Let's start with your project: 

Are you a data scientist? 

I think you are an awesome a data scientist.

### **Problem** 
**Our goal is to create a predictive model that can answer the following question:**

**What kind of people had a better chance of surviving?**

**Data about passengers:**
*   Name
*   Age
*   Gender.


## Install and Import Libraries
Let's install PySpark:

In [1]:
# importing libraries
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.sql import SparkSession

## Build Spark Session

In [2]:
spark = SparkSession.builder.getOrCreate()


## Data Loading


You have two datasets: 
* Train  
* Test.

Read two datasets: 
* Train
* Test.



In [3]:
train_df = spark.read.csv('train.csv' ,header = True , inferSchema =True)
test_df = spark.read.csv('test.csv' ,header = True , inferSchema =True)


Let's work with train dataset:

**Confirm if this is a dataframe or not:**

In [4]:
print(type(train_df))

<class 'pyspark.sql.dataframe.DataFrame'>


**Show 5 rows.**

In [5]:
train_df.show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
+-----------+--------+------+--------------------+------+----+-----+-----+------

**Display schema for the dataset:**

In [6]:
train_df.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



**Statistical summary:**

In [7]:
train_df.summary().show()

+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+
|summary|      PassengerId|           Survived|            Pclass|                Name|   Sex|               Age|             SibSp|              Parch|            Ticket|             Fare|Cabin|Embarked|
+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+
|  count|              891|                891|               891|                 891|   891|               714|               891|                891|               891|              891|  204|     889|
|   mean|            446.0| 0.3838383838383838| 2.308641975308642|                null|  null| 29.69911764705882|0.5230078563411896|0.38159371492704824|260318.54916792738| 32.20420

## EDA - Exploratory Data Analysis

**Display count for the train dataset:**

In [8]:
from pyspark.sql.functions import count
print(train_df.count())

891


**Can you answer this question:** 

**How many people survived, and how many didn't survive?** 

**Please save data in a variable.**

In [9]:
import pyspark.sql.functions as F

cnt_cond = lambda cond: F.sum(F.when(cond, 1).otherwise(0))
query_survive = train_df.groupBy('Survived').agg(
    cnt_cond(F.col('Survived') == 1).alias('Survived_one'), 
    cnt_cond(F.col('Survived') == 0).alias('not_Survived_one'))

**Display your result:**

In [10]:
query_survive.show()

+--------+------------+----------------+
|Survived|Survived_one|not_Survived_one|
+--------+------------+----------------+
|       1|         342|               0|
|       0|           0|             549|
+--------+------------+----------------+



In [11]:
train_df.groupby('Survived').count().show()


+--------+-----+
|Survived|count|
+--------+-----+
|       1|  342|
|       0|  549|
+--------+-----+



**Can you display your answer in ratio form?(Hint: Use "UDF" Function. (Hint: Use "UDF" Function. This is a hint you can use any method.)**






In [12]:
def ratioFunction(num1, num2):
    num1 = float(num1) # Now we are good
    num2 = float(num2) # Good, good
    ratio12 = float(num1/num2)
    print('The ratio of', num1, 'and', num2,'is', ratio12 , '.')
ratioFunction(342, 549)

The ratio of 342.0 and 549.0 is 0.6229508196721312 .


**Can you get the number of males and females?**


In [13]:
cnt_cond = lambda cond: F.sum(F.when(cond, 1).otherwise(0))
train_df.groupBy('Sex').agg(
    cnt_cond(F.col('Sex') == 'male').alias('male'), 
    cnt_cond(F.col('Sex') == 'female').alias('female')).show()

+------+----+------+
|   Sex|male|female|
+------+----+------+
|female|   0|   314|
|  male| 577|     0|
+------+----+------+



In [14]:
train_df.groupby('Sex').count().show()


+------+-----+
|   Sex|count|
+------+-----+
|female|  314|
|  male|  577|
+------+-----+



**1. What is the average number of survivors of each gender?**

**2. What is the number of survivors of each gender?**

(Hint: Group by the "sex" column. This is a hint you can use any method.)

In [15]:
train_df.groupby('Sex').count().show()


+------+-----+
|   Sex|count|
+------+-----+
|female|  314|
|  male|  577|
+------+-----+



**Create temporary view PySpark:**

In [16]:
train_df.createOrReplaceTempView("practical")

**How many people survived, and how many didn't survive? By SQL:**

In [17]:
spark.sql("""SELECT count(Survived)
  FROM practical
 GROUP BY Survived
        """).show(10)

+---------------+
|count(Survived)|
+---------------+
|            342|
|            549|
+---------------+



**Can you display the number of survivors from each gender as a ratio?**

(Hint: Group by "sex" column. This is a hint you can use any method.)

**Can you do this via SQL?**

In [18]:
train_df.groupby('Sex').count().show()


+------+-----+
|   Sex|count|
+------+-----+
|female|  314|
|  male|  577|
+------+-----+



In [19]:
spark.sql("""SELECT count(Sex)
  FROM practical
 GROUP BY Sex
        """).show()

+----------+
|count(Sex)|
+----------+
|       314|
|       577|
+----------+



**Display a ratio for "p-class": SUM(Survived)/count for p-class**


In [20]:

spark.sql("""SELECT sum(Survived) , sum(Survived) / count(Pclass) 
  FROM practical
 GROUP BY Pclass
        """).show()

+-------------+-------------------------------------------------------------------------------+
|sum(Survived)|(CAST(sum(CAST(Survived AS BIGINT)) AS DOUBLE) / CAST(count(Pclass) AS DOUBLE))|
+-------------+-------------------------------------------------------------------------------+
|          136|                                                             0.6296296296296297|
|          119|                                                            0.24236252545824846|
|           87|                                                            0.47282608695652173|
+-------------+-------------------------------------------------------------------------------+



**Let's take a break and continue after this.**

## Data Cleaning

**First and foremost, we must merge both the train and test datasets. (Hint: The union function can do this.)**



In [21]:
from functools import reduce  
from pyspark.sql import DataFrame
  
def unionAll(*dfs):
    return reduce(DataFrame.unionAll, dfs)
  
full_df = unionAll(train_df, test_df)


**Display count:**

In [22]:
full_df.count()

1329

**Can you define the number of null values in each column?**


In [23]:
from pyspark.sql.functions import isnan, when, count, col

null_val =full_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in full_df.columns])
null_val.show()

+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|          0|       0|     0|   0|  0|265|    0|    0|     0|   0| 1021|       3|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+



**Create Dataframe for null values**

1. Column
2. Number of missing values.

In [24]:
null_val.createOrReplaceTempView("Select_Null")
spark.sql("""SELECT Age , Cabin , Embarked
  FROM Select_Null
        """).show()

+---+-----+--------+
|Age|Cabin|Embarked|
+---+-----+--------+
|265| 1021|       3|
+---+-----+--------+



## Preprocessing 

**Create Temporary view PySpark:**

In [25]:
full_df.createOrReplaceTempView("new_temp")

**Can you show the "name" column from your temporary table?**

In [26]:
spark.sql("""SELECT name
  FROM new_temp
        """).show(5,truncate= False)

+---------------------------------------------------+
|name                                               |
+---------------------------------------------------+
|Braund, Mr. Owen Harris                            |
|Cumings, Mrs. John Bradley (Florence Briggs Thayer)|
|Heikkinen, Miss. Laina                             |
|Futrelle, Mrs. Jacques Heath (Lily May Peel)       |
|Allen, Mr. William Henry                           |
+---------------------------------------------------+
only showing top 5 rows



**Run this code:**

In [27]:
import pyspark.sql.functions as F
combined = full_df.withColumn('Title',F.regexp_extract(F.col("Name"),"([A-Za-z]+)\.",1))
combined.createOrReplaceTempView('combined')
combined.show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+-----+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|Title|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+-----+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|   Mr|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|  Mrs|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S| Miss|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|  Mrs|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|   Mr|
+-----------+--------+------+---

**Display "Title" column and count "Title" column:**

In [28]:
Titles_df = spark.sql("""SELECT Title , COUNT(Title) as Titles_Count
  FROM combined
  GROUP BY Title
  ORDER BY Titles_Count
        """)
Titles_df.show(truncate= False)

+--------+------------+
|Title   |Titles_Count|
+--------+------------+
|Mme     |1           |
|Don     |1           |
|Ms      |1           |
|Countess|2           |
|Jonkheer|2           |
|Sir     |2           |
|Lady    |2           |
|Capt    |2           |
|Major   |3           |
|Mlle    |4           |
|Col     |4           |
|Rev     |9           |
|Dr      |11          |
|Master  |56          |
|Mrs     |186         |
|Miss    |257         |
|Mr      |786         |
+--------+------------+



**We can see that Dr, Rev, Major, Col, Mlle, Capt, Don, Jonkheer, Countess, Ms, Sir, Lady, and Mme are really rare titles, so create Dictionary and set the value to "rare".**

In [29]:
new_dict = list(map (lambda row : row.asDict(),Titles_df.collect()))
dict_map = {row['Title']:'RaRe' if row['Titles_Count'] < 56 else row['Title'] for row in new_dict}
dict_map

{'Don': 'RaRe',
 'Mme': 'RaRe',
 'Ms': 'RaRe',
 'Countess': 'RaRe',
 'Lady': 'RaRe',
 'Capt': 'RaRe',
 'Sir': 'RaRe',
 'Jonkheer': 'RaRe',
 'Major': 'RaRe',
 'Col': 'RaRe',
 'Mlle': 'RaRe',
 'Rev': 'RaRe',
 'Dr': 'RaRe',
 'Master': 'Master',
 'Mrs': 'Mrs',
 'Miss': 'Miss',
 'Mr': 'Mr'}

**Run the function:**

In [30]:
def impute_title(title):
    return dict_map[title]# Title_map is your dictionary. please change this name with your dictionary name.

**Apply the function on "Title" column using UDF:**

In [31]:
# import org.apache.spark.sql.functions.udf package
from pyspark.sql.functions import udf
from pyspark.sql.types import *

convertUDF = udf(lambda z: impute_title(z))


**Display "Title" from table and group by "Title" column:**

In [32]:
from pyspark.sql import functions as F
# using select
combined.select(F.col("Title"), \
    convertUDF(F.col("Title")).alias("Title") ) \
   .show(truncate=False)

+------+------+
|Title |Title |
+------+------+
|Mr    |Mr    |
|Mrs   |Mrs   |
|Miss  |Miss  |
|Mrs   |Mrs   |
|Mr    |Mr    |
|Mr    |Mr    |
|Mr    |Mr    |
|Master|Master|
|Mrs   |Mrs   |
|Mrs   |Mrs   |
|Miss  |Miss  |
|Miss  |Miss  |
|Mr    |Mr    |
|Mr    |Mr    |
|Miss  |Miss  |
|Mrs   |Mrs   |
|Master|Master|
|Mr    |Mr    |
|Mrs   |Mrs   |
|Mrs   |Mrs   |
+------+------+
only showing top 20 rows



## **Preprocessing Age**

**Based on the "age" column mean, you will fill in the missing age values:**

In [33]:
from pyspark.sql.functions import mean , col 

Age_mean = full_df.select(mean(col('Age')))
Age_mean.show()
df_age =full_df.na.fill({'age': 30.079501879699244})


+------------------+
|          avg(Age)|
+------------------+
|30.079501879699244|
+------------------+



**Fill missing with "age" mean:**

In [34]:
df_age.show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
+-----------+--------+------+--------------------+------+----+-----+-----+------

## **Preprocessing Embarked**

**Select "Embarked" column, count them, order by count Desc, and save in grouped_Embarked variable:**




In [35]:
df_age.createOrReplaceTempView("Embarked")
groupped_Embarked = spark.sql("""SELECT Embarked , count(Embarked) as Embark_count
  FROM Embarked
  group by Embarked
  order by Embark_count DESC
        """)

**Show "groupped_Embarked" your variable:**

In [36]:
groupped_Embarked.show(5,truncate= False)

+--------+------------+
|Embarked|Embark_count|
+--------+------------+
|S       |962         |
|C       |253         |
|Q       |111         |
|null    |0           |
+--------+------------+



**Get max of groupped_Embarked:** 

In [37]:
from pyspark.sql.functions import max

max_empark = groupped_Embarked.select(max(col('Embark_count')))
max_empark.show()

+-----------------+
|max(Embark_count)|
+-----------------+
|              962|
+-----------------+



**Fill missing values with max 'S' of grouped_Embarked:**

## **Preprocessing Cabin**

**Replace "cabin" column with first char from the string:**



In [38]:
from pyspark.sql.functions import col, substring
#df1=df_age.select('cabin', substring('cabin', 0,1).alias('cabin_First_string'))

df1 = df_age.withColumn("cabin_First_string", substring('cabin', 0,1))


**Show the result:**

In [39]:
df1.show()

+-----------+--------+------+--------------------+------+------------------+-----+-----+----------------+-------+-----+--------+------------------+
|PassengerId|Survived|Pclass|                Name|   Sex|               Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|cabin_First_string|
+-----------+--------+------+--------------------+------+------------------+-----+-----+----------------+-------+-----+--------+------------------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|              22.0|    1|    0|       A/5 21171|   7.25| null|       S|              null|
|          2|       1|     1|Cumings, Mrs. Joh...|female|              38.0|    1|    0|        PC 17599|71.2833|  C85|       C|                 C|
|          3|       1|     3|Heikkinen, Miss. ...|female|              26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|              null|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|              35.0|    1|    0|          113803|   53.1

**Create the temporary view:**

In [40]:
df1.createOrReplaceTempView("cabin_First_string")


**Select "Cabin" column, count "Cabin" column, Group by "Cabin" column, Order By count DESC**  

In [41]:
groupped_cabin = spark.sql("""SELECT Cabin , count(Cabin) as Cabin_count
  FROM cabin_First_string
  group by Cabin
  order by Cabin_count DESC
        """).show(5)

+-----------+-----------+
|      Cabin|Cabin_count|
+-----------+-----------+
|    B96 B98|          6|
|    C22 C26|          4|
|        D20|          4|
|        B22|          4|
|B51 B53 B55|          4|
+-----------+-----------+
only showing top 5 rows



**Fill missing values with "U":**

In [42]:
df2 =df1.na.fill({'cabin_First_string': 'U'})
df2 = df2.na.drop()
df2.show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+--------+-------+-----+--------+------------------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|  Ticket|   Fare|Cabin|Embarked|cabin_First_string|
+-----------+--------+------+--------------------+------+----+-----+-----+--------+-------+-----+--------+------------------+
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|PC 17599|71.2833|  C85|       C|                 C|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|  113803|   53.1| C123|       S|                 C|
|          7|       0|     1|McCarthy, Mr. Tim...|  male|54.0|    0|    0|   17463|51.8625|  E46|       S|                 E|
|         11|       1|     3|Sandstrom, Miss. ...|female| 4.0|    1|    1| PP 9549|   16.7|   G6|       S|                 G|
|         12|       1|     1|Bonnell, Miss. El...|female|58.0|    0|    0|  113783|  26.55| C103|       S|            

**StringIndexer: A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType. Its default value is ‘frequencyDesc’.**

**StringIndexer(inputCol=None, outputCol=None)**

In [43]:
from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCols =[ "Sex" , "Ticket" ,"Embarked" , "cabin_First_string"], outputCols=["Sex_numeric" , "Ticket_numeric","Embarked_numeric" , "cabin_First_string_numeric"]).fit(df2)
indexed_df = indexer.transform(df2)
indexed_df.drop("Sex", "Ticket","Embarked" , "cabin_First_string","Name","Cabin").show()

+-----------+--------+------+------------------+-----+-----+--------+-----------+--------------+----------------+--------------------------+
|PassengerId|Survived|Pclass|               Age|SibSp|Parch|    Fare|Sex_numeric|Ticket_numeric|Embarked_numeric|cabin_First_string_numeric|
+-----------+--------+------+------------------+-----+-----+--------+-----------+--------------+----------------+--------------------------+
|          2|       1|     1|              38.0|    1|    0| 71.2833|        1.0|         134.0|             1.0|                       0.0|
|          4|       1|     1|              35.0|    1|    0|    53.1|        1.0|          48.0|             0.0|                       0.0|
|          7|       0|     1|              54.0|    0|    0| 51.8625|        0.0|         118.0|             0.0|                       3.0|
|         11|       1|     3|               4.0|    1|    1|    16.7|        1.0|          95.0|             0.0|                       6.0|
|         12|

**OneHotEncoder(inputCols=None, outputCols=None)**

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

In [44]:
from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder(inputCols=["Sex_numeric" , "Ticket_numeric","Embarked_numeric" , "cabin_First_string_numeric"], outputCols=["encoded_Sex" , "encoded_Ticket","encoded_Embarked" , "encoded_cabin_First"])
encoder.setDropLast(False)
ohe = encoder.fit(indexed_df)
encoded_df = ohe.transform(indexed_df)
encoded_df.drop("Sex_numeric" , "Ticket_numeric","Embarked_numeric" , "cabin_First_string_numeric").show()

+-----------+--------+------+--------------------+------+------------------+-----+-----+-----------+--------+-----------+--------+------------------+-------------+-----------------+----------------+-------------------+
|PassengerId|Survived|Pclass|                Name|   Sex|               Age|SibSp|Parch|     Ticket|    Fare|      Cabin|Embarked|cabin_First_string|  encoded_Sex|   encoded_Ticket|encoded_Embarked|encoded_cabin_First|
+-----------+--------+------+--------------------+------+------------------+-----+-----+-----------+--------+-----------+--------+------------------+-------------+-----------------+----------------+-------------------+
|          2|       1|     1|Cumings, Mrs. Joh...|female|              38.0|    1|    0|   PC 17599| 71.2833|        C85|       C|                 C|(2,[1],[1.0])|(141,[134],[1.0])|   (3,[1],[1.0])|      (8,[0],[1.0])|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|              35.0|    1|    0|     113803|    53.1|       C123|   

**VectorAssembler: VectorAssembler(*, inputCols=None, outputCol=None). A feature transformer that merges multiple columns into a vector column.**



In [45]:
from pyspark.ml.feature import VectorAssembler


assembler = VectorAssembler(
    inputCols=['Pclass','Age','SibSp','Parch' ,'Fare','encoded_Sex','encoded_Ticket','encoded_Embarked','encoded_cabin_First'
], outputCol="features")


final_df = assembler.transform(encoded_df)
final_df.select(['features','Survived']).show(truncate=False)

+----------------------------------------------------------------------------------+--------+
|features                                                                          |Survived|
+----------------------------------------------------------------------------------+--------+
|(159,[0,1,2,4,6,141,149,151],[1.0,38.0,1.0,71.2833,1.0,1.0,1.0,1.0])              |1       |
|(159,[0,1,2,4,6,55,148,151],[1.0,35.0,1.0,53.1,1.0,1.0,1.0,1.0])                  |1       |
|(159,[0,1,4,5,125,148,154],[1.0,54.0,51.8625,1.0,1.0,1.0,1.0])                    |0       |
|(159,[0,1,2,3,4,6,102,148,157],[3.0,4.0,1.0,1.0,16.7,1.0,1.0,1.0,1.0])            |1       |
|(159,[0,1,4,6,118,148,151],[1.0,58.0,26.55,1.0,1.0,1.0,1.0])                      |1       |
|(159,[0,1,4,5,130,148,153],[2.0,34.0,13.0,1.0,1.0,1.0,1.0])                       |1       |
|(159,[0,1,4,5,121,148,155],[1.0,28.0,35.5,1.0,1.0,1.0,1.0])                       |1       |
|(159,[0,1,2,3,4,5,19,148,151],[1.0,19.0,3.0,2.0,263.0,1.0,1

**Use randomSplit function and split data to x_train, and X_test with 80% and 20% Consecutive**

In [46]:
trainDF, testDF = final_df.randomSplit([.8,.2],seed=42)
print(f"There are {trainDF.count()} rows in the training set, and {testDF.count()} in the test set")


There are 256 rows in the training set, and 49 in the test set


**Pipeline: ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines.**

**Build RandomForestClassifier model and use pipeline to fit and transform then display "prediction, Survived, features" columns**

In [50]:
#from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier

Rand_classify = RandomForestClassifier(featuresCol='features',labelCol='Survived')

#pipeline = Pipeline(stages=[indexer, encoder, assembler , Rand_classify])
#model = pipeline.fit(train_df)
#transformed = model.transform(test_df)
#transformed.select(['Survived','prediction']).show(truncate=False)


model = Rand_classify.fit(trainDF)
# transform the data
sample_data_test = model.transform(testDF)
sample_data_test.select(['features','Survived' , 'prediction']).show()

+--------------------+--------+----------+
|            features|Survived|prediction|
+--------------------+--------+----------+
|(159,[0,1,4,5,125...|       0|       1.0|
|(159,[0,1,4,5,121...|       1|       1.0|
|(159,[0,1,2,4,6,9...|       1|       1.0|
|(159,[0,1,4,6,138...|       1|       1.0|
|(159,[0,1,3,4,5,8...|       0|       1.0|
|(159,[0,1,3,4,5,8...|       0|       1.0|
|(159,[0,1,2,4,6,5...|       1|       1.0|
|(159,[0,1,4,5,117...|       0|       1.0|
|(159,[0,1,2,4,5,7...|       0|       1.0|
|(159,[0,1,2,3,4,5...|       1|       1.0|
|(159,[0,1,2,3,4,6...|       0|       1.0|
|(159,[0,1,4,6,10,...|       1|       1.0|
|(159,[0,1,5,108,1...|       0|       0.0|
|(159,[0,1,2,4,6,1...|       1|       1.0|
|(159,[0,1,4,6,129...|       1|       1.0|
|(159,[0,1,2,3,4,6...|       1|       1.0|
|(159,[0,1,4,5,119...|       0|       1.0|
|(159,[0,1,4,5,64,...|       0|       1.0|
|(159,[0,1,4,6,10,...|       1|       1.0|
|(159,[0,1,4,6,87,...|       1|       1.0|
+----------

**Use MulticlassClassificationEvaluator and set the "labelCol" to "Survived",  "predictionCol" to "prediction", "metricName" to "accuracy"** 

In [51]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="Survived", predictionCol="prediction", metricName="accuracy")
Accuracy = evaluator.evaluate(sample_data_test)
print("Accuracy on test data = %g" % Accuracy)

Accuracy on test data = 0.693878


**When you are finished send the project via Google classroom**
**Please let me know if you have any questions.**
* nabieh.mostafa@yahoo.com
* +201015197566 (Whatsapp)

**Don't Hate me, I push you to learn**

**I will help you to become an awesome data engineer.**

**Why did I say that "Data Engineer"?**

**Tricky question, but an optional question, if you would like to know the answer, ask me.**
