## Introduction

In this article we will be predicting the famous machine learning problem statement i.e. **Titanic Survival Prediction using PySpark's MLIB** this is one of the best dataset to getting started with new concepts as we being a machine learning enthusiasts already are well aware of this particular dataset and we are gonna do everything from scratch i.e. from **data preprocessing steps ,dealing with categorical variables (converting them) and building and evaluating the model using MLIB**.

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 34 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 50.0 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=e46d3ebdc239acc3e532977348abdaced9e8419659979f09dce0e8a2e8854dba
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


## Mandatory process to follow

As discussed in the introduction section that we will be predicting about** which passenger survived the Titanic ship crash and for that we will be using PySpark's Mlib library**, for doing soo we need to first create and **setup an environment** to start the **Spark Session** and this will enable us to use all the required libraries which we need for the prediction.

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('Titanic_project').getOrCreate()
spark

**Code breakdown:**

1. The very first step has to be to import the **SparkSession** object and for that we are importing it from **pyspark.sql** library.

2. Then comes the part of building and creating the Spark Session and for that **builder** function is used to build it then for creating the same we have **getOrCreate()** method.

3. To view the kind of GUI version of the session we can simply use the object name and it will show all the relevant information about the same like **version**, **app name** and **Master location**.

## Reading the dataset

So by far we have setup our Spark Session now it's time to read the legendary Titanic dataset and for that we will be using the **read.csv** method of PySpark, but before heading towards the coding part let's first look at the **features** that this dataset holds.

1. **PassengerId:** This is just the unique ID which was assigned to each passenger.
2. **Survived:** This is the target column which our model will predict.
3. **Pclass:** This column holds the different class of passengers who were travelling.
4. **Name:** Name of the passenger.
5. **Sex:** Gender of the passenger.
6. **Age:** Age of the passenger.
7. **SibSp:** No. of Sibblings and Spouse of the passenger.
8. **Parch:** Parents and no. of childern of the passenger.
9. **Ticket:** The unique number assigned to the ticket.
10. **Fare:** Fare of the titanic ticket based on the different criteria like which class and facilites they will get.
11. **Cabin:** Cabin number assigned to each passenger.
12. **Embarked:** Which port the passenger will be embarked.

In [4]:
data = spark.read.csv('titanic.csv',inferSchema=True,header=True)

**Inference:** By far we are well aware of the fact that read.csv will read the dataset but here we will discuss the parameters of this method

1. **inferSchema:** Notice that this param is set to **True** that means it will **return the real data type of each column** that our original data have hence keeping it True is a good practice to see the real face of the dataset.
 
2. **header:** Keeping this parameter **True** will let the first row of the dataset as the **header of the DataFrame** otherwise the original heading will also be treated as the records.

In [5]:
data.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



**Inference:** Okay! So now from the above output we got the original data type of each column and also the information that the particular column will be able to hold the **NULL** value or not. Apart from these inferences, we should notice that features like **Sex, Embarked** is in the string format so we need to change them in **categorical features**.

In [6]:
data.columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

**Inference:** In case one needs to find out what columns are present in the dataset then he/she can use the columns object corresponding to the dataset.

In [7]:
my_cols = data.select(['Survived',
 'Pclass',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'Embarked'])

**Inference:** In the previous DataFrame we got everything (the **target column** as well which is not required) hence here we are filtering the columns to only have **features (dependent variables)** using the **select statement.**

**Dropping NULL values**

There are various method to deal with null values we can either **impute it with central tendency methods like mean/media/mode** depending on the nature of the data or we can simply **drop all the null values** here we are dropping all of them as we don't really have many of them hence it is the better option to get rid of all at once.

**Note:** na.drop() method is used to drop all the NA values from the features DataFrame and then we are assigning it to new variable which would be the final data.

In [8]:
my_final_data = my_cols.na.drop()

### Dealing with Categorical Columns

As we discussed about dealing with the categorical columns which are now in the String state but as we know String type is not accepted by any ML algorithm so we need to deal with it and for that we have to go through set of operations/steps in PySpark.

So, let's break down each step and convert the necessary features columns.

## Vector Assembler and OneHotEncoder

**Vector Assembler**: From the name itself it is indicating that it kind of put together columns in a collective vectorised format i.e. **all the features get stacked up as a single unit in the form of vector** and this is one of the rule as well that MLIB library takes all the features as **single unit only**.


**One Hot Encoder:** There are multiple ways of dealing with categorical variables this time going with One hot encoder where each **categorical value is seperated to independent column** and it get the binary value i.e. **either 0 or 1**.

In [9]:
from pyspark.ml.feature import (VectorAssembler,VectorIndexer,
                                OneHotEncoder,StringIndexer)

In [10]:
gender_indexer = StringIndexer(inputCol='Sex',outputCol='SexIndex')
gender_encoder = OneHotEncoder(inputCol='SexIndex',outputCol='SexVec')

In [11]:
embark_indexer = StringIndexer(inputCol='Embarked',outputCol='EmbarkIndex')
embark_encoder = OneHotEncoder(inputCol='EmbarkIndex',outputCol='EmbarkVec')

In [12]:
assembler = VectorAssembler(inputCols=['Pclass',
 'SexVec',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'EmbarkVec'],outputCol='features')

**Code breakdown:**

1. As discussed that Vector assembler and One Hot encoding technique is required for conversion hence we imported both of them from **ml.feature library of PySpark**.

2. While importing other important methods note that **StringIndexer** was also there which would be responsible for converting the String type to categorical type.

3. Then One Hot Encoder will convert each **categorical value to its binary value i.e. 0 or 1** by it's predefined object. Repeating the same process for **"Embarked"** column as we did for **"Gender"** column.

4. At the last, Vector Assembler will put together all the **preprocessed feature column together** and removing the unwanted one.

In [13]:
from pyspark.ml.classification import LogisticRegression

So we are good to go with **model development phase** and for that first thing that we need to import is ML algorithm, for this particular problem statement we have to **predict the categorical data hence classification machine learning** algorithm should be accessed i.e. **Logisitic Regression**.

## Pipelines 

Sometimes to cope up with the whole process of **model development is complex** and we get stuck to choose the right flow if the execution in this type of problems **Pipelines from PySpark** comes in to rescue us as it helps to maintain the **proper flow of the execution cycle** so that each step should be performed at its given stage neither before nor soon.

In [14]:
from pyspark.ml import Pipeline

In [15]:
log_reg_titanic = LogisticRegression(featuresCol='features',labelCol='Survived')

In [16]:
pipeline = Pipeline(stages=[gender_indexer,embark_indexer,
                           gender_encoder,embark_encoder,
                           assembler,log_reg_titanic])

In [17]:
train_titanic_data, test_titanic_data = my_final_data.randomSplit([0.7,.3])

In [18]:
fit_model = pipeline.fit(train_titanic_data)

In [19]:
results = fit_model.transform(test_titanic_data)

**Code breakdown:** 

1. First and foremost **Pipeline** module is being accessed and imported by **pyspark.ml** library.

2. Then for developing the model, **Logistic Regression** method is used and in the parameters passing in the features columns and label (independent) column.

3. Now comes the **Pipeline** method where one can look in the **stages** section that all the preprocessed steps are lined up one after the other.

4. Then using the **randomSplit**() method the final dataset is being break down into training set of **70**% and testing set of **30**%.

5. At the last it's important to have all the changes committed for that we are first fitting the pipeline with **training** data and **transforming** the **testing** data with **pipeline model**.

## Model Evaluation

Okay! So we are now in the Model Evaluation phase that means development is already done and now we should evaluate it and from evaluating we mean that it should be working as per our requirement with good results i.e. **good accuracy over testing data**.

In [20]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [21]:
my_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                       labelCol='Survived')

In [22]:
results.select('Survived','prediction').show()

+--------+----------+
|Survived|prediction|
+--------+----------+
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       1.0|
|       0|       1.0|
|       0|       0.0|
+--------+----------+
only showing top 20 rows



**Code breakdown:**

1. Imported the **Binary Classifier** as this problem statement has the binary type of target column.

2. Then by applying the **Binary Classification evaluator** object we are passing in the values to **raw prediction** column and the **label column**

3. At the end when the DataFrame is having both **Survived** and **Prediction** column after the evaluation then it is shown using select statement.

## Conclusion

Here we come in the final section of this article where we will allow ourself to go along whatever we did so far in this article i.e. from starting the spark session to building and evaluating the model we will discuss each step in brief.

1. Firstly we started the spark session and read the famous titanic survivals dataset using PySpark's data preprocessing techniques.

2. Then, we dealt with NULL values by dropping all of them along with that we also handled the categorical features and converted them to relevant type using Vector Assembler and One Hot Encoder.

3. During the next phase we came across the concept of Pipelines which helped us to build a end to end pipeline of all the stages.

4. At the last we build the Logistic regresson model using PySpark's MLIB and later evaluate it too so that we should see how well our model performed based on the testing data.