# **Introduction**
In this notebook we will walkthrough detailed statistical analysis of Titanic data set along with Machine learning model implementation. This notebook will work as a tutorial for all the beginners who don't know much about sparkml as I have tried to explain each and every steps with simplicity. I have used many different types of plotting techniques so that you can understand how each and every column will affect the prediction score. Also, I will go through many different machine learning classifiers so that you get to know about it and you’ll be able to solve any problem thrown your way. 

## <font color = 'red'> Please do an upvote if you find the kernel useful. </font>

# **Table of Contents**
* [Setting up the environment](#1)
* [Importing Libraries](#2)
* [Reading the data](#3)
* [Exploratory Data Analysis](#4)
* [Feature Engineering](#5)
* [Spark ML Models](#6)
* [Submitting the predictions](#7)

<a id='1'></a>
# **Setting up the environment**
Before starting we first have to change the java version of the notebook because if we will use version 11 which is already installed then we will get some errors and we will not be able to use pyspark properly. So we will delete java version 11 and install java version 8.

In [None]:
! apt remove -y openjdk-11-jre-headless

In [None]:
!apt-get update

In [None]:
!apt install -y openjdk-8-jdk openjdk-8-jre

**If you get this error "E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/p/pulseaudio/libpulse0_11.1-1ubuntu7.5_amd64.deb  404  Not Found [IP: 91.189.88.142 80]" while installing java version 8 then run "!apt-get update" this command first like I did and then install jdk8.**

In [None]:
!java -version

**After installing jdk8, we will now install pyspark.**

In [None]:
!pip install pyspark

<a id='2'></a>
# **Importing Libraries**

In [None]:
import os
import pandas as pd
import numpy as np
import re
from pylab import *
from pyspark.sql.functions import udf, concat, col, lit
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.types import *
import pyspark.sql.functions as F

#Creating spark session
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .getOrCreate()
sqlContext = SQLContext(sc)

from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Normalizer
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.classification import LinearSVC
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.classification import MultilayerPerceptronClassifier

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

<a id='3'></a>
# **Reading the data**

In [None]:
df1 = pd.read_csv('../input/titanic/train.csv')
df2 = pd.read_csv('../input/titanic/test.csv')
sub_df = pd.read_csv('../input/titanic/gender_submission.csv')

In [None]:
df1.head()

<a id='4'></a>
# **Exploratory Data Analysis**

**Let us first see how many people survived or not.**

In [None]:
survival = df1.groupby('Survived').count()['Name'].reset_index()
sns.countplot(x='Survived', data=df1)
print("Number of passengers didn't Survived = {}".format(survival['Name'][0]))
print("Number of passengers survived = {}".format(survival['Name'][1]))

**Now we will observe how each factor has its impact on whether the person survived or not.**

In [None]:
val = ['Pclass', 'Sex', 'Embarked', 'SibSp', 'Parch']
plt.figure(figsize=(15,15))
plt.subplots_adjust(right=1.5)
for i in range(5):
    plt.subplot(2,3,i+1), sns.countplot(x=val[i], hue='Survived', data = df1)
    plt.legend(['Not Survived', 'Survived'], loc='upper center', prop={'size': 10})
    plt.title('Count of Survival in {} Feature'.format(val[i]), size=10, y=1.05)

In [None]:
surv = df1['Survived'] == 1

sns.distplot(df1[~surv]['Age'], label='Not Survived', hist=True, color='#e74c3c')
sns.distplot(df1[surv]['Age'], label='Survived', hist=True, color='#2ecc71')

plt.legend()
plt.title('Distribution of Survival in Age')

        
plt.show()

In [None]:
plt.figure(figsize=(14,6))
plt.plot(range(0,len(df1[~surv]['Fare'])), df1[~surv]['Fare'], color='blue', animated=True, linewidth=1)
plt.plot(range(0,len(df1[surv]['Fare'])), df1[surv]['Fare'], color='red', animated=True, linewidth=1)
plt.xlabel('PassengerID', fontsize=14)
plt.ylabel('Fare', fontsize=14)
plt.legend(['Not Survived', 'Survived'])
plt.title('Distribution of Fare')
plt.show()

<a id='5'></a>
# **Feature Engineering**
Basically, all machine learning algorithms use some input data to create outputs. This input data comprise features, which are usually in the form of structured columns. Algorithms require features with some specific characteristic to work properly. Here, the need for feature engineering arises. There maybe be many redundant features which should be eliminated. Also we can get or add new features by observing or extracting information from other features.

We will apply feature engineering steps to both our training and test data. Here we are going to concat them so that we don't have to apply each steps separately. Then later on after applying feature engineering process we will separate them.


In [None]:
df = pd.concat([df1,df2],ignore_index=True)

In [None]:
df.dtypes

**Now we will make a new column which will store the values of number of person in a family and another column which will tell whether the person is alone or not. Then, we will visualize it so that we can check if survival rate have anything to do with family size of the passengers.**

In [None]:
df['Family'] = df['SibSp'] + df['Parch'] + 1
df['Alone'] = df['Family'].apply(lambda x : 0 if x>1 else 1 )

In [None]:
fig ,ax=plt.subplots(2,2,figsize=(14,12))
sns.barplot('Family','Survived',data=df,ax=ax[0][0])
ax[0][0].set_title('Family vs Survived')
sns.pointplot('Family','Survived',data=df,ax=ax[0][1])
ax[0][1].set_title('Family vs Survived')
sns.countplot('Alone',hue='Survived',data=df,ax=ax[1][0])
ax[1][0].set_title('Alone vs Survived')
sns.pointplot('Alone','Survived',data=df,ax=ax[1][1])
ax[1][1].set_title('Alone vs Survived')
plt.show()

**After calculating family size, now we will go to Name column which we haven't seen yet. Although the whole name doesn't make any sense that it will affect the survival rate but the title like Mr., Mrs. etc can affect it. So we will make a new column which will store title of every name.**

In [None]:
df['Title'] = df['Name'].apply(lambda x: re.search(' ([A-Z][a-z]+)\.', x).group(1))

In [None]:
df['Title'].unique()

**Now, we will identify the social status of each title.**

In [None]:
Title_Dictionary = {
        "Capt":       "Officer",
        "Col":        "Officer",
        "Major":      "Officer",
        "Dr":         "Officer",
        "Rev":        "Officer",
        "Jonkheer":   "Royalty",
        "Don":        "Royalty",
        "Sir" :       "Royalty",
        "Countess":   "Royalty",
        "Dona":       "Royalty",
        "Lady" :      "Royalty",
        "Mme":        "Mrs",
        "Ms":         "Mrs",
        "Mrs" :       "Mrs",
        "Mlle":       "Miss",
        "Miss" :      "Miss",
        "Mr" :        "Mr",
        "Master" :    "Master"
                   }
    
# we map each title to correct category
df['Title'] = df['Title'].map(Title_Dictionary)
df['Title'].unique()

In [None]:
sns.countplot(x='Title', hue='Survived', data = df)

**Now we will fill the null values in the Age column. As we have observed, the Age column has many different values so we can fill it by the mean of all the ages but here I am just filling it with -0.5.**

In [None]:
df['Age'] = df['Age'].fillna(-0.5)

**As we have observed, the graph of Fare column shows that although Fare column have many different values but most of the values are around the median. So we will fill all the null values with the median of the Fare column. Since fare is also a continous feature, we need to convert it into ordinal value.**

In [None]:
df['Fare'] = df['Fare'].fillna(df['Fare'].median())
#intervals to categorize
quant = (-1, 0, 8, 15, 31, 600)

#Labels without input values
label_quants = ['NoInf','quart_1', 'quart_2', 'quart_3', 'quart_4']

#doing the cut in fare and puting in a new column
df["Fare_cat"] = pd.cut(df['Fare'], quant, labels=label_quants)

In [None]:
fig ,ax=plt.subplots(1,2,figsize=(14,6))
sns.barplot('Fare_cat','Survived',data=df,ax=ax[0])
ax[0].set_title('Fare_cat vs Survived')
sns.pointplot('Fare_cat','Survived',data=df,ax=ax[1])
ax[1].set_title('Fare_cat vs Survived')
plt.close(2)
plt.show()

**Lastly we will fill all the null values in Embarked column with the value which has occured maximum in the data.**

In [None]:
sns.countplot('Embarked', data=df)

**So as 'S' as occured most of the time so we will fill it with 'S'.**

In [None]:
df["Embarked"] = df["Embarked"].fillna('S')

In [None]:
df.head()

**Also, we will drop the columns which we don't require.**

In [None]:
df = df.drop(['Name','Ticket', 'Cabin'], axis=1)

**Now as we have applied all the feature engineering steps so now its time to separate our data back.**

In [None]:
dfs = np.split(df, [len(df1)], axis=0)

In [None]:
train = dfs[0]
train.shape

In [None]:
test = dfs[1]
del test['Survived']
test.shape

<a id='6'></a>
# **Spark ML Models**
So now it's time to create our models. Spark ML is a package which aims to provide a uniform set of high-level APIs that help users create and tune practical machine learning pipelines. Spark ML standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow.<br> ***Here as we are using pyspark to create our models, we first have to convert our data from pandas frame to spark frame.***

In [None]:
train = sqlContext.createDataFrame(train)

**As some of the column contains values in string format so first we indexed them using StringIndexer. A StringIndexer will assign unique integer number to each unique string values.**

In [None]:
indexer = StringIndexer(inputCol='Sex',outputCol='label1')
indexer2 = StringIndexer(inputCol='Embarked',outputCol='label2')
indexer3 = StringIndexer(inputCol='Fare_cat',outputCol='label3')
indexer4 = StringIndexer(inputCol='Title',outputCol='label4')

**Then data is converted which are required to predict survival into vector form by using VectorAssembler as VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models.<br>Normalizer is a Transformer which transforms a dataset of Vector rows, normalizing each Vector to have unit norm. It takes parameter p, which specifies the p-norm used for normalization. This normalization can help standardize your input data and improve the behavior of learning algorithms.  So then we normalize our data by using Normalizer.**

In [None]:
vector = VectorAssembler(inputCols=['label1','Pclass','Age','label2','Family','label3','label4', 'Alone'],outputCol='features')
normalizer = Normalizer(inputCol='features',outputCol='features_norm', p=1.0)

**Then we call our model and give the input and the output column. Here, the input column will be our normalized data and the output column is what we have to predict i.e. Survived.**

In [None]:
lor = LogisticRegression(featuresCol='features_norm', labelCol='Survived', maxIter=100)

**Now it's time to call  pipeline. MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. So we add all the commands which we have called till now and add them to pipeline.**

In [None]:
pipeline1 = Pipeline(stages=[indexer,indexer2,indexer3,indexer4,vector,normalizer,lor])

**Next, we will fit our pipeline and create a model. This method is called Estimator. An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.**

In [None]:
model1 = pipeline1.fit(train)

**Finally we perform transform function which is known as transformer. A Transformer is an abstraction that includes feature transformers and learned models. Technically, a Transformer implements a method transform(), which converts one DataFrame into another, generally by appending one or more columns.**

In [None]:
predictions1 = model1.transform(train)

**For comparsion of different models we are initializing one list which will store accuracy of all the models.**

In [None]:
accuracy = []

**Using MulticlassClassificationEvaluator we will get the accuracy of our model.**

In [None]:
eval = MulticlassClassificationEvaluator().setMetricName('accuracy').setLabelCol('Survived').setPredictionCol('prediction')
print("The accuracy is: " + str(eval.evaluate(predictions1)))
accuracy.append(eval.evaluate(predictions1))

**The second classification method is GBTClassifier. We have to almost repeat the same steps as we did previously and just have to change the name of model and pipeline and call the gbtclassifier and check the accuracy of the model.**

In [None]:
gbt = GBTClassifier(featuresCol='features_norm',labelCol='Survived',maxIter=100)
pipeline2 = Pipeline(stages=[indexer,indexer2,indexer3,indexer4,vector,normalizer,gbt])
model2 = pipeline2.fit(train)
predictions2 = model2.transform(train)

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
eval = MulticlassClassificationEvaluator().setMetricName('accuracy').setLabelCol('Survived').setPredictionCol('prediction')
print("The accuracy is: " + str(eval.evaluate(predictions2)))
accuracy.append(eval.evaluate(predictions2))

**The third classification method is Linear Support Vector Classifier.**

In [None]:
svc = LinearSVC(featuresCol='features_norm', labelCol='Survived', maxIter=10)
pipeline3 = Pipeline(stages=[indexer,indexer2,indexer3,indexer4,vector,normalizer,svc])
model3 = pipeline3.fit(train)
predictions3 = model3.transform(train)

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
eval = MulticlassClassificationEvaluator().setMetricName('accuracy').setLabelCol('Survived').setPredictionCol('prediction')
print("The accuracy is: " + str(eval.evaluate(predictions3)))
accuracy.append(eval.evaluate(predictions3))

**The fourth classification method is DecisionTreeClassifier.**

In [None]:
dt = DecisionTreeClassifier(featuresCol='features_norm', labelCol='Survived')
pipeline4 = Pipeline(stages=[indexer,indexer2,indexer3,indexer4,vector,normalizer,dt])
model4 = pipeline4.fit(train)
predictions4 = model4.transform(train)

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
eval = MulticlassClassificationEvaluator().setMetricName('accuracy').setLabelCol('Survived').setPredictionCol('prediction')
print("The accuracy is: " + str(eval.evaluate(predictions4)))
accuracy.append(eval.evaluate(predictions4))

**The fifth classification method is RandomForestClassifier.**

In [None]:
rfc = RandomForestClassifier(featuresCol='features_norm', labelCol='Survived')
pipeline5 = Pipeline(stages=[indexer,indexer2,indexer3,indexer4,vector,normalizer,rfc])
model5 = pipeline5.fit(train)
predictions5 = model5.transform(train)

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
eval = MulticlassClassificationEvaluator().setMetricName('accuracy').setLabelCol('Survived').setPredictionCol('prediction')
print("The accuracy is: " + str(eval.evaluate(predictions5)))
accuracy.append(eval.evaluate(predictions5))

**The sixth classification method is  MultilayerPerceptronClassifier. This method is little different as here we have to provide it with layers also. For eg. here we give layers = [8, 5, 4, 2], where 8 demonstrate number of input features, 5 and 4 are genral middle layers and 2 is number of output classes. According to yor model you should define layers.**

In [None]:
layers = [8, 5, 4, 2]
trainer = MultilayerPerceptronClassifier(featuresCol='features_norm', labelCol='Survived', maxIter=100, layers=layers, blockSize=128, seed=1234)
pipeline6 = Pipeline(stages=[indexer,indexer2,indexer3,indexer4,vector,normalizer,trainer])
model6 = pipeline6.fit(train)
predictions6 = model6.transform(train)

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
eval = MulticlassClassificationEvaluator().setMetricName('accuracy').setLabelCol('Survived').setPredictionCol('prediction')
print("The accuracy is: " + str(eval.evaluate(predictions6)))
accuracy.append(eval.evaluate(predictions6))

**Now let us compare our models through visualization.**

In [None]:
names = ['Logistic Regression', 'GBTClassifier', 'LinearSVC', 'DecisionTreeClassifier', 'RandomForestClassifier', 'MultilayerPerceptronClassifier']

In [None]:
fig, ax = plt.subplots(2,1, figsize=(15,10))
sns.barplot(x=names, y=accuracy, ax=ax[0])
sns.pointplot(x=names, y=accuracy, ax=ax[1])
plt.show()

**As we have observed that GBTClasifer method has the highest accuracy so now let us see some of the predictions through by GBTClassifer method.**

In [None]:
predictions2.select("Survived", "prediction").show()

**Now its time to apply our trained model to our test data. We will do this by appling transform method and we will use GBTClassifier model because it has the highest accuracy.**

In [None]:
test = sqlContext.createDataFrame(test)
predictions = model2.transform(test)
predictions = predictions.toPandas()

In [None]:
predictions = model5.transform(test)
predictions = predictions.toPandas()

<a id='7'></a>
# **Submitting the predictions**

In [None]:
sub_df['Survived'] = predictions['prediction']
sub_df.head()

In [None]:
sub_df.to_csv('submission.csv', index=False)

**So we come to an end. I hope I have tried to explain each and every thing. But if you still want to know anything do comment and I will definitely try to solve your doubt.<br> Also, if you want to know more about Spark ML and some other different techniques and methods you can view my another notebook: -** <a href = "https://www.kaggle.com/utcarshagrawal/water-quality-prediction-using-sparkml/notebook" class = "btn btn-info btn-lg active"  role = "button" style = "color: white;" data-toggle = "popover" title = "Click">Click here</a> 
### <font color = 'red'> Thanks a lot for having a look at this notebook. I would like to get an appreciation from you with an upvote. Please upvote if you liked the kernel.</font>