# IST 718: Big Data Analytics

*CLASS EXERCISE 5 (2 PTS)*

## General instructions:

- Class Exercises begin in class and are submitted within a day or two. Submit your .ipynb file on Blackboard.
- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers either from your classmates or from the internet__
- Remove or comment out code that contains `raise NotImplementedError`. This is mainly to make the `assert` statement fail if nothing is submitted.
- The tests shown in some cells (i.e., `assert` and `np.testing.` statements) are used to grade your answers. **However, the professor and FAs will use __additional__ test for your answer. Think about cases where your code should run even if it passess all the tests you see.**
- Before downloading and submitting your work through Blackboard, remember to save and press `Validate` (or go to 
`Kernel`$\rightarrow$`Restart and Run All`). 

In [2]:
from pyspark.sql import SparkSession
from pyspark.ml import feature
from pyspark.ml import classification
from pyspark.sql import functions as fn
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator, \
    MulticlassClassificationEvaluator, \
    RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from pyspark.sql import SparkSession


spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

In [3]:
# Define a function to determine if we are running on data bricks
# Return true if running in the data bricks environment, false otherwise
import os
def is_databricks():
    if os.getenv("DATABRICKS_RUNTIME_VERSION") != None:
        return True
    else:
        return False

# Define a function to read the data. The full path name is constructed by checking
# the runtime environment to determine if it is databricks or a personal computer.
# On the local filesystem the data is assumed to be in the same directory as the code.
# On databricks, the data path is assumed to be at '/FileStore/tables/' location.
# 
# Parameter(s):
#   name: The base name of the file, parquet file or parquet directory
# Return Value:
#   full_path_name: the full path name of the data based on the runtime environment
#
# Correct Usage Example (pass ONLY the full file name):
#   name_to_load = get_datapath("sms_spam.csv") # correct  
#   
# Incorrect Usage Examples:
#   name_to_load = get_datapath("/sms_spam.csv") # incorrect
#   name_to_load = get_datapath("sms_spam.csv/") # incorrect
#   name_to_load = get_datapath("c:/users/will/data/sms_spam.csv") incorrect
#
def get_datapath(name):    
    if is_databricks():
        full_path_name = "/FileStore/tables/%s" % name
    else:
        full_path_name = name
    return full_path_name

# Supervised Learning for Titanic Survival

https://www.kaggle.com/c/titanic

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In [4]:
titanic_df = spark.read.csv(get_datapath('titanic_original.csv'), header=True, inferSchema=True)

[Column Descriptions](https://data.world/nrippner/titanic-disaster-dataset): <br>
survival - Survival (0 = No; 1 = Yes) <br>
class - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) <br>
name - Name <br>
sex - Sex <br>
age - Age <br>
sibsp - Number of Siblings/Spouses Aboard <br>
parch - Number of Parents/Children Aboard <br>
ticket - Ticket Number <br>
fare - Passenger Fare <br>
cabin - Cabin <br>
embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) <br>
boat - Lifeboat (if survived) <br>
body - Body number (if did not survive and body was recovered) <br>

In [5]:
titanic_df_pd = titanic_df.limit(10).toPandas()
display(titanic_df_pd.head())
display("types:", titanic_df_pd.dtypes)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


'types:'

pclass         int32
survived       int32
name          object
sex           object
age          float64
sibsp          int32
parch          int32
ticket        object
fare         float64
cabin         object
embarked      object
boat          object
body         float64
home.dest     object
dtype: object

In [6]:
# some basic cleanup
drop_cols = ['boat', 'body']
new_titanic_df = titanic_df.\
    drop(*drop_cols).\
    withColumnRenamed('home.dest', 'home_dest') .\
    dropna(subset=['pclass', 'age', 'sibsp', 'parch', 'fare', 'survived'])

In [7]:
from pyspark.sql.functions import isnan, isnull, when, count, col

new_titanic_df.select([count(when(isnull(c), c)).alias(c) for c in new_titanic_df.columns]).show()
new_titanic_df.select([count(when(isnan(c), c)).alias(c) for c in new_titanic_df.columns]).show()

+------+--------+----+---+---+-----+-----+------+----+-----+--------+---------+
|pclass|survived|name|sex|age|sibsp|parch|ticket|fare|cabin|embarked|home_dest|
+------+--------+----+---+---+-----+-----+------+----+-----+--------+---------+
|     0|       0|   0|  0|  0|    0|    0|     0|   0|  773|       2|      360|
+------+--------+----+---+---+-----+-----+------+----+-----+--------+---------+

+------+--------+----+---+---+-----+-----+------+----+-----+--------+---------+
|pclass|survived|name|sex|age|sibsp|parch|ticket|fare|cabin|embarked|home_dest|
+------+--------+----+---+---+-----+-----+------+----+-----+--------+---------+
|     0|       0|   0|  0|  0|    0|    0|     0|   0|    0|       0|        0|
+------+--------+----+---+---+-----+-----+------+----+-----+--------+---------+



In [8]:
training, test = new_titanic_df.randomSplit([0.8, 0.2], 0)
training.show(5)

+------+--------+--------------------+------+----+-----+-----+--------+------+-------+--------+--------------------+
|pclass|survived|                name|   sex| age|sibsp|parch|  ticket|  fare|  cabin|embarked|           home_dest|
+------+--------+--------------------+------+----+-----+-----+--------+------+-------+--------+--------------------+
|     1|       0|"Lindeberg-Lind, ...|  male|42.0|    0|    0|   17475| 26.55|   null|       S|   Stockholm, Sweden|
|     1|       0|"Rosenshine, Mr. ...|  male|46.0|    0|    0|PC 17585|  79.2|   null|       C|        New York, NY|
|     1|       0|Allison, Miss. He...|female| 2.0|    1|    2|  113781|151.55|C22 C26|       S|Montreal, PQ / Ch...|
|     1|       0|Allison, Mr. Huds...|  male|30.0|    1|    2|  113781|151.55|C22 C26|       S|Montreal, PQ / Ch...|
|     1|       0|Allison, Mrs. Hud...|female|25.0|    1|    2|  113781|151.55|C22 C26|       S|Montreal, PQ / Ch...|
+------+--------+--------------------+------+----+-----+-----+--

## classic pipeline

Create a logistic regression pipeline

Fit the logistic regression model on the training data

If we wanted to modify the pipeline to add "sex" (gender) as a feature, we need to modify the point of entry and the next transformation.  The StringIndexer converts labels where the label with the highest count is label 0, next highest is label 1, etc.  See the spark documentation for [StringIndexer](https://spark.apache.org/docs/latest/ml-features.html#stringindexer) for more information.

In [9]:
from pyspark.ml.classification import LogisticRegression
model1 = Pipeline(stages=[feature.VectorAssembler(inputCols=['pclass', 'age', 'sibsp', 'parch', 'fare'], 
                                                  outputCol='features'),
                          feature.StringIndexer(inputCol='sex', outputCol='encoded_sex'),
                          feature.VectorAssembler(inputCols=['features', 'encoded_sex'], outputCol='final_features'),
                 LogisticRegression(labelCol='survived', featuresCol='final_features')])

In [10]:
model1_fitted = model1.fit(training)

In [11]:
evaluator = BinaryClassificationEvaluator(labelCol='survived')
evaluator.evaluate(model1_fitted.transform(test))

0.8341750841750837

## decision tree pipeline

Research Spark ML for DT classifier

Create a decision tree pipeline

Fit the decision tree model on the training data

In [17]:
# import classifier
# define model2
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
model2= Pipeline(stages=[feature.VectorAssembler(inputCols=['pclass', 'age', 'sibsp', 'parch', 'fare'], 
                                                  outputCol='features'),
                          feature.StringIndexer(inputCol='sex', outputCol='encoded_sex'),
                          feature.VectorAssembler(inputCols=['features', 'encoded_sex'], outputCol='final_features'),
                  DecisionTreeClassifier(labelCol='survived', featuresCol='final_features')])
# YOUR CODE HERE


In [18]:
# define model2_fitted

# YOUR CODE HERE
model2_fitted = model2.fit(training)

In [19]:
# evaluate the model

# YOUR CODE HERE
evaluator = MulticlassClassificationEvaluator(labelCol='survived')
evaluator.evaluate(model2_fitted.transform(test))


0.7702231848573313

## random forest pipeline

Research Spark ML for RF classifier

Create a random forest pipeline

Fit the random forest model on the training data

In [46]:
# import classifier
from pyspark.ml.classification import RandomForestClassifier
# define model3
model3= Pipeline(stages=[feature.VectorAssembler(inputCols=['pclass', 'age', 'sibsp', 'parch', 'fare'], 
                                                  outputCol='features'),
                          feature.StringIndexer(inputCol='sex', outputCol='encoded_sex'),
                          feature.VectorAssembler(inputCols=['features', 'encoded_sex'], outputCol='final_features'),
                  RandomForestClassifier(labelCol='survived', featuresCol='final_features')])
# YOUR CODE HERE


In [None]:
# define model3_fitted

# YOUR CODE HERE
model3_fitted = model3.fit(training)

In [None]:
# evaluate the model

# YOUR CODE HERE
evaluator = MulticlassClassificationEvaluator(labelCol='survived')
evaluator.evaluate(model3_fitted.transform(test))

## gradient-boosted tree pipeline

Research Spark ML for GBT classifier

Create a gradient-boosted tree pipeline

Fit the gradient-boosted treen model on the training data

In [25]:
# import classifier
# define model4
from pyspark.ml.classification import GBTClassifier
# YOUR CODE HERE
model4 = Pipeline(stages=[feature.VectorAssembler(inputCols=['pclass', 'age', 'sibsp', 'parch', 'fare'], 
                                                  outputCol='features'),
                          feature.StringIndexer(inputCol='sex', outputCol='encoded_sex'),
                          feature.VectorAssembler(inputCols=['features', 'encoded_sex'], outputCol='final_features'),
                  GBTClassifier(labelCol='survived', featuresCol='final_features')])

In [26]:
# define model4_fitted

# YOUR CODE HERE
model4_fitted = model4.fit(training)

In [27]:
# evaluate the model

# YOUR CODE HERE
evaluator = MulticlassClassificationEvaluator(labelCol='survived')
evaluator.evaluate(model4_fitted.transform(test))

0.7696140246130725

## naive bayes pipeline

Research Spark ML for NB classifier

Create a naive bayes pipeline

Fit the naive bayes model on the training data

In [28]:
from pyspark.ml.classification import NaiveBayes
model5 = Pipeline(stages=[feature.VectorAssembler(inputCols=['pclass', 'age', 'sibsp', 'parch', 'fare'], 
                                                  outputCol='features'),
                          feature.StringIndexer(inputCol='sex', outputCol='encoded_sex'),
                          feature.VectorAssembler(inputCols=['features', 'encoded_sex'], outputCol='final_features'),
                 NaiveBayes(labelCol='survived', featuresCol='final_features')])

In [29]:
model5_fitted = model5.fit(training)

In [30]:
evaluator.evaluate(model5_fitted.transform(test))

0.6628363249052904

## support vector machines pipeline

Research Spark ML for SVC classifier

Create a support vector machines pipeline

Fit the support vector machines model on the training data

In [38]:
# import classifier
from pyspark.ml.classification import LinearSVC

# define model6
model6 = Pipeline(stages=[feature.VectorAssembler(inputCols=['pclass', 'age', 'sibsp', 'parch', 'fare'], 
                                                  outputCol='features'),
                          feature.StringIndexer(inputCol='sex', outputCol='encoded_sex'),
                          feature.VectorAssembler(inputCols=['features', 'encoded_sex'], outputCol='final_features'),
                 LinearSVC(labelCol='survived', featuresCol='final_features')])
# YOUR CODE HERE


In [39]:
# define model6_fitted

# YOUR CODE HERE
model6_fitted = model6.fit(training)

In [41]:
# evaluate the model

# YOUR CODE HERE
evaluator.evaluate(model6_fitted.transform(test))

0.7621830324533027

## inspect all the models

Which model seems to be the best for the titatnic survival model? Substantiate your answer.

*YOUR ANSWER HERE*

The logistic regression pipeline model produces the best resulting by comparing all the results from the evaluation. 

*END OF EXERCISE*