# Logistic Regression - Titanic Data

Aim - Predict what passengers survived the titanic crash based solely on passenger's features.

Steps to follow: 

1. Create a Spark Session and import LogisticRegression
2. Load data, check null values and check if it's in the format - label, features (if not, convert to features using assembler)
3. Split data into training and testing set (7:3)
4. Create an instance of Logistic Regression 
5. Create a model by using the instance to train/fit training data 
6. Use trained model to obtain prediction results by evaluating on testing data
7. Select label and predictions from prediction results
8. Create evaluator instance 
9. Get accuracy by evaluating predictions and label on evaluator instance

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('myproj').getOrCreate()

In [3]:
data = spark.read.csv('titanic.csv',inferSchema=True,header=True)

In [4]:
data.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



In [5]:
data.columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [6]:
# Select the columns relevant to the predicting survival of passenger
# Survived would be our label
my_cols = data.select(['Survived',
 'Pclass',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'Embarked'])

In [7]:
# Check null or na values in dataframe
from pyspark.sql.functions import isnan, isnull, when, count, col

my_cols.select([count(when(isnan(c)| isnull(c), c)).alias(c) for c in my_cols.columns]).show()

+--------+------+---+---+-----+-----+----+--------+
|Survived|Pclass|Sex|Age|SibSp|Parch|Fare|Embarked|
+--------+------+---+---+-----+-----+----+--------+
|       0|     0|  0|177|    0|    0|   0|       2|
+--------+------+---+---+-----+-----+----+--------+



In [8]:
# For the time being we will just drop the missing data
data = my_cols.na.drop()

In [9]:
data.printSchema()

root
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Embarked: string (nullable = true)



We have 2 string columns. To use them we will create a StringIndexer instance for each string column and OneHotEncode them.

In [10]:
from pyspark.ml.feature import (VectorAssembler,VectorIndexer,
                                OneHotEncoder,StringIndexer)

In [11]:
# Indexer assigns a number to every category in the column
sex_indexer = StringIndexer(inputCol='Sex',outputCol='SexIndex')
sex_encoder = OneHotEncoder(inputCol='SexIndex',outputCol='SexVec')

In [12]:
embarked_indexer = StringIndexer(inputCol='Embarked',outputCol='EmbarkIndex')
embarked_encoder = OneHotEncoder(inputCol='EmbarkIndex',outputCol='EmbarkVec')

In [13]:
# Assemble all our features (independent variables)
assembler = VectorAssembler(inputCols=['Pclass',
 'SexVec',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'EmbarkVec'],outputCol='features')

In [14]:
from pyspark.ml.classification import LogisticRegression

In [15]:
from pyspark.ml import Pipeline

In [16]:
log_reg_titanic = LogisticRegression(featuresCol='features',labelCol='Survived')

In [17]:
# Pipeline sets stages for complex tasks
pipeline = Pipeline(stages=[sex_indexer,embarked_indexer,
                           sex_encoder,embarked_encoder,
                           assembler,log_reg_titanic])

In [18]:
train, test = data.randomSplit([0.7,.3])

In [19]:
model = pipeline.fit(train)

In [20]:
results = model.transform(test)

In [21]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [22]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                       labelCol='Survived')

In [23]:
results.select('Survived','prediction').show()

+--------+----------+
|Survived|prediction|
+--------+----------+
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       0.0|
|       0|       1.0|
|       0|       1.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       1.0|
+--------+----------+
only showing top 20 rows



In [24]:
evaluator.evaluate(results)

0.8032967032967033

This means that the area under ROC curve was 0.80. That means all observations were classified with 80% accuracy.

---------------------------------------------------------------------