# Logistic Regression Code Along.

Let's work through a "classic" classification example!

The titanic dataset is a common exercise for classification in machine learning.  There are lots of examples of it online for other machine learning libraries.

We'll use it to attempt to predict what passengers survived the titanic crash based solely on passenger's features such as:
- age
- cabin
- children
- etc...

We'll also explore a few more things!

We'll see some better ways to deal with categorical data through a 2-step process.

We'll also show how to use pipelines to set stages and build models that can be easily used again.

Our data will also have a lot of missing information, so we will nee to deal with that as well.

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("titanic").getOrCreate()

In [3]:
data = spark.read.csv("titanic.csv", inferSchema=True, header=True)

In [4]:
data.printSchema()

data.show()

# We are interested in predicting the "Survived" field.

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|   

In [5]:
data.columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [6]:
my_cols = data.select(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'])

**We need to deal with missing data!**

We are going to keep things, at this stage, very simple, and just drop the samples containing null entries.

In [7]:
my_final_data = my_cols.na.drop(how="any")

 **We need to process the categorical columns.**

In [8]:
from pyspark.ml.feature import (VectorAssembler, VectorIndexer, OneHotEncoder, StringIndexer)

In [9]:
gender_indexer = StringIndexer(inputCol="Sex", outputCol="SexIndex")
# Output an indexed version of the "Sex" column.  Assign a number for every category of that column.

In [10]:
# One-hot Encode them.
# Transforms the actual numbers for the categories into a ohe.
gender_encoder = OneHotEncoder(inputCol="SexIndex", outputCol="SexVec")

In [11]:
# Do the exact same things for the "Embarked" columns.
# You do not need to know the amount of categories beforehand.
# The Indexer and Encoder combination takes care of that for you.
embark_indexer = StringIndexer(inputCol="Embarked", outputCol="EmbarkedIndex")
embark_encoder = OneHotEncoder(inputCol="EmbarkedIndex", outputCol="EmbarkedVec")

**In the following, we need to create an Assembler.**

In [12]:
assembler = VectorAssembler(inputCols=["Pclass", "SexVec", "EmbarkedVec", "Age", "SibSp", "Parch", "Fare"],
                           outputCol="features")
# Please take note that we are not using the "Sex" or "Embarked" columns.

**Create a PipeLine.**

In [13]:
from pyspark.ml.classification import LogisticRegression

In [14]:
from pyspark.ml import Pipeline

In [15]:
log_reg_titanic = LogisticRegression(labelCol="Survived")

In [16]:
# A pipeline sets stages for different steps:
pipeline = Pipeline(stages=[gender_indexer, 
                            embark_indexer,
                            gender_encoder, 
                            embark_encoder,
                            assembler, 
                            log_reg_titanic])
# Transformation Stages --> Assembler --> Your Model.
# You can treat this Pipeline like you would a normal model.

In [17]:
train_data, test_data = my_final_data.randomSplit([0.7, 0.3], seed=123)

In [18]:
fitted_model = pipeline.fit(train_data)

In [19]:
# transform our test data
results = fitted_model.transform(dataset=test_data)
results

DataFrame[Survived: int, Pclass: int, Sex: string, Age: double, SibSp: int, Parch: int, Fare: double, Embarked: string, SexIndex: double, EmbarkedIndex: double, SexVec: vector, EmbarkedVec: vector, features: vector, rawPrediction: vector, probability: vector, prediction: double]

**It is time to evaluate our results.**

In [20]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [21]:
my_evaluator = BinaryClassificationEvaluator(labelCol="Survived")

In [22]:
results.select(["Survived", "rawPrediction", "prediction"]).show()

+--------+--------------------+----------+
|Survived|       rawPrediction|prediction|
+--------+--------------------+----------+
|       0|[-2.7828312400295...|       1.0|
|       0|[-2.6920522778532...|       1.0|
|       0|[-1.0093444100035...|       1.0|
|       0|[-1.1353427916105...|       1.0|
|       0|[-1.3096816414899...|       1.0|
|       0|[-0.7634442737207...|       1.0|
|       0|[-0.1237448243641...|       1.0|
|       0|[0.30743401607605...|       0.0|
|       0|[-0.1109499526733...|       1.0|
|       0|[-0.3895528708307...|       1.0|
|       0|[1.12640390957957...|       0.0|
|       0|[0.74046053612254...|       0.0|
|       0|[0.84182168577874...|       0.0|
|       0|[0.50876423476703...|       0.0|
|       0|[0.19031777812683...|       0.0|
|       0|[0.19394095505748...|       0.0|
|       0|[0.73209154946681...|       0.0|
|       0|[0.83076294069805...|       0.0|
|       0|[0.87947655211660...|       0.0|
|       0|[0.20000383051040...|       0.0|
+--------+-

In [23]:
au_roc = my_evaluator.evaluate(dataset=results)
# We are evaluating our "results" dataframe.

In [24]:
au_roc

0.8474666014988604