# Logistic Regression in Spark
## 1.) Basic Logistic Regression
* Logistic regression models are classification algorithms
* They use a logistic function to map input values to an output ranging between 0 and 1
* These output values are the probability that the given inputs result in a 0 or a 1
* As such, this is a binary classification method
* Below, we build a simple logistic regression model and fit it to our input data

In [14]:
# find spark
import findspark

# point to spark
findspark.init('/home/matt/spark-3.0.2-bin-hadoop3.2')

# load spark lib
import pyspark
from pyspark.sql import SparkSession

# create session
spark = SparkSession.builder.appName('logreg').getOrCreate()

# load logreg libs
from pyspark.ml.classification import LogisticRegression

# read in data
df = spark.read.format('libsvm').load('Data/sample_libsvm_data.txt')

# show schema
df.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)



In [15]:
# peek at data
df.show(3)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|
+-----+--------------------+
only showing top 3 rows



In [16]:
# create logreg model
lr = LogisticRegression()

# fit model to data
lr_model = lr.fit(df)

# show model summary
lr_summary = lr_model.summary

# inspect summary
# the predictions object is a df
# we can extract label, feature, predictions and probs from this
lr_summary.predictions.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [17]:
# we want to check if the actual 'label' matches the predicted results
# at a glance this looks like a pretty good estimator so far
lr_summary.predictions.select('label', 'prediction').show(5)

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  1.0|       1.0|
|  1.0|       1.0|
|  1.0|       1.0|
|  1.0|       1.0|
+-----+----------+
only showing top 5 rows



## 2.) Evaluating the Model
* The above code offers simplistic evaluation of our model
* We can extract basic info such as the probability and predicted binary class of each input
* However, we can use Spark **evaluators** to go into more detail in this area
* This includes things like confusion matrices, classification reports, ROC curves etc.

In [18]:
# split train/test data
train, test = df.randomSplit([0.7, 0.3])

# create new model object
lr = LogisticRegression()

# train model on train data
lr_model = lr.fit(train)

# calculate predicted results
test_pred = lr_model.evaluate(test)

# show output df
test_pred.predictions.show(10)

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[122,123,124...|[17.3275120737099...|[0.99999997016286...|       0.0|
|  0.0|(692,[123,124,125...|[30.0713136181052...|[0.99999999999991...|       0.0|
|  0.0|(692,[124,125,126...|[30.8791077666983...|[0.99999999999996...|       0.0|
|  0.0|(692,[124,125,126...|[22.2580906352772...|[0.99999999978450...|       0.0|
|  0.0|(692,[124,125,126...|[22.2560261974221...|[0.99999999978406...|       0.0|
|  0.0|(692,[126,127,128...|[17.2672598669120...|[0.99999996830984...|       0.0|
|  0.0|(692,[126,127,128...|[30.1193826680919...|[0.99999999999991...|       0.0|
|  0.0|(692,[127,128,129...|[23.5627343744599...|[0.99999999994154...|       0.0|
|  0.0|(692,[129,130,131...|[16.1306561349988...|[0.99999990124820...|       0.0|
|  0.0|(692,[151

In [19]:
# load evaluator libs
from pyspark.ml.evaluation import (BinaryClassificationEvaluator,
                                   MulticlassClassificationEvaluator)

# create binary evaluator instance
binary_eval = BinaryClassificationEvaluator()

# evaluate predictions 
final_roc = binary_eval.evaluate(test_pred.predictions)

# show results
# this represents area under the curve
# 1 means perfect, which is unlikely in reality
# but this is model data hence the perfect accuracy
final_roc

1.0

# Titanic Classification
## 1.) First Steps
* The titanic data contains information about passengers and whether or not they survived
* We will use this information to build a logistic regression classification model
* This model will predict whether or not a passenger will survive based on their information

In [20]:
# read in data
df = spark.read.csv('Data/titanic.csv', inferSchema=True, header=True)

# check schema
df.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



In [21]:
# peek at data
df.show(3)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
only showing top 3 rows



## 2.) Prepare Features
* We need to perform categorical indexing and encoding so that we can handle our categorical variables
* We also need to vectorize all of our features into a Spark friendly format

In [22]:
# extract relevant cols
# drop ID and text cols
df_clean = df.select(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'])

# drop nulls
df_final = df_clean.na.drop()

# load vector and encoder libs
from pyspark.ml.feature import VectorAssembler, VectorIndexer, OneHotEncoder, StringIndexer

### INDEXERS and ENCODERS ###
# create gender indexer
# e.g. A, B, C > 0, 1, 2
gender_indexer = StringIndexer(inputCol='Sex', outputCol='SexIndex')

# create gender encoder
gender_encoder = OneHotEncoder(inputCol='SexIndex', outputCol='SexVec')

# create embark indexer
embark_indexer = StringIndexer(inputCol='Embarked', outputCol='EmbarkedIndex')

# create embark encoder
embark_encoder = OneHotEncoder(inputCol='EmbarkedIndex', outputCol='EmbarkedVec')

### ASSEMBLER ###
# assemble all input cols (incl. above encoded categories)
assembler = VectorAssembler(inputCols=['Pclass', 'SexVec', 'EmbarkedVec', 'Age', 'SibSp', 'Parch', 'Fare'],
                            outputCol='features')

## 3.) Build Model Pipeline
* The above steps are defining what we plan on doing
* But we have not executed any of these steps, we've simply laid out the steps to perform
* As such, we now need to run our steps and we can combine these into a pipeline for ease

In [23]:
# load libs
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

# create logreg instance
titanic_logreg = LogisticRegression(featuresCol='features',
                                    labelCol='Survived')

# create pipeline
pipeline = Pipeline(stages=[gender_indexer, embark_indexer,
                            gender_encoder, embark_encoder,
                            assembler, titanic_logreg])

## 4.) Build and Evaluate Model
* Now we've done that, we can build and train our model
* Then we can evaluate the results of our model
* We will simply look at the AUC of our model, i.e. % accuracy of predictions

In [24]:
# split train/test data
train, test = df_final.randomSplit([0.7, 0.3])

# fit model to training data
fit_model = pipeline.fit(train)

# make predictions on test data
# this creates 'prediction' column storing predictions
# hence why we pass this as a string in the later eval step
results = fit_model.transform(test)

# load evaluation libs
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# build evaluator object
my_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='Survived')

# show results
results.select('Survived', 'prediction').show(5)

+--------+----------+
|Survived|prediction|
+--------+----------+
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       0.0|
|       0|       0.0|
+--------+----------+
only showing top 5 rows



In [25]:
# evaluate model
AUC = my_eval.evaluate(results)

# show result
AUC

0.7472972972972973