# PySpark Logistic Regression

<a href='https://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression'>Link</a> to documentation

## Content
1. [Example from the documentation](#doc)
2. [Introduction of evaluators](#eval)
3. [Predicting survival of titanic passengers](#titanic)
4. [Predicting churn: final project](#churn)

<a id='doc'></a>
## 1. Example from the documentation

In [69]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.sql.functions import isnan, when, count, col
from pyspark.ml.feature import StringIndexer, VectorIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [None]:
spark = SparkSession.builder.appName('logregdoc').getOrCreate()

In [5]:
# load data and train model
training = spark.read.format('libsvm').load('data/sample_libsvm_data.txt')
lr = LogisticRegression()
lrModel = lr.fit(training)
summary = lrModel.summary

In [7]:
summary.predictions.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[127,128,129...|[19.8534775947478...|[0.99999999761359...|       0.0|
|  1.0|(692,[158,159,160...|[-20.377398194908...|[1.41321555111056...|       1.0|
|  1.0|(692,[124,125,126...|[-27.401459284891...|[1.25804865126979...|       1.0|
|  1.0|(692,[152,153,154...|[-18.862741612668...|[6.42710509170303...|       1.0|
|  1.0|(692,[151,152,153...|[-20.483011833009...|[1.27157209200604...|       1.0|
|  0.0|(692,[129,130,131...|[19.8506078990277...|[0.99999999760673...|       0.0|
|  1.0|(692,[158,159,160...|[-20.337256674833...|[1.47109814695581...|       1.0|
|  1.0|(692,[99,100,101,...|[-19.595579753418...|[3.08850168102631...|       1.0|
|  0.0|(692,[154,155,156...|[19.2708803215613...|[0.99999999572670...|       0.0|
|  0.0|(692,[127

In [6]:
# training data so we fit the data perfectly 
summary.areaUnderROC

1.0

In [8]:
# training data so we fit the data perfectly 
summary.accuracy

1.0

In [10]:
spark.stop()

<a id='eval'></a>
## 2. Introduction of evaluators
important part of the pipeline


- `BinaryClassificationEvaluator` -> (<a href='https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.BinaryClassificationEvaluator'>Link</a>)
- `MulticlassClassificationEvaluator` -> (<a href='https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.MulticlassClassificationEvaluator'>Link</a>)

In [None]:
# imports 
# from pyspark.ml.evaluate import BinaryClassEvaluator, MulticlassClassificationEvaluator

<a id='titanic'></a>
## 3. Predicting survival of titanic passengers
Goal: make a model that predict if a passenger died or survived 

In [17]:
# creat new spark session
spark = SparkSession.builder.appName('titanic').getOrCreate()

In [48]:
# load the data
data = spark.read.csv('data/titanic.csv', inferSchema=True, header=True)

In [4]:
# look at the columns and data types of the varibles
data.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



In [19]:
# look a the individual features
data.select('Cabin').show()

+-----+
|Cabin|
+-----+
| null|
|  C85|
| null|
| C123|
| null|
| null|
|  E46|
| null|
| null|
| null|
|   G6|
| C103|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
+-----+
only showing top 20 rows



#### check for missing values
- In PySpark NaNs and Nulls are not the same
- to make it more readable we can call `toPandas().T`

In [39]:
# check NaNs
# code from https://stackoverflow.com/questions/44627386/how-to-find-count-of-null-and-nan-values-for-each-column-in-a-pyspark-dataframe?rq=1
data.select([count(when(isnan(c), c)).alias(c) for c in data.columns]).show()

+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    0|       0|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+



In [40]:
# check for nulls
# code from https://towardsdatascience.com/data-prep-with-spark-dataframes-3629478a1041
data.select([count(when(col(c).isNull(), c)).alias(c) for c in data.columns]).toPandas().T

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


- `PassengerId`, `Name`, `Ticket` is not important -> can be dropped
- `Sex` needs to be transformed to integer
- `Age` impute age
- remove `Cabin`, to many missing values 
- impute `Embarked` with mode
- use `Onhotencode` on categorical features

In [49]:
# drop cols
data = data.drop('PassengerId','Name', 'Cabin', 'Ticket')

In [50]:
# impute missing value
# Embarked
data.groupBy('Embarked').count().show()

+--------+-----+
|Embarked|count|
+--------+-----+
|       Q|   77|
|    null|    2|
|       C|  168|
|       S|  644|
+--------+-----+



In [51]:
# impute 'S' for missing values
data = data.fillna({'Embarked': 'S'})

In [54]:
# impute missing values
# Age
# fing the mean age of pclcass
data.groupBy('Pclass').mean('Age').show()

+------+------------------+
|Pclass|          avg(Age)|
+------+------------------+
|     1|38.233440860215055|
|     3| 25.14061971830986|
|     2| 29.87763005780347|
+------+------------------+



In [55]:
# impute the age for each pclass
# make a function 
# code inspiration: https://www.kaggle.com/roshan77/pyspark-classification-model
def age_imputer(df, pclass, age):
    '''
    finds the null values in the age column with the provided pclass and imputes the provided age, if
    Age is not null it will keep the same age.
    
    INPUTS
    df: PySpark DataFrame
    pclass: int of the Pclass to filter
    age: mean age of the individial in the associated Pclass
    '''
    return df.withColumn('Age', when((df['Age'].isNull()) & (df['Pclass']==pclass), 
                                     age).otherwise(df['Age']))

data = age_imputer(data, 1, 38.0)
data = age_imputer(data, 2, 25.0)
data = age_imputer(data, 3, 30.0)

In [33]:
data.printSchema()

root
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Embarked: string (nullable = true)



In [56]:
# final check for null values
data.select([count(when(col(c).isNull(), c)).alias(c) for c in data.columns]).toPandas().T

Unnamed: 0,0
Survived,0
Pclass,0
Sex,0
Age,0
SibSp,0
Parch,0
Fare,0
Embarked,0


In [57]:
# transform gender into numeric and one hot encode 
gender_indexer = StringIndexer(inputCol='Sex',outputCol='SexIndex')
gender_encoder = OneHotEncoder(inputCol='SexIndex',outputCol='SexVec')

In [58]:
# transform embarked into numeric and one hot encode
embark_indexer = StringIndexer(inputCol='Embarked',outputCol='EmbarkIndex')
embark_encoder = OneHotEncoder(inputCol='EmbarkIndex',outputCol='EmbarkVec')

In [59]:
# creat fector for final vector
assembler = VectorAssembler(inputCols = ['Pclass', 'SexVec', 'Age', 'SibSp', 'Parch', 'Fare',
                                         'EmbarkVec'],
                            outputCol = 'features')

#### Pipeline
<a href='https://spark.apache.org/docs/latest/ml-pipeline.html'>Link</a> to docs

various data preprocessing and modeling steps need to be in order, for this case Spark offers a pipeline feature with the following advantages:
- a pipeline is less prone to mistakes (automated processes)
- pipeline are used in a production environment 
- uses lazy evaluation, really important for Big Data

In [60]:
log_reg = LogisticRegression(featuresCol='features',labelCol='Survived')

In [61]:
# create pipeline
pipeline = Pipeline(stages=[gender_indexer,
                            embark_indexer,
                            gender_encoder,
                            embark_encoder,
                            assembler,
                            log_reg])

In [62]:
train, test = data.randomSplit([0.7, 0.3])

In [63]:
fit_model = pipeline.fit(train)

In [64]:
results = fit_model.transform(test)

In [70]:
eval_ = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='Survived')

In [71]:
# compare Survived vs. prediction
results.select('Survived', 'prediction').show()

+--------+----------+
|Survived|prediction|
+--------+----------+
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       1.0|
|       0|       1.0|
|       0|       0.0|
|       0|       1.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
+--------+----------+
only showing top 20 rows



In [72]:
eval_.evaluate(results)

0.766985368680284

<a id='churn'></a>
## Predicting churn: final project
Goal: create a machine learning model which can predict customer churn

Data
- Name: name of the last contact person
- Age: of the customer
- Total_Purchase: count of inserts that the customer bought
- Account_Manager: binary 0=no manger, 1=account manager
- Years: number of years an individual is a customer
- Num_sites: who use the service 
- Onboard_date: date from the last contact
- Location: of the main office
- Company: name of the company

## Resources
- Udemy course on PySpark
- https://stackoverflow.com/questions/44627386/how-to-find-count-of-null-and-nan-values-for-each-column-in-a-pyspark-dataframe?rq=1
- https://towardsdatascience.com/data-prep-with-spark-dataframes-3629478a1041
- https://sparkbyexamples.com/pyspark/pyspark-groupby-explained-with-example/
- https://www.kaggle.com/roshan77/pyspark-classification-model