### Logistic Regression (Classification method)

Not all labels are continuous, so sometimes we need to predict categories, which is known as classification.

Logistic Regression is one of the most basic ways to perform classification.

Examples of classifications:
- Spam VS Normal emails
- Loan Default (Y/N)
- Disease Diagnosis

(Note: Above were all examples of Binary Classification)

We can use a Confusion Matrix to evalute classification models

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("LogisticRegressioni").getOrCreate()

In [3]:
# Logistic Regerssion Documentation Example

In [4]:
from pyspark.ml.classification import LogisticRegression

In [7]:
egdata = spark.read.format('libsvm').load("datasets/sample_libsvm_data.txt")

In [8]:
egdata.show(2)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
+-----+--------------------+
only showing top 2 rows



In [9]:
lor_reg_model = LogisticRegression()

In [10]:
fitted_lr = lor_reg_model.fit(egdata)

In [11]:
log_summary = fitted_lr.summary

In [12]:
log_summary.predictions.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [13]:
train,test = egdata.randomSplit([0.7,0.3])

In [14]:
trainned_model = LogisticRegression()
final = trainned_model.fit(train)

In [15]:
predictions_n_labels = final.evaluate(test)

In [16]:
predictions_n_labels.predictions.show(3)

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[121,122,123...|[24.3762709822539...|[0.99999999997408...|       0.0|
|  0.0|(692,[122,123,124...|[19.4421793105573...|[0.99999999639945...|       0.0|
|  0.0|(692,[124,125,126...|[29.9410594829723...|[0.99999999999990...|       0.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 3 rows



In [17]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator,MulticlassClassificationEvaluator

In [19]:
my_eval = BinaryClassificationEvaluator()

In [20]:
roc_results = my_eval.evaluate(predictions_n_labels.predictions)

In [21]:
roc_results

1.0

### Titanic Dataset Exercise

In [24]:
titanic = spark.read.csv("datasets/titanic.csv",inferSchema=True,header=True).createOrReplaceTempView("titanic")

In [25]:
df = spark.sql("SELECT * FROM titanic")

In [26]:
df.head()

Row(PassengerId=1, Survived=0, Pclass=3, Name='Braund, Mr. Owen Harris', Sex='male', Age=22.0, SibSp=1, Parch=0, Ticket='A/5 21171', Fare=7.25, Cabin=None, Embarked='S')

In [27]:
df.show(3)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
only showing top 3 rows



In [28]:
df.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



In [29]:
df.columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [30]:
my_cols = df.select(['Survived',
 'Pclass',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked'])

In [31]:
simple_drop_data = my_cols.na.drop()

In [32]:
from pyspark.ml.feature import (VectorAssembler,VectorIndexer,OneHotEncoder,StringIndexer)

In [33]:
gender_indexer = StringIndexer(inputCol='Sex',outputCol='SexIndex')
# A B C -> StringIndexer -> 0 1 2
# OneHot Encode:
# KEY A B C
# Eg A: [1,0,0]
gender_encoder = OneHotEncoder(inputCol='SexIndex',outputCol='SexVec')

In [34]:
embark_indexer = StringIndexer(inputCol='Embarked',outputCol='EmbarkIndex')
embark_encoder = OneHotEncoder(inputCol='EmbarkIndex',outputCol='EmbarkVec')

In [35]:
assembler = VectorAssembler(inputCols=['Pclass','SexVec','EmbarkVec','Age','SibSp','Parch','Fare'],
                            outputCol='features')

### Pipeline

In [36]:
from pyspark.ml import Pipeline

In [37]:
log_reg_titanic = LogisticRegression(featuresCol='features',labelCol='Survived')

In [38]:
pipeline = Pipeline(stages=[gender_indexer,embark_indexer,
                            gender_encoder,embark_encoder,
                            assembler,log_reg_titanic])

In [39]:
train, test = simple_drop_data.randomSplit([0.7,0.3])

In [40]:
fit_model = pipeline.fit(train)

In [41]:
results = fit_model.transform(test)

In [42]:
# from pyspark.ml.evaluation import BinaryClassificationEvaluator,MulticlassClassificationEvaluator

In [43]:
my_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='Survived')

In [46]:
results.select('Survived','prediction').show(5)

+--------+----------+
|Survived|prediction|
+--------+----------+
|       0|       0.0|
|       0|       1.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
+--------+----------+
only showing top 5 rows



In [49]:
results.printSchema()

root
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)
 |-- SexIndex: double (nullable = false)
 |-- EmbarkIndex: double (nullable = false)
 |-- SexVec: vector (nullable = true)
 |-- EmbarkVec: vector (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [48]:
AUC = my_eval.evaluate(results)
AUC

0.7426470588235294