<a href="https://colab.research.google.com/github/Ricardo-Jaramillo/PySpark/blob/main/07_Logistic_Regression_Titanic_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression example

Now let's code a simple exercise with the Titanic csv. We're trying to predict whether a passenger would survived or not depending on the variables we have.

In [1]:
# Intall pyspark
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=0edca87616d224ffaed46a33c74638ed2c1815217cc54907c1cd7a06ba244da0
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In [2]:
# Download the file
!wget https://raw.githubusercontent.com/Ricardo-Jaramillo/PySpark/main/datasets/titanic.csv

--2023-10-03 18:26:37--  https://raw.githubusercontent.com/Ricardo-Jaramillo/PySpark/main/datasets/titanic.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60302 (59K) [text/plain]
Saving to: ‘titanic.csv’


2023-10-03 18:26:37 (28.1 MB/s) - ‘titanic.csv’ saved [60302/60302]



In [3]:
# Import the necessary libraries
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression

In [4]:
# Create a session
spark = SparkSession.builder.appName('log_reg_titanic').getOrCreate()

In [6]:
# Read in the file
df = spark.read.csv('titanic.csv', header=True, inferSchema=True)

In [8]:
# Print the schema
df.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



In [9]:
# Show the data
df.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| NULL|       S|
|          6|       0|     3|    Moran, Mr. James|  male|NULL|    0|    0|      

In [11]:
# Print out the column names
df.columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [14]:
# Select usefull columns
my_cols = df.select(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin', 'Embarked'])

In [16]:
# Drop null values
my_final_data = my_cols.na.drop()
my_final_data.show()

+--------+------+------+----+-----+-----+--------+-----------+--------+
|Survived|Pclass|   Sex| Age|SibSp|Parch|    Fare|      Cabin|Embarked|
+--------+------+------+----+-----+-----+--------+-----------+--------+
|       1|     1|female|38.0|    1|    0| 71.2833|        C85|       C|
|       1|     1|female|35.0|    1|    0|    53.1|       C123|       S|
|       0|     1|  male|54.0|    0|    0| 51.8625|        E46|       S|
|       1|     3|female| 4.0|    1|    1|    16.7|         G6|       S|
|       1|     1|female|58.0|    0|    0|   26.55|       C103|       S|
|       1|     2|  male|34.0|    0|    0|    13.0|        D56|       S|
|       1|     1|  male|28.0|    0|    0|    35.5|         A6|       S|
|       0|     1|  male|19.0|    3|    2|   263.0|C23 C25 C27|       S|
|       1|     1|female|49.0|    1|    0| 76.7292|        D33|       C|
|       0|     1|  male|65.0|    0|    1| 61.9792|        B30|       C|
|       0|     1|  male|45.0|    1|    0|  83.475|        C83|  

## Encode the lables

In [17]:
# Import functions
from pyspark.ml.feature import VectorAssembler, VectorIndexer, OneHotEncoder, StringIndexer

### Create indexed data (convert strings into numbers)

Let's do it for the Sex and Embark columns

In [22]:
# Create a SEX string indexer
gender_indexer = StringIndexer(inputCol='Sex', outputCol='SexIndex')

# Create a SEX oneHotEncoder
gender_encoder = OneHotEncoder(inputCol='SexIndex', outputCol='SexVec')

In [42]:
# Create an Embark string indexer
embark_indexer = StringIndexer(inputCol='Embarked', outputCol='EmbarkedIndex')

# Create a Embark oneHotEncoder
embark_encoder = OneHotEncoder(inputCol='EmbarkedIndex', outputCol='EmbarkedVec')

### Create an assembler
Save all variables in a single column 'features' with its own label

In [43]:
# Create the object assembler
assembler = VectorAssembler(inputCols=['Pclass', 'SexVec', 'EmbarkedVec', 'Age', 'SibSp', 'Parch', 'Fare'],
                            outputCol='features')

In [44]:
# Import the functions
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

In [45]:
# Create the model
log_reg_titanic = LogisticRegression(featuresCol='features', labelCol='Survived')

### Pipeline

In [46]:
# Create the pipeline that contains all stages our data will go through until reach the logistic regression
pipeline = Pipeline(stages=[gender_indexer, embark_indexer, gender_encoder, embark_encoder, assembler, log_reg_titanic])

### Split into train and test data

In [47]:
# Split into train and test data
train_data, test_data = my_final_data.randomSplit([0.7, 0.3])
train_data.show()

+--------+------+------+----+-----+-----+--------+-----------+--------+
|Survived|Pclass|   Sex| Age|SibSp|Parch|    Fare|      Cabin|Embarked|
+--------+------+------+----+-----+-----+--------+-----------+--------+
|       0|     1|female| 2.0|    1|    2|  151.55|    C22 C26|       S|
|       0|     1|female|25.0|    1|    2|  151.55|    C22 C26|       S|
|       0|     1|female|50.0|    0|    0| 28.7125|        C49|       C|
|       0|     1|  male|19.0|    1|    0|    53.1|        D30|       S|
|       0|     1|  male|19.0|    3|    2|   263.0|C23 C25 C27|       S|
|       0|     1|  male|21.0|    0|    1| 77.2875|        D26|       S|
|       0|     1|  male|24.0|    0|    0|    79.2|        B86|       C|
|       0|     1|  male|24.0|    0|    1|247.5208|    B58 B60|       C|
|       0|     1|  male|27.0|    0|    2|   211.5|        C82|       C|
|       0|     1|  male|29.0|    0|    0|    30.0|         D6|       S|
|       0|     1|  male|29.0|    1|    0|    66.6|         C2|  

In [48]:
# Fit the model with our predefined Pipeline
fit_model = pipeline.fit(train_data)

## Predict and Evaluate

In [49]:
# Test our results
results = fit_model.transform(test_data)

In [50]:
# Import our function to evaluate the model
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [51]:
# Create our evaluatior object
my_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='Survived')

In [53]:
# Print out the whole results
results.show()

+--------+------+------+----+-----+-----+--------+-----------+--------+--------+-------------+-------------+-------------+--------------------+--------------------+--------------------+----------+
|Survived|Pclass|   Sex| Age|SibSp|Parch|    Fare|      Cabin|Embarked|SexIndex|EmbarkedIndex|       SexVec|  EmbarkedVec|            features|       rawPrediction|         probability|prediction|
+--------+------+------+----+-----+-----+--------+-----------+--------+--------+-------------+-------------+-------------+--------------------+--------------------+--------------------+----------+
|       0|     1|  male|18.0|    1|    0|   108.9|        C65|       C|     0.0|          1.0|(1,[0],[1.0])|(2,[1],[1.0])|[1.0,1.0,0.0,1.0,...|[-1.2570932630573...|[0.22147467660908...|       1.0|
|       0|     1|  male|31.0|    0|    0| 50.4958|        A24|       S|     0.0|          0.0|(1,[0],[1.0])|(2,[0],[1.0])|[1.0,1.0,1.0,0.0,...|[-0.1689484726968...|[0.45786306245347...|       1.0|
|       0|     

In [54]:
# Show only the label and prediction
results.select('Survived', 'Prediction').show()

+--------+----------+
|Survived|Prediction|
+--------+----------+
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       0.0|
|       0|       1.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       1.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       1.0|
|       0|       0.0|
|       1|       1.0|
|       1|       1.0|
+--------+----------+
only showing top 20 rows



In [56]:
# Let's Evaluate the regression model results pm the ROC
AUC = my_eval.evaluate(results) # AUC stands for Area Unter the Curve
AUC

0.75