### Titanic


In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Titanic").getOrCreate()

In [3]:
df = spark.read.csv('Datasets/titanic.csv', header=True, inferSchema=True)
df.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| NULL|       S|
|          6|       0|     3|    Moran, Mr. James|  male|NULL|    0|    0|      

In [4]:
df.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



In [5]:
print(df.columns)

['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']


Select columns that are useful to build a model

In [6]:
my_cols = df.select(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch','Fare','Embarked' ])

In [7]:
# drop rows which contain NULL
my_final_data = my_cols.na.drop()
my_final_data.count() # 712

712

In [8]:
# Convert category data into numbers

from pyspark.ml.feature import VectorAssembler, VectorIndexer, OneHotEncoder, StringIndexer

**StringIndexer**

converts categorical string values into numerical indices. It assigns a unique numeric index to each category. The most frequent category gets index 0, the second most frequent gets index 1, and so on.

**OneHotEncoder**

The OneHotEncoder converts the indexed categorical values (from StringIndexer) into one-hot encoded vectors. Converts categorical indices into a binary vector. The vector has 1 in the corresponding category position, and 0 elsewhere.

**One-hot encoding**

["red", "green", "blue"] =

"red" → [1, 0, 0]

"blue" → [0, 1, 0]

"green" → [0, 0, 1]

**Sparse Vector Representation**
- Dense representation of one-hot encoded data: [0, 0, 1, 0, 0] (5 elements stored).
- Sparse representation: (5, [2], [1]) (only the size, index of 1, and the value 1 are stored).

(Size, [indices], [values])

Example:

(3, [0], [1.0]) → Set index 0 to 1.0.

(3, [1], [1.0]) → Set index 1 to 1.0.

(3, [], []) → No change

(3, [2], [1.0]) → Set index 2 to 1.0.

Convert ‘Sex’ to one-hot encode

In [9]:
gender_indexer = StringIndexer(inputCol='Sex', outputCol='SexIndex')
gender_encoder = OneHotEncoder(inputCol='SexIndex', outputCol='SexVec')

Convert ‘Embark’ to one-hot encode

In [10]:
embark_indexer = StringIndexer(inputCol='Embarked', outputCol='EmbarkIndex')
embark_encoder = OneHotEncoder(inputCol='EmbarkIndex', outputCol = 'EmbarkVec')

In [11]:
assembler = VectorAssembler(
   inputCols=['Pclass','SexVec','EmbarkVec','Age','SibSp', 'Parch', 'Fare'],
   outputCol='features'
   )

**Using a Pipeline**

The Pipeline in PySpark MLlib is used to streamline machine learning workflows by chaining multiple transformations and estimators together into a single pipeline object.
- Automates Data Transformation Steps
- Ensures Consistent Processing Across Train & Test Data
- Easier to Manage Complex ML Workflows
- Reduces Manual Repetitions

A Pipeline ensures that all transformations happen in sequence before feeding data to a model.


In [12]:
from pyspark.ml import Pipeline

In [13]:
from pyspark.ml.classification import LogisticRegression
log_reg_titanic = LogisticRegression(featuresCol = 'features', labelCol='Survived')

In [14]:
pipeline = Pipeline(
  stages = [
    gender_indexer,
    embark_indexer,
    gender_encoder,
    embark_encoder,
    assembler,
    log_reg_titanic
  ]
)

In [15]:
train_data, test_data = my_final_data.randomSplit([0.7,0.3])

In [16]:
fit_model = pipeline.fit(train_data)

In [17]:
results = fit_model.transform(test_data) # you will get prediction column
results.select('Survived','prediction').show()

+--------+----------+
|Survived|prediction|
+--------+----------+
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       1.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
+--------+----------+
only showing top 20 rows



In [18]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [19]:
my_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='Survived')
my_eval.evaluate(results) # AUC-ROC

0.8154121863799283

Show the model summary

In [20]:
# Extract trained logistic regression model
log_reg_model = fit_model.stages[-1] #log_reg_titanic

# Print Coefficients & Intercept
print("Coefficients: ", log_reg_model.coefficients)
print("Intercept: ", log_reg_model.intercept)

# Get Model Summary
summary = log_reg_model.summary

# Print Model Performance Metrics (evaluate by train_data)
print("Accuracy: ", summary.accuracy)
print("AUC: ", summary.areaUnderROC)
print("Precision by Label: ", summary.precisionByLabel)
print("Recall by Label: ", summary.recallByLabel)

Coefficients:  [-1.1239883285904049,-2.44498264874124,0.529193482908619,1.158134802091996,-0.0452707852503392,-0.2530548514302138,-0.17551833814822707,0.0015982976883840217]
Intercept:  4.439466795243761
Accuracy:  0.7913223140495868
AUC:  0.8470055895661432
Precision by Label:  [0.8092105263157895, 0.7611111111111111]
Recall by Label:  [0.8512110726643599, 0.7025641025641025]


**Implication**

![Implication](Img/TitanicImplication.png)

SexVec (gender) has the most negative impact → Males were less likely to survive.

Pclass negatively affects survival → Lower-class passengers had a lower survival rate.

Fare and EmbarkVec have positive impacts → Higher fares & embarking from Cherbourg were associated with higher survival.

Age has a positive impact → Older people survived more in this model (could be dataset bias).

SibSp and Parch have negative effects → Having more family members on board slightly reduced survival.