# 🧪 PySpark Titanic Practice Levels

## ✅ Level 1: Schema & Types
**Goal:** Understand and clean the schema.
- Identify which columns should be numeric and which should remain strings.
- Convert columns to appropriate data types.
- Count and inspect rows with null or malformed values.




In [43]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [44]:
file = 'titanic.csv'
df = spark.read.csv(file, header=True)

In [45]:
df.printSchema()

root
 |-- Survived: string (nullable = true)
 |-- Pclass: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Siblings/Spouses Aboard: string (nullable = true)
 |-- Parents/Children Aboard: string (nullable = true)
 |-- Fare: string (nullable = true)



In [46]:
# single exercise: add a column full of integer zeros 
from pyspark.sql.functions import lit,col 
df = df.withColumn('tmr',lit(0))
print(df.columns)

['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare', 'tmr']


In [47]:
# exercise: drop the column you just created
df = df.drop('tmr')
print(df.columns)

['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']


In [48]:
df.show()

+--------+------+--------------------+------+---+-----------------------+-----------------------+-------+
|Survived|Pclass|                Name|   Sex|Age|Siblings/Spouses Aboard|Parents/Children Aboard|   Fare|
+--------+------+--------------------+------+---+-----------------------+-----------------------+-------+
|       0|     3|Mr. Owen Harris B...|  male| 22|                      1|                      0|   7.25|
|       1|     1|Mrs. John Bradley...|female| 38|                      1|                      0|71.2833|
|       1|     3|Miss. Laina Heikk...|female| 26|                      0|                      0|  7.925|
|       1|     1|Mrs. Jacques Heat...|female| 35|                      1|                      0|   53.1|
|       0|     3|Mr. William Henry...|  male| 35|                      0|                      0|   8.05|
|       0|     3|     Mr. James Moran|  male| 27|                      0|                      0| 8.4583|
|       0|     1|Mr. Timothy J McC...|  male| 

In [49]:
df = df.withColumn('Survived', col('Survived').cast('int')).withColumn('Pclass', col('Pclass').cast('int')).withColumn('Age', col('Age').cast('float')).withColumn('Siblings/Spouses Aboard', col('Siblings/Spouses Aboard').cast('int')).withColumn('Parents/Children Aboard', col('Parents/Children Aboard').cast('int')).withColumn('Fare',col('Fare').cast('float'))
df.show()


+--------+------+--------------------+------+----+-----------------------+-----------------------+-------+
|Survived|Pclass|                Name|   Sex| Age|Siblings/Spouses Aboard|Parents/Children Aboard|   Fare|
+--------+------+--------------------+------+----+-----------------------+-----------------------+-------+
|       0|     3|Mr. Owen Harris B...|  male|22.0|                      1|                      0|   7.25|
|       1|     1|Mrs. John Bradley...|female|38.0|                      1|                      0|71.2833|
|       1|     3|Miss. Laina Heikk...|female|26.0|                      0|                      0|  7.925|
|       1|     1|Mrs. Jacques Heat...|female|35.0|                      1|                      0|   53.1|
|       0|     3|Mr. William Henry...|  male|35.0|                      0|                      0|   8.05|
|       0|     3|     Mr. James Moran|  male|27.0|                      0|                      0| 8.4583|
|       0|     1|Mr. Timothy J McC...

In [50]:
df.printSchema

<bound method DataFrame.printSchema of DataFrame[Survived: int, Pclass: int, Name: string, Sex: string, Age: float, Siblings/Spouses Aboard: int, Parents/Children Aboard: int, Fare: float]>

In [51]:
# df[df.Sex.isin('male')].show()

---

## ✅ Level 2: Basic Exploration
**Goal:** Describe your dataset.
- Count the number of passengers in each class (`Pclass`).
- Count how many survived vs. didn’t.
- Calculate average and median age and fare.


In [52]:
# count the number of passengers in each class 
count = 0
for x in range(1,4):
    print(f'there is {df[df.Pclass.isin(x)].count()} passengers in class {x}')
    count += df[df.Pclass.isin(x)].count()
print(f'in total there is {count} passengers')
print(df.count())


there is 216 passengers in class 1
there is 184 passengers in class 2
there is 487 passengers in class 3
in total there is 887 passengers
887


In [53]:
# count how many survived vs who didn't
survived = df[df.Survived.isin(1)].count()
not_survived = df.count() - survived
print('survived',survived)
print('didnt survived',not_survived)

survived 342
didnt survived 545


In [54]:
df.printSchema()

root
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: float (nullable = true)
 |-- Siblings/Spouses Aboard: integer (nullable = true)
 |-- Parents/Children Aboard: integer (nullable = true)
 |-- Fare: float (nullable = true)



In [55]:
# calculate average and median age and fare
from pyspark.sql import functions as f 

df.select(f.mean('Age')).show()

+------------------+
|          avg(Age)|
+------------------+
|29.471443066501564|
+------------------+



---

## ✅ Level 3: Filtering
**Goal:** Extract meaningful subsets.
- Filter all passengers under 10 years old.
- Filter all female passengers who survived.
- Find the 10 passengers with the highest fares.

---

## ✅ Level 4: Feature Engineering
**Goal:** Create new columns.
- Add a `FamilySize` column: siblings/spouses + parents/children + 1.
- Create a binary column for minors (under 18).
- Extract titles (Mr., Miss, etc.) from the `Name` column.

---

## ✅ Level 5: Grouping and Aggregation
**Goal:** Summarize data by groups.
- Compute survival rate by sex.
- Compute average fare per passenger class.
- Find the average age for each title (from Level 4).

---

## ✅ Level 6: Missing Data
**Goal:** Handle nulls effectively.
- Count missing values in each column.
- Replace missing `Age` values with average age per `Pclass`.

---

## ✅ Level 7: Sorting and Ranking
**Goal:** Use ordering to analyze data.
- Rank passengers by fare within each class.
- Identify the top 3 oldest passengers in each survival group.

---

## ✅ Level 8: Joins and Window Functions (Advanced)
**Goal:** Combine and compare rows.
- Create a second DataFrame with only female passengers and join it with the original.
- Use a window function to compute rolling averages or rank by fare.

---

## ⭐ Bonus Level: Build a Mini ML Pipeline
**Goal:** Prepare data for machine learning.
- Select features for predicting survival.
- Encode categorical variables.
- Split into training/testing sets and train a classifier.
