# Lab 2 – Titanic Data Exploration and Splitting
### *Huzaifa Nadeem*
### *03-21-2025*

## Introduction
In this lab, we will explore the Titanic dataset and prepare it for machine learning. We will:
1. Import and inspect the dataset
2. Visualize data patterns
3. Handle missing values
4. Engineer new features
5. Select features and target
6. Split the data using both basic and stratified methods

We will **not** publish or share any Howell dataset solution. Just using it as a guidance.
Everything is done locally and then pushed to our GitHub repository.



## Imports
Place all necessary imports here.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit

# For cleaner plots (optional), you can increase figure size defaults:
# plt.rcParams['figure.figsize'] = (8, 6)

# Section 1: Import and Inspect the Data

In [None]:
# 1.1 Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Display basic information
titanic.info()

# Display the first 10 rows
display(titanic.head(10))

# Check for missing values
print("\nMissing Values:\n", titanic.isnull().sum())

# Display summary statistics
print("\nSummary Statistics:\n", titanic.describe())

# Check for correlations (numeric only)
print("\nCorrelation Matrix (numeric only):\n", titanic.corr(numeric_only=True))

### Reflection 1
1. How many data instances are there? 891
2. How many features are there? 15
3. What are the names of the features? survived, pclass, sex, age, sibsp, parch, fare, embarked, class, who, adult_male, deck, embark_town, alive, alone
4. Are there any missing values? age, embark_town, and sometimes deck have missing values.
5. Are there any non-numeric features? sex, embarked, class, who, deck, embark_town, alive, and alone
6. Are the data instances sorted on any attribute? Doesn't seem to be
7. What are the units of `age`? Years
8. What are the minimum, median, and maximum age values? Min is .42 median is 28 and max is 28.
9. Which two numeric features have the highest correlation? fare and pclass
10. Are there any categorical features that might be useful for prediction? Yes, sex, pclass and embarked



# Section 2: Data Exploration and Preparation

## 2.1 Explore Data Patterns and Distributions

In [None]:
# Scatter Matrix for numeric attributes
attributes = ['age', 'fare', 'pclass']
scatter_matrix(titanic[attributes], figsize=(10,10))
plt.show()

# Scatter plot of age vs fare, colored by sex (male=0, female=1)
plt.scatter(
    titanic['age'], 
    titanic['fare'], 
    c=titanic['sex'].apply(lambda x: 0 if x == 'male' else 1)
)
plt.xlabel('Age')
plt.ylabel('Fare')
plt.title('Age vs Fare by Gender')
plt.show()

# Histogram of age
sns.histplot(data=titanic, x='age', kde=True)
plt.title('Age Distribution')
plt.show()

# Countplot for class, hue by survived
sns.countplot(x='class', hue='survived', data=titanic)
plt.title('Class Distribution by Survival')
plt.show()

### Reflection 2.1
- What patterns or anomalies do you notice? Fare values have a very large range
- Do any features stand out as potential predictors? sex and pclass seem important, which makes sense since we know first class and women survived more.
- Are there any visible class imbalances? With survived, we have more people who did not survive compared to who those who, shows a bit of an imbalance.



## 2.2 Handle Missing Values and Clean Data

In [None]:
# Impute missing Age with median
titanic['age'].fillna(titanic['age'].median(), inplace=True)

# Fill missing embark_town with mode
titanic['embark_town'].fillna(titanic['embark_town'].mode()[0], inplace=True)

# Check if still any missing
print("Missing Values After Cleaning:\n", titanic.isnull().sum())

## 2.3 Feature Engineering
Create additional or transformed features, as needed.

In [None]:
# Create new feature: family_size = sibsp + parch + 1
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1

# Convert 'sex' to numeric
titanic['sex'] = titanic['sex'].map({'male':0, 'female':1})

# Convert 'embarked' to numeric
titanic['embarked'] = titanic['embarked'].map({'C':0, 'Q':1, 'S':2})

# Optional: Make 'alone' an integer
titanic['alone'] = titanic['alone'].astype(int)

titanic.head(10)

### Reflection 2.3
- Why might `family_size` be a useful feature for predicting survival? It will show if a having a large family can slow you down
- Why convert categorical data to numeric? Most machine learning concepts require numeric inputs

>

# Section 3: Feature Selection and Justification
Pick which features to use as predictors (X) and which to use as target (y).

In [None]:
# For classification, let's pick some columns
X = titanic[['age', 'fare', 'pclass', 'sex', 'family_size']]
y = titanic['survived']  # 0 or 1

X.head()

### Reflection 3
- Why did you pick these features? I picked these because I thought they would be critical for survival.
- Which do you suspect will be highly predictive? I think sex and pclass will be. We know that women and children and first class survived the most.



# Section 4: Splitting the Data
We'll do both a basic train/test split and a stratified train/test split.

In [None]:
# 4.1 Basic Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.2, 
    random_state=123
)
print('Basic Split =>')
print('Train size:', len(X_train), 'Test size:', len(X_test))

# Check distribution of the target in the train/test sets
print('\nOriginal Survived Dist:', y.value_counts(normalize=True))
print('Train Survived Dist:', y_train.value_counts(normalize=True))
print('Test Survived Dist:', y_test.value_counts(normalize=True))

In [None]:
# 4.2 Stratified Train/Test Split
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=123)

for train_idx, test_idx in splitter.split(X, y):
    X_train_strat = X.iloc[train_idx]
    y_train_strat = y.iloc[train_idx]
    X_test_strat = X.iloc[test_idx]
    y_test_strat = y.iloc[test_idx]

print('\nStratified Split =>')
print('Train size:', len(X_train_strat), 'Test size:', len(X_test_strat))

# Check distribution of the target in the stratified train/test sets
print('\nOriginal Survived Dist:', y.value_counts(normalize=True))
print('Strat Train Survived Dist:', y_train_strat.value_counts(normalize=True))
print('Strat Test Survived Dist:', y_test_strat.value_counts(normalize=True))

### Reflection 4
- Why might stratification improve model performance? It makes that test sets have the same proportion for survivors vs non-survivors
- Which split method produced a class distribution closer to the original? The stratisfied split kept the original ratio of survivors to non-survivors more precise
- In what scenarios might stratification be less important? If the dataset is already large and balanced beforehand.



# Conclusion
First, I checked out all the info in the dataset and saw there were some missing values, especially in age. I filled those in, and then poked around the data with some basic plots, noticing stuff like how class and gender seemed tied to survival. It was actually kind of interesting to see real historical patterns in the numbers.

I also created a new column, family_size, to see if traveling in a group might matter for survival, and then converted any non-numerical columns (like sex and embarked) into numbers so that future machine learning models can use them. After cleaning everything up, I split the data two ways: a regular train/test split and a stratified split. The stratified split makes sure the survival rate stays about the same in both training and testing sets, which should help the final models perform more accurately.