# Titanic Survival Prediction

This notebook demonstrates an end-to-end baseline for the Kaggle "Titanic: Machine Learning from Disaster" competition.

Learning objectives:
- Understand the typical ML workflow on tabular data: ingest → explore → preprocess → engineer features → train → evaluate → submit.
- See examples of imputation, categorical encoding, feature derivation, and model training with scikit-learn and XGBoost.
- Recognize tradeoffs and caveats (data leakage, reproducibility, choice of metrics, etc.).

Competition reminder (goal): Predict the binary outcome Survived (1/0) for passengers in the test set. Submissions must be a CSV with columns PassengerId and Survived.

High-level pipeline used here:
1) Load Kaggle-provided train/test CSVs.
2) Concatenate them temporarily to apply consistent preprocessing/encoding.
3) Drop low-utility text keys, impute missing values, and extract signal from Cabin.
4) One-hot encode categorical features; derive a couple of simple features.
5) Split the (processed) training data into train/validation to estimate performance.
6) Train a few baseline models (Logistic Regression, XGBoost, Random Forest) and compare accuracy.
7) Fit the chosen model on the train split and generate predictions for the test set.
8) Create the submission file.

In [49]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier


In [50]:
df_train = pd.read_csv('https://raw.githubusercontent.com/Okwybobby/WIGE-Kaggle-Competition-Example/refs/heads/main/train.csv')
df_test = pd.read_csv('https://raw.githubusercontent.com/Okwybobby/WIGE-Kaggle-Competition-Example/refs/heads/main/test.csv')
if 'Survived' not in df_test.columns:
    df_test['Survived'] = 0

In [51]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Quick peek at training data
- Typical Columns: PassengerId, Survived (label), Pclass (1–3), Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked.
- Early intuition:
  - Pclass, Sex, Fare, and Cabin often carry strong signal.
  - Name and Ticket are free text; they may contain useful signal (titles, ticket prefixes), but require extra parsing. In this baseline, they’re dropped for simplicity.

## Preprocessing pipeline design
We’ll now build a function that:
1) Concatenates train and test so the same preprocessing is applied consistently.
2) Drops low-utility text columns (Name, Ticket) in this baseline.
3) Imputes missing values (Age mean; Fare mean; Embarked placeholder; Cabin placeholder).
4) Splits Cabin into letter and number components; one-hot encodes categories.
5) Creates a couple of simple engineered features.
6) Splits the combined frame back into train and test with aligned columns.

Why concatenate?
- Ensures any encoding (e.g., one-hot) yields identical columns across train and test. Otherwise, category mismatches can cause errors at inference.

In [52]:
def Preprocess(df_train, df_test):
  df = pd.concat([df_train, df_test], axis=0)
  df = df.drop(['Name', 'Ticket'], axis = 1)
  df['Age'] = df['Age'].fillna(df['Age'].mean())
  df['Cabin'] = df['Cabin'].fillna('X000')
  df['Embarked'] = df['Embarked'].fillna('X')
  df['Fare'] = df['Fare'].fillna(df['Fare'].mean())

  df['cabin_letter'] = df['Cabin'].str.extract(r'([A-Za-z]+)', expand = False)
  df['cabin_number'] = df['Cabin'].str.extract(r'(\d+)', expand = False)
  df  = df.drop('Cabin', axis = 1)

  df = pd.get_dummies(df, columns = ['cabin_letter'], prefix='cabin')
  df = pd.get_dummies(df, columns = ['Embarked'], prefix='Embarked')
  df = pd.get_dummies(df, columns = ['Sex'], prefix='Sex')

  df = df.drop('cabin_X', axis = 1)
  df = df.drop('Embarked_X', axis = 1)

  df['cabin_number'] = df['cabin_number'].fillna(0)
  df['cabin_number'] = pd.to_numeric(df['cabin_number'])

  df['Pclass_bin_Fare'] = df['Fare'] // df['Pclass']
  df['Pclass_bin_sex'] = df['Pclass'] - df['Sex_female']


  df_train = df[:len(df_train)]
  df_test = df[:len(df_test)]

  df_test =  df_test.drop('Survived', axis=1)

  return df_train, df_test

In [53]:
train_df, test_df = Preprocess(df_train, df_test)

In [54]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,cabin_number,cabin_A,cabin_B,...,cabin_F,cabin_G,cabin_T,Embarked_C,Embarked_Q,Embarked_S,Sex_female,Sex_male,Pclass_bin_Fare,Pclass_bin_sex
0,1,0,3,22.0,1,0,7.25,0,False,False,...,False,False,False,False,False,True,False,True,2.0,3
1,2,1,1,38.0,1,0,71.2833,85,False,False,...,False,False,False,True,False,False,True,False,71.0,0
2,3,1,3,26.0,0,0,7.925,0,False,False,...,False,False,False,False,False,True,True,False,2.0,2
3,4,1,1,35.0,1,0,53.1,123,False,False,...,False,False,False,False,False,True,True,False,53.0,0
4,5,0,3,35.0,0,0,8.05,0,False,False,...,False,False,False,False,False,True,False,True,2.0,3


## Correlation with target (quick intuition)
After preprocessing, `train_df` contains the original training rows with engineered/encoded features, and `test_df` contains the aligned features for the test set (no Survived column). Next, we’ll examine correlations to get intuition about feature importance (not a substitute for proper modeling, but useful teaching signal).

Positive correlation means as the feature increases, Survived tends to be 1 more often; negative means the opposite.

In [55]:
train_df.corr()['Survived']

Unnamed: 0,Survived
PassengerId,-0.005007
Survived,1.0
Pclass,-0.338481
Age,-0.070323
SibSp,-0.035322
Parch,0.081629
Fare,0.257307
cabin_number,0.229756
cabin_A,0.022287
cabin_B,0.175095


## Train/validation split
- A simple 80/20 split gives a holdout estimate of generalization.
- For reproducibility in class demonstrations, we set `random_state` and `stratify=y` to preserve class balance across splits.
- The `y_train = np.reshape(y_train, (-1, 1))` line shapes y into a column vector; most scikit-learn estimators accept 1D arrays for y, so either form works. Some estimators will warn and internally ravel to 1D.

In [56]:
X = train_df.drop('Survived', axis = 1)
y = train_df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
y_train = np.reshape(y_train, (-1, 1))

In [57]:
X_train.shape, y_train.shape

((712, 22), (712, 1))

Shapes check:
- 712 training rows (80% of 891).
- 22 features after preprocessing/encoding/engineering (count may vary by environment if categories differ).
- y is shaped (712, 1) here; it could also be (712,) without issue.


In [58]:
model_1 = LogisticRegression()
model_1.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Model 1: Logistic Regression (baseline linear classifier)
- Good sanity check model; fast to train and easy to interpret.
- The warning about convergence suggests increasing `max_iter` or scaling features. In a more refined pass, try `LogisticRegression(max_iter=1000, solver='lbfgs')` and consider scaling numeric features (e.g., StandardScaler in a Pipeline).
- For y shape, you can avoid the DataConversionWarning by keeping `y` as 1D (`y.ravel()`), though this warning is harmless here.


In [59]:
y_pred = model_1.predict(X_test)

In [60]:
accuracy_score(y_test, y_pred)

0.7988826815642458

Accuracy around ~0.80 is typical for a quick baseline on Titanic with minimal feature work.Feel free to compare with additional features and better preprocessing.

In [61]:
model_2 = XGBClassifier(enable_categorical = True)
model_2.fit(X_train, y_train)

## Model 2: XGBoost (gradient boosted trees)
- Powerful non-linear model that often performs well on tabular datasets.
- Here, features are already numeric/one-hot, so `enable_categorical=True` is not necessary but harmless.
- In practice, set `random_state` (or `seed`) and consider tuning `n_estimators`, `max_depth`, `learning_rate`, `subsample`, `colsample_bytree`, etc. Even light tuning can improve performance.
- Early stopping with a validation set can prevent overfitting and speed up training.

In [62]:
y_pred = model_2.predict(X_test)

accuracy_score(y_test, y_pred)

0.7932960893854749

In [63]:
model_3 = RandomForestClassifier()
model_3.fit(X_train, y_train)

  return fit_method(estimator, *args, **kwargs)


## Model 3: Random Forest (bagged trees)
- Ensemble of decision trees trained on bootstrap samples with feature subsampling, which is robust and easy to use.
- Defaults are reasonable; you can often gain accuracy by tuning `n_estimators`, `max_depth`, `min_samples_split`, `min_samples_leaf`, and class_weight.
- Consider enabling `oob_score=True` to get an out-of-bag estimate without a separate validation split (useful in class to discuss bias/variance and validation strategies).


In [64]:
y_pred = model_3.predict(X_test)

accuracy_score(y_test, y_pred)

0.8268156424581006

In [65]:
pred = model_3.predict(test_df)

final = pd.DataFrame()
final['PassengerId'] = test_df['PassengerId']
final['Survived'] = pred
final.to_csv('submission.csv', index=False)