<a href="https://colab.research.google.com/github/RonaldDonfack/Bole/blob/main/titanic_analysis_lionel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Titanic Survival Prediction

## 1. Import Libraries and Load Data

First, let's import the necessary libraries and load our training and testing datasets.

In [34]:
# 1. Import Libraries and Load Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import files

# Upload CSVs (train.csv & test.csv)
uploaded = files.upload()

# Load data
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# Preview data
train_df.head()


Saving gender_submission.csv to gender_submission (3).csv
Saving test.csv to test (3).csv
Saving train.csv to train (4).csv


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 2. Exploratory Data Analysis (EDA)

Now, let's explore the data to understand its structure, find patterns, and identify missing values.

In [36]:
# 2. Data Cleaning and Advanced Feature Engineering

def preprocess(df):
    df = df.copy()

    # --- Handle missing values ---
    df['Age'] = df['Age'].fillna(df['Age'].median())
    df['Fare'] = df['Fare'].fillna(df['Fare'].median())
    df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

    # --- Encode categorical variables ---
    df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
    df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

    # --- Extract Title from Name ---
    df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)
    df['Title'] = df['Title'].replace(['Mlle', 'Ms'], 'Miss')
    df['Title'] = df['Title'].replace(['Mme'], 'Mrs')
    df['Title'] = df['Title'].replace(['Dr', 'Rev', 'Col', 'Major', 'Capt', 'Countess', 'Jonkheer', 'Don', 'Dona', 'Lady', 'Sir'], 'Rare')
    df['Title'] = df['Title'].map({
        'Mr': 1, 'Miss': 2, 'Mrs': 3, 'Master': 4, 'Rare': 5
    }).fillna(0)

    # --- Family features ---
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
    df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

    # --- Age categories ---
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 12, 18, 30, 50, 80],
                            labels=[0, 1, 2, 3, 4])

    # --- Fare categories ---
    df['FareGroup'] = pd.qcut(df['Fare'], 4, labels=[0, 1, 2, 3])

    # --- Drop unnecessary columns ---
    df.drop(columns=['Name', 'Ticket', 'Cabin'], inplace=True, errors='ignore')

    return df


train = preprocess(train_df)
test = preprocess(test_df)


## 3. Data Cleaning & Feature Engineering

Based on our EDA, we'll clean the data by handling missing values and create new features to improve our model's performance.

In [37]:
from sklearn.preprocessing import StandardScaler

# Align columns between train and test
X = train.drop('Survived', axis=1)
y = train['Survived']

# Save PassengerId for submission
passenger_id = test['PassengerId']

# Ensure same columns
test = test[X.columns.intersection(test.columns)]
missing_cols = [col for col in X.columns if col not in test.columns]
for col in missing_cols:
    test[col] = 0
test = test[X.columns]

# Normalize numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
test_scaled = scaler.transform(test)


## 4. Model Training and Evaluation

It's time to choose a model, train it on our processed data, and see how well it performs.

In [38]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score

models = {
    "Random Forest": RandomForestClassifier(n_estimators=300, max_depth=6, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=300, learning_rate=0.05, max_depth=3, random_state=42),
    "XGBoost": XGBClassifier(n_estimators=300, learning_rate=0.05, max_depth=4, random_state=42, eval_metric='logloss')
}

for name, model in models.items():
    scores = cross_val_score(model, X_scaled, y, cv=5, scoring='accuracy')
    print(f"{name}: {scores.mean():.4f} (+/- {scores.std():.4f})")


Random Forest: 0.8159 (+/- 0.0108)
Gradient Boosting: 0.7836 (+/- 0.0850)
XGBoost: 0.7858 (+/- 0.0863)


## 5. Create Submission File

Finally, we'll use our trained model to make predictions on the test set and generate the submission file in the required format.

In [39]:
from sklearn.model_selection import GridSearchCV

xgb = XGBClassifier(eval_metric='logloss', random_state=42)

param_grid = {
    'n_estimators': [200, 400, 600],
    'max_depth': [3, 4, 5, 6],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

grid = GridSearchCV(xgb, param_grid, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)
grid.fit(X_scaled, y)

print("✅ Best parameters:", grid.best_params_)
print("✅ Best score:", grid.best_score_)
best_model = grid.best_estimator_
# 6. Create Final Submission File
predictions = best_model.predict(test_scaled)

submission = pd.DataFrame({
    'PassengerId': passenger_id,
    'Survived': predictions
})

submission.to_csv('submission_final.csv', index=False)
print("✅ submission_final.csv created successfully!")
submission.head()


Fitting 5 folds for each of 144 candidates, totalling 720 fits
✅ Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 200, 'subsample': 0.8}
✅ Best score: 0.819327098110602
✅ submission_final.csv created successfully!


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
