<a href="https://colab.research.google.com/github/Shan-Niit/story/blob/main/Baseline_Model_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Baseline Model Titanic

## Importing the libraries

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')


## Importing the dataset

In [None]:
# Load Titanic data
df = pd.read_csv('/content/train.csv')

# Drop rows where 'Embarked' is missing (only 2 rows)
df = df.dropna(subset=['Embarked'])


##Model to predict the missing data for age (+20% of the rows) depending of the other variables

In [None]:
# Features to help predict Age
features_for_age = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked']

# Create a temporary DataFrame with one-hot encoding
df_temp = pd.get_dummies(df[features_for_age + ['Age']], drop_first=True)

# Split into known and missing Age
df_age_known = df_temp[df_temp['Age'].notnull()]
df_age_missing = df_temp[df_temp['Age'].isnull()]

# Train the model
X_age = df_age_known.drop('Age', axis=1)
y_age = df_age_known['Age']

rfr = RandomForestRegressor(n_estimators=100, random_state=42)
rfr.fit(X_age, y_age)

# Predict missing ages
X_missing_age = df_age_missing.drop('Age', axis=1)
predicted_ages = rfr.predict(X_missing_age)

# Fill back into the original DataFrame
df.loc[df['Age'].isnull(), 'Age'] = predicted_ages


## Preprocessing pipeline

In [None]:
# Define features and target
X = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
y = df['Survived']

# Column groups
numeric_features = ['Age', 'SibSp', 'Parch', 'Fare']
categorical_features = ['Pclass', 'Sex', 'Embarked']

# Pipelines
numeric_pipeline = Pipeline([
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline([
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine pipelines
preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, numeric_features),
    ('cat', categorical_pipeline, categorical_features)
])

# Apply transformations
X_prepro = preprocessor.fit_transform(X)

# Get column names
num_cols = numeric_features
cat_cols = preprocessor.named_transformers_['cat']['encoder'].get_feature_names_out(categorical_features)
all_columns = list(num_cols) + list(cat_cols)

# Wrap in DataFrame
X_final = pd.DataFrame(X_prepro, columns=all_columns)

# Optional: display the result
print(X_final.head())


        Age     SibSp     Parch      Fare  Pclass_1  Pclass_2  Pclass_3  \
0 -0.545560  0.431350 -0.474326 -0.500240       0.0       0.0       1.0   
1  0.619174  0.431350 -0.474326  0.788947       1.0       0.0       0.0   
2 -0.254377 -0.475199 -0.474326 -0.486650       0.0       0.0       1.0   
3  0.400786  0.431350 -0.474326  0.422861       1.0       0.0       0.0   
4  0.400786 -0.475199 -0.474326 -0.484133       0.0       0.0       1.0   

   Sex_female  Sex_male  Embarked_C  Embarked_Q  Embarked_S  
0         0.0       1.0         0.0         0.0         1.0  
1         1.0       0.0         1.0         0.0         0.0  
2         1.0       0.0         0.0         0.0         1.0  
3         1.0       0.0         0.0         0.0         1.0  
4         0.0       1.0         0.0         0.0         1.0  


##Import the models

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier

##Define models and their hyperparameters

In [None]:
model_params = {
    'SVC': {
        'model': SVC(probability=True, random_state=42),
        'params': {
            'model__C': [0.1, 1, 10],
            'model__kernel': ['linear', 'rbf']
        }
    },
    'RandomForest': {
        'model': RandomForestClassifier(random_state=42),
        'params': {
            'model__n_estimators': [100, 200],
            'model__max_depth': [None, 5, 10]
        }
    },
    'XGBoost': {
        'model': XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42),
        'params': {
            'model__n_estimators': [100, 200],
            'model__max_depth': [3, 5],
            'model__learning_rate': [0.05, 0.1]
        }
    },
    'MLP': {
        'model': MLPClassifier(max_iter=500, random_state=42),
        'params': {
            'model__hidden_layer_sizes': [(100,), (50, 50)],
            'model__alpha': [0.0001, 0.001]
        }
    },
    'KNN': {
        'model': KNeighborsClassifier(),
        'params': {
            'model__n_neighbors': [3, 5, 7],
            'model__weights': ['uniform', 'distance']
        }
    },
    'DecisionTree': {
        'model': DecisionTreeClassifier(random_state=42),
        'params': {
            'model__max_depth': [None, 5, 10],
            'model__criterion': ['gini', 'entropy']
        }
    },
    'ExtraTrees': {
        'model': ExtraTreesClassifier(random_state=42),
        'params': {
            'model__n_estimators': [100, 200],
            'model__max_depth': [None, 5, 10]
        }
    },
    'GradientBoosting': {
        'model': GradientBoostingClassifier(random_state=42),
        'params': {
            'model__n_estimators': [100, 200],
            'model__learning_rate': [0.05, 0.1],
            'model__max_depth': [3, 5]
        }
    },
    'Voting': {
        'model': VotingClassifier(
            estimators=[
                ('svc', SVC(probability=True, random_state=42)),
                ('rf', RandomForestClassifier(random_state=42)),
                ('xgb', XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)),
                ('mlp', MLPClassifier(max_iter=500, random_state=42))
            ],
            voting='soft'
        ),
        'params': {}  # No hyperparameters to tune here directly
    }
}

## Split the data

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Loop on all the models to find the best parameters with a cross validation

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

best_models = {}

for name, mp in model_params.items():
    pipe = Pipeline([
        ('preprocessing', preprocessor),
        ('model', mp['model'])
    ])

    clf = GridSearchCV(pipe, mp['params'], cv=5, n_jobs=-1, scoring='accuracy')
    clf.fit(X_train, y_train)

    print(f"Best parameters for {name}: {clf.best_params_}")
    print(f"Validation Accuracy for {name}: {clf.best_score_:.4f}")

    best_models[name] = clf.best_estimator_

Best parameters for SVC: {'model__C': 10, 'model__kernel': 'rbf'}
Validation Accuracy for SVC: 0.8298
Best parameters for RandomForest: {'model__max_depth': 10, 'model__n_estimators': 200}
Validation Accuracy for RandomForest: 0.8368
Best parameters for XGBoost: {'model__learning_rate': 0.05, 'model__max_depth': 5, 'model__n_estimators': 100}
Validation Accuracy for XGBoost: 0.8368
Best parameters for MLP: {'model__alpha': 0.001, 'model__hidden_layer_sizes': (100,)}
Validation Accuracy for MLP: 0.8340
Best parameters for KNN: {'model__n_neighbors': 3, 'model__weights': 'uniform'}
Validation Accuracy for KNN: 0.8298
Best parameters for DecisionTree: {'model__criterion': 'entropy', 'model__max_depth': 5}
Validation Accuracy for DecisionTree: 0.8144
Best parameters for ExtraTrees: {'model__max_depth': 10, 'model__n_estimators': 100}
Validation Accuracy for ExtraTrees: 0.8354
Best parameters for GradientBoosting: {'model__learning_rate': 0.05, 'model__max_depth': 5, 'model__n_estimators': 

## Display accuracy with the best parameters for each model

In [None]:
from sklearn.metrics import accuracy_score

# Optionally test on hold-out set
for name, model in best_models.items():
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"Test accuracy for {name}: {acc:.4f}")

Test accuracy for SVC: 0.7921
Test accuracy for RandomForest: 0.7978
Test accuracy for XGBoost: 0.7921
Test accuracy for MLP: 0.8034
Test accuracy for KNN: 0.7640
Test accuracy for DecisionTree: 0.7865
Test accuracy for ExtraTrees: 0.7978
Test accuracy for GradientBoosting: 0.7978
Test accuracy for Voting: 0.8202


## 📊 Baseline Model Review

Before introducing any feature engineering, we tested a variety of models using only the original features (`Pclass`, `Sex`, `Age`, `SibSp`, `Parch`, `Fare`, and `Embarked`). Each model was tuned using `GridSearchCV` and evaluated on a hold-out test set.

### 🔍 Performance Summary

| Model              | Test Accuracy |
|-------------------|----------------|
| SVC               | 0.7921         |
| RandomForest      | 0.7978         |
| XGBoost           | 0.7921         |
| MLP               | 0.8034         |
| KNN               | 0.7640         |
| DecisionTree      | 0.7865         |
| ExtraTrees        | 0.7978         |
| GradientBoosting  | 0.7978         |
| **VotingClassifier** | **0.8202**     |

- ✅ The **VotingClassifier** with soft voting performed best, reaching **82.02% accuracy**.
- 🔁 However, this is close to a performance ceiling for the given features.
- 🧠 To improve further, we now explore **feature engineering** to extract more signal from the data.

---

### ⏭️ Next: Feature Engineering

We'll now enrich the dataset with new features like:
- Titles extracted from passenger names
- Family size & solo travelers
- Cabin deck categories
- Ticket groupings or fare bins

These can help our models capture deeper patterns in the data.
