
The goal is to predict whether or not a passenger survived the Titanic challenge on Kaggle, based on attributes such as age, gender, passenger class, where they boarded, and so on.

- First, log in to Kaggle and go to the Titanic challenge to download train.csv and test.csv. The data is already split into a training set and a test set. However, the test data does not contain the labels: your goal is to train the best possible model using the training data, so make your predictions on the test data and upload it to Kaggle to see your final score.

- Then, evaluate the attributes of the training set. Some attributes contain missing data. This indicates that it may be unnecessary to include them in the model.

- Next, evaluate the performance of some models seen in class.

- To improve this result, you can:
    - Compare more models and tune hyperparameters;
    - Do some preprocessing on the features, for example:
    - replace SibSp and Parch by their sum,
    - try to identify parts of names that correlate well with the Survived attribute (for example, if the name contains "Countess", then survival seems more likely),
    - Try to convert numeric attributes into categorical attributes: for example, different age groups had very different survival rates. So it might help to create an age category and use that instead of age. Similarly, it might be useful to have a special category for people traveling alone, since only 30% of them survived.


In [79]:
## Columns
## PassengerID: Unique identifier for each passenger
# Survived: Survival status (0 = No, 1 = Yes)
# Pclass: Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)
# Name: Name of the passenger
# Sex: Gender of the passenger
# Age: Age of the passenger in years
# SibSp: Number of siblings or spouses aboard the Titanic
# Parch: Number of parents or children aboard the Titanic
# Ticket: Ticket number
# Fare: Passenger fare
# Cabin: Cabin number (if available)
# Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

In [90]:
import pandas as pd
import re

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

In [None]:
TRAIN_DATA = "data/titanic/train.csv"
TEST_DATA = "data/titanic/test.csv"

In [None]:
df_train = pd.read_csv(TRAIN_DATA)
df_test = pd.read_csv(TEST_DATA)

In [None]:
df_train.shape, df_train.head()

((891, 12),
    PassengerId  Survived  Pclass  \
 0            1         0       3   
 1            2         1       1   
 2            3         1       3   
 3            4         1       1   
 4            5         0       3   
 
                                                 Name     Sex   Age  SibSp  \
 0                            Braund, Mr. Owen Harris    male  22.0      1   
 1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
 2                             Heikkinen, Miss. Laina  female  26.0      0   
 3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
 4                           Allen, Mr. William Henry    male  35.0      0   
 
    Parch            Ticket     Fare Cabin Embarked  
 0      0         A/5 21171   7.2500   NaN        S  
 1      0          PC 17599  71.2833   C85        C  
 2      0  STON/O2. 3101282   7.9250   NaN        S  
 3      0            113803  53.1000  C123        S  
 4      0            373

In [None]:
# count the number of missing values in each column
missing_values = df_train.isnull().sum()
missing_values = missing_values[missing_values > 0]
print("Missing values in training data:\n", missing_values)

Missing values in training data:
 Age         177
Cabin       687
Embarked      2
dtype: int64


### Cleaning the data

In [None]:
# removing irrelevant features
df_train = df_train.drop(columns=['PassengerId', 'Ticket', 'Cabin'])
df_test = df_test.drop(columns=['PassengerId', 'Ticket', 'Cabin'])

In [None]:
# Embarked is a categorical feature, so we can fill missing values with the mode (most frequent value)
df_train['Embarked'].fillna(df_train['Embarked'].mode()[0], inplace=True)
df_test['Embarked'].fillna(df_test['Embarked'].mode()[0], inplace=True)

In [None]:
# age is a numerical feature, so we can fill missing values with the median age
df_train['Age'].fillna(df_train['Age'].median(), inplace=True)
df_test['Age'].fillna(df_test['Age'].median(), inplace=True)

In [None]:
df_train.isnull().sum()


Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

### Feature Engineering

In [None]:
df_train['FamilySize'] = df_train['SibSp'] + df_train['Parch']
df_test['FamilySize'] = df_test['SibSp'] + df_test['Parch']

df_train['IsAlone'] = (df_train['FamilySize'] == 0).astype(int)
df_test['IsAlone'] = (df_test['FamilySize'] == 0).astype(int)

In [None]:
rare_titles = ['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 
               'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona']

df_train['Title'] = df_train['Name'].apply(lambda name: re.search(' ([A-Za-z]+)\.', name).group(1))
df_test['Title'] = df_test['Name'].apply(lambda name: re.search(' ([A-Za-z]+)\.', name).group(1))

df_train['Title'] = df_train['Title'].replace(rare_titles, 'Rare')
df_train['Title'] = df_train['Title'].replace({'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs'})

df_test['Title'] = df_test['Title'].replace(rare_titles, 'Rare')
df_test['Title'] = df_test['Title'].replace({'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs'})

In [None]:
# pandas.cut: "Bin values into discrete intervals". (from pandas documentation)
df_train['AgeGroup'] = pd.cut(df_train['Age'], bins=[0, 12, 18, 35, 60, 100],
                              labels=['Child', 'Teen', 'Adult', 'MiddleAge', 'Senior'])
df_test['AgeGroup'] = pd.cut(df_test['Age'], bins=[0, 12, 18, 35, 60, 100],
                             labels=['Child', 'Teen', 'Adult', 'MiddleAge', 'Senior'])

In [None]:
# pandas.qcut: "Quantile-based discretization function". (from pandas documentation)
df_train['FareGroup'] = pd.qcut(df_train['Fare'], 4,
                               labels=['LowFare', 'MidFare', 'HighFare', 'VeryHighFare'])
df_test['FareGroup'] = pd.qcut(df_test['Fare'], 4,
                              labels=['LowFare', 'MidFare', 'HighFare', 'VeryHighFare'])

In [None]:
# removing unused features
df_train = df_train.drop(columns=['Name', 'SibSp', 'Parch', 'Age', 'Fare'])
df_test = df_test.drop(columns=['Name', 'SibSp', 'Parch', 'Age', 'Fare'])

In [None]:
df_train.head()

Unnamed: 0,Survived,Pclass,Sex,Embarked,FamilySize,IsAlone,Title,AgeGroup,FareGroup
0,0,3,male,S,1,0,Mr,Adult,LowFare
1,1,1,female,C,1,0,Mrs,MiddleAge,VeryHighFare
2,1,3,female,S,0,1,Miss,Adult,MidFare
3,1,1,female,S,1,0,Mrs,Adult,VeryHighFare
4,0,3,male,S,0,1,Mr,Adult,MidFare


### Preprocessing Data

- Why use pandas.get_dummies() instead of sklearn.preprocessing.LabelEncoder()?
    -  pandas.get_dummies split the Embarked classes into 3 different binary columns. label encoding would have assigned a single integer to each class
          , creating a ordinal relationship that does not exist in the data.
    -  this is good for models that are sensitive to the scale of the features, like SVM or KNN

In [None]:
categorical = ['Sex', 'Embarked', 'Title', 'AgeGroup', 'FareGroup']

df_train_encoded = pd.get_dummies(df_train, columns=categorical, drop_first=True)
df_test_encoded = pd.get_dummies(df_test, columns=categorical, drop_first=True)

df_train_encoded.head()

Unnamed: 0,Survived,Pclass,FamilySize,IsAlone,Sex_male,Embarked_Q,Embarked_S,Title_Miss,Title_Mr,Title_Mrs,Title_Rare,AgeGroup_Teen,AgeGroup_Adult,AgeGroup_MiddleAge,AgeGroup_Senior,FareGroup_MidFare,FareGroup_HighFare,FareGroup_VeryHighFare
0,0,3,1,0,1,0,1,0,1,0,0,0,1,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1
2,1,3,0,1,0,0,1,1,0,0,0,0,1,0,0,1,0,0
3,1,1,1,0,0,0,1,0,0,1,0,0,1,0,0,0,0,1
4,0,3,0,1,1,0,1,0,1,0,0,0,1,0,0,1,0,0


In [None]:
df_train_encoded.isnull().sum().sort_values(ascending=False).head()


Survived              0
Pclass                0
FareGroup_HighFare    0
FareGroup_MidFare     0
AgeGroup_Senior       0
dtype: int64

In [None]:
y_train = df_train_encoded['Survived']
X_train = df_train_encoded.drop(columns=['Survived'])

X_test = df_test_encoded


### Model Training and Evaluation

In [None]:
# Logistic regression
log_reg = LogisticRegression(max_iter=1000)
param_grid_lr = {
    'C': [0.01, 0.1, 1, 10],
    'solver': ['liblinear', 'lbfgs']
}

grid_lr = GridSearchCV(log_reg, param_grid_lr, cv=5, scoring='accuracy')
grid_lr.fit(X_train, y_train)

print("Best Parameters Logistic Regression:", grid_lr.best_params_)
print("Best Accuracy:", grid_lr.best_score_)

Best Parameters Logistic Regression: {'C': 1, 'solver': 'lbfgs'}
Best Accuracy: 0.8260435628648548


In [None]:
# Random forest
rf = RandomForestClassifier(random_state=42)
param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}

grid_rf = GridSearchCV(rf, param_grid_rf, cv=5, scoring='accuracy', n_jobs=-1)
grid_rf.fit(X_train, y_train)

print("Best Parameters Random Forest:", grid_rf.best_params_)
print("Best Accuracy:", grid_rf.best_score_)


Best Parameters Random Forest: {'max_depth': 5, 'min_samples_split': 10, 'n_estimators': 100}
Best Accuracy: 0.8248885820099178


In [86]:
# Gradient boosting
gb = GradientBoostingClassifier(random_state=42)
param_grid_gb = {
    'n_estimators': [100, 200],
    'learning_rate': [0.05, 0.1, 0.2],
    'max_depth': [3, 5]
}

grid_gb = GridSearchCV(gb, param_grid_gb, cv=5, scoring='accuracy', n_jobs=-1)
grid_gb.fit(X_train, y_train)

print("Best Parameters Gradient Boosting:", grid_gb.best_params_)
print("Best Accuracy:", grid_gb.best_score_)


Best Parameters Gradient Boosting: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200}
Best Accuracy: 0.8249011361496453


In [92]:
# K-Nearest Neighbors
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)


knn = KNeighborsClassifier()
param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski']
}

grid_knn = GridSearchCV(knn, param_grid_knn, cv=5, scoring='accuracy', n_jobs=-1)
grid_knn.fit(X_scaled, y_train)

print("Best Parameters KNN:", grid_knn.best_params_)
print("Best Accuracy:", grid_knn.best_score_)


Best Parameters KNN: {'metric': 'manhattan', 'n_neighbors': 7, 'weights': 'uniform'}
Best Accuracy: 0.8047391877471595


In [93]:
#TODO: send submission to Kaggle