# Titanic dataset learning model

|                |   |
:----------------|---|
| **Name**     | Eddie Aguilar  |
| **Date**      | 03/27/2025  |
| **ID** | 739352  |

## Instructions

https://www.kaggle.com/datasets/yasserh/titanic-dataset?resource=download


Use the Titanic dataset to adjust a model that tries to predict if a passenger survives given his information.

Try to find the best possible model in based on the AUC metric and a K-folds validation with k = 10.

Points to consider:

- Transformations related to each data type
- Pipeline
- KFolds validation
- AUC
- Comporative between three different models: SVC, MLP and LogisticRegression
- Optimization of hyperparameters:
    - SVC with rbf/sigmoid --> C & gamma
    - SVC with poly --> C, gamma & degree
    - MLPClassifier --> hidden_layer_sizes (max 3 layres, 1-30 neurons each layer)
    - LogisticRegression --> C
- Report with explanations

In [118]:
import pandas as pd 
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression

data = pd.read_csv(r"C:\Users\AgJo413\Documents\GitHub\Lab_std\labstds\Exams\Exam2\Titanic-Dataset.csv")

data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [119]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Filling age with the mean age

Filling embarked with the mode

Changing sex to nummerical 

In [120]:
data["Age"].fillna(data["Age"].mean(), inplace=True)
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
data['Sex'] = data['Sex'].map({'male': 1, 'female': 0})

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data["Age"].fillna(data["Age"].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)


In [121]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,0.647587,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,0.47799,13.002015,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,0.0,22.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,1.0,29.699118,0.0,0.0,14.4542
75%,668.5,1.0,3.0,1.0,35.0,1.0,0.0,31.0
max,891.0,1.0,3.0,1.0,80.0,8.0,6.0,512.3292


Making a new column for the string part of ticket as area and getting the ticket number

Cabin and new area filling with unkown

In [122]:
def extract_area_and_ticket(ticket):
    parts = ticket.split(" ")
    if len(parts) > 1:
        area = parts[0]  
        ticket_number = parts[1]  
    else:
        area = "Unknown"  
        ticket_number = parts[0]  
    return area, ticket_number

data[["Area", "Ticket_Number"]] = data["Ticket"].apply(extract_area_and_ticket).apply(pd.Series)

data["Ticket_Number"] = pd.to_numeric(data["Ticket_Number"], errors="coerce").fillna(0).astype(int)

data["Area"] = data["Area"].fillna("Unknown")
data["Cabin"] = data["Cabin"].fillna("Unknown") 

In [123]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Area,Ticket_Number
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,Unknown,S,A/5,21171
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,C,PC,17599
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,Unknown,S,STON/O2.,3101282
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,S,Unknown,113803
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,Unknown,S,Unknown,373450


Getting dummies for the categorical 

In [124]:
data = pd.get_dummies(data, columns=["Embarked", "Area", "Cabin"], drop_first=True)

In [125]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Cabin_F E69,Cabin_F G63,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Cabin_Unknown
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,...,False,False,False,False,False,False,False,False,False,True
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,...,False,False,False,False,False,False,False,False,False,False
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,...,False,False,False,False,False,False,False,False,False,True
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,...,False,False,False,False,False,False,False,False,False,False
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,...,False,False,False,False,False,False,False,False,False,True


Scaling nummerical variables

In [126]:
numerical_cols = ["Pclass", "Age", "SibSp", "Parch", "Fare", "Ticket_Number"]
scaler = StandardScaler()
data[numerical_cols] = scaler.fit_transform(data[numerical_cols])

In [127]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Cabin_F E69,Cabin_F G63,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Cabin_Unknown
0,1,0,0.827377,"Braund, Mr. Owen Harris",1,-0.592481,0.432793,-0.473674,A/5 21171,-0.502445,...,False,False,False,False,False,False,False,False,False,True
1,2,1,-1.566107,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,0.638789,0.432793,-0.473674,PC 17599,0.786845,...,False,False,False,False,False,False,False,False,False,False
2,3,1,0.827377,"Heikkinen, Miss. Laina",0,-0.284663,-0.474545,-0.473674,STON/O2. 3101282,-0.488854,...,False,False,False,False,False,False,False,False,False,True
3,4,1,-1.566107,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,0.407926,0.432793,-0.473674,113803,0.42073,...,False,False,False,False,False,False,False,False,False,False
4,5,0,0.827377,"Allen, Mr. William Henry",1,0.407926,-0.474545,-0.473674,373450,-0.486337,...,False,False,False,False,False,False,False,False,False,True


Getting the matrix and target

In [128]:
X = data.drop(columns = ["Survived", "PassengerId", "Name", "Ticket"], axis=1)
y = data['Survived']


Defining our kfolds as k = 10

In [129]:
kf = KFold(n_splits=10, shuffle=True, random_state=42)

SVC model with rbf, optimizing and making pipline

In [130]:
svc_rbf = SVC(kernel='rbf', class_weight='balanced', probability=True)
svc_param_grid_rbf = {
    'svc__C': [1, 5, 10, 50],
    'svc__gamma': [0.0001, 0.0005, 0.001, 0.005]
}
model_svc_rbf = make_pipeline(StandardScaler(), svc_rbf)
grid_svc_rbf = GridSearchCV(model_svc_rbf, svc_param_grid_rbf, scoring='roc_auc', cv=kf)

SVC model with poly, optimizing and making pipline

In [131]:
svc_poly = SVC(kernel='poly', class_weight='balanced', probability=True)
svc_param_grid_poly = {
    'svc__C': [1, 5, 10, 50],
    'svc__gamma': [0.0001, 0.0005, 0.001, 0.005],
    'svc__degree': [2, 3, 4]
}
model_svc_poly = make_pipeline(StandardScaler(), svc_poly)
grid_svc_poly = GridSearchCV(model_svc_poly, svc_param_grid_poly, scoring='roc_auc', cv=kf)

MLP model, optimizing and making pipline

In [None]:
mlp = MLPClassifier(max_iter=10000)
mlp_param_grid = {
    'mlpclassifier__hidden_layer_sizes': [(10,), (20,), (30,), (10, 10), (20, 10), (30, 10)]
}
model_mlp = make_pipeline(StandardScaler(), mlp)
grid_mlp = GridSearchCV(model_mlp, mlp_param_grid, scoring='roc_auc', cv=kf)

Logistic model, optimizing and making pipline

In [133]:
logreg = LogisticRegression(class_weight='balanced')
logreg_param_grid = {'logisticregression__C': [0.1, 1, 10, 100]}
model_logreg = make_pipeline(StandardScaler(), logreg)
grid_logreg = GridSearchCV(model_logreg, logreg_param_grid, scoring='roc_auc', cv=kf)

fitting

In [134]:
grid_svc_rbf.fit(X, y)
grid_svc_poly.fit(X, y)
grid_mlp.fit(X, y)
grid_logreg.fit(X, y)

  arr = np.array(param_list)


Getting the best model based on auc

In [135]:

models = {
    "SVC (RBF)": grid_svc_rbf,
    "SVC (Poly)": grid_svc_poly,
    "MLP": grid_mlp,
    "Logistic Regression": grid_logreg
}

best_model_name = max(models, key=lambda key: models[key].best_score_)
best_model = models[best_model_name]
print("Best model based on AUC:", best_model_name)

Best model based on AUC: SVC (RBF)


Getting the auc of the best model 

In [136]:
y_pred_proba = best_model.predict_proba(X)[:, 1] 
train_auc = roc_auc_score(y, y_pred_proba)
print("Training AUC of the best model:", train_auc)

Training AUC of the best model: 0.9105390982008755
