# Titanic - Machine Learning from Disaster

Download the ***only the training set*** from following link https://www.kaggle.com/competitions/titanic/data

Divide the training set into train and test later when needed


Data Description:

survival	Survival	0 = No, 1 = Yes

pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd

sex	Sex

Age	Age in years

sibsp	# of siblings / spouses aboard the Titanic

parch	# of parents / children aboard the Titanic

ticket	Ticket number

fare	Passenger fare

cabin	Cabin number

embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton


**Use NN to create three models that predicts which passengers survived the Titanic shipwreck**

### Data pre-processing



In [None]:
!gdown 1XmXkKC02f0c3uXVm3aPJRfcfn2MAuoT3

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

df_train = pd.read_csv("/content/train.csv")
print(df_train.head())
df_train = pd.DataFrame(df_train)
df_copy=df_train.copy()

perc_isnull=df_copy.isnull().sum() / len(df_copy) * 100
print(perc_isnull)

label_encoder = LabelEncoder()
df_copy['Sex'] = label_encoder.fit_transform(df_copy['Sex'])
df_copy['Embarked'] = label_encoder.fit_transform(df_copy['Embarked'])

df_copy['Age'].fillna(df_copy['Age'].median(), inplace=True)
df_copy['Cabin'].fillna('Unknown', inplace=True)
df_copy['Embarked'].fillna(df_copy['Embarked'].mode()[0], inplace=True)
df_copy.isnull().sum()


Downloading...
From: https://drive.google.com/uc?id=1XmXkKC02f0c3uXVm3aPJRfcfn2MAuoT3
To: /content/train.csv
  0% 0.00/61.2k [00:00<?, ?B/s]100% 61.2k/61.2k [00:00<00:00, 84.5MB/s]
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

###Feature engineering

Feature engineering, in data science, refers to manipulation — addition, deletion, combination, mutation — of your data set to improve machine learning model training, leading to better performance and greater accuracy.

From the columns that denote the number of sibilings and number of parents **define a new column isAlone** which shows if the passenger has relatives on the boat. The column should contain 0s and 1s.

Additionally change the **age column** such that the passengers are divided in five age groups: 0 for age<=16, 1 for 16<age<=32, 2 for 32<age<=48, 3 for 48<age<=64 and 4 for age>64.

Hint: Drop the columns for the number of sibilings and parents

In [None]:
df_copy['isAlone'] = (df_copy['SibSp'] + df_copy['Parch']).apply(lambda x: 1 if x == 0 else 0)
df_copy = df_copy.drop(['SibSp', 'Parch'], axis=1)

bins = [0, 16, 32, 48, 64, float('inf')]
labels = [0, 1, 2, 3, 4]
df_copy['AgeGroup'] = pd.cut(df_copy['Age'], bins=bins, labels=labels, include_lowest=True)
df_copy = df_copy.drop('Age', axis=1)

train_set, test_set = train_test_split(df_copy, test_size=0.2, random_state=42)
df_copy.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Ticket,Fare,Cabin,Embarked,isAlone,AgeGroup
0,1,0,3,"Braund, Mr. Owen Harris",1,A/5 21171,7.25,Unknown,2,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,PC 17599,71.2833,C85,0,0,2
2,3,1,3,"Heikkinen, Miss. Laina",0,STON/O2. 3101282,7.925,Unknown,2,1,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,113803,53.1,C123,2,0,2
4,5,0,3,"Allen, Mr. William Henry",1,373450,8.05,Unknown,2,1,2


### Neural Network 1

In [None]:
X_train = train_set.drop('Survived', axis=1)
Y_train = train_set['Survived']

X_test = test_set.drop('Survived', axis=1)
Y_test = test_set['Survived']

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.select_dtypes(include=['int64', 'float64']))
X_test_scaled = scaler.transform(X_test.select_dtypes(include=['int64', 'float64']))
#nn1 = MLPClassifier(random_state=42)
model_1 = Sequential()
model_1.add(Dense(64, input_dim=X_train_scaled.shape[1], activation='relu'))
model_1.add(Dense(32, activation='relu'))
model_1.add(Dense(1, activation='sigmoid'))

#### Optimize number of epochs and batch size for NN1

(Try different values for the epochs and batch size parameters and choose the optimal ones)

Hint: You can use exhaustive search over specified parameter values for an estimator.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

You will need a wrapper class for your neural network models

https://adriangb.com/scikeras/stable/generated/scikeras.wrappers.KerasClassifier.html

In [None]:
"""param_grid = {
    'hidden_layer_sizes': [(50,), (50, 25), (100, 50, 25)],
    'max_iter': [500, 1000, 1500],
    'batch_size': [32, 64, 128],
}
grid_search_nn1 = GridSearchCV(nn1, param_grid, cv=5, scoring='accuracy')
grid_search_nn1.fit(X_train_scaled, Y_train)
"""
model_1.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history_1 = model_1.fit(X_train_scaled, Y_train, epochs=2, batch_size=64, validation_data=(X_test_scaled, Y_test))


Epoch 1/2
Epoch 2/2


### Neural Network 2

In [None]:
X_train = train_set.drop('Survived', axis=1)
Y_train = train_set['Survived']

X_test = test_set.drop('Survived', axis=1)
Y_test = test_set['Survived']

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.select_dtypes(include=['int64', 'float64']))
X_test_scaled = scaler.transform(X_test.select_dtypes(include=['int64', 'float64']))
#nn2 = MLPClassifier(random_state=42)
model_2 = Sequential()
model_2.add(Dense(128, input_dim=X_train_scaled.shape[1], activation='sigmoid'))
model_2.add(Dense(64, activation='sigmoid'))
model_2.add(Dense(32, activation='relu'))
model_2.add(Dense(1, activation='sigmoid'))

#### Optimize number of epochs and batch size for NN2

(Try different values for the epochs and batch size parameters and choose the optimal ones)

Hint: You can use exhaustive search over specified parameter values for an estimator.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

You will need a wrapper class for your neural network models

https://adriangb.com/scikeras/stable/generated/scikeras.wrappers.KerasClassifier.html

In [None]:
"""param_grid_nn2 = {
    'hidden_layer_sizes': [(50,), (50, 25), (100, 50, 25)],
    'max_iter': [500, 1000, 1500],
    'batch_size': [16, 32, 64],
    'epochs': [10, 15, 25],
}
grid_search_nn2 = GridSearchCV(nn2, param_grid_nn2, cv=5, scoring='accuracy')
grid_search_nn2.fit(X_train_scaled, Y_train)"""
model_2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history_2 = model_2.fit(X_train_scaled, Y_train, epochs=3, batch_size=128, validation_data=(X_test_scaled, Y_test))


Epoch 1/3
Epoch 2/3
Epoch 3/3


### Neural Network 3

In [None]:
X_train = train_set.drop('Survived', axis=1)
Y_train = train_set['Survived']

X_test = test_set.drop('Survived', axis=1)
Y_test = test_set['Survived']

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.select_dtypes(include=['int64', 'float64']))
X_test_scaled = scaler.transform(X_test.select_dtypes(include=['int64', 'float64']))
#nn3 = MLPClassifier(random_state=42)
model_3 = Sequential()
model_3.add(Dense(256, input_dim=X_train_scaled.shape[1], activation='relu'))
model_3.add(Dense(128, activation='sigmoid'))
model_3.add(Dense(64, activation='relu'))
model_3.add(Dense(32, activation='sigmoid'))
model_3.add(Dense(1, activation='sigmoid'))


#### Optimize number of epochs and batch size for NN3

(Try different values for the epochs and batch size parameters and choose the optimal ones)

Hint: You can use exhaustive search over specified parameter values for an estimator.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

You will need a wrapper class for your neural network models

https://adriangb.com/scikeras/stable/generated/scikeras.wrappers.KerasClassifier.html


In [None]:
"""param_grid_nn3 = {
    'hidden_layer_sizes': [(50,), (50, 25), (100, 50, 25)],
    'max_iter': [500, 1000, 1500],
    'batch_size': [64, 128, 256],
    'epochs': [5, 8, 12],
}

grid_search_nn3 = GridSearchCV(nn3, param_grid_nn3, cv=5, scoring='accuracy')
grid_search_nn3.fit(X_train_scaled, Y_train)"""
model_3.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history_3 = model_3.fit(X_train_scaled, Y_train, epochs=5, batch_size=256, validation_data=(X_test_scaled, Y_test))


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Evaluate the three NNs

In [None]:
"""
print("Best Hyperparameters for NN1:", grid_search_nn1.best_params_)
print("Best Hyperparameters for NN2:", grid_search_nn2.best_params_)
print("Best Hyperparameters for NN3:", grid_search_nn3.best_params_)

best_nn1 = grid_search_nn1.best_estimator_
best_nn2 = grid_search_nn2.best_estimator_
best_nn3 = grid_search_nn3.best_estimator_

y_pred_nn1 = best_nn1.predict(X_test_scaled)
y_pred_nn2 = best_nn2.predict(X_test_scaled)
y_pred_nn3 = best_nn3.predict(X_test_scaled)

accuracy_nn1 = accuracy_score(Y_test, y_pred_nn1)
accuracy_nn2 = accuracy_score(Y_test, y_pred_nn2)
accuracy_nn3 = accuracy_score(Y_test, y_pred_nn3)

print(f"Accuracy for NN1: {accuracy_nn1}")
print(f"Accuracy for NN2: {accuracy_nn2}")
print(f"Accuracy for NN3: {accuracy_nn3}")"""
eval_1 = model_1.evaluate(X_test_scaled, Y_test)
eval_2 = model_2.evaluate(X_test_scaled, Y_test)
eval_3 = model_3.evaluate(X_test_scaled, Y_test)





## Results analysis

In [None]:

"""from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
import matplotlib.pyplot as plt

cm_nn1 = confusion_matrix(y_test, y_pred_nn1)
cm_nn2 = confusion_matrix(y_test, y_pred_nn2)
cm_nn3 = confusion_matrix(y_test, y_pred_nn3)

report_nn1 = classification_report(y_test, y_pred_nn1)
report_nn2 = classification_report(y_test, y_pred_nn2)
report_nn3 = classification_report(y_test, y_pred_nn3)

fpr_nn1, tpr_nn1, _ = roc_curve(y_test, best_nn1.predict_proba(X_test_scaled)[:, 1])
fpr_nn2, tpr_nn2, _ = roc_curve(y_test, best_nn2.predict_proba(X_test_scaled)[:, 1])
fpr_nn3, tpr_nn3, _ = roc_curve(y_test, best_nn3.predict_proba(X_test_scaled)[:, 1])

roc_auc_nn1 = auc(fpr_nn1, tpr_nn1)
roc_auc_nn2 = auc(fpr_nn2, tpr_nn2)
roc_auc_nn3 = auc(fpr_nn3, tpr_nn3)

print("Accuracy for NN1:", accuracy_nn1)
print("Confusion Matrix for NN1:")
print(cm_nn1)
print("Classification Report for NN1:")
print(report_nn1)

plt.figure(figsize=(10, 6))
plt.plot(fpr_nn1, tpr_nn1, label=f'NN1 (AUC = {roc_auc_nn1:.2f})')
plt.plot(fpr_nn2, tpr_nn2, label=f'NN2 (AUC = {roc_auc_nn2:.2f})')
plt.plot(fpr_nn3, tpr_nn3, label=f'NN3 (AUC = {roc_auc_nn3:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()"""
print("Neural Network 1:")
print(f"Test Loss: {eval_1[0]}, Test Accuracy: {eval_1[1]}")

print("Neural Network 2:")
print(f"Test Loss: {eval_2[0]}, Test Accuracy: {eval_2[1]}")

print("Neural Network 3:")
print(f"Test Loss: {eval_3[0]}, Test Accuracy: {eval_3[1]}")

Neural Network 1:
Test Loss: 0.6054165363311768, Test Accuracy: 0.7094972133636475
Neural Network 2:
Test Loss: 0.6595090627670288, Test Accuracy: 0.5865921974182129
Neural Network 3:
Test Loss: 0.6273787021636963, Test Accuracy: 0.5977653861045837


##**Bonus task** (+2 points)

The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing models in order to balance out their individual weaknesses

https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier

**Your task will be to create Majority/Hard Voting with the three previously created NN.**

In majority voting, the predicted class label for a particular sample is the class label that represents the majority (mode) of the class labels predicted by each individual classifier.

Is this model better than the models before?

Hint: You will need a wrapper class for your neural network models
 https://adriangb.com/scikeras/stable/generated/scikeras.wrappers.KerasClassifier.html
