# Titanic - Machine Learning from Disaster
## Titanic Survivors predictors - Machine learning model
### Overview
This project explores the infamous Titanic disaster, one of the most studied shipwrecks in history. The passengers aboard came from a wide range of backgrounds—different ages, genders, social classes, and financial statuses. These factors played varying roles in determining who survived the tragedy.

By treating each passenger’s attributes as features in a machine learning model, we can predict survival outcomes and identify the most characteristics of the greatest influence. The primary goal is to build an accurate predictive model of survivors for the Kaggle competition. As a secondary objective, this project also explores which features had the greatest impact on survival.

## 1. Intiation and viewing the dataset

In [234]:
#import required libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

In [235]:
data = pd.read_csv("/kaggle/input/titanic/train.csv")
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [236]:
data.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### Observations:
The data has both numerical and catagorical data. The numerical data: Pclass, Age, SibSp (siblings/sppouse), Parch, and Fare. While the categorical data are: Name, Sex, Ticket, Cabin, and Embarked. The target column is the survival: 1 representing survived, 0 representing the passenger has not survived

However I will terminate two whole columns from the data which are: the Tickets for providing meaningless and unuseful data, and The Cabin for having too few values which will cause huge miscalculations if used.

Other than those, any categorical data has to be encoded into numerical first. I will also have to deal with missing values in Age and embarked, then clean the data. All of that will be done in the preprocessing process

## 2. PreProcessing Data 

### 2.1 Removing unused columns

In [237]:
data = data.drop(['Cabin','Ticket'], axis = 1)

### 2.2 Dealing with missing values

In [238]:
imputer_num = SimpleImputer(strategy = "median")
imputer_cat = SimpleImputer(strategy = 'most_frequent')
data['Age'] = imputer_num.fit_transform(data[['Age']])
data['Embarked'] = imputer_cat.fit_transform(data[['Embarked']]).ravel()
data.isna().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64

### 2.3 Dealing with Categorical data

In [239]:
data = pd.get_dummies(data, columns=['Sex','Embarked'], drop_first=True)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,7.25,True,False,True
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,71.2833,False,False,False
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,7.925,False,False,True
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,53.1,False,False,True
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,8.05,True,False,True


### 2.4 Feature Engineering
After my first attempt I realized using feature engineering is a key that can enhance the accuracy of predictions

#### Firstly we can use the Names column by taking the prefix of the name and making it a new feature

In [240]:
data['Title'] = data['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
print(data['Title'].value_counts())

Title
Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Col               2
Mlle              2
Major             2
Ms                1
Mme               1
Don               1
Lady              1
Sir               1
Capt              1
the Countess      1
Jonkheer          1
Name: count, dtype: int64


In [241]:
rare_titles = ['Lady', 'the Countess', 'Capt', 'Col', 'Don', 
               'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Mme']

data['Title'] = data['Title'].replace(rare_titles, 'Rare')
data['Title'] = data['Title'].replace({'Mlle': 'Miss', 'Ms': 'Miss'})
print(data['Title'].value_counts())

Title
Mr        517
Miss      185
Mrs       125
Master     40
Rare       24
Name: count, dtype: int64


In [242]:
title_map = {'Mr': 1, 'Miss': 2, 'Mrs': 3, 'Master': 4, 'Rare': 5}
data['Title'] = data['Title'].map(title_map).fillna(0)
print(data['Title'])

0      1
1      3
2      2
3      3
4      1
      ..
886    5
887    2
888    2
889    1
890    1
Name: Title, Length: 891, dtype: int64


#### A new feature Family_Size can be generated from the no. of sibling, parents, and children summed together. Another feature could be if a person is alone it could have a significant effect.

In [243]:
data['Family_Size'] = data['SibSp'] + data['Parch'] + 1
data['Is_Alone'] = (data['Family_Size'] == 1).astype(int)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S,Title,Family_Size,Is_Alone
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,7.25,True,False,True,1,2,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,71.2833,False,False,False,3,2,0
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,7.925,False,False,True,2,1,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,53.1,False,False,True,3,2,0
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,8.05,True,False,True,1,1,1


### 2.5  Data splitting and scaling

In [244]:
# Data Splitting 
X = data.drop(["Survived","PassengerId", "Name"], axis = 1)
y = data["Survived"]
#Data Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

#### - Finalizing PreProcessing

In [245]:
data = data.drop('Name', axis = 1).astype(float)

## 3 Model Creation 
### 3.1 Model intiation

In [246]:
log_reg = LogisticRegression()
model = GridSearchCV(log_reg, param_grid = {'C': [0.0001,0.001,0.01,0.1,1,10,100,1000,10000], 
    "penalty": ["l1", "l2"],
    "solver": ["liblinear"]}, cv = 5, scoring = "accuracy")
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 42)
model.fit(X_train,y_train)

### 3.2 Testing the model

In [247]:
print(model.score(X_test,y_test))

0.7985074626865671


## 4. Application on testing data

In [248]:
testing = pd.read_csv('/kaggle/input/titanic/test.csv')
testing.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


### We have to apply the same changes

In [249]:
testing = testing.drop(['Cabin','Ticket'], axis = 1)
testing = pd.get_dummies(testing, columns=['Sex','Embarked'], drop_first=True)
testing['Age'] = imputer_num.transform(testing[["Age"]])
imputer_num.fit(data[["Fare"]])
testing['Fare'] = imputer_num.transform(testing[["Fare"]])
testing.isna().sum()
testing['Title'] = testing['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
testing['Title'] = testing['Title'].replace(rare_titles, 'Rare')
testing['Title'] = testing['Title'].replace({'Mlle': 'Miss', 'Ms': 'Miss'})
testing['Title'] = testing['Title'].map(title_map).fillna(0)
testing['Family_Size'] = testing['SibSp'] + testing['Parch'] + 1
testing['Is_Alone'] = (testing['Family_Size'] == 1).astype(int)
testing = testing.drop('Name',axis = 1).astype(float)
testing.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S,Title,Family_Size,Is_Alone
0,892.0,3.0,34.5,0.0,0.0,7.8292,1.0,1.0,0.0,1.0,1.0,1.0
1,893.0,3.0,47.0,1.0,0.0,7.0,0.0,0.0,1.0,3.0,2.0,0.0
2,894.0,2.0,62.0,0.0,0.0,9.6875,1.0,1.0,0.0,1.0,1.0,1.0
3,895.0,3.0,27.0,0.0,0.0,8.6625,1.0,0.0,1.0,1.0,1.0,1.0
4,896.0,3.0,22.0,1.0,1.0,12.2875,0.0,0.0,1.0,3.0,3.0,0.0


## 5. Creating the output - The prediction

In [250]:
X_Final = testing.drop("PassengerId", axis = 1)
Prediction = model.predict(X_Final).astype(int)
submission = pd.DataFrame({
    "PassengerId": testing["PassengerId"].astype(int),
    "Survived":Prediction
})
print(submission)
submission.to_csv("submission.csv", index=False)

     PassengerId  Survived
0            892         0
1            893         1
2            894         0
3            895         0
4            896         1
..           ...       ...
413         1305         0
414         1306         1
415         1307         0
416         1308         0
417         1309         0

[418 rows x 2 columns]
