In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import warnings
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**SCENE SETTING**

One of the most well-known shipwrecks in history was the sinking of the RMS
Titanic. RMS Titanic sunk on April 15, 1912, after striking an iceberg while on a
journey. As a result of the lack of available lifeboats, a total of 1502 passengers
and crew members perished.

![titanic](https://robbreport.com/wp-content/uploads/2022/04/7.-titanic-sinking-Screen-Shot-2022-04-14-at-12.10.37-AM.jpg?w=1000)

More than 2200. Even if chance had a role, it appears that certain individuals
had a higher probability of surviving than others.

So in this project, there are two sets of Titanic passenger data: a training set and
a test set, both of which are.csv files. The "Survived" response variable, as well
as 11 other 891-passenger informative factors, were included in the training
dataset.

The goal of this project is to create a machine learning models in order to
forecast which people survived the shipwreck. The response variable Survived
will be modeled in specific, given 10 different predictors. The rest of this paper
goes through the procedures that were used to create the predictive model.
We'll create models to forecast which individuals are more likely to survive. Also,
compare the models to see which is the most effective.


**Problem Analysis**

An examination of the Titanic's historical report provides useful insight into
the passenger data in terms of survival.

1. Because of a "women and children first" protocol for filling lifeboats, a disproportionate number of men were left on board.

2. Passengers in first and second class were the most likely to make it to the lifeboats. To get to the boat deck, third-class passengers had to navigate a tangle of passageways and staircases.

3. Many lifeboats were barely partially loaded when they were deployed.

One of the most well-known shipwrecks in history was the sinking of the RMS Titanic. RMS Titanic sunk on April 15, 1912, after striking an iceberg while on a journey. As a result of the lack of available lifeboats, a total of 1502 passengers and crew members perished.

More than 2200. Even if chance had a role, it appears that certain individuals had a higher probability of surviving than others.

So in this project, there are two sets of Titanic passenger data: a training set and a test set, both of which are.csv files. The "Survived" response variable, as well as 11 other 891-passenger informative factors, were included in the training dataset.

The goal of this kaggle notebook is to create machine learning models in order to forecast which people survived the shipwreck. The response variable Survived will be modeled in specific, given 10 different predictors. The rest of this paper goes through the procedures that were used to create the predictive model.

We'll create models to forecast which individuals are more likely to survive. Also, compare the models to see which is the most effective.


In [2]:
#Performing EDA Analysis

train = pd.read_csv('/kaggle/input/titanic/train.csv')
test = pd.read_csv('/kaggle/input/titanic/test.csv')

print("Checking Test and Train Shapes :")
print("Train DS Shape : ", train.shape)
print("Test DS Shape :  ", test.shape)

I performed a descriptive analysis of the dataset in order to gain maximum insight into the passengers aboard the Titanic. This analysis helped me identify things like gender ratio, people having different class tickets, age variance, etc.

In [3]:
train.head()

But first, I identified what the given dataset contained. The values in the train dataset had the following structure and values

In [4]:
train.dtypes

In [5]:
train.info()

The dataset contained 5 categorical columns as shown below :

In [6]:
categorical_cols= train.select_dtypes(include=['object'])
print(f'The dataset contains {len(categorical_cols.columns.tolist())} categorical columns')
for cols in categorical_cols.columns:
    print(cols,':', len(categorical_cols[cols].unique()),'labels')

But the main categorical data that can be identified are the fields of sex and embarked. Also, all of this categorical data needed to be converted into numerical data as many machine learning models are unable to work with categorical fields.

In [7]:
train.describe()

Also to be addressed were the missing values in the age and cabin fields as there are 891 rows in the training dataset. Using describe() we see the training dataset contained only 714 values for Age.

I know from my historical research that women and children were given
priority while loading lifeboats. As a result, I required a method of identifying
children. Due to the enormous number of missing age data, this was a difficult
task.
Next, I did a gender analysis to identify the male-female distribution on the ship.
The following results were observed:


In [8]:
import plotly.graph_objects as go

base_colors = ['#20618E',  '#6880AD',  '#57A7F3']

labels = [x for x in train.Sex.value_counts().index]
values = train.Sex.value_counts()

fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.3,pull=[0.03, 0])])

fig.update_layout(
    title_text="Gender ")
fig.update_traces(marker=dict(colors=base_colors))
fig.show()

The total number of males was significantly higher. Keeping that in mind I also did
a gender vs survival rate analysis to confirm the assumption I gathered from
historical report of the Titanic.

In [9]:
gender_analysis = sns.catplot(x="Sex",y="Survived",data=train, kind="bar", height = 6, palette = base_colors)
gender_analysis = gender_analysis.set_ylabels("Survival Rate")

From the above graph, it can be observed that even though the number of males was
much higher than females, the survival rate of men was significantly lower. This also
supports my historical assumption about the Ladies and Children First protocol.

Next up is the analysis of the distribution of people in different passenger classes.


C1 : First Class

C2 : Second Class

C3 : Third Class

In [10]:
labels = [x for x in train.Pclass.value_counts().index]
values = train.Pclass.value_counts()

fig = go.Figure(data=[go.Pie(labels=['C3','C1','C2'], values=values, hole=.3,pull=[0,0,0.04])])

fig.update_layout(
    title_text="Ticket class ")
fig.update_traces(marker=dict(colors=base_colors))
fig.show()

According to the aforementioned analysis, the bulk of passengers are having a
third-class ticket. From the historical data, we know that third-class people had to
travel a lot in order to reach the safe area when compared to the people having
first and second-class tickets. Keeping that in mind I also did a passenger class vs
survival rate analysis on the basis of gender to confirm the assumption I gathered
from the historical report of the Titanic

In [11]:
gender_pclass = sns.FacetGrid(train, height=4.5, aspect=1.6)
gender_pclass.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', order=None, hue_order=None )
gender_pclass.add_legend();

According to the aforementioned analysis, females traveling in the third class had
a considerably lower survival rate. This backs up the historical assumption that
first and second-class passengers were the most likely to make it to the lifeboats,
which were hurriedly released partially loaded.

Next up is the analysis of the distribution of people based on their port of
embarkation.

In [12]:
labels = [x for x in train.Embarked.value_counts().index]
values = train.Embarked.value_counts()

fig=go.Figure(data=[go.Pie(labels=["Southampton","Cherbourg","Queenstown"],values=values,hole=.3,pull=[0,0,0.06,0])])

fig.update_layout(
    title_text="Port of embarkation")
fig.update_traces(marker=dict(colors=base_colors))
fig.show()

According to the aforementioned analysis, it s observed that an overwhelming
number of people embarked from Southampton also encoded as S.

Upon further analysis, It was found out that apart from the missing values in the
age variable, the age and fare values were positively skewed

In [13]:
import plotly.figure_factory as ff
from plotly.offline import iplot
age=train['Age'].dropna()
fig = ff.create_distplot([age],['Age'],bin_size=1)
fig.update_traces(marker=dict(color='#57A7F3'))
iplot(fig, filename='Basic Distplot')

In [14]:
fig = ff.create_distplot([train['Fare']],['Fare'],bin_size=10)
fig.update_traces(marker=dict(color='#57A7F3'))
iplot(fig, filename='Basic Distplot')

**FEATURE ENGINEERING**

One-Hot Encoding is a crucial step in preparing data for machine learning
models. Many machine learning algorithms are unable to directly work on
categorical data such as gender (male/female). This implies that category
data must be transformed into numerical data. Feature Engineering is a
critical phase in designing any prediction system because the data may have
missing fields, incomplete fields, or fields containing secret information. Age,
Fare, and Embarked, for example, had missing values in the training and
testing data that needed to be filled up. I also used the passenger's surname
to distinguish families on board the Titanic.

First, I removed the name, passenger id, and ticket variables because they all
have unique values, and creating dummies for them would increase the
dimensionality. The structure of train dataset after dropping values was:

In [15]:
train=train.drop(['PassengerId','Name','Ticket'],1)
test=test.drop(['PassengerId','Name','Ticket'],1)

train['Survived']=train['Survived'].astype('int')
train['Pclass']=train['Pclass'].astype('int')
train['SibSp']=train['SibSp'].astype('int')

train.info()

Now I addressed the missing values in the dataset. The following are the
percentages of the missing values.

In [16]:
features=[features for features in train.columns if train[features].isnull().sum()>1]
for feature in features:
    print(feature, np.round(train[feature].isnull().mean()*100, 2),  ' % missing values.\n')

Because the cabin variable had over 77 percent missing values, I eliminated
it. However, I can impute missing values as "other" by introducing another
category. After doing so, only 2 categorical columns remained ie Sex and
Embarked .

Since there were only two missing values in Embarked variable so I imputed
them with the mode of the rest of the values. Next to fill up the missing age
values I used random values in between the 25th and 75th percentile. In the
test set there were also missing values for fare variable. Those were also
imputed from the values range in the training set

In [17]:
train=train.drop(['Cabin'],1)
test=test.drop(['Cabin'],1)

categorical_cols_train= train.select_dtypes(include=['object'])

print(f'The dataset contains {len(categorical_cols_train.columns.tolist())} categorical columns')

categorical_cols_train.describe()

In [18]:
categorical_cols_missing = categorical_cols_train.columns[categorical_cols_train.isnull().any()]
from sklearn.impute import SimpleImputer
categoricalImputer = SimpleImputer(missing_values = np.NaN,strategy = 'most_frequent')
for feature in categorical_cols_missing:
     categorical_cols_train[feature] = categoricalImputer.fit_transform(categorical_cols_train[feature].values.reshape(-1,1))
     train[feature] = categoricalImputer.fit_transform(train[feature].values.reshape(-1,1))

In [19]:
train['Age'].describe()

In [20]:
train['Fare'].describe()

Now I had to convert the categorical data into numerical one. So the male
and female of the sex field were encoded as 1 and 0 respectively.
Because the cabin variable had over 77 percent missing values, I eliminated
it. However, I can impute missing values as "other" by introducing another
category. After doing so, only 2 categorical columns remained ie Sex and
Embarked .

Since there were only two missing values in Embarked variable so I imputed
them with the mode of the rest of the values. Next to fill up the missing age
values I used random values in between the 25th and 75th percentile. In the
test set there were also missing values for fare variable. Those were also
imputed from the values range in the training set.

Next I had to create dummy variables for different Embarked classes like
Embarked_C, Embarked_Q, Embarked_S. Now these values will be binary
encoded so that at a time only one of these values will be 1 signifying that
the given person embarked from that location. So for example a person
embarked from a location C, then in the row, where that person's details
are mentioned, the Embarked_C column will contain the value 1 while
others ie Embarked_Q and Embarked_S value will contain 0.


In [21]:
np.random.seed(1)
train['Age'].fillna(np.random.randint(20,38), inplace = True)
test['Age'].fillna(np.random.randint(20,38), inplace = True)
test['Fare'].fillna(np.random.randint(0,31), inplace = True)

cleanup_nums = {"Sex": {"male": 1, "female": 0}}
train= train.replace(cleanup_nums)
test= test.replace(cleanup_nums)

train=pd.get_dummies(train, columns=["Embarked"])
test=pd.get_dummies(test, columns=["Embarked"])

train.shape

**CORRELATIONS IN THE DATASET**

Next, we move on to the correlations in the dataset.

Through the use of data correlations, we can determine how many variables and
attributes are associated in your dataset. If one or more attributes are dependent
on another attribute, or if one or more attributes act as a catalyst for another
attribute, we can infer this from correlation.

In this we analyze correlations between numerical and numerical variables,
numerical and categorical variables and categorical and categorical variables.


In [22]:
from scipy.stats import chi2_contingency
import numpy as np

def cramers_V(var1,var2):
  crosstab =np.array(pd.crosstab(var1,var2, rownames=None, colnames=None)) # Cross table building
  stat = chi2_contingency(crosstab)[0] # Keeping of the test statistic of the Chi2 test
  obs = np.sum(crosstab) # Number of observations
  mini = min(crosstab.shape)-1 # Take the minimum value between the columns and the rows of the cross table
  return (stat/(obs*mini))

In [23]:
rows= []

for var1 in train:
  col = []
  for var2 in train:
    cramers =cramers_V(train[var1], train[var2])
    col.append(round(cramers,2))
  rows.append(col)
  
cramers_results = np.array(rows)
df = pd.DataFrame(cramers_results, columns = train.columns, index =train.columns)
df

In [26]:
plt.figure(figsize=(8,8))
plt_data = train[:]
sns.heatmap(df, vmin = -0.5,vmax = 1,annot=True)

To visualize the correlation matrix, I used a
seaborn heatmap. The diagonal of the
correlation map is all 1 which is because each
variable is correlated to itself.

**BUILDING PREDICTIVE MODELS**



In [27]:
model_results = []

from sklearn.model_selection import GridSearchCV 
def classification_eval (algorithm, grid_params, X_train, X_test, y_train, model_name) : 
    model = GridSearchCV(algorithm, grid_params, n_jobs = - 1, cv = 5, verbose = 1) 
    model. fit(X_train, y_train) 
    y_pred = model.predict (X_test) 
    print("Grid Search Best Score: \t", model.best_score_) 
    print("Grid Search Best Params: \t", model.best_params_)
    model_results.append([model_name,model.best_score_])
    return model, y_pred

In [31]:
X_train = train.loc[:,['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked_C', 'Embarked_Q', 'Embarked_S']]
y_train = train.loc[:,'Survived']

X_test = test.loc[:,['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked_C', 'Embarked_Q', 'Embarked_S']]

In [32]:
# Balancing the dataset 
from imblearn.over_sampling import SMOTE 

oversample = SMOTE()
X_train, y_train = oversample.fit_resample(X_train,y_train)

In [38]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X_train,y_train,test_size=0.2,random_state=69)

In [35]:
# Classification using KNN
from sklearn.neighbors import KNeighborsClassifier 
grid_params = {"weights": ['uniform', 'distance'],
                          'n_neighbors': range(2, 10, 2),
                          "algorithm": ['auto', 'ball_tree', 'kd_tree', 'brute'],
                          "leaf_size": range(15, 30, 5),
                          "p" : range(1, 5),
                          "n_jobs" : range(1,10, 2),
                          }
knn_model, y_pred_knn = classification_eval(KNeighborsClassifier(),grid_params,X_train, X_test, y_train,'KNN')
knn_model

In [36]:
knn_model.best_params_

In [40]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report

In [47]:
model1 = KNeighborsClassifier(algorithm = 'ball_tree',
 leaf_size = 15,
 n_jobs = 1,
 n_neighbors = 6,
 p = 1,
 weights = 'distance')
model1.fit(x_train, y_train)

print("train accuracy:",model1.score(x_train, y_train),"\n","test accuracy:",model1.score(x_test,y_test))

knnpred = model1.predict(x_test)
print("\n")
print("classification report for adaboost classifier")
print(classification_report(knnpred,y_test))
print("\n")
print("confusion matrix for adaboost classifier")
displr = plot_confusion_matrix(model1, x_test, y_test ,cmap=plt.cm.Blues , values_format='d')

In [49]:
# Classification using SVC
from sklearn.svm import SVC 
grid_params = {'C': [0.1, 1, 10, 100, 1000],
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf']}
svc_model, y_pred_svc = classification_eval(SVC(), grid_params, x_train, x_test, y_train,'SVC')
svc_model

In [50]:
model1 = SVC(C = 1000, gamma = 0.001, kernel = 'rbf')
model1.fit(x_train, y_train)

print("train accuracy:",model1.score(x_train, y_train),"\n","test accuracy:",model1.score(x_test,y_test))

pred = model1.predict(x_test)
print("\n")
print("classification report for adaboost classifier")
print(classification_report(pred,y_test))
print("\n")
print("confusion matrix for adaboost classifier")
displr = plot_confusion_matrix(model1, x_test, y_test ,cmap=plt.cm.Blues , values_format='d')

In [52]:
from sklearn.tree import DecisionTreeClassifier 
grid_params = {'max_depth' : [1,5,10], 'min_samples_leaf' : [1,5,10], 'criterion':['gini','entropy'], } 

dt_model, y_pred_dt = classification_eval(DecisionTreeClassifier(), grid_params, x_train, x_test, y_train,'Decision Tree')
dt_model

In [54]:
model1 = DecisionTreeClassifier(criterion="gini", max_depth=5, min_samples_leaf = 1)
model1.fit(x_train, y_train)

print("train accuracy:",model1.score(x_train, y_train),"\n","test accuracy:",model1.score(x_test,y_test))

pred = model1.predict(x_test)
print("\n")
print("classification report for adaboost classifier")
print(classification_report(pred,y_test))
print("\n")
print("confusion matrix for adaboost classifier")
displr = plot_confusion_matrix(model1, x_test, y_test ,cmap=plt.cm.Blues , values_format='d')

In [56]:
# Classification using Random Forest
from sklearn. ensemble import RandomForestClassifier 
grid_params = {'criterion':['gini'],'n_estimators' : [100, 250, 500], 'max_depth' : [1,5,10,15,20], 'min_samples_leaf' : [1,2,5]} 

rf_model, y_pred_rf = classification_eval(RandomForestClassifier(), grid_params, x_train, x_test, y_train,'Random Forest')
rf_model

In [57]:
model1 = RandomForestClassifier(criterion="gini", max_depth=20, min_samples_leaf = 2, n_estimators = 250)
model1.fit(x_train, y_train)

print("train accuracy:",model1.score(x_train, y_train),"\n","test accuracy:",model1.score(x_test,y_test))

pred = model1.predict(x_test)
print("\n")
print("classification report for adaboost classifier")
print(classification_report(pred,y_test))
print("\n")
print("confusion matrix for adaboost classifier")
displr = plot_confusion_matrix(model1, x_test, y_test ,cmap=plt.cm.Blues , values_format='d')

In [58]:
# Classification using XGB Classifier
from xgboost import XGBClassifier 
grid_params = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5]
        }
xg_model, y_pred_rf = classification_eval(XGBClassifier(use_label_encoder=False), grid_params, x_train, x_test, y_train,'XG Boost')
xg_model

In [95]:
xgb_model = XGBClassifier(colsample_bytree = 0.6, gamma = 0.5, max_depth = 5, min_child_weight =  5, subsample = 0.8, use_label_encoder=False)
xgb_model.fit(x_train, y_train)

print("train accuracy:",xgb_model.score(x_train, y_train),"\n","test accuracy:",xgb_model.score(x_test,y_test))

pred = xgb_model.predict(x_test)
print("\n")
print("classification report for adaboost classifier")
print(classification_report(pred,y_test))
print("\n")
print("confusion matrix for adaboost classifier")
displr = plot_confusion_matrix(model1, x_test, y_test ,cmap=plt.cm.Blues , values_format='d')

In [69]:
from sklearn.linear_model import LogisticRegression
grid_params = {"penalty":["l1","l2"], "max_iter" : [10, 100, 1000]}
lr_model, y_pred_rf = classification_eval(LogisticRegression(), grid_params, x_train, x_test, y_train,'Logistic Regression')
lr_model

In [81]:
lr = LogisticRegression(max_iter=2000,penalty='l2')
model1=lr.fit(x_train, y_train)
print("train accuracy:",model1.score(x_train, y_train),"\n","test accuracy:",model1.score(x_test,y_test))
lrpred = lr.predict(x_test)
print("\n")
print("classification report for logistic regression")
print(classification_report(lrpred,y_test))
print("\n")
print("confusion matrix for logistic regression")
displr = plot_confusion_matrix(lr, x_test, y_test,cmap=plt.cm.Blues , values_format='d')

In [84]:
from sklearn.ensemble import AdaBoostClassifier
grid_params = {
    'n_estimators': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 20],
    'learning_rate': [(0.97 + x / 100) for x in range(0, 8)],
    'algorithm': ['SAMME', 'SAMME.R']
}
ada_model, y_pred_rf = classification_eval(AdaBoostClassifier(), grid_params, x_train, x_test, y_train,'Ada Boost')
ada_model

In [85]:
ada=AdaBoostClassifier(algorithm = 'SAMME.R', learning_rate = 1.03, n_estimators = 12)
model4=ada.fit(x_train, y_train)
print("train accuracy:",model4.score(x_train, y_train),"\n","test accuracy:",model4.score(x_test,y_test))
adapred = ada.predict(x_test)
print("\n")
print("classification report for adaboost classifier")
print(classification_report(adapred,y_test))
print("\n")
print("confusion matrix for adaboost classifier")
displr = plot_confusion_matrix(ada, x_test, y_test ,cmap=plt.cm.Blues , values_format='d')

In [87]:
from sklearn.linear_model import RidgeClassifier
grid_params = {'alpha':[1, 10]}
rid_model, y_pred_rf = classification_eval(RidgeClassifier(), grid_params, x_train, x_test, y_train,'Ridge Boost')
rid_model

In [88]:
rc =RidgeClassifier()
model5=rc.fit(x_train, y_train)
print("train accuracy:",model5.score(x_train, y_train),"\n","test accuracy:",model5.score(x_test,y_test))
rcpred = rc.predict(x_test)
print("\n")
print("classification report for Ridge Classification")
print(classification_report(rcpred,y_test))
print("\n")
print("confusion matrix for Ridge Regression")
displr = plot_confusion_matrix(rc, x_test, y_test,cmap=plt.cm.Blues , values_format='d')

In [92]:
results = pd.DataFrame(model_results,columns = ['Model','Best Score'])

In [94]:
# results = results.drop([0, 6, 7, 8, 9, 10, 11, 13, 15])
results

**GENERATING SUBMISSION VIA XGBOOST**

In [104]:
test2 = pd.read_csv('/kaggle/input/titanic/test.csv')
x = test2["PassengerId"]
pred = xgb_model.predict(test)
submission = pd.DataFrame({"PassengerId" : x, "Survived" : pred})
submission.to_csv('submission.csv', index = False, encoding='utf-8')

**CONCLUSION**

The role of machine learning applications in disaster management and
predictions has been increasing rapidly over the past years. Following this
trend, I was given an assignment to predict the survival of passengers aboard
the infamous Titanic Ship sailing across the North Pacific Region.
As a consequence of my efforts, I obtained important knowledge in the
development of prediction algorithms and set a high of 84 percent accuracy
in the "Titanic - Machine Learning from Disaster" competition organized by
Kaggle.

From my work in building machine learning models to predict the survival of
passengers aboard the titanic, I'd rather be a young female with a first class
ticket.


**If you liked my work, or want to give me any suggesstions on how I can improve upon it, do contact me via the follwoing links **

**Github: https://github.com/AkshatRastogi-1nC0re **
** LinkedIn: https://www.linkedin.com/in/akshat-rastogi-3425aa1b8/ **

**SIGNING OUT **
** AKSHAT RASTOGI**