<div>
<img src="https://media1.tenor.com/images/36ee59cad8a7e51c9546613e4521dc17/tenor.gif?itemid=14438682">
</div>

<div class="alert alert-block alert-success">  
<h1><center><strong>🚢 Break the ice</strong></center></h1>
    <p>
    The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

Considering this, we have been asked to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (i.e. name, age, gender, socio-economic class, etc).

</p>
</div>

<div class="alert alert-info">  
<h3><strong>Imports</strong></h3>
</div>

In [None]:
!pip install pywaffle

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling
import plotly.express as px
import plotly.graph_objects as go
import sklearn.metrics as metrics
import plotly.offline as py

from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, roc_curve,auc, confusion_matrix,precision_recall_curve,precision_recall_curve,plot_precision_recall_curve
from pywaffle import Waffle
from yellowbrick.classifier import classification_report
from plotly.subplots import make_subplots

In [None]:
custom_colors = ["#c8e7ff","#deaaff", "#f72585","#d100d1"]
customPalette = sns.set_palette(sns.color_palette(custom_colors))

In [None]:
sns.palplot(sns.color_palette(custom_colors),size=1)
plt.tick_params(axis='both', labelsize=0, length = 0)

In [None]:
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})

<div class="alert alert-info">  
<h3><strong>Reading the csv files</strong></h3>
</div>

In [None]:
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data.head()

In [None]:
train_data.shape

In [None]:
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()

In [None]:
test_data.shape

<div class="alert alert-info">  
<h3><strong>Generate descriptive statistics</strong></h3>
</div>

* DataFrame.count: 
Count number of non-NA/null observations.

* DataFrame.max: 
Maximum of the values in the object.

* DataFrame.min:
Minimum of the values in the object.

* DataFrame.mean:
Mean of the values.

* DataFrame.std:
Standard deviation of the observations.

In [None]:
train_data.describe()

<div class="alert alert-info">  
<h3><strong>Data types of attributes</strong></h3>
</div>


In [None]:
train_data.dtypes

<div class="alert alert-info">  
<h3><strong>Checking columns for null values</strong></h3>
</div>


In [None]:
train_data.isna().sum()

<div class="alert alert-info">  
<h3><strong>Number of Unique values per column</strong></h3>
</div>


In [None]:
train_data.nunique()

> All passenger IDs are unique and there are no missing values for this column.

<div class="alert alert-info">  
<h3><strong>Pandas profiling</strong></h3>
</div>

Generates profile reports from a pandas DataFrame. 

The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

In [None]:
profile = pandas_profiling.ProfileReport(train_data)

In [None]:
profile

<div class="alert alert-info">  
<h3><strong>Modifying Cabin column</strong></h3>
</div>

Distribution of class

<div>
<img src="https://i.imgur.com/bvyChJc.jpg">
</div>

Decks

<div>
<img src="https://i.imgur.com/FAMeIC7.png">
</div>

In [None]:
train_data['Cabin'].unique()

In [None]:
train_data['Cabin'] = train_data['Cabin'].apply(lambda i: i[0] if pd.notnull(i) else 'Z')
test_data['Cabin'] = test_data['Cabin'].apply(lambda i: i[0] if pd.notnull(i) else 'Z')

Z indicates those values that are missing.

In [None]:
train_data['Cabin'].unique()

In [None]:
train_data[train_data['Cabin']=='T'].index.values

In [None]:
test_data[test_data['Cabin']=='T'].index.values

In [None]:
train_data.iloc[339]

There is no evidence that anyone else occupied the T Boat Deck cabin.
Since he was a class 1 passenger, we group him with the A deck passengers.

In [None]:
index = train_data[train_data['Cabin'] == 'T'].index
train_data.loc[index, 'Cabin'] = 'A'

In [None]:
def plot_bar(df, feat_x, feat_y,s, normalize=True):
    ct = pd.crosstab(df[feat_x], df[feat_y])
    return ct.plot(kind='bar', stacked=s)

In [None]:
dpi=80
plot_bar(train_data, 'Cabin', 'Pclass',False)
plt.legend(title='Pclass',loc='upper right',bbox_to_anchor=(1.25, 1))
plt.gcf().set_size_inches(10,8)
plt.ylim(0,100)
plt.xticks(rotation=45)
plt.show()

* A, B and C have only class 1 passengers.
* D has both class 1 and 2 passengers. E has class 1, 2 and 3 passengers. So these two can be grouped together.
* F and G both have class 2 and 3 passengers.

In [None]:
train_data['Cabin'] = train_data['Cabin'].replace(['A', 'B', 'C'], 'ABC')
train_data['Cabin'] = train_data['Cabin'].replace(['D', 'E'], 'DE')
train_data['Cabin'] = train_data['Cabin'].replace(['F', 'G'], 'FG')

test_data['Cabin'] = test_data['Cabin'].replace(['A', 'B', 'C'], 'ABC')
test_data['Cabin'] = test_data['Cabin'].replace(['D', 'E'], 'DE')
test_data['Cabin'] = test_data['Cabin'].replace(['F', 'G'], 'FG')

<div class="alert alert-info">  
<h3><strong>Dropping columns and filling NA values using the specified method</strong></h3>
</div>

In [None]:
train_data.drop(["Ticket", "Name", "PassengerId"], axis=1, inplace=True)
test_data.drop(["Ticket", "Name", "PassengerId"], axis=1, inplace=True)

train_data["Age"].fillna(train_data["Age"].median(skipna=True), inplace=True)
test_data["Age"].fillna(test_data["Age"].median(skipna=True), inplace=True)


test_data["Fare"].fillna(test_data["Fare"].median(skipna=True), inplace=True)

train_data["Embarked"].fillna('S', inplace=True)
test_data["Embarked"].fillna('S', inplace=True)

In [None]:
train_data["Cabin"].unique()

<div class="alert alert-info">  
<h3><strong>Label encoding</strong></h3>
</div>

In [None]:
gender = {'male': 0, 'female': 1}
train_data.Sex = [gender[item] for item in train_data.Sex] 
test_data.Sex = [gender[item] for item in test_data.Sex] 

embarked = {'S': 0, 'C': 1, 'Q':2}
train_data.Embarked = [embarked[item] for item in train_data.Embarked] 
test_data.Embarked = [embarked[item] for item in test_data.Embarked] 


train_data['Cabin'] = LabelEncoder().fit_transform(train_data['Cabin'])
test_data['Cabin'] = LabelEncoder().fit_transform(test_data['Cabin'])

In [None]:
train_data.dtypes

<div class="alert alert-info">  
<h3><strong>EDA</strong></h3>
</div>

<div class="alert alert-info">  
<h3><strong>Distribution of Gender</strong></h3>
</div>

In [None]:
gender = train_data['Sex'].value_counts()

fig = plt.figure(
    FigureClass=Waffle, 
    rows=5,
    columns=10,
    values=gender,
    colors = (custom_colors[0], custom_colors[1]),
    title={'label': 'Gender Distribution', 'loc': 'center'},
    labels=["{}({})".format(a, b) for a, b in zip(gender.index, gender) ],
    legend={'loc': 'upper left', 'bbox_to_anchor': (1,1)},
    font_size=35, 
    icons = ['male','female'],
    icon_legend=True,
    figsize=(10, 8)
)

Male: 577
Female: 314

<div class="alert alert-info">  
<h3><strong>Distrubution of Age</strong></h3>
</div>

In [None]:
def triple_plot(x, title,c):
    fig, ax = plt.subplots(3,1,figsize=(15,8),sharex=True)
    sns.distplot(x, ax=ax[0],color=c)
    ax[0].set(xlabel=None)
    ax[0].set_title('Histogram + KDE')
    sns.boxplot(x, ax=ax[1],color=c)
    ax[1].set(xlabel=None)
    ax[1].set_title('Boxplot')
    sns.violinplot(x, ax=ax[2],color=c)
    ax[2].set(xlabel=None)
    ax[2].set_title('Violin plot')
    fig.suptitle(title, fontsize=16)
    plt.tight_layout(pad=3.0)
    plt.show()

In [None]:
def hist(x,title):
    plt.figure(figsize = (10,8))
    ax = sns.distplot(x, 
                 kde=False);
    values = np.array([rec.get_height() for rec in ax.patches])
    norm = plt.Normalize(values.min(), values.max())
    colors = plt.cm.jet(norm(values))
    for rec, col in zip(ax.patches, colors):
        rec.set_color(col)
    plt.title(title)

In [None]:
hist(train_data['Age'],'Distribution of Age')

In [None]:
triple_plot(train_data['Age'],'Distribution of Age',custom_colors[2])

<div class="alert alert-info">  
<h3><strong>Distribution of Fare</strong></h3>
</div>

In [None]:
hist(train_data['Fare'],'Distribution of Fare')

In [None]:
triple_plot(train_data['Fare'],'Distribution of Fare',custom_colors[1])

<div class="alert alert-info">  
<h3><strong>Pclass and Age vs Survived</strong></h3>
</div>

In [None]:
sns.violinplot(x="Pclass", y="Age", hue="Survived", split=True, data=train_data)
plt.legend(title='Survived',loc='upper right',bbox_to_anchor=(1.25, 1))
plt.show()

<div class="alert alert-info">  
<h3><strong>Cabin vs Survived</strong></h3>
</div>

In [None]:
td = pd.read_csv("/kaggle/input/titanic/train.csv")
td["Cabin"]=td.Cabin.str[0]

> Before grouping

In [None]:
sns.catplot("Survived", col="Cabin", col_wrap=8,data=td[td.Cabin.notnull()],kind="count",height=4,aspect=.6)
plt.show()

> After grouping 

In [None]:
sns.catplot("Survived", col="Cabin", col_wrap=4,data=train_data,kind="count", height=4,aspect=.6)
plt.show()

* Cabin 0: ABC
* Cabin 1: DE
* Cabin 2: FG
* Cabin 3: Z(missing values)

<div class="alert alert-info">  
<h3><strong>SibSp vs Survived</strong></h3>
</div>

In [None]:
plot_bar(train_data, 'SibSp', 'Survived',False)
plt.legend(title='Survived',loc='upper right',bbox_to_anchor=(1.25, 1))
plt.gcf().set_size_inches(10,8)
plt.xticks(rotation=45)
plt.show()

<div class="alert alert-info">  
<h3><strong>Parch vs Survived</strong></h3>
</div>


In [None]:
plot_bar(train_data, 'Parch', 'Survived',True)
plt.legend(title='Survived',loc='upper right',bbox_to_anchor=(1.25, 1))
plt.gcf().set_size_inches(10,8)
plt.xticks(rotation=45)
plt.show()

<div class="alert alert-info">  
<h3><strong>Gender vs Survived</strong></h3>
</div>

In [None]:
data = train_data[['Sex','Survived']]
data1 = data.loc[data.Sex==0]
data2 = data.loc[data.Sex!=0]

plt.figure(figsize=(16,8),dpi=60)

ax1 = plt.subplot(121, aspect='equal')
data1['Survived'].value_counts().plot.pie(startangle=90,autopct='%1.1f%%', ax=ax1)
ax1.title.set_text('Male')

ax2 = plt.subplot(122, aspect='equal')
data2['Survived'].value_counts().plot.pie(startangle=90,autopct='%1.1f%%', ax=ax2)
ax2.title.set_text('Female')

plt.show()

<div class="alert alert-info">  
<h3><strong>Embarked and Fare vs Survived</strong></h3>
</div>

In [None]:
sns.barplot(x = "Embarked", y = "Fare", hue = "Survived", data = train_data)
plt.show()

<div class="alert alert-info">  
<h3><strong>Fare vs Survived</strong></h3>
</div>

In [None]:
sns.kdeplot(train_data['Fare'][train_data.Survived == 1], color=custom_colors[2], shade=True)
sns.kdeplot(train_data['Fare'][train_data.Survived == 0], color=custom_colors[1], shade=True)
plt.legend(['Survived', 'Not Survived'])
plt.show()

<div class="alert alert-info">  
<h3><strong>Age vs Survived</strong></h3>
</div>

In [None]:
sns.kdeplot(train_data['Age'][train_data.Survived == 1], color=custom_colors[2], shade=True)
sns.kdeplot(train_data['Age'][train_data.Survived == 0], color=custom_colors[1], shade=True)
plt.legend(['Survived', 'Not Survived'])
plt.show()

<div class="alert alert-info">  
<h3><strong>Correlation</strong></h3>
</div>

In [None]:
mask = np.triu(np.ones_like(train_data.corr(), dtype=bool))
fig, ax = plt.subplots(figsize=(16,10),dpi=80, facecolor='w', edgecolor='k')
sns.heatmap(train_data.corr(), mask=mask, cmap="YlGnBu", vmax=.3, center=0,annot = True,
            square=True)
plt.show()

In [None]:
expected_values = train_data["Survived"]
train_data.drop("Survived", axis=1, inplace=True)

In [None]:
train_data.drop("Cabin", axis=1, inplace=True)
test_data.drop("Cabin", axis=1, inplace=True)

<div class="alert alert-info">  
<h3><strong>Training and testing</strong></h3>
</div>

In [None]:
X = train_data.values
y = expected_values.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

In [None]:
model = RandomForestClassifier(criterion='gini',
                                           n_estimators=1750,
                                           max_depth=7,
                                           min_samples_split=6,
                                           min_samples_leaf=6,
                                           max_features='auto',
                                           oob_score=True,
                                           random_state=42,
                                           n_jobs=-1,
                                           verbose=1) 

In [None]:
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

In [None]:
print("Training accuracy: ", accuracy_score(y_train, y_pred_train))
print("Testing accuracy: ", accuracy_score(y_test, y_pred_test))

In [None]:
column_values = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] 

X_train_df = pd.DataFrame(data = X_train,   
                  columns = column_values) 
X_test_df = pd.DataFrame(data = X_test,   
                  columns = column_values) 

In [None]:
def feature_importance(model):
    importances = model.feature_importances_
    indices = np.argsort(importances)
    features = X_train_df.columns
    plt.title('Feature Importance')
    plt.barh(range(len(indices)), importances[indices], color=custom_colors[2], align='center')
    plt.yticks(range(len(indices)), [features[i] for i in indices])
    plt.xlabel('Relative Importance')
    plt.show()

<div class="alert alert-info">  
<h3><strong>Confusion Matrix</strong></h3>
</div>


![](https://miro.medium.com/max/2800/0*9r99oJ2PTRi4gYF_.jpg)

<div class="alert alert-info">  
<h3><strong>ROC Curve</strong></h3>
</div>

![](https://glassboxmedicine.files.wordpress.com/2019/02/roc-curve-v2.png?w=576)

<div class="alert alert-info">  
<h3><strong>Precision, Recall, F1 score</strong></h3>
</div>

<div>
<img src="https://i.imgur.com/WEzWTOU.jpg" width="600" height="400">
</div>

In [None]:
def visualize_metrics(model, model_name) :  
    
    cm = confusion_matrix(y_test, y_pred_test)
    x =  ["0 (pred)","1 (pred)"]
    y = ["0 (actual)","1 (actual)"]
    
    trace1 = go.Heatmap(z = cm  ,x = x,
                        y = y,xgap = 1, ygap = 1, 
                        colorscale = 'purpor', showscale  = False)
    
    
    fpr, tpr, _ = roc_curve(y_test, y_pred_test)
    roc_auc = auc(fpr, tpr)

    trace2 = go.Scatter(x=fpr, y=tpr,
                        name = "ROC : " ,
                        line = dict(color = ('rgb(209,0,209)'),width = 2), fill='tozeroy',fillcolor=('rgba(1209,0,209,0.7)'))
    trace3 = go.Scatter(x = [0,1],y = [0,1],
                        line = dict(color = ('black'),width = 1.5,
                        dash = 'dot'))

    precision, recall, _ = precision_recall_curve(y_test, y_pred_test)
    
    
    tp = cm[1,1]
    fn = cm[1,0]
    fp = cm[0,1]
    tn = cm[0,0]
    Accuracy  =  ((tp+tn)/(tp+tn+fp+fn))
    Precision =  (tp/(tp+fp))
    Recall    =  (tp/(tp+fn))
    F1_score  =  (2*(((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn)))))

    show_metrics = pd.DataFrame(data=[[F1_score,Recall,Precision,Accuracy]])
    show_metrics = show_metrics.T
    trace4 = go.Bar(x = (show_metrics[0].values), 
                    y = ['F1 score ','Recall ','Precision ','Accuracy '], text = np.round_(show_metrics[0].values,4),
                    textposition = 'auto', textfont=dict(color='black'),
                    orientation = 'h', opacity = 1, marker=dict(
            color=custom_colors,
            line=dict(color='#000000',width=1.5)))

    
    trace5 = go.Scatter(x = recall, y = precision,
                        name = "Precision" + str(precision),
                        line = dict(color = ('rgb(222,170,255)'),width = 2), fill='tozeroy',fillcolor=('rgba(222,170,255,0.7)'))
    
    fig = make_subplots(rows=2, cols=2, print_grid=False,
                          specs=[[{}, {}], 
                                 [{}, {}]],
                          subplot_titles=('Confusion Matrix',
                                          'ROC curve'+" "+ '('+ str(round(roc_auc,3))+')',
                                          'Metrics',
                                          'Precision - Recall curve',
                                          ),
                        horizontal_spacing = 0.2
                       )
        
    fig.append_trace(trace1,1,1)
    fig.append_trace(trace2,1,2)
    fig.append_trace(trace3,1,2)
    fig.append_trace(trace4,2,1)
    fig.append_trace(trace5,2,2)
    
    fig['layout'].update(showlegend = False, title = '<b>Visualizing Metrics</b><br>'+model_name, title_x=0.5,
                        autosize = False, height = 800, width = 800,
                        plot_bgcolor = 'white',
                        paper_bgcolor = 'white',
                        margin = dict(b = 195), font=dict(color='black'))
    
    fig["layout"]["xaxis1"].update(showgrid=False, color = 'black',title= "Predicted value")
    fig["layout"]["yaxis1"].update(showgrid=False, color = 'black',title= "Actual value")
    fig["layout"]["xaxis2"].update(dict(title = "False Positive Rate"), color = 'black',showgrid=True, gridwidth=1, gridcolor='black',zeroline=True, zerolinewidth=2, zerolinecolor='black')
    fig["layout"]["yaxis2"].update(dict(title = "True Positive Rate"),color = 'black',showgrid=True, gridwidth=1, gridcolor='black',zeroline=True, zerolinewidth=2, zerolinecolor='black')
    fig["layout"]["xaxis3"].update(dict(range=[0, 1], color = 'black'),showgrid=True, gridwidth=1, gridcolor='black')
    fig["layout"]["yaxis3"].update(color = 'black')
    fig["layout"]["xaxis4"].update(dict(title = "recall"), range = [0,1.05],color = 'black',showgrid=True, gridwidth=1, gridcolor='black')
    fig["layout"]["yaxis4"].update(dict(title = "precision"), range = [0,1.05],color = 'black',showgrid=True, gridwidth=1, gridcolor='black')
 
    for i in fig['layout']['annotations']:
        i['font'] = titlefont=dict(color='black', size = 14)

    py.iplot(fig)

In [None]:
visualize_metrics(model, 'Random Forest Classifier')

In [None]:
feature_importance(model)

> The women and children were allowed to leave the ship first.

In [None]:
model.fit(train_data, expected_values)
print("%.4f" % model.oob_score_)

<div class="alert alert-info">  
<h3><strong>Creating the submission file</strong></h3>
</div>

In [None]:
passenger_IDs = pd.read_csv("/kaggle/input/titanic/test.csv")[["PassengerId"]].values
preds = model.predict(test_data.values)
preds

In [None]:
df = {'PassengerId': passenger_IDs.ravel(), 'Survived': preds}
df_predictions = pd.DataFrame(df).set_index(['PassengerId'])
df_predictions.head(10)

In [None]:
df_predictions.to_csv('/kaggle/working/Predictions.csv')