# Problem Statement

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing numerous passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. 
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. 
In this, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

<img src = "https://cdn.britannica.com/79/4679-050-BC127236/Titanic.jpg" alt="Drawing" width="400">
<figcaption> <h6> Titanic : The most incredible thing that Collide with destiny.</h6>


#### Definition Key

|S.No.|Variable|Definition|Key       |
|-----|--------|----------|----------|
|1    |survival|Survival  |0=No,1=Yes|
|2    |Pclass  |Ticket Class  |1=1st,2=2nd, 3=3rd|
|3    |Sex     |Sex  |-|
|4    |Age     |Age in years|-|
|5    |Sibsp   |# of siblings / spouses aboard the Titanic|-|
|6    |Parch   |# of parents / children aboard the Titanic|-|
|7    |Ticket  |Ticket number|-|
|8    |Fare    |Passenger fare|-|
|9    |Cabin   |Cabin Number|-|
|10   |embarked|Port of Embarkation|C = Cherbourg, Q = Queenstown, S =Southampton|


# Importing Libraries and Reading Data

In [None]:
#importing libraries
import pandas as pd
import numpy as np

#Visualisation Libraries
import matplotlib.pyplot as plt
import seaborn as sns

#modelling libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

#evaluation metrics
from sklearn.metrics import jaccard_similarity_score, confusion_matrix, f1_score, classification_report, accuracy_score

import warnings
warnings.simplefilter("ignore")
%matplotlib inline

In [None]:
#reading data
data = pd.read_csv("train_data.csv")
data.head(2)

In [5]:
data.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


# Data Evaluation, Preprocessing and Data Cleaning

In [None]:
data.describe()

In [None]:
print("Shape of Train Data is:", data.shape)
print("*"*50)
print("Data type in Train Dataset.\n\n", data.dtypes)

In [None]:
data.corr()

In [None]:
# Missing values
def missing_values_table(df):
        mis_val = df.isnull().sum() # Total missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df) # Percentage of missing values
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1) # Make a table with the results
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:,1] != 0].sort_values('% of Total Values', ascending=False).round(1)
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]))   
        print("There are " + str(mis_val_table_ren_columns.shape[0])+" columns that have missing values.")
        return mis_val_table_ren_columns        # Return the dataframe with missing information
    
missing_values1= missing_values_table(data)
missing_values1.style.background_gradient(cmap='Reds')

In [None]:
print(data[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False))
sns.countplot(x = "Pclass", hue= "Survived", data = data, dodge = True)

In [None]:
print(data[['Sex','Survived']].groupby(['Sex'], as_index=False).mean())
sns.countplot(x = "Survived", hue= "Sex", data = data, dodge = True)

In [None]:
print(data[['SibSp','Survived']].groupby(['SibSp'], as_index=False).count())
sns.countplot(x = "SibSp", hue = "Survived", data = data, dodge = True)

In [None]:
bins = np.arange(0, 85, 5)
g = sns.FacetGrid(data, col='Survived', row = 'Sex')
g.map(plt.hist, 'Age', bins=bins)

In [None]:
grid = sns.FacetGrid(data, col='Survived', row='Pclass', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

In [None]:
sns.boxplot(x = 'Pclass', y = 'Age', data = data)

In [None]:
sns.heatmap(data.isnull(), yticklabels= False)

In [None]:
data.groupby(['Survived','Sex','Pclass'])['Age'].mean()

In [None]:
age_groupby_train = data.groupby(['Survived','Sex','Embarked','Pclass'])['Age'].mean()

In [None]:
data['Age'].fillna(value = -1,inplace =True)

In [None]:
for row in range(len(age_groupby_train.index)):
    data.loc[(data['Survived'] == age_groupby_train.index[row][0]) &
           (data['Sex']== age_groupby_train.index[row][1]) &
           (data['Embarked']== age_groupby_train.index[row][2])&
           (data['Pclass']== age_groupby_train.index[row][3])&
           (data['Age']==-1),'Age']=age_groupby_train.values[row]

In [None]:
Sex = pd.get_dummies(data['Sex'], drop_first=True)
Embark = pd.get_dummies(data['Embarked'], drop_first= True)
Pcls = pd.get_dummies(data['Pclass'], drop_first= True)

In [None]:
data = pd.concat([data, Sex, Embark, Pcls], axis = 1)
data.head(2)

In [None]:
data.drop(['PassengerId', 'Sex', 'Embarked', 'Name', 'Ticket', 'Cabin', 'Fare'], axis = 1, inplace = True)
data.head(2)

In [None]:
print("Missing Values of Age in Training Dataset :", data.isnull().any())

# Data Spilit and Model Building

In [None]:
X = data[['Pclass','Age','SibSp', 'Parch', 'male', 'Q','S']]
y = data['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42)

In [None]:
lr = RandomForestClassifier(criterion='gini',
                             max_depth= 5,
                             max_leaf_nodes= 10,
                             min_samples_leaf= 5,
                             min_samples_split= 10,
                             n_estimators= 100)

In [None]:
lr.fit(X_train, y_train)

In [None]:
y_pred = lr.predict(X_test)

# Model Evaluation

In [None]:
print("Accuracy Score:", accuracy_score(y_test,y_pred)*100)
print("*"*50)
print("F1 Score:",f1_score(y_test,y_pred))
print("*"*50)
print("Confusion Matrix \n",confusion_matrix(y_test,y_pred))
print("*"*50)
print("Classification Matrix \n", classification_report(y_test,y_pred))

## THE END!!!

In [None]:
sns.distributions(data['Fare'])