<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0; color:tomato' role="tab" aria-controls="home"><center>Titanic EDA + End to End Machine Learning 🛳</center></h2>


<img src="https://faithmag.com/sites/default/files/styles/article_full/public/2018-09/titanic2.jpg?h=6521bd5e&itok=H8td6QVv.jpg"  Width="800">

## *Challenge*
`The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, We will build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).`

## <font color=darkgreen>Approach:
- <i><b>Data cleaning and statistical analysis.
- Exploratory Data Analysis and visualisations.
- Machine learning modelling and Prediction using ML model.
- FInding the best Machine learning model based on various score

### Importing libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid', context='notebook', palette='deep')
import warnings
warnings.filterwarnings('ignore')
import operator
sns.set_context("talk", font_scale = 1, rc={"grid.linewidth": 3})
pd.set_option('display.max_rows', 100, 'display.max_columns', 100)
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve,precision_score,recall_score,confusion_matrix,classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score,KFold,StratifiedKFold,StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
import tensorflow

### Importing train and test data set.

In [None]:
train= pd.read_csv('../input/titanic/train.csv') #Trainig data set 
test= pd.read_csv('../input/titanic/test.csv') #Testing data set
gender = pd.read_csv('../input/titanic/gender_submission.csv')

In [None]:
train.head()

In [None]:
test.head()

In [None]:
print(len(train))
print(len(test))

### Checking train and test dataset info

In [None]:
print('Train Data Info')
print(train.info())
print('\n')
print('Test Data Info')
print(test.info())

In [None]:
train.describe()

In [None]:
test.describe()

### Exploring category of different feature variables
- <b>PassengerId</b>
 - <i>Index
- <b>Survived: Passenger Survived(0) or not(1)</b>
 - <i>Numerical variable
- <b>Pclass: lower class(1), middle class(2), upper class(3)</b>
 - <i>Ordinal variable 
- <b>Name: Passenger Name</b>
 - <i>Text variable or String
- <b>Sex: Male or Female</b>
 - <i>Nominal variable
- <b>Age: Passenger Age</b>
 - <i>Numerical continous variable
- <b>SibSp: No of Siblings and Spouse travelling with passanger</b>
 - <i>Numerical discrete variable
- <b>Parch: No of Siblings and Spouse travelling with passanger</b>
 - <i>Numerical discrete variable
- <b>Ticket: Ticket Number</b>
 - <i>Text variable or String
- <b>Fare</b>
 - <i>Numerical continous variable
- <b>Cabin: Cabin number</b>
 - <i>Text/String variable
- <b>Embarked: Tells about the embarkation code or embarked method</b>
 - <i>Text/String variable


### Missing values percentage for training and testing data set

In [None]:
pd.DataFrame([train.isnull().sum(),train.isnull().sum()/len(train)*100]).T.\
rename(columns={0:'Total',1:'Missing Perc'})

#### _<font color=darkgreen>Inference: Age has ~20% null/Nan values , Cabin has ~77% null values and Embarked feature has ~.2% or only 2 null values._

In [None]:
pd.DataFrame([test.isnull().sum(),test.isnull().sum()/len(test)*100]).T.\
rename(columns={0:'Total',1:'Missing Perc'})

#### _<font color=darkgreen>Inference: Age has ~21% null/Nan values , Cabin has ~78% null values and fare has only 1 missing value._

### Fixing Embarked missing values

In [None]:
train[train.Embarked.isnull()]

 #### We know for Embarked=NaN, Pclass =1, Fare=80,Cabin =B2B. Removing outliers for fare column to better visualise the embarked feature for values greater than 3 standard deviation. We know for missing embarked values, respective Pclass is 1 and Fare is 80. 

In [None]:
df = train[train['Fare']<train['Fare'].std()*3]
plt.figure(figsize=(10,10))
sns.boxplot(x=df['Embarked'], y=df['Fare'], data=df,hue=df['Pclass'])
#plt.yticks(range(0,550,50))
plt.show()

#### <font color=darkgreen>Inference: Only Embarked C satisfies the condition which is Fare=80 and Pclass=1. Hence, MIssing values for Embarked most likely be equal to C. We could have taken help from Cabin feature to be more sure of Embarked misssing value, but Cabin column contains around 78% missing values.

### Filling embarked missing values with C.

In [None]:
train['Embarked'].fillna('C',inplace=True)

#### Combining train and test data in master dataframe to predict missing values. Lets fetch the first character from cabin column to bettter understand the distribution of Cabin class. 

survivers = train.Survived
train.drop(["Survived"],axis=1, inplace=True)
master=pd.concat([train,test])
master.Cabin.fillna("N", inplace=True)
master['Cabin'] = master['Cabin'].apply(lambda x:list(str(x))[0].upper())
master.head()

#### Titanic has different cabins based on their fare/price and they categorized based on their initial character and then subsequent number. Lets check whether cabin category (i.e. C,E,A,D, etc) has fare range.

In [None]:
survivers = train.Survived
train.drop(["Survived"],axis=1, inplace=True)
master=pd.concat([train,test])
master.Cabin.fillna("N", inplace=True)
master['Cabin'] = master['Cabin'].apply(lambda x:list(str(x))[0].upper())
master.head()

In [None]:
master.groupby('Cabin')['Fare'].describe()

#### Analysing mean, median, and max value we can make a fare range under which Cabin category falls.
- val<=14,<b>Cabin:G</b>
- val=35, <b>Cabin:T</b>
- 14 < val <= 26, <b>Cabin:F</b>
- 26 < val <= 39, <b>Cabin:A</b>
- 39<val<=53, <b>Cabin:E</b>
- 53 < val <= 80, <b>Cabin:D</b>
- 80 < val <= 115, <b>Cabin:C</b>
- ,>115, <b>Cabin:E</b>

#### Function to replace NaN or 'N' value with Above mentioned values.

In [None]:
def repl_N(val):
    n = 0
    if val==35:
        n = 'T'
    elif val<=14:
        n = 'G'
    elif 14<val<=26:
        n='F'
    elif 26<val<=39:
        n='A'
    elif 39<val<=53:
        n='E'
    elif 53<val<=80:
        n='D'
    elif 80<val<=115:
        n='C'
    else:
        n='B'
    return n


### Retriving rows having 'N' values for Cabin replace it with above function values. 

#### Filling Nan/Null value in Fare column which is PassengerId=1044. PassengerId=1044 travelling in Pclass:3 and Embarked:S.

In [None]:

master_N = master[master['Cabin']=='N']
master_notN = master[~(master['Cabin']=='N')]
master_N['Cabin']=master_N['Fare'].apply(lambda x:repl_N(x))
master = pd.concat([master_N,master_notN])
fare_mean= master[(master['Pclass']==3) & (master['Embarked']=='S')]['Fare'].mean()
master['Fare'].fillna(fare_mean,inplace=True)

In [None]:
missing_value = test[(test.Pclass == 3) & 
                     (test.Embarked == "S") & 
                     (test.Sex == "male")].Fare.mean()
## replace the test.fare null values with test.fare mean
test.Fare.fillna(missing_value, inplace=True)

#### Lets split our master dataframe into train and test dataset again, we know training set has 891 rows(passangerId 0-890), and test data has 418 rows (passangerId 892-end) and our master df has 1309 rows.

In [None]:
train = master.sort_values('PassengerId')[:891]
test= master.sort_values('PassengerId')[891:]
train['Survived'] = survivers

#### Missing values in master df, only Age columns has missing values which is ~20.1% or 263 missing values. Survived column has missing values because its not present in test dataset and is expected.

In [None]:
pd.DataFrame([master.isnull().sum(),master.isnull().sum()/len(master)*100]).T.\
rename(columns={0:'Total',1:'Missing Perc'})

#### Filling Missing values in Age column for master df

#### Creating family size and friends column to make final column which includes count of family members and friends. As we will proceed further we will see this feature is very important in our ml model prediction.

In [None]:
train['family_size'] = train.SibSp + train.Parch+1
test['family_size'] = test.SibSp + test.Parch+1

#### Merging dataframes

#### Drop PassengerId column, it role is same as index column only it starts from 1 and index starts from 0.

In [None]:
passenger_test= test['PassengerId']
train.drop('PassengerId',axis=1,inplace=True)
test.drop('PassengerId',axis=1,inplace=True)

## Function to show values on bar plot

In [None]:
def showvalues(ax,m=None):
    for p in ax.patches:
        ax.annotate("%.1f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()),\
                    ha='center', va='center', fontsize=14, color='k', rotation=0, xytext=(0, 7),\
                    textcoords='offset points',fontweight='light',alpha=0.9) 

## % of Passenger survived or not w.r.t Sex(Male/Female).

In [None]:
plot_df= train.groupby('Sex')['Survived'].value_counts(normalize=True).mul(100).rename('percent').reset_index()
plot_df

In [None]:
plt.figure(figsize=(10,8))
col = {1:'#99ff99', 0:'#ff9999'}
ax= sns.barplot(x='Sex',y='percent',data=plot_df,hue='Survived',palette=col)
showvalues(ax)
plt.title('Percentage of Passenger Survived Sex wise', pad=30)
plt.xlabel('Sex')
plt.ylabel('Percentage of Passenger Survived')
leg = ax.get_legend().texts
leg[0].set_text("No")
leg[1].set_text("Yes")
plt.show()

#### _<font color=darkgreen>~74% of the female survived whereas ~81% Male Deceased. It's proving Titanic survival ratio are biased towards females(Female passengers were priority rather than male passengers)._

## % of Passenger survived  w.r.t Passenger Class(Lower(3),Middle(2),Upper(1)).

In [None]:
plot_df= train.groupby('Pclass')['Survived'].value_counts(normalize=True).mul(100).rename('percent').reset_index()
plot_df

In [None]:
plt.figure(figsize=(10,8))
col = {1:'#99ff99', 0:'#ff9999'}
ax= sns.barplot(x='Pclass',y='percent',data=plot_df,hue='Survived',palette=col)
showvalues(ax)
plt.title("Percentage of Passenger Survived vs PClass", pad=30)
plt.xlabel("Passenger Class");
plt.ylabel("Percentage of Passenger Survived")
leg = ax.get_legend().texts
leg[0].set_text("No")
leg[1].set_text("Yes")
plt.show()

#### _<font color=darkgreen>63% of the people survived in Upper Class, ~47.3% passanger survived in Middle class whereas only ~24% passanger survived in Lower class.(Priority for upper class > middle class> lower class)_

#### Age Distribution of Passengers

In [None]:
col = {0:'#99ff99', 1:'#ff9999'}
plt.figure(figsize=(10,8))
ax=sns.boxplot(x='Sex',data=train,y='Age',hue='Survived',palette=col)
leg = ax.get_legend().texts
leg[0].set_text("No")
leg[1].set_text("Yes")
plt.show()

#### _<font color='darkgreen'>Younger people survived percentage is more in males while it is opposite in females._

#### Fare distrbution for Pclass and Cabin

In [None]:
plt.figure(figsize=(20,12))
plt.subplot(1,2,1)
ax=sns.distplot(train['Fare'])
plt.subplot(1,2,2)
ax=sns.boxplot(x='Pclass',data=train,y='Fare',hue='Sex',palette='cool')
ax.set_yscale('log')
plt.show()


#### <font color=darkgreen>PClass has highest fare range and it is somewhat biased, Female passanger in Upper class paid more than male passenger, similarly for Lower class. For middle class, Female employess also paid higher but variation is not that much as for Upper and lower class.

#### Fare and Age distribution vs Survived or not

In [None]:
plt.figure(figsize=(20,20))
plt.subplot(2,1,1)
ax=sns.kdeplot(train.loc[(train['Survived'] == 0),'Fare'] , color='r',shade=True,label='Deceased')
ax=sns.kdeplot(train.loc[(train['Survived'] == 1),'Fare'] , color='g',shade=True, label='Survived')
plt.xlabel('Fare')
plt.ylabel('Frequency of Passenger Survived')
plt.subplot(2,1,2)
ax=sns.kdeplot(train.loc[(train['Survived'] == 0),'Age'] , color='r',shade=True,label='Deceased')
ax=sns.kdeplot(train.loc[(train['Survived'] == 1),'Age'] , color='g',shade=True, label='Survived')
plt.xlabel('Age')
plt.ylabel('Frequency of Passenger Survived')
plt.show()

#### Inference Fare distribution:
- <b><font color=darkgreen>The spike in the plot under 50 dollar represents that a lot of passengers who bought the ticket within that range did not survive. 
- <b><font color=darkgreen>When fare is approximately more than 200 dollars, there is very small red shade which means, either everyone passed that fare point survived or maybe there is an outlier that clouds our judgment.

#### Inference Age distrbution:
- <b><font color=darkgreen>Children or young infants has more Survived percentage because chidrens and infants were the priority then same as for females which we have seen earlier.

In [None]:
# Kernel Density Plot
fig = plt.figure(figsize=(15,8),)
ax=sns.kdeplot(train.Pclass[train.Survived == 0] , 
               color='red',
               shade=True,
               label='not survived')
ax=sns.kdeplot(train.loc[(train['Survived'] == 1),'Pclass'] , 
               color='g',
               shade=True, 
               label='survived', 
              )
plt.title('Passenger Class Distribution - Survived vs Non-Survived', fontsize = 25, pad = 40)
plt.ylabel("Frequency of Passenger Survived", fontsize = 15, labelpad = 20)
plt.xlabel("Passenger Class", fontsize = 15,labelpad =20)
## Converting xticks into words for better understanding
labels = ['Upper', 'Middle', 'Lower']
plt.xticks(sorted(train.Pclass.unique()), labels);

#### Lets visulaise Embarked , Sex with Survived percentage

In [None]:

ax = sns.FacetGrid(train,size=5, col="Sex", row="Embarked", margin_titles=True, hue = "Survived",palette = col)
ax = ax.map(plt.hist, "Age", edgecolor = 'white').add_legend()
ax.fig.suptitle("Survived by Sex and Age", size = 25)
plt.subplots_adjust(top=0.90)



#### _<font color=darkgreen>Majority of passengers boarded from Southhampton and then from Cherbourg. Passengers who boarded from Queenstown are very less in number compared to other embarked/port. Majority of the female had survived in which most no of females survived are from southhampton followed by Cherbourgh and Queenstown. Note: No male passenger from queenstown survived, as you can see from the graph._

#### Factorplot for Parents/Children survived for male and female

In [None]:

sns.factorplot(x='Parch',y='Survived',data=train,col='Sex',color='g',ci=95.0)

In [None]:

sns.catplot(x='Parch',y='Survived',data=train,col='Sex',color='g')

#### _<font color=darkgreen>Passenger who travelled in big group with their parents and children had less survival rate than who travelled alone or with their parent or childred._
####  _<font color=darkgreen>Femaled who were alone, high percentage of females survived._

#### Factorplot for Spouse/Siblings survived for male and female

In [None]:

sns.factorplot(x='SibSp',y='Survived',data=train,col='Sex',ci=95.0,color='g')

#### _<font color=darkgreen>Similar Inference as for parent/Children Parch in above plot._

#### Factorplot for Spouse+Siblings+Parents+Childrens+Friend survived for male and female

In [None]:

sns.factorplot(x='family_size',y='Survived',data=train,col='Sex',ci=95.0,color='g')

#### Survived passenger vs Cabin

In [None]:
plot_df= train.groupby('Cabin')['Survived'].value_counts(normalize=True).mul(100).rename('percent').reset_index()

In [None]:
plt.figure(figsize=(16,8))
col = {1:'#99ff99', 0:'#ff9999'}
ax= sns.barplot(x='Cabin',y='percent',data=plot_df,hue='Survived',palette=col)
showvalues(ax)
plt.title("Percentage of Passenger Survived vs Cabin", pad=30)
plt.xlabel("Cabin");
plt.ylabel("Percentage of Passenger Survived")
leg = ax.get_legend().texts
leg[0].set_text("No")
leg[1].set_text("Yes")
plt.show()

#### <font color=darkgreen>Inference:Survived percentage in Cabin B is highest which is 74.5% followed by Cabin D. It's a interesting fact about cabin B. There is only 1 person in cabin T and he didn't survive.

#### Factor Plot for Embarked vs Survived

In [None]:

sns.factorplot(x='Embarked',y='Survived',data=train,col='Sex',ci=95.0,color='g')

#### Passengers who boarded from Cherbourg has highest percentage of passenger survived and for Queenstown males survived is very small.

#### Factor Plot for Embarked vs Survived

In [None]:

sns.factorplot(x='Cabin',y='Survived',data=train,col='Embarked',hue='Sex',ci=95.0,size=6)

#### _<font color=darkgreen>None of the Passengers Embarked=Q are  in D,B,E,T cabins and D,B,E has high survived percentage. We noticed that very small fraction of male survived who boarded from Queenstown as seen earlier also. Females have greater survived percentage than males._

In [None]:
plt.figure(figsize=(20,10))
ax=sns.boxplot(x='Cabin',data=train,y='Fare',hue='Sex',palette='cool')
ax.set_yscale('log')
plt.show()


#### _<font color=darkgreen>Cabins C, B, and D has high fare amount, also we saw previusly Cabin C,D and B has gretaer survived percentage than other cabins. It is because who paid higher amount for  fare are likely to be more priority to save than passenger who paid smaller amount._

#### Lets see if childrens or you infants were the priority and if it is what is the impact on young childrens/infants

In [None]:
plot_df= train[train['Age']<10]['Survived'].value_counts(normalize=True).mul(100).rename('percent').reset_index()

In [None]:
plt.figure(figsize=(10,8))
ax=sns.barplot(x=plot_df['index'],y=plot_df['percent'],palette=col)
locs, labels = plt.xticks()
plt.xticks(ticks=locs,labels=['Not Survived','Survived'])
plt.ylabel('Percentage of Children Survived in Age < 10')
plt.xlabel('Survived or Not Survived')
showvalues(ax)

#### _<font color=darkgreen>For Age<10 (Childrens/Infants) Survived percentage is 61% which shows that good percentage of children age<10 survived as children and females were first priority._

In [None]:
train['Survived'].value_counts(normalize=True)*100

#### Male passenged survived percentage who embarked from Queenstown

In [None]:
train[(train['Embarked']=='Q') & (train['Sex']=='male')]['Survived'].value_counts()

In [None]:
train[(train['Embarked']=='Q') & (train['Sex']=='male')]['Survived'].value_counts(normalize=True)*100

## Inferences from above visualisations:
- <b><font color=darkgreen>~38% passenger survived.
- ~74% female passenger survived, while only ~19% male passenger survived.
- ~63% upper class passengers survived, 47% middle class passenger survived, while only 24% lower class passenger survived.
- Passenger who is in Cabin B, 75% of the passenger in Cabin B survived, will find out reason behind it.
- Most of the passenger embarked from Cherbourg and Southhampton, very few passenger embarked from Queenstown and more than 90% male passenger died who boarder from Queesntown.
- ~61% of the children survived below Age 10, were on priority to save.
- Small family size has more survived percentage compared to medium and big family size.


#### Correlation between different features, how they are correlates( strongly(positive) , neutral, or weakely(negative))

In [None]:
train['Sex'] = train.Sex.apply(lambda x: 0 if x == "female" else 1)
test['Sex'] = test.Sex.apply(lambda x: 0 if x == "female" else 1)

In [None]:
plot_df

In [None]:
plot_df=train[['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked',
       'family_size']]
plt.figure(figsize=(15,10))
mask = np.zeros_like(plot_df.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(plot_df.corr(),annot=True,cmap='cividis',mask=mask)

#### Inference from Coorelation:
- <b><font color=darkgreen>Fare and Survived are positively correlated(0.26)
- Fare and Pclass are negatively correlated(-0.55)
- Survived and Passenger Class are negatively correlated(-0.34)
- Survived and Sex are negatively correlated(-0.54)
- Passenger Class and Age are negatively correlated(-0.37)



## Feature Engineering

# Name

In [None]:
np.info([len(i) for i in train.Name])

In [None]:
train['name_len'] = [len(i) for i in train.Name]
test['name_len'] = [len(i) for i in test.Name]
def name_length(size):
    a = ''
    if (size <=20):
        a = 'short'
    elif (size <=35):
        a = 'medium'
    elif (size <=50):
        a = 'long'
    else:
        a = 'very long'
    return a
train['name_len_rnge'] = train['name_len'].map(name_length)
test['name_len_rnge'] = test['name_len'].map(name_length)

In [None]:
plt.figure(figsize=(10,8))
ax=sns.kdeplot(train.loc[(train['Survived'] == 0),'name_len'] , color='r',shade=True,label='Deceased')
ax=sns.kdeplot(train.loc[(train['Survived'] == 1),'name_len'] , color='g',shade=True, label='Survived')
plt.xlabel('Name Length')
plt.ylabel('Frequency of Passenger Survived')
plt.show()

#### _<font color=darkgreen>Inference: for long and very long Name length, Survived percentage is much larger than deceased, which is a great insight from length feature._

In [None]:
sns.distplot([len(i) for i in train.Name])

#### Fetching title from Name

In [None]:
train['title']=train['Name'].apply(lambda x:x.split('.')[0].split(',')[1].strip())
test['title']=test['Name'].apply(lambda x:x.split('.')[0].split(',')[1].strip())

#### Lets replace Mlle(Mademoiselle) to Miss, Ms to Miss, Mme(Madame) to Mrs, and  Colonel,Don,jonkheer,the Countess,Major relaced with rank.

In [None]:
## we are writing a function that can help us modify title column
def replace_title(df):
    
    result=[]
    for val in df:
        if val in ['the Countess','Capt','Lady','Sir','Jonkheer','Don','Major','Col','Dona']:
            val = 'rare'
            result.append(val)
        elif val in ['Ms', 'Mlle']:
            val = 'Miss'
            result.append(val)
        elif val == 'Mme':
            val = 'Mrs'
            result.append(val)
        else:
            result.append(val)
    return result

train['title']=replace_title(train['title'])
test['title']=replace_title(test['title'])

In [None]:
train['title'].value_counts()

#### Unique name titles

In [None]:
print(train['title'].unique())
print(test['title'].unique())

#### Family Size including friends range

In [None]:
## bin the family size. 
def family_group(size):
    """
    This funciton groups(loner, small, large) family based on family size
    """
    
    a = ''
    if (size <= 1):
        a = 'loner'
    elif (size <= 4):
        a = 'small'
    else:
        a = 'large'
    return a

train['family_size_inc_frnds_rng'] = train['family_size_inc_frnds'].map(family_group)
test['family_size_inc_frnds_rng'] = test['family_size_inc_frnds'].map(family_group)

In [None]:

train['family_group'] = train['family_size'].map(family_group)
test['family_group'] = test['family_size'].map(family_group)

In [None]:
train['is_alone'] = [1 if i<2 else 0 for i in train.family_size]
test['is_alone'] = [1 if i<2 else 0 for i in test.family_size]

#### Actual_fare, passenger who are with family have paid total fare not the individual fare as seen from the fare column. We will derive individual fare columns because fare is very important parameter in prediction of passengger survived or not.

In [None]:
train['actual_fare']=train['Fare']/train.family_size
test['actual_fare'] = test.Fare/test.family_size

#### Fare Range

In [None]:
train['Fare'].describe()

In [None]:
def fare_rnge(fare):
    val= ''
    if fare <= 4:
        val = 'very_low'
    elif fare <= 10:
        val = 'low'
    elif fare <= 20:
        val = 'mid'
    elif fare <= 45:
        val = 'high'
    else:
        val = 'very_high'
    return val

train['fare_rnge'] = train['actual_fare'].map(fare_rnge)
test['fare_rnge'] = test['actual_fare'].map(fare_rnge)

In [None]:
## create bins for age
#def age_group_fun(age):
#    """
#    This function creates a bin for age
#    """
#    a = ''
#    if age <= 1:
#        a = 'infant'
#    elif age <= 4: 
#        a = 'toddler'
#    elif age <= 13:
#        a = 'child'
#    elif age <= 18:
#        a = 'teenager'
#    elif age <= 35:
#        a = 'Young_Adult'
#    elif age <= 45:
#        a = 'adult'
#    elif age <= 55:
#        a = 'middle_aged'
#    elif age <= 65:
#        a = 'senior_citizen'
#    else:
#        a = 'old'
#    return a
        
## Applying "age_group_fun" function to the "Age" column.
#train['age_group'] = train['Age'].map(age_group_fun)
#test['age_group'] = test['Age'].map(age_group_fun)

## Creating dummies for "age_group" feature. 
#train = pd.get_dummies(train,columns=['age_group'], drop_first=True)
#test = pd.get_dummies(test,columns=['age_group'], drop_first=True);

#### Creating more columns and converting catrgorical columns into dummy variable. So, we can use it in ML model.

In [None]:
train = pd.get_dummies(train, columns=['Pclass', 'Cabin', 'Embarked', 'name_len_rnge', 'title',\
                                       'fare_rnge','family_group'], drop_first=False)
test = pd.get_dummies(test, columns=['Pclass', 'Cabin', 'Embarked', 'name_len_rnge',\
                                     'title','fare_rnge','family_group'], drop_first=False)


#### Dropping columns which are not useful after creating dummy variables

train.drop(['family_size','Friends', 'family_size_inc_frnds','name_len',\
            'Fare','Ticket_num','Ticket','Name'], axis=1, inplace=True)
test.drop(['family_size','Friends', 'family_size_inc_frnds','name_len',\
            'Fare','Ticket_num','Ticket','Name'], axis=1, inplace=True)

In [None]:
train.drop(['family_size','name_len',\
            'Fare','Name','Ticket'], axis=1, inplace=True)
test.drop(['family_size','name_len',\
            'Fare','Name','Ticket'], axis=1, inplace=True)

#### Predicting missing value for Age columns

In [None]:
def predict_age(df):
    df_not_null = df.loc[df['Age'].notnull()]
    df_null= df[df['Age'].isnull()]
    y=df_not_null['Age']
    x=df_not_null.drop('Age',axis=1)
    rf_reg=RandomForestRegressor(n_estimators=1000).fit(x,y)
    pred=rf_reg.predict(df_null.drop('Age',axis=1))
    df.loc[df.Age.isnull(), "Age"] =list(pred)
    return df
predict_age(train)
predict_age(test)
    

## Preprocessing Tasks

#### Splitting data into train and test data 

In [None]:
X = train.drop(['Survived'], axis = 1)
y = train["Survived"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

#### Feature Scaling preprocessing step
<ul>
    <li><b>MinMaxScaler</b>-Scales the data using the max and min values so that it fits between 0 and 1.</li>
    <li><b>StandardScaler</b>-Scales the data so that it has mean 0 and variance of 1.</li>
    <li><b>RobustScaler</b>-Scales the data similary to Standard Scaler, but makes use of the median and scales using the interquertile range so as to aviod issues with large outliers.</b>
 </ul>

In [None]:
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

### Data Modeling and Evaluation

In [None]:
from sklearn.linear_model import LogisticRegression
lr= LogisticRegression(solver='liblinear',penalty= 'l1',random_state = 0)
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
y_prob = lr.predict_proba(X_test)
print(accuracy_score(y_test, y_pred))
print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred)) 

In [None]:
from sklearn.metrics import roc_auc_score,auc,roc_curve
fpr, tpr, _ =roc_curve(y_test,y_prob[:,1])
roc_auc= auc(fpr,tpr)
plt.figure(figsize=(10,8))
plt.plot(fpr,tpr,label='ROC Curve(area = %0.2f)'%roc_auc)
plt.plot([0,1],[0,1],'k--',c='r')
plt.xlabel('False Positive Rate', fontsize = 18)
plt.ylabel('True Positive Rate', fontsize = 18)
plt.title('ROC Curve', fontsize= 18)
plt.show()

In [None]:
precision,recall,thre=precision_recall_curve(y_test,y_prob[:,1])
prec_recall= auc(recall,precision)
plt.figure(figsize=(10,8))
plt.plot(recall,precision,label='Precision recall Curve(area = %0.2f)'%prec_recall)
plt.xlabel('recall', fontsize = 18)
plt.ylabel('precision', fontsize = 18)
plt.title('Precision Recall', fontsize= 18)
plt.show()

In [None]:
cv=StratifiedShuffleSplit(n_splits=20,test_size=0.3,random_state=0)
cross_v_score= cross_val_score(LogisticRegression(),X,y,cv=cv)
print(cross_v_score)
print('mean cross validation score:{0:2.2f}'.format(np.mean(cross_v_score)))

In [None]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

c = list(np.linspace(0.01,10,19))
penalties = ['l1','l2']
cv = StratifiedShuffleSplit(n_splits = 10, test_size = .3)
param = {'penalty': penalties, 'C': c}
logreg = LogisticRegression(solver='liblinear')
grid = RandomizedSearchCV(estimator=LogisticRegression(), 
                           param_distributions = param,
                           scoring = 'accuracy',
                           cv = cv,n_iter=40
                          )
## Fitting the model
grid.fit(X, y)

In [None]:
print(grid.best_estimator_)
print(grid.best_params_)
print(grid.best_score_)
print(grid.best_index_)

In [None]:
lr=grid.best_estimator_
lr.score(X,y)

## KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier()
knn.fit(X_train,y_train)
y_predict=knn.predict(X_test)
accuracy_score(y_test,y_predict)

## Naive Bayes - Baseline Model

In [None]:
from sklearn.naive_bayes import MultinomialNB
g_nb=MultinomialNB()
g_nb.fit(X,y)
y_pred= g_nb.predict(X_test)
print(round(accuracy_score(y_test,y_pred),3))


## Support Vector Machine

In [None]:
from sklearn.svm import SVC
svm_n=SVC(C=3,kernel='poly',degree=3)
svm_n.fit(X_train,y_train)
y_pred= svm_n.predict(X_test)
accuracy_score(y_test,y_pred)

## Decision tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
max_depth_n = range(1,10)
max_feature_n = [20,21,22,23,24,25,26,28,29,30,'auto']
criterion_n = ["gini", "entropy"]
params={'max_depth':max_depth_n,'max_features':max_feature_n,'criterion':criterion_n}
cv_n=StratifiedShuffleSplit(test_size=0.25,random_state=0)
random_cv= RandomizedSearchCV(DecisionTreeClassifier(),param_distributions=params,cv=cv_n)
random_cv.fit(X,y)

In [None]:
print(random_cv.best_estimator_)
print(random_cv.best_index_)
print(random_cv.best_params_)
print(random_cv.best_score_)


In [None]:
dtc=random_cv.best_estimator_
dtc.score(X,y)


## Feature Importance 

In [None]:
columns= X.columns
feature_importances = pd.DataFrame(dtc.feature_importances_,
                                   index = columns,
                                    columns=['Feature Importance'])
feature_importances.sort_values(by='Feature Importance', ascending=False).head(10)

In [None]:
df_temp= feature_importances.sort_values(by='Feature Importance', ascending=False).head(10)
plt.figure(figsize=(8,6))
sns.barplot(data=df_temp,y=df_temp.index, x='Feature Importance',orient='h')
#bar.set_xticklabels(bar.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()

## Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
n_estimators_n=[145,150]
max_depth_n=range(1,10)
criterion_n = ["gini", "entropy"]
params={'max_depth':max_depth_n,'criterion':criterion_n,'n_estimators':n_estimators_n}
cv_n=StratifiedShuffleSplit(test_size=0.25,random_state=0)
grid_cv= GridSearchCV(RandomForestClassifier(),param_grid=params)
grid_cv.fit(X,y)

In [None]:
print(grid_cv.best_estimator_)
print(grid_cv.best_params_)
print(grid_cv.best_score_)


In [None]:
rfc=grid_cv.best_estimator_
rfc.score(X,y)

In [None]:
columns= X.columns
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = columns,
                                    columns=['Feature Importance'])
feature_importances.sort_values(by='Feature Importance', ascending=False).head(10)
df_temp= feature_importances.sort_values(by='Feature Importance', ascending=False).head(10)
plt.figure(figsize=(8,6))
sns.barplot(data=df_temp,y=df_temp.index, x='Feature Importance',orient='h')
#bar.set_xticklabels(bar.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()

## Bagging Classifier

In [None]:
from sklearn.ensemble import BaggingClassifier
n_estimators_n = [10,20,30,50,70,80,100,120, 140,150]
cv_n=StratifiedShuffleSplit(test_size=0.25,random_state=0)
params={'n_estimators':n_estimators_n}
grid_cv= GridSearchCV(BaggingClassifier(),param_grid=params,cv=cv_n)
grid_cv.fit(X,y)

In [None]:
print(grid_cv.best_estimator_)
print(grid_cv.best_params_)
print(grid_cv.best_score_)

In [None]:
bc_n=grid_cv.best_estimator_
bc_n.score(X,y)

In [None]:
bc=BaggingClassifier(n_estimators=30,max_features=17)
bc.fit(X_train,y_train)
y_pred=bc.predict(X_test)
accuracy_score(y_test,y_pred)

## Gradient Boosting Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
max_depth_n=range(2,10)
params={'max_depth':max_depth_n}
cv_n=StratifiedShuffleSplit(test_size=0.25,random_state=0)
grid_cv= GridSearchCV(GradientBoostingClassifier(),param_grid=params)
grid_cv.fit(X,y)

In [None]:
print(grid_cv.best_estimator_)
print(grid_cv.best_params_)
print(grid_cv.best_score_)

In [None]:
gbc=grid_cv.best_estimator_
gbc.score(X,y)

## XGB Classifier

In [None]:
from xgboost import XGBClassifier
xgbc=XGBClassifier()
xgbc.fit(X_train,y_train)
y_pred=xgbc.predict(X_test)
print(round(accuracy_score(y_test,y_pred),2))

## Ada Boost Algorithm

In [None]:
from sklearn.ensemble import AdaBoostClassifier
n_estimators_n = [50,70,80,100]
cv_n=StratifiedShuffleSplit(test_size=0.25,random_state=0)
params = {'n_estimators':n_estimators_n}
grid_cv= GridSearchCV(AdaBoostClassifier(),param_grid=params,cv=cv_n)
grid_cv.fit(X,y)

In [None]:
print(grid_cv.best_estimator_)
print(grid_cv.best_score_)
print(grid_cv.best_params_)

In [None]:
abc_n=grid_cv.best_estimator_
abc_n.score(X,y)

In [None]:
abc=AdaBoostClassifier(algorithm='SAMME.R',learning_rate=1.007)
abc.fit(X_train,y_train)
y_pred=abc.predict(X_test)
accuracy_score(y_test,y_pred)

## Extra Tree Classifier

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
etc=ExtraTreesClassifier()
etc.fit(X_train,y_train)
y_pred=etc.predict(X_test)
print(round(accuracy_score(y_test,y_pred),2))

## Gaussian Process Classifier

In [None]:
from sklearn.gaussian_process import GaussianProcessClassifier
gpc=GaussianProcessClassifier()
gpc.fit(X_train,y_train)
y_pred=gpc.predict(X_test)
print(round(accuracy_score(y_test,y_pred),2))

## Voting Classifier

In [None]:
from sklearn.ensemble import VotingClassifier
vc= VotingClassifier(estimators=[lr,rfc,gbc,knn,bc,abc,etc,gpc,g_nb])

vc = VotingClassifier(estimators=[
    ('lr_grid', lr),
    ('random_forest', rfc),
    ('gradient_boosting', gbc),
    ('decision_tree_grid',dtc),
    ('knn_classifier', knn),
    ('XGB_Classifier', xgbc),
    ('bagging_classifier', bc),
    ('adaBoost_classifier',abc),
    ('ExtraTrees_Classifier', etc),
    ('gaussian_process_classifier', gpc)
],voting='hard')
vc.fit(X_train,y_train)
y_pred=vc.predict(X_test)
print(round(accuracy_score(y_test,y_pred),2))

## Artificial Neural Network

In [None]:
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense,Dropout
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dropout
#Early Stop
early_stop = EarlyStopping(monitor = 'val_loss', mode = "min", verbose = 1 , patience = 25)

ann= Sequential()

ann.add(Dense(9,activation = 'relu'))
ann.add(Dropout(0.5))
ann.add(Dense(4,activation = 'relu'))
ann.add(Dropout(0.5))

ann.add(Dense(1,activation='sigmoid'))
ann.compile(loss= 'binary_crossentropy', optimizer = 'adam')

ann.fit(x=X_train, y=y_train,epochs=400,validation_data=(X_test,y_test),callbacks=[early_stop])

In [None]:
y_pred = (ann.predict(X_test) > 0.46).astype(int)

In [None]:
accuracy_score(y_test,y_pred)

In [None]:
models = [lr,knn,svm_n,dtc,rfc,gbc,bc,abc,etc,gpc,xgbc,vc]
c = {}
for model in models:
    pred = model.predict(X_test)
    result = accuracy_score(y_test,pred)
    c[model] = result
    


In [None]:
test['Cabin_T']=0
test_prediction = (max(c, key=c.get)).predict(test.values)
submission = pd.DataFrame({
        "PassengerId": passenger_test,
        "Survived": test_prediction
    })
submission.PassengerId = submission.PassengerId.astype(int)
submission.Survived = submission.Survived.astype(int)

submission.to_csv("titanic_submission.csv", index=False)

<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0; color:tomato' role="tab" aria-controls="home"><center>If you found this notebook helpful , some upvotes would be very much appreciated - That will keep me motivated :)</center></h2>


<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0; color:tomato' role="tab" aria-controls="home"><center>Thank You :)</center></h2>
