# Titanic Survival Prediction Using Mutiple Classifiers

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
titanic_train= pd.read_csv('../input/titanic/train.csv')
titanic_train.head()

## **Exploratory Data Analysis (EDA) of Taining set**

In [5]:
titanic_train.shape

In [6]:
titanic_train.info()

In [7]:
titanic_train.describe()

In [8]:
titanic_train.nunique()

## **Missing Data**

In [9]:
titanic_train.isnull().sum()

In [10]:
plt.figure(figsize =(12,6))
sns.heatmap(titanic_train.isnull(), yticklabels = False, cbar = False, cmap ='cividis'); 

Cabin column contains too many of missing data to do something useful with it. So its better to drop it. 

## **Data Cleaning**

### **Filling age column:**

Its seem better to fill the age column with the mean age per passenger class. 

In [11]:
plt.figure(figsize=(12, 6))
sns.boxplot(x='Pclass',y='Age',data=titanic_train,palette='winter');

In [12]:
titanic_train.groupby('Pclass').mean()

Define a function for filling age based on these mean Age per Pclass

In [13]:
def fill_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    if pd.isnull(Age):
        if Pclass == 1:
            return 38
        elif Pclass == 2:
            return 30
        else:
            return 25
    else:
        return Age

Applying fill_age function

In [14]:
titanic_train['Age'] = titanic_train[['Age', 'Pclass']].apply (fill_age, axis = 1)
titanic_train.head()

Check Heatmap again

In [15]:
plt.figure(figsize =(12,6))
sns.heatmap(titanic_train.isnull(), yticklabels = False, cbar = False, cmap ='cividis');

In [16]:
titanic_train.isnull().sum()

Embarked column contain 2 missing values. Better drop those two rows.  Cabin column can be deleted completely. 

In [17]:
titanic_train.drop('Cabin', axis =1, inplace = True)

In [18]:
#Dropping missing embarked rows
titanic_train.dropna(inplace = True)

In [19]:
titanic_train.head()

In [20]:
titanic_train.isnull().sum()

No missing values left. 

### **Feature engineering of Name column**

In [21]:
titanic_train['Name'][0].split(', ')[1].split('.')[0]

In [22]:
titanic_train['First_name'] = titanic_train['Name'].apply(lambda x: x.split(', ')[1])
titanic_train['First_name'] = titanic_train['First_name'].apply(lambda x:x.split('.')[0])
titanic_train.head()

In [23]:
titanic_train['First_name'].value_counts()

## **Data Visualization**

In [24]:
sns.countplot(titanic_train['Survived'], palette = 'magma');

In [25]:
plt.figure(figsize = (12,5))
sns.countplot(titanic_train['Survived'], hue = titanic_train['Sex'],  palette = 'magma');

In [26]:
plt.figure(figsize = (12,5))
sns.countplot(titanic_train['Survived'], hue = titanic_train['Embarked'],  palette = 'magma');

In [27]:
plt.figure(figsize = (12,5))
sns.countplot(titanic_train['Survived'], hue = titanic_train['Pclass'],  palette = 'magma');

In [28]:
plt.figure(figsize = (12,5))
sns.countplot(titanic_train['Survived'], hue = titanic_train['SibSp'],  palette = 'magma_r');

## **Convert Categorical features into dummy variables**

In [29]:
titanic_train.info()

Sex, Embarked, First name can be converted to dummy variables. Name, Ticket are to be dropped. 

In [30]:
sex = pd.get_dummies(titanic_train['Sex'], drop_first = True)
embark = pd.get_dummies(titanic_train['Embarked'], drop_first = True)
firstname = pd.get_dummies(titanic_train['First_name'], drop_first = True)
titanic_train = pd.concat([titanic_train, sex, embark, firstname], axis =1)
titanic_train.drop(['Sex', 'Embarked', 'Name', 'Ticket', 'First_name'], axis = 1, inplace = True)
titanic_train.head()

## **Exploratory Data Analysis (EDA) of Testing set**

In [31]:
titanic_test = pd.read_csv('../input/titanic/test.csv')
titanic_test.head()

In [32]:
titanic_test.info()

In [33]:
titanic_test.describe()

In [34]:
titanic_test.nunique()

In [35]:
titanic_test.isnull().sum()

Fare column contains 1 missing data. Before dealing with age and vanim column. this needs to be filled in. Checking mean Fare per Passenger class gives an idea about the missing value. 

In [36]:
sns.boxplot(x='Pclass',y='Fare',data=titanic_test,palette='winter');

Find which class the missing value lies in.

In [37]:
titanic_test[titanic_test['Fare'].isnull()]

The missing value lies in Pclass 3. Find the mean value of Fares in each class.

In [38]:
titanic_test.groupby('Pclass').mean()

Fill the missing value with the mean Fare of Pclass 3.

In [39]:
titanic_test['Fare'] = titanic_test['Fare'].fillna(value = 12.459678) 

In [40]:
titanic_test[titanic_test['Fare'].isnull()]

The missing Fare value is filled in. 

### **Feature engineering of Name column of Test data set.**

In [41]:
titanic_test['First_name'] = titanic_test['Name'].apply(lambda x: x.split(', ')[1])
titanic_test['First_name'] = titanic_test['First_name'].apply(lambda x:x.split('.')[0])
titanic_test.head()

In [42]:
titanic_test['First_name'].value_counts()

In [43]:
titanic_test['First_name'].unique()

Filling Age missing values and getting dummy values for 'Sex', 'Embarked' and 'First_name'. 'Name; 'Cabin' and 'Ticket' are to be dropped.

In [44]:
titanic_test['Age'] = titanic_test[['Age','Pclass']].apply(fill_age,axis=1)
sex = pd.get_dummies(titanic_test['Sex'], drop_first = True)
embark = pd.get_dummies(titanic_test['Embarked'], drop_first = True)
firstname = pd.get_dummies(titanic_test['First_name'], drop_first = True)
titanic_test = pd.concat([titanic_test, sex, embark, firstname], axis =1)
titanic_test.drop(['Sex', 'Embarked', 'Name', 'Ticket', 'Cabin', 'First_name', ], axis = 1, inplace = True)
titanic_test.head()

In [45]:
titanic_test.columns

In [46]:
titanic_train.columns

Compared to titanic_train set, titanic_test set does not contain 'Jonkheer', 'Lady', 'Major','Mlle', 'Mme',  'Sir', 'the Countess', 'Col'. Its better to drop those dummy variables from titanic_train. The Don variable can be renamed to Dona, which is present in titanic testset. 

In [47]:
titanic_train.drop(['Jonkheer', 'Lady', 'Major','Mlle', 'Mme', 'Sir', 'the Countess', 'Col'], axis =1, inplace = True)

In [48]:
titanic_train.rename(columns ={'Don': 'Dona'}, inplace = True)

In [49]:
titanic_train.head()

In [50]:
titanic_test.head()

In [51]:
titanic_train.shape

In [52]:
titanic_test.shape

The data is ready to develop classification model It looks perfect to continue to classification process. Only 'Survived' column is missing in titanic_test.

## **Building a Logistic Regression Classification model**

### **Train Test Split**

Splitting the titanic_train data into test and train. 

In [53]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(titanic_train.drop('Survived',axis=1), titanic_train['Survived'], test_size=0.30, random_state=101)

### **Training and predicting**

In [54]:
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression(max_iter =10000)
logmodel.fit(X_train,y_train)

In [55]:
predictions = logmodel.predict(X_test)

### **Model Evaluation**

In [56]:
from sklearn.metrics import accuracy_score
print('Accuracy of the Logistic Regression Model is ', accuracy_score(y_test,predictions))

In [57]:
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))

## **Building a Support Vector Classifier(SVC) Model**

In [58]:
from sklearn.svm import SVC
svcmodel = SVC()
svcmodel.fit(X_train,y_train)
predictions = svcmodel.predict(X_test)
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import accuracy_score
print('Accuracy of the SVC Model is ', accuracy_score(y_test,predictions))
print('\n', '\n','Confusion Matrix of SVC Model:' '\n', confusion_matrix(y_test,predictions))
print('\n', '\n','Classification Report for SVC Model:' '\n',classification_report(y_test,predictions))

### **Building a Decision Tree Classifier Model**

In [59]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
#Predictions and Evaluation of Decision Tree
predictions = dtc.predict(X_test)
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import accuracy_score
print('Accuracy of the DecisionTreeClassifier Model is ', accuracy_score(y_test,predictions))
print('\n', '\n','Confusion Matrix of DecisionTreeClassifier Model:' '\n', confusion_matrix(y_test,predictions))
print('\n', '\n','Classification Report for DecisionTreeClassifier Model:' '\n',classification_report(y_test,predictions))

## **Building a Random Forest Classifier Model**

In [60]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators = 600)
rfc.fit(X_train, y_train)
predictions = rfc.predict(X_test)
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import accuracy_score
print('Accuracy of the RandomForestClassifier Model is ', accuracy_score(y_test,predictions))
print('\n', '\n','Confusion Matrix of RandomForestClassifier :' '\n', confusion_matrix(y_test,predictions))
print('\n', '\n','Classification Report for RandomForestClassifier:' '\n',classification_report(y_test,predictions))

## **Building a Logistic Binary Regression Classification Model using Artificial Neural Network**

In [61]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout
from tensorflow.keras.constraints import max_norm
model = Sequential()
# input layer
model.add(Dense(78,  activation='relu'))
model.add(Dropout(0.2))

# hidden layer
model.add(Dense(39, activation='relu'))
model.add(Dropout(0.2))

# hidden layer
model.add(Dense(19, activation='relu'))
model.add(Dropout(0.2))

# output layer
model.add(Dense(units=1,activation='sigmoid'))

# Compile model
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(x=X_train, 
          y=y_train, 
          epochs=25,
          batch_size=256,
          validation_data=(X_test, y_test), 
          )


In [62]:
losses = pd.DataFrame(model.history.history)
losses[['loss','val_loss']].plot()

In [63]:
predictions = (model.predict(X_test) > 0.5).astype("int32")
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import accuracy_score
print('Accuracy of the ANN Model is ', accuracy_score(y_test,predictions))
print('\n', '\n','Confusion Matrix of ANN Model:' '\n', confusion_matrix(y_test,predictions))
print('\n', '\n','Classification Report for ANN Model:' '\n',classification_report(y_test,predictions))

## **Final Output**

The best accuracy, precision and f1-score is obtained for Logistic Regression Classification model. Predicting the survival rate using this model and saving it as submission.csv file. 

In [64]:
submission_preds = logmodel.predict(titanic_test)
test_ids = titanic_test['PassengerId']
df = pd.DataFrame({'PassengerId': test_ids.values, 'Survived': submission_preds})
df.to_csv('submission.csv', index = False)