<h1>Introduction</h1>

This notebook is a very basic and simple approach to this beginner classification problem. This problem serves as an excellent starter for any new aspiring Data Scientists and is perfect for laying down the foundation to newcomers ML Journey. I myself am a new comer to Kaggle and sincerely hope to do justice in conveying the concept in an easy to understand Manner. Please feel free to leave any comments that will help me to further improve this kernel and supplement my knowledge.

This notebook is divided into six major parts:
<ol>
    <li>Introduction</li>
    <li>Competition Description</li>
    <li>Data Description</li>
    <li>Exploratory Data Analyis or EDA (in short)</li>
    <li>Data Pre-Processing</li>
    <li>Modeling</li>
</ol>
Following the famous data science mantra we will spending the majority of our time in EDA and preprocessing compared to Modeling in a 80:20 ratio.

<h1>Competition Description</h1>

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

<h1>Data Description</h1>

<h3>Overview</h3>

The data has been split into two groups:
<ul>
    <li>training set (train.csv)</li>
    <li>test set (test.csv)</li>
</ul>

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

# Importing the Libraries

This is where the actual fun begins. We start off by importing all the libraries that we will need later on. We will be using Numpy and pandas for data analysis and matplotlib (Matlab for python), seaborn for data visualisation.

In [None]:
import os
print(os.listdir("../input"))

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


In [None]:
# This is how we assign the datasets to variables in python using pandas.
train=pd.read_csv("../input/train.csv")
test=pd.read_csv("../input/test.csv")

# Exploratory Data Analysis

We will use the .head() function to display the first five columns of the dataset to get a feel of the dataset.

In [None]:
train.head()

From the above displayed data we can see our Target Variable (which we have to predict) has the values. This indicates that it is a binary classification problem. We will also have to convert the categorical variables Embarked and Sex column onto its numerical counterpart so that our ML Algorithm can understand it.

In [None]:
test.head()

The test dataset has the same numeber of columns as the train dataset without the Target Variable. 

In [None]:
#To get the number of rows and columns of the dataset
train.shape

In [None]:
test.shape

In [None]:
#Gives us statistical information about the dataset
train.describe()


From the above data we can tell that the Age column is missing a lot of values. To better understand the number of missing values we run the following code.

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

To visualise the missing values we can plot it out onto a heatmap.

In [None]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

From the above plot we can tell that Cabin has an enormous amount of missing values therefore we can do nothing but drop that column and we will impute the Age column with values derived from a concrete hypotheis in the preprocessing section

A plot to visualise the Target Distribution.

In [None]:
sns.set_style("whitegrid")
sns.countplot(x='Survived',data=train,palette='viridis')

<h3>Passenger Class (Pclass) column</h3>

Let us look into the Pclass column now which is basically the passenger class

In [None]:
train.Pclass.value_counts()


We can find the number of people who survived in each class by grouping them with 'Pclass' column

In [None]:
train.groupby('Pclass').Survived.value_counts()

The below code shows us the percentage of people who survived in each class. From this data we can come to the conclusion that there is a higher likelihood of people surviving in the higher class compared to the lower class

In [None]:
train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean()


The below is a pictorial representation of the above data.

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Pclass',data=train,palette='rainbow')

<h3>Gender Column</h3>

Now let us look into the Gender Column

In [None]:
train.Sex.value_counts()


We will perform the same analysis that we have done in passenger class.

In [None]:
train.groupby('Survived').Sex.value_counts()

From the above data we can come to the conclusion that there is a higher percentage of female passengers who survived.

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')

Now let us see the ratio of males and females for each passenger class.

In [None]:
tab = pd.crosstab(train['Pclass'], train['Sex'])
print (tab)

From the plot below we can tell that the majority of the passegner's age lies between 20 and 40.

In [None]:
sns.distplot(train['Age'].dropna(),kde=False,color='darkred',bins=30)

From the below plot we can see that majority of the paasengers didnt have any siblings.

In [None]:
sns.countplot(x='SibSp',data=train)

In [None]:
train['Fare'].hist(color='green',bins=40,figsize=(8,4))

The plot below gives us the insight that the age of the passegners tend to be higher if they are from a higher passenger class. We wil be using this insight to later fill the Age column.

In [None]:
plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')


# Data Preprocessing

In [None]:
#We are doing this because the test doesn't have the Target column.
train2=train.drop('Survived',axis=1)

In [None]:
#We are combining train and test dataset as it will be easier for us to process the data together.
data = train2.append(test,sort=False)

In [None]:
data.head()

In [None]:
#We drop the PassengerId column as the values from this column wont contribute to our model.
data.drop(['PassengerId'],axis=1,inplace=True)

It was found that the Title from the names such as Mr, Miss, Mrs etc do contribute to the prediction. Threfore we create a feature out of it. It is also found that the length of the Name contributes as well.

In [None]:
data['Title'] =data['Name'].apply(lambda x: x.split(',')[1]).apply(lambda x: x.split()[0])
data['Name_Len'] = data['Name'].apply(lambda x: len(x))
data.drop(labels='Name', axis=1, inplace=True)

In [None]:
data.Name_Len = (data.Name_Len/10).astype(np.int64)+1

In [None]:
training_age_n = data.Age.dropna(axis=0)

In [None]:
fx, axes = plt.subplots(1, 2, figsize=(15,5))
axes[0].set_title("Age vs frequency")
axes[1].set_title("Age vise Survival rate")
fig1_age = sns.distplot(a=training_age_n, bins=15, ax=axes[0], hist_kws={'rwidth':0.7})

# Creating a new list of survived and dead

pass_survived_age = train[train.Survived == 1].Age
pass_dead_age = train[train.Survived == 0].Age

axes[1].hist([data.Age, pass_survived_age, pass_dead_age], bins=5, range=(0, 100), label=['Total', 'Survived', 'Dead'])
axes[1].legend()
plt.show()

We fill the Age column with average age values of the passenger class. 

In [None]:
#Null Ages in Training set (177 null values)
train_age_mean = data.Age.mean()
train_age_std = data.Age.std()
train_age_null = data.Age.isnull().sum()
rand_tr_age = np.random.randint(train_age_mean - train_age_std, train_age_mean + train_age_std, size=train_age_null)
data['Age'][np.isnan(data['Age'])] = rand_tr_age
data['Age'] = data['Age'].astype(int) + 1

# Null Ages in Test set (86 null values)
test_age_mean = data.Age.mean()
test_age_std = data.Age.std()
test_age_null = data.Age.isnull().sum()
rand_ts_age = np.random.randint(test_age_mean - test_age_std, test_age_mean + test_age_std, size=test_age_null)
data['Age'][np.isnan(data['Age'])] = rand_ts_age
data['Age'] = data['Age'].astype(int)

data.Age = (data.Age/15).astype(np.int64) + 1

We create a feature known as FamilySize by adding the SibSP and Parch column.

In [None]:
data['FamilySize'] = data['SibSp'] + data['Parch'] + 1

A feature isAlone is created that checks if the FamilySize is 1 or greater.

In [None]:
data['isAlone'] =data['FamilySize'].map(lambda x: 1 if x == 1 else 0)

In [None]:
data.drop(labels=['SibSp', 'Parch'], axis=1, inplace=True)
data.head()

In [None]:
# We drop the Cabin column as it has too many null values.
data.drop(['Cabin'],axis=1,inplace=True)

In [None]:
The Ticket length also gives us a useful feature that increases our accuracy.

In [None]:
data['Ticket_Len'] = data['Ticket'].apply(lambda x: len(x))

In [None]:
data.drop(labels='Ticket', axis=1, inplace=True)

The empty values in the Fare column is filled with the mean of the Fare column.

In [None]:
data['Fare'][np.isnan(data['Fare'])] = data.Fare.mean()

In [None]:
data.Fare = (data.Fare /20).astype(np.int64) + 1

In [None]:
data['Embarked'].isnull().sum()

The empty values in the Embarked column is filled with S.

In [None]:
data['Embarked'] = data['Embarked'].fillna('S')

In [None]:
data.head()

We will convert the categorical values into its binary equivalent.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
lr=LabelEncoder()

In [None]:
data['Sex'] = lr.fit_transform(data['Sex'])
data['Embarked']=lr.fit_transform(data['Embarked'])
data['Title']=lr.fit_transform(data['Title'])

In [None]:
train.shape

Now we don't have any null values present in the datset.

In [None]:
sns.heatmap(data.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
data.head()

In [None]:
#Splitting the data back
train2=data.iloc[0:891,:]
test2=data.iloc[891:1310,:]

In [None]:
train2.shape

In [None]:
#Splitting the dataset back into the train and test.
X=train2
y=train['Survived']

# Modeling 
Modeling is by far the easiest part in the ML Workflow. All we have to do is import the library then fit the model and predict our values.
<h3>1. Logistic Regression</h3>

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logmodel = LogisticRegression()
logmodel.fit(X,y)

In [None]:
predictions_log = logmodel.predict(test2)

<h3>2. Random Forest</h3>

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# n_estimators refers to the number of tress.
rfc=RandomForestClassifier(n_estimators=250)
rfc.fit(X,y)

In [None]:
predictions_rfc=rfc.predict(test2)

<h1>Submission</h1>

In [None]:
predictions_log.shape

<h2>Logistric Regression </h2>

In [None]:
#my_submission = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': predictions_log})
#my_submission.to_csv('submission.csv', index=False)

<font size="4">Gives us a score of 0.76555</font>

<h2>Random Forest </h2>

In [None]:
my_submission = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': predictions_rfc})
my_submission.to_csv('submission.csv', index=False)

<font size="4">This also gave us a score of 0.76555</font>