# The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

# Approach

I have divided the entire step to analyze the data into three steps
    
    Exploratory Data Analysis
    Feature Scaling
    Applying Classification Model

Importing all the necessary libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

To extract data from the csv file

In [None]:
test = pd.read_csv("../input/titanic/test.csv")
train = pd.read_csv("../input/titanic/train.csv")

y=train.Survived

In [None]:
train_features = train.drop(['Survived'], axis=1)
test_features = test
features = pd.concat([train_features, test_features]).reset_index(drop=True)

# Exploratory Data Analysis

In [None]:
ax= sns.countplot(train['Survived'])
ax.set_title('No of Passengers survived')

Counting the number of survivals basis Gender

In [None]:
ax=sns.countplot(x='Sex',hue ='Survived',data= train)
ax.set_title('Sex vs Survived')

Counting the number of survivals basis Pclass types

In [None]:
ax=sns.countplot(x='Pclass',hue ='Survived',data= train)
ax.set_title('Pclass vs Survived')

Counting the number of survivals basis Embarked types

In [None]:
ax=sns.countplot(x='Embarked',hue ='Survived',data= train)
ax.set_title('Embarked vs Survived')

Counting the number of survivals basis Sex, Age types

In [None]:
ax=sns.violinplot("Sex","Age", hue="Survived", data=train,split=True)
ax.set_title('Sex and Age vs Survived')

Counting the number of survivals basis Pclass, Sex types

In [None]:
ax=sns.violinplot("Pclass","Sex", hue="Survived", data=train, split=True)
ax.set_title('Pclass and Sex vs Survived')

Counting the number of survivals basis SibSp types

In [None]:
ax=sns.countplot(x='SibSp',hue ='Survived',data= train)
ax.set_title('SibSp vs Survived')

In [None]:
sns.distplot(train[train['Pclass']==1].Fare)
#ax[0].set_title('Fares in Pclass 1')

In [None]:
g = sns.FacetGrid(train, col='Survived')
g = g.map(sns.distplot, "Age")

In [None]:
g = sns.heatmap(train[["Age","Fare","Sex","SibSp","Parch","Pclass","Survived"]].corr(),annot=True)

# Feature Scaling

Dropping the feature Name,ticket,Cabin

In [None]:
features= features.drop(['Name'],axis=1)
features= features.drop(['Ticket'],axis=1)
features= features.drop(['Cabin'],axis=1)

Finding the number of columns with NA

In [None]:
features.isna().sum()

Replacing the NA values with mean values

In [None]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer()
features.iloc[:, [3,6]] = imp.fit_transform(features.iloc[:,[3,6]].values)

In [None]:
features['Embarked'] = features['Embarked'].fillna('S')

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
features['Sex']=le.fit_transform(features['Sex'])
features['Embarked']=le.fit_transform(features['Embarked'])

from sklearn.preprocessing import OneHotEncoder
one = OneHotEncoder()
features = one.fit_transform(features).toarray()
features = pd.DataFrame(list(features))

from sklearn.preprocessing import StandardScaler
sd =StandardScaler()
features = sd.fit_transform(features)
features= pd.DataFrame(list(features))

dividing the training and test sets

In [None]:
train = features.iloc[:len(y),:].values
test = features.iloc[len(y):,:].values

# Selecting the best classification model:

Importing all the necessary Libraries

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

### Applying Logistic Regression

In [None]:
lin_reg = LogisticRegression()
lin_reg.fit(train,y)
lin_reg.score(train,y)

y_pred = lin_reg.predict(train)
test_y = lin_reg.predict(test)
lin_reg.score(test,test_y)

Using Confusion matrix to check the scores

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y, y_pred)

from sklearn.metrics import precision_score, recall_score, f1_score
precision_score(y, y_pred)
recall_score(y, y_pred)
f1_score(y, y_pred)

### KNN ( K- Nearest Neighbour)

In [None]:
knn = KNeighborsClassifier()
knn.fit(train, y)
y_pred = knn.predict(train)

knn.score(train,y)

y_test = knn.predict(test)

Using Confusion matrix to check the scores

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y, y_pred)

from sklearn.metrics import precision_score, recall_score, f1_score
precision_score(y, y_pred)
recall_score(y, y_pred)
f1_score(y, y_pred)

### Gaussian Naive

In [None]:
nb = GaussianNB()
nb.fit(train, y)

y_pred = nb.predict(train)

nb.score(train,y)

y_test = nb.predict(test)

nb.score(test,y_test)

Using Confusion matrix to check the scores

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y, y_pred)

from sklearn.metrics import precision_score, recall_score, f1_score
precision_score(y, y_pred)
recall_score(y, y_pred)
f1_score(y, y_pred)

### Support Vector Machine

In [None]:
svm = SVC()
svm.fit(train,y)
svm.score(train,y)
y_test = svm.predict(test)

svm.score(test,y_test)

### Decision Tree

In [None]:
dtf = DecisionTreeClassifier(criterion = 'entropy', max_depth = 11)
dtf.fit(train,y)
dtf.score(train,y)

y_pred = dtf.predict(test)
dtf.score(test,y_pred)