<a href="https://colab.research.google.com/github/4nur4g/Google-Colab/blob/master/Titanic_Exploratory_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## **References**
* [Visualisation and Code](https://github.com/mjamilmoughal/DataSciencePractices/blob/master/Exploratory%20Data%20Analysis%20with%20Titanic%20Dataset.ipynb)
* [Expected Surivival From Different Features](https://github.com/raghav96/datascience/blob/master/Titanic%20Dataset%20Kaggle%20Competition.ipynb) <br/>
* [Visualisation](https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8) <br/>
* [Visualisation](https://zmudzinski.me/posts/2019/02/titanic-eda/) <br/>

# Load the Dataset

In [0]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [0]:
# reading data from csv
data = pd.read_csv('/kaggle/input/titanic/train.csv')
# shape of the dataset
print(data.shape)

In [0]:
test = pd.read_csv('/kaggle/input/titanic/test.csv')
print(test.shape)

# Cleaning & Analysis the dataset

In [0]:
data.head()

In [0]:
data.info()

In [0]:
data.describe()

#### Inital Analysis without DataCleaning
 - 38 percentage of people survived.
 - Missing values in Age.
 - Passengers Age from 0.4 to 80.

In [0]:
# function of finding NaN Values present in dataset
def find_NaN(data):
    total = data.isnull().sum().sort_values(ascending=False)
    percent_1 = data.isnull().sum()/data.isnull().count()*100
    percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
    nan = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])
    print('Finding NaN Values present in dataset')
    print(nan.head(4))

In [0]:
find_NaN(data)

In [0]:
# Count values in the column
data['Embarked'].value_counts()

In [0]:
# Fill NaN value with mean in Age
data['Age'] = data['Age'].fillna((data['Age'].mean()))

# Fill NaN value with most common Port
data['Embarked'] = data['Embarked'].fillna('S')

In [0]:
data.columns

### Finding Survival Rates from different features
* Pclass : Ticket class
* Sex : Gender like Female rate of Survival?
* Age : Which Age group Survivality rate is High?
* SibSp : Sibling Spouses aboard on the ship
* Parch : Parents and Children Aboard on the ship

In [0]:
# Function Showing DataFrame Side by Side
from IPython.core.display import display, HTML
def display_side_by_side(dfs:list, captions:list):
    output = ""
    combined = dict(zip(captions, dfs))
    for caption, df in combined.items():
        # Use df.head() to show only top 5 rows
        output += df.head().style.set_table_attributes("style='display:inline'").set_caption(caption)._repr_html_()
        output += "\xa0\xa0\xa0"
    display(HTML(output))

In [0]:
# Relationship between Survival of Passenger and Different features
PCLASS = data[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)
SEX = data[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)
AGE = data[['Age', 'Survived']].groupby(['Age'], as_index=False).mean().sort_values(by='Survived', ascending=False)
PARCH = data[['Parch', 'Survived']].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)
SIBSP = data[['SibSp', 'Survived']].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)
display_side_by_side([PCLASS,SEX,AGE,PARCH,SIBSP], ['TICKET CLASS','SEX','AGE','PARCH','SIBSP'])

### Analysis of Survival Rate of a Passenger
* Ticket Class 1 (Rich People) : 62% 
* Sex (Female) : 74%
* Age (Infants) : 100%
* Age (3 yrs) : 83%
* Age (15 yrs) : 80%
* Parent or Children Aboard on the ship : 60%
* 1 Sibling : 53%
* 2 Siblings : 46%
* 5 & 8 Siblings : 0%

In [0]:
# PCLASS
survived = pd.crosstab(index=data.Survived, columns = data.Pclass, margins=True)
survived.columns = ['Upper Class','Middle Class','Lower Class','ColTotal']
survived.index = ['Not Survived','Survived','RowTotal']
# Normalization of PCLASS
survived_per = pd.crosstab(index=data.Survived, columns = data.Pclass, margins=True,normalize=True)
survived_per.columns = ['Upper Class','Middle Class','Lower Class','ColTotal']
survived_per.index = ['Not Survived','Survived','RowTotal']
display_side_by_side([survived, survived_per], ['Survived','Survived_per'])

# Siblings 
survived_sib = pd.crosstab(index=data.Survived, columns = data.SibSp, margins=True,colnames=['Siblings'])
# Normalisation of Siblings
survived_sib_per = pd.crosstab(index=data.Survived, columns = data.SibSp, margins=True,colnames=['Siblings'],normalize=True)
survived_sib_per.index = ['Not Survived','Survived','RowTotal']
survived_sib.index = ['Not Survived','Survived','RowTotal']
display_side_by_side([survived_sib, survived_sib_per], ['Survived_sib','Survived_sib_per'])


* Lower Class have max Count (372 & 41%) who didn't Survived.
* Upper Class have max Count (136 & 15%) who Survived.

# Visualisation of Training Dataset

In [0]:
# importing libraries for data visualisation
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style
sns.set(style='darkgrid')

## Analysis of Survival Rate

### Age and Sex with [Kernel Density Estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation)

In [0]:
survived = 'survived'
not_survived = 'not survived'
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(14, 6))
female = data[data['Sex']=='female']
male = data[data['Sex']=='male']

# Chart of Female
ax = sns.distplot(female[female['Survived']==1].Age.dropna(),bins=18, label = survived, ax = axes[0], kde =True)
ax = sns.distplot(female[female['Survived']==0].Age.dropna(),bins=40, label = not_survived, ax = axes[0], kde =True)
ax.legend()
ax.set_title('Female')

# Chart of Male
ax = sns.distplot(male[male['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[1], kde = True)
ax = sns.distplot(male[male['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[1], kde = True)
ax.legend()
ax.set_title('Male')

* Survival Rate of Females is more when their Age is between 15 yrs to 40 yrs and 0.1 yrs to 5 yrs.
* Survival Rate of Males is more when they have age between 0.1 yrs to 5 yrs (infants) and 18 yrs to 40 yrs.

### Survival Rate of PClass

In [0]:
sns.countplot('Pclass', hue='Survived', data=data)
plt.title('PClass Survival Rate')
plt.show()

* Survival rate of 1st class is more in compare of 2nd and 3rd Class, even if passengers are greater in number in 3rd class

In [0]:
# Deep Analysis of Pclass
pd.crosstab([data.Sex, data.Survived], data.Pclass, margins=True).style.background_gradient(cmap='PuBu')

In [0]:
sns.factorplot('Pclass', 'Survived', hue='Sex', data=data)
plt.show()

* Survival Rate of Womens in 1st class is about 95-96%.
* Survival Rate of Mens in 1st class is also low.

### Survival Rate in Embarked Port

In [0]:
FacetGrid = sns.FacetGrid(data, row='Embarked', height=4.0, aspect=1.6)
FacetGrid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex',palette='deep',  order=None, hue_order=None )
FacetGrid.add_legend()

* Women on port Q and on port S have a higher chance of survival. The inverse is true, if they are at port C. Men have a high survival probability if they are on port C, but a low probability if they are on port Q or S.

In [0]:
data = data.drop(['Name','Cabin','PassengerId','Ticket'],axis=1)

In [0]:
# label encoding the data 
from sklearn.preprocessing import LabelEncoder 
le = LabelEncoder() 

data['Sex']= le.fit_transform(data['Sex']) 
data['Embarked'] =le.fit_transform(data['Embarked'])

In [0]:
data.columns

In [0]:
data.head()

In [0]:
plt.figure(figsize=(12,10))
cor = data.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

## Models

In [0]:
train_data = pd.DataFrame()

In [0]:
train_data = data

In [0]:
train_data.head()

In [0]:
train_X = train_data[train_data.columns[1:]].values
train_Y = train_data[train_data.columns[0]]

In [0]:
test_data = pd.DataFrame()
test_data = test
test_data.head()

In [0]:
final_output = pd.DataFrame()
final_output = pd.DataFrame({'PassengerId': test_data['PassengerId']})
final_output.head()

In [0]:
test_data = test_data.drop(['Name','Cabin','PassengerId','Ticket'],axis=1)

In [0]:
# Fill NaN value with mean in Age
test_data['Age'] = test_data['Age'].fillna((test_data['Age'].mean()))

# Fill NaN value with most common Port
test_data['Fare'] = test_data['Fare'].fillna((test_data['Fare'].mean()))

In [0]:
find_NaN(test_data)

In [0]:
# label encoding the data 
from sklearn.preprocessing import LabelEncoder 
le = LabelEncoder() 

test_data['Sex']= le.fit_transform(test_data['Sex']) 
test_data['Embarked'] =le.fit_transform(test_data['Embarked'])

In [0]:
test_data.head()

In [0]:
test_X = test_data[test_data.columns].values

### Support Vector Classifier

In [0]:
from sklearn.svm import SVC # "Support Vector Classifier" 
svm_clf = SVC(kernel='linear',random_state=45) 
  
# fitting x samples and y classes 
svm_clf.fit(train_X,train_Y) 

In [0]:
test_Y = svm_clf.predict(test_X)

In [0]:
final_output['Survived'] = test_Y

In [0]:
final_output.head()

In [0]:
final_output.to_csv('SVM_output.csv',sep=',',index=False)

### Random Forest Classifier

In [0]:
# Random Forest is used for unbalanced DataSets
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings("ignore")

# Train model
random_clf = RandomForestClassifier(random_state=45,n_estimators=100,min_samples_leaf=50)
random_clf.fit(train_X, train_Y)

In [0]:
random_test_Y = random_clf.predict(test_X)

In [0]:
final_output['Survived'] = random_test_Y

In [0]:
final_output.head()

In [0]:
final_output.to_csv('Random_output.csv',sep=',',index=False)