## About Dataset


### Problem Feature:
The sinking of the Titanic is one of the most infamous shipwrecks in history. **On April 15, 1912**, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing **1502 out of 2224** passengers and crew. That's why the name DieTanic. This is a very unforgetable disaster that no one in the world can forget.

1.  **Age** ==>> Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

2. **Sibsp** ==>> The dataset defines family relations in this way...

    a. Sibling = brother, sister, stepbrother, stepsister

    b. Spouse = husband, wife (mistresses and fiancés were ignored)

3. **Parch** ==>> The dataset defines family relations in this way...

    a. Parent = mother, father

    b. Child = daughter, son, stepdaughter, stepson

    c. Some children travelled only with a nanny, therefore parch=0 for them.

4. **Pclass** ==>> A proxy for socio-economic status (SES)

    * 1st = Upper
    * 2nd = Middle
    * 3rd = Lower
    
5. **Embarked** ==>> Port of embarkation (C=Cherbourg , Q=Qeenstown , S=southampton)
6. **Name** ==>> nominal datatype . It could be used in feature engineering to derive the gender from title
7. **Sex** ==>>  nominal datatype 
8. **Ticket** ==>> that have no impact on the outcome variable. Thus, they will be excluded from analysis
9. **Cabin** ==>>  is a nominal datatype that can be used in feature engineering
11. **Fare** ==>>  Indicating the fare
12. **PassengerID** ==>> have no impact on the outcome variable. Thus, it will be excluded from analysis
11. **Survival** is ==>> **[dependent variable], 0 or 1


## Load the libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

### Import dataset

In [None]:
df=pd.read_csv('../input/titanicdataset-traincsv/train.csv')

# shape
print(df.shape)

In [None]:
# dataset first 5 rows
df.head()

In [None]:
# 5 point summary
# Min ,Max , 1 st , 2nd , 3rd quantile
df.describe().transpose()

### Inferences

1. In the given data, Minimum age of the person is around 20 & maximum is 80.
There were 50% people with age below than 28 years. It means that there were 50% people with age greater tha 28.Age data seems to be right skewed.

2. Minimum fare paid is 7.9 units & maximum is 512.3 units. There were 50% people who paid more than 14.4 units.
Fare is also right skewed . we can clearly see that there are extreme maximum values at of fare.


In [None]:
# datatype of features
df.info()

### Inferences

1. From the above , we can sat that Passanger ID , Name & Ticket number columns are not useful for finding the survival of the person.
We will drop 3 columns.

2. There are some categorical columns which are having datatype integer . we will change the datatype of some features in furthur analysis.

## Removed all irrelevant columns

In [None]:
# Name ,passenger ID and ticket are not useful for furthur analysis
# We will drop that columns
df.drop(columns=['PassengerId','Name','Ticket'] , inplace=True)


## Missing value treatement

In [None]:
#Check NULL values

null_df=pd.DataFrame()
null_df['Features']=df.isnull().sum().index
null_df['Null values']=df.isnull().sum().values
null_df['% Null values']=(df.isnull().sum().values / df.shape[0])*100
null_df.sort_values(by='% Null values',ascending=False)

In [None]:
# Plot graph to see missing values in each column
import missingno as no
no.bar(df)
plt.show()

In [None]:
# We can see that there are around 77% missing values in the cabin column
# We will drop Cabin column 
df.drop(columns='Cabin' , inplace=True)

In [None]:
# check the size of the dataset
df.shape

In [None]:
#We will drop the records from embarked column in which null values are present
# There are around 0.224% of null values in embarked column .

df.dropna(subset=['Embarked'],inplace=True)

In [None]:
# check the size of the dataset
df.shape

In [None]:
# There are around 19% null values present in the age column
# We cannot drop records because we will loose so much data 
# So instead of dropping the column , we will impute NULL values with Median value of the Age since it is slightly 
# right skewed
df['Age'].describe()

In [None]:
print('Skewness of age :',round(df['Age'].skew(),3))
sns.distplot(df['Age'])
plt.show()

In [None]:
# Null value imputation in Age column by median value
# Here we have imputed null values with median instead of mean because the Age column is slightly right skewed 
df['Age']=df['Age'].fillna(df['Age'].median())

In [None]:
# Check null values after NULL value treatmet
df.isnull().sum()

## Unique Value count 

In [None]:
#Checking number of unique values in each column. We should remove the columns that has single value. 
#Those columns will not give us meaningful information
unique_val=pd.DataFrame()
unique_val['Features']=df.nunique().index
unique_val['Unique_Values']=df.nunique().values
unique_val.sort_values(by='Unique_Values')

In [None]:
# Correlation matrix before changing the datatype of variables
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(),annot=True)
plt.show()

1. The correlation coefficient between dependent variables is very less. We can say that there is no multicollinearity exists in the goven data

### Change the datatype of variables

In [None]:
# Change the datatype of some variables from integer to object
df['Survived']=df['Survived'].astype('object')
df['Pclass']=df['Pclass'].astype('object')
df['SibSp']=df['SibSp'].astype('object')
df['Parch']=df['Parch'].astype('object')

# Outliers treatment

In [None]:
# There are only 2 numerical columns left in the dataset--> Age & Fare
# Individual Boxplot to check outliers in each feature

df_num=df.select_dtypes(include=np.number)

for i in range(len(df_num.columns)):
    sns.boxplot(df_num.iloc[:,i])
    plt.show()

### Inferences

1. We can clearly see that there are extreme values at the right side which we called as outliers
2. Fare  has more right skewed distribution as compared to Age 
3. We will remove outliers using Inter quartile range 

In [None]:
# We can observe , there are outliers preset in some features
# we will remove outliers in from those features only
q1=df[['Fare','Age']].quantile(0.25)
q3=df[['Fare','Age']].quantile(0.75)
iqr=q3-q1

ll=q1-1.5*iqr
ul=q3+1.5*iqr

df=df[~((df<ll)|(df>ul)).any(axis=1)]
df.reset_index(drop=True, inplace=True)

In [None]:
# After removing outliers again check distribution
#Individual Boxplot to check outliers in each feature

df_num=df.select_dtypes(include=np.number)

for i in range(len(df_num.columns)):
    sns.boxplot(df_num.iloc[:,i])
    plt.show()

#### Reset Index of the datset

In [None]:
# Reset index of dataframe
df.reset_index(drop=True)
# Data size after Data ccleaning
df.shape

In [None]:
# Final dataset has 8 Features & 721 Records 
# Now we will visualize each feature using different graphs to find some insightful information from the data

## Univariate, Bivariate , Multivariate analysis

In [None]:
# Individual distribution plots just to ckeck distribution is skewed or not
# Skewness of Fare is more than the skewness of Age but that much skewness is acceptable in model building

df_num=df.select_dtypes(include=np.number)

for i in range(len(df_num.columns)):
    print(f'Skewness of {df_num.columns[i]} : {round(df_num.iloc[:,i].skew(),3)}')
    sns.distplot(df_num.iloc[:,i])
    plt.show()

In [None]:
# Target variable -People who dies & survived
sns.countplot(df['Survived'])
plt.title('Survived Yes/No',fontsize=15)
plt.text(0,df['Survived'].value_counts()[0],df['Survived'].value_counts()[0])
plt.text(1,df['Survived'].value_counts()[1],df['Survived'].value_counts()[1])
plt.show()

d1=df['Survived'].value_counts()
plt.pie(d1.values,labels=d1.index,autopct='%0.2f%%')
plt.title('Pecentage of Survived Yes/No',fontsize=15)

plt.tight_layout()
plt.show()

### Inferences:
1. There are around 478 people out of 721 people who died & 243 people survived.
There are 66.3% people died & 33.7% people survived.
2. More number of people died than survived

### Countplots

In [None]:
# Subplots
figure,ax=plt.subplots(3,2,figsize=(9,14))

df_cat=df.select_dtypes(include=np.object)

col=range(len(df_cat.columns))
m=0
while(m<(len(col)-1)):
    for i in range(3):
        for j in range(2):
            sns.countplot(x=df_cat.iloc[:,col[m]] , ax=ax[i,j])
            ax[i,j].set_title(f'Countplot for {df_cat.columns[m]}', fontsize=15)
            ax[i,j].set_xlabel(df_cat.columns[m],fontsize=15)
            
            m+=1

plt.tight_layout()
plt.show()

### Inferences
Feature --- Observation
1. Pclass - People are classified based on economic condition.
From the countplot we can clearly see that there were more number of people  from Lower class & less number of people from upper class travelling through titanic.
2. Sex -- More number of males were travelling in the titanic than the females
3. SibSp,Parch -- More than 80 % of the people travelled alone .The count is more for the people who were not having siblings,spouse,parents & childrens with them in the titanic.
4. Embarked -- There are more number of people boarded from Southampton & less number of people boarded from Qeenstown

In [None]:
# Subplots
figure,ax=plt.subplots(3,2,figsize=(10,15))

df_cat=df.select_dtypes(include=np.object)

col=range(len(df_cat.columns))
m=0
while(m<(len(col)-1)):
    for i in range(3):
        for j in range(2):
            sns.countplot(x=df_cat.iloc[:,col[m]] ,hue= df_cat['Survived'], ax=ax[i,j])
            ax[i,j].set_title(f'Heart Disease w.r.t {df_cat.columns[m]}', fontsize=15)
            ax[i,j].set_xlabel(df_cat.columns[m],fontsize=15)
            
            m+=1

plt.tight_layout()
plt.show()

### Inferences
1. From first class , More than 50% are survived .The survival chances of class-1 traveller was more than the number of passengers in the first & second class combined.
2. More people from third class society are died.
3. Approximately 65% of the tourists were male while the remaining 35% were female. The percentage of female survivours was higher than the number of male survivors.
4. More than the 80% of the male commuters died, as compared to females 
Among the total male count, more number of males are died than the females.
4. SibSp,Parch -- More than 80 % of the people travelled alone .The chance of survival dropped drastically if someone traveled with more than 2 siblings or spouse.

In [None]:
# Subplots
figure,ax=plt.subplots(2,2,figsize=(10,8))

# Embarked Vs No. of passanges
sns.countplot(x='Embarked',data=df,ax=ax[0,0])
ax[0,0].set_title('Embarked Vs No. of Passanges')

#Embarked Vs Sex
sns.countplot(x='Embarked',hue='Sex',data=df,ax=ax[0,1])
ax[0,1].set_title('Embarked Vs Sex')

#Embarked Vs Survived
sns.countplot(x='Embarked',hue='Survived',data=df,ax=ax[1,0])
ax[1,0].set_title('Embarked Vs Survived')

#Embarked Vs Pclass
sns.countplot(x='Embarked',hue='Pclass',data=df,ax=ax[1,1])
ax[1,1].set_title('Embarked Vs Pclass')
plt.tight_layout()
plt.show()

### Age Interval

In [None]:
# Since age is the continuous variable , we will make categories by dividing it into interval
# Divide Age into groups based on intervals
df['Age_category']=pd.cut(df['Age'],  [0,10,20,30,40,50,60], labels=['0-10','10-20','20-30','30-40','40-50','50-60'])

In [None]:
# Check number of people in each category of age 
df['Age_category'].value_counts()

In [None]:
#Age

fig,ax=plt.subplots(1,2  , figsize=(12,6))
#Countplot for Age categories
sns.countplot(df['Age_category'] ,ax=ax[0] )
#Barplot for Survival vs Age categories
sns.barplot(x=df['Age_category'] ,y=df['Survived'] ,ax=ax[1] )
ax[0].set_title('\nAge Range\n',fontsize=15)
ax[1].set_title('\nSurvived % w.r.t Age\n',fontsize=15)
plt.tight_layout()
plt.show()

plt.figure(figsize=(12,6))
sns.countplot(x=df['Age_category'],hue=df['Survived'])
plt.title('Survived & Died people between Age groups',fontsize=20)
plt.show()

### Infereces
1. More People between 20 to 30 years  age were travelling in the ship
2. A larger fraction of children under 10 survived than died.
3. For other age groups , the number of casualties was higher than the number of survivors.
4. Around 250 Peoplewithin age group 20-30 were died as compared to just around 100 plus people of the same age range sustained

### Fare Interval

In [None]:
# since fare amount is continuous variable
# Divide Fare amount into groups based on intervals
df['Fare_category']=pd.cut(df['Fare'],  [0,10,20,30,40,50,60,70], labels=['0-10','10-20','20-30','30-40','40-50','50-60','60-70'])

In [None]:
# check number of people in each category of Fare
df['Fare_category'].value_counts()

In [None]:
#Fare

fig,ax=plt.subplots(1,2  , figsize=(12,6))
#Countplot for Age categories
sns.countplot(df['Fare_category'] ,ax=ax[0] )
#Barplot for Survival vs Age categories
sns.barplot(x=df['Fare_category'] ,y=df['Survived'] ,ax=ax[1] )
ax[0].set_title('\nFare Range\n',fontsize=15)
ax[1].set_title('\nSurvived % w.r.t Fare\n',fontsize=15)
plt.tight_layout()
plt.show()

plt.figure(figsize=(12,6))
sns.countplot(x=df['Fare_category'],hue=df['Survived'])
plt.title('Survived & Died people between Fare categories',fontsize=20)
plt.show()

### Inferences
1. The survival chances were more in case of highly paid amount.
The higher a tourist paid, the higher would be his chances to survive.
2. There are around 250 people who died that paid very less amount.


In [None]:
# age & survival  
plt.figure(figsize=(7,5))
sns.boxplot(x='Survived',y='Age',data=df)
plt.show()

1. The average age of the people who survivied & not survived is same and it is around 27 years

In [None]:
# Survived & Fare  
plt.figure(figsize=(7,5))
sns.boxplot(x='Survived',y='Fare',data=df)
plt.show()

1. The average amount paid is more in case of survived people.
2. The average amount paid is less in case of died people.

In [None]:
# Class & fare
plt.figure(figsize=(7,5))
sns.boxplot(x='Pclass',y='Fare',data=df)
plt.show()

1. The average amount paid by people in class 1 is very high as compared to those in class 3

In [None]:
# class & age 
plt.figure(figsize=(7,5))
sns.boxplot(x='Pclass',y='Age',data=df)
plt.show()

1. The people from first class category were having higher average age than the people of third class category

In [None]:
# embarked city and fare
plt.figure(figsize=(7,5))
sns.barplot(x='Embarked',y='Fare',data=df)
plt.show()

1. The fare for people is more who boarded  from Cherbourg & it is less for those who boarded from Qeenstown

In [None]:
pd.crosstab(df['Sex'],df['Survived']).plot(kind='bar',stacked=True)

In [None]:
sns.boxplot(df['Survived'],df['Age'])

In [None]:
# heatmap fot numerical variable collinearity
sns.heatmap(df.corr() , annot=True)
plt.show()

In [None]:
# Pairplot
sns.pairplot(data=df )
plt.show()