**TITANIC SURVIVOR ANALYSIS**


In [36]:
import pandas as pd
import numpy as np

# Reading Data Using Pandas



In order to read csv file in Python we use pandas read_csv function and to convert file into data frame we use pandas DataFrame method.

In [37]:
df = pd.DataFrame(pd.read_csv('/content/train (1).csv'))
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**Description of the Attributes of the Dataset**

1.   PassengerId : Identity Number of Passenger
2.   Survived    : Survival (0=No; 1=Yes)
3.   PClass      : Passenger Class (1=1st; 2=2nd; 3=3rd)
4.   Name        : Name
5.   Sex         : Sex
6.   Age         : Age
7.   SibSp       : No. of Siblings/Spouses Abroad
8.   Parch       : No. of Parents/Children Abroad  
9.   Ticket      : Ticket Number
10.  Fare        : Passenger Fare (British Pound) 
11.  Cabin       : Cabin
12.  Embarked    : Port of Embarkation (C=Cherbourg; Q=Queenstown; S=Southampton)

In [38]:
df.shape

(891, 12)

# Handaling Null Values




The dataset may contain rows and columns for which some values are missing. We can not leave those missing values as it is. In such cases we have two options:


1.   Either drop the entire row or column
2.   Or fill the empty cells with some appropriate values (say mean of all the values of that column).

In [39]:
df.isnull().sum()
# To find number of empty cells in each column
# This function returns pandas series with columns name as label index and total null count in that column as it's value.

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Separting out the columns which have more than 30% of the values missing in the dataset.

In [40]:
x=df.isnull().sum()

drop_col = x[x>(30/100 * df.shape[0])]
drop_col
# Here df.shape[0] gives the 1st value of the tuple.

Cabin    687
dtype: int64

In [41]:
drop_col.index
# Since 'Cabin' column had not much useful data we dropped that.

Index(['Cabin'], dtype='object')

In [42]:
df.drop(drop_col.index, axis=1, inplace=True)
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Embarked         2
dtype: int64

In [43]:
df.fillna(df.mean(), inplace=True)
df.isnull().sum()
# df.fillna() will fill the empty cells of 'Age' column with the mean of the Age column data.

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       2
dtype: int64

Since **Embarked** contains string values, we will see the details of that column separately from others because strings does not have mean.

In [44]:
df['Embarked'].describe()

count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

Here count signifies the filled values of Embarked column.
Unique signifies the total numbers of unique data.
Top and Frequency denotes the highest occuring data and its frequency.




For Embarked attribute , we fill the NULL values with the with the most frequent value in the column.

In [45]:
df['Embarked'].fillna('S', inplace=True)

In [46]:
#Now all the NULL values have been filled. Let's check
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

# Detailed Analysis

In [62]:
df.corr()
# It shows the correlation of one column values with all the other column values. 

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Fare,FamilySize,Alone
PassengerId,1.0,-0.005007,-0.035144,,0.033207,0.012658,-0.040143,0.057462
Survived,-0.005007,1.0,-0.338481,,-0.069809,0.257307,0.016639,-0.203367
Pclass,-0.035144,-0.338481,1.0,,-0.331339,-0.5495,0.065997,0.135207
Sex,,,,,,,,
Age,0.033207,-0.069809,-0.331339,,1.0,0.091566,-0.248512,0.179775
Fare,0.012658,0.257307,-0.5495,,0.091566,1.0,0.217138,-0.271832
FamilySize,-0.040143,0.016639,0.065997,,-0.248512,0.217138,1.0,-0.690922
Alone,0.057462,-0.203367,0.135207,,0.179775,-0.271832,-0.690922,1.0


In [48]:
df['FamilySize'] = df['SibSp']+df['Parch']
df.drop(['SibSp', 'Parch'], axis=1, inplace=True)
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,Fare,FamilySize
PassengerId,1.0,-0.005007,-0.035144,0.033207,0.012658,-0.040143
Survived,-0.005007,1.0,-0.338481,-0.069809,0.257307,0.016639
Pclass,-0.035144,-0.338481,1.0,-0.331339,-0.5495,0.065997
Age,0.033207,-0.069809,-0.331339,1.0,0.091566,-0.248512
Fare,0.012658,0.257307,-0.5495,0.091566,1.0,0.217138
FamilySize,-0.040143,0.016639,0.065997,-0.248512,0.217138,1.0


**FamilySize in the ship does not have much correlation with the survival rate.** Let's check if whether the person was alone or not can affect the survival rate.

In [50]:
df['Alone'] = [0 if df['FamilySize'][i]>0 else 1 for i in df.index]
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Embarked,FamilySize,Alone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.25,S,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,STON/O2. 3101282,7.925,S,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1,S,1,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.05,S,0,1


In [51]:
df.groupby(['Alone'])['Survived'].mean()

Alone
0    0.505650
1    0.303538
Name: Survived, dtype: float64

If the person is alone, he/she has a less chance of surviving.
> The reason might be the person who is travelling with his family might be belonging to rich class and might be prioritized over others.



In [54]:
df[['Alone','Fare']].corr()

Unnamed: 0,Alone,Fare
Alone,1.0,-0.271832
Fare,-0.271832,1.0


So it is seen that if the person is not alone, the chance that the ticket price is higher is high.

In [57]:
df['Sex'] = [0 if df['Sex'][i]=='male' else 1 for i in df.index]  
df.groupby(['Sex'])['Survived'].mean()
# 0 for male, 1 for female

Sex
1    0.383838
Name: Survived, dtype: float64

It shows that female passengers had more chance of surviving than males.

It means women were prioritized over men.

In [61]:
df.groupby(['Pclass'])['Survived'].mean()

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

This shows that a person travelling in 1st class (rich people) had more chance of surviving.  

In [56]:
df.groupby(['Embarked'])['Survived'].mean()

Embarked
C    0.553571
Q    0.389610
S    0.339009
Name: Survived, dtype: float64

# Conclusion


 



*   Passengers travelling with their family have higher survival rate.
*   Female passengers were prioritized over man.
*   Rich or high class people have higher survival rate than others. They might have been given priority while saving people.
*   People who had family members also had a higher rate of survival.
*   Passengers who boarded the ship at Cherbourg, survied more in proportion.