#**Titanic Survivor Analysis**
An exploratory data analysis on passengers of Titanic.

In [46]:
!pip install numpy
!pip install pandas



In [47]:
import numpy as np
import pandas as pd

In [48]:
df = pd.DataFrame(pd.read_csv("train.csv"))
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [49]:
df.shape

(891, 12)

##**Dataset Description**

```
Size of Dataset: 891x12

Pclass: Passenger Class(1: 1st, 2: 2nd, 3: 3rd)
Survival: Survived?(0: No, 1: Yes)
Name: Name of passenger
Sex:  Sex of passenger
Age: Age of passenger
Sibsp: Number of Sibilings/Spouses Aboard
Parch: Number of Parents/Children Aboard
Ticket: Ticket Number
Fare: Passenger Fare(in British pound)
Cabin: Cabin Name
Embarked:Port of Embarkation(C: Cherbourg, Q: Queenstown, S: Southampton)
```




##**Data Cleaning**
-> Handeling NULL values.

In [50]:
#Checking No. of null values for each column in the given dataset
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [51]:
#Dropping columns with more than 35% null values
drop_col = df.isnull().sum()[df.isnull().sum()>(35/100 * df.shape[0])]
df.drop(drop_col.index,axis = 1, inplace=True)

#Checking remaining data
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Embarked         2
dtype: int64

In [52]:
#Replacing null values of int type columns with the mean of given column
df.fillna(df.mean(),inplace = True)

#Replacing null values with the most frequent values in the column
df['Embarked'].describe()

count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

In [53]:
df['Embarked'].fillna('S', inplace = True)

#Checking null values in the dataset again
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

##**Data Re-formatting**

In [54]:
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.033207,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.069809,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.331339,0.083081,0.018443,-0.5495
Age,0.033207,-0.069809,-0.331339,1.0,-0.232625,-0.179191,0.091566
SibSp,-0.057527,-0.035322,0.083081,-0.232625,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.179191,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.091566,0.159651,0.216225,1.0


From above data we get to figure out that there is some co-relation between SibSp and Parch, so we can combine these two columns and create a new column.

In [55]:
df['FamilyOnBoard'] = df['SibSp'] + df['Parch']

#Dropping the old 2 columns to reduce data redundancy
df.drop(['SibSp','Parch'], axis = 1, inplace=True)

In [56]:
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,Fare,FamilyOnBoard
PassengerId,1.0,-0.005007,-0.035144,0.033207,0.012658,-0.040143
Survived,-0.005007,1.0,-0.338481,-0.069809,0.257307,0.016639
Pclass,-0.035144,-0.338481,1.0,-0.331339,-0.5495,0.065997
Age,0.033207,-0.069809,-0.331339,1.0,0.091566,-0.248512
Fare,0.012658,0.257307,-0.5495,0.091566,1.0,0.217138
FamilyOnBoard,-0.040143,0.016639,0.065997,-0.248512,0.217138,1.0


From above data we understand that the probability of a person to survive does not depend on the size of family he had on board.

Now let us analyse that weather the person was alone can affect the survival rate.

In [57]:
df['isAlone'] = [0 if df['FamilyOnBoard'][i] > 0 else 1 for i in df.index]
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Embarked,FamilyOnBoard,isAlone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.25,S,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,STON/O2. 3101282,7.925,S,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1,S,1,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.05,S,0,1


In [58]:
df.groupby('isAlone')['Survived'].mean()

isAlone
0    0.505650
1    0.303538
Name: Survived, dtype: float64

**Hypothesis:** From above data we see that is a person is alone, he had 30% chances of survival, while if he was with his/her family, the chances were about 50.56%. One of the major reason behind it might be that a person who is travelling with his/her family might be belonging to a riche class, hence would be prioritized over the other.

In [59]:
df[['isAlone','Fare']].corr()

Unnamed: 0,isAlone,Fare
isAlone,1.0,-0.271832
Fare,-0.271832,1.0


From above we can see that, if a person was not alone, the chance the ticket price is higher is high.

In [60]:
#Setting the 'Sex' field for int so that it becomes convenient for us for interpretition
df['Sex'] = [0 if df['Sex'][i] == 'male' else 1 for i in df.index]

df.groupby(['Sex'])['Survived'].mean()

Sex
0    0.188908
1    0.742038
Name: Survived, dtype: float64

From above data we can interpret that during the rescue, females were prioritised over males.

In [61]:
df.groupby(['Embarked'])['Survived'].mean()

Embarked
C    0.553571
Q    0.389610
S    0.339009
Name: Survived, dtype: float64

From above data we can interpret that, people who boarded from Cherbourg, had a higher chance of survival than those borded from the other 2.

##**Conclusion**


1.   Female passengers were prioritized over men.
2.   Rich people had a higher chances of survival than others. 
3.   Passengers travelling with their family had a higher survival rate.
4.   people who boarded from Cherbourg, had a higher chance of survival than those borded from the other 2.

