# Titanic Data Analysis

Documentation of the process to analyze Titanic data. By Tan Suyin

"The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class."

Below is a glimpse of the dataset. It is information about Titanic passengers, and whether or not they survived.

In [4]:
%pylab inline
import pandas as pd
passenger_df= pd.read_csv("titanic_data.csv")
passenger_df.head(6)

Populating the interactive namespace from numpy and matplotlib


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


# Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

# Questions posed

We know (from the movie...), that survivors were mostly people who were given preference on the limited lifeboats. There were also a small number of lucky people who survived the ice cold water as more help arrived, (is this true, or just in the movie?)

1. What factors made people more likely to survive(be given preference on lifeboats)? Is it age(the very old and the very young), gender(the weaker sex: female), familial groupsize(larger families survive better than single passengers), socio-class(rich vs poor), being a member of/related to the cabin crew?

2. Why are there missing age values (20% of passengers)? Is it true that non-survivors and survivors who had no age data did not have family/surviving family members who were on board Titanic? 



# How many survived, how many didn't?
While the true total number of passengers and crew was 2224 (survivors:722[32%], non-survivors: 1502[68%]), our dataset only contains a subset of that population. However the ratio of survivors (one-third) and non-survivors (two-thirds) in this subset and in the true dataset are similar. So we will study this subset as a sampling of the overall actual population (make sense?). 

In this dataset, the Total passengers are 891. Total survivors are 342 (38%). Total non-survivors are 549 (62%).

In [72]:
survival_groups=passenger_df.groupby('Survived')
survival_groups['PassengerId'].count()

Survived
0    549
1    342
Name: PassengerId, dtype: int64

# Explore/Validate data for findings 

In [75]:
#Validate that all passengers have Survived values
print("Total passengers with Survived values above:", survival_groups['PassengerId'].count().sum()
       , "\n Total passengers :", passenger_df['PassengerId'].count())

('Total passengers with Survived values above:', 891, '\n Total passengers :', 891)


Let's check our data about age, whether we have missing values.

In [70]:
passenger_df.loc[pd.isna(passenger_df['Age'])].count()

PassengerId    177
Survived       177
Pclass         177
Name           177
Sex            177
Age              0
SibSp          177
Parch          177
Ticket         177
Fare           177
Cabin           19
Embarked       177
dtype: int64

There are 177 passengers who are missing age data, that is 20% of the people on board. Perhaps this is because they did not survive to tell their age, nor did their family if they had any. Let's check.

# Why are there missing age values for 20% of the passengers?
- We know that 70% of these passengers had not survived to tell their age. But didn't they have family who survived. Same goes to the 30% who were survivors.
- 549 is the total non-survivors, while 125 of them did not have age data (20% of the non-survivors), so why did 80% who did not survive have age values? Did they have surviving family connections, or they didn't either, maybe they had families that weren't on board but came forward to supply the deceased data.
- This is very curious for me. I'm sure the data collectors did all they could to compile the data. 


In [66]:
survival_groups['Age'].apply(lambda x: pd.isna(x).sum()).reset_index(name='Count of people with Age=NaN')
#note: use sum() for boolean to add up. count() does not work.  

Unnamed: 0,Survived,Count of people with Age=NaN
0,0,125
1,1,52


As suspected, most of the people with NaN values for age, did not survive. Wonder how many among them had surviving family members? However, 52 of the people who survived had not supplied their age for some reason. 

In [71]:
#create new column in df for Surname


# Is age a factor in survival?

In [77]:
survival_groups['Age'].mean()

Survived
0    30.626179
1    28.343690
Name: Age, dtype: float64

In [19]:
survival_groups['Age'].std()

Survived
0    14.172110
1    14.950952
Name: Age, dtype: float64

In [20]:
survival_groups['Age'].min()

Survived
0    1.00
1    0.42
Name: Age, dtype: float64

In [21]:
survival_groups['Age'].max()

Survived
0    74.0
1    80.0
Name: Age, dtype: float64

Finding: Mean, Standard Deviation, Min & Max age of survivors vs non-survivors do not defer much, as seen above.

Conclusion: Age is not a factor.

# Is gender a factor in survival?

In [23]:
survival_groups['Sex'].count

Survived
0    549
1    342
Name: Sex, dtype: int64