# Introduction
Purpose
To perform data analysis on sample titanic dataset.

# Questions

In [1]:
# What factors made people more likely to survive?

# Were social-economic standing a factor in survival rate?
# Did age, regardless of sex, determine your chances of survival?
# Did women and children have preference to lifeboats (survival)?
# How did children with nannies fare in comparison to children with parents?
# Assumption: We are going to assume that everyone who survived made it to a life boat and it wasn't by chance or luck.

# Data Wrangling
Data Description

survival Survival (0 = No; 1 = Yes)
pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)


sibsp Number of Siblings/Spouses Aboard,
parch Number of Parents/Children Aboard,
ticket Ticket Number,
fare Passenger Fare,

embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
SPECIAL NOTES:

Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower,

Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.

Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic,
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored),
Parent: Mother or Father of Passenger Aboard Titanic,
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic,
Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

In [4]:
import pandas
import matplotlib.pyplot as plt
data = pandas.read_csv('titanic.csv')

data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [5]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
# Note: Some values for Age are NaN, while ticket and cabin values are alphanumeric and also missing values with NaN. Not a big deal but good to know. Based on current questions, will not require either Ticket or Cabin data.

# Additional potential questions from reading data and data description

# How did children with nannies fare in comparison to children with parents. Did the nanny "abandon" the child to save his/her own life?

# I would need additional information to determine if a child was indeed only on board with a nanny. For example, a child could be on board with an adult sibling. This would make Parch (parent) = 0 but it would be incorrect to say the child had a nanny.
# Need to review list for children with no siblings. These will be children with nannies; however, a child could have siblings and still have a nanny as well.
# Did cabin location play a part in the survival rate without the consideration of class

# No data on where the cabins are actually located on the Titanic
# External source of this data could probably be found
# Check for inconsistensies in dataset
# In column - 'Survived'

In [18]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [19]:
# Data Cleanup
# From the data description and questions to answer, I've determined that some dataset columns will not play a part in my analysis and these columns can therefore be removed. This will decluster the dataset and also help with processing performance of the dataset.

# * PassengerId
# * Name
# * Ticket
# * Cabin
# I'll take a 2 step approach to data cleanup

# 1. Remove unnecessary columns
# 2. Fix missing and data format issues
# Step 1 - Remove unnecessary columns
# Columns (PassengerId, Name, Ticket, Cabin) removed

In [20]:
titanic_data = data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
titanic_data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [21]:
# Step 2: Fix any missing or data format issues
titanic_data.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

In [22]:
missing_age_bool = pandas.isnull(titanic_data['Age'])
titanic_data[missing_age_bool].head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
5,0,3,male,,0,0,8.4583,Q
17,1,2,male,,0,0,13.0,S
19,1,3,female,,0,0,7.225,C
26,0,3,male,,0,0,7.225,C
28,1,3,female,,0,0,7.8792,Q


In [37]:
missing_age_male = titanic_data[missing_age_bool]['Sex']=='male'
missing_age_female = titanic_data[missing_age_bool]['Sex']=='female'

In [44]:
missing_age_male.sum()

124

In [45]:
missing_age_female.sum()

53

In [46]:
# Should keep note of the proportions across male and female...

# Age missing in male data: 124
# Age missing in female data: 53

In [47]:
# Data Exploration and Visualisation
# Looking at some descriptive statistics

In [48]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [49]:
youngest_to_survive = titanic_data[titanic_data['Survived'] == 1]['Age'].min()
youngest_to_die = titanic_data[titanic_data['Survived'] == 0]['Age'].min()
oldest_to_survive = titanic_data[titanic_data['Survived'] == 1]['Age'].max()
oldest_to_die = titanic_data[titanic_data['Survived'] == 0]['Age'].max()

In [52]:
youngest_to_survive

0.42

In [53]:
youngest_to_die

1.0

In [54]:
oldest_to_survive

80.0

In [55]:
oldest_to_die

74.0

In [56]:
# Note: An interesting note is that all "new borns" survived.

# Question 1 :
Were social-economic standing a factor in survival rate?

Number of males and females survived in each class

In [57]:
group_by_class_survival = titanic_data.groupby(['Pclass', 'Survived', 'Sex']).size()
print(group_by_class_survival)

Pclass  Survived  Sex   
1       0         female      3
                  male       77
        1         female     91
                  male       45
2       0         female      6
                  male       91
        1         female     70
                  male       17
3       0         female     72
                  male      300
        1         female     72
                  male       47
dtype: int64


In [60]:
def survival(pclass, sex):
    group_by_class = titanic_data.groupby(['Pclass', 'Sex']).size()[pclass, sex].astype('float')
    group_by_class_survived = titanic_data.groupby(['Pclass', 'Survived', 'Sex']).size()[pclass, 1, sex].astype('float')
    
    print ('Total numbers of',sex,'of class',pclass,'-',group_by_class)
    print ('Total numbers of',sex,'of class',pclass,'who survived -',group_by_class_survived)
    survival_rate = ((group_by_class_survived/group_by_class)*100).round(2)
    return survival_rate
    print ('\n\n')
    
print ('Effect of social economy in survival rate : \n')
print ('Class 1 - Male survival rate :\n',survival(1, 'male'),'%\n')
print ('Class 1 - Female survival rate \n:',survival(1, 'female'),'%\n')
print ('-------------\n')
print ('Class 2 - Male survival rate :\n',survival(2, 'male'),'%\n')
print ('Class 2 - Female survival rate:\n',survival(2, 'female'),'%\n')
print ('-------------\n')
print ('Class 3 - Male survival rate :\n',survival(3, 'male'),'%\n')
print ('Class 3 - Female survival rate :\n',survival(3, 'female'),'%\n')

Effect of social economy in survival rate : 

Total numbers of male of class 1 - 122.0
Total numbers of male of class 1 who survived - 45.0
Class 1 - Male survival rate :
 36.89 %

Total numbers of female of class 1 - 94.0
Total numbers of female of class 1 who survived - 91.0
Class 1 - Female survival rate 
: 96.81 %

-------------

Total numbers of male of class 2 - 108.0
Total numbers of male of class 2 who survived - 17.0
Class 2 - Male survival rate :
 15.74 %

Total numbers of female of class 2 - 76.0
Total numbers of female of class 2 who survived - 70.0
Class 2 - Female survival rate:
 92.11 %

-------------

Total numbers of male of class 3 - 347.0
Total numbers of male of class 3 who survived - 47.0
Class 3 - Male survival rate :
 13.54 %

Total numbers of female of class 3 - 144.0
Total numbers of female of class 3 who survived - 72.0
Class 3 - Female survival rate :
 50.0 %



In [61]:
%matplotlib inline
titanic_data_survived = titanic_data
titanic_data_survived_grouped = titanic_data_survived.groupby(['Sex']).Survived.mean()*100
titanic_data_survived_grouped.plot(kind = 'bar')

<Axes: xlabel='Sex'>

In [62]:
titanic_data_survived = titanic_data
titanic_data_survived_grouped = titanic_data_survived.groupby(['Sex', 'Pclass']).Survived.mean()*100
titanic_data_survived_grouped.plot(kind = 'bar')

<Axes: xlabel='Sex,Pclass'>

# titanic_data_survived = titanic_data
titanic_data_survived_grouped = titanic_data_survived.groupby(['Sex', 'Pclass']).Survived.mean()*100
titanic_data_survived_grouped.plot(kind = 'bar')

# Question 2
Did age, regardless of sex and class, determine your chances of survival?

In [66]:
age_below_18 = len(titanic_data[titanic_data['Age']<18])
print ('Total number of passengers below 18 :',age_below_18)
age_below_18_survived = len(titanic_data[titanic_data['Age']<18][titanic_data['Survived']==1])
print ('Total number of passengers below 18 who survived :',age_below_18_survived)
print ('\n')

age_below_50 = len(titanic_data[titanic_data['Age']>18][titanic_data['Age']<50])
print ('Total number of passengers below 50 :',age_below_50)
age_below_50_survived = len(titanic_data[titanic_data['Age']>18][titanic_data['Age']<50][titanic_data['Survived']==1])
print ('Total number of passengers below 50 who survived :',age_below_50_survived)
print ('\n')

age_above_50 = len(titanic_data[titanic_data['Age']>50])
print ('Total number of passengers above 50 :',age_above_50)
age_above_50_survived = len(titanic_data[titanic_data['Age']>50][titanic_data['Survived']==1])
print ('Total number of passengers above 50 who survived :',age_above_50_survived)
print ('\n')

print ('Below 18 survival rate :',((float(age_below_18_survived)/age_below_18)*100))
print ('Between 18 and 50 survival rate :',((float(age_below_50_survived)/age_below_50)*100))
print ('Above 50 survival rate :',((float(age_above_50_survived)/age_above_50)*100))

Total number of passengers below 18 : 113
Total number of passengers below 18 who survived : 61


Total number of passengers below 50 : 501
Total number of passengers below 50 who survived : 193


Total number of passengers above 50 : 64
Total number of passengers above 50 who survived : 22


Below 18 survival rate : 53.98230088495575
Between 18 and 50 survival rate : 38.522954091816366
Above 50 survival rate : 34.375


  age_below_18_survived = len(titanic_data[titanic_data['Age']<18][titanic_data['Survived']==1])
  age_below_50 = len(titanic_data[titanic_data['Age']>18][titanic_data['Age']<50])
  age_below_50_survived = len(titanic_data[titanic_data['Age']>18][titanic_data['Age']<50][titanic_data['Survived']==1])
  age_below_50_survived = len(titanic_data[titanic_data['Age']>18][titanic_data['Age']<50][titanic_data['Survived']==1])
  age_above_50_survived = len(titanic_data[titanic_data['Age']>50][titanic_data['Survived']==1])


In [67]:
cleaned_age_data = titanic_data.dropna()
total_survivors = cleaned_age_data[cleaned_age_data['Survived']==1]['Age'].count()
total_non_survivors = cleaned_age_data[cleaned_age_data['Survived']==0]['Age'].count()
total_survivors_mean = cleaned_age_data[cleaned_age_data['Survived']==1]['Age'].mean()
total_non_survivors_mean = cleaned_age_data[cleaned_age_data['Survived']==0]['Age'].mean()

print ('Total Survivors : ',total_survivors)
print ('Total Non-Survivors :',total_non_survivors)
print ('Total Survivors Mean Age:',total_survivors_mean)
print ('Total Non-Survivors Mean Age:',total_non_survivors_mean)

Total Survivors :  288
Total Non-Survivors : 424
Total Survivors Mean Age: 28.19329861111111
Total Non-Survivors Mean Age: 30.62617924528302


In [70]:
cleaned_age_data.loc[(cleaned_age_data['Age']<18),'Age_Category'] = 'Young Aged'
cleaned_age_data.loc[(cleaned_age_data['Age']>17) & (cleaned_age_data['Age']<50),'Age_Category'] = 'Middle Aged'
cleaned_age_data.loc[(cleaned_age_data['Age']>50),'Age_Category'] = 'Old Aged'

In [71]:
cleaned_age_data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Age_Category
0,0,3,male,22.0,1,0,7.25,S,Middle Aged
1,1,1,female,38.0,1,0,71.2833,C,Middle Aged
2,1,3,female,26.0,0,0,7.925,S,Middle Aged
3,1,1,female,35.0,1,0,53.1,S,Middle Aged
4,0,3,male,35.0,0,0,8.05,S,Middle Aged


In [72]:
titanic_data_grouped_by_age_category = cleaned_age_data
titanic_data_survival_by_age = (titanic_data_grouped_by_age_category.groupby(['Age_Category']).Survived.mean()*100).sort_values()
titanic_data_survival_by_age.plot(kind = 'bar')

<Axes: xlabel='Age_Category'>

# Question 3
Did women and children have preference to lifeboats and therefore survival (assuming there was no shortage of lifeboats)?

Assumption: With "child" not classified in the data, I'll need to assume a cutoff point. Therefore, I'll be using today's standard of under 18 as those to be considered as a child vs adult.

In [74]:
cleaned_age_data.loc[((cleaned_age_data['Sex']=='female') & (cleaned_age_data['Age']>=18)), 'Category'] = 'Woman'
cleaned_age_data.loc[(cleaned_age_data['Sex']=='male') & (cleaned_age_data['Age']>=18), 'Category'] = 'Man'
cleaned_age_data.loc[(cleaned_age_data['Age'] < 18),'Category'] = 'Child'

In [75]:
cleaned_age_data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Age_Category,Category
0,0,3,male,22.0,1,0,7.25,S,Middle Aged,Man
1,1,1,female,38.0,1,0,71.2833,C,Middle Aged,Woman
2,1,3,female,26.0,0,0,7.925,S,Middle Aged,Woman
3,1,1,female,35.0,1,0,53.1,S,Middle Aged,Woman
4,0,3,male,35.0,0,0,8.05,S,Middle Aged,Man


In [78]:
print (cleaned_age_data.groupby(['Category', 'Survived']).size())

Category  Survived
Child     0            52
          1            61
Man       0           325
          1            70
Woman     0            47
          1           157
dtype: int64


#  Question 4
How did children with nannies fare in comparison to children with parents. Did the nanny "abandon" children to save his/her own life?

Need to review list for children with no parents. These will be children with nannies as stated in the data description
Compare "normal" survival rate of children with parents against children with nannies
Assumptions:

If you're classified as a 'Child' (under 18) and have Parch > 0, then the value is associated to your Parents, eventhough it is possible to be under 18 and also have children

Classifying people as 'Child' represented by those under 18 years old is applying today's standards to the 1900 century

In [84]:
children_with_nanny = cleaned_age_data[cleaned_age_data['Category']=='Child'][cleaned_age_data['Parch']==0]
children_with_parents = cleaned_age_data[cleaned_age_data['Category']=='Child'][cleaned_age_data['Parch'] > 0]

print ('Number of childern with nanny:',children_with_nanny['Survived'].count())
print ('Number of childern with nanny who survived:',children_with_nanny[children_with_nanny['Survived']==1]['Survived'].count())

print ('Number of childern with nanny:',children_with_parents['Survived'].count())
print ('Number of childern with nanny who survived:')
children_with_parents[children_with_parents['Survived']==1]['Survived'].count()

Number of childern with nanny: 32
Number of childern with nanny who survived: 16
Number of childern with nanny: 81
Number of childern with nanny who survived:


  children_with_nanny = cleaned_age_data[cleaned_age_data['Category']=='Child'][cleaned_age_data['Parch']==0]
  children_with_parents = cleaned_age_data[cleaned_age_data['Category']=='Child'][cleaned_age_data['Parch'] > 0]


45

In [86]:
print ('Percentage of children who survived with nanny:',"\n")
(float(children_with_nanny[children_with_nanny['Survived']==1]['Survived'].count())/children_with_nanny['Survived'].count())*100

print ('Mean age of children who survived with nanny:',"\n")
children_with_nanny[children_with_nanny['Survived']==1]['Age'].mean()

Percentage of children who survived with nanny: 

Mean age of children who survived with nanny: 



14.6875

In [87]:
print ('Percentage of children who survived with parents:',"\n")
(float(children_with_parents[children_with_parents['Survived']==1]['Survived'].count())/\
children_with_parents['Survived'].count())*100

print ('Mean age of children who survived with parents:',"\n")
children_with_parents[children_with_parents['Survived']==1]['Age'].mean()

Percentage of children who survived with parents: 

Mean age of children who survived with parents: 



5.470444444444444

In [88]:
cleaned_age_data.loc[((cleaned_age_data['Parch']==0) & (cleaned_age_data['Category']=='Child')), 'nanny_parents'] = 'With_nanny'
cleaned_age_data.loc[((cleaned_age_data['Parch']>0) & (cleaned_age_data['Category']=='Child')),'nanny_parents']='Without_nanny'

cleaned_age_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_age_data.loc[((cleaned_age_data['Parch']==0) & (cleaned_age_data['Category']=='Child')), 'nanny_parents'] = 'With_nanny'


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Age_Category,Category,nanny_parents
0,0,3,male,22.0,1,0,7.25,S,Middle Aged,Man,
1,1,1,female,38.0,1,0,71.2833,C,Middle Aged,Woman,
2,1,3,female,26.0,0,0,7.925,S,Middle Aged,Woman,
3,1,1,female,35.0,1,0,53.1,S,Middle Aged,Woman,
4,0,3,male,35.0,0,0,8.05,S,Middle Aged,Man,


### Based on the data analysis above, it would appear that the survival rate for children who were accompanied by parents vs those children accompanied by nannies was slighly higher for those with parents. The slight increase could be due to the average age of children with parents being younger, almost half, that of children with nannies.

Percentage of children with nannies who survived: 50.0%, 
Percentage of children with parents who survived: 55.56%, 
Average age of surviving children with nannies: 15,  
Average age of surviving children with parents: 7.0

### Conclusion
The results of the analysis, although tentative, would appear to indicate that class and sex, namely, being a female with upper social-economic standing (first class), would give one the best chance of survival when the tragedy occurred on the Titanic. Age did not seem to be a major factor. While being a man in third class, gave one the lowest chance of survival. Women and children, across all classes, tend to have a higher survival rate than men in general but by no means did being a child or woman guarentee survival. Although, overall, children accompanied by parents (or nannies) had the best survival rate at over 50%