In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### 1. Titanic data

In [6]:
titanic = pd.read_csv("./Downloads/titanic.csv.bz2")
titanic.head(7)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
5,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S,3.0,,"New York, NY"
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S,10.0,,"Hudson, NY"


#### Some sanity checks

In [7]:
titanic.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881135,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.4135,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.1667,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


In [8]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
pclass       1309 non-null int64
survived     1309 non-null int64
name         1309 non-null object
sex          1309 non-null object
age          1046 non-null float64
sibsp        1309 non-null int64
parch        1309 non-null int64
ticket       1309 non-null object
fare         1308 non-null float64
cabin        295 non-null object
embarked     1307 non-null object
boat         486 non-null object
body         121 non-null float64
home.dest    745 non-null object
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


In [9]:
titanic.isna().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

The variables with a considerable amount of null values are not that relevant to our model, with some of them even containing imprecise data, such as the age of "Allison, Master. Hudson Trevor" at index 1. Even though the age could be used to potentially analyze data in an interesting way - such as dividing people into age groups (children, teenagers, adults, elders) - we might not get trustable results due to the lack of data in this category. However, because children and women had a preference to be evacuated first, this variable can still be useful somehow.

Other categories such as cabin, boat, and body do not seem to be useful in predicting one's fate in this fatal event that shocked the world. Home destination could be useful as people from English-speaking countries had more ease in understanding the commands and explanations given by the crew. However, null observations for this category represent more than 40% of observations, so it might not be very useful anyway.

### 2. Logistic Regression

#### 2.1 - Variables to be used

"According to the habits of the time, women and children were first to get to boats (...) meant 1st and 2nd class women and children as for the third class it was much harder to reach up to the boat deck". This suggests that `pclass`, `sex` and, at some extent, `age` are very important variables.

Another quote from the text is "... many of them did not understand explanations given in English anyway". This could indicate that people from the US and other countries that have English as a first language or a required one in school could better understand the commands given. However, as the text itself points out later, we cannot take this information as granted, specially due to the unimaginable nature of this event.

### 2.2 - Models

#### pclass

In [10]:
import statsmodels.formula.api as smf
m = smf.logit(formula='survived ~ pclass', data=titanic).fit()

Optimization terminated successfully.
         Current function value: 0.616220
         Iterations 5


In [12]:
m.summary()

0,1,2,3
Dep. Variable:,survived,No. Observations:,1309.0
Model:,Logit,Df Residuals:,1307.0
Method:,MLE,Df Model:,1.0
Date:,"Mon, 24 Feb 2020",Pseudo R-squ.:,0.07338
Time:,14:13:53,Log-Likelihood:,-806.63
converged:,True,LL-Null:,-870.51
Covariance Type:,nonrobust,LLR p-value:,1.266e-29

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,1.2680,0.168,7.551,0.000,0.939,1.597
pclass,-0.7790,0.071,-10.978,0.000,-0.918,-0.640


In [13]:
m.get_margeff().summary()

0,1
Dep. Variable:,survived
Method:,dydx
At:,overall

Unnamed: 0,dy/dx,std err,z,P>|z|,[0.025,0.975]
pclass,-0.1659,0.012,-13.553,0.0,-0.19,-0.142


In [58]:
round(np.exp(-0.779), 2)

0.46

The negative values for both the coefficient and dy/dx value of the `pclass` variable indicate that **pclass is negatively correlated with `survived`**. That is, an increase in pclass correlates to a decrease in survived. So people in pclass 1 have a higher chance of surviving than those in 2, who have a higher chance of surviving than those in 3. One class above leads to a 0.46 times the previous class' survival rate.

#### sex

In [60]:
m = smf.logit(formula='survived ~ sex', data=titanic).fit()

Optimization terminated successfully.
         Current function value: 0.522576
         Iterations 5


In [61]:
m.summary()

0,1,2,3
Dep. Variable:,survived,No. Observations:,1309.0
Model:,Logit,Df Residuals:,1307.0
Method:,MLE,Df Model:,1.0
Date:,"Mon, 24 Feb 2020",Pseudo R-squ.:,0.2142
Time:,14:39:01,Log-Likelihood:,-684.05
converged:,True,LL-Null:,-870.51
Covariance Type:,nonrobust,LLR p-value:,4.326e-83

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.9818,0.104,9.437,0.000,0.778,1.186
sex[T.male],-2.4254,0.136,-17.832,0.000,-2.692,-2.159


In [65]:
m.get_margeff().summary()

0,1
Dep. Variable:,survived
Method:,dydx
At:,overall

Unnamed: 0,dy/dx,std err,z,P>|z|,[0.025,0.975]
sex[T.male],-0.4125,0.01,-42.239,0.0,-0.432,-0.393


In [66]:
round(np.exp(-2.4254), 2)

0.09

The negative values for both the coefficient and dy/dx value of the `sex` variable (in this case, sex[T.male]) indicate that **sex is negatively correlated with `survived`**. That is, being a male in the Titanic correlates to a decrease in survival rate. Being a man indicated a 0.09 times the survival rate of that of women, i.e. a lower survival chance rate.

#### age

In [37]:
# Let's create age groups (Source: https://stackoverflow.com/questions/49382207/how-to-map-numeric-data-into-categories-bins-in-pandas-dataframe)
bins = [10, 18, 35, 65, np.inf]
labels = ['<10', '10-17', '18-35', '36-64', '65+']

# Adding a new column to the data frame
titanic['ageGroup'] = pd.cut(titanic.age, bins, labels)

In [39]:
titanic.head(7)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,ageGroup
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO","(18.0, 35.0]"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON",
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON","(18.0, 35.0]"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON","(18.0, 35.0]"
5,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S,3.0,,"New York, NY","(35.0, 65.0]"
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S,10.0,,"Hudson, NY","(35.0, 65.0]"


In [47]:
m = smf.logit(formula='survived ~ ageGroup', data=titanic).fit()

Optimization terminated successfully.
         Current function value: 0.668779
         Iterations 5


In [48]:
m.summary()

0,1,2,3
Dep. Variable:,survived,No. Observations:,960.0
Model:,Logit,Df Residuals:,956.0
Method:,MLE,Df Model:,3.0
Date:,"Mon, 24 Feb 2020",Pseudo R-squ.:,0.001739
Time:,14:35:23,Log-Likelihood:,-642.03
converged:,True,LL-Null:,-643.15
Covariance Type:,nonrobust,LLR p-value:,0.5248

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-0.3205,0.196,-1.636,0.102,-0.704,0.063
"ageGroup[T.Interval(18.0, 35.0, closed='right')]",-0.1434,0.215,-0.667,0.505,-0.565,0.278
"ageGroup[T.Interval(35.0, 65.0, closed='right')]",-0.0823,0.227,-0.362,0.717,-0.528,0.363
"ageGroup[T.Interval(65.0, inf, closed='right')]",-1.0658,0.814,-1.309,0.191,-2.662,0.530


In [49]:
m.get_margeff().summary()

0,1
Dep. Variable:,survived
Method:,dydx
At:,overall

Unnamed: 0,dy/dx,std err,z,P>|z|,[0.025,0.975]
"ageGroup[T.Interval(18.0, 35.0, closed='right')]",-0.0341,0.051,-0.667,0.505,-0.134,0.066
"ageGroup[T.Interval(35.0, 65.0, closed='right')]",-0.0196,0.054,-0.362,0.717,-0.126,0.086
"ageGroup[T.Interval(65.0, inf, closed='right')]",-0.2536,0.193,-1.312,0.189,-0.632,0.125


In [67]:
# 18-35
round(np.exp(-0.1434), 2)

0.87

In [73]:
# 36-65
round(np.exp(-0.0823), 2)

0.92

In [70]:
# 66-inf
round(np.exp(-1.0658), 2)

0.34

We can see that elderly people (above 65 years old) had a much smaller survival rate than young adults who, at the same time, had a slightly smaller survival rate than older adults (ranging at 36-65 years old).

For ages below 18 we did not have much data, so it is hard to get to a conclusion about it.

#### 2.3 - All together

In [71]:
m = smf.logit(formula='survived ~ pclass + sex + ageGroup', data=titanic).fit()

Optimization terminated successfully.
         Current function value: 0.445014
         Iterations 6


In [72]:
m.summary()

0,1,2,3
Dep. Variable:,survived,No. Observations:,960.0
Model:,Logit,Df Residuals:,954.0
Method:,MLE,Df Model:,5.0
Date:,"Mon, 24 Feb 2020",Pseudo R-squ.:,0.3357
Time:,14:45:38,Log-Likelihood:,-427.21
converged:,True,LL-Null:,-643.15
Covariance Type:,nonrobust,LLR p-value:,4.002e-91

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,3.6286,0.408,8.899,0.000,2.829,4.428
sex[T.male],-2.7936,0.183,-15.302,0.000,-3.151,-2.436
"ageGroup[T.Interval(18.0, 35.0, closed='right')]",0.0997,0.284,0.352,0.725,-0.456,0.656
"ageGroup[T.Interval(35.0, 65.0, closed='right')]",-0.5627,0.313,-1.799,0.072,-1.176,0.050
"ageGroup[T.Interval(65.0, inf, closed='right')]",-1.1793,0.978,-1.205,0.228,-3.097,0.738
pclass,-1.0602,0.115,-9.184,0.000,-1.286,-0.834


Putting all these variables together, it seems that people in the age range of 18 to 35 years old had a higher survival rate. Other important "attributes" were the passenger's class and whether they belonged to the male sex or not, which are information we managed to gather from independent analysis above as well.

### 3 - Historical Conclusions

From what I analyzed and also studied about the topic, I believe that, in the first minutes of evacuation, women and children indeed had the preference, which agrees to our logistic regression that women and children had a higher survival rate than men and older people. Besides, the more one had payed for their tickets (higher pclass), the higher survival rate they had. Maybe the access to boats was easier for them or they were given a preference due to their possible financial status and importance.

However, I think that as time passed, people started realizing the critical, urgent, desperate situation they found themselves in and abdicated their rationale and social norms in favor of trying to save their lives. "Chaos was the law of nature", and the worst happened to many people that day.