## In-depth EDA on the Titanic dataset
Including descriptions of the exploratory data analysis (EDA) and appropriate visualisations.
Where pclass refers to passenger class (1st, 2nd, 3rd), and is a proxy for socio-economic class. Age is in years, and some infants had fractional values. Fare is in Pre-1970 British Pounds.

In [11]:
import pandas as pd

What is the most important factor in determining survival of the Titanic incident?

Sex (male) has the greatest negative correlation with survival, and the greatest difference from zero overall.

In [12]:
# Load the data into a pandas DataFrame
df = pd.read_csv('Titanic.csv')

# Calculate the correlation between the variables and survival rate
correlation = df.corr()['Survived']

# Sort the correlations in descending order
correlation = correlation.sort_values(ascending=False)

# Print the sorted correlations
print(correlation)

# Create a new binary variable for Sex
df['Sex'] = df['Sex'].apply(lambda x: 1 if x == 'male' else 0)

# Calculate the correlation between Sex and survival rate
Sex_correlation = df['Sex'].corr(df['Survived'])

# Print the correlation
print('Correlation between sex (male) and survival rate:', Sex_correlation)


Survived       1.000000
Fare           0.257307
Parch          0.081629
PassengerId   -0.005007
SibSp         -0.035322
Age           -0.077221
Pclass        -0.338481
Name: Survived, dtype: float64
Correlation between sex (male) and survival rate: -0.5433513806577546


  correlation = df.corr()['Survived']


In the movie, the upper-class passengers were given preference on lifeboats. Does this show in the data?

Highly likely, but not fully confirmed. It would be impossible to survive without being in a lifeboats, and passenger class has the strongest relationship to survival compared to other data, but it is not specifically indicated how many people came to be on these lifeboats. While unlikely, this may be an alternate dimension where crew members didn't give preference to high class passengers, but high class passengers ended up taking more lifeboats through other means.

“Women and children first”. Was this the case?

Children appear to have been let on first. A negative correlation coefficient between age and survival rate (-0.077221) suggests that as age increases, survival rate decreases. This could imply that younger people were more likely to survive, and therefore that they were offered spots on lifeboats first. But to be a "child" there would be an age cut-off, and I want to see if children overall had a preference rather than, for example, young adults surviving better because they're healthier. A law passed in 1855 that changed age of consent from 13 to 16, and the titanic sank in 1912, so I'm going to infer that anyone below 16 would have qualified as a child. Having created an age group for under 16s I found a positive correlation of 0.136106, which suggests that people under the age of 16 have a higher survival rate. It is worth noting, however, of 892 entries, 176 are missing data for age.

Women appear to have been let on first, since there is a high negative correlation between being male and surviving.

In [13]:
# Create a new binary variable for age group
df['age_group'] = df['Age'].apply(lambda x: 1 if x < 16 else 0)

# Calculate the correlation between age group and survival rate
correlation = df['age_group'].corr(df['Survived'])

# Print the correlation
print('Correlation between age group and survival rate:', correlation)


Correlation between age group and survival rate: 0.13610698067319452


Add one other observation that you have noted in the dataset.

People who were in the same passenger class did not pay the same fare.