# Workshop 2025-W49 - Predicting survivors on the Titanic

This workshop will be focused on predicting which passengers aboard the Titanic survived. It will based on the classic Titanic dataset, which we have taken from [Kaggle](https://www.kaggle.com/competitions/titanic/). Following the link you can find more information about the different columns of the dataset, as well as how others have analyzed it.

This dataset is presented in its raw form here, which means we will have to do some EDA and data cleaning in order to use it for prediction and analysis. Additionally, depending on model choice, we might have to encode our categorical features (that is, transform them into numerical representations).

**Important**
Try to pick one particular thing to explore in this workshop, at least initially. It is easy to get overwhelmed when there are many things to do and also to simply underestimate how much time certain tasks take. This rings especially true for the intermediate and advanced level tasks.

**Beginner level:**
A very important and often first step when working with data is to get a good overview of the data and start exploring it. If you do not have much experience with data analysis, this is a great way to begin learning about the [pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started) and [matplotlib](https://matplotlib.org/stable/users/index.html) libraries. Below are some ideas if you need inspiration, but feel free to explore whatever question you come up with.
- How does age, sex, and ticket class relate to survivability? Showing these relations are not trivial, so an initial idea could be to plot histograms with the average survival rate within each group.
- Check for missing values. Do you notice anything strange? Do you think the missing value here contains important information?
- Come up with a question or hypothesis and try to answer/test it using the data.

**Intermediate level:**
At the intermediate level, we might want to start predicting whether passengers survived or not.

- For prediction we might need to handle categorical columns and missing data depending on our choice of model. Techniques for handling this are called encoding and imputation methods respectively and there are a wide variety of them. You can read more about basic imputation [here](https://www.kaggle.com/code/alexisbcook/missing-values) and about basic encoding [here](https://www.kaggle.com/code/alexisbcook/categorical-variables).
- Picking/creating a suitable family of models often depends on the problem at hand. For tabular data, a classic first choice is some variant of a gradient boosting machine (GBM). The most common ones are [XGBoost](https://xgboost.readthedocs.io/en/stable/), [CatBoost](https://catboost.ai/docs/en/), and [LightGBM](https://lightgbm.readthedocs.io/en/stable/). Pick one, split the data into train/test, and see if you can get a good classification accuracy on the unseen data.
- One way of improving a model's predictions, or discovering new relationships between our features/explanatory variables, is through feature engineering. Roughly speaking, feature engineering is the art of creating new, potentially strong features based on our given features. See if you can find new features that give better performance of your chosen model. You can read more [here](https://www.kaggle.com/learn/feature-engineering).

**Advanced level:**
Now we want to dive deeper into a specific case based on the data. Predicting survivability is cool, but at the end of the day, that prediction does not mean much by itself. We are not exactly going to be getting any new unseen data to predict survivability on in the future. Instead, a better question to ask in our case is _why_ certain people survived. This is actually a fairly deep question that is **very** difficult to answer. Therefore we will tackle it from a slightly different angle. Instead of trying to find causal relationships in the data, we instead try to figure out why our model makes the predictions it does.
- A more rigorous way of concluding a relationship between our output (survival status) and our explanatory variables is through a simple logistic regression. One would have to handle the categorical variables (probably through one-hot encoding). A problem that commonly arises when doing regression analysis for inferece is that of [multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity). Is that a problem in our dataset?
- With more complex data, we may often not be satisfied with the performance of a relatively simple model like logistic regression. A lot of GBM-libraries have built-in feature importance modules that let us view which of our features are most important for prediction. This could give us a hint as to what is important for a person to survive the sinking of the Titanic.
- Feature importance is nice, but leaves a lot to be desired. It does not have the statistical properties of the logistic regression and also does not tell us how certain features impact our predictions. A new and super popular library for this (that works for a lot of models suffering from this issue, not just GBMs) is [SHAP](https://shap.readthedocs.io/en/latest/). See if you can implement this on a fitted XGBoost model and get some nice plots.


## Code

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
PATH = 'titanic.csv'

In [2]:
data = pd.read_csv(PATH)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
