# Winter School 2024 - EDA Tutorial

This notebook highlights an exploratory data analysis using the example of the Titanic data set
(https://www.kaggle.com/competitions/titanic/data).

Authors: Christopher Katins, Mario Sänger, Christopher Lazik, Thomas Kosch
Credits to Patrick Schäfer (HU Berlin)

------------

Contents of the Notebook:

#### Part 1: Data Description and Visualization:

1. Analyzing features and their distribution.
2. Finding relations or trends between features.

#### Part 2: Feature Engineering and Data Cleaning:

1. Adding features.
2. Removing features.
3. Converting features into suitable form for modeling.

#### Part 3: Predictive Modeling

1. Running a simple classification algorithm.

--------------------


# Part 1: Data Description and Visualization

Setup the environment and install the required packages

In [None]:
!python -m venv env_eda_titanic

In [None]:
!source env_eda_titanic/bin/activate

In [None]:
!pip install -r requirements.txt

Import used packages and used classes / functions

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
data = pd.read_csv("data/titanic_train.csv")
data

### Basic Structure

Use .info() to get brief information about the dataframe

In [None]:
data.info()

#### Types of Features

- Nominal Features:
  - Name: Full name
  - Sex: Sex of the passenger
  - Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
  - Ticket: Ticket number
  - Cabin: Cabin number
  - Survived: Survival (0 = No, 1 = Yes)

- Ordinal Features:
  - PClass: Ticket class as a proxy for socio-economic status (1 = 1st, 2 = 2nd, 3 = 3rd)

- Continuous Feature:
  - Age: Passenger age
  - Fare: Passenger fare

- Discrete Features:
  - SibSp: Number of siblings (brother, sister, stepbrother, stepsister)) / spouses (husband, wife) aboard the Titanic
  - Parch: Number of parents (mother, father) / children (daughter, son, stepdaughter, stepson)*

*Some children travelled only with a nanny, therefore parch=0 for them.

In [None]:
# Get number of unique values per column
data.nunique()

### Null Values and Summary Statistics

In [None]:
# Describe basic statistics of the numerical attributes
data.describe()

In [None]:
# Describe basic statistics of the non-numerical attributes
data.describe(include="O")

### Missing Values



In [None]:
# Checking for total number of null values for each column
data.isnull().sum().sort_values(ascending=False)

=> The `Age`, `Cabin` and `Embarked` have null values. we will try to infer them from the data.

In [None]:
# Checking for duplicates in the data set
data.duplicated().any()

=> No duplicated rows

### Survival rates

Let's first inspect the survival rates of the passengers

In [None]:
# Create a figure with two sub-plots
f,ax=plt.subplots(1,2,figsize=(12,6))

# Pie plot for relative rates
data["Survived"].value_counts().plot.pie(explode=[0,0.05], autopct='%1.1f%%', ax=ax[0])
ax[0].set_title('Relative Survival Rates')
ax[0].set_ylabel('')

# Bar plot for absolute rates
sns.countplot(data, x="Survived", ax=ax[1])
ax[1].set_title("Absolute Survival Rate")
sns.despine()

**Observation:**
- Out of 891 passengers in training set, only around 350 survived
- That is: 38.4% survived

We will  dig down deeper to get better insights and see which features of the passengers increased the chance of survival

# Key Question: What influenced the chance of survival?

### Factor Sex

We will first build a crosstab between Sex and Survival Rates.

In [None]:
pd.crosstab(data["Sex"],data["Survived"],margins=True)  \
    .style.background_gradient(cmap='Blues')

We will next use a histogram to show the relation

In [None]:
f,ax=plt.subplots(1,1,figsize=(6,4))

ax = sns.pointplot(data, x='Sex', y='Survived', ax=ax)
ax.set_title('Factor of Sex - Survived vs Dead')
sns.despine()
plt.show()

**Observations:**

=> Concerning female passengers: ~ 3/4 of the passengers survived
=> For male passengers it's inverse: ~ only 1/5 of the passengers survived
=> We can clearly see that female passengers had a much higher rate of survival.

### Factor PClass

In [None]:
# Build a cross table for pclass and survived
ct = pd.crosstab(data["Pclass"],data["Survived"],margins=True) \
    .style.background_gradient(cmap='Blues')
display(ct)

# Get the table with normalized values
ct = pd.crosstab(data["Pclass"],data["Survived"],margins=True, normalize=True) \
    .style.background_gradient(cmap='Blues')
display(ct)

# Plot the values
_,ax = plt.subplots(1,1,figsize=(6,4))
sns.pointplot(data, x="Pclass", y="Survived", ax=ax)
ax.set_title("Pclass: Survived vs. Dead")
sns.despine()

**Observations:**
=> We can clearly see that passengers of class 1 were given a high priority while rescue.

### Sex AND Pclass

Now let's look on the joint impact of Sex and Pclass

In [None]:
pd.crosstab([data["Sex"],data["Survived"]],data["Pclass"],margins=True) \
    .style.background_gradient(cmap='Blues')

In [None]:
sns.catplot(data, x="Pclass", y="Survived", hue="Sex", kind="point")
sns.despine()

**Observations:**
=> The chances of survival were highest, if you were female and in the first passenger class.
    

### Age

Let's inspect the age of the passengers ...

In [None]:
print('Oldest Passenger was:',data['Age'].max(),'Years')
print('Youngest Passenger was:',data['Age'].min(),'Years')

In [None]:
fg = sns.displot(data, x="Age", hue="Survived", col="Sex", element="poly", height=3)
fg.axes_dict["male"].set_title("Male")
_ = fg.axes_dict["female"].set_title("Female")

**Observations:**
    
1. Survival rates for passenegers below age 10 is increased
2. Survival chances for passenegers aged 20-50 from Pclass 1 is highest, and even better for women.
3. The Women and Child first policy thus holds true

### Filling Missing Values

## Filling Age

As we had seen earlier, the Age feature has 177 null values.

The Name feature has a salutation like Mr or Mrs. Thus we may use it to guess the age of respective groups.

In [None]:
data[["Name"]].head()

We are using a Regex for extracting the salutation / initial of the persons. It returns strings that contain A-Z or a-z and are followed by a punctuation "." (dot)

In [None]:
data["Initial"] = data["Name"].str.extract('([A-Za-z]+)\.')
data[["Initial"]].head()

### Checking Frequency of Initial

In [None]:
pd.crosstab(data["Initial"],data["Sex"]).T \
    .style.background_gradient(cmap='Blues')

In [None]:
data.groupby("Initial")[["Age"]].mean()

**Observations:**
=> There are some misspelled initials like Mlle or Mme that stand for Miss

In [None]:
data["Initial"].replace(
    ['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],
    ['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'], inplace=True)

In [None]:
data.groupby('Initial')[['Age']].mean()

**Observations:**
=> Master can be matched to male children. Miss is less distinct.

## Filling Missing Ages by Intial-Averages

In [None]:
data[data["Age"].isna()]

In [None]:
data["Age"].fillna(data.groupby("Initial")["Age"].transform("mean"), inplace=True)

In [None]:
data.Age.isnull().any()

In [None]:
# Inspect the survival rates per initial and Pclass
sns.catplot(data, x="Pclass", y="Survived", col="Initial", kind="point", height=2)
sns.despine()

**Observation:**
=> The Women and Child first policy thus holds true

### Embarked

In [None]:
sns.catplot(data, x="Embarked", y="Survived", kind="point", height=3);

**Observations:**
=> The chances for survival for Port C (Cherbourg) is highest around 0.55 while it is lowest for S (Southampton).

In [None]:
sns.catplot(data, x="Pclass",y="Survived", kind="point", hue="Sex", col="Embarked", height=3);

**Observations:**

- The survival rates are ~1 for women from Pclass 1 and Pclass 2 irrespective of the port
- Port S is worst for Pclass 3
- Port Q is worst for Men

# Filling Emarked NaN

There are two missing values for `Embarked`


In [None]:
data[data["Embarked"].isnull()]

In [None]:
# Let's check which is the port with the highest number of entering passengers
data["Embarked"].mode()

In [None]:
data["Embarked"].value_counts(normalize=True)

As we saw that most passengers boarded from Port S, we replace NaN with S.

In [None]:
data["Embarked"].fillna("S", inplace=True)
data["Embarked"].isnull().any()

### SibSip

This feature represents whether a person is alone or with his family members.

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife

In [None]:
pd.crosstab(data["SibSp"], data["Survived"], normalize=True) \
    .style.background_gradient(cmap='Blues')

In [None]:
pd.crosstab(data["SibSp"],data["Pclass"], normalize=True) \
    .style.background_gradient(cmap='Blues')

In [None]:
f,ax=plt.subplots(figsize=(8,4))
sns.pointplot(data, x="SibSp", y="Survived", hue="Pclass", palette="colorblind", ax=ax)
ax.set_title("SibSp vs Survived")

**Observations:**
- Plcass 1 and 2 had highest chances of survival
- Families with 3 Members had highest chances of survival
- Families >3 were only in Pclass 3
- Survival for families with >5 members is 0%. 


### Fare

In [None]:
print('Highest Fare was:',data['Fare'].max())
print('Lowest Fare was:',data['Fare'].min())
print('Average Fare was:',data['Fare'].mean())

In [None]:
f,ax=plt.subplots(1,3,figsize=(12,4))

sns.histplot(x=data[data['Pclass']==1].Fare,kde=True,ax=ax[0])
ax[0].set_title('Fares in Pclass 1')

sns.histplot(x=data[data['Pclass']==2].Fare,kde=True,ax=ax[1])
ax[1].set_title('Fares in Pclass 2')

sns.histplot(x=data[data['Pclass']==3].Fare,kde=True,ax=ax[2])
ax[2].set_title('Fares in Pclass 3')

plt.show()
sns.boxplot(x="Pclass", y="Fare", hue="Sex", data=data);

**Observation:**
=> There seems to be a large spread in distributions of fares. Some outliers.

### Log transformed Data

Let's see how the distributions look if we log transform the data

In [None]:
def get_log_fare(data, pclass):
    fare = data[data['Pclass']==pclass]["Fare"]
    return np.log2(fare[fare>0])

f,ax=plt.subplots(1,3,figsize=(12,4))
sns.histplot(x=get_log_fare(data, 1), kde=True, ax=ax[0])
sns.despine()
ax[0].set_title('Fares in Pclass 1')

sns.histplot(x=get_log_fare(data, 2), kde=True, ax=ax[1])
sns.despine()
ax[1].set_title('Fares in Pclass 2')

sns.histplot(x=get_log_fare(data, 3), kde=True, ax=ax[2])
sns.despine()
_ = ax[2].set_title('Fares in Pclass 3')

**Observation:**
=> Still difficult to interpret the observations due to high variance :-(

## QQ-Plot - Test for Normality

Quantile-quantile plots (QQ-Plots) are a graphical tool used to assess whether a set of data follows a particular probability distribution (e.g. normal distribution)

In [None]:
import statsmodels.api as sm
ax = sm.qqplot(get_log_fare(data, 1), fit=True, line="45");
ax.suptitle("Plcass 1")
sm.qqplot(get_log_fare(data, 2), fit=True, line="45");
sm.qqplot(get_log_fare(data, 3), fit=True, line="45");

**Observation**: No signs of normal distribution, even for log-transformed data

# Part 2: Feature Engineering and Data Cleaning

# Family_Size and Alone

In [None]:
data['Family_Size'] = 0
data['Family_Size'] = data['Parch']+data['SibSp']
data['Alone'] = data.Family_Size==0

ax = sns.pointplot(x='Alone',y='Survived', hue="Pclass", data=data)
ax.set_title("Traveling Alone");

**Observation** If you were alone then chances for survival is very low. 


In [None]:
ax = sns.pointplot(x='Alone',y='Survived',data=data, hue='Sex')

Observation: chances of females, who are alone, are higher than those with family.

# Outlier Handling - Binning Fares

In [None]:
data['Fare_Range'] = pd.qcut(data['Fare'],4,labels=["Small","Medium","Large", "Rich"])
data.groupby(['Fare_Range'])['Survived'].mean().to_frame().style.background_gradient(cmap='Blues')

**Observation:** We observe that as the fare increases, the chances of survival increases.


# Encoding Categorical Values

In [None]:
data_numerical = pd.get_dummies(data, columns=['Sex'], prefix='is')
data_numerical = pd.get_dummies(data_numerical, columns=['Embarked'], prefix='is')
data_numerical = pd.get_dummies(data_numerical, columns=['Initial'], prefix='is')
data_numerical = pd.get_dummies(data_numerical, columns=['Fare_Range'], prefix='is')

# Correlation Between The Numerical Features

In [None]:
fig, ax = plt.subplots(figsize=(16,12))
sns.heatmap(data_numerical.corr(numeric_only=True), annot=True, linewidths=0.2, ax=ax)
plt.tight_layout()
plt.show()

There is high correlation between `survived` and `Pclass`, `Fare`, `is_female`, `is_male` `is_S`, `is_C`, `is_Mr`, `is_Mrs`, `is_Miss`, `alone`, `is_rich`

# Observations in a Nutshell for all features:

- Sex: 
    - The chance of survival for women is high as compared to men
- Pclass: 
    - Fares varied significantly with few passengers (<1%) paying as high as $512
    - The more you pay, the better chances of survival
    - The survival rate for Pclass3 is very low
- Age: 
    - Children between 5-10 years do have a high chance of survival
- Embarked: 
    - The chances of survival at C and S were high for lower passenger classes
- Parch+SibSp: 
    - Having a small family  gives a greater chance of than traveling alone (as male) or having a large family travelling with you
    - Females travel best alone

# Part 3: Predictive Modeling

- We have gained insights from the EDA 
- Using this insights, we cannot accurately predict whether a passenger will survive or not
- So we may use a classification algorithm to predict whether the Passenger will survive


In [None]:
from sklearn.linear_model import RidgeClassifierCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# display(data_numerical.head())
train_y = data_numerical['Survived']
train_X = data_numerical.drop(columns=['Survived', 'Name', 'Ticket', 'Cabin'])

# Scaled data has zero mean and unit variance:
# deprecated:
# train_X = StandardScaler().fit_transform(train_X)
# clf = RidgeClassifierCV(alphas=np.logspace(-3, 3, 10), normalize=True).fit(train_X, train_y)
# score = clf.score(train_X, train_y)
# print ("Accuracy of Model: ", score)

# Train Regression model
clf = make_pipeline(StandardScaler(), 
                    RidgeClassifierCV(alphas=np.logspace(-3, 3, 10))).fit(train_X, train_y)
score = clf.score(train_X, train_y)

print ("Accuracy of Model: ", score)