# 4t DigiNext Data & ML Bootcamp

## Week 1 - EDA

### Applied Exploratory Data Analysis


This notebook serves as a practical guide to Applied Exploratory Data Analysis.

## Getting Started
- Download [Titanic Datset](https://drive.google.com/file/d/1qsPISZNlcAaLGXG9l0WpWSzpxIbj7UYW/view?usp=sharing)
- Upload `titanic.csv` to your runtime (Colab: *Files* → *Upload*).
- Keep your answers brief but clear in the designated *ANSWER* cells.
- Use plotly for plotting charts.

In [1]:
# (Optional) Install/upgrade libraries if needed
# If you're in Colab, these are usually available. Uncomment if required.
# !pip install -q --upgrade pandas plotly

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)
SEED = 42
np.random.seed(SEED)


## Intro
The Titanic dataset contains one row per passenger with attributes such as **Survived**, **Pclass**, **Sex**, **Age**, **SibSp**, **Parch**, **Fare**, **Cabin**, **Embarked**, etc.

Read the column descriptions (Kaggle Titanic) and refer back as necessary.

## Data Cleaning – Missing Values

### 1) Load the Titanic dataset

In [None]:
# Load the Titanic dataset
df = pd.read_csv('titanic.csv')
display(df.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### 2) Check for missing values in each column

In [None]:
display(df.isnull().sum())

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


### 3) Are they systematically missing or randomly? *(Brief answer)*

It's difficult to say only using above metrics and without any further analysis.
So let's deep dive into it.

In [None]:
df.groupby('Survived')["Age"].apply(lambda x: x.isnull().mean())

Unnamed: 0_level_0,Age
Survived,Unnamed: 1_level_1
0,0.227687
1,0.152047


In [None]:
df.groupby('Pclass')["Age"].apply(lambda x: x.isnull().mean())

Unnamed: 0_level_0,Age
Pclass,Unnamed: 1_level_1
1,0.138889
2,0.059783
3,0.276986


I guess that Age is not missing at random and actually, it's leaking the label. Also, it seems to depend on the PClass. So systematically missing.

In [None]:
df.groupby('Survived')["Cabin"].apply(lambda x: x.isnull().mean())

Unnamed: 0_level_0,Cabin
Survived,Unnamed: 1_level_1
0,0.876138
1,0.602339


In [None]:
df.groupby('Pclass')["Cabin"].apply(lambda x: x.isnull().mean())

Unnamed: 0_level_0,Cabin
Pclass,Unnamed: 1_level_1
1,0.185185
2,0.913043
3,0.97556


Cabin missing values are highly related to Pclass so We can assume that it's not missing at random. So systematically missing.

### 4) What is the best approach to handle each column? *(Brief answer)*

Age: We can fill the missing value using median or group-wise median, e.g. fill with median age of gender.

Cabin: For now we can drop the column because it has many distinct values and mostly missing. Too much head-ach to work with it.

Embarked: Either delete the rows or fill with Mode. We can also look for their relatives in Dataset and guess based on that.

### 5) Handle missing values for `Age`

#### 5-1) Drop rows with missing Age

In [None]:
df_dropped = df.dropna(subset=['Age'])
display(df_dropped.head())
print(df_dropped.shape)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


(714, 12)


#### 5-2) Fill Age with median and mean

In [None]:
df_median = df.copy()
df_mean = df.copy()

df_median['Age'] = df_median['Age'].fillna(df_median['Age'].median())
df_mean['Age'] = df_mean['Age'].fillna(df_mean['Age'].mean())

print("df", df["Age"].mean(), df["Age"].median())
print("df_median", df_median["Age"].mean(), df_median["Age"].median())
print("df_mean", df_mean["Age"].mean(), df_mean["Age"].median())

df 29.69911764705882 28.0
df_median 29.36158249158249 28.0
df_mean 29.69911764705882 29.69911764705882


#### 5-3) Indicator Column

In [None]:
df_indicator = df.copy()
df_indicator['Age_missing'] = df_indicator['Age'].isnull()
df_indicator['Age'] = df_indicator['Age'].fillna(df_indicator['Age'].median())

print("DataFrame with Age indicator column:")
display(df_indicator[df_indicator["Age_missing"]].head())
display(df_indicator.head())


DataFrame with Age indicator column:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_missing
5,6,0,3,"Moran, Mr. James",male,28.0,0,0,330877,8.4583,,Q,True
17,18,1,2,"Williams, Mr. Charles Eugene",male,28.0,0,0,244373,13.0,,S,True
19,20,1,3,"Masselmani, Mrs. Fatima",female,28.0,0,0,2649,7.225,,C,True
26,27,0,3,"Emir, Mr. Farred Chehab",male,28.0,0,0,2631,7.225,,C,True
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,28.0,0,0,330959,7.8792,,Q,True


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_missing
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,False
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,False
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,False
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,False
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,False


### 6) Look up MAR, MCAR, MNAR. Describe a scenario for each *(Brief answer)*

MCAR (Missing Completely at Random): Missingness is unrelated to any data (observed or unobserved).

A lab misplaces a random subset of blood samples during shipping.


MAR (Missing At Random): Missingness depends on another observed variable, but not the missing value itself.

Driver ratings are missing more for short trips and late-night rides (observed: trip length, time of day).


MNAR (Missing Not At Random): Missingness depends on the unobserved value itself.

People with very high incomes are less likely to report their income on a survey because it is sensitive.

## Data Cleaning – Duplicates

### 0) Download [corrput titanic](https://drive.google.com/file/d/1ThKbMCgw0-UoCk3pJB-mnaw8pt8vIv99/view?usp=sharing) dataset.

### 1) Upload and read `corrupt_titanic.csv`

This dataset is a modified version of original dataset but with some duplicate rows.

In [None]:
dup_df = pd.read_csv("corrupt_titanic.csv")
print("Original DF:", df.shape)
print("Duplicated DF:", dup_df.shape)

Original DF: (891, 12)
Duplicated DF: (1001, 12)


### 2) Basic EDA

Do a basic EDA on corrupt dataset, and compare the results with the original dataset.

In [None]:
print("Original DF:", df['Survived'].mean())
print("Duplicated DF:", dup_df['Survived'].mean())

Original DF: 0.3838383838383838
Duplicated DF: 0.4115884115884116


In [None]:
print("Original DF:", df.groupby("Sex")['Survived'].mean())
print("Duplicated DF:", dup_df.groupby("Sex")['Survived'].mean())

Original DF: Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64
Duplicated DF: Sex
female    0.754335
male      0.230534
Name: Survived, dtype: float64


### 3) Find and drop **exact** duplicates in corrupt dataset.

In [None]:
print(dup_df.drop_duplicates().shape)

(1001, 12)


No Exact duplication. Let's dive deeper.

In [None]:
dup_df.nunique()

Unnamed: 0,0
Survived,2
Pclass,3
Name,921
Sex,2
Age,88
SibSp,7
Parch,7
Ticket,685
Fare,248
Cabin,147


Each row has a unique `PassengerId`!

In [None]:
dup_df_2 = dup_df.drop("PassengerId", axis=1).drop_duplicates()
print(dup_df_2.shape)

(940, 11)


### 4) Look for **near-duplicates** (hint: check `Embarked`, `Ticket`, `Name`)

In [None]:
dup_df_2["Embarked"].value_counts()

Unnamed: 0_level_0,count
Embarked,Unnamed: 1_level_1
S,670
C,175
Q,78
Southampton,10
Cherbourg,4
Queenstown,1


Embarked has inconsistency, let's use unify it.

In [None]:
dup_df_2["Embarked"] = (
    dup_df_2["Embarked"]
    .fillna("S")
    .apply(lambda x: x.replace("Southampton", "S").replace("Cherbourg", "C").replace("Queenstown", "Q"))
  )

dup_df_3 = dup_df_2.drop_duplicates()
print(dup_df_3.shape)

(925, 11)


In [None]:
dup_df_3['Ticket'].sort_values().head(40)

Unnamed: 0,Ticket
947,110152
510,110152
616,110152
515,110152
37,110413
696,110413
530,110413
172,110465
658,110465
246,110564


There are a lot of duplicated tickets!

In [None]:
ticket_counts = dup_df_3['Ticket'].value_counts()
repeated_tickets = ticket_counts[ticket_counts > 1].index
dup_df_with_repeated_tickets = dup_df_3[dup_df_3['Ticket'].isin(repeated_tickets)].sort_values('Ticket')
display(dup_df_with_repeated_tickets.head(30))

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
515,1,1,"Rothes, the Countess. of (Lucy Noel Martha Dye...",female,33.0,0,0,110152,86.5,B77,S
947,1,1,"Maioni, Miss. Roberta",female,16.0,0,0,110152,86.5,B79,S
510,1,1,"Cherry, Miss. Gladys",female,30.0,0,0,110152,86.5,B77,S
616,1,1,Rothes the Countess of Lucy Noel Martha DyerEd...,female,33.0,0,0,110152,86.5,B77,S
696,0,1,"Taussig, Mr. Emil",male,52.0,1,1,110413,79.65,E67,S
37,1,1,"Taussig, Miss. Ruth",female,18.0,0,2,110413,79.65,E68,S
530,1,1,"Taussig, Mrs. Emil (Tillie Mandelbaum)",female,39.0,1,1,110413,79.65,E67,S
172,0,1,"Porter, Mr. Walter Chamberlain",male,47.0,0,0,110465,52.0,C110,S
658,0,1,"Clifford, Mr. George Quincy",male,,0,0,110465,52.0,A14,S
274,1,1,"Hippach, Mrs. Louis Albert (Ida Sophia Fischer)",female,44.0,0,1,111361,57.9792,B18,C


Following rows have the same name with simple changes.

258: "Hippach, MISS. Jean Gertrude"

487: "Hippach, Miss. Jean Gertrude"

--

390: "Rood, Mr. Hugh Roscoe"

640: "Rood Mr Hugh Roscoe"

--

So let's convert names to lower letters and remove all punctuations.

In [None]:
import string

dup_df_4 = dup_df_3.copy()
dup_df_4['Name'] = dup_df_4['Name'].apply(lambda x: x.lower().translate(str.maketrans('', '', string.punctuation)))

dup_df_4 = dup_df_4.drop_duplicates()

print(dup_df_4.shape)

(895, 11)


### ** Use the main dataset for the rest of practice **

## Data Cleaning – Outliers

### 1) Plot `Fare` using box-plot and histogram

In [None]:
fig = px.box(df, y="Fare")
fig.update_layout(title='Box plot of Fare')
fig.show()

fig = px.histogram(df, x="Fare")
fig.update_layout(title='Histogram of Fare')
fig.show()

### 2) Do you see any outliers in `Fare`? *(Brief answer)*

Yes, most fares are lower than 100 but there are a few fares up to 500!

### 3) Do the same for `Age`

In [None]:
fig = px.box(df, y="Age")
fig.update_layout(title='Box plot of Age')
fig.show()

fig = px.histogram(df, x="Age")
fig.update_layout(title='Histogram of Age')
fig.show()

I can see some outlier ages, but it's not weird and unusual.

### 4) IQR vs Z-score methods *(Brief answer)*

IQR method (Tukey’s fences)

Compute Q1, Q3, IQR = Q3−Q1.

Flag points < Q1 − 1.5·IQR or > Q3 + 1.5·IQR (often 3·IQR for “extreme” outliers).

Pros: non-parametric, robust to skew and extreme values, works with small samples.

Cons: univariate only; ignores tail shape; can over-flag when data are very sparse and under-flag in very heavy tails.

----

Z-score method (standard scores)

Compute z = (x − mean)/std and flag |z| > k (common k = 3, sometimes 2.5).

Pros: simple, uses full distribution, comparable across variables; good when data are ~normal.

Cons: not robust—mean/std are pulled by outliers; many false flags on skewed/non-normal data; less reliable in small samples.

----

Rule of thumb: use IQR for skewed/heavy-tailed data; use Z-scores when the variable is roughly normal.

## Descriptive Statistics – Survival Rates & Summary

### 1) Survival rate by **gender** and **class**, then both.

In [None]:
print("Survival rate by Sex:")
display(df.groupby('Sex')['Survived'].mean())

print("\nSurvival rate by Pclass:")
display(df.groupby('Pclass')['Survived'].mean())

print("\nSurvival rate by Sex and Pclass:")
display(df.groupby(['Sex', 'Pclass'])['Survived'].mean())

Survival rate by Sex:


Unnamed: 0_level_0,Survived
Sex,Unnamed: 1_level_1
female,0.742038
male,0.188908



Survival rate by Pclass:


Unnamed: 0_level_0,Survived
Pclass,Unnamed: 1_level_1
1,0.62963
2,0.472826
3,0.242363



Survival rate by Sex and Pclass:


Unnamed: 0_level_0,Unnamed: 1_level_0,Survived
Sex,Pclass,Unnamed: 2_level_1
female,1,0.968085
female,2,0.921053
female,3,0.5
male,1,0.368852
male,2,0.157407
male,3,0.135447


### 2) Mean, median, mode for `Fare`

In [None]:
print("Mean of Fare:", df['Fare'].mean())
print("Median of Fare:", df['Fare'].median())
print("Mode of Fare:", df['Fare'].mode()[0])

Mean of Fare: 32.204207968574636
Median of Fare: 14.4542
Mode of Fare: 8.05


### 3) Min, Max, Std for `Age`

In [None]:
print("Minimum Age:", df['Age'].min())
print("Maximum Age:", df['Age'].max())
print("Standard Deviation of Age:", df['Age'].std())

Minimum Age: 0.42
Maximum Age: 80.0
Standard Deviation of Age: 14.526497332334044


### 4) Value counts of `Embarked` with percentages

In [None]:
df['Embarked'].value_counts() / len(df) * 100

Unnamed: 0_level_0,count
Embarked,Unnamed: 1_level_1
S,72.278339
C,18.855219
Q,8.641975


## Univariate Analysis – Numerical

### 1) Histogram + KDE for `Age`

In [62]:
import plotly.figure_factory as ff

fig = ff.create_distplot([df["Age"].dropna()], ["Age"])
fig.show()

### 2) Skewness & Kurtosis for `Age`

In [63]:
print("Skewness of Age:", df['Age'].skew())
print("Kurtosis of Age:", df['Age'].kurtosis())

Skewness of Age: 0.38910778230082704
Kurtosis of Age: 0.17827415364210353


### 3) Distribution types lookup *(Brief answer)*

Symmetric: Left and right sides look the same; mean ≈ median. (e.g., bell curve)

Skewed: long tail to one side, either left or right. (few very large values)

Heavy tails: Extra-wide tails; extreme values happen more often than in a normal curve.

### 4) Indicate Age distribution type *(Brief answer)*

Skewness is positive, so longer right tail!

Kurtosis slightly positive, a little bit heavy-tail.

### 5) Repeat for `Fare`

In [64]:
print("Skewness of Fare:", df['Fare'].skew())
print("Kurtosis of Fare:", df['Fare'].kurtosis())

Skewness of Fare: 4.787316519674893
Kurtosis of Fare: 33.39814088089868


Skewness and Kurtosis are highly positive, so strongly right-skewed with very heavy tails!


## Univariate Analysis – Categorical

### 1) Bar charts of `Embarked`, `Pclass`, `Sex`

In [67]:
fig = px.bar(df['Embarked'].value_counts(), title="Distribution of Embarked")
fig.show()

fig = px.bar(df['Pclass'].apply(str).value_counts(), title="Distribution of Pclass")
fig.show()

fig = px.bar(df['Sex'].value_counts(), title="Distribution of Sex")
fig.show()

## Bivariate Analysis – Numerical & Numerical

### 1) Scatter plot of `Age` vs `Fare`

In [69]:
fig = px.scatter(df, x="Age", y="Fare", title="Age vs Fare Scatter Plot")
fig.show()

### 2) Pattern interpretation *(Brief answer)*

No, There's no significant pattern here. Maybe we can plot fare against pclass or box_plot of fare vs survived.

## Bivariate Analysis – Categorical & Numerical

### 1) Boxplot of `Age` grouped by `Survived`

In [70]:
fig = px.box(df, y="Age", color="Survived", title="Age Distribution by Survival")
fig.show()

### 2) Violin plot of `Fare` grouped by `Survived`

In [73]:
fig = px.violin(df, y="Fare", color="Survived", title="Fare Distribution by Survival")
fig.show()

### 3) Histogram of `Age` grouped by `Survived`

In [74]:
fig = px.histogram(df, x="Age", color="Survived", title="Age Distribution by Survival", barmode='overlay', histnorm='probability density')
fig.show()

### 4) Histogram of `Fare` grouped by `Survived`

In [76]:
fig = px.histogram(df, x="Fare", color="Survived", title="Fare Distribution by Survival", barmode='overlay', histnorm='probability density')
fig.show()

### 5) Interpretation *(Brief answer)*

Yes, Younger ages or higher fares are more likely to survive.

## Bivariate Analysis – Categorical & Categorical

### 1) Grouped bar chart of `Sex` and `Pclass`

In [89]:
fig = px.bar(df.groupby(['Sex', 'Pclass']).size().reset_index(name='count'),
             x='Pclass',
             y='count',
             color='Sex',
             barmode='group',
             title='Grouped Bar Chart of Sex and Pclass')
fig.show()

## Bivariate Analysis – Correlation Heatmap

### 1) Correlation matrix of numerical features

In [92]:
corr_matrix = df.select_dtypes(include=np.number).corr()
display(corr_matrix)

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


### 2) Heatmap for the correlation matrix

In [94]:
fig = px.imshow(corr_matrix, text_auto=True, aspect="auto", title="Correlation Heatmap (Numerical Features)")
fig.show()


### 3) Which variables correlate most with `Survived`? *(Brief answer)*

Pclass has strong negative correlation, meaning the lower Pclass the better survival chances are.

Also, strong correlation between Fare and Survived.

## Trends & Patterns – Hypothesis Generation

### 1) Grouped bar plots for survival rate by **Pclass** and **Sex**

In [96]:
pivot = df.pivot_table(values="Survived", index="Pclass", columns="Sex", aggfunc="mean")
pivot = (pivot*100).round(2).reset_index().melt(id_vars="Pclass", var_name="Sex", value_name="Survival_%")
fig = px.bar(pivot, x="Pclass", y="Survival_%", color="Sex", barmode="group",
             title="Survival Rate (%) by Pclass and Sex")
fig.show()


---
## Notes
- Keep your narrative answers concise.
- When you choose a cleaning/imputation method, comment on rationale and