# 4t DigiNext Data & ML Bootcamp

## Week 1 - EDA

### Applied Exploratory Data Analysis


This notebook serves as a practical guide to Applied Exploratory Data Analysis.

## Getting Started
- Download [Titanic Datset](https://drive.google.com/file/d/1qsPISZNlcAaLGXG9l0WpWSzpxIbj7UYW/view?usp=sharing)
- Upload `titanic.csv` to your runtime (Colab: *Files* → *Upload*).
- Keep your answers brief but clear in the designated *ANSWER* cells.
- Use plotly for plotting charts.

In [None]:
# (Optional) Install/upgrade libraries if needed
# If you're in Colab, these are usually available. Uncomment if required.
# !pip install -q --upgrade pandas plotly

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)
SEED = 42
np.random.seed(SEED)


## Intro
The Titanic dataset contains one row per passenger with attributes such as **Survived**, **Pclass**, **Sex**, **Age**, **SibSp**, **Parch**, **Fare**, **Cabin**, **Embarked**, etc.

Read the column descriptions (Kaggle Titanic) and refer back as necessary.

## Data Cleaning – Missing Values

### 1) Load the Titanic dataset

In [None]:
df = pd.read_csv("titanic.csv")
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### 2) Check for missing values in each column

In [None]:
missing_counts = df.isna().sum()

print("Shape of df:", df.shape, '\n')
print(missing_counts)

Shape of df: (891, 12) 

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


### 3) Are they systematically missing or randomly? *(Brief answer)*

**ANSWER:** I think is randomly

### 4) What is the best approach to handle each column? *(Brief answer)*

**ANSWER:**


*   Age --> impute with same-sex mean
*   Cabin --> create indicators "is_missing"
*   Embarked --> impute with mode



### 5) Handle missing values for `Age`

#### 5-1) Drop rows with missing Age

In [None]:
df = df.dropna(subset=['Age'])
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


#### 5-2) Fill Age with median and mean

In [None]:
women_age_mean = df.loc[df['Sex'] == 'female', 'Age'].mean()
women_age_median = df.loc[df['Sex'] == 'female', 'Age'].median()

men_age_mean = df.loc[df['Sex'] == 'male', 'Age'].mean()
men_age_median = df.loc[df['Sex'] == 'male', 'Age'].median()

print(f"women's mean: {women_age_mean:.3f}")
print(f"women's median: {women_age_median:.3f}\n")
print(f"men's mean: {men_age_mean:.3f}")
print(f"men's median: {men_age_median:.3f}\n")

# I filled them in new rows so we could access the original data.
df['Age_median'] = df['Age'].fillna(df.groupby('Sex')['Age'].transform('median'))
df['Age_mean'] = df['Age'].fillna(df.groupby('Sex')['Age'].transform('mean'))

df.loc[df['Age'].isna()]

women's mean: 27.916
women's median: 27.000

men's mean: 30.727
men's median: 29.000



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Age_median'] = df['Age'].fillna(df.groupby('Sex')['Age'].transform('median'))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Age_mean'] = df['Age'].fillna(df.groupby('Sex')['Age'].transform('mean'))


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_median,Age_mean




#### 5-3) Indicator Column

In [None]:
df['Age_indicator'] = df['Age'].isna()
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Age_indicator'] = df['Age'].isna()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_median,Age_mean,Age_indicator
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,22.0,22.0,False
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,38.0,38.0,False
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,26.0,26.0,False
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,35.0,35.0,False
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,35.0,35.0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q,39.0,39.0,False
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,27.0,27.0,False
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,19.0,19.0,False
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,26.0,26.0,False


### 6) Look up MAR, MCAR, MNAR. Describe a scenario for each *(Brief answer)*

**ANSWER:**

1.   MCAR (Missing Completely At Random): The probablity of a value being missing has nothing to do with the data itself. neither observed nor unobserved values influence it.
*   scenario: A lab machine randomly malfunctions and fails to record temperature for some samples, regardless of their actual temperature or any other factor.

2.   MAR (Missing At Random): Missing depends only on observed data, not on the value that is missing itself.
*   scenario: In a survey, younger people tend not to report their income. if you know their age, you can model the missing income based on the relationship between age and income frome the rest on the data.

3.   The probablity of missingness depends on the unobserved valuee itself (or other unmeasured factors).
*   scenario: People with very high incomes avoid reporting them, and you don't have other variables that explain this. The missingness is directly tied to the value itself.


## Data Cleaning – Duplicates




### 0) Download [corrput titanic](https://drive.google.com/file/d/1ThKbMCgw0-UoCk3pJB-mnaw8pt8vIv99/view?usp=sharing) dataset.

### 1) Upload and read `corrupt_titanic.csv`

This dataset is a modified version of original dataset but with some dupl

In [None]:
df2 = pd.read_csv("corrupt_titanic.csv")
df2

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,PassengerId
0,0,3,"Vovk, Mr. Janko",male,22.0,0,0,349252,7.8958,,S,1
1,1,3,Lam Mr Ali,male,,0,0,1601,56.4958,,S,2
2,0,1,"Cavendish, Mr. Tyrell William",male,36.0,1,0,19877,78.8500,C46,S,3
3,1,1,"Silvey, Mrs. William Baird (Alice Munger)",female,39.0,1,0,13507,55.9000,E44,Southampton,4
4,0,3,"Hart, Mr. Henry",male,,0,0,394140,6.8583,,Q,5
...,...,...,...,...,...,...,...,...,...,...,...,...
996,1,3,"Salkjelsvik, Miss. Anna Kristine",female,21.0,0,0,343120,7.6500,,S,997
997,0,1,"Cairns, Mr. Alexander",male,,0,0,113798,31.0000,,S,998
998,0,3,"Hansen, Mr. Claus Peter",male,41.0,2,0,350026,14.1083,,S,999
999,1,1,"Carter, Miss. Lucile Polk",female,14.0,1,2,113760,120.0000,B96 B98,S,1000


### 2) Basic EDA

Do a basic EDA on corrupt dataset, and compare the results with the original dataset.

In [None]:
df.info()
print("\n\n")
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 714 entries, 0 to 890
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   PassengerId    714 non-null    int64  
 1   Survived       714 non-null    int64  
 2   Pclass         714 non-null    int64  
 3   Name           714 non-null    object 
 4   Sex            714 non-null    object 
 5   Age            714 non-null    float64
 6   SibSp          714 non-null    int64  
 7   Parch          714 non-null    int64  
 8   Ticket         714 non-null    object 
 9   Fare           714 non-null    float64
 10  Cabin          185 non-null    object 
 11  Embarked       712 non-null    object 
 12  Age_median     714 non-null    float64
 13  Age_mean       714 non-null    float64
 14  Age_indicator  714 non-null    bool   
dtypes: bool(1), float64(4), int64(5), object(5)
memory usage: 84.4+ KB



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Dat

### 3) Find and drop **exact** duplicates in corrupt dataset.

In [None]:
print("before drop duplicates:", df2.shape)
df2.drop_duplicates(inplace=True)
print("after drop duplicates:", df2.shape)

before drop duplicates: (1001, 12)
after drop duplicates: (1001, 12)


### 4) Look for **near-duplicates** (hint: check `Embarked`, `Ticket`, `Name`)

In [None]:
print("before drop duplicates:", df2.shape)
df2.drop_duplicates(subset=['Name', 'Ticket', 'Embarked'], inplace=True)
print("after drop duplicates:", df2.shape)

before drop duplicates: (1001, 12)
after drop duplicates: (940, 12)


### ** Use the main dataset for the rest of practice **

## Data Cleaning – Outliers

### 1) Plot `Fare` using box-plot and histogram

In [None]:
#box-plot
boxplot = px.box(df, y='Fare')
boxplot.show()

#histogram
histogram = px.histogram(df, x='Fare')
histogram.show()

### 2) Do you see any outliers in `Fare`? *(Brief answer)*

**ANSWER:** yes. Fare = 512

### 3) Do the same for `Age`

In [None]:
#box-plot
boxplot = px.box(df, y='Age')
boxplot.show()

#histogram
histogram = px.histogram(df, x='Age')
histogram.show()

### 4) IQR vs Z-score methods *(Brief answer)*

**ANSWER:**
1. *IQR (Interquartile Range) Method:*<br>
Idea: Looks at the middle 50% of your data.<br>
Steps:<br>
a. Compute Q1 (25th percentile) and Q3 (75th percentile).<br>
b. IQR = Q3 − Q1<br>
c. Flag points < Q1 − 1.5×IQR or > Q3 + 1.5×IQR as outliers.<br>
Pros: Not affected much by extreme values (robust)✓ Works well for skewed or non‑normal data<br>
Cons: Assumes you can meaningfully define quartiles (may fail on tiny datasets)✗ Cutoff (1.5×IQR) is a rule of thumb, not universal<br>

2. *Z‑Score Method:*
Idea: Measures how many standard deviations a point is from the mean.<br>
Steps:<br>
a. Compute mean (μ) and standard deviation (σ).<br>
b. Z = (x − μ) / σ<br>
c. Flag points where |Z| > 3 (common threshold) as outliers.<br>
Pros: Simple, mathematically interpretable✓ Good for symmetric, normal‑like data<br>
Cons:✗ Sensitive to extreme values (mean & std get skewed)✗ Not reliable for skewed or non‑normal data<br>


## Descriptive Statistics – Survival Rates & Summary

### 1) Survival rate by **gender** and **class**, then both.

In [None]:
#by sex
print(df.groupby('Sex')['Survived'].mean(), '\n')

#by pclass
print(df.groupby('Pclass')['Survived'].mean(), '\n')

#by age & pclass
print(df.groupby(['Sex', 'Pclass'])['Survived'].mean(), '\n')

Sex
female    0.754789
male      0.205298
Name: Survived, dtype: float64 

Pclass
1    0.655914
2    0.479769
3    0.239437
Name: Survived, dtype: float64 

Sex     Pclass
female  1         0.964706
        2         0.918919
        3         0.460784
male    1         0.396040
        2         0.151515
        3         0.150198
Name: Survived, dtype: float64 



### 2) Mean, median, mode for `Fare`

In [None]:
#mean
print("mean fare:", df['Fare'].mean())

#median
print("median fare:", df['Fare'].median())

#mode
print("mode fare:", df['Fare'].mode())

mean fare: 34.694514005602244
median fare: 15.7417
mode fare: 0    13.0
Name: Fare, dtype: float64


### 3) Min, Max, Std for `Age`

In [None]:
#max
print("max age:", df['Age'].max())

#min
print("min age:", df['Age'].min())
print("min age (days):", round(df['Age'].min() * 365))

#std
print("standard deviation age:", df['Age'].std())

max age: 80.0
min age: 0.42
min age (days): 153
standard deviation age: 14.526497332334044


### 4) Value counts of `Embarked` with percentages

In [None]:
df['Embarked'].value_counts(normalize=True)*100

Unnamed: 0_level_0,proportion
Embarked,Unnamed: 1_level_1
S,77.808989
C,18.258427
Q,3.932584


## Univariate Analysis – Numerical


### 1) Histogram + KDE for `Age`

In [84]:
import plotly.figure_factory as ff

fig = ff.create_distplot([df['Age_mean']], ["Age"])
fig.show()

### 2) Skewness & Kurtosis for `Age`

In [86]:
#skewness
print("skewness:", df['Age'].skew())

#kurtosis
print("kurtosis:", df['Age'].kurtosis())

skewness: 0.38910778230082693
kurtosis: 0.1782741536421022


### 3) Distribution types lookup *(Brief answer)*

**ANSWER:**


*   Symmetric: The left and right sides of the distribution look like mirror images. (normal distribution)
*   Skewed: The distribution is stretched more to one side. (chi-square distribution)
*   Heavy-tailed: Tails are “fatter” than a normal distribution — extreme values happen more often. (t-distribution)



### 4) Indicate Age distribution type *(Brief answer)*

**ANSWER:**

skewness ≈ 0.39<br>
kurtosis ≈ 0.18<br>
skewness > kurtosis<br>
Its distribution is more like a gamma distribution.

### 5) Repeat for `Fare`

In [91]:
#histogram + KDE
fig = ff.create_distplot([df['Fare'].dropna()], ["Fare"])
fig.show()

#skewness
print("\nskewness:", df['Fare'].skew())

#kurtosis
print("kurtosis:", df['Fare'].kurtosis())


skewness: 4.6536303678277395
kurtosis: 30.9242490147161


## Univariate Analysis – Categorical

### 1) Bar charts of `Embarked`, `Pclass`, `Sex`

In [96]:
#Embarked
fig1 = px.bar(df['Embarked'].value_counts(), title="Embarked")
fig1.show()

#Pclass
fig2 = px.bar(df['Pclass'].apply(str).value_counts(), title="Pclass")
fig2.show()

#Sex
fig3 = px.bar(df['Sex'].value_counts(), title="Sex")
fig3.show()

## Bivariate Analysis – Numerical & Numerical

### 1) Scatter plot of `Age` vs `Fare`

In [97]:
px.scatter(df, x="Age", y="Fare").show()

### 2) Pattern interpretation *(Brief answer)*

**ANSWER:** No. Maybe we can find a pattern in the scatter plot of fare vs survived.

## Bivariate Analysis – Categorical & Numerical

### 1) Boxplot of `Age` grouped by `Survived`

In [101]:
px.box(df, y="Age", color="Survived").show()

### 2) Violin plot of `Fare` grouped by `Survived`

In [102]:
px.violin(df, y="Fare", color="Survived").show()

### 3) Histogram of `Age` grouped by `Survived`

In [103]:
px.histogram(df, x="Age", color="Survived", barmode='overlay', histnorm='density').show()

### 4) Histogram of `Fare` grouped by `Survived`

In [104]:
px.histogram(df, x="Fare", color="Survived", barmode='overlay', histnorm='density').show()

### 5) Interpretation *(Brief answer)*

**ANSWER:** Youger ages, higher fares --> higher chance of survival

## Bivariate Analysis – Categorical & Categorical

### 1) Grouped bar chart of `Sex` and `Pclass`

In [105]:
px.bar(df.groupby(['Sex', 'Pclass']).size().reset_index(name='count'), x='Pclass', y='count', color='Sex', barmode='group').show()

## Bivariate Analysis – Correlation Heatmap

### 1) Correlation matrix of numerical features

In [110]:
display(df.select_dtypes(include=np.number).corr())

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Age_median,Age_mean
PassengerId,1.0,0.02934,-0.035349,0.036847,-0.082398,-0.011617,0.009592,0.036847,0.036847
Survived,0.02934,1.0,-0.359653,-0.077221,-0.017358,0.093317,0.268189,-0.077221,-0.077221
Pclass,-0.035349,-0.359653,1.0,-0.369226,0.067247,0.025683,-0.554182,-0.369226,-0.369226
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067,1.0,1.0
SibSp,-0.082398,-0.017358,0.067247,-0.308247,1.0,0.38382,0.138329,-0.308247,-0.308247
Parch,-0.011617,0.093317,0.025683,-0.189119,0.38382,1.0,0.205119,-0.189119,-0.189119
Fare,0.009592,0.268189,-0.554182,0.096067,0.138329,0.205119,1.0,0.096067,0.096067
Age_median,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067,1.0,1.0
Age_mean,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067,1.0,1.0


### 2) Heatmap for the correlation matrix

In [112]:
corr = df.select_dtypes(include=np.number).corr()
px.imshow(corr, text_auto=True, aspect="auto").show()

### 3) Which variables correlate most with `Survived`? *(Brief answer)*

**ANSWER:**


*   Positive correlation: fare & survived
*   Positive correlation: pclass & survived



## Trends & Patterns – Hypothesis Generation

### 1) Grouped bar plots for survival rate by **Pclass** and **Sex**

In [114]:
pivot = df.pivot_table(values="Survived", index="Pclass", columns="Sex", aggfunc="mean")

pivot = (pivot*100).round(2).reset_index().melt(id_vars="Pclass", var_name="Sex", value_name="Survival_%")

px.bar(pivot, x="Pclass", y="Survival_%", color="Sex", barmode="group").show()

---
## Notes
- Keep your narrative answers concise.
- When you choose a cleaning/imputation method, comment on rationale and