
# 🧠 Module 5: Exploratory Data Analysis (EDA)
This notebook covers all key concepts of EDA, including:
- Data types and variables
- Central tendency and dispersion
- Five-point summary and skewness
- Box plot
- Covariance and correlation
- Encoding
- Scaling and normalization
- Pre-processing
- Handling missing values
- Working with outliers

We'll use the `pandas`, `numpy`, `matplotlib`, and `seaborn` libraries for this analysis.


In [None]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler

# Load a sample dataset
df = sns.load_dataset('titanic')
df.head()


## 📊 1. Data Types and Variables

In [None]:

# Display data types
df.dtypes

# Check variable types
df.info()


## 📈 2. Central Tendency and Dispersion

In [None]:

# Central tendency
print("Mean:
", df.mean(numeric_only=True))
print("Median:
", df.median(numeric_only=True))
print("Mode:
", df.mode().iloc[0])

# Dispersion
print("Standard Deviation:
", df.std(numeric_only=True))
print("Variance:
", df.var(numeric_only=True))


## 📐 3. Five-Point Summary and Skewness

In [None]:

# Five point summary
df.describe()

# Skewness
df.skew(numeric_only=True)


## 📦 4. Box Plot

In [None]:

sns.boxplot(x=df["age"])
plt.title("Box Plot of Age")
plt.show()


## 🔗 5. Covariance and Correlation

In [None]:

print("Covariance:
", df.cov(numeric_only=True))
print("Correlation:
", df.corr(numeric_only=True))

sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()


## 🔤 6. Encoding

In [None]:

# Label Encoding 'sex' column
le = LabelEncoder()
df['sex_encoded'] = le.fit_transform(df['sex'])
df[['sex', 'sex_encoded']].head()


## ⚖️ 7. Scaling and Normalization

In [None]:

scaler = StandardScaler()
df['age_scaled'] = scaler.fit_transform(df[['age']])

minmax = MinMaxScaler()
df['fare_normalized'] = minmax.fit_transform(df[['fare']])

df[['age', 'age_scaled', 'fare', 'fare_normalized']].head()


## 🔍 8. Pre-processing: Handling Missing Values

In [None]:

# Check missing values
df.isnull().sum()

# Fill missing age with median
df['age'].fillna(df['age'].median(), inplace=True)
df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)


## ⚠️ 9. Working with Outliers

In [None]:

# Detect outliers using IQR method
Q1 = df['fare'].quantile(0.25)
Q3 = df['fare'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['fare'] < Q1 - 1.5 * IQR) | (df['fare'] > Q3 + 1.5 * IQR)]
print("Outliers in Fare:
", outliers[['fare']])

# Optional: Remove outliers
df_clean = df[~((df['fare'] < Q1 - 1.5 * IQR) | (df['fare'] > Q3 + 1.5 * IQR))]


## 🏠 10. Homework / Exercise


**Task 1:** Create a boxplot for the 'fare' column and describe the distribution.
**Task 2:** Encode the 'embarked' column and check value counts.
**Task 3:** Normalize the 'age' column using MinMaxScaler and visualize using histogram.
**Task 4:** Drop rows with missing values and compare dataset shape.
**Task 5:** Compute and plot correlation heatmap for the cleaned dataset.
