# Module 2.3 – Exploratory Data Analysis (EDA)

Before we build machine learning models, we must first **understand our data**.

EDA is the process of:
- Looking at the data
- Summarizing it
- Visualizing it
- Asking: *What is normal? What is weird? What relates to what?*

Skipping EDA is like doing **andha ML** (blind machine learning).


## Step 0: Import Required Libraries

We use:
- NumPy for numerical operations
- Pandas for data handling
- Matplotlib & Seaborn for visualization


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# sns.set(style="whitegrid")
# plt.rcParams["figure.figsize"] = (8, 4)


## Step 1: Simple Dataset (for Learning)

We create a **synthetic dataset** so that:
- Patterns are clear
- Visualizations are easy to understand
- You can focus on concepts, not messy data

This dataset mimics a **house price prediction** problem.


In [None]:
np.random.seed(42)
N = 500

age = np.random.normal(35, 8, N).round().astype(int)
income = np.random.exponential(50000, N).round().astype(int)
area = np.random.uniform(500, 2500, N)
bedrooms = np.random.choice([1,2,3,4], N)
distance = np.random.uniform(1, 25, N)

city = np.random.choice(["Delhi", "Mumbai", "Bangalore"], N, p=[0.4,0.35,0.25])
segment = np.random.choice(["Economy", "Mid", "Premium"], N, p=[0.5,0.35,0.15])

price = (
    2000 * area *
    np.where(city=="Mumbai",1.5,np.where(city=="Delhi",1.3,1.1)) *
    np.where(segment=="Premium",1.6,np.where(segment=="Mid",1.25,1.0))
    - 15000*np.log(distance+1)
    + np.random.normal(0,100000,N)
)


df = pd.DataFrame({
    "age": age,
    "income": income,
    "area_sqft": area,
    "bedrooms": bedrooms,
    "distance_km": distance,
    "city": city,
    "segment": segment,
    "house_price": price,
})

df.head(3)


## Step 2: Understand Dataset Structure

Before plotting anything, always check:
- Shape (rows × columns)
- Data types
- Missing values


In [None]:
df.shape

In [None]:
df.info()

## Step 3: Summary Statistics

`.describe()` gives a quick numerical summary:
- Mean, std
- Min, max
- Quartiles

This is the fastest way to understand numeric features.


In [None]:
df.describe()

## Step 4: Central Tendency

We study:
- Mean → average value
- Median → middle value
- Mode → most frequent value

Comparing mean and median gives hints about skewness.


In [None]:
df[["age","income"]].mean()

In [None]:
df[["age"]].median()

In [None]:
df["bedrooms"].mode()

## Step 5: Variance and Standard Deviation

These measure **spread**:
- Small std → values are tightly packed
- Large std → values are widely spread


Same mean does NOT mean same behavior.

#### Imagine you know the average of something (mean), but you also want to know: “How much do individual values usually deviate from that average?”
two dataset can have same mean but different varience
- example:
- Dataset A: {2, 4, 6}
- Dataset B: {1, 4, 7}
- mean is same but vareince is different.

In [None]:
df[["age","income"]].std()

## Step 6: Skewness

Skewness tells us:
- Is the distribution symmetric?
- Or stretched to one side?

Right skew → few very large values  
Left skew → few very small values


In [None]:
df[["age","income"]].skew()

## Step 7: Histogram (Univariate Visualization)

Histograms show:
- Distribution shape
- Skewness
- Multiple peaks

This is the first plot you should draw for numeric data.


In [None]:
df[["age","income"]].hist(bins=30)
plt.tight_layout()

## Step 8: KDE / Density Plot

A KDE plot is a smooth version of histogram.
It helps compare distributions more clearly.


In [None]:
sns.kdeplot(df["income"], fill=True)
plt.title("Income Distribution")

## Step 9: Box Plot (Outlier Detection)

Box plots show:
- Median
- IQR (middle 50%)
- Outliers

Great for detecting extreme values quickly.


In [None]:
sns.boxplot(x=df["income"])
plt.title("Income Outliers")

## Step 10: Categorical Distribution

Bar / count plots show:
- Category frequencies
- Imbalance
- Rare classes


In [None]:
sns.countplot(x="city", data=df)

## Step 11: Scatter Plot

Scatter plots show how two numeric variables move together.
They reveal linear or non-linear relationships.


In [None]:
sns.scatterplot(x="area_sqft", y="house_price", data=df)

## Step 12: Scatter with Regression Line

Regression line shows the **average trend**.
Useful to visually judge linearity.


In [None]:
sns.regplot(x="area_sqft", y="house_price", data=df)

## Step 13: Numeric vs Categorical (Box)

This plots compare distributions across categories.


In [None]:
sns.boxplot(x="city", y="house_price", data=df)

## Step 14: Categorical vs Categorical

Crosstab + stacked bar shows conditional relationships.


In [None]:
pd.crosstab(df["city"], df["segment"]).plot(kind="bar", stacked=True)

## Step 15: Correlation Analysis

Correlation measures linear relationships.
Heatmaps help visualize all correlations at once.


In [None]:
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm")

In [None]:
df.corr(numeric_only=True)["house_price"].sort_values(ascending=False)

## Step 16: Pair Plot (Multivariate EDA)

Pair plots show all pairwise relationships together.
Useful for quick exploration.


In [None]:
sns.pairplot(df[["age","income","area_sqft","house_price"]])

## Step 17: Interaction Effects

Grouped statistics help detect interactions
(e.g., effect of bedrooms depends on city).


In [None]:
grouped = df.groupby(["city","bedrooms"])["house_price"].mean().unstack()
sns.heatmap(grouped, annot=True, cmap="YlGnBu")

# END!