# Complete EDA: Bivariate & Multivariate Analysis
**Source:** CampusX Day 21 (Video Notes) + Supplemental Data Science Theory

--- 
### Overview
In EDA, we ask: "How does variable A behave when variable B changes?"
1. **Numerical - Numerical**: Scatter Plots, Line Plots.
2. **Numerical - Categorical**: Bar Plots, Box Plots, Distplots/KDE.
3. **Categorical - Categorical**: Crosstabs, Heatmaps, Clustermaps.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load video datasets
tips = sns.load_dataset('tips')
titanic = sns.load_dataset('titanic')
flights = sns.load_dataset('flights')
iris = sns.load_dataset('iris')

## 1. Numerical - Numerical Analysis
### Scatter Plot (Bivariate & Multivariate)
- **Video Note:** We use `total_bill` and `tip` to show a linear relationship. 
- **Multivariate:** We use `hue` (Gender), `style` (Smoker), and `size` (Table Size) to see 5 dimensions at once.

**Pro-Tip (The 'Why'):** We look for **Heteroscedasticity** (when the spread of dots increases as the X-value increases). In the tips data, notice how the tip variation grows as the bill gets larger.

In [None]:
# Bivariate to Multivariate transition
plt.figure(figsize=(10,6))
sns.scatterplot(data=tips, x='total_bill', y='tip', hue='sex', style='smoker', size='size')
plt.title("Tips Dataset: Multivariate Scatter Plot")
plt.show()

## 2. Numerical - Categorical Analysis
### A. Bar Plot (Aggregated Values)
- **Video Note:** Bar plots show the mean value. The black line on top is the **Confidence Interval**.
### B. Box Plot (Distribution)
- **Video Note:** Shows outliers and the 5-number summary. 

**Deep Dive (Missing in Video): Violin Plots**
Violin plots are a hybrid of a Box Plot and a KDE. They show where the data is most concentrated (the 'fat' part of the violin).

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15, 5))

# Bar Plot from video
sns.barplot(data=titanic, x='pclass', y='age', hue='sex', ax=ax[0])
ax[0].set_title("Average Age per Class & Gender")

# Violin Plot (The Extra Insight)
sns.violinplot(data=titanic, x='pclass', y='fare', hue='survived', split=True, ax=ax[1])
ax[1].set_ylim(0, 300) # Capping fare for clarity
ax[1].set_title("Fare Density vs Survival")

plt.show()

### C. Distplot / KDE Plot
- **Video Note:** Comparing survival probability based on age. 
- **Insight:** Children (Age < 10) have a higher orange line (Survived) than blue line (Died), proving "women and children first" was a reality.

In [None]:
plt.figure(figsize=(10,5))
sns.kdeplot(titanic[titanic['survived']==0]['age'], label='Not Survived')
sns.kdeplot(titanic[titanic['survived']==1]['age'], label='Survived')
plt.title("Survival Probability vs Age")
plt.legend()
plt.show()

## 3. Categorical - Categorical Analysis
### Heatmaps & Clustermaps
- **Video Note:** Use `pd.crosstab` to create a matrix, then feed it to `sns.heatmap`.
- **Clustermap:** Groups similar rows/columns using **Dendrograms** (the tree-like structures on the side).

**Pro-Tip:** Clustermaps are essential for discovering "Hidden Segments" in your data that behave similarly.

In [None]:
# Heatmap of Survival counts per class
ct = pd.crosstab(titanic['pclass'], titanic['survived'])
sns.heatmap(ct, annot=True, fmt='d', cmap='YlGnBu')
plt.title("Pclass vs Survival Count")
plt.show()

# Clustermap from Video (Flights dataset)
pivot_flights = flights.pivot_table(values='passengers', index='month', columns='year')
sns.clustermap(pivot_flights)
plt.show()

## 4. Automation: Pairplot
- **Video Note:** Generates scatter plots for all numerical columns. Diagonal plots are Histograms.
- **Constraint:** Be careful with large datasets.

In [None]:
sns.pairplot(iris, hue='species')
plt.show()

## 5. Lineplot (Time Series)
- **Video Note:** Use for data involving time (Year/Month). 
- **Insight:** The flights data shows a linear upward trend and cyclic seasonality.

In [None]:
new_df = flights.groupby('year').sum(numeric_only=True).reset_index()
sns.lineplot(data=new_df, x='year', y='passengers')
plt.title("Yearly Growth in Passengers")
plt.show()