# Day 4: Data Visualization & Exploratory Data Analysis (EDA)

Data visualization is crucial for understanding patterns, trends, and insights. Today we'll master:

1. **Matplotlib Fundamentals**: Line plots, bar charts, subplots
2. **Seaborn**: Distribution plots, heatmaps, pair plots
3. **Choosing the Right Visualization**: When to use what
4. **Assignment**: Complete EDA on Titanic with 10+ visualizations

---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')

# Set styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Default figure size
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 100

print(f"Matplotlib Version: {plt.matplotlib.__version__}")
print(f"Seaborn Version: {sns.__version__}")
print("Ready for Data Visualization!")

In [None]:
# Load datasets
titanic = sns.load_dataset('titanic')
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')

print("Datasets loaded:")
print(f"  - Titanic: {titanic.shape}")
print(f"  - Tips: {tips.shape}")
print(f"  - Iris: {iris.shape}")

---

## 1. Matplotlib Fundamentals

Matplotlib is the foundation of Python visualization. Let's explore its core concepts.

### 1.1 Line Plots

In [None]:
# Basic line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)

plt.figure(figsize=(10, 5))
plt.plot(x, y)
plt.title('Simple Sine Wave', fontsize=14, fontweight='bold')
plt.xlabel('X values', fontsize=12)
plt.ylabel('sin(x)', fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Multiple lines with customization
plt.figure(figsize=(12, 6))

plt.plot(x, np.sin(x), 'b-', linewidth=2, label='sin(x)')
plt.plot(x, np.cos(x), 'r--', linewidth=2, label='cos(x)')
plt.plot(x, np.sin(x) * np.cos(x), 'g-.', linewidth=2, label='sin(x)*cos(x)')

plt.title('Trigonometric Functions', fontsize=14, fontweight='bold')
plt.xlabel('X', fontsize=12)
plt.ylabel('Y', fontsize=12)
plt.legend(loc='upper right', fontsize=11)
plt.xlim(0, 10)
plt.ylim(-1.5, 1.5)
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Line plot with markers
x_points = np.arange(0, 11)
y_points = x_points ** 2

plt.figure(figsize=(10, 6))
plt.plot(x_points, y_points, 'ro-', markersize=10, linewidth=2, 
         markerfacecolor='yellow', markeredgecolor='red', markeredgewidth=2)

# Add annotations
for i, (xi, yi) in enumerate(zip(x_points, y_points)):
    if i % 2 == 0:  # Label every other point
        plt.annotate(f'({xi}, {yi})', (xi, yi), textcoords='offset points', 
                     xytext=(0, 10), ha='center', fontsize=9)

plt.title('X² Function with Annotations', fontsize=14, fontweight='bold')
plt.xlabel('X', fontsize=12)
plt.ylabel('X²', fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

### 1.2 Bar Charts

In [None]:
# Simple bar chart
categories = ['A', 'B', 'C', 'D', 'E']
values = [23, 45, 56, 78, 32]

plt.figure(figsize=(10, 6))
bars = plt.bar(categories, values, color='steelblue', edgecolor='black', alpha=0.7)

# Add value labels on top of bars
for bar, val in zip(bars, values):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, 
             str(val), ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.title('Simple Bar Chart', fontsize=14, fontweight='bold')
plt.xlabel('Category', fontsize=12)
plt.ylabel('Value', fontsize=12)
plt.show()

In [None]:
# Horizontal bar chart
plt.figure(figsize=(10, 6))
colors = plt.cm.viridis(np.linspace(0, 1, len(categories)))
plt.barh(categories, values, color=colors, edgecolor='black')

for i, val in enumerate(values):
    plt.text(val + 1, i, str(val), va='center', fontsize=11)

plt.title('Horizontal Bar Chart', fontsize=14, fontweight='bold')
plt.xlabel('Value', fontsize=12)
plt.ylabel('Category', fontsize=12)
plt.show()

In [None]:
# Grouped bar chart
categories = ['Q1', 'Q2', 'Q3', 'Q4']
product_a = [20, 35, 30, 35]
product_b = [25, 32, 34, 20]
product_c = [15, 25, 28, 30]

x = np.arange(len(categories))
width = 0.25

plt.figure(figsize=(12, 6))
plt.bar(x - width, product_a, width, label='Product A', color='#3498db')
plt.bar(x, product_b, width, label='Product B', color='#2ecc71')
plt.bar(x + width, product_c, width, label='Product C', color='#e74c3c')

plt.title('Quarterly Sales by Product', fontsize=14, fontweight='bold')
plt.xlabel('Quarter', fontsize=12)
plt.ylabel('Sales', fontsize=12)
plt.xticks(x, categories)
plt.legend()
plt.show()

In [None]:
# Stacked bar chart
plt.figure(figsize=(12, 6))

plt.bar(categories, product_a, label='Product A', color='#3498db')
plt.bar(categories, product_b, bottom=product_a, label='Product B', color='#2ecc71')
plt.bar(categories, product_c, bottom=np.array(product_a) + np.array(product_b), 
        label='Product C', color='#e74c3c')

plt.title('Quarterly Sales (Stacked)', fontsize=14, fontweight='bold')
plt.xlabel('Quarter', fontsize=12)
plt.ylabel('Total Sales', fontsize=12)
plt.legend()
plt.show()

### 1.3 Subplots

In [None]:
# Basic subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Plot 1: Line plot
x = np.linspace(0, 10, 100)
axes[0, 0].plot(x, np.sin(x), 'b-', linewidth=2)
axes[0, 0].set_title('Sine Wave', fontsize=12)
axes[0, 0].set_xlabel('X')
axes[0, 0].set_ylabel('sin(x)')

# Plot 2: Bar chart
categories = ['A', 'B', 'C', 'D']
values = [4, 7, 2, 5]
axes[0, 1].bar(categories, values, color='green', alpha=0.7)
axes[0, 1].set_title('Bar Chart', fontsize=12)

# Plot 3: Scatter plot
np.random.seed(42)
x_scatter = np.random.randn(50)
y_scatter = np.random.randn(50)
axes[1, 0].scatter(x_scatter, y_scatter, c='red', alpha=0.6, s=100)
axes[1, 0].set_title('Scatter Plot', fontsize=12)
axes[1, 0].set_xlabel('X')
axes[1, 0].set_ylabel('Y')

# Plot 4: Histogram
data = np.random.normal(0, 1, 1000)
axes[1, 1].hist(data, bins=30, color='purple', alpha=0.7, edgecolor='black')
axes[1, 1].set_title('Histogram', fontsize=12)
axes[1, 1].set_xlabel('Value')
axes[1, 1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
# Subplots with different sizes using GridSpec
from matplotlib.gridspec import GridSpec

fig = plt.figure(figsize=(14, 8))
gs = GridSpec(2, 3, figure=fig)

# Large plot on left
ax1 = fig.add_subplot(gs[:, 0])
ax1.plot(np.random.randn(100).cumsum(), 'b-', linewidth=2)
ax1.set_title('Random Walk', fontsize=12)

# Two plots on top right
ax2 = fig.add_subplot(gs[0, 1])
ax2.bar(['A', 'B', 'C'], [3, 5, 2], color='green')
ax2.set_title('Bar Chart', fontsize=12)

ax3 = fig.add_subplot(gs[0, 2])
ax3.scatter(np.random.randn(30), np.random.randn(30), c='red')
ax3.set_title('Scatter', fontsize=12)

# Wide plot on bottom right
ax4 = fig.add_subplot(gs[1, 1:])
ax4.hist(np.random.randn(500), bins=30, color='purple', alpha=0.7)
ax4.set_title('Histogram', fontsize=12)

plt.tight_layout()
plt.show()

### 1.4 Other Matplotlib Charts

In [None]:
# Pie chart
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Basic pie chart
sizes = [30, 25, 20, 15, 10]
labels = ['Product A', 'Product B', 'Product C', 'Product D', 'Product E']
colors = plt.cm.Set3(np.linspace(0, 1, len(sizes)))
explode = (0.05, 0, 0, 0, 0)

axes[0].pie(sizes, labels=labels, colors=colors, explode=explode,
            autopct='%1.1f%%', shadow=True, startangle=90)
axes[0].set_title('Market Share', fontsize=12, fontweight='bold')

# Donut chart
axes[1].pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%',
            pctdistance=0.85, startangle=90)
centre_circle = plt.Circle((0, 0), 0.70, fc='white')
axes[1].add_artist(centre_circle)
axes[1].set_title('Donut Chart', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Area plot
x = np.arange(1, 11)
y1 = np.random.randint(10, 30, 10)
y2 = np.random.randint(20, 40, 10)
y3 = np.random.randint(10, 25, 10)

plt.figure(figsize=(12, 6))
plt.stackplot(x, y1, y2, y3, labels=['Series 1', 'Series 2', 'Series 3'],
              colors=['#3498db', '#2ecc71', '#e74c3c'], alpha=0.7)
plt.title('Stacked Area Plot', fontsize=14, fontweight='bold')
plt.xlabel('X', fontsize=12)
plt.ylabel('Y', fontsize=12)
plt.legend(loc='upper left')
plt.show()

In [None]:
# Box plot with Matplotlib
np.random.seed(42)
data = [np.random.normal(0, std, 100) for std in range(1, 5)]

plt.figure(figsize=(10, 6))
bp = plt.boxplot(data, patch_artist=True, labels=['Std=1', 'Std=2', 'Std=3', 'Std=4'])

# Customize box colors
colors = ['lightblue', 'lightgreen', 'lightyellow', 'lightpink']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)

plt.title('Box Plot - Different Standard Deviations', fontsize=14, fontweight='bold')
plt.ylabel('Value', fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

---

## 2. Seaborn Visualizations

Seaborn builds on Matplotlib and provides a high-level interface for statistical graphics.

### 2.1 Distribution Plots

In [None]:
# Histogram with KDE
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.histplot(titanic['age'].dropna(), kde=True, bins=30, color='steelblue')
plt.title('Age Distribution with KDE', fontsize=12)

plt.subplot(1, 2, 2)
sns.kdeplot(data=titanic, x='age', hue='survived', fill=True, alpha=0.5)
plt.title('Age Distribution by Survival', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# Distribution plot with different aesthetics
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Histogram only
sns.histplot(tips['total_bill'], bins=25, ax=axes[0, 0], color='blue')
axes[0, 0].set_title('Histogram', fontsize=12)

# KDE only
sns.kdeplot(tips['total_bill'], ax=axes[0, 1], color='red', fill=True)
axes[0, 1].set_title('KDE Plot', fontsize=12)

# ECDF (Empirical Cumulative Distribution Function)
sns.ecdfplot(tips['total_bill'], ax=axes[1, 0], color='green')
axes[1, 0].set_title('ECDF Plot', fontsize=12)

# Rug plot with KDE
sns.kdeplot(tips['total_bill'], ax=axes[1, 1], color='purple')
sns.rugplot(tips['total_bill'], ax=axes[1, 1], color='purple', alpha=0.5)
axes[1, 1].set_title('KDE with Rug Plot', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# Box and Violin plots
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Box plot
sns.boxplot(data=titanic, x='pclass', y='age', hue='sex', ax=axes[0])
axes[0].set_title('Age by Class and Sex (Box Plot)', fontsize=12)

# Violin plot
sns.violinplot(data=titanic, x='pclass', y='age', hue='sex', split=True, ax=axes[1])
axes[1].set_title('Age by Class and Sex (Violin Plot)', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# Swarm and Strip plots
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Strip plot
sns.stripplot(data=tips, x='day', y='total_bill', hue='sex', 
              dodge=True, alpha=0.6, ax=axes[0])
axes[0].set_title('Total Bill by Day (Strip Plot)', fontsize=12)

# Swarm plot
sns.swarmplot(data=tips, x='day', y='total_bill', hue='sex', 
              dodge=True, ax=axes[1])
axes[1].set_title('Total Bill by Day (Swarm Plot)', fontsize=12)

plt.tight_layout()
plt.show()

### 2.2 Categorical Plots

In [None]:
# Count plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.countplot(data=titanic, x='pclass', hue='survived', ax=axes[0])
axes[0].set_title('Survival Count by Class', fontsize=12)
axes[0].legend(title='Survived', labels=['No', 'Yes'])

sns.countplot(data=titanic, y='embark_town', hue='pclass', ax=axes[1])
axes[1].set_title('Passengers by Embarkation Town', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# Bar plot (shows mean and confidence interval)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.barplot(data=titanic, x='pclass', y='survived', hue='sex', ax=axes[0])
axes[0].set_title('Survival Rate by Class and Sex', fontsize=12)
axes[0].set_ylabel('Survival Rate')

sns.barplot(data=tips, x='day', y='total_bill', hue='sex', ax=axes[1])
axes[1].set_title('Average Bill by Day and Sex', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# Point plot (good for showing trends)
plt.figure(figsize=(10, 6))
sns.pointplot(data=titanic, x='pclass', y='survived', hue='sex', 
              markers=['o', 's'], linestyles=['-', '--'])
plt.title('Survival Rate by Class and Sex', fontsize=14, fontweight='bold')
plt.ylabel('Survival Rate')
plt.show()

### 2.3 Heatmaps

In [None]:
# Correlation heatmap
numeric_cols = titanic.select_dtypes(include=[np.number]).columns
correlation = titanic[numeric_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0,
            fmt='.2f', linewidths=0.5, square=True)
plt.title('Correlation Heatmap - Titanic Dataset', fontsize=14, fontweight='bold')
plt.show()

In [None]:
# Pivot table heatmap
pivot_table = titanic.pivot_table(values='survived', 
                                   index='pclass', 
                                   columns='sex', 
                                   aggfunc='mean')

plt.figure(figsize=(8, 5))
sns.heatmap(pivot_table, annot=True, cmap='RdYlGn', fmt='.2%',
            linewidths=2, cbar_kws={'label': 'Survival Rate'})
plt.title('Survival Rate by Class and Sex', fontsize=14, fontweight='bold')
plt.show()

In [None]:
# Clustered heatmap
iris_numeric = iris.drop('species', axis=1)

plt.figure(figsize=(10, 8))
g = sns.clustermap(iris_numeric.corr(), annot=True, cmap='viridis', 
                   figsize=(8, 8), linewidths=0.5)
plt.title('Clustered Correlation Heatmap - Iris', y=1.02)
plt.show()

### 2.4 Pair Plots and Joint Plots

In [None]:
# Pair plot
g = sns.pairplot(iris, hue='species', height=2.5, 
                 plot_kws={'alpha': 0.6, 's': 50})
g.fig.suptitle('Iris Dataset - Pair Plot', y=1.02, fontsize=14, fontweight='bold')
plt.show()

In [None]:
# Pair plot with different diagonal
g = sns.pairplot(iris, hue='species', diag_kind='kde', height=2.5,
                 corner=True)  # Only lower triangle
plt.show()

In [None]:
# Joint plot
g = sns.jointplot(data=iris, x='sepal_length', y='petal_length', 
                  hue='species', height=8)
g.fig.suptitle('Sepal vs Petal Length', y=1.02, fontsize=14)
plt.show()

In [None]:
# Different joint plot kinds
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Scatter with marginal histograms
g1 = sns.jointplot(data=tips, x='total_bill', y='tip', kind='scatter', height=5)
g1.fig.suptitle('Scatter', y=1.02)

# Regression
g2 = sns.jointplot(data=tips, x='total_bill', y='tip', kind='reg', height=5)
g2.fig.suptitle('Regression', y=1.02)

# Hexbin (for large datasets)
g3 = sns.jointplot(data=tips, x='total_bill', y='tip', kind='hex', height=5)
g3.fig.suptitle('Hexbin', y=1.02)

plt.show()

### 2.5 Regression Plots

In [None]:
# Linear regression plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.regplot(data=tips, x='total_bill', y='tip', ax=axes[0],
            scatter_kws={'alpha': 0.5})
axes[0].set_title('Tip vs Total Bill (Linear)', fontsize=12)

# With confidence interval
sns.regplot(data=tips, x='total_bill', y='tip', ax=axes[1],
            ci=99, scatter_kws={'alpha': 0.5}, line_kws={'color': 'red'})
axes[1].set_title('Tip vs Total Bill (99% CI)', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# lmplot - faceted regression
g = sns.lmplot(data=tips, x='total_bill', y='tip', col='time', row='smoker',
               height=4, aspect=1.2, scatter_kws={'alpha': 0.5})
g.fig.suptitle('Tip vs Total Bill by Time and Smoking', y=1.02, fontsize=14)
plt.show()

---

## 3. Choosing the Right Visualization

| Data Type | Visualization | When to Use |
|-----------|---------------|-------------|
| **Distribution (1 var)** | Histogram, KDE, Box plot | Understanding spread and shape |
| **Relationship (2 vars)** | Scatter, Line, Regression | Finding correlations |
| **Comparison (categories)** | Bar, Count, Box | Comparing groups |
| **Composition** | Pie, Stacked bar | Showing parts of whole |
| **Correlation (matrix)** | Heatmap | Multiple variable relationships |
| **Multiple features** | Pair plot | Exploring all relationships |
| **Time series** | Line plot | Trends over time |

---

## 4. Assignment: Complete EDA on Titanic Dataset

Create a comprehensive EDA report with 10+ visualizations.

In [None]:
# Reload and prepare data
titanic = sns.load_dataset('titanic')

print("=" * 70)
print("EXPLORATORY DATA ANALYSIS: TITANIC DATASET")
print("=" * 70)
print(f"\nDataset Shape: {titanic.shape}")
print(f"Features: {list(titanic.columns)}")

print("\nFirst 5 rows:")
titanic.head()

In [None]:
# Data summary
print("\nDataset Info:")
print(titanic.info())

print("\nMissing Values:")
print(titanic.isnull().sum()[titanic.isnull().sum() > 0])

print("\nNumerical Summary:")
titanic.describe()

### Visualization 1: Overall Survival Rate

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
survived_counts = titanic['survived'].value_counts()
colors = ['#e74c3c', '#2ecc71']
axes[0].bar(['Did not survive', 'Survived'], survived_counts.values, color=colors)
for i, v in enumerate(survived_counts.values):
    axes[0].text(i, v + 10, f'{v}\n({v/len(titanic)*100:.1f}%)', ha='center', fontsize=11)
axes[0].set_title('Survival Count', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Count')

# Pie chart
axes[1].pie(survived_counts.values, labels=['Did not survive', 'Survived'], 
            colors=colors, autopct='%1.1f%%', explode=(0, 0.05),
            shadow=True, startangle=90)
axes[1].set_title('Survival Rate', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\nOverall Survival Rate: {titanic['survived'].mean()*100:.1f}%")

### Visualization 2: Survival by Gender

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
sns.countplot(data=titanic, x='sex', hue='survived', ax=axes[0], palette=['#e74c3c', '#2ecc71'])
axes[0].set_title('Survival Count by Gender', fontsize=14, fontweight='bold')
axes[0].legend(title='Survived', labels=['No', 'Yes'])

# Survival rate bar plot
survival_by_sex = titanic.groupby('sex')['survived'].mean() * 100
colors = ['#3498db', '#e74c3c']
bars = axes[1].bar(survival_by_sex.index, survival_by_sex.values, color=colors)
for bar, rate in zip(bars, survival_by_sex.values):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, 
                 f'{rate:.1f}%', ha='center', fontsize=12, fontweight='bold')
axes[1].set_title('Survival Rate by Gender', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Survival Rate (%)')
axes[1].set_ylim(0, 100)

plt.tight_layout()
plt.show()

print("\nKey Insight: Women had a significantly higher survival rate (74.2%) than men (18.9%)")

### Visualization 3: Survival by Passenger Class

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
sns.countplot(data=titanic, x='pclass', hue='survived', ax=axes[0], palette=['#e74c3c', '#2ecc71'])
axes[0].set_title('Survival Count by Class', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Passenger Class')
axes[0].legend(title='Survived', labels=['No', 'Yes'])

# Survival rate
survival_by_class = titanic.groupby('pclass')['survived'].mean() * 100
colors = ['#f1c40f', '#bdc3c7', '#cd7f32']
bars = axes[1].bar(['1st Class', '2nd Class', '3rd Class'], survival_by_class.values, color=colors)
for bar, rate in zip(bars, survival_by_class.values):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, 
                 f'{rate:.1f}%', ha='center', fontsize=12, fontweight='bold')
axes[1].set_title('Survival Rate by Class', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Survival Rate (%)')
axes[1].set_ylim(0, 80)

plt.tight_layout()
plt.show()

print("\nKey Insight: First class had the highest survival rate (63%), third class the lowest (24%)")

### Visualization 4: Survival by Class and Gender (Combined)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Grouped bar plot
sns.barplot(data=titanic, x='pclass', y='survived', hue='sex', ax=axes[0])
axes[0].set_title('Survival Rate by Class and Gender', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Survival Rate')
axes[0].set_xlabel('Passenger Class')

# Heatmap
pivot = titanic.pivot_table(values='survived', index='sex', columns='pclass', aggfunc='mean')
sns.heatmap(pivot, annot=True, cmap='RdYlGn', fmt='.1%', ax=axes[1],
            linewidths=2, cbar_kws={'label': 'Survival Rate'})
axes[1].set_title('Survival Rate Heatmap', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nKey Insight: Female first-class passengers had 96.8% survival rate!")

### Visualization 5: Age Distribution

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Overall age distribution
sns.histplot(titanic['age'].dropna(), bins=30, kde=True, ax=axes[0, 0], color='steelblue')
axes[0, 0].axvline(titanic['age'].mean(), color='red', linestyle='--', label=f"Mean: {titanic['age'].mean():.1f}")
axes[0, 0].axvline(titanic['age'].median(), color='green', linestyle='--', label=f"Median: {titanic['age'].median():.1f}")
axes[0, 0].set_title('Age Distribution', fontsize=12)
axes[0, 0].legend()

# Age by survival
sns.histplot(data=titanic, x='age', hue='survived', kde=True, ax=axes[0, 1],
             palette=['#e74c3c', '#2ecc71'], alpha=0.6)
axes[0, 1].set_title('Age Distribution by Survival', fontsize=12)

# Box plot by class
sns.boxplot(data=titanic, x='pclass', y='age', ax=axes[1, 0], palette='Set2')
axes[1, 0].set_title('Age by Passenger Class', fontsize=12)

# Violin plot by survival
sns.violinplot(data=titanic, x='survived', y='age', ax=axes[1, 1], palette=['#e74c3c', '#2ecc71'])
axes[1, 1].set_title('Age by Survival (Violin)', fontsize=12)
axes[1, 1].set_xticklabels(['Did not survive', 'Survived'])

plt.tight_layout()
plt.show()

### Visualization 6: Fare Analysis

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Fare distribution
sns.histplot(titanic['fare'], bins=50, kde=True, ax=axes[0, 0], color='purple')
axes[0, 0].set_title('Fare Distribution', fontsize=12)

# Log-transformed fare
sns.histplot(np.log1p(titanic['fare']), bins=30, kde=True, ax=axes[0, 1], color='orange')
axes[0, 1].set_title('Log(Fare) Distribution', fontsize=12)

# Fare by class
sns.boxplot(data=titanic, x='pclass', y='fare', ax=axes[1, 0], palette='coolwarm')
axes[1, 0].set_title('Fare by Passenger Class', fontsize=12)

# Fare vs Survival
sns.violinplot(data=titanic, x='survived', y='fare', ax=axes[1, 1], palette=['#e74c3c', '#2ecc71'])
axes[1, 1].set_title('Fare by Survival', fontsize=12)
axes[1, 1].set_xticklabels(['Did not survive', 'Survived'])

plt.tight_layout()
plt.show()

print(f"\nFare Statistics:")
print(f"  Mean: ${titanic['fare'].mean():.2f}")
print(f"  Median: ${titanic['fare'].median():.2f}")
print(f"  Max: ${titanic['fare'].max():.2f}")

### Visualization 7: Embarkation Analysis

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Count by embarkation
sns.countplot(data=titanic, x='embark_town', order=['Southampton', 'Cherbourg', 'Queenstown'],
              ax=axes[0], palette='Set2')
axes[0].set_title('Passengers by Embarkation', fontsize=12)

# Survival by embarkation
sns.barplot(data=titanic, x='embark_town', y='survived', 
            order=['Southampton', 'Cherbourg', 'Queenstown'], ax=axes[1], palette='Set2')
axes[1].set_title('Survival Rate by Embarkation', fontsize=12)
axes[1].set_ylabel('Survival Rate')

# Class distribution by embarkation
sns.countplot(data=titanic, x='embark_town', hue='pclass',
              order=['Southampton', 'Cherbourg', 'Queenstown'], ax=axes[2])
axes[2].set_title('Class by Embarkation', fontsize=12)

plt.tight_layout()
plt.show()

### Visualization 8: Family Size Analysis

In [None]:
# Create family size feature
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Family size distribution
sns.countplot(data=titanic, x='family_size', ax=axes[0], palette='viridis')
axes[0].set_title('Family Size Distribution', fontsize=12)

# Survival by family size
family_survival = titanic.groupby('family_size')['survived'].mean()
axes[1].bar(family_survival.index, family_survival.values, color='teal')
axes[1].set_title('Survival Rate by Family Size', fontsize=12)
axes[1].set_xlabel('Family Size')
axes[1].set_ylabel('Survival Rate')
axes[1].axhline(y=0.38, color='red', linestyle='--', label='Overall Rate')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\nKey Insight: Small families (2-4) had better survival rates than solo travelers or large families")

### Visualization 9: Correlation Heatmap

In [None]:
# Select numeric columns for correlation
numeric_cols = ['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare', 'family_size']
correlation = titanic[numeric_cols].corr()

plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(correlation, dtype=bool))  # Upper triangle mask
sns.heatmap(correlation, mask=mask, annot=True, cmap='coolwarm', center=0,
            fmt='.2f', linewidths=0.5, square=True,
            cbar_kws={'label': 'Correlation Coefficient'})
plt.title('Correlation Matrix - Titanic Features', fontsize=14, fontweight='bold')
plt.show()

print("\nTop correlations with survival:")
print(correlation['survived'].sort_values(ascending=False)[1:])

### Visualization 10: Age vs Fare Scatter (with survival)

In [None]:
plt.figure(figsize=(12, 8))

# Scatter plot with different colors for survival and sizes for class
survived = titanic[titanic['survived'] == 1]
not_survived = titanic[titanic['survived'] == 0]

plt.scatter(not_survived['age'], not_survived['fare'], 
            c='red', alpha=0.5, s=50, label='Did not survive')
plt.scatter(survived['age'], survived['fare'], 
            c='green', alpha=0.5, s=50, label='Survived')

plt.xlabel('Age', fontsize=12)
plt.ylabel('Fare', fontsize=12)
plt.title('Age vs Fare (colored by Survival)', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

### Visualization 11 (Bonus): Pair Plot of Key Features

In [None]:
# Select key features for pair plot
key_features = titanic[['age', 'fare', 'pclass', 'survived']].dropna()

g = sns.pairplot(key_features, hue='survived', 
                 palette=['#e74c3c', '#2ecc71'], 
                 diag_kind='kde', height=2.5,
                 plot_kws={'alpha': 0.5})
g.fig.suptitle('Pair Plot: Age, Fare, Class by Survival', y=1.02, fontsize=14, fontweight='bold')
plt.show()

### Visualization 12 (Bonus): FacetGrid - Comprehensive View

In [None]:
g = sns.FacetGrid(titanic, col='pclass', row='sex', hue='survived',
                  height=3.5, aspect=1.2, palette=['#e74c3c', '#2ecc71'])
g.map(sns.histplot, 'age', bins=20, alpha=0.6)
g.add_legend(title='Survived')
g.fig.suptitle('Age Distribution by Class, Sex, and Survival', y=1.02, fontsize=14, fontweight='bold')
plt.show()

### EDA Summary

In [None]:
print("=" * 70)
print("EDA SUMMARY: KEY FINDINGS")
print("=" * 70)

print("""
1. OVERALL SURVIVAL: 38.4% of passengers survived the disaster.

2. GENDER IMPACT: 
   - Women: 74.2% survival rate
   - Men: 18.9% survival rate
   - "Women and children first" policy was clearly followed

3. CLASS IMPACT:
   - 1st Class: 63.0% survival rate
   - 2nd Class: 47.3% survival rate  
   - 3rd Class: 24.2% survival rate
   - Higher class = higher survival probability

4. BEST SURVIVAL COMBINATION:
   - Female + 1st Class = 96.8% survival
   - Male + 3rd Class = 13.5% survival

5. AGE INSIGHTS:
   - Average age: ~30 years
   - Children had slightly better survival rates
   - No strong linear relationship between age and survival

6. FARE INSIGHTS:
   - Higher fares correlated with higher survival
   - Fare distribution is heavily right-skewed
   
7. FAMILY SIZE:
   - Small families (2-4) had best survival rates
   - Solo travelers and large families had lower rates

8. EMBARKATION:
   - Cherbourg had highest survival rate (55%)
   - Southampton had most passengers but lower rate (34%)
""")

print("=" * 70)
print("VISUALIZATION COUNT: 12 visualizations created")
print("=" * 70)

---

## Summary

Today you learned:

### Matplotlib
- Line plots with customization
- Bar charts (vertical, horizontal, grouped, stacked)
- Subplots and GridSpec for complex layouts
- Pie charts and area plots

### Seaborn
- Distribution plots: histplot, kdeplot, boxplot, violinplot
- Categorical plots: countplot, barplot, pointplot
- Heatmaps for correlation matrices
- Pair plots for multivariate analysis
- Joint plots for bivariate relationships
- Regression plots

### EDA Best Practices
1. Start with data overview and missing values
2. Examine distributions of key variables
3. Explore relationships with target variable
4. Look for interactions between features
5. Create visualizations that tell a story
6. Document key findings

## Next Steps

Tomorrow (Day 5), we'll dive into **Statistics for ML** - the mathematical foundation!

---

**Great job completing Day 4!**