# üìä Matplotlib & Seaborn Review for ML/DL

**M·ª•c ti√™u:** √în t·∫≠p visualization cho EDA v√† model evaluation

**N·ªôi dung:**
- Matplotlib basics & customization
- Seaborn statistical plots
- Distribution plots
- Relationship plots
- Model evaluation plots
- Subplots & layouts

**Level:** Intermediate

---

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11

print(f"Matplotlib: {plt.matplotlib.__version__}")
print(f"Seaborn: {sns.__version__}")

# Generate sample data
np.random.seed(42)
n = 500
df = pd.DataFrame({
    'feature1': np.random.randn(n),
    'feature2': np.random.randn(n) * 2 + 1,
    'feature3': np.random.exponential(2, n),
    'target': np.random.binomial(1, 0.3, n),
    'category': np.random.choice(['A', 'B', 'C'], n)
})
df['feature2'] += df['feature1'] * 0.5  # Add correlation

---

## 1. Matplotlib Basics

### Figure & Axes anatomy

In [None]:
# Two ways to create plots

# Method 1: pyplot (simple, quick)
plt.plot([1, 2, 3], [1, 4, 9])
plt.title('Simple Plot')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

# Method 2: Object-oriented (recommended for complex plots)
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot([1, 2, 3], [1, 4, 9], marker='o', linewidth=2, label='Quadratic')
ax.set_title('Better Plot', fontsize=14, fontweight='bold')
ax.set_xlabel('X', fontsize=12)
ax.set_ylabel('Y', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("‚úÖ Always use object-oriented API (fig, ax) for better control")

## 2. Distribution Plots

### Essential for EDA

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Histogram
axes[0, 0].hist(df['feature1'], bins=30, edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Histogram', fontweight='bold')
axes[0, 0].set_xlabel('Feature 1')
axes[0, 0].set_ylabel('Frequency')

# KDE (Kernel Density Estimate)
sns.kdeplot(data=df, x='feature1', ax=axes[0, 1], fill=True)
axes[0, 1].set_title('KDE Plot', fontweight='bold')

# Box plot (detect outliers)
sns.boxplot(data=df, x='category', y='feature2', ax=axes[1, 0])
axes[1, 0].set_title('Box Plot by Category', fontweight='bold')

# Violin plot (combines box + KDE)
sns.violinplot(data=df, x='category', y='feature3', ax=axes[1, 1])
axes[1, 1].set_title('Violin Plot', fontweight='bold')

plt.tight_layout()
plt.show()

print("üìä Distribution insights:")
print(f"  Feature1 mean: {df['feature1'].mean():.2f}, std: {df['feature1'].std():.2f}")
print(f"  Feature3 skew: {df['feature3'].skew():.2f} (right-skewed)")

### Multiple Distributions

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Overlapping histograms
for cat in df['category'].unique():
    subset = df[df['category'] == cat]['feature1']
    axes[0].hist(subset, bins=20, alpha=0.5, label=f'Category {cat}')
axes[0].set_title('Overlapping Histograms', fontweight='bold')
axes[0].set_xlabel('Feature 1')
axes[0].legend()

# Multiple KDE
for cat in df['category'].unique():
    subset = df[df['category'] == cat]['feature1']
    sns.kdeplot(subset, ax=axes[1], label=f'Category {cat}', fill=True, alpha=0.3)
axes[1].set_title('Multiple KDE', fontweight='bold')
axes[1].legend()

plt.tight_layout()
plt.show()

## 3. Relationship Plots

### Correlations & Scatter plots

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Scatter plot
axes[0].scatter(df['feature1'], df['feature2'], alpha=0.5, c=df['target'], cmap='coolwarm')
axes[0].set_title('Scatter Plot', fontweight='bold')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')

# Hexbin (for dense data)
axes[1].hexbin(df['feature1'], df['feature2'], gridsize=20, cmap='YlOrRd')
axes[1].set_title('Hexbin Plot', fontweight='bold')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')

# Regression plot
sns.regplot(data=df, x='feature1', y='feature2', ax=axes[2], scatter_kws={'alpha':0.3})
axes[2].set_title('Regression Plot', fontweight='bold')

plt.tight_layout()
plt.show()

# Correlation
corr = df[['feature1', 'feature2', 'feature3']].corr()
print("\nCorrelation matrix:")
print(corr)

### Correlation Heatmap

In [None]:
# Correlation matrix
numeric_cols = ['feature1', 'feature2', 'feature3']
corr_matrix = df[numeric_cols].corr()

# Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1)
plt.title('Correlation Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("üí° Heatmap shows:")
print("  - Strong positive: red")
print("  - Strong negative: blue")
print("  - No correlation: white")

### Pair Plot (All combinations)

In [None]:
# Pairplot (all vs all)
g = sns.pairplot(df[['feature1', 'feature2', 'feature3', 'category']], 
                 hue='category', diag_kind='kde', plot_kws={'alpha': 0.5})
g.fig.suptitle('Pair Plot', y=1.02, fontweight='bold')
plt.show()

print("‚úÖ Pairplot shows all pairwise relationships at once")

## 4. Model Evaluation Plots

### Classification metrics

In [None]:
# Simulate predictions
y_true = df['target'].values
y_pred_proba = np.random.rand(len(y_true))  # Random probabilities
y_pred = (y_pred_proba > 0.5).astype(int)

# Confusion matrix
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_true, y_pred)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Confusion matrix heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0], 
            xticklabels=['Pred 0', 'Pred 1'], yticklabels=['True 0', 'True 1'])
axes[0].set_title('Confusion Matrix', fontweight='bold')
axes[0].set_ylabel('True Label')
axes[0].set_xlabel('Predicted Label')

# ROC curve (simplified)
from sklearn.metrics import roc_curve, auc

fpr, tpr, _ = roc_curve(y_true, y_pred_proba)
roc_auc = auc(fpr, tpr)

axes[1].plot(fpr, tpr, linewidth=2, label=f'ROC (AUC = {roc_auc:.2f})')
axes[1].plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random')
axes[1].set_title('ROC Curve', fontweight='bold')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"‚úÖ AUC: {roc_auc:.3f} (random predictions for demo)")

### Training curves

In [None]:
# Simulate training history
epochs = np.arange(1, 51)
train_loss = 2.0 * np.exp(-epochs / 10) + 0.1 + np.random.randn(50) * 0.05
val_loss = 2.0 * np.exp(-epochs / 10) + 0.2 + np.random.randn(50) * 0.08
train_acc = 1 - np.exp(-epochs / 10) - 0.1 + np.random.randn(50) * 0.02
val_acc = 1 - np.exp(-epochs / 10) - 0.15 + np.random.randn(50) * 0.03

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curves
axes[0].plot(epochs, train_loss, label='Train Loss', linewidth=2)
axes[0].plot(epochs, val_loss, label='Val Loss', linewidth=2)
axes[0].set_title('Training Loss', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy curves
axes[1].plot(epochs, train_acc, label='Train Acc', linewidth=2)
axes[1].plot(epochs, val_acc, label='Val Acc', linewidth=2)
axes[1].set_title('Training Accuracy', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("üìà Training insights:")
print(f"  Final train loss: {train_loss[-1]:.3f}")
print(f"  Final val loss: {val_loss[-1]:.3f}")
print(f"  Val gap: {val_loss[-1] - train_loss[-1]:.3f} (overfitting check)")

### Feature importance

In [None]:
# Simulate feature importances
features = ['Feature_A', 'Feature_B', 'Feature_C', 'Feature_D', 'Feature_E']
importances = np.array([0.35, 0.25, 0.20, 0.12, 0.08])

# Sort by importance
indices = np.argsort(importances)

plt.figure(figsize=(10, 6))
plt.barh(range(len(importances)), importances[indices], alpha=0.8)
plt.yticks(range(len(importances)), [features[i] for i in indices])
plt.xlabel('Importance', fontsize=12)
plt.title('Feature Importance', fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Top feature: {features[importances.argmax()]} ({importances.max():.2f})")

## 5. Advanced Layouts

### Complex subplots

In [None]:
# GridSpec for flexible layouts
from matplotlib.gridspec import GridSpec

fig = plt.figure(figsize=(12, 8))
gs = GridSpec(3, 3, figure=fig)

# Large subplot (top-left 2x2)
ax1 = fig.add_subplot(gs[:2, :2])
ax1.scatter(df['feature1'], df['feature2'], c=df['target'], cmap='viridis', alpha=0.5)
ax1.set_title('Main Plot', fontweight='bold')

# Top-right
ax2 = fig.add_subplot(gs[0, 2])
ax2.hist(df['feature1'], bins=20, edgecolor='black')
ax2.set_title('Hist 1')

# Middle-right
ax3 = fig.add_subplot(gs[1, 2])
ax3.hist(df['feature2'], bins=20, edgecolor='black')
ax3.set_title('Hist 2')

# Bottom (full width)
ax4 = fig.add_subplot(gs[2, :])
ax4.boxplot([df[df['category'] == cat]['feature3'] for cat in df['category'].unique()], 
            labels=df['category'].unique())
ax4.set_title('Category Comparison', fontweight='bold')

plt.tight_layout()
plt.show()

print("‚úÖ GridSpec allows complex, flexible layouts")

## 6. Customization

### Make plots publication-ready

In [None]:
# Custom style
fig, ax = plt.subplots(figsize=(10, 6))

# Plot with custom styling
for cat, color in zip(df['category'].unique(), ['#FF6B6B', '#4ECDC4', '#45B7D1']):
    subset = df[df['category'] == cat]
    ax.scatter(subset['feature1'], subset['feature2'], 
              c=color, label=f'Category {cat}', 
              s=50, alpha=0.6, edgecolors='black', linewidth=0.5)

# Customization
ax.set_xlabel('Feature 1', fontsize=13, fontweight='bold')
ax.set_ylabel('Feature 2', fontsize=13, fontweight='bold')
ax.set_title('Custom Styled Plot', fontsize=15, fontweight='bold', pad=20)
ax.legend(frameon=True, shadow=True, fontsize=11)
ax.grid(True, alpha=0.3, linestyle='--')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.tight_layout()
plt.show()

# Save figure
# fig.savefig('plot.png', dpi=300, bbox_inches='tight')
# fig.savefig('plot.pdf', bbox_inches='tight')  # Vector format

print("üíæ Save tips:")
print("  PNG: dpi=300 for high quality")
print("  PDF/SVG: Vector format (scales without quality loss)")
print("  bbox_inches='tight': Remove extra whitespace")

---

## üéØ Key Takeaways

### Essential Plots for ML/DL

#### EDA (Exploratory Data Analysis)
1. **Histogram** / **KDE**: Distribution of single variable
2. **Box plot**: Outliers and quartiles
3. **Scatter plot**: Relationship between two variables
4. **Correlation heatmap**: Feature correlations
5. **Pair plot**: All pairwise relationships

#### Model Evaluation
1. **Confusion matrix**: Classification errors
2. **ROC curve**: Trade-off FPR vs TPR
3. **Training curves**: Loss & accuracy over time
4. **Feature importance**: Most predictive features

### Best Practices

```python
# ‚úÖ Always use object-oriented API
fig, ax = plt.subplots()
ax.plot(...)

# ‚úÖ Use tight_layout
plt.tight_layout()

# ‚úÖ Label everything
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_title('Title')
ax.legend()

# ‚úÖ Save with high quality
fig.savefig('plot.png', dpi=300, bbox_inches='tight')
```

### Common Gotchas

```python
# ‚ùå Forgetting plt.show() in scripts
# ‚úÖ Always call plt.show() or save figure

# ‚ùå Not closing figures (memory leak)
# ‚úÖ plt.close(fig) or plt.close('all')

# ‚ùå Matplotlib defaults (ugly)
# ‚úÖ Use seaborn style: sns.set_style('whitegrid')
```

### Quick Reference

| Plot Type | Use Case | Code |
|-----------|----------|------|
| Histogram | Distribution | `ax.hist(data, bins=30)` |
| Box plot | Outliers | `sns.boxplot(data=df, x='cat', y='val')` |
| Scatter | Relationship | `ax.scatter(x, y)` |
| Heatmap | Correlation | `sns.heatmap(corr, annot=True)` |
| Line | Time series | `ax.plot(x, y)` |
| Bar | Counts | `ax.bar(categories, values)` |

---

**Next:** Scikit-learn for ML algorithms