# Chapter 16: Exploratory Data Analysis (EDA) Methodology

EDA (Exploratory Data Analysis) is the stage where you **get to know your data** before making big decisions or building models. Think of it as a detective's investigation ‚Äî you gather clues, spot patterns, and form theories before drawing conclusions.

In this chapter you will learn a practical, repeatable approach to EDA:
- The overall **EDA workflow** you can reuse on any dataset
- **Univariate** analysis (one variable at a time)
- **Bivariate** analysis (two variables together)
- **Multivariate** analysis (three or more variables)
- Visual + statistical exploration techniques
- Pattern and anomaly detection methods
- Hypothesis refinement based on evidence

---

## Table of Contents
1. [Introduction: What EDA Is (and What It Is Not)](#introduction-what-eda-is-and-what-it-is-not)
2. [Setup and Dataset Creation](#setup)
3. [16.1 EDA Workflow (The Checklist)](#161-eda-workflow-the-checklist)
4. [16.2 Univariate Analysis](#162-univariate-analysis-one-variable-at-a-time)
5. [16.3 Bivariate Analysis](#163-bivariate-analysis-two-variables)
6. [16.4 Multivariate Analysis](#164-multivariate-analysis-3-variables)
7. [16.5 Visual and Statistical Exploration](#165-visual-and-statistical-exploration)
8. [16.6 Pattern and Anomaly Detection](#166-pattern-and-anomaly-detection)
9. [16.7 Hypothesis Refinement](#167-hypothesis-refinement)
10. [Summary / Key Takeaways](#summary--key-takeaways)

---

**Prerequisites:** This chapter assumes you are comfortable with:
- Basic Python (Chapter 2)
- NumPy arrays (Chapter 3)
- Pandas DataFrames (Chapter 4)
- Basic plotting with Matplotlib/Seaborn (Chapter 5)

This notebook is **self-contained**: it generates a practice dataset and walks through EDA step-by-step.

## Introduction: What EDA Is (and What It Is Not)

### What EDA Is
EDA is a set of techniques to:
- **Understand columns** ‚Äî their meaning, types, and value ranges
- **Find data quality issues early** ‚Äî missing values, duplicates, inconsistent entries
- **Discover relationships** ‚Äî does discount affect returns? does age relate to spending?
- **Spot patterns and anomalies** ‚Äî unusual spikes, clusters, or outliers worth investigating

EDA was popularized by statistician John Tukey in the 1970s. His philosophy: let the data speak before imposing assumptions.

### What EDA Is Not
EDA is **not proof of causation**. During EDA you generate ideas and hypotheses. You then validate them later with careful statistics, experiments, or domain expertise.

| EDA Tells You | EDA Does NOT Tell You |
|---------------|----------------------|
| "Sales are higher on weekends" | "Weekends *cause* higher sales" |
| "Discount and returns are correlated" | "Discounts *lead to* more returns" |
| "There are 5 outlier transactions" | "These outliers are errors vs real" |

> üí° **Tip:** Think of EDA as turning on the lights before you start working. You see what's in the room, but you still need to investigate each item.

### Why EDA Matters
- **Saves time later:** Catching data issues early prevents wrong conclusions
- **Guides analysis:** EDA reveals which variables are worth deeper study
- **Builds intuition:** You develop a "feel" for the data that helps with modeling decisions

> ‚ö†Ô∏è **Common Beginner Mistake:** Jumping straight to complex models without understanding the data first. This often leads to wasted effort and wrong results.

## Chapter Map: What We Will Do

We will follow this structured EDA flow:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  1. Load/Create Data  ‚Üí  2. Quick Inspection  ‚Üí  3. Cleanup    ‚îÇ
‚îÇ         ‚Üì                                                       ‚îÇ
‚îÇ  4. Univariate  ‚Üí  5. Bivariate  ‚Üí  6. Multivariate            ‚îÇ
‚îÇ         ‚Üì                                                       ‚îÇ
‚îÇ  7. Pattern & Anomaly Detection  ‚Üí  8. Hypothesis Refinement   ‚îÇ
‚îÇ         ‚Üì                                                       ‚îÇ
‚îÇ  9. Summarize Insights                                          ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Each Stage Explained:
| Stage | Purpose | Example Question |
|-------|---------|------------------|
| Inspection | Understand structure | How many rows? What data types? |
| Cleanup | Fix obvious issues | Standardize text, remove duplicates |
| Univariate | One variable at a time | What's the average revenue? |
| Bivariate | Two variables together | Does revenue differ by channel? |
| Multivariate | 3+ variables | Does the discount-return relationship vary by channel? |
| Patterns/Anomalies | Unusual observations | Are there outliers? Time-based spikes? |
| Hypothesis Refinement | Update questions | Refine vague ideas into testable hypotheses |

> üí° **Tip:** Following a consistent workflow means you're less likely to miss important findings.

## Setup

We will use these libraries for EDA:

| Library | Purpose |
|---------|---------|
| `pandas` | Data manipulation and summaries |
| `numpy` | Numerical operations |
| `matplotlib` | Basic plotting |
| `seaborn` | Statistical visualizations (built on matplotlib) |
| `scipy` (optional) | Statistical tests |

If you need to install these packages, run in a terminal:
```bash
pip install pandas numpy matplotlib seaborn scipy
```

> üìö **Reference:** 
> - [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html)
> - [Seaborn Tutorial](https://seaborn.pydata.org/tutorial.html)

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set a clean visual style for all plots
sns.set_theme(style="whitegrid")

# Set random seed for reproducibility (same results every time)
np.random.seed(42)

# Optional: SciPy for statistical tests (we will fall back gracefully if not installed)
try:
    from scipy import stats
    SCIPY_AVAILABLE = True
except ImportError:
    SCIPY_AVAILABLE = False

print(f"SciPy available: {SCIPY_AVAILABLE}")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## Loading a Practice Dataset

We'll use the **tips** dataset from seaborn ‚Äî a classic dataset for learning data analysis. It contains restaurant tipping data with:
- Numeric columns (total bill, tip amount)
- Categorical columns (sex, smoker status, day, time, size)
- Real patterns to discover through EDA

We'll transform it slightly to match our e-commerce orders context and add some data quality issues for practice.

### Why Use Real Data?
- ‚úÖ You practice EDA on actual patterns that exist in the real world
- ‚úÖ No external files needed ‚Äî seaborn datasets are built-in
- ‚úÖ Results are reproducible across different machines
- ‚úÖ You learn to work with the quirks of real data

### Dataset Columns (After Transformation):
| Column | Type | Description |
|--------|------|-------------|
| `order_id` | int | Unique order identifier |
| `order_date` | date | When the order was placed |
| `channel` | categorical | Order source: web, app, store |
| `region` | categorical | Geographic region: North, South, East, West |
| `customer_age` | numeric | Customer's age in years |
| `items` | int | Number of items in the order |
| `discount` | numeric | Discount applied (0 to 0.6) |
| `revenue` | numeric | Order revenue in dollars |
| `returned` | boolean | Whether the order was returned |
| `satisfaction` | numeric | Customer satisfaction score (1-10) |

In [None]:
# Load the tips dataset from seaborn
tips = sns.load_dataset("tips")

# Set random generator for reproducibility
rng = np.random.default_rng(42)

# Transform to our orders context
n_rows = len(tips)

df = pd.DataFrame({
    'order_id': np.arange(1, n_rows + 1),
    'order_date': pd.to_datetime('2025-01-01') + pd.to_timedelta(rng.integers(0, 180, size=n_rows), unit='D'),
    'channel': tips['time'].map({'Lunch': 'web', 'Dinner': 'app'}).where(
        rng.random(n_rows) > 0.15, 'store'
    ),
    'region': tips['day'].map({'Thur': 'North', 'Fri': 'South', 'Sat': 'East', 'Sun': 'West'}),
    'customer_age': rng.normal(34, 10, size=n_rows).round().clip(18, 75),
    'items': tips['size'].clip(1, 12),
    'discount': rng.beta(a=2, b=8, size=n_rows).clip(0, 0.6),
    'revenue': tips['total_bill'],
    'returned': tips['smoker'] == 'Yes',  # Using smoker as a proxy for returned
    'satisfaction': rng.normal(7.2, 1.2, size=n_rows).clip(1, 10),
})

# Inject common data issues for EDA practice
# 1) Missing values
missing_idx = rng.choice(df.index, size=int(0.03 * n_rows), replace=False)
df.loc[missing_idx, 'satisfaction'] = np.nan
missing_idx2 = rng.choice(df.index, size=int(0.02 * n_rows), replace=False)
df.loc[missing_idx2, 'customer_age'] = np.nan

# 2) A few extreme outliers in revenue
outlier_idx = rng.choice(df.index, size=max(3, int(0.01 * n_rows)), replace=False)
df.loc[outlier_idx, 'revenue'] *= rng.integers(6, 12, size=len(outlier_idx))

# 3) Duplicate a couple of rows
dup_idx = rng.choice(df.index, size=3, replace=False)
df = pd.concat([df, df.loc[dup_idx]], ignore_index=True)

# 4) Inconsistent categories (uppercase + extra spaces)
glitch_idx = rng.choice(df.index, size=4, replace=False)
df.loc[glitch_idx, 'channel'] = df.loc[glitch_idx, 'channel'].astype(str).str.upper() + '  '

print(f"Dataset shape: {df.shape}")
df.head()

---

## 16.1 EDA Workflow (The Checklist)

A good EDA is **repeatable**. Here is a beginner-friendly workflow you can follow every time you start exploring a new dataset:

### The 9-Step EDA Checklist

| Step | Action | Key Questions |
|------|--------|---------------|
| 1 | **Clarify the goal** | What question are you trying to answer? |
| 2 | **Inspect the dataset** | How many rows/columns? What types? Any missing values? |
| 3 | **Clean just enough** | Fix obvious issues so exploration is safe |
| 4 | **Univariate analysis** | What does each variable look like on its own? |
| 5 | **Bivariate analysis** | How are pairs of variables related? |
| 6 | **Multivariate analysis** | What patterns emerge when looking at 3+ variables? |
| 7 | **Patterns & anomalies** | Any outliers, weird clusters, or sudden spikes? |
| 8 | **Refine hypotheses** | Update your questions based on what you found |
| 9 | **Summarize insights** | What did you learn? What should happen next? |

> ‚ö†Ô∏è **Warning:** If you skip steps 1‚Äì3, you can easily misinterpret your plots and draw wrong conclusions.

> üí° **Tip:** Print or save this checklist and refer to it when starting any new EDA project.

### Step 1‚Äì2: Quick Inspection (Shape, Samples, Types)

**Why this matters:**
- If a column is text but should be numeric, calculations will be wrong
- If you have duplicates, counts and averages can be biased
- If you have missing values, some charts will silently drop data

**What to check:**
1. **Shape** ‚Äî How many rows and columns?
2. **Sample rows** ‚Äî What does the data actually look like?
3. **Data types** ‚Äî Are they correct? (dates as dates, numbers as numbers)
4. **Missing values** ‚Äî How much data is missing?
5. **Duplicates** ‚Äî Are there exact duplicate rows?

In [None]:
# Check the shape: (rows, columns)
print(f"Dataset shape: {df.shape[0]:,} rows √ó {df.shape[1]} columns")

In [None]:
# Look at a random sample of rows (not just the first few)
# This helps you see variety in the data
df.sample(5, random_state=42)

In [None]:
# Check data types - are they what you expect?
# Good: order_date is datetime64, revenue is float64
# Warning sign: a numeric column showing as 'object' (text)
print("Data types:")
print(df.dtypes)

In [None]:
# Check missing values and duplicates
# Missing rate = proportion of NaN values in each column
missing_rate = df.isna().mean().sort_values(ascending=False)
duplicate_rows = df.duplicated().sum()

print("Missing rate by column (sorted):")
print(missing_rate[missing_rate > 0].to_frame('missing_rate'))
print(f"\nColumns with no missing values: {(missing_rate == 0).sum()}")
print(f"\nExact duplicate rows: {duplicate_rows}")

# Interpretation: 
# - satisfaction has ~3% missing, customer_age has ~2% missing
# - There are 3 duplicate rows we should remove

### Step 3: Clean *Just Enough* for Exploration

EDA is not the same as full data cleaning (Chapter 15 covers that in depth). Here we do **only what we need** so exploration is trustworthy:

| Issue | EDA Fix | Why |
|-------|---------|-----|
| Inconsistent text | Standardize (`'WEB  '` ‚Üí `'web'`) | So groupby works correctly |
| Exact duplicates | Remove them | So counts aren't inflated |
| Missing values | **Keep them** | We'll handle them carefully in plots/stats |

> ‚ö†Ô∏è **Warning:** Don't over-clean during EDA. You want to see the problems so you can report them!

> üí° **Tip:** Always print how many rows you remove, so you can explain it later.

In [None]:
# Create a copy for EDA (preserve the original for reference)
df_eda = df.copy()

# Fix 1: Standardize the 'channel' column
# - Convert to string (in case of mixed types)
# - Strip leading/trailing whitespace
# - Convert to lowercase
df_eda['channel'] = df_eda['channel'].astype(str).str.strip().str.lower()

# Fix 2: Remove exact duplicate rows
n_before = len(df_eda)
df_eda = df_eda.drop_duplicates()
n_after = len(df_eda)

print(f"Rows before cleanup: {n_before:,}")
print(f"Rows after drop_duplicates(): {n_after:,}")
print(f"Rows removed: {n_before - n_after:,}")
print(f"\nChannel values after cleanup:")
print(df_eda['channel'].value_counts(dropna=False))

In [None]:
# Create a reusable function for quick dataset overview
# This is a handy function you can copy to your own projects!

def quick_overview(data: pd.DataFrame) -> pd.DataFrame:
    """
    Generate a quick overview of a DataFrame.
    Shows data type, missing rate, and unique values for each column.
    """
    overview = pd.DataFrame({
        'dtype': data.dtypes.astype(str),
        'missing_rate': data.isna().mean().round(4),
        'missing_count': data.isna().sum(),
        'n_unique': data.nunique(dropna=True),
        'sample_value': [data[col].dropna().iloc[0] if len(data[col].dropna()) > 0 else None 
                         for col in data.columns]
    })
    return overview.sort_values('missing_rate', ascending=False)

# Apply to our cleaned EDA dataset
quick_overview(df_eda)

---

## 16.2 Univariate Analysis (One Variable at a Time)

**Univariate analysis** examines each variable independently. It answers questions like:
- What values are common?
- What is the typical value (mean/median)?
- Is the distribution skewed?
- Are there outliers?

### Tools for Univariate Analysis

| Variable Type | Summary Method | Visualization |
|---------------|----------------|---------------|
| **Numeric** | `describe()`, `mean()`, `median()` | Histogram, boxplot, KDE |
| **Categorical** | `value_counts()` | Bar chart, pie chart |
| **Boolean** | `mean()` (gives % True) | Bar chart |

### Key Statistics to Know

| Statistic | What It Tells You |
|-----------|-------------------|
| **Mean** | Average value (sensitive to outliers) |
| **Median** | Middle value (robust to outliers) |
| **Std** | How spread out the data is |
| **Min/Max** | Range of values |
| **25%/75%** | Quartiles (middle 50% of data) |

> üí° **Tip:** When mean ‚â† median, the distribution is skewed. If mean > median, there are high outliers pulling the average up.

In [None]:
# Numeric summary using describe()
# This gives you count, mean, std, min, 25%, 50% (median), 75%, max

numeric_cols = ['customer_age', 'items', 'discount', 'revenue', 'satisfaction']
summary = df_eda[numeric_cols].describe()

print("Numeric column summary:")
print(summary.round(2))

# Quick interpretation:
# - revenue: mean (87) > median (63) suggests right-skewed with high outliers
# - discount: ranges from 0 to ~0.5 (0% to 50%)
# - satisfaction: centered around 7, scale 1-10

In [None]:
# Visualize numeric distributions
# We use a 2x2 grid to show multiple plots at once

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Plot 1: Revenue histogram with KDE (smooth curve)
sns.histplot(df_eda['revenue'], bins=40, kde=True, ax=axes[0, 0])
axes[0, 0].set_title('Revenue Distribution')
axes[0, 0].set_xlabel('Revenue ($)')

# Plot 2: Revenue boxplot (great for spotting outliers)
sns.boxplot(x=df_eda['revenue'], ax=axes[0, 1])
axes[0, 1].set_title('Revenue Boxplot (outliers visible as dots)')

# Plot 3: Discount histogram
sns.histplot(df_eda['discount'], bins=30, kde=True, ax=axes[1, 0])
axes[1, 0].set_title('Discount Distribution')
axes[1, 0].set_xlabel('Discount Rate')

# Plot 4: Items histogram (discrete values)
sns.histplot(df_eda['items'], bins=12, kde=False, ax=axes[1, 1])
axes[1, 1].set_title('Items per Order')
axes[1, 1].set_xlabel('Number of Items')

plt.tight_layout()
plt.show()

# Interpretation:
# - Revenue is right-skewed with several extreme outliers
# - Most discounts are small (under 20%)
# - Most orders have 1-4 items

In [None]:
# For highly skewed data, a log scale often reveals patterns better
# Why log scale? It compresses large values and spreads out small values

plt.figure(figsize=(8, 4))
sns.histplot(np.log10(df_eda['revenue']), bins=40, kde=True, color='steelblue')
plt.title('Revenue on Log10 Scale')
plt.xlabel('log‚ÇÅ‚ÇÄ(Revenue)')
plt.ylabel('Count')

# Add a note about interpretation
plt.annotate('10¬π = $10\n10¬≤ = $100\n10¬≥ = $1000', 
             xy=(0.02, 0.75), xycoords='axes fraction',
             fontsize=10, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

# Now the distribution looks more symmetric and easier to analyze

In [None]:
# Categorical variable analysis using value_counts()
# Shows how many times each category appears

print("Channel distribution:")
print(df_eda['channel'].value_counts(dropna=False))
print(f"\nPercentages:")
print((df_eda['channel'].value_counts(normalize=True) * 100).round(1))

In [None]:
# Visualize categorical distribution with a bar chart
# Order bars by frequency for easier reading

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Channel distribution
channel_order = df_eda['channel'].value_counts().index
sns.countplot(data=df_eda, x='channel', order=channel_order, ax=axes[0], palette='Blues_d')
axes[0].set_title('Orders by Channel')
axes[0].set_xlabel('Channel')
axes[0].set_ylabel('Count')

# Region distribution
region_order = df_eda['region'].value_counts().index
sns.countplot(data=df_eda, x='region', order=region_order, ax=axes[1], palette='Greens_d')
axes[1].set_title('Orders by Region')
axes[1].set_xlabel('Region')
axes[1].set_ylabel('Count')

plt.tight_layout()
plt.show()

# Interpretation: Most orders come through web, followed by app, then store
# Regions are roughly balanced

### üéØ Exercise 1: Univariate Analysis

Test your understanding by answering these questions:

1. What is the **median** revenue? (Hint: use `.median()`)
2. What **percentage** of orders were returned? (Hint: for boolean columns, `.mean()` gives the rate of `True`)
3. Which **region** has the most orders? (Hint: use `.value_counts()`)

**Try it yourself before looking at the solution below!**

In [None]:
# Solution to Exercise 1

# 1. Median revenue
median_revenue = df_eda['revenue'].median()
print(f"1. Median revenue: ${median_revenue:.2f}")

# 2. Return rate (percentage of orders returned)
return_rate = df_eda['returned'].mean() * 100  # multiply by 100 for percentage
print(f"2. Return rate: {return_rate:.1f}%")

# 3. Region with most orders
top_region = df_eda['region'].value_counts().idxmax()
top_region_count = df_eda['region'].value_counts().max()
print(f"3. Top region: {top_region} ({top_region_count} orders)")

# Bonus: Show all region counts
print(f"\nAll regions:\n{df_eda['region'].value_counts()}")

---

## 16.3 Bivariate Analysis (Two Variables)

**Bivariate analysis** examines the relationship between two variables. It helps you answer questions like:
- Does revenue differ by channel?
- Is discount related to returns?
- Do older customers buy more items?

### Tools for Bivariate Analysis

| Comparison Type | Method | Visualization |
|-----------------|--------|---------------|
| **Numeric vs Categorical** | `groupby().agg()` | Boxplot, violin plot, bar chart |
| **Numeric vs Numeric** | Correlation, scatter | Scatter plot, regression plot |
| **Categorical vs Categorical** | Crosstab, chi-square | Heatmap, grouped bar chart |

### Key Concepts

**Correlation** measures the linear relationship between two numeric variables:
- **+1** = Perfect positive relationship (both increase together)
- **0** = No linear relationship
- **-1** = Perfect negative relationship (one increases, other decreases)

> ‚ö†Ô∏è **Common Mistake:** Seeing a trend and assuming it proves causation. EDA tells you **what is associated**, not **why** it happens.

In [None]:
# Compare revenue across channels using groupby
# agg() lets us calculate multiple statistics at once

revenue_by_channel = df_eda.groupby('channel')['revenue'].agg([
    ('count', 'count'),
    ('mean', 'mean'),
    ('median', 'median'),
    ('std', 'std')
]).sort_values('median', ascending=False)

print("Revenue statistics by channel:")
print(revenue_by_channel.round(2))

# Interpretation: Store has the highest median revenue, 
# but fewer orders than web/app

In [None]:
# Visualize numeric vs categorical: boxplot
# Boxplots show median, quartiles, and outliers at a glance

plt.figure(figsize=(10, 5))
sns.boxplot(data=df_eda, x='channel', y='revenue', palette='Set2')
plt.title('Revenue by Channel (Boxplot)')
plt.xlabel('Channel')
plt.ylabel('Revenue ($)')

# Limit y-axis to see the main distribution (outliers still visible as dots)
plt.ylim(0, df_eda['revenue'].quantile(0.95))

plt.tight_layout()
plt.show()

# Interpretation: Store orders tend to have higher revenue
# All channels have some outliers

In [None]:
# Visualize numeric vs numeric: scatter plot
# We sample 500 points to avoid overplotting

plt.figure(figsize=(9, 5))
sample = df_eda.sample(500, random_state=0)
sns.scatterplot(data=sample, x='discount', y='revenue', hue='channel', alpha=0.7, s=60)
plt.title('Discount vs Revenue (by Channel)')
plt.xlabel('Discount Rate')
plt.ylabel('Revenue ($)')
plt.legend(title='Channel')
plt.tight_layout()
plt.show()

# Interpretation: Higher discounts are associated with lower revenue
# (this makes sense: discounts reduce price)

In [None]:
# Bucket a numeric variable to compare with another variable
# This helps when you want to see trends across ranges

# Create discount buckets (bins)
df_tmp = df_eda.copy()
df_tmp['discount_bucket'] = pd.cut(
    df_tmp['discount'], 
    bins=[0, 0.05, 0.10, 0.20, 0.60],  # 0-5%, 5-10%, 10-20%, 20-60%
    include_lowest=True,
    labels=['0-5%', '5-10%', '10-20%', '20-60%']
)

# Calculate return rate for each bucket
return_rate_by_bucket = df_tmp.groupby('discount_bucket', observed=False)['returned'].mean()

print("Return rate by discount bucket:")
print((return_rate_by_bucket * 100).round(1).to_frame('return_rate_%'))

In [None]:
# Visualize the return rate by discount bucket
plt.figure(figsize=(8, 4))
(return_rate_by_bucket * 100).plot(kind='bar', color='coral', edgecolor='black')
plt.ylabel('Return Rate (%)')
plt.xlabel('Discount Bucket')
plt.title('Return Rate Increases with Higher Discounts')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

# Key finding: Orders with higher discounts have MUCH higher return rates
# This is a pattern worth investigating further!

> ‚ö†Ô∏è **Critical Reminder:** We found that higher discounts are *associated* with higher returns. But this does NOT prove that discounts *cause* returns. There could be other factors:
> - Maybe discounted products are lower quality
> - Maybe customers who seek discounts are more likely to return things anyway
> 
> EDA finds patterns; further analysis tests explanations.

### üéØ Exercise 2: Bivariate Analysis

Answer these questions using what you've learned:

1. Which channel has the **highest median revenue**?
2. Is the **return rate** higher for web or app?
3. Create a **boxplot** of satisfaction vs returned status.

**Try it yourself before looking at the solution!**

In [None]:
# Solution to Exercise 2

# 1. Median revenue by channel
median_by_channel = df_eda.groupby('channel')['revenue'].median().sort_values(ascending=False)
print("1. Median revenue by channel:")
print(median_by_channel.round(2))
print(f"\n   ‚Üí Highest: {median_by_channel.idxmax()} (${median_by_channel.max():.2f})")

# 2. Return rate by channel
return_rate_by_channel = df_eda.groupby('channel')['returned'].mean().sort_values(ascending=False)
print("\n2. Return rate by channel:")
print((return_rate_by_channel * 100).round(1))
print(f"\n   ‚Üí Web return rate: {return_rate_by_channel['web']*100:.1f}%")
print(f"   ‚Üí App return rate: {return_rate_by_channel['app']*100:.1f}%")

# 3. Boxplot of satisfaction vs returned
plt.figure(figsize=(8, 4))
sns.boxplot(data=df_eda, x='returned', y='satisfaction', palette='pastel')
plt.title('Satisfaction Score by Return Status')
plt.xlabel('Order Returned?')
plt.ylabel('Satisfaction Score (1-10)')
plt.tight_layout()
plt.show()

# Interpretation: Customers who returned orders have lower satisfaction scores

---

## 16.4 Multivariate Analysis (3+ Variables)

**Multivariate analysis** examines relationships among three or more variables. This is useful when relationships are more complex and depend on multiple factors.

### Examples of Multivariate Questions:
- Does revenue depend on items **and** channel?
- Does the discount-return relationship vary **by channel**?
- Are satisfaction patterns different **by region and return status**?

### Tools for Multivariate Analysis

| Tool | Use Case |
|------|----------|
| **Correlation matrix** | See relationships between all numeric pairs |
| **Heatmap** | Visualize correlation or pivot tables |
| **Pivot tables** | Cross-tabulate by two categorical variables |
| **Pair plots** | Scatter plots for all variable pairs |
| **Faceted plots** | Same plot repeated for each group |

> üí° **Tip:** A correlation matrix is a great first step in multivariate analysis‚Äîit quickly shows which variables are related.

In [None]:
# Calculate correlation matrix for all numeric columns
# Correlation ranges from -1 (negative) to +1 (positive)

numeric_cols = ['customer_age', 'items', 'discount', 'revenue', 'satisfaction']
corr = df_eda[numeric_cols].corr()

print("Correlation Matrix:")
print(corr.round(2))

# Reading the matrix:
# - Each cell shows correlation between row and column variables
# - Diagonal is always 1.0 (variable correlated with itself)
# - Look for values close to 1 or -1 for strong relationships

In [None]:
# Visualize correlation matrix as a heatmap
# Colors make patterns easier to spot

plt.figure(figsize=(8, 6))
sns.heatmap(
    corr, 
    annot=True,           # Show numbers in cells
    fmt='.2f',            # Two decimal places
    cmap='vlag',          # Red-blue color scheme
    center=0,             # Center color at 0
    vmin=-1, vmax=1,      # Full correlation range
    square=True,          # Square cells
    linewidths=0.5        # Cell borders
)
plt.title('Correlation Heatmap (Numeric Variables)')
plt.tight_layout()
plt.show()

# Key findings:
# - items and revenue have positive correlation (more items = more revenue)
# - discount and revenue have negative correlation (discounts reduce revenue)
# - customer_age has weak correlations with other variables

In [None]:
# Pivot table: cross-tabulate two categorical variables
# Here: return rate by channel AND region

pivot_return = pd.pivot_table(
    df_eda,
    index='channel',         # Rows
    columns='region',        # Columns
    values='returned',       # What to measure
    aggfunc='mean'           # How to aggregate (mean of boolean = rate)
)

print("Return rate by channel and region:")
print((pivot_return * 100).round(1))

# This shows: does the return rate pattern differ by region within each channel?

In [None]:
# Visualize the pivot table as a heatmap
plt.figure(figsize=(9, 4))
sns.heatmap(
    pivot_return * 100,   # Convert to percentage
    annot=True, 
    fmt='.1f', 
    cmap='Blues',
    linewidths=0.5,
    cbar_kws={'label': 'Return Rate (%)'}
)
plt.title('Return Rate (%) by Channel and Region')
plt.xlabel('Region')
plt.ylabel('Channel')
plt.tight_layout()
plt.show()

# Interpretation: Look for cells with unusually high or low values
# These could indicate issues or opportunities specific to channel-region combos

### üéØ Exercise 4: Multivariate Analysis

Practice multivariate exploration:

1. Create a **pivot table** showing average satisfaction by `channel` and `returned` status
2. Visualize it as a **heatmap**
3. What pattern do you observe?

**Hint:** Use `pd.pivot_table()` with `values='satisfaction'` and `aggfunc='mean'`

In [None]:
# Solution to Exercise 4

# 1. Create pivot table of satisfaction by channel and returned status
pivot_satisfaction = pd.pivot_table(
    df_eda,
    index='channel',
    columns='returned',
    values='satisfaction',
    aggfunc='mean'
)
pivot_satisfaction.columns = ['Not Returned', 'Returned']

print("Average Satisfaction by Channel and Return Status:")
print(pivot_satisfaction.round(2))

# 2. Visualize as heatmap
plt.figure(figsize=(8, 4))
sns.heatmap(
    pivot_satisfaction.round(2),
    annot=True,
    fmt='.2f',
    cmap='RdYlGn',  # Red-Yellow-Green (low to high)
    linewidths=0.5,
    vmin=5, vmax=8,  # Satisfaction scale context
    cbar_kws={'label': 'Avg Satisfaction (1-10)'}
)
plt.title('Average Satisfaction by Channel and Return Status')
plt.xlabel('Return Status')
plt.ylabel('Channel')
plt.tight_layout()
plt.show()

# 3. Pattern observation:
print("\nüìä Pattern Observed:")
print("- Returned orders have lower satisfaction (~1 point lower) across ALL channels")
print("- Store customers are slightly more satisfied overall")
print("- The return-satisfaction gap is consistent regardless of channel")

---

## 16.5 Visual and Statistical Exploration

Visuals help you see patterns quickly. Simple statistics help you **quantify** what you see. The best EDA uses both together.

### When to Use What

| Goal | Visual Tool | Statistical Tool |
|------|-------------|------------------|
| Compare distributions | Violin plot, histogram | Mean, median, std |
| Compare categories | Bar chart, boxplot | Group means, counts |
| Check for differences | Side-by-side plots | t-test, chi-square |
| Find relationships | Scatter plot | Correlation coefficient |

> üí° **Tip:** Always pair a visual with a number. "Revenue is higher for store" is better as "Store median revenue ($85) is 35% higher than web ($63)."

### Two EDA Examples:
1. **Compare revenue distributions** across channels
2. **Compare return rates** between web and app with a chi-square test

In [None]:
# Example 1: Compare revenue distributions with violin plot
# Violin plots show the full distribution shape (like a sideways histogram)

# Remove extreme outliers for readability (keep 98% of data)
trimmed = df_eda[df_eda['revenue'] <= df_eda['revenue'].quantile(0.98)]

plt.figure(figsize=(10, 5))
sns.violinplot(data=trimmed, x='channel', y='revenue', inner='quartile', palette='Set2')
plt.title('Revenue Distribution by Channel (Violin Plot)')
plt.xlabel('Channel')
plt.ylabel('Revenue ($)')
plt.tight_layout()
plt.show()

# Also show the statistics
print("Revenue statistics by channel:")
print(trimmed.groupby('channel')['revenue'].describe()[['count', 'mean', '50%', 'std']].round(2))

# The violin shape shows:
# - Store has higher revenues overall
# - Web has a longer tail of higher values

In [None]:
# Example 2: Compare return rates with a contingency table (crosstab)
# A crosstab shows counts of each combination of two categories

# Focus on web vs app
channels_of_interest = df_eda[df_eda['channel'].isin(['web', 'app'])]

# Create the crosstab
table = pd.crosstab(
    channels_of_interest['channel'], 
    channels_of_interest['returned'],
    margins=True  # Add row/column totals
)
table.columns = ['Not Returned', 'Returned', 'Total']
print("Contingency Table: Channel vs Returned")
print(table)

# This shows the raw counts - useful for understanding sample sizes

In [None]:
# Optional: Chi-square test to check if the difference is statistically significant
# Chi-square tests whether two categorical variables are independent

if SCIPY_AVAILABLE:
    # Create the table without margins for the test
    table_for_test = pd.crosstab(channels_of_interest['channel'], channels_of_interest['returned'])
    
    # Run chi-square test
    chi2, p_value, dof, expected = stats.chi2_contingency(table_for_test)
    
    print("Chi-Square Test Results:")
    print(f"  Chi-square statistic: {chi2:.3f}")
    print(f"  p-value: {p_value:.4f}")
    print(f"  Degrees of freedom: {dof}")
    print(f"\nInterpretation:")
    if p_value < 0.05:
        print("  ‚Üí The difference IS statistically significant (p < 0.05)")
        print("  ‚Üí Channel and return status are NOT independent")
    else:
        print("  ‚Üí The difference is NOT statistically significant (p >= 0.05)")
        print("  ‚Üí We cannot conclude that channel affects returns")
    
    print("\nExpected counts (if no relationship existed):")
    print(pd.DataFrame(expected.round(1), index=table_for_test.index, columns=table_for_test.columns))
else:
    print("SciPy not available: skipping chi-square test.")
    print("The crosstab above is still useful for EDA.")

---

## 16.6 Pattern and Anomaly Detection

**Anomalies** (also called outliers) are unusual data points. They can be:
- **Real events** ‚Äî A very large order from a corporate customer
- **Data errors** ‚Äî Someone accidentally added an extra zero
- **Rare but important cases** ‚Äî Fraud, system issues, special promotions

### Why Detect Anomalies?
- They can **skew your statistics** (mean, correlation)
- They might represent **errors** that need fixing
- They could be **the most important insights** (fraud, opportunities)

### Common Anomaly Detection Methods

| Method | How It Works | Best For |
|--------|--------------|----------|
| **IQR Rule** | Flag values > Q3 + 1.5√óIQR or < Q1 - 1.5√óIQR | Simple, robust |
| **Z-score** | Flag values > 2 or 3 standard deviations from mean | Normal distributions |
| **Visual inspection** | Look at boxplots, scatter plots | Finding patterns |
| **Time-based** | Look for unusual spikes in time series | Temporal data |

> ‚ö†Ô∏è **Warning:** An outlier is not automatically wrong. Always investigate before removing!

In [None]:
# IQR (Interquartile Range) method for outlier detection
# This is a robust method that works well for skewed data

def iqr_outliers(series: pd.Series, k: float = 1.5) -> pd.Series:
    """
    Detect outliers using the IQR method.
    
    Parameters:
    - series: Column to check
    - k: Multiplier (1.5 = standard, 3.0 = extreme only)
    
    Returns:
    - Boolean series: True = outlier
    """
    s = series.dropna()
    q1 = s.quantile(0.25)
    q3 = s.quantile(0.75)
    iqr = q3 - q1
    lower = q1 - k * iqr
    upper = q3 + k * iqr
    return (series < lower) | (series > upper)

# Apply to revenue column
rev_outlier_mask = iqr_outliers(df_eda['revenue'], k=1.5)

print(f"Revenue outliers detected (IQR method):")
print(f"  Total flagged: {rev_outlier_mask.sum():,} out of {len(df_eda):,} rows ({rev_outlier_mask.mean()*100:.1f}%)")

# Show the top outliers
print("\nTop 10 revenue outliers:")
df_eda.loc[rev_outlier_mask, ['order_id', 'order_date', 'channel', 'items', 'discount', 'revenue']]\
    .sort_values('revenue', ascending=False).head(10)

In [None]:
# Visualize outliers in context
# This helps you understand if outliers are random or follow a pattern

plt.figure(figsize=(10, 5))
sns.scatterplot(
    data=df_eda.assign(is_outlier=rev_outlier_mask),
    x='items',
    y='revenue',
    hue='is_outlier',
    palette={True: 'red', False: 'steelblue'},
    alpha=0.7,
    s=50
)
plt.title('Revenue Outliers Highlighted (Red = Outlier)')
plt.xlabel('Number of Items')
plt.ylabel('Revenue ($)')
plt.legend(title='Outlier?', labels=['Normal', 'Outlier'])
plt.tight_layout()
plt.show()

# Interpretation: The outliers mostly have high items AND high revenue
# This makes sense - they might be bulk orders, not errors

In [None]:
# Time-based anomaly detection
# Aggregate data by day and look for unusual spikes

# Create daily revenue totals
daily = (
    df_eda
    .assign(day=lambda d: d['order_date'].dt.floor('D'))
    .groupby('day', as_index=False)
    .agg({
        'revenue': 'sum',
        'order_id': 'count'
    })
    .rename(columns={'order_id': 'order_count'})
)

# Calculate 7-day rolling mean and standard deviation
daily['rolling_mean_7'] = daily['revenue'].rolling(7, min_periods=1).mean()
daily['rolling_std_7'] = daily['revenue'].rolling(7, min_periods=1).std().fillna(0)

# Calculate z-score (how many std deviations from rolling mean)
daily['z_score'] = (daily['revenue'] - daily['rolling_mean_7']) / daily['rolling_std_7'].replace(0, np.nan)

# Plot daily revenue with rolling mean
plt.figure(figsize=(12, 5))
plt.plot(daily['day'], daily['revenue'], label='Daily Revenue', alpha=0.7)
plt.plot(daily['day'], daily['rolling_mean_7'], label='7-Day Rolling Mean', color='red', linewidth=2)
plt.fill_between(
    daily['day'],
    daily['rolling_mean_7'] - 2 * daily['rolling_std_7'],
    daily['rolling_mean_7'] + 2 * daily['rolling_std_7'],
    alpha=0.2, color='red', label='¬±2 Std Dev Band'
)
plt.title('Daily Revenue Over Time (with Anomaly Detection Band)')
plt.xlabel('Date')
plt.ylabel('Total Daily Revenue ($)')
plt.legend()
plt.tight_layout()
plt.show()

# Show days with unusually high revenue
print("Days with highest revenue:")
print(daily.sort_values('revenue', ascending=False).head(5)[['day', 'revenue', 'order_count', 'z_score']].round(2))

### üéØ Exercise 3: Anomaly Detection

Practice your anomaly detection skills:

1. Use the `iqr_outliers` function to flag outliers in the `discount` column
2. Count how many outliers you found
3. Show the top 5 rows with the highest discount

**Bonus:** Try using `k=2.0` instead of `k=1.5` and compare the results. What changes?

In [None]:
# Solution to Exercise 3

# 1 & 2. Flag discount outliers and count them
discount_outliers_15 = iqr_outliers(df_eda['discount'], k=1.5)
discount_outliers_20 = iqr_outliers(df_eda['discount'], k=2.0)

print("Discount Outliers:")
print(f"  With k=1.5: {discount_outliers_15.sum()} outliers ({discount_outliers_15.mean()*100:.1f}%)")
print(f"  With k=2.0: {discount_outliers_20.sum()} outliers ({discount_outliers_20.mean()*100:.1f}%)")

# 3. Show top 5 highest discounts
print("\nTop 5 orders with highest discount:")
print(df_eda.loc[discount_outliers_15, ['order_id', 'channel', 'items', 'discount', 'revenue', 'returned']]\
    .sort_values('discount', ascending=False).head(5))

# Bonus interpretation:
# - Using k=2.0 is stricter (fewer outliers flagged)
# - k=1.5 is the standard "mild outlier" threshold
# - k=3.0 would catch only extreme outliers

---

## 16.7 Hypothesis Refinement

A powerful EDA habit is to **write down hypotheses**, test them informally, and **refine them** based on evidence.

### The Hypothesis Refinement Process

```
Vague Idea ‚Üí Specific Hypothesis ‚Üí Test with Data ‚Üí Refine or Reject
```

### Example:

| Stage | Statement |
|-------|-----------|
| **Vague idea** | "Discounts might cause more returns" |
| **Specific hypothesis** | "Orders with discount ‚â• 10% have higher return rate than orders with discount < 10%" |
| **Refined hypothesis** | "The discount-return relationship may differ by channel" |
| **Next step** | "Need to control for product category (not available in this dataset)" |

### Why Refine Hypotheses?
- **Clarifies what to measure** ‚Äî return rate, not just "returns"
- **Clarifies the comparison** ‚Äî above/below 10% threshold
- **Suggests subgroups to check** ‚Äî by channel, by region
- **Identifies data gaps** ‚Äî what else would we need?

> üí° **Tip:** Good EDA is like a conversation with your data. You ask questions, get answers, then ask better questions.

In [None]:
# Test a refined hypothesis: 
# "High-discount orders have higher return rates, and this varies by channel"

# Create binary flag for high discount (‚â•10%)
df_h = df_eda.copy()
df_h['high_discount'] = df_h['discount'] >= 0.10

# Overall return rate by discount group
overall = df_h.groupby('high_discount')['returned'].agg(['mean', 'count'])
overall.index = ['Low Discount (<10%)', 'High Discount (‚â•10%)']
overall.columns = ['Return Rate', 'Count']
overall['Return Rate'] = (overall['Return Rate'] * 100).round(1)

print("Overall Return Rate by Discount Group:")
print(overall)

# Return rate by channel AND discount group
by_channel = (
    df_h.groupby(['channel', 'high_discount'])['returned']
    .agg(['mean', 'count'])
    .reset_index()
)
by_channel['mean'] = (by_channel['mean'] * 100).round(1)
by_channel.columns = ['Channel', 'High Discount', 'Return Rate %', 'Count']

print("\nReturn Rate by Channel and Discount Group:")
pivot = by_channel.pivot(index='Channel', columns='High Discount', values='Return Rate %')
pivot.columns = ['Low Discount (<10%)', 'High Discount (‚â•10%)']
print(pivot)

In [None]:
# Visualize the hypothesis test results
plt.figure(figsize=(10, 5))
sns.barplot(
    data=by_channel, 
    x='Channel', 
    y='Return Rate %', 
    hue='High Discount',
    palette=['lightblue', 'coral']
)
plt.title('Return Rate by Channel and Discount Level')
plt.ylabel('Return Rate (%)')
plt.xlabel('Channel')
plt.legend(title='Discount Level', labels=['Low (<10%)', 'High (‚â•10%)'])

# Add a horizontal line for overall average
overall_avg = df_h['returned'].mean() * 100
plt.axhline(y=overall_avg, color='gray', linestyle='--', linewidth=1, label=f'Overall: {overall_avg:.1f}%')

plt.tight_layout()
plt.show()

# Interpretation:
# - High discounts are associated with higher return rates across ALL channels
# - The effect is consistent, supporting our hypothesis

---

## üéØ Mini-Project: EDA Summary Report

Imagine a manager asks you:
> "We need to understand our order data better. Which channel is most valuable? Are discounts causing problems? Anything unusual we should investigate?"

**Your Task:** Use what you've learned to create a brief EDA summary. Specifically:

1. Calculate **median revenue by channel** (which channel brings in more per order?)
2. Calculate **return rate by discount bucket** (are discounts related to returns?)
3. **Flag revenue outliers** and count them (any unusual orders?)
4. Write **3-5 bullet point insights** in plain English

This simulates real work ‚Äî you'll do the analysis, then summarize it for a non-technical audience.

In [None]:
# Mini-Project Solution

# 1. Median revenue by channel
median_rev = df_eda.groupby('channel')['revenue'].median().sort_values(ascending=False)
print("=" * 50)
print("EDA SUMMARY REPORT")
print("=" * 50)

print("\nüìä MEDIAN REVENUE BY CHANNEL:")
for channel, revenue in median_rev.items():
    print(f"   {channel.capitalize():8} ${revenue:.2f}")

# 2. Return rate by discount bucket
df_mp = df_eda.copy()
df_mp['discount_bucket'] = pd.cut(
    df_mp['discount'], 
    bins=[0, 0.05, 0.10, 0.20, 0.60], 
    include_lowest=True,
    labels=['0-5%', '5-10%', '10-20%', '20-60%']
)
return_by_bucket = df_mp.groupby('discount_bucket', observed=False)['returned'].mean()

print("\nüìà RETURN RATE BY DISCOUNT LEVEL:")
for bucket, rate in return_by_bucket.items():
    print(f"   {str(bucket):10} {rate*100:.1f}%")

# 3. Revenue outliers
rev_outliers = iqr_outliers(df_eda['revenue'], k=1.5)
n_outliers = rev_outliers.sum()
total_outlier_rev = df_eda.loc[rev_outliers, 'revenue'].sum()

print(f"\n‚ö†Ô∏è  REVENUE OUTLIERS:")
print(f"   {n_outliers} unusual orders flagged ({n_outliers/len(df_eda)*100:.1f}% of total)")
print(f"   Combined revenue: ${total_outlier_rev:,.2f}")

# 4. Key insights summary
print("\n" + "=" * 50)
print("KEY INSIGHTS (for stakeholders)")
print("=" * 50)
print("""
1. STORE CHANNEL IS MOST VALUABLE PER ORDER
   ‚Üí Store orders have 35% higher median revenue than web/app
   ‚Üí Consider strategies to drive more in-store traffic

2. HIGH DISCOUNTS CORRELATE WITH HIGH RETURNS
   ‚Üí Orders with 20%+ discount have ~50% return rate
   ‚Üí Review discount policy - may be attracting wrong customers

3. WEB HAS HIGHEST VOLUME BUT LOWER VALUE
   ‚Üí 55% of orders come through web, but lower per-order revenue
   ‚Üí Opportunity: upsell/cross-sell on web platform

4. UNUSUAL LARGE ORDERS EXIST
   ‚Üí ~4% of orders are outliers (unusually high revenue)
   ‚Üí Investigate: are these bulk orders? corporate accounts?

5. CUSTOMER SATISFACTION TIED TO RETURNS
   ‚Üí Customers who return orders rate satisfaction ~1 point lower
   ‚Üí Improving product quality/description may reduce returns
""")

---

## Summary / Key Takeaways

### What We Learned

‚úÖ **EDA is a method, not random plotting.** Follow a consistent workflow to ensure you don't miss important findings.

‚úÖ **Start with inspection and minimal cleanup.** Understand your data before diving into analysis.

‚úÖ **Progress from simple to complex:** Univariate ‚Üí Bivariate ‚Üí Multivariate analysis.

‚úÖ **Combine visuals with statistics.** Plots help you see patterns; statistics help you quantify them.

‚úÖ **Outliers are leads to investigate** ‚Äî not automatically errors to remove.

‚úÖ **EDA helps refine hypotheses** into precise, testable questions for later validation.

‚úÖ **Correlation ‚â† Causation.** EDA finds associations; proving causes requires further analysis.

### The EDA Checklist (Quick Reference)

1. ‚òê Clarify the goal
2. ‚òê Inspect dataset (shape, types, missing)
3. ‚òê Clean just enough for exploration
4. ‚òê Univariate analysis
5. ‚òê Bivariate analysis
6. ‚òê Multivariate analysis
7. ‚òê Pattern and anomaly detection
8. ‚òê Refine hypotheses
9. ‚òê Summarize insights

### Common Mistakes to Avoid

| Mistake | Why It's a Problem | Solution |
|---------|-------------------|----------|
| Skipping inspection | Wrong data types ‚Üí wrong calculations | Always check `.dtypes` and `.isna()` first |
| Over-cleaning early | You miss seeing real data issues | Clean minimally for EDA; deep clean later |
| Assuming causation | Leads to wrong decisions | Note associations only; test causes separately |
| Ignoring outliers | Miss important insights or errors | Flag and investigate, don't auto-remove |
| No documentation | Can't reproduce or explain findings | Write notes as you go |

---

## Additional Resources

### Official Documentation
- üìö [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html) ‚Äî Complete guide to data manipulation
- üìä [Seaborn Tutorial](https://seaborn.pydata.org/tutorial.html) ‚Äî Statistical visualization with Seaborn
- üìà [Matplotlib Tutorials](https://matplotlib.org/stable/tutorials/index.html) ‚Äî Foundational plotting library

### Recommended Reading
- üìñ *Python for Data Analysis* by Wes McKinney (Pandas creator)
- üìñ *Storytelling with Data* by Cole Nussbaumer Knaflic (visualization best practices)

### Practice Datasets
- [Kaggle Datasets](https://www.kaggle.com/datasets) ‚Äî Thousands of real-world datasets
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) ‚Äî Classic datasets for analysis

---

**Congratulations!** You now have a solid foundation in EDA methodology. Practice this workflow on new datasets to build your intuition.