# Chapter 7: Exploratory Data Analysis and Descriptive Statistics

---

## Introduction

**Exploratory Data Analysis (EDA)** is the critical first step in any data analytics project. Before building models, creating reports, or making predictions, you must *understand* your data. EDA is the process of systematically examining datasets to discover patterns, spot anomalies, test assumptions, and summarize main characteristics‚Äîoften using visual methods.

Think of EDA as a detective's investigation: you're gathering clues about what the data contains, what stories it might tell, and what problems or surprises might be lurking beneath the surface.

### Why is EDA so important?

1. **Prevents costly mistakes** ‚Äî Wrong assumptions about data lead to wrong conclusions
2. **Reveals data quality issues** ‚Äî Missing values, duplicates, and inconsistencies become visible
3. **Guides further analysis** ‚Äî EDA helps you decide which techniques and models are appropriate
4. **Builds intuition** ‚Äî You develop a "feel" for the data that helps throughout your project
5. **Communicates findings** ‚Äî Visualizations and summaries help explain data to stakeholders

### What we'll cover in this chapter

This chapter introduces the core techniques of EDA and descriptive statistics:

| Topic | Description |
|-------|-------------|
| **Data Quality Checks** | Inspecting types, missing values, duplicates |
| **Distribution Analysis** | Understanding how values are spread out |
| **Central Tendency** | Mean, median, mode ‚Äî what's "typical"? |
| **Dispersion** | Spread and variability of data |
| **Outlier Detection** | Finding unusual or extreme values |
| **Correlation Analysis** | How variables relate to each other |
| **Multivariate Exploration** | Looking at multiple variables together |
| **Automated Reports** | Tools that generate EDA summaries |

> **Prerequisites**: This chapter assumes you're familiar with Python basics (Chapter 2), NumPy (Chapter 3), Pandas (Chapter 4), and basic plotting (Chapter 5).

---

## Learning Objectives

By the end of this chapter, you will be able to:

‚úÖ Explain what EDA is and why it matters for data analytics projects

‚úÖ Summarize data using measures of **central tendency** (mean, median, mode)

‚úÖ Describe data spread using measures of **dispersion** (range, variance, standard deviation, IQR)

‚úÖ Analyze **distributions** and identify **outliers** using multiple techniques

‚úÖ Explore relationships between variables using **correlation** analysis

‚úÖ Perform **multivariate** exploration to find patterns across multiple variables

‚úÖ Create a simple, repeatable EDA checklist for your projects

‚úÖ (Optional) Generate an automated EDA report using Python tools

‚úÖ Interpret EDA results carefully and avoid common analytical traps

---

## 7.1 Purpose and Importance of EDA

EDA is the process of *looking at your data* to understand:

| Question | What You're Looking For |
|----------|------------------------|
| **What are the columns?** | Column names, data types (numbers, categories, dates) |
| **Is the data quality good?** | Missing values, duplicates, inconsistent values |
| **What are typical values?** | Averages, common categories, expected ranges |
| **How do values vary?** | Spread (dispersion) and unusual values (outliers) |
| **Do variables relate?** | Patterns, trends, correlations between columns |

### The EDA Mindset

EDA is not about finding "the answer"‚Äîit's about asking the right questions. You should approach EDA with curiosity and skepticism:

- **Be curious**: What patterns exist? What surprises are hiding in the data?
- **Be skeptical**: Could this pattern be a data error? Is this outlier real?
- **Be systematic**: Use a consistent checklist so you don't miss important checks

### Why Good EDA Prevents Expensive Mistakes

```
‚ùå Wrong assumptions ‚Üí Wrong charts ‚Üí Wrong models ‚Üí Wrong conclusions ‚Üí Bad decisions
‚úÖ Good EDA ‚Üí Right understanding ‚Üí Right approach ‚Üí Valid conclusions ‚Üí Good decisions
```

> **üí° Tip**: EDA is not a one-time step. You'll often go back and forth as you discover issues and refine questions. Think of it as an iterative conversation with your data.

---

## 7.2 Setup: Import Libraries

Before we begin our EDA, let's import the libraries we'll use:

| Library | Purpose |
|---------|---------|
| **NumPy** | Numerical computing and array operations |
| **Pandas** | Data manipulation and analysis |
| **Matplotlib** | Basic plotting and visualization |
| **Seaborn** | Statistical visualizations (optional but recommended) |
| **SciPy** | Statistical functions like z-scores (optional) |

> **‚ö†Ô∏è Note**: If you see `ModuleNotFoundError`, install missing packages in your environment:
> ```
> pip install pandas numpy matplotlib seaborn scipy
> ```

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Optional: Seaborn for prettier charts
try:
    import seaborn as sns
    sns.set_theme(style="whitegrid")
except ImportError:
    sns = None
    print("Seaborn is not installed. Plots will use Matplotlib only.")

# Optional: SciPy for z-scores and some stats helpers
try:
    from scipy import stats
except ImportError:
    stats = None
    print("SciPy is not installed. Some stats examples will be skipped.")

%matplotlib inline
plt.rcParams["figure.figsize"] = (10, 5)
pd.set_option('display.max_columns', 50)

---

## 7.3 Loading a Practice Dataset

To focus on EDA skills, we'll use the **diamonds** dataset from seaborn ‚Äî a real-world dataset containing information about diamond prices and characteristics. This is a great dataset for practicing EDA because it has:
- Multiple numeric columns (price, carat, depth, table, dimensions)
- Categorical columns (cut, color, clarity)
- A good size for exploration (~54,000 rows)

We'll also inject a few realistic data issues (missing values, duplicates) to practice data quality checks.

### Dataset Description

| Column | Description | Type |
|--------|-------------|------|
| `carat` | Weight of the diamond | Float |
| `cut` | Quality of the cut (Fair, Good, Very Good, Premium, Ideal) | Category |
| `color` | Diamond color, from D (best) to J (worst) | Category |
| `clarity` | Clarity rating (I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF) | Category |
| `depth` | Total depth percentage | Float |
| `table` | Width of top relative to widest point | Float |
| `price` | Price in US dollars | Integer |
| `x`, `y`, `z` | Dimensions in mm | Float |

> **‚ö†Ô∏è Warning (common mistake)**: In real projects, you must understand how the data was collected (tracking rules, definitions, time windows). EDA can't fix bad definitions or unclear business rules.

In [None]:
# Load the diamonds dataset from seaborn
df = sns.load_dataset("diamonds")

# Add an order_id for reference
df = df.reset_index().rename(columns={"index": "order_id"})
df["order_id"] = df["order_id"] + 1

# Add order_date for time-series examples
rng = np.random.default_rng(7)
start = np.datetime64('2025-01-01')
dates = start + rng.integers(0, 180, size=len(df)).astype('timedelta64[D]')
df["order_date"] = pd.to_datetime(dates)

# Rename some columns to match our business context
df = df.rename(columns={"price": "revenue", "carat": "units"})

# Create segment from cut quality
df["segment"] = df["cut"].map({
    "Fair": "Budget", "Good": "Standard", "Very Good": "Standard",
    "Premium": "Premium", "Ideal": "Premium"
})

# Create region from color
df["region"] = df["color"].map({
    "D": "North", "E": "North", "F": "East", "G": "East",
    "H": "South", "I": "South", "J": "West"
})

# Add discount_rate and returned columns
df["discount_rate"] = rng.beta(2, 8, size=len(df)) * 0.4
df["unit_price"] = df["revenue"] / df["units"]
df["returned"] = rng.random(len(df)) < (0.03 + 0.20 * df["discount_rate"])

# Keep only the columns we need for EDA
df = df[["order_id", "order_date", "segment", "region", "units", "unit_price", 
         "discount_rate", "revenue", "returned"]].copy()

# Inject a few realistic data issues:
# 1) Missing values
missing_idx = rng.choice(df.index, size=50, replace=False)
df.loc[missing_idx[:25], "unit_price"] = np.nan
df.loc[missing_idx[25:], "segment"] = None

# 2) Duplicates (duplicate some rows but change order_id)
dupe_rows = df.sample(20, random_state=1).copy()
dupe_rows["order_id"] = np.arange(len(df) + 1, len(df) + 1 + len(dupe_rows))
df = pd.concat([df, dupe_rows], ignore_index=True)

# 3) Outliers: some extreme revenue values already exist in diamonds data

print(f"Dataset shape: {df.shape}")
df.head()

---

## 7.4 A Simple EDA Checklist (Your Repeatable Workflow)

When you start EDA, use a consistent checklist. This ensures you don't miss important checks and makes your analysis reproducible.

### The 8-Step EDA Checklist

| Step | Action | Key Tools |
|------|--------|-----------|
| 1Ô∏è‚É£ | **Preview rows** ‚Äî Look at actual data | `head()`, `sample()` |
| 2Ô∏è‚É£ | **Check column types and missing values** | `info()`, `isna()` |
| 3Ô∏è‚É£ | **Summarize numeric columns** | `describe()` |
| 4Ô∏è‚É£ | **Summarize categorical columns** | `value_counts()` |
| 5Ô∏è‚É£ | **Visualize distributions** | Histogram, box plot |
| 6Ô∏è‚É£ | **Look for outliers and data quality issues** | IQR rule, z-scores |
| 7Ô∏è‚É£ | **Explore relationships** | Correlation, groupby, scatter |
| 8Ô∏è‚É£ | **Document observations** | Notes, comments |

> **üí° Tip**: Keep notes as you do EDA. Your future self (or teammates) will thank you. Consider creating a "findings" section in your notebook.

### Step 1: Preview Your Data

Let's start by looking at the first few rows to get a sense of what we're working with:

In [None]:
df.head()

In [None]:
df.sample(5, random_state=42)

### Step 2: Data Types and Missing Values

The `info()` method answers two critical questions for beginners:

1. **Which columns are numeric vs text vs dates?** ‚Äî This determines what operations you can perform
2. **Which columns have missing values?** ‚Äî Non-null counts reveal gaps in your data

> **‚ö†Ô∏è Common mistake**: Treating numeric-looking text as numbers (e.g., `'100'` stored as string instead of `100` as integer). Always verify dtypes before doing math!

In [None]:
df.info()

In [None]:
missing = df.isna().sum().sort_values(ascending=False)
missing[missing > 0]

### Duplicates
Duplicates can happen during data imports, merges, or repeated API pulls.

Here we check duplicates *ignoring* `order_id` (because `order_id` is unique even for duplicated content).

In [None]:
cols_to_check = [c for c in df.columns if c != 'order_id']
duplicate_mask = df.duplicated(subset=cols_to_check, keep=False)
df.loc[duplicate_mask, cols_to_check].head(10)

### Basic numeric summary (`describe`)
`describe()` gives quick descriptive statistics for numeric columns:
- Count (non-missing)
- Mean and standard deviation
- Min, quartiles (25%, 50%, 75%), max

> **Tip**: Quartiles are key for understanding spread and for detecting outliers.

In [None]:
df.describe(include=[np.number]).T

### Categorical summary (value counts)
For text / category columns, look at the most common values and whether you have unexpected categories (typos, inconsistent labels).

In [None]:
for col in ['segment', 'region', 'returned']:
    print(f"\n{col} value counts:")
    print(df[col].value_counts(dropna=False))

---
## Exercise 7.1 ‚Äî Quick data understanding
1. How many rows and columns are in `df`?
2. Which columns have missing values, and how many?
3. What are the *top 2* regions by count?

Write code below.

In [None]:
# 1) shape
# YOUR CODE HERE

# 2) missing values per column
# YOUR CODE HERE

# 3) top 2 regions
# YOUR CODE HERE

---

## 7.5 Data Distribution Analysis

A **distribution** tells you how values are spread out across your data. Understanding distributions is fundamental to EDA because it reveals:

| Pattern | What It Means | Example |
|---------|---------------|---------|
| **Right-skewed** | Most values are small, with a long tail of large values | Income, home prices |
| **Left-skewed** | Most values are large, with a long tail of small values | Age at retirement |
| **Normal (bell curve)** | Values cluster around the mean symmetrically | Heights, test scores |
| **Bimodal** | Two distinct peaks (possible multiple groups) | Mixed populations |
| **Uniform** | Values spread evenly across the range | Random IDs |

### Key Questions When Analyzing Distributions

- Are values mostly small with a few large ones (right-skewed)?
- Are there multiple peaks (possible multiple groups)?
- Are there strange values or impossible values (negative ages, future dates)?

### Common Beginner-Friendly Plots

| Plot Type | Best For | Shows |
|-----------|----------|-------|
| **Histogram** | Continuous data | Frequency distribution (counts in bins) |
| **Box plot** | Continuous data | Median, quartiles, and potential outliers |
| **Bar chart** | Categorical data | Counts per category |
| **KDE plot** | Continuous data | Smoothed density estimate |

> **‚ö†Ô∏è Common mistake**: Using a mean to describe a heavily skewed variable without checking the distribution first. Always visualize before summarizing!

In [None]:
numeric_cols = ['units', 'unit_price', 'discount_rate', 'revenue']
df[numeric_cols].hist(bins=30)
plt.suptitle('Histograms of numeric columns')
plt.tight_layout()
plt.show()

In [None]:
if sns is not None:
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    sns.boxplot(x=df['revenue'], ax=axes[0])
    axes[0].set_title('Revenue (box plot)')
    sns.histplot(df['revenue'], bins=40, kde=True, ax=axes[1])
    axes[1].set_title('Revenue (hist + KDE)')
    plt.tight_layout()
    plt.show()
else:
    plt.boxplot(df['revenue'].dropna(), vert=False)
    plt.title('Revenue (box plot)')
    plt.show()

### Distributions for categorical variables
For categories, a bar chart (counts) is often enough.

In [None]:
counts = df['segment'].value_counts(dropna=False)
counts

In [None]:
if sns is not None:
    sns.countplot(data=df, x='segment', order=df['segment'].value_counts(dropna=False).index)
    plt.title('Count by segment')
    plt.xticks(rotation=0)
    plt.show()
else:
    counts.plot(kind='bar')
    plt.title('Count by segment')
    plt.show()

---

## 7.6 Measures of Central Tendency

**Central tendency** describes a *typical* or *representative* value in your data. Think of it as answering the question: "What's a normal value?"

### The Three Main Measures

| Measure | Definition | Pros | Cons |
|---------|------------|------|------|
| **Mean** | Sum of all values √∑ count | Uses all data points | Sensitive to outliers |
| **Median** | Middle value when sorted | Robust to outliers | Ignores magnitude |
| **Mode** | Most frequent value | Works for categories | May not be unique |

### Mathematical Definitions

For a dataset with values $x_1, x_2, ..., x_n$:

- **Mean**: $\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$
- **Median**: The value at position $\frac{n+1}{2}$ when sorted (or average of two middle values)
- **Mode**: The value that appears most frequently

### When to Use Which?

| Situation | Recommended Measure | Why |
|-----------|--------------------| ----|
| Symmetric distribution | Mean | Accurate representation |
| Skewed distribution | Median | Not affected by extreme values |
| Categorical data | Mode | Only measure that works |
| Data with outliers | Median | More robust |

> **üí° Tip**: If the distribution is skewed or has outliers, median is often more representative than mean. When in doubt, report both!

In [None]:
revenue = df['revenue']
summary_central = {
    'mean': revenue.mean(),
    'median': revenue.median(),
    'mode_first': revenue.mode(dropna=True).iloc[0] if not revenue.mode(dropna=True).empty else np.nan
}
pd.Series(summary_central).round(2)

In [None]:
# Central tendency by segment (grouped summary)
df.groupby('segment', dropna=False)['revenue'].agg(['count', 'mean', 'median']).round(2)

---

## 7.7 Measures of Dispersion (Spread)

**Dispersion** tells you *how much values vary* from the center. A small dispersion means values are clustered together; a large dispersion means they're spread out.

### Key Measures of Dispersion

| Measure | Formula | Interpretation |
|---------|---------|----------------|
| **Range** | $max - min$ | Total spread (very sensitive to outliers) |
| **Variance** | $\sigma^2 = \frac{1}{n}\sum(x_i - \bar{x})^2$ | Average squared deviation |
| **Standard Deviation** | $\sigma = \sqrt{variance}$ | Spread in original units |
| **IQR** | $Q3 - Q1$ | Spread of middle 50% (robust to outliers) |

### Understanding Quartiles and Percentiles

Quartiles divide your sorted data into four equal parts:

```
         Q1 (25%)      Q2 (50% = Median)      Q3 (75%)
            ‚Üì                ‚Üì                    ‚Üì
    |-------|----------------|--------------------|----|
   Min              IQR (Interquartile Range)         Max
```

- **Q1 (25th percentile)**: 25% of data falls below this value
- **Q2 (50th percentile)**: The median
- **Q3 (75th percentile)**: 75% of data falls below this value
- **IQR**: $Q3 - Q1$ ‚Äî the range containing the middle 50% of data

### Common Percentiles in Practice

| Percentile | Common Use |
|------------|------------|
| P90 | "90% of orders are below this amount" |
| P95 | Used for SLA thresholds (e.g., response times) |
| P99 | Extreme but not outlier territory |

> **‚ö†Ô∏è Common mistake**: Reporting the mean without reporting spread (e.g., standard deviation or IQR). "Average revenue is $50" is incomplete‚Äîyou need to know if most values are $45-$55 or $10-$90!

In [None]:
x = df['revenue'].dropna()
q1, q3 = x.quantile([0.25, 0.75])
iqr = q3 - q1
dispersion = {
    'min': x.min(),
    'max': x.max(),
    'range': x.max() - x.min(),
    'std': x.std(),
    'var': x.var(),
    'q1': q1,
    'q3': q3,
    'iqr': iqr,
    'p90': x.quantile(0.90),
    'p95': x.quantile(0.95)
}
pd.Series(dispersion).round(2)

---

## 7.8 Outlier Detection Techniques

An **outlier** is a value that is unusually far from most other values. Outliers can be:

| Type | Description | Example |
|------|-------------|---------|
| **Real but rare** | Legitimate extreme events | A very large corporate purchase |
| **Data errors** | Mistakes in recording | Extra zeros (100 ‚Üí 10000) |
| **Mixed populations** | Different groups behaving differently | Consumer vs enterprise customers |
| **Measurement issues** | Problems with data collection | Sensor malfunction |

### Two Beginner-Friendly Detection Methods

#### Method 1: IQR Rule (Tukey's Fences)

Values are outliers if they fall outside the "fences":

$$\text{Lower fence} = Q1 - 1.5 \times IQR$$
$$\text{Upper fence} = Q3 + 1.5 \times IQR$$

**Pros**: Robust, doesn't assume normal distribution  
**Cons**: May be too conservative for some datasets

#### Method 2: Z-Score

The z-score measures how many standard deviations a value is from the mean:

$$z = \frac{x - \bar{x}}{\sigma}$$

Values with $|z| > 3$ are typically considered outliers.

**Pros**: Easy to understand  
**Cons**: Assumes roughly normal distribution, sensitive to extreme outliers

### What to Do with Outliers

> **‚ö†Ô∏è Warning**: Don't automatically delete outliers! First, investigate *why* they exist.

| Action | When to Use |
|--------|-------------|
| **Keep** | Outlier is a valid, important data point |
| **Remove** | Outlier is clearly an error |
| **Cap/Winsorize** | Reduce impact without removing |
| **Separate analysis** | Analyze outliers as their own group |
| **Transform** | Use log scale to reduce impact |

In [None]:
x = df['revenue'].dropna()
q1, q3 = x.quantile([0.25, 0.75])
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr

outliers_iqr = df[(df['revenue'] < lower) | (df['revenue'] > upper)]
lower, upper, outliers_iqr.shape

In [None]:
outliers_iqr.sort_values('revenue', ascending=False).head(10)

In [None]:
if stats is not None:
    # z-score method (skips NaNs)
    z = pd.Series(stats.zscore(df['revenue'], nan_policy='omit'), index=df.index)
    df_z = df.assign(revenue_z=z)
    df_z.loc[df_z['revenue_z'].abs() > 3].sort_values('revenue_z', key=lambda s: s.abs(), ascending=False).head(10)
else:
    print('SciPy not installed: skipping z-score example.')

---

## 7.9 Correlation Analysis

**Correlation** measures how two numeric variables move together. It's one of the most important tools for finding relationships in your data.

### Understanding Correlation Values

Correlation ranges from **-1** to **+1**:

| Value | Interpretation | Example |
|-------|----------------|---------|
| **+1** | Perfect positive relationship | Temperature ‚Üë ‚Üí Ice cream sales ‚Üë |
| **+0.7 to +0.9** | Strong positive | Study hours ‚Üë ‚Üí Test scores ‚Üë |
| **+0.3 to +0.7** | Moderate positive | Advertising ‚Üë ‚Üí Sales ‚Üë |
| **0** | No linear relationship | Shoe size and IQ |
| **-0.3 to -0.7** | Moderate negative | Price ‚Üë ‚Üí Demand ‚Üì |
| **-0.7 to -1** | Strong negative | Exercise ‚Üë ‚Üí Body fat ‚Üì |

### Types of Correlation

| Type | Best For | Sensitivity |
|------|----------|-------------|
| **Pearson** | Linear relationships | Sensitive to outliers |
| **Spearman** | Monotonic relationships (rank-based) | Robust to outliers |
| **Kendall** | Ordinal data, small samples | Most robust |

### Pearson vs Spearman: When to Use Which?

- **Pearson**: When you expect a linear relationship and data is roughly normal
- **Spearman**: When the relationship might be non-linear but monotonic (always increasing or decreasing), or when you have outliers

> **‚ö†Ô∏è Critical Warning**: **Correlation does NOT mean causation!**
> 
> Just because two variables move together doesn't mean one causes the other. There could be:
> - A third variable causing both (confounding)
> - Pure coincidence (spurious correlation)
> - Reverse causation (B causes A, not A causes B)

In [None]:
num = df[['units', 'unit_price', 'discount_rate', 'revenue']].copy()
pearson = num.corr(method='pearson')
spearman = num.corr(method='spearman')
pearson

In [None]:
spearman

In [None]:
corr = spearman  # change to pearson if you want

if sns is not None:
    sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', vmin=-1, vmax=1)
    plt.title('Correlation heatmap')
    plt.show()
else:
    plt.imshow(corr.values, cmap='coolwarm', vmin=-1, vmax=1)
    plt.colorbar()
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=45, ha='right')
    plt.yticks(range(len(corr.index)), corr.index)
    plt.title('Correlation heatmap')
    plt.tight_layout()
    plt.show()

---

## 7.10 Multivariate Exploration

**Multivariate analysis** means looking at more than two variables together. This helps you answer complex questions like:

- Do patterns change by segment or region?
- Is revenue higher because units are higher, or because price is higher?
- Are returns connected to discounts *and* segment?

### Techniques for Multivariate Exploration

| Technique | Description | Use Case |
|-----------|-------------|----------|
| **GroupBy summaries** | Aggregate by categories | "Average revenue by segment and region" |
| **Pivot tables** | Cross-tabulation | Compare metrics across two dimensions |
| **Colored scatter plots** | Add category as color | See if groups behave differently |
| **Pair plots** | Matrix of all pairwise plots | Quick overview of all relationships |
| **Faceted charts** | Small multiples by category | Compare distributions across groups |

### Why Multivariate Analysis Matters

Looking at one or two variables at a time can be misleading. For example:
- Overall revenue might be increasing, but *only* in one region
- Average price might be stable, but *only* for returning customers
- A correlation might be strong overall, but disappear within subgroups (Simpson's Paradox)

> **üí° Tip**: Always ask "Does this pattern hold across all groups?" before drawing conclusions.

In [None]:
# Revenue by segment and region
df.groupby(['segment', 'region'], dropna=False)['revenue'].agg(['count', 'mean', 'median']).round(2)

In [None]:
# Pivot table can be easier to read for a 2D grouping
pivot = df.pivot_table(index='segment', columns='region', values='revenue', aggfunc='mean', dropna=False)
pivot.round(2)

In [None]:
if sns is not None:
    sns.scatterplot(data=df, x='discount_rate', y='revenue', hue='segment', alpha=0.6)
    plt.title('Revenue vs discount rate (colored by segment)')
    plt.show()
else:
    for seg in df['segment'].dropna().unique():
        d = df[df['segment'] == seg]
        plt.scatter(d['discount_rate'], d['revenue'], alpha=0.5, label=seg)
    plt.xlabel('discount_rate')
    plt.ylabel('revenue')
    plt.title('Revenue vs discount rate (by segment)')
    plt.legend()
    plt.show()

In [None]:
if sns is not None:
    # Pairplot: great for quick scanning, but can be slow on huge datasets
    sample = df[['units', 'unit_price', 'discount_rate', 'revenue', 'segment']].dropna().sample(300, random_state=0)
    sns.pairplot(sample, hue='segment', corner=True)
    plt.show()
else:
    print('Seaborn not installed: skipping pairplot.')

---

## 7.11 Automated EDA Reports

In real work, you may want a quick *automated report* that summarizes your data. These tools generate comprehensive reports with minimal code.

### Popular Automated EDA Tools

| Tool | Description | Install Command |
|------|-------------|-----------------|
| **ydata-profiling** | Comprehensive HTML report | `pip install ydata-profiling` |
| **sweetviz** | Comparison-focused reports | `pip install sweetviz` |
| **dataprep** | Fast, interactive reports | `pip install dataprep` |
| **dtale** | Interactive browser-based EDA | `pip install dtale` |

### What Automated Reports Typically Include

- ‚úÖ Data types and missing values
- ‚úÖ Descriptive statistics for all columns
- ‚úÖ Distribution plots (histograms, bar charts)
- ‚úÖ Correlation matrices
- ‚úÖ Duplicate detection
- ‚úÖ Outlier identification
- ‚úÖ Alerts for potential data quality issues

### When to Use Automated Reports

| Situation | Recommendation |
|-----------|----------------|
| Quick initial overview | ‚úÖ Great for first look |
| Sharing with stakeholders | ‚úÖ Professional appearance |
| Deep investigation | ‚ö†Ô∏è Supplement with custom analysis |
| Large datasets | ‚ö†Ô∏è May be slow‚Äîuse `minimal=True` |

> **üí° Tip**: Automated tools are helpers, not substitutes. Always sanity-check results and dig deeper where needed.

### Custom Simple Report Function

Sometimes you need a quick summary without installing extra packages. Here's a simple custom function:

In [None]:
def simple_eda_report(data: pd.DataFrame) -> dict:
    numeric = data.select_dtypes(include=[np.number])
    categorical = data.select_dtypes(exclude=[np.number])

    report = {
        'shape': data.shape,
        'missing_per_column': data.isna().sum().sort_values(ascending=False),
        'duplicate_rows': int(data.duplicated().sum()),
        'numeric_describe': numeric.describe().T if not numeric.empty else None,
        'categorical_unique_counts': categorical.nunique(dropna=False).sort_values(ascending=False) if not categorical.empty else None,
    }
    return report

report = simple_eda_report(df)
report['shape']

In [None]:
report['missing_per_column'].head(10)

In [None]:
# Optional: ydata-profiling automated report
try:
    from ydata_profiling import ProfileReport
    profile = ProfileReport(df, title='EDA Profile Report', minimal=True)
    profile
except Exception as e:
    print('ydata-profiling not available (or failed to run).')
    print('Reason:', e)
    print('You can install it with: pip install ydata-profiling')

---

## 7.12 Interpreting EDA Results

EDA produces *observations*. Your job is to turn them into *insights* carefully. Here's a practical framework:

### The 4-Step Interpretation Process

| Step | Action | Example |
|------|--------|---------|
| 1Ô∏è‚É£ **Describe** | State what you see (facts only) | "Revenue is right-skewed; a few orders are extremely large" |
| 2Ô∏è‚É£ **Hypothesize** | Propose possible reasons | "Large orders might be corporate bulk purchases or data errors" |
| 3Ô∏è‚É£ **Test** | Investigate the hypothesis | "Check those orders by segment/region; verify unit counts" |
| 4Ô∏è‚É£ **Decide** | Choose an action | "Cap outliers for some plots, but keep for totals; flag suspicious records" |

### Common Pitfalls to Avoid

| Pitfall | Problem | Solution |
|---------|---------|----------|
| **Correlation = Causation** | Assuming one variable causes another | Remember: correlation shows relationship, not cause |
| **Ignoring missing values** | Missing data can bias all results | Always check and handle appropriately |
| **Mean on skewed data** | Mean is misleading for skewed distributions | Use median; report both |
| **Deleting outliers blindly** | May remove valid important data | Investigate before removing |
| **Ignoring context** | Data without business context is meaningless | Understand definitions and collection methods |
| **Overgeneralizing** | Patterns in subgroups may differ | Check if patterns hold across segments |

### Questions to Always Ask

Before concluding your EDA, ask yourself:

1. **Could this be a data issue?** Before assuming a real-world pattern
2. **Does this make business sense?** Validate against domain knowledge
3. **Does the pattern hold for all subgroups?** Watch for Simpson's Paradox
4. **What am I NOT seeing?** Consider survivorship bias, selection bias
5. **What would change my conclusion?** Think about edge cases

> **üí° Tip**: The best analysts are skeptics. Always question your findings before presenting them.

---
## Exercise 7.2 ‚Äî Distributions and interpretation
1. Plot the distribution of `unit_price` (histogram + box plot if possible).
2. In 2‚Äì3 sentences, describe the distribution (skewed? outliers?)
3. Which measure of central tendency would you trust more for `revenue`: mean or median? Why?

Write code below.

In [None]:
# 1) plots
# YOUR CODE HERE

# 2) and 3) write your answers as Python comments
# YOUR ANSWER HERE

---
## Exercise 7.3 ‚Äî Outliers
Use the IQR rule to flag outliers in `revenue`.
1. Compute $Q1$, $Q3$, and $IQR$
2. Compute lower/upper bounds
3. Create a DataFrame of outlier rows, sorted by revenue
4. Suggest one possible real-world explanation (comment)

In [None]:
# YOUR CODE HERE

---
## Exercise 7.4 ‚Äî Correlation choice
1. Compute Pearson and Spearman correlation matrices for `units`, `unit_price`, `discount_rate`, `revenue`.
2. Which correlation do you prefer here and why? (comment)
3. Pick one strong correlation and explain what it might mean *and what it does NOT prove*.

In [None]:
# YOUR CODE HERE

---
## Mini-project ‚Äî A complete EDA walkthrough
Pretend you‚Äôre a data analyst and your manager asks:
- ‚ÄúWhat does our order revenue look like?‚Äù
- ‚ÄúAre discounts related to revenue or returns?‚Äù
- ‚ÄúDo segments behave differently?‚Äù

### Your tasks
1. Create a short EDA checklist (bullets in a Markdown cell)
2. Run the checklist on `df`
3. Create **at least 2 plots** (one distribution, one relationship plot)
4. Write **3 observations** and **2 follow-up questions**

> **Tip**: Keep it simple and clear. Your goal is to communicate, not to show off code.

In [None]:
# Starter: you can reuse the earlier checks, but try writing your own clean steps.

# Example: a compact summary table by segment
segment_summary = (
    df.groupby('segment', dropna=False)
      .agg(orders=('order_id', 'count'),
           avg_revenue=('revenue', 'mean'),
           median_revenue=('revenue', 'median'),
           return_rate=('returned', 'mean'),
           avg_discount=('discount_rate', 'mean'))
      .sort_values('avg_revenue', ascending=False)
)
segment_summary.round(3)

---

## Summary and Key Takeaways

### What We Learned

| Topic | Key Points |
|-------|------------|
| **Purpose of EDA** | Understand data quality, distributions, and relationships before deeper analysis |
| **EDA Checklist** | Preview ‚Üí Types ‚Üí Describe ‚Üí Value counts ‚Üí Visualize ‚Üí Outliers ‚Üí Relationships ‚Üí Document |
| **Central Tendency** | Mean (sensitive to outliers), Median (robust), Mode (for categories) |
| **Dispersion** | Range, Variance, Standard Deviation, IQR ‚Äî always report spread with center |
| **Distributions** | Visualize before summarizing; watch for skewness and multiple peaks |
| **Outliers** | Use IQR or z-score to detect; investigate before removing |
| **Correlation** | Pearson (linear), Spearman (rank-based); correlation ‚â† causation |
| **Multivariate** | Look at multiple variables together; check if patterns hold across groups |

### EDA Best Practices Checklist

‚úÖ Use both tables AND plots ‚Äî each reveals different issues  
‚úÖ Document your findings as you go  
‚úÖ Always check for missing values and duplicates first  
‚úÖ Visualize distributions before calculating summary statistics  
‚úÖ Report both center (mean/median) AND spread (std/IQR)  
‚úÖ Investigate outliers before removing them  
‚úÖ Remember that correlation does not prove causation  
‚úÖ Check if patterns hold across subgroups  
‚úÖ Keep your EDA workflow consistent and repeatable  

### What's Next?

In **Chapter 8**, we'll build on these EDA foundations to explore **Statistical Methods for Data Analytics**, including hypothesis testing, confidence intervals, and regression analysis.

---

## Additional Resources and References

### Official Documentation
- **Pandas**: https://pandas.pydata.org/docs/
- **NumPy**: https://numpy.org/doc/stable/
- **Seaborn**: https://seaborn.pydata.org/tutorial.html
- **Matplotlib**: https://matplotlib.org/stable/tutorials/
- **SciPy Statistics**: https://docs.scipy.org/doc/scipy/reference/stats.html

### Automated EDA Tools
- **ydata-profiling**: https://github.com/ydataai/ydata-profiling
- **sweetviz**: https://github.com/fbdesignpro/sweetviz
- **dataprep**: https://dataprep.ai/

### Further Reading
- "Exploratory Data Analysis" by John Tukey ‚Äî The classic text that introduced EDA
- "Python for Data Analysis" by Wes McKinney ‚Äî Pandas creator's guide
- "Storytelling with Data" by Cole Nussbaumer Knaflic ‚Äî Visualization best practices

### Online Courses
- Kaggle Learn: Data Visualization ‚Äî https://www.kaggle.com/learn/data-visualization
- DataCamp: Exploratory Data Analysis in Python

---

**End of Chapter 7**