# 🧪 Data Profiling Assignment: Getting Hands-On with Real-World Stats

Welcome! This assignment is designed to help you **play with data like working analysts do**. You’ll explore a synthetic dataset with **thousands of points** and analyze it using:

- **Central tendency**: mean, median, (estimated) mode
- **Dispersion**: variance, standard deviation, coefficient of variation (CV), range, IQR
- **Shape**: skewness, kurtosis
- **Position & extremes**: min, max, percentiles, z-scores
- **Distribution rules**: Empirical Rule (68–95–99.7) & Chebyshev’s inequality

Each question asks you to **use the sample data** to compute something and **explain what it means**. After every question you’ll find **two blank cells**: one for code, one for your interpretation.

**Tip:** In professional work, the numbers are just the start—**the story** they tell is what matters. Be explicit about assumptions, limitations, and what a stakeholder should take away.

## 1) Setup

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

np.set_printoptions(edgeitems=3, linewidth=120)
pd.set_option('display.width', 120)
pd.set_option('display.max_columns', 20)


## 2) Generate a Sample Dataset (thousands of points)

In [2]:
# Reproducible synthetic data that mimics different real-world shapes
rng = np.random.default_rng(42)
n = 5000  # thousands of points

# Distributions
normal = rng.normal(0, 1, n)  # symmetric, light tails
lognormal = rng.lognormal(mean=0.0, sigma=0.9, size=n)  # positive, right-skewed
t_df3 = rng.standard_t(df=3, size=n)  # symmetric, heavy tails
uniform = rng.uniform(-3, 3, size=n)  # flat
exponential = rng.exponential(scale=1.0, size=n)  # positive, right-skewed

# Bimodal mixture
bimodal = np.concatenate([
    rng.normal(-2.0, 0.5, n // 2),
    rng.normal( 2.0, 0.5, n - n // 2)
])

# Normal with a few extreme outliers injected
with_outliers = rng.normal(0, 1, n)
out_idx = rng.choice(n, size=12, replace=False)
with_outliers[out_idx] = rng.normal(0, 1, size=12) * 10

df = pd.DataFrame({
    'normal': normal,
    'lognormal': lognormal,
    't_df3': t_df3,
    'uniform': uniform,
    'exponential': exponential,
    'bimodal': bimodal,
    'with_outliers': with_outliers,
})

df.head()

Unnamed: 0,normal,lognormal,t_df3,uniform,exponential,bimodal,with_outliers
0,0.304717,0.792822,0.138433,-0.045096,2.049361,-2.145103,-0.651898
1,-1.039984,0.763733,-0.229852,2.703969,1.891653,-1.722417,-1.579549
2,0.750451,0.534195,4.519477,1.059963,2.327707,-2.224025,0.646698
3,0.940565,1.470547,0.611817,0.513368,1.365464,-2.376502,-0.319862
4,-1.951035,1.235137,0.332825,-1.044643,0.389808,-2.328547,-1.262942


### Quick peek

In [3]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
normal,5000.0,-0.019877,0.999454,-3.648413,-0.691954,-0.004161,0.631247,3.454046
lognormal,5000.0,1.515106,1.746572,0.019251,0.551296,0.982112,1.831378,37.458799
t_df3,5000.0,0.028603,1.652612,-12.577299,-0.74229,0.010486,0.765586,30.190838
uniform,5000.0,0.03833,1.719777,-2.999717,-1.448413,0.029373,1.530037,2.999743
exponential,5000.0,0.995986,0.993266,3e-05,0.278624,0.696891,1.374293,7.766503
bimodal,5000.0,-0.011055,2.053284,-3.798815,-2.000615,0.003059,1.97295,4.16346
with_outliers,5000.0,0.013223,1.130575,-5.917001,-0.66794,-0.001768,0.673869,20.73226


> Optional: Save a copy as CSV so you (or a teammate) can reuse the same snapshot later.

In [None]:
df.to_csv('sample_data.csv', index=False)
print('Saved sample_data.csv')

## 3) Helper Utilities (optional)

In [None]:
def coefficient_of_variation(x: pd.Series):
    """CV = sample std / mean. Not meaningful when mean≈0 or for variables taking negative values.
    Returns np.nan if mean is too close to zero.
    """
    m = x.mean()
    s = x.std(ddof=1)
    return np.nan if np.isclose(m, 0.0) else s / m

def empirical_within_k(x: pd.Series, k: int):
    m, s = x.mean(), x.std(ddof=1)
    if np.isclose(s, 0.0):
        return np.nan
    z = (x - m) / s
    return (np.abs(z) <= k).mean()

def chebyshev_lower_bound(k: int):
    assert k >= 1
    return 1 - 1 / (k**2)

def z_scores(x: pd.Series):
    return (x - x.mean()) / x.std(ddof=1)

def hist_and_qq(x: pd.Series, bins=50, title=''):
    """Simple visuals to reason about shape (run when needed)."""
    plt.figure()
    plt.hist(x.values, bins=bins)
    plt.title(f'Histogram: {title}')
    plt.xlabel('value'); plt.ylabel('count')
    plt.show()

    # Q-Q plot against normal to assess normality / tails
    plt.figure()
    stats.probplot(x, dist='norm', plot=plt)
    plt.title(f'Normal Q-Q: {title}')
    plt.show()

# Section A — Central Tendency

### A1. Mean vs Median (and an estimate of Mode)

Pick **two columns**—one roughly symmetric (e.g., `normal`) and one skewed (e.g., `lognormal` or `exponential`).

1) Compute the **mean** and **median** for each.
2) Provide a **rough estimate of the mode** (e.g., from the mid-point of the most populated histogram bin).
3) Explain the **ordering** you see (mean vs median vs mode) and what it implies about **symmetry/skewness**.
4) In a business context, when would you prefer **median** over **mean**, and why?

In [None]:
# Your code here

_Use this cell for your interpretation/short write-up._

### A2. Robustness to Outliers

Using the `with_outliers` column:

1) Compute the mean and median **before** and **after** trimming the top/bottom 1% (use percentiles).
2) Which statistic is **more robust** to the injected extremes? Explain.
3) Why might stakeholders be misled by the mean here? Provide a short note.

In [None]:
# Your code here

_Use this cell for your interpretation/short write-up._

# Section B — Dispersion (Spread)

### B1. Variance, Standard Deviation, Range, and IQR

For **at least three columns** (suggestion: `normal`, `t_df3`, `with_outliers`):

1) Compute **variance**, **standard deviation**, **range** (max−min), and **IQR** (Q3−Q1).
2) Interpret how **outliers** and **heavy tails** change these measures.
3) Who would care about this in practice (e.g., risk teams, operations, product)?

In [None]:
# Your code here

_Use this cell for your interpretation/short write-up._

### B2. Coefficient of Variation (CV)

For **positive-valued** columns (e.g., `lognormal`, `exponential`):

1) Compute **CV = std/mean**.
2) Rank the selected columns by CV.
3) Explain when CV is **not appropriate** (hint: mean≈0 or sign changes).

In [None]:
# Your code here

_Use this cell for your interpretation/short write-up._

# Section C — Shape of the Distribution

### C1. Skewness & Kurtosis

Across **all columns**:

1) Compute **skewness** and **(excess) kurtosis** (use `scipy.stats.skew` and `scipy.stats.kurtosis(fisher=True)`).
2) Identify which distributions are **right/left-skewed**.
3) Which have **heavy tails** (excess kurtosis > 0)? How does that show up in the Q-Q plot?

In [None]:
# Your code here

_Use this cell for your interpretation/short write-up._

### C2. Visual Intuition Check

Pick **two columns** with different shapes (e.g., `normal` vs `bimodal`).

1) Plot a histogram and a normal Q-Q plot for each (use `hist_and_qq`).
2) Explain **why** percentiles or averages alone can miss **multimodality**.

In [None]:
# Your code here

_Use this cell for your interpretation/short write-up._

# Section D — Position & Extremes

### D1. Min/Max & Percentiles

For **each column**:

1) Report **min**, **max**, and key **percentiles** (1st, 5th, 25th, 50th, 75th, 95th, 99th).
2) Which columns show the **widest spread** between the 1st and 99th percentiles?
3) What operational risks might that imply?

In [None]:
# Your code here

_Use this cell for your interpretation/short write-up._

### D2. Z-Scores & Outlier Flagging

Using the `with_outliers` column:

1) Compute **z-scores** and count points with **|z| > 3**.
2) Show the **indices** (or values) of these potential outliers.
3) Explain the difference between **statistical outliers** and **bad data**.

In [None]:
# Your code here

_Use this cell for your interpretation/short write-up._

# Section E — Distribution Rules

### E1. Empirical Rule (68–95–99.7)

For `normal` and `t_df3`:

1) Compute the **proportion of points** within **1, 2, and 3** standard deviations of the mean.
2) Compare to **68% / 95% / 99.7%**. Where does it **match or break down**, and why?
3) What decision mistakes might happen if someone assumes normality where it doesn’t hold?

In [None]:
# Your code here

_Use this cell for your interpretation/short write-up._

### E2. Chebyshev’s Inequality (distribution-free bound)

Pick any **three columns**:

1) For **k = 2 and 3**, compute Chebyshev’s **lower bound** (≥ 1 − 1/k²) for the proportion within k SDs.
2) Compute the **actual proportions** and compare to the bounds.
3) Explain why Chebyshev’s bound can be **loose** but still **useful** in practice.

In [None]:
# Your code here

_Use this cell for your interpretation/short write-up._

# Section F — Synthesis & Communication

### F1. Tell the Data Story

Choose **two columns** with contrasting behavior (e.g., `lognormal` vs `t_df3`).

Write a short **executive-style note** (5–10 sentences) for a non-technical stakeholder that explains:
- Central tendency: which typical value is most representative and why
- Spread: what volatility/risk is present and how you’d summarize it
- Shape: skew/tails and what they mean for planning
- Extremes: what to expect in the worst 1–5% of cases
- Any **actionable recommendations**

In [None]:
# Your code here

_Use this cell for your interpretation/short write-up._

---
### Submission Checklist
- Run all cells in order.
- Ensure each question has code + a short written interpretation.
- If you add extra visualizations or helper functions, briefly justify them.

Good luck & have fun exploring! 🧑‍💻📊