## 1. What is Statistics?

**Statistics** is the science of:
- Collecting data
- Organizing data
- Summarizing data
- Analyzing data
- Drawing conclusions from data

It is used in many fields like marketing, business, healthcare, telecom, and data science.

**Real-life example:**
- You track your monthly expenses and calculate the average to understand your spending.

**Production example:**
- A company collects application logs, response times, and user behavior,
  then uses statistics to understand performance and user patterns.


## 2. Types of Statistics

There are two main branches:

1. **Descriptive Statistics** – describe and summarize data.
2. **Inferential Statistics** – use sample data to make conclusions about a larger population using probability.


## 3. Basic Terms: Population, Sample, Variable, Parameter, Statistic

- **Population**: The entire group you care about.
- **Sample**: A subset of the population, used for analysis.
- **Variable**: A characteristic that can vary (e.g., height, response time).
- **Parameter**: A numerical summary of the population (e.g., true mean of all users).
- **Statistic**: A numerical summary of the sample (e.g., mean of sampled users).

**Real-life example:**
- Population: All people in a city.
- Sample: 500 people selected for a survey.

**Production example:**
- Population: All API calls in a month.
- Sample: Logs from a single day.


## 4. Types of Data

### 4.1 Categorical (Qualitative) Data
Represents **categories or groups**.

Examples:
- Car brands: `"Audi"`, `"BMW"`, `"Mercedes"`
- Yes/No answers
- Browser type: `"Chrome"`, `"Firefox"`, `"Safari"`

### 4.2 Numerical (Quantitative) Data
Represents **numbers**.

It can be:
- **Discrete**: Countable values (e.g., number of children, number of defects).
- **Continuous**: Measured values (e.g., height, time, distance, latency).

**Production example:**
- Discrete: Number of failed test cases in a test run.
- Continuous: API response time in milliseconds.


In [None]:
# Simple example: classifying data as categorical, discrete, or continuous

data_examples = {
    "car_brand": "BMW",          # categorical
    "num_children": 2,           # discrete numerical
    "response_time_ms": 187.5    # continuous numerical
}

for name, value in data_examples.items():
    print(f"{name}: {value}")


## 5. Levels of Measurement

1. **Nominal** – categories without order (e.g., colors, browser type).
2. **Ordinal** – ordered categories (e.g., ratings: bad, ok, good).
3. **Interval** – numeric scale without a true zero (e.g., °C, °F).
4. **Ratio** – numeric scale with a true zero (e.g., height, weight, response time).

These levels determine which statistical methods and operations make sense.


# Part A: Descriptive Statistics

- Measures of central tendency (mean, median, mode)
- Measures of dispersion (range, variance, standard deviation)
- Skewness (shape of the distribution)


## 6. Measures of Central Tendency

### 6.1 Mean
The **mean** is the arithmetic average.
- Mean is heavily affected by **outliers**.

### 6.2 Median
The **median** is the middle value when data is sorted.
- If `n` is odd → exact middle value.
- If `n` is even → average of two middle values.
- Median is more robust when there are extreme values.

### 6.3 Mode
The **mode** is the most frequent value in the dataset.

In [None]:
import numpy as np
from scipy import stats

# Example dataset: systolic blood pressure of 7 men
bp = np.array([150, 123, 134, 170, 146, 124, 113])

mean_bp = np.mean(bp)
median_bp = np.median(bp)
mode_bp = stats.mode(bp, keepdims=True)

print("Data:", bp)
print("Mean:", mean_bp)
print("Median:", median_bp)
print("Mode:", mode_bp.mode[0])


### Effect of Outliers on Mean vs Median

Add a very large outlier and compare mean and median.

- Mean will shift a lot.
- Median will barely move.


In [None]:
expenditure = np.random.normal(25000, 15000, 10000)

print("Original mean:", np.mean(expenditure))
print("Original median:", np.median(expenditure))

# Add a huge outlier
expenditure_with_outlier = np.append(expenditure, [10_000_000_000])

print("\nAfter adding outlier:")
print("New mean:", np.mean(expenditure_with_outlier))
print("New median:", np.median(expenditure_with_outlier))


## 7. Measures of Dispersion: Range, Variance, Standard Deviation

Measures of dispersion describe **how spread out** the data is around its center (usually the mean or median).  
They help in understanding whether the data points are **close together** or **widely scattered**.

Dispersion is important because **two datasets can have the same mean but very different spreads**, leading to very different interpretations.

### 7.1 Range

**Definition:**  
The range is the simplest measure of dispersion.  
It is the difference between the **largest** and **smallest** values in the dataset.

**Formula:**  
Range = maximum value − minimum value

**What it tells you:**  
- Gives a quick sense of the total spread  
- Very sensitive to outliers  
- Doesn’t show how values are distributed between the extremes  

**Real-life examples:**  
- Temperature difference in a day (high − low)  
- Height difference in a classroom  
- Monthly stock price high vs low  

**Engineering examples:**  
- Fastest vs slowest API response time in a day  
- Minimum vs maximum CPU usage during a load test  
- Smallest vs largest packet size in network traffic  

### 7.2 Variance

**Definition:**  
Variance measures the **average of the squared differences** between each data point and the mean.

It answers the question:  
**“On average, how far are the data points from the mean — but squared?”**

**Why squared?**  
- Prevents negative values from canceling out  
- Penalizes larger deviations more strongly  
- Makes the math work nicely for probability and machine learning  

**Interpretation:**  
- High variance → data points are far from the mean  
- Low variance → data points are close to the mean  

**Real-life examples:**  
- Variance in daily steps: high variance means inconsistent activity  
- Variance in exam scores: high variance means some students did very well and some very poorly  

**Engineering examples:**  
- Variance in API latency: high variance means unpredictable performance  
- Variance in memory usage: helps detect instability or memory leaks  
- Variance in manufacturing measurements: used in quality control (Six Sigma)  

### 7.3 Standard Deviation (SD)

**Definition:**  
Standard deviation is the **square root of variance**.  
It brings the measure back to the **same units** as the original data, making it easier to interpret.

**Why SD is more useful than variance:**  
- Variance is in squared units (e.g., ms², kg²), which is hard to interpret  
- SD is in the original units (ms, kg), so it makes intuitive sense  

**Interpretation:**  
- **Small SD** → data is tightly clustered around the mean  
- **Large SD** → data is widely spread  
- **SD = 0** → all values are identical  

**Real-life examples:**  
- Low SD in monthly expenses → stable spending habits  
- High SD in commute time → unpredictable traffic  

**Engineering examples:**  
- Low SD in API response time → stable system  
- High SD in test execution time → flaky tests  
- Low SD in sensor readings → reliable hardware  
- High SD in error counts → unstable system behavior  

## Why Dispersion Matters

Two datasets can have the **same mean** but behave completely differently:

Example:  
Dataset A: 50, 51, 49, 50, 52  
Dataset B: 10, 90, 5, 95, 50  

Both have a mean around 50, but Dataset B is far more spread out.  
Dispersion helps in seeing this difference clearly.

- **Range** → Quick snapshot of total spread  
- **Variance** → Mathematical measure of average squared deviation  
- **Standard Deviation** → Practical, easy-to-interpret measure of spread  

Together, these metrics help in understanding **consistency**, **stability**, and **variability** in any dataset — whether it's exam scores, financial data, or system performance metrics.

In [None]:
results = np.array([3, 3, 3, 5, 6, 1])

sample_mean = np.mean(results)
sample_var = np.var(results, ddof=1)  # ddof=1 for sample variance
sample_std = np.std(results, ddof=1)  # sample standard deviation
data_range = np.max(results) - np.min(results)

print("Data:", results)
print("Mean:", sample_mean)
print("Range:", data_range)
print("Sample variance:", sample_var)
print("Sample standard deviation:", sample_std)


## 8. Visualizing Spread with Standard Deviation

We can visualize how data points cluster around the mean and how standard deviation reflects the spread.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)

# Dataset with small standard deviation
data_small_std = np.random.normal(100, 5, 1000)

# Dataset with large standard deviation
data_large_std = np.random.normal(100, 20, 1000)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.histplot(data_small_std, kde=True, ax=axes[0], color='green')
axes[0].set_title('Small Standard Deviation')
axes[0].axvline(np.mean(data_small_std), color='red', linestyle='--', label='Mean')
axes[0].legend()

sns.histplot(data_large_std, kde=True, ax=axes[1], color='orange')
axes[1].set_title('Large Standard Deviation')
axes[1].axvline(np.mean(data_large_std), color='red', linestyle='--', label='Mean')
axes[1].legend()

plt.tight_layout()
plt.show()


## 9. Skewness

Skewness measures how **asymmetric** a distribution is.

- **Skewness ≈ 0** → roughly symmetric.
- **Positive skew** → right tail is longer (few very large values).
- **Negative skew** → left tail is longer (few very small values).

**Production example:**
- Latency distribution is often positively skewed: most requests are fast, but a few are very slow.


In [None]:
# Example: symmetric vs positively skewed data

from scipy.stats import skew

np.random.seed(0)

symmetric_data = np.random.normal(0, 1, 1000)
positive_skew_data = np.random.exponential(scale=1.0, size=1000)

print("Skewness (symmetric):", skew(symmetric_data))
print("Skewness (positive skew):", skew(positive_skew_data))

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.histplot(symmetric_data, kde=True, ax=axes[0], color='blue')
axes[0].set_title('Symmetric Distribution')

sns.histplot(positive_skew_data, kde=True, ax=axes[1], color='purple')
axes[1].set_title('Positively Skewed Distribution')

plt.tight_layout()
plt.show()


# Random Variables and Distributions

**probability distributions**: which describe how random variables behave.

- Random variables
- Discrete vs continuous
- Bernoulli distribution
- Binomial distribution
- Normal (Gaussian) distribution
- Poisson distribution (brief)
- PDF, PMF, and CDF


## 10. Random Variables

A **random variable** is a variable whose value depends on the outcome of a random experiment.

- **Discrete random variable**: takes countable values (0, 1, 2, 3, ...).
- **Continuous random variable**: takes any value in an interval.

**Examples:**
- Discrete: number of defective items in a batch.
- Continuous: time taken for a request to complete.


## 11. Bernoulli Distribution

The **Bernoulli distribution** is the simplest possible probability distribution.  
It describes a situation where there are **only two possible outcomes**:

- **1 (success)** with probability `p`
- **0 (failure)** with probability `1 - p`

This makes Bernoulli perfect for modeling **yes/no**, **true/false**, **pass/fail**, or **on/off** events.

In short:

- You perform one trial.
- The trial can only result in success or failure.
- You assign a probability `p` to success.
- Everything else is failure.

 **Why Bernoulli Matters**

Bernoulli is the **building block** for many other distributions:

- **Binomial distribution** = repeated Bernoulli trials  
- **Geometric distribution** = number of Bernoulli trials until first success  
- **Logistic regression** models Bernoulli outcomes  
- **Binary classification** in machine learning uses Bernoulli assumptions  

Whenever you see a binary outcome, you are looking at a Bernoulli process.

**Coin Toss**  
If you define “success = heads”, then:
- Heads → 1  
- Tails → 0  

**Exam Pass/Fail**  
- Pass → 1  
- Fail → 0  

**Light Bulb Works or Not**  
- Working bulb → 1  
- Defective bulb → 0  

**Customer Buys or Doesn’t Buy**  
- Purchase → 1  
- No purchase → 0  

**Traffic Light Detection**  
- Light is green → 1  
- Light is not green → 0  

These are all **single-event, two-outcome** situations.

**Engineering Examples:**

**API Request Outcome**  
- Success (HTTP 200) → 1  
- Failure (HTTP 4xx/5xx) → 0  

This is one of the most common Bernoulli processes in systems engineering.

**Email Campaign**  
- User opens email → 1  
- User does not open → 0  

CTR (click-through rate) is literally the **mean of Bernoulli outcomes**.

**Feature Flag Check**  
- Feature enabled → 1  
- Feature disabled → 0  

**Authentication**  
- Login success → 1  
- Login failure → 0  

**Fraud Detection**  
- Transaction is fraudulent → 1  
- Transaction is legitimate → 0  

**Sensor Trigger**  
- Motion detected → 1  
- No motion → 0  

**Test Case Execution**  
- Test passes → 1  
- Test fails → 0  

**What Bernoulli Really Represents**

A Bernoulli trial is like asking a **single yes/no question**:

- “Did the event happen?”  
- “Did the user click?”  
- “Did the request succeed?”  
- “Did the sensor detect motion?”  

If the answer is yes → 1  
If the answer is no → 0  

The probability of “yes” is `p`.  
The probability of “no” is `1 - p`.

**Mean and Variance**

For a Bernoulli random variable:

- **Mean = p**  
  (the long-term proportion of successes)

- **Variance = p(1 - p)**  
  (highest when p = 0.5, lowest when p = 0 or 1)

This is why:
- A coin toss (p = 0.5) is the most “uncertain” Bernoulli trial  
- A nearly guaranteed event (p ≈ 1) has very low variance  

Engineering usage:

- A/B testing  
- Machine learning (binary classification)  
- Reliability engineering  
- QA pass/fail analysis  
- Monitoring and alerting  
- User behavior modeling  
- Risk analysis  

In [None]:
from scipy.stats import bernoulli

p = 0.7  # probability of success
bern_samples = bernoulli.rvs(p, size=20, random_state=42)

print("Bernoulli samples (1 = success, 0 = failure):")
print(bern_samples)

print("Mean (approx p):", np.mean(bern_samples))
print("Variance (approx p(1-p)):", np.var(bern_samples, ddof=0))


## 12. Binomial Distribution

The binomial distribution describes the number of **successes** you get when you repeat the same yes/no experiment a fixed number of times. Each attempt is a Bernoulli trial, meaning it can only result in success (1) or failure (0).

A situation follows a binomial distribution when:
- You repeat the experiment a fixed number of times (`n`).
- Each trial has only two outcomes (success or failure).
- The probability of success (`p`) stays the same for every trial.
- Each trial is independent of the others.

In simple terms, the binomial distribution answers the question:

**“If I repeat this binary event n times, how many successes should I expect?”**

This makes it useful for anything involving repeated yes/no outcomes.

**Real-life examples:**

- **Coin tossing:** Toss a coin 10 times and count how many heads appear.  
- **Exam guessing:** A student guesses on 20 multiple-choice questions; how many will they get right?  
- **Sports:** A basketball player makes 70% of free throws; how many shots will they make out of 10 attempts?  
- **Manufacturing:** A machine has a 2% defect rate; how many defective items will appear in a batch of 50?

These are all repeated Bernoulli trials with a fixed number of attempts.

**Engineering and production examples:**

- **User conversions:** If the signup conversion rate is `p`, how many signups will occur out of 1000 visitors?  
- **Email campaigns:** If 30% of users open an email, how many opens will you get from 500 emails?  
- **API reliability:** If an API succeeds 98% of the time, how many successful responses will occur in 200 calls?  
- **QA testing:** If a flaky test passes 90% of the time, how many passes will you see in 50 runs?  
- **Feature rollouts:** If 5% of users click a new feature, how many clicks will come from 10,000 users?  
- **Security monitoring:** If 1% of login attempts are suspicious, how many suspicious attempts will appear in 5000 logins?

All of these involve repeating the same binary event many times and counting how often success occurs.

**Why the binomial distribution matters:**

It is one of the most widely used distributions in statistics because it models real-world binary outcomes repeated many times. It forms the basis for:

- A/B testing  
- Conversion rate modeling  
- Reliability and failure analysis  
- Quality control  
- Risk estimation  
- Machine learning classification probabilities  

Whenever you count how many times something “works” out of a fixed number of attempts, you are using the binomial distribution.

In [None]:
from scipy.stats import binom

n = 10  # number of trials
p = 0.5 # probability of success

# Probability of exactly 5 successes
prob_5 = binom.pmf(5, n, p)
print("P(X = 5) when n=10, p=0.5:", prob_5)

# Simulate many binomial outcomes
simulated = binom.rvs(n=n, p=p, size=1000, random_state=42)

plt.figure(figsize=(8,5))
sns.histplot(simulated, bins=range(n+2), discrete=True, stat='probability', color='skyblue')
plt.title('Binomial Distribution Simulation (n=10, p=0.5)')
plt.xlabel('Number of successes')
plt.ylabel('Probability')
plt.show()


## 13. Normal Distribution (Gaussian)

The **normal distribution** is a continuous distribution with a bell-shaped curve.

Key properties:
- Symmetric around the mean.
- Mean = Median = Mode.
- Shape controlled by mean `μ` and standard deviation `σ`.

### Empirical Rule (68-95-99.7 rule)
- About **68%** of data lies within 1σ of mean.
- About **95%** within 2σ.
- About **99.7%** within 3σ.

**Real-life example:**
- Height of people.

**Production example:**
- If response times follow roughly a normal distribution, most requests fall near the mean with predictable spread.


In [None]:
from scipy.stats import norm

mu = 100
sigma = 15

x = np.linspace(mu - 4*sigma, mu + 4*sigma, 500)
y = norm.pdf(x, mu, sigma)

plt.figure(figsize=(8,5))
plt.plot(x, y, label='Normal PDF')
plt.axvline(mu, color='red', linestyle='--', label='Mean')
plt.axvline(mu - sigma, color='gray', linestyle=':', label='μ ± σ')
plt.axvline(mu + sigma, color='gray', linestyle=':')
plt.title('Normal Distribution (μ = 100, σ = 15)')
plt.xlabel('x')
plt.ylabel('Density')
plt.legend()
plt.show()


## 14. Standard Normal Distribution and Z-Score

The **standard normal distribution** is a special normal distribution with:
- Mean = 0
- Standard deviation = 1

We convert any normally distributed variable X into a standard normal variable Z by using the z‑score formula.
The z‑score is calculated as:

**z = (x − mean) / standard deviation**


Interpretation:
- z = 0 → value equals the mean.
- z = 1 → one standard deviation above the mean.
- z = -2 → two standard deviations below the mean.

**Production example:**
- Compute z-score for latency or error counts to detect anomalies.


In [None]:
# Example: exam scores with mean 72, sd 2.0
mu = 72
sigma = 2.0
stephanie_score = 74

z_stephanie = (stephanie_score - mu) / sigma
print("Stephanie's z-score:", z_stephanie)

# Daniel: mean 76, sd 4.5, score 64
mu2 = 76
sigma2 = 4.5
daniel_score = 64
z_daniel = (daniel_score - mu2) / sigma2
print("Daniel's z-score:", z_daniel)

# Probability of scoring below a certain z (using CDF)
prob_steph_below = norm.cdf(z_stephanie)
print("Probability of scoring <= Stephanie's score:", prob_steph_below)


# Poisson Distribution

The Poisson distribution is used to predict **how many times an event will occur** within a fixed amount of time, space, or volume. It applies when events:

- happen independently of each other, and  
- occur at a steady average rate over time.

In simple terms, Poisson helps estimate **how many events you can expect** when those events happen randomly but follow a consistent long‑term average.

**Everyday and Real‑World Uses**
- Calls arriving at a call center per minute  
- Cars passing through a toll booth in a given time  
- Customers entering a store during an hour  
- Earthquakes above a certain magnitude in a year  
- Accidents occurring at a specific intersection each month  

These are all cases where events are random but follow a predictable long‑term rate.

**Engineering and System-Level Uses**
- Errors appearing in logs within a time window  
- Failed API calls in a batch of requests  
- Jobs arriving in a message queue

In [None]:
from scipy.stats import poisson

lam = 3  # average number of events per interval
x_values = np.arange(0, 11)
poisson_probs = poisson.pmf(x_values, lam)

plt.figure(figsize=(8,5))
plt.stem(x_values, poisson_probs, use_line_collection=True)
plt.title('Poisson Distribution (λ = 3)')
plt.xlabel('Number of events (x)')
plt.ylabel('P(X = x)')
plt.show()


## 16. PDF, PMF, and CDF

When working with probability distributions, we often need to describe how likely different outcomes are. Three core concepts help us do this: **PMF**, **PDF**, and **CDF**. They tell us how probability is distributed across values of a random variable.

**PMF (Probability Mass Function)**  
A PMF is used for **discrete** random variables — variables that take specific, countable values.  
It tells you the probability of the variable being exactly equal to a particular value.

Examples of discrete distributions:  
- Binomial (number of successes in n trials)  
- Poisson (number of events in a time interval)

**IT engineering examples:**  
- Number of failed API calls in a batch of 100 requests  
- Number of errors logged in the last minute  
- Number of retries needed before a request succeeds  
- Number of users who click a button out of 50 shown the feature  
- Number of packets dropped in a network router per second  

In all these cases, the outcome is a **count**, so PMF applies.

**PDF (Probability Density Function)**  
A PDF is used for **continuous** random variables — variables that can take any value within a range.  
A PDF does *not* give the probability of a single exact value (because that probability is effectively zero).  
Instead, it describes how probability is **distributed across intervals**.

Examples of continuous distributions:  
- Normal distribution  
- Exponential distribution  
- Log-normal distribution  

**IT engineering examples:**  
- API response time (e.g., 123.45 ms)  
- CPU utilization percentage (e.g., 67.2%)  
- Memory usage in MB  
- Network latency in milliseconds  
- Time between two incoming requests  

These values can take infinitely many possible numbers, so PDFs describe their behavior.

**CDF (Cumulative Distribution Function)**  
A CDF gives the probability that a random variable is **less than or equal to** a certain value.  
It accumulates probability from the left side of the distribution up to that point.

CDF is extremely useful for understanding **percentiles**.

**IT engineering examples:**  
- Latency percentiles (P50, P90, P95, P99)  
  - CDF tells you what percentage of requests are faster than a given latency.  
- Error rate thresholds  
  - Probability that errors per minute stay below a certain limit.  
- Queue wait times  
  - Probability that a job waits less than X milliseconds.  
- Disk I/O performance  
  - Probability that read/write completes within a certain time.  
- Cloud autoscaling  
  - Probability that CPU stays below 80% for the next 5 minutes.  

CDF is the backbone of **SRE metrics**, **SLAs**, and **performance dashboards**.

**Putting it all together in IT engineering:**

- Use **PMF** when counting events (errors, retries, failures, user clicks).  
- Use **PDF** when measuring continuous metrics (latency, CPU, memory, throughput).  
- Use **CDF** when analyzing percentiles, thresholds, or SLA compliance.

These three concepts help engineers understand system behavior, detect anomalies, and make data-driven decisions about performance, reliability, and scaling.

In [None]:
# Example: CDF for Normal distribution

mu = 100
sigma = 15

values = [85, 100, 115]
for v in values:
    z = (v - mu) / sigma
    prob = norm.cdf(v, mu, sigma)
    print(f"P(X <= {v}) = {prob:.4f} (z = {z:.2f})")

# Visualizing the CDF
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 500)
cdf_values = norm.cdf(x, mu, sigma)

plt.figure(figsize=(8,5))
plt.plot(x, cdf_values, label='Normal CDF')
plt.title('CDF of Normal Distribution (μ = 100, σ = 15)')
plt.xlabel('x')
plt.ylabel('P(X ≤ x)')
plt.grid(True)
plt.legend()
plt.show()


## 17. Central Limit Theorem (High-Level)

The Central Limit Theorem (CLT) explains why the **Normal distribution appears everywhere**, even when the original data is messy, skewed, or irregular.

- You repeatedly take samples from any population (as long as the population has a finite variance).  
- For each sample, you calculate the **mean**.  
- If you collect enough sample means, their distribution will start to look **normal (bell‑shaped)**.  
- This happens **even if the original data is not normal at all**.

**When you average things, the averages tend to become normally distributed.**

**Examples:**

- **Daily average API latency:**  
  Individual request latencies may be highly skewed (some very slow outliers), but the *daily average latency* across thousands of requests tends to follow a normal distribution.

- **Average CPU usage per hour:**  
  CPU usage at any moment jumps around unpredictably, but hourly averages form a bell-shaped pattern.

- **Average memory consumption per container:**  
  Instantaneous memory usage is noisy, but the average over many samples becomes normally distributed.

- **Average number of errors per minute across a full day:**  
  Error bursts cause spikes, but the distribution of *minute-level averages* across many days becomes normal.

- **Average throughput of a microservice:**  
  Per-request throughput varies wildly, but the average throughput over 5-minute windows tends to be normal.

- **Average queue wait time:**  
  Individual wait times may be chaotic, but the average wait time per hour becomes predictable and normal-like.

- **Average disk I/O time:**  
  Raw I/O times are often skewed, but averages over many operations follow a normal distribution.

**Why engineers care:**

Because of the CLT, you can:

- Use normal-based confidence intervals for averages  
- Predict system performance more reliably  
- Detect anomalies using z-scores  
- Build dashboards that rely on percentiles and averages  
- Model aggregated metrics with normal assumptions  
- Simplify complex, messy data into something predictable  

The CLT is the reason why **averages are stable**, **percentiles make sense**, and **SRE metrics behave nicely** even when raw data does not.

In [None]:
# Simple CLT simulation

np.random.seed(42)

# Skewed population: exponential distribution
population = np.random.exponential(scale=1.0, size=100000)

sample_means = []
sample_size = 50
num_samples = 1000

for _ in range(num_samples):
    sample = np.random.choice(population, size=sample_size, replace=False)
    sample_means.append(np.mean(sample))

plt.figure(figsize=(8,5))
sns.histplot(sample_means, kde=True, color='teal')
plt.title('Distribution of Sample Means (CLT demo)')
plt.xlabel('Sample mean')
plt.ylabel('Frequency')
plt.show()
