In [28]:
import numpy as np
import plotly.express as px
from scipy import stats

Problem:
An e-commerce company wants to estimate the average delivery time for orders in a region. The raw delivery time data is highly skewed (some orders take 2 days, others take 10 days due to logistics issues). The operations team is skeptical: “How can we trust the average if the data is all over the place?”

What is CLT?
The CLT states that if you take large enough random samples from any population (even a skewed one), the distribution of the sample means will approximate a normal distribution (bell curve).

Practical Explanation:

- The team takes 100 random samples of 50 orders each and calculates the mean delivery time for each sample.

- Plotting these 100 sample means creates a bell curve (normal distribution), even though the original data was skewed.

- Using this normal distribution, they calculate a 95% confidence interval (e.g., “We’re 95% confident the true average delivery time is between 3.5 and 4.2 days”).

### 1. Create Synthetic Delivery Time Data (Skewed Distribution)

In [18]:
np.random.seed(0)
delivery_times = np.clip(
    np.abs(np.append(
        np.random.normal(2, 0.5, 1000),  # Majority of orders (2 days ± 0.5)
        np.random.normal(10, 5, 100)      # Delayed orders (10 days ± 5)
    )), 0, 10
)

### 2. Calculate Population Parameters (Unknown in Real Life)

In [19]:
population_mean = np.mean(delivery_times)
population_std = np.std(delivery_times)

### 3. Central Limit Theorem (CLT) Simulation

In [20]:
sample_size = 50        # Number of orders in each sample
num_samples = 1000      # Number of samples to collect

# Simulate repeated sampling
sample_means = []
for i in range(num_samples):
    sample = np.random.choice(delivery_times, sample_size, replace=True)
    sample_means.append(sample.mean())

### 4. Calculate CLT Metrics

In [21]:
# Observed metrics from sampling
sample_mean_of_means = np.mean(sample_means)
sample_std_of_means = np.std(sample_means, ddof=1)

# Theoretical standard error
standard_error = population_std / np.sqrt(sample_size)

# 95% Confidence Interval
confidence_level = 0.95
ci = stats.norm.interval(confidence_level, 
                        loc=sample_mean_of_means, 
                        scale=standard_error)

Confidence interval
- To calculate 95% CI of sample mean, we need to get the middle 95% of  sample means.  
- We can exclude 2.5% of the sample means from the left and 2.5% from the right of the distribution.  
- So left bound of the CI is the 2.5th percentile and the right bound is the 97.5th percentile of the sample means distribution.

In [22]:
lower_bound = stats.norm.ppf(0.025)
upper_bound = stats.norm.ppf(0.975)
z = upper_bound
print("Confidence interval:", round(sample_mean_of_means - z* standard_error,2), "-", round(sample_mean_of_means + z * standard_error,2))

Confidence interval: 1.98 - 3.09


### 5. Create Interactive Visualizations

In [26]:
# Plot 1: Original Data Distribution
fig1 = px.histogram(
                        x=delivery_times,
                        nbins=50,
                        title="Original Delivery Times Distribution (Skewed)",
                        labels={'x': 'Delivery Time (Days)', 'y': 'Number of Orders'}
                )
fig1.update_layout(bargap=0.1)
fig1.add_vline(x=population_mean, line_dash="dot", 
              line_color="red", annotation_text="True Average")



In [27]:
# Plot 2: Distribution of Sample Means (CLT)
fig2 = px.histogram(
    x=sample_means,
    nbins=30,
    title="Distribution of Sample Means (CLT in Action!)",
    labels={'x': 'Average Delivery Time (Days)', 'y': 'Frequency'},
    color_discrete_sequence=['#4ECDC4']
)
fig2.add_vline(x=population_mean, line_dash="dot", 
              line_color="red", annotation_text="True Average", annotation_position="top right")
fig2.add_vline(x=sample_mean_of_means, line_dash="dash", 
              line_color="blue", annotation_text="Sample Averages' Mean", annotation_position="bottom right")
fig2.add_vrect(x0=ci[0], x1=ci[1], 
              annotation_text="95% Confidence Interval", annotation_position="top left",annotation_y=0.6, 
              fillcolor="green", opacity=0.2, line_width=0)

![Original Delivery Times Distribution (Skewed)](original_distribution.png)


![Sample Means Distribution](sample_means_distribution.png)


In [25]:
print(f"True Average Delivery Time: {population_mean:.2f} days")
print(f"Natural Variation in Delivery Times (population std dev): {population_std:.2f} days")
print(f"\nCLT Results (Based on {num_samples} samples of {sample_size} orders):")
print(f"- Average of Sample Averages: {sample_mean_of_means:.2f} days")
print(f"- Observed Variation in Averages: {sample_std_of_means:.2f} days")
print(f"- 95% Confidence Interval: ({ci[0]:.2f} days, {ci[1]:.2f} days)")

True Average Delivery Time: 2.54 days
Natural Variation in Delivery Times (population std dev): 2.00 days

CLT Results (Based on 1000 samples of 50 orders):
- Average of Sample Averages: 2.54 days
- Observed Variation in Averages: 0.29 days
- 95% Confidence Interval: (1.98 days, 3.09 days)


**Key Metrics Explained**

1. True Average Delivery Time (2.54 days):  
The actual average delivery time across all orders (unknown in real scenarios).

2. Natural Variation (2 days):  
How much delivery times naturally differ from order to order.

3. Average of Sample Averages (2.54 days):  
The average of all our sample estimates. Matches the true average (CLT works!).

4. Observed Variation in Averages (0.29 days):  
How much our sample estimates varied from each other.

5. 95% Confidence Interval (1.98 days, 3.09 days):  
We're 95% confident the true average is in this range.