# Statistical Distributions: Synthetic Data Generation

This notebook demonstrates data generation from common statistical distributions with real-world examples:
- **Normal Distribution**: Heights of adults
- **Log-Normal Distribution**: Income distribution
- **Binomial Distribution**: Product quality control
- **Poisson Distribution**: Customer arrivals per hour

## 1. Import Required Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# Set style for better visualizations
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")

## 2. Normal Distribution (Gaussian)

**Real-World Example**: Heights of adult humans

The normal distribution is symmetric and bell-shaped. Many natural phenomena follow this distribution, including:
- Human heights and weights
- Blood pressure measurements
- IQ scores
- Measurement errors

In [None]:
# Generate data: Heights of adult males (in cm)
# Mean height: 175 cm, Standard deviation: 7 cm
mean_height = 175
std_height = 7
sample_size = 1000

heights = np.random.normal(mean_height, std_height, sample_size)

# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram with KDE
axes[0].hist(heights, bins=30, density=True, alpha=0.7, color='skyblue', edgecolor='black')
x = np.linspace(heights.min(), heights.max(), 100)
axes[0].plot(x, stats.norm.pdf(x, mean_height, std_height), 'r-', linewidth=2, label='PDF')
axes[0].set_title('Normal Distribution: Adult Male Heights', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Height (cm)')
axes[0].set_ylabel('Density')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Box plot
axes[1].boxplot(heights, vert=True)
axes[1].set_title('Box Plot of Heights', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Height (cm)')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Statistics
print(f"Mean: {heights.mean():.2f} cm")
print(f"Standard Deviation: {heights.std():.2f} cm")
print(f"Min: {heights.min():.2f} cm")
print(f"Max: {heights.max():.2f} cm")
print(f"68% of data within: [{mean_height-std_height:.2f}, {mean_height+std_height:.2f}] cm")

## 3. Log-Normal Distribution

**Real-World Example**: Income distribution in a population

The log-normal distribution is right-skewed and appears when the logarithm of the variable is normally distributed. Common in:
- Income and wealth distribution
- Stock prices
- City population sizes
- File sizes on computers

In [None]:
# Generate data: Annual income (in thousands of dollars)
# Log-normal parameters
mu = 10.5  # Mean of log(income)
sigma = 0.5  # Standard deviation of log(income)
sample_size = 1000

incomes = np.random.lognormal(mu, sigma, sample_size) / 1000  # Convert to thousands

# Create visualization
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Histogram of income
axes[0].hist(incomes, bins=50, density=True, alpha=0.7, color='lightcoral', edgecolor='black')
axes[0].set_title('Log-Normal Distribution: Income Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Annual Income ($1000s)')
axes[0].set_ylabel('Density')
axes[0].grid(alpha=0.3)

# Histogram of log(income) - should be normal
log_incomes = np.log(incomes)
axes[1].hist(log_incomes, bins=30, density=True, alpha=0.7, color='lightgreen', edgecolor='black')
x_log = np.linspace(log_incomes.min(), log_incomes.max(), 100)
axes[1].plot(x_log, stats.norm.pdf(x_log, mu, sigma), 'r-', linewidth=2, label='Normal PDF')
axes[1].set_title('Log of Income (Normal Distribution)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Log(Income)')
axes[1].set_ylabel('Density')
axes[1].legend()
axes[1].grid(alpha=0.3)

# Cumulative distribution
axes[2].hist(incomes, bins=50, cumulative=True, density=True, alpha=0.7, color='plum', edgecolor='black')
axes[2].set_title('Cumulative Distribution', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Annual Income ($1000s)')
axes[2].set_ylabel('Cumulative Probability')
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Statistics
print(f"Mean Income: ${incomes.mean():.2f}k")
print(f"Median Income: ${np.median(incomes):.2f}k")
print(f"Standard Deviation: ${incomes.std():.2f}k")
print(f"Min Income: ${incomes.min():.2f}k")
print(f"Max Income: ${incomes.max():.2f}k")
print(f"\n75th percentile: ${np.percentile(incomes, 75):.2f}k")
print(f"90th percentile: ${np.percentile(incomes, 90):.2f}k")
print(f"95th percentile: ${np.percentile(incomes, 95):.2f}k")

## 4. Binomial Distribution

**Real-World Example**: Product quality control - number of defective items

The binomial distribution models the number of successes in a fixed number of independent trials. Common applications:
- Manufacturing defect rates
- Clinical trial success rates
- A/B testing in marketing
- Election polling

In [None]:
# Generate data: Quality control - testing 100 products with 5% defect rate
n_trials = 100  # Number of products tested per batch
p_defect = 0.05  # Probability of defect (5%)
n_batches = 1000  # Number of batches

defects = np.random.binomial(n_trials, p_defect, n_batches)

# Compare with different defect rates
p_defect_low = 0.02  # 2% defect rate
p_defect_high = 0.10  # 10% defect rate

defects_low = np.random.binomial(n_trials, p_defect_low, n_batches)
defects_high = np.random.binomial(n_trials, p_defect_high, n_batches)

# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram comparing different defect rates
axes[0].hist(defects_low, bins=20, alpha=0.5, label='2% defect rate', color='green', edgecolor='black')
axes[0].hist(defects, bins=20, alpha=0.5, label='5% defect rate', color='orange', edgecolor='black')
axes[0].hist(defects_high, bins=20, alpha=0.5, label='10% defect rate', color='red', edgecolor='black')
axes[0].set_title('Binomial Distribution: Product Defects', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Number of Defects per 100 Products')
axes[0].set_ylabel('Frequency')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Probability mass function for 5% defect rate
x = np.arange(0, 15)
pmf = stats.binom.pmf(x, n_trials, p_defect)
axes[1].bar(x, pmf, alpha=0.7, color='steelblue', edgecolor='black')
axes[1].set_title('Probability Mass Function (5% defect rate)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Number of Defects')
axes[1].set_ylabel('Probability')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Statistics
print(f"Mean defects (5% rate): {defects.mean():.2f}")
print(f"Standard Deviation: {defects.std():.2f}")
print(f"Expected value (n*p): {n_trials * p_defect:.2f}")
print(f"Theoretical std: {np.sqrt(n_trials * p_defect * (1-p_defect)):.2f}")
print(f"\nProbability of getting:")
print(f"  0 defects: {stats.binom.pmf(0, n_trials, p_defect):.4f}")
print(f"  5 defects: {stats.binom.pmf(5, n_trials, p_defect):.4f}")
print(f"  >10 defects: {1 - stats.binom.cdf(10, n_trials, p_defect):.4f}")

## 5. Poisson Distribution

**Real-World Example**: Customer arrivals at a store per hour

The Poisson distribution models the number of events occurring in a fixed interval of time or space. Common applications:
- Number of customers arriving per hour
- Number of emails received per day
- Number of phone calls to a call center
- Rare disease occurrences

In [None]:
# Generate data: Customer arrivals per hour
lambda_low = 5    # Average 5 customers per hour (slow period)
lambda_med = 15   # Average 15 customers per hour (normal period)
lambda_high = 30  # Average 30 customers per hour (busy period)
n_hours = 500

arrivals_low = np.random.poisson(lambda_low, n_hours)
arrivals_med = np.random.poisson(lambda_med, n_hours)
arrivals_high = np.random.poisson(lambda_high, n_hours)

# Create visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Histogram comparing different arrival rates
axes[0, 0].hist(arrivals_low, bins=20, alpha=0.6, label='λ=5 (slow)', color='lightblue', edgecolor='black')
axes[0, 0].hist(arrivals_med, bins=30, alpha=0.6, label='λ=15 (normal)', color='yellow', edgecolor='black')
axes[0, 0].hist(arrivals_high, bins=40, alpha=0.6, label='λ=30 (busy)', color='salmon', edgecolor='black')
axes[0, 0].set_title('Poisson Distribution: Customer Arrivals', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Number of Customers per Hour')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)

# PMF for λ=15
x = np.arange(0, 35)
pmf = stats.poisson.pmf(x, lambda_med)
axes[0, 1].bar(x, pmf, alpha=0.7, color='mediumseagreen', edgecolor='black')
axes[0, 1].axvline(lambda_med, color='red', linestyle='--', linewidth=2, label=f'Mean = {lambda_med}')
axes[0, 1].set_title('PMF: Normal Period (λ=15)', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Number of Arrivals')
axes[0, 1].set_ylabel('Probability')
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)

# Time series simulation
hours = np.arange(24)
# Simulate a day with varying rates
lambda_by_hour = np.array([5, 3, 2, 2, 3, 5, 10, 15, 20, 18, 15, 20,
                            25, 22, 18, 20, 25, 30, 28, 25, 20, 15, 10, 7])
arrivals_by_hour = [np.random.poisson(lam) for lam in lambda_by_hour]

axes[1, 0].plot(hours, arrivals_by_hour, marker='o', linewidth=2, markersize=8, color='purple')
axes[1, 0].fill_between(hours, arrivals_by_hour, alpha=0.3, color='purple')
axes[1, 0].set_title('Customer Arrivals Throughout the Day', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Hour of Day')
axes[1, 0].set_ylabel('Number of Customers')
axes[1, 0].set_xticks(hours[::2])
axes[1, 0].grid(alpha=0.3)

# Mean vs Variance property
lambdas = [5, 10, 15, 20, 25, 30]
means = []
variances = []

for lam in lambdas:
    samples = np.random.poisson(lam, 1000)
    means.append(samples.mean())
    variances.append(samples.var())

axes[1, 1].scatter(means, variances, s=100, alpha=0.7, color='darkred')
axes[1, 1].plot([0, 35], [0, 35], 'k--', linewidth=2, label='Mean = Variance')
axes[1, 1].set_title('Poisson Property: Mean = Variance', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Sample Mean')
axes[1, 1].set_ylabel('Sample Variance')
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Statistics
print(f"Normal Period (λ=15):")
print(f"  Mean arrivals: {arrivals_med.mean():.2f}")
print(f"  Variance: {arrivals_med.var():.2f}")
print(f"  Std deviation: {arrivals_med.std():.2f}")
print(f"\nProbability of exactly 15 customers: {stats.poisson.pmf(15, lambda_med):.4f}")
print(f"Probability of more than 20 customers: {1 - stats.poisson.cdf(20, lambda_med):.4f}")
print(f"Probability of fewer than 10 customers: {stats.poisson.cdf(9, lambda_med):.4f}")

## 6. Distribution Comparison Summary

Let's visualize all four distributions side by side to see their key differences and characteristics.

In [None]:
# Create comparison plot
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Normal Distribution
axes[0, 0].hist(heights, bins=30, density=True, alpha=0.7, color='skyblue', edgecolor='black')
x_norm = np.linspace(heights.min(), heights.max(), 100)
axes[0, 0].plot(x_norm, stats.norm.pdf(x_norm, mean_height, std_height), 'r-', linewidth=3)
axes[0, 0].set_title('Normal: Adult Heights\nSymmetric, Bell-shaped', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Height (cm)')
axes[0, 0].set_ylabel('Density')
axes[0, 0].grid(alpha=0.3)
axes[0, 0].text(0.05, 0.95, f'Mean: {heights.mean():.1f}\nStd: {heights.std():.1f}',
                transform=axes[0, 0].transAxes, verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# 2. Log-Normal Distribution
axes[0, 1].hist(incomes, bins=50, density=True, alpha=0.7, color='lightcoral', edgecolor='black')
axes[0, 1].set_title('Log-Normal: Income Distribution\nRight-skewed, Long tail', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Annual Income ($1000s)')
axes[0, 1].set_ylabel('Density')
axes[0, 1].grid(alpha=0.3)
axes[0, 1].text(0.65, 0.95, f'Mean: ${incomes.mean():.1f}k\nMedian: ${np.median(incomes):.1f}k',
                transform=axes[0, 1].transAxes, verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# 3. Binomial Distribution
x_binom = np.arange(0, 15)
pmf_binom = stats.binom.pmf(x_binom, n_trials, p_defect)
axes[1, 0].bar(x_binom, pmf_binom, alpha=0.7, color='steelblue', edgecolor='black', width=0.8)
axes[1, 0].set_title('Binomial: Product Defects\nDiscrete, Fixed trials', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Number of Defects (out of 100)')
axes[1, 0].set_ylabel('Probability')
axes[1, 0].grid(alpha=0.3)
axes[1, 0].text(0.65, 0.95, f'n: {n_trials}\np: {p_defect}\nE[X]: {n_trials*p_defect}',
                transform=axes[1, 0].transAxes, verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# 4. Poisson Distribution
x_poisson = np.arange(0, 35)
pmf_poisson = stats.poisson.pmf(x_poisson, lambda_med)
axes[1, 1].bar(x_poisson, pmf_poisson, alpha=0.7, color='mediumseagreen', edgecolor='black', width=0.8)
axes[1, 1].set_title('Poisson: Customer Arrivals\nDiscrete, Rate-based', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Number of Customers per Hour')
axes[1, 1].set_ylabel('Probability')
axes[1, 1].grid(alpha=0.3)
axes[1, 1].text(0.65, 0.95, f'λ: {lambda_med}\nMean: {lambda_med}\nVar: {lambda_med}',
                transform=axes[1, 1].transAxes, verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.suptitle('Comparison of Common Statistical Distributions', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

# Summary table
print("=" * 80)
print("DISTRIBUTION COMPARISON SUMMARY")
print("=" * 80)
print(f"\n{'Distribution':<15} {'Type':<12} {'Real-World Example':<35} {'Key Property'}")
print("-" * 80)
print(f"{'Normal':<15} {'Continuous':<12} {'Human heights, IQ scores':<35} {'Symmetric, bell-shaped'}")
print(f"{'Log-Normal':<15} {'Continuous':<12} {'Income, stock prices':<35} {'Right-skewed, positive'}")
print(f"{'Binomial':<15} {'Discrete':<12} {'Coin flips, defect rates':<35} {'Fixed trials'}")
print(f"{'Poisson':<15} {'Discrete':<12} {'Customer arrivals, rare events':<35} {'Mean = Variance'}")
print("=" * 80)

## Conclusion

This notebook demonstrated how to generate synthetic data from four fundamental statistical distributions:

1. **Normal Distribution** - Perfect for modeling natural phenomena with symmetric variation around a mean
2. **Log-Normal Distribution** - Ideal for modeling positive-skewed data like income or prices
3. **Binomial Distribution** - Used for counting successes in fixed number of trials
4. **Poisson Distribution** - Models event counts in fixed intervals of time or space

Each distribution has unique properties and real-world applications. Understanding these distributions is crucial for:
- Data simulation and testing
- Statistical modeling
- Machine learning feature engineering
- Understanding natural and social phenomena