# Statistical Distributions: Synthetic Data Generation

This notebook demonstrates data generation from common statistical distributions with real-world examples:
- **Normal Distribution**: Heights of adults
- **Log-Normal Distribution**: Income distribution
- **Binomial Distribution**: Product quality control
- **Poisson Distribution**: Customer arrivals per hour

## 1. Import Required Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# Set style for better visualizations
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")

## 2. Normal Distribution (Gaussian)

**Real-World Example**: Heights of adult humans

The normal distribution is symmetric and bell-shaped. Many natural phenomena follow this distribution, including:
- Human heights and weights
- Blood pressure measurements
- IQ scores
- Measurement errors

In [None]:
# Generate data: Heights of adult males (in cm)
# Mean height: 175 cm, Standard deviation: 7 cm
mean_height = 175
std_height = 7
sample_size = 1000

heights = np.random.normal(mean_height, std_height, sample_size)

# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram with KDE
axes[0].hist(heights, bins=30, density=True, alpha=0.7, color='skyblue', edgecolor='black')
x = np.linspace(heights.min(), heights.max(), 100)
axes[0].plot(x, stats.norm.pdf(x, mean_height, std_height), 'r-', linewidth=2, label='PDF')
axes[0].set_title('Normal Distribution: Adult Male Heights', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Height (cm)')
axes[0].set_ylabel('Density')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Box plot
axes[1].boxplot(heights, vert=True)
axes[1].set_title('Box Plot of Heights', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Height (cm)')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Statistics
print(f"Mean: {heights.mean():.2f} cm")
print(f"Standard Deviation: {heights.std():.2f} cm")
print(f"Min: {heights.min():.2f} cm")
print(f"Max: {heights.max():.2f} cm")
print(f"68% of data within: [{mean_height-std_height:.2f}, {mean_height+std_height:.2f}] cm")

## 3. Log-Normal Distribution

**Real-World Example**: Income distribution in a population

The log-normal distribution is right-skewed and appears when the logarithm of the variable is normally distributed. Common in:
- Income and wealth distribution
- Stock prices
- City population sizes
- File sizes on computers

In [None]:
# Generate data: Annual income (in thousands of dollars)
# Log-normal parameters
mu = 10.5  # Mean of log(income)
sigma = 0.5  # Standard deviation of log(income)
sample_size = 1000

incomes = np.random.lognormal(mu, sigma, sample_size) / 1000  # Convert to thousands

# Create visualization
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Histogram of income
axes[0].hist(incomes, bins=50, density=True, alpha=0.7, color='lightcoral', edgecolor='black')
axes[0].set_title('Log-Normal Distribution: Income Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Annual Income ($1000s)')
axes[0].set_ylabel('Density')
axes[0].grid(alpha=0.3)

# Histogram of log(income) - should be normal
log_incomes = np.log(incomes)
axes[1].hist(log_incomes, bins=30, density=True, alpha=0.7, color='lightgreen', edgecolor='black')
x_log = np.linspace(log_incomes.min(), log_incomes.max(), 100)
axes[1].plot(x_log, stats.norm.pdf(x_log, mu, sigma), 'r-', linewidth=2, label='Normal PDF')
axes[1].set_title('Log of Income (Normal Distribution)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Log(Income)')
axes[1].set_ylabel('Density')
axes[1].legend()
axes[1].grid(alpha=0.3)

# Cumulative distribution
axes[2].hist(incomes, bins=50, cumulative=True, density=True, alpha=0.7, color='plum', edgecolor='black')
axes[2].set_title('Cumulative Distribution', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Annual Income ($1000s)')
axes[2].set_ylabel('Cumulative Probability')
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Statistics
print(f"Mean Income: ${incomes.mean():.2f}k")
print(f"Median Income: ${np.median(incomes):.2f}k")
print(f"Standard Deviation: ${incomes.std():.2f}k")
print(f"Min Income: ${incomes.min():.2f}k")
print(f"Max Income: ${incomes.max():.2f}k")
print(f"\n75th percentile: ${np.percentile(incomes, 75):.2f}k")
print(f"90th percentile: ${np.percentile(incomes, 90):.2f}k")
print(f"95th percentile: ${np.percentile(incomes, 95):.2f}k")

## 4. Binomial Distribution

**Real-World Example**: Product quality control - number of defective items

The binomial distribution models the number of successes in a fixed number of independent trials. Common applications:
- Manufacturing defect rates
- Clinical trial success rates
- A/B testing in marketing
- Election polling

In [None]:
# Generate data: Quality control - testing 100 products with 5% defect rate
n_trials = 100  # Number of products tested per batch
p_defect = 0.05  # Probability of defect (5%)
n_batches = 1000  # Number of batches

defects = np.random.binomial(n_trials, p_defect, n_batches)

# Compare with different defect rates
p_defect_low = 0.02  # 2% defect rate
p_defect_high = 0.10  # 10% defect rate

defects_low = np.random.binomial(n_trials, p_defect_low, n_batches)
defects_high = np.random.binomial(n_trials, p_defect_high, n_batches)

# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram comparing different defect rates
axes[0].hist(defects_low, bins=20, alpha=0.5, label='2% defect rate', color='green', edgecolor='black')
axes[0].hist(defects, bins=20, alpha=0.5, label='5% defect rate', color='orange', edgecolor='black')
axes[0].hist(defects_high, bins=20, alpha=0.5, label='10% defect rate', color='red', edgecolor='black')
axes[0].set_title('Binomial Distribution: Product Defects', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Number of Defects per 100 Products')
axes[0].set_ylabel('Frequency')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Probability mass function for 5% defect rate
x = np.arange(0, 15)
pmf = stats.binom.pmf(x, n_trials, p_defect)
axes[1].bar(x, pmf, alpha=0.7, color='steelblue', edgecolor='black')
axes[1].set_title('Probability Mass Function (5% defect rate)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Number of Defects')
axes[1].set_ylabel('Probability')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Statistics
print(f"Mean defects (5% rate): {defects.mean():.2f}")
print(f"Standard Deviation: {defects.std():.2f}")
print(f"Expected value (n*p): {n_trials * p_defect:.2f}")
print(f"Theoretical std: {np.sqrt(n_trials * p_defect * (1-p_defect)):.2f}")
print(f"\nProbability of getting:")
print(f"  0 defects: {stats.binom.pmf(0, n_trials, p_defect):.4f}")
print(f"  5 defects: {stats.binom.pmf(5, n_trials, p_defect):.4f}")
print(f"  >10 defects: {1 - stats.binom.cdf(10, n_trials, p_defect):.4f}")