# Day 5: Statistics for Machine Learning

Statistics is the backbone of machine learning. Today we'll master:

1. **Descriptive Statistics**: Mean, median, mode, standard deviation
2. **Probability Distributions**: Normal, binomial, uniform
3. **Correlation vs Causation**: Understanding relationships
4. **Hypothesis Testing**: Making data-driven decisions
5. **Assignment**: Correlation analysis and hypothesis testing

---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm, binom, uniform, poisson, expon
from scipy.stats import ttest_ind, ttest_1samp, chi2_contingency, pearsonr, spearmanr
import warnings

warnings.filterwarnings('ignore')

# Set styles
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

print(f"SciPy Version: {stats.scipy.__version__ if hasattr(stats, 'scipy') else 'N/A'}")
print("Ready to learn Statistics for ML!")

In [None]:
# Load datasets
titanic = sns.load_dataset('titanic')
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')

print("Datasets loaded successfully!")

---

## 1. Descriptive Statistics

Descriptive statistics summarize and describe the main features of a dataset.

### 1.1 Measures of Central Tendency

In [None]:
# Create sample data
data = np.array([23, 25, 27, 28, 29, 30, 30, 31, 32, 35, 40, 150])  # Note: 150 is an outlier

print("Sample Data:", data)
print("=" * 50)

# Mean: Average of all values
mean = np.mean(data)
print(f"\nMean: {mean:.2f}")
print(f"  Formula: sum(x) / n = {sum(data)} / {len(data)} = {mean:.2f}")
print(f"  Note: Mean is sensitive to outliers (150 pulls it up)")

# Median: Middle value when sorted
median = np.median(data)
print(f"\nMedian: {median:.2f}")
print(f"  The middle value (or average of two middle values)")
print(f"  Note: Median is robust to outliers")

# Mode: Most frequent value
mode_result = stats.mode(data, keepdims=True)
print(f"\nMode: {mode_result.mode[0]} (appears {mode_result.count[0]} times)")
print(f"  The most frequently occurring value")

In [None]:
# Visualize the difference
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# With outlier
axes[0].hist(data, bins=10, edgecolor='black', alpha=0.7)
axes[0].axvline(mean, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean:.1f}')
axes[0].axvline(median, color='green', linestyle='--', linewidth=2, label=f'Median: {median:.1f}')
axes[0].set_title('With Outlier (150)', fontsize=12)
axes[0].legend()

# Without outlier
data_no_outlier = data[data < 100]
mean_clean = np.mean(data_no_outlier)
median_clean = np.median(data_no_outlier)

axes[1].hist(data_no_outlier, bins=10, edgecolor='black', alpha=0.7, color='green')
axes[1].axvline(mean_clean, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_clean:.1f}')
axes[1].axvline(median_clean, color='blue', linestyle='--', linewidth=2, label=f'Median: {median_clean:.1f}')
axes[1].set_title('Without Outlier', fontsize=12)
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"\nMean dropped from {mean:.1f} to {mean_clean:.1f} after removing outlier")
print(f"Median only changed from {median:.1f} to {median_clean:.1f}")

### 1.2 Measures of Dispersion

In [None]:
# Sample data
np.random.seed(42)
data = np.random.normal(50, 15, 1000)  # mean=50, std=15

print("Measures of Dispersion")
print("=" * 50)

# Range
range_val = np.ptp(data)  # peak-to-peak
print(f"\nRange: {range_val:.2f}")
print(f"  Formula: max - min = {data.max():.2f} - {data.min():.2f}")

# Variance
variance = np.var(data)
print(f"\nVariance: {variance:.2f}")
print(f"  Formula: Σ(x - mean)² / n")
print(f"  Measures how spread out the data is (in squared units)")

# Standard Deviation
std = np.std(data)
print(f"\nStandard Deviation: {std:.2f}")
print(f"  Formula: √variance = √{variance:.2f}")
print(f"  Same unit as original data")

# Interquartile Range (IQR)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
print(f"\nIQR: {iqr:.2f}")
print(f"  Q1 (25th percentile): {q1:.2f}")
print(f"  Q3 (75th percentile): {q3:.2f}")
print(f"  IQR = Q3 - Q1 = {q3:.2f} - {q1:.2f}")
print(f"  Robust to outliers, covers middle 50% of data")

In [None]:
# Coefficient of Variation
cv = (std / np.mean(data)) * 100
print(f"\nCoefficient of Variation (CV): {cv:.2f}%")
print(f"  Formula: (std / mean) × 100")
print(f"  Useful for comparing variability across different scales")

In [None]:
# Visualize standard deviation
plt.figure(figsize=(12, 6))

# Create data with different standard deviations
data1 = np.random.normal(50, 5, 1000)   # Low spread
data2 = np.random.normal(50, 15, 1000)  # Medium spread
data3 = np.random.normal(50, 30, 1000)  # High spread

plt.hist(data1, bins=30, alpha=0.5, label=f'std=5 (tight)', color='blue')
plt.hist(data2, bins=30, alpha=0.5, label=f'std=15 (medium)', color='green')
plt.hist(data3, bins=30, alpha=0.5, label=f'std=30 (spread)', color='red')

plt.axvline(50, color='black', linestyle='--', linewidth=2, label='Mean=50')
plt.title('Effect of Standard Deviation on Data Spread', fontsize=14, fontweight='bold')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend()
plt.show()

### 1.3 Percentiles and Quartiles

In [None]:
# Percentiles
data = titanic['age'].dropna()

print("Age Percentiles in Titanic Dataset")
print("=" * 50)

percentiles = [10, 25, 50, 75, 90, 95, 99]
for p in percentiles:
    value = np.percentile(data, p)
    print(f"{p}th percentile: {value:.1f} years")

print(f"\nInterpretation:")
print(f"  - 50% of passengers were younger than {np.percentile(data, 50):.0f} years")
print(f"  - 90% of passengers were younger than {np.percentile(data, 90):.0f} years")

In [None]:
# Box plot showing quartiles
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
bp = axes[0].boxplot(data, vert=True, patch_artist=True)
bp['boxes'][0].set_facecolor('lightblue')

# Add quartile labels
axes[0].axhline(np.percentile(data, 25), color='green', linestyle='--', alpha=0.7)
axes[0].axhline(np.percentile(data, 50), color='red', linestyle='--', alpha=0.7)
axes[0].axhline(np.percentile(data, 75), color='green', linestyle='--', alpha=0.7)
axes[0].set_title('Age Distribution (Box Plot)', fontsize=12)
axes[0].set_ylabel('Age')

# Annotate
q1, q2, q3 = np.percentile(data, [25, 50, 75])
axes[0].annotate(f'Q1: {q1:.0f}', xy=(1.1, q1), fontsize=10)
axes[0].annotate(f'Q2 (Median): {q2:.0f}', xy=(1.1, q2), fontsize=10)
axes[0].annotate(f'Q3: {q3:.0f}', xy=(1.1, q3), fontsize=10)

# Histogram with quartile lines
axes[1].hist(data, bins=30, edgecolor='black', alpha=0.7)
for p, label, color in [(25, 'Q1', 'green'), (50, 'Median', 'red'), (75, 'Q3', 'green')]:
    val = np.percentile(data, p)
    axes[1].axvline(val, color=color, linestyle='--', linewidth=2, label=f'{label}: {val:.0f}')
axes[1].set_title('Age Distribution with Quartiles', fontsize=12)
axes[1].set_xlabel('Age')
axes[1].set_ylabel('Frequency')
axes[1].legend()

plt.tight_layout()
plt.show()

### 1.4 Skewness and Kurtosis

In [None]:
# Skewness: Measure of asymmetry
# Positive skew: tail extends to the right (mean > median)
# Negative skew: tail extends to the left (mean < median)

# Create different distributions
np.random.seed(42)
normal_data = np.random.normal(50, 10, 1000)
right_skewed = np.random.exponential(10, 1000)  # Right-skewed
left_skewed = 100 - np.random.exponential(10, 1000)  # Left-skewed

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, data, title in zip(axes, 
                            [left_skewed, normal_data, right_skewed],
                            ['Left Skewed', 'Normal (Symmetric)', 'Right Skewed']):
    ax.hist(data, bins=30, edgecolor='black', alpha=0.7)
    ax.axvline(np.mean(data), color='red', linestyle='--', linewidth=2, label=f'Mean: {np.mean(data):.1f}')
    ax.axvline(np.median(data), color='green', linestyle='--', linewidth=2, label=f'Median: {np.median(data):.1f}')
    skew = stats.skew(data)
    ax.set_title(f'{title}\nSkewness: {skew:.2f}', fontsize=12)
    ax.legend(fontsize=9)

plt.tight_layout()
plt.show()

print("\nSkewness Interpretation:")
print("  Skewness < 0: Left-skewed (tail on left, mean < median)")
print("  Skewness = 0: Symmetric (normal distribution)")
print("  Skewness > 0: Right-skewed (tail on right, mean > median)")

In [None]:
# Kurtosis: Measure of "tailedness"
# High kurtosis: Heavy tails, more outliers
# Low kurtosis: Light tails, fewer outliers

np.random.seed(42)
normal = np.random.normal(0, 1, 1000)
heavy_tails = np.random.standard_t(df=3, size=1000)  # t-distribution with heavy tails
light_tails = np.random.uniform(-2, 2, 1000)  # Uniform has light tails

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, data, title in zip(axes,
                            [light_tails, normal, heavy_tails],
                            ['Light Tails (Uniform)', 'Normal (Mesokurtic)', 'Heavy Tails (t-dist)']):
    ax.hist(data, bins=30, edgecolor='black', alpha=0.7, density=True)
    kurt = stats.kurtosis(data)
    ax.set_title(f'{title}\nExcess Kurtosis: {kurt:.2f}', fontsize=12)
    ax.set_xlim(-6, 6)

plt.tight_layout()
plt.show()

print("\nKurtosis Interpretation (Excess Kurtosis):")
print("  Kurtosis < 0: Platykurtic (light tails, flatter peak)")
print("  Kurtosis = 0: Mesokurtic (normal distribution)")
print("  Kurtosis > 0: Leptokurtic (heavy tails, sharper peak)")

---

## 2. Probability Distributions

Understanding probability distributions is essential for ML.

### 2.1 Normal (Gaussian) Distribution

In [None]:
# Normal Distribution: Bell curve, most common in nature
# Parameters: μ (mean), σ (standard deviation)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Different means
x = np.linspace(-10, 20, 1000)
for mu, color in [(0, 'blue'), (5, 'green'), (10, 'red')]:
    axes[0].plot(x, norm.pdf(x, mu, 2), color=color, linewidth=2, label=f'μ={mu}, σ=2')
axes[0].set_title('Effect of Mean (μ)', fontsize=12)
axes[0].legend()
axes[0].set_xlabel('x')
axes[0].set_ylabel('Probability Density')

# Different standard deviations
x = np.linspace(-15, 15, 1000)
for sigma, color in [(1, 'blue'), (2, 'green'), (4, 'red')]:
    axes[1].plot(x, norm.pdf(x, 0, sigma), color=color, linewidth=2, label=f'μ=0, σ={sigma}')
axes[1].set_title('Effect of Standard Deviation (σ)', fontsize=12)
axes[1].legend()
axes[1].set_xlabel('x')
axes[1].set_ylabel('Probability Density')

plt.tight_layout()
plt.show()

In [None]:
# The 68-95-99.7 Rule (Empirical Rule)
mu, sigma = 0, 1
x = np.linspace(-4, 4, 1000)
y = norm.pdf(x, mu, sigma)

plt.figure(figsize=(12, 6))
plt.plot(x, y, 'b-', linewidth=2)

# Fill areas
# 1 std (68.27%)
x1 = np.linspace(-1, 1, 100)
plt.fill_between(x1, norm.pdf(x1, mu, sigma), alpha=0.3, color='blue', label='68.27% (±1σ)')

# 2 std (95.45%)
x2 = np.linspace(-2, 2, 100)
plt.fill_between(x2, norm.pdf(x2, mu, sigma), alpha=0.2, color='green', label='95.45% (±2σ)')

# 3 std (99.73%)
x3 = np.linspace(-3, 3, 100)
plt.fill_between(x3, norm.pdf(x3, mu, sigma), alpha=0.1, color='red', label='99.73% (±3σ)')

# Add lines
for i in range(-3, 4):
    plt.axvline(i, color='gray', linestyle='--', alpha=0.5)

plt.title('The 68-95-99.7 Rule (Empirical Rule)', fontsize=14, fontweight='bold')
plt.xlabel('Standard Deviations from Mean')
plt.ylabel('Probability Density')
plt.legend()
plt.show()

print("The Empirical Rule:")
print("  68.27% of data falls within 1 standard deviation of the mean")
print("  95.45% of data falls within 2 standard deviations of the mean")
print("  99.73% of data falls within 3 standard deviations of the mean")

In [None]:
# Z-scores: How many standard deviations from the mean
# z = (x - μ) / σ

# Example: IQ scores (mean=100, std=15)
mu_iq, sigma_iq = 100, 15

print("IQ Score Analysis (μ=100, σ=15)")
print("=" * 50)

for iq in [70, 85, 100, 115, 130, 145]:
    z = (iq - mu_iq) / sigma_iq
    percentile = norm.cdf(z) * 100
    print(f"IQ {iq}: z-score = {z:+.2f}, percentile = {percentile:.1f}%")

print(f"\nWhat IQ is needed to be in top 1%?")
z_99 = norm.ppf(0.99)  # z-score for 99th percentile
iq_99 = mu_iq + z_99 * sigma_iq
print(f"  z-score for 99th percentile: {z_99:.2f}")
print(f"  IQ needed: {iq_99:.0f}")

### 2.2 Binomial Distribution

In [None]:
# Binomial: Number of successes in n independent trials
# Parameters: n (number of trials), p (probability of success)

# Example: Coin flips
n = 10  # Number of flips
p = 0.5  # Probability of heads

x = np.arange(0, n+1)
probabilities = binom.pmf(x, n, p)

plt.figure(figsize=(10, 5))
plt.bar(x, probabilities, color='steelblue', edgecolor='black')
plt.title(f'Binomial Distribution: {n} Fair Coin Flips', fontsize=14, fontweight='bold')
plt.xlabel('Number of Heads')
plt.ylabel('Probability')
plt.xticks(x)

# Add expected value
expected = n * p
plt.axvline(expected, color='red', linestyle='--', linewidth=2, label=f'Expected Value: {expected}')
plt.legend()
plt.show()

print(f"Expected value (μ) = n × p = {n} × {p} = {expected}")
print(f"Variance = n × p × (1-p) = {n} × {p} × {1-p} = {n*p*(1-p)}")
print(f"Standard deviation = √variance = {np.sqrt(n*p*(1-p)):.2f}")

In [None]:
# Different binomial distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

params = [(10, 0.5), (10, 0.2), (20, 0.5)]
titles = ['n=10, p=0.5\n(Fair coin)', 'n=10, p=0.2\n(Biased)', 'n=20, p=0.5\n(More trials)']

for ax, (n, p), title in zip(axes, params, titles):
    x = np.arange(0, n+1)
    ax.bar(x, binom.pmf(x, n, p), color='steelblue', edgecolor='black')
    ax.axvline(n*p, color='red', linestyle='--', linewidth=2, label=f'μ={n*p}')
    ax.set_title(title, fontsize=11)
    ax.set_xlabel('Number of Successes')
    ax.legend()

plt.tight_layout()
plt.show()

In [None]:
# Practical example: Quality Control
# A factory produces items with 2% defect rate
# What's the probability of finding exactly 3 defects in a batch of 100?

n = 100  # Batch size
p = 0.02  # Defect rate

print("Quality Control Example")
print("=" * 50)
print(f"Batch size: {n}, Defect rate: {p*100}%")
print()

for k in range(0, 8):
    prob = binom.pmf(k, n, p)
    print(f"P(exactly {k} defects) = {prob:.4f} ({prob*100:.2f}%)")

print(f"\nP(5 or more defects) = {1 - binom.cdf(4, n, p):.4f}")
print(f"Expected defects = {n*p}")

### 2.3 Uniform Distribution

In [None]:
# Uniform: All values equally likely
# Parameters: a (minimum), b (maximum)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Continuous uniform
a, b = 0, 10
x = np.linspace(-2, 12, 1000)
y = uniform.pdf(x, loc=a, scale=b-a)

axes[0].plot(x, y, 'b-', linewidth=2)
axes[0].fill_between(x, y, alpha=0.3)
axes[0].set_title(f'Continuous Uniform [{a}, {b}]', fontsize=12)
axes[0].set_xlabel('x')
axes[0].set_ylabel('Probability Density')
axes[0].axvline((a+b)/2, color='red', linestyle='--', label=f'Mean: {(a+b)/2}')
axes[0].legend()

# Discrete uniform (rolling a die)
x = np.arange(1, 7)
prob = np.ones(6) / 6

axes[1].bar(x, prob, color='steelblue', edgecolor='black')
axes[1].set_title('Discrete Uniform: Rolling a Die', fontsize=12)
axes[1].set_xlabel('Outcome')
axes[1].set_ylabel('Probability')
axes[1].set_ylim(0, 0.25)
axes[1].axhline(1/6, color='red', linestyle='--', label=f'P = 1/6 ≈ {1/6:.3f}')
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"Uniform Distribution Properties:")
print(f"  Mean = (a + b) / 2 = ({a} + {b}) / 2 = {(a+b)/2}")
print(f"  Variance = (b - a)² / 12 = {(b-a)**2/12:.2f}")

### 2.4 Other Important Distributions

In [None]:
# Comparison of distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Poisson distribution: Count of events in fixed time
lambda_param = 5
x_poisson = np.arange(0, 20)
axes[0, 0].bar(x_poisson, poisson.pmf(x_poisson, lambda_param), color='purple', edgecolor='black')
axes[0, 0].set_title(f'Poisson Distribution (λ={lambda_param})\nEvents per time interval', fontsize=11)
axes[0, 0].set_xlabel('Number of Events')
axes[0, 0].set_ylabel('Probability')

# Exponential distribution: Time between events
x_exp = np.linspace(0, 5, 100)
for rate in [0.5, 1, 2]:
    axes[0, 1].plot(x_exp, expon.pdf(x_exp, scale=1/rate), linewidth=2, label=f'λ={rate}')
axes[0, 1].set_title('Exponential Distribution\nTime between events', fontsize=11)
axes[0, 1].set_xlabel('Time')
axes[0, 1].set_ylabel('Probability Density')
axes[0, 1].legend()

# Chi-square distribution: Sum of squared standard normals
x_chi = np.linspace(0, 20, 100)
for df in [1, 2, 5, 10]:
    axes[1, 0].plot(x_chi, stats.chi2.pdf(x_chi, df), linewidth=2, label=f'df={df}')
axes[1, 0].set_title('Chi-Square Distribution\nUsed in hypothesis testing', fontsize=11)
axes[1, 0].set_xlabel('x')
axes[1, 0].set_ylabel('Probability Density')
axes[1, 0].legend()

# t-distribution: Like normal but heavier tails
x_t = np.linspace(-5, 5, 100)
axes[1, 1].plot(x_t, norm.pdf(x_t), 'k--', linewidth=2, label='Normal')
for df in [1, 5, 30]:
    axes[1, 1].plot(x_t, stats.t.pdf(x_t, df), linewidth=2, label=f't (df={df})')
axes[1, 1].set_title('t-Distribution\nUsed with small samples', fontsize=11)
axes[1, 1].set_xlabel('x')
axes[1, 1].set_ylabel('Probability Density')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

---

## 3. Correlation vs Causation

One of the most important concepts in statistics!

### 3.1 Correlation Coefficients

In [None]:
# Pearson correlation: Linear relationship (-1 to 1)
# r = 1: Perfect positive correlation
# r = 0: No linear correlation
# r = -1: Perfect negative correlation

np.random.seed(42)

# Create datasets with different correlations
n = 100
x = np.random.randn(n)

y_strong_pos = x + np.random.randn(n) * 0.3  # r ≈ 0.95
y_weak_pos = x + np.random.randn(n) * 2     # r ≈ 0.45
y_no_corr = np.random.randn(n)               # r ≈ 0
y_strong_neg = -x + np.random.randn(n) * 0.3  # r ≈ -0.95

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

datasets = [
    (x, y_strong_pos, 'Strong Positive'),
    (x, y_weak_pos, 'Weak Positive'),
    (x, y_no_corr, 'No Correlation'),
    (x, y_strong_neg, 'Strong Negative')
]

for ax, (xi, yi, title) in zip(axes.flatten(), datasets):
    ax.scatter(xi, yi, alpha=0.6)
    r, p = pearsonr(xi, yi)
    ax.set_title(f'{title}\nr = {r:.3f}, p-value = {p:.4f}', fontsize=11)
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    
    # Add regression line
    m, b = np.polyfit(xi, yi, 1)
    ax.plot(xi, m*xi + b, 'r-', linewidth=2)

plt.tight_layout()
plt.show()

In [None]:
# Pearson vs Spearman correlation
# Pearson: Measures linear relationship
# Spearman: Measures monotonic relationship (based on ranks)

np.random.seed(42)
x = np.linspace(1, 10, 50)
y_linear = 2*x + np.random.randn(50) * 2
y_exponential = np.exp(x/3) + np.random.randn(50) * 5

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Linear relationship
axes[0].scatter(x, y_linear)
r_pearson, _ = pearsonr(x, y_linear)
r_spearman, _ = spearmanr(x, y_linear)
axes[0].set_title(f'Linear Relationship\nPearson: {r_pearson:.3f}, Spearman: {r_spearman:.3f}', fontsize=11)

# Exponential (non-linear) relationship
axes[1].scatter(x, y_exponential)
r_pearson, _ = pearsonr(x, y_exponential)
r_spearman, _ = spearmanr(x, y_exponential)
axes[1].set_title(f'Exponential Relationship\nPearson: {r_pearson:.3f}, Spearman: {r_spearman:.3f}', fontsize=11)

plt.tight_layout()
plt.show()

print("Key Insight:")
print("  Spearman correlation is better for non-linear monotonic relationships")
print("  Both coefficients are similar for linear relationships")

### 3.2 Correlation Does NOT Imply Causation

In [None]:
# Famous spurious correlations
print("CORRELATION ≠ CAUSATION")
print("=" * 60)
print("""
Examples of Spurious Correlations:

1. Ice cream sales and drowning deaths
   - Both increase in summer (confounding variable: temperature)
   - Ice cream doesn't cause drowning!

2. Number of firefighters and fire damage
   - More firefighters at bigger fires = more damage
   - Firefighters don't cause damage!

3. Shoe size and reading ability in children
   - Both increase with age (confounding variable: age)
   - Bigger feet don't cause better reading!

4. Nicolas Cage movies and pool drownings
   - r = 0.67 (actual statistic!)
   - Completely coincidental
""")

# Visualize confounding variable example
np.random.seed(42)
temperature = np.random.uniform(60, 100, 100)  # Temperature in F
ice_cream = 10 * temperature + np.random.randn(100) * 50  # Ice cream sales
drowning = 0.5 * temperature + np.random.randn(100) * 5  # Drowning incidents

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Ice cream vs Drowning
axes[0].scatter(ice_cream, drowning, alpha=0.6, c='red')
r, _ = pearsonr(ice_cream, drowning)
axes[0].set_title(f'Ice Cream vs Drowning\nr = {r:.3f}', fontsize=11)
axes[0].set_xlabel('Ice Cream Sales')
axes[0].set_ylabel('Drowning Incidents')

# Temperature vs Ice cream
axes[1].scatter(temperature, ice_cream, alpha=0.6, c='blue')
axes[1].set_title('Temperature vs Ice Cream', fontsize=11)
axes[1].set_xlabel('Temperature (F)')
axes[1].set_ylabel('Ice Cream Sales')

# Temperature vs Drowning
axes[2].scatter(temperature, drowning, alpha=0.6, c='green')
axes[2].set_title('Temperature vs Drowning', fontsize=11)
axes[2].set_xlabel('Temperature (F)')
axes[2].set_ylabel('Drowning Incidents')

plt.tight_layout()
plt.show()

print("\nTemperature is the CONFOUNDING VARIABLE!")
print("It causes both ice cream sales AND drowning incidents to increase.")

---

## 4. Hypothesis Testing

Making data-driven decisions with statistical rigor.

### 4.1 Understanding Hypothesis Testing

In [None]:
print("HYPOTHESIS TESTING FRAMEWORK")
print("=" * 60)
print("""
1. FORMULATE HYPOTHESES:
   - H₀ (Null Hypothesis): No effect, no difference (status quo)
   - H₁ (Alternative Hypothesis): There IS an effect or difference

2. CHOOSE SIGNIFICANCE LEVEL (α):
   - Common values: 0.05 (5%), 0.01 (1%)
   - Probability of rejecting H₀ when it's actually true (Type I error)

3. CALCULATE TEST STATISTIC & P-VALUE:
   - P-value: Probability of observing data this extreme if H₀ is true

4. MAKE A DECISION:
   - If p-value < α: Reject H₀ (result is statistically significant)
   - If p-value ≥ α: Fail to reject H₀ (not enough evidence)

IMPORTANT NOTES:
   - Statistical significance ≠ Practical significance
   - "Fail to reject" ≠ "Accept" (we can't prove the null)
   - P-value is NOT the probability that H₀ is true
""")

### 4.2 One-Sample t-test

In [None]:
# One-sample t-test: Compare sample mean to known value
# Example: Is the average height different from 170 cm?

np.random.seed(42)
heights = np.random.normal(172, 8, 50)  # Sample of 50 heights

hypothesized_mean = 170
t_stat, p_value = ttest_1samp(heights, hypothesized_mean)

print("One-Sample t-Test")
print("=" * 50)
print(f"H₀: μ = {hypothesized_mean} cm")
print(f"H₁: μ ≠ {hypothesized_mean} cm")
print(f"\nSample size: {len(heights)}")
print(f"Sample mean: {np.mean(heights):.2f} cm")
print(f"Sample std: {np.std(heights, ddof=1):.2f} cm")
print(f"\nt-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
    print(f"\n✓ REJECT H₀ (p < {alpha}): Mean height is significantly different from {hypothesized_mean} cm")
else:
    print(f"\n✗ FAIL TO REJECT H₀ (p ≥ {alpha}): No significant difference from {hypothesized_mean} cm")

### 4.3 Two-Sample t-test

In [None]:
# Two-sample t-test: Compare means of two groups
# Example: Did male and female passengers pay different fares?

male_fares = titanic[titanic['sex'] == 'male']['fare'].dropna()
female_fares = titanic[titanic['sex'] == 'female']['fare'].dropna()

t_stat, p_value = ttest_ind(male_fares, female_fares)

print("Two-Sample t-Test: Fare by Gender")
print("=" * 50)
print(f"H₀: μ_male = μ_female (no difference in fares)")
print(f"H₁: μ_male ≠ μ_female (fares are different)")
print(f"\nMale passengers:")
print(f"  n = {len(male_fares)}, mean = ${np.mean(male_fares):.2f}, std = ${np.std(male_fares):.2f}")
print(f"\nFemale passengers:")
print(f"  n = {len(female_fares)}, mean = ${np.mean(female_fares):.2f}, std = ${np.std(female_fares):.2f}")
print(f"\nt-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.6f}")

alpha = 0.05
if p_value < alpha:
    print(f"\n✓ REJECT H₀: Significant difference in fares between genders")
else:
    print(f"\n✗ FAIL TO REJECT H₀: No significant difference")

In [None]:
# Visualize the comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
sns.boxplot(data=titanic, x='sex', y='fare', ax=axes[0])
axes[0].set_title('Fare Distribution by Gender', fontsize=12)

# Histogram
axes[1].hist(male_fares, bins=30, alpha=0.5, label='Male', color='blue')
axes[1].hist(female_fares, bins=30, alpha=0.5, label='Female', color='red')
axes[1].axvline(male_fares.mean(), color='blue', linestyle='--', linewidth=2)
axes[1].axvline(female_fares.mean(), color='red', linestyle='--', linewidth=2)
axes[1].set_title('Fare Distributions Overlaid', fontsize=12)
axes[1].set_xlabel('Fare')
axes[1].legend()

plt.tight_layout()
plt.show()

### 4.4 Chi-Square Test

In [None]:
# Chi-square test: Test independence between categorical variables
# Example: Is survival independent of gender?

contingency_table = pd.crosstab(titanic['sex'], titanic['survived'])
print("Contingency Table: Sex vs Survived")
print(contingency_table)

chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"\nChi-Square Test")
print("=" * 50)
print(f"H₀: Sex and survival are independent")
print(f"H₁: Sex and survival are NOT independent")
print(f"\nChi-square statistic: {chi2:.2f}")
print(f"Degrees of freedom: {dof}")
print(f"p-value: {p_value:.2e}")

print(f"\nExpected frequencies (if independent):")
print(pd.DataFrame(expected, 
                   index=contingency_table.index, 
                   columns=contingency_table.columns).round(1))

alpha = 0.05
if p_value < alpha:
    print(f"\n✓ REJECT H₀: Sex and survival are NOT independent (gender affects survival)")
else:
    print(f"\n✗ FAIL TO REJECT H₀: Cannot conclude they are dependent")

In [None]:
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Observed counts
contingency_table.plot(kind='bar', ax=axes[0], color=['#e74c3c', '#2ecc71'])
axes[0].set_title('Observed Counts', fontsize=12)
axes[0].set_xlabel('Sex')
axes[0].set_ylabel('Count')
axes[0].legend(['Did not survive', 'Survived'])
axes[0].tick_params(axis='x', rotation=0)

# Survival rate
survival_rate = titanic.groupby('sex')['survived'].mean() * 100
bars = axes[1].bar(survival_rate.index, survival_rate.values, color=['#3498db', '#e74c3c'])
for bar, rate in zip(bars, survival_rate.values):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                 f'{rate:.1f}%', ha='center', fontsize=12, fontweight='bold')
axes[1].set_title('Survival Rate by Gender', fontsize=12)
axes[1].set_ylabel('Survival Rate (%)')
axes[1].set_ylim(0, 100)

plt.tight_layout()
plt.show()

---

## 5. Assignment: Correlation Analysis and Hypothesis Testing

In [None]:
# ASSIGNMENT Part 1: Analyze correlations in the Tips dataset

print("=" * 70)
print("ASSIGNMENT PART 1: CORRELATION ANALYSIS (TIPS DATASET)")
print("=" * 70)

# Calculate correlations
numeric_cols = tips.select_dtypes(include=[np.number]).columns
correlation_matrix = tips[numeric_cols].corr()

print("\nCorrelation Matrix:")
print(correlation_matrix.round(3))

In [None]:
# Visualize correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            fmt='.3f', square=True, linewidths=0.5)
plt.title('Tips Dataset - Correlation Heatmap', fontsize=14, fontweight='bold')
plt.show()

In [None]:
# Analyze tip vs total_bill correlation with significance test
r, p = pearsonr(tips['total_bill'], tips['tip'])

print("\nCorrelation: Tip vs Total Bill")
print("=" * 50)
print(f"Pearson correlation coefficient: {r:.4f}")
print(f"p-value: {p:.2e}")

if p < 0.05:
    print("\n✓ Correlation is statistically significant (p < 0.05)")
    
# Interpret strength
if abs(r) >= 0.7:
    strength = "strong"
elif abs(r) >= 0.4:
    strength = "moderate"
else:
    strength = "weak"
    
direction = "positive" if r > 0 else "negative"
print(f"Interpretation: {strength} {direction} correlation")
print(f"\nAs total bill increases, tip tends to increase proportionally.")

In [None]:
# Scatter plot with regression
plt.figure(figsize=(10, 6))
sns.regplot(data=tips, x='total_bill', y='tip', 
            scatter_kws={'alpha': 0.5}, line_kws={'color': 'red'})
plt.title(f'Tip vs Total Bill (r = {r:.3f})', fontsize=14, fontweight='bold')
plt.xlabel('Total Bill ($)')
plt.ylabel('Tip ($)')
plt.show()

In [None]:
# ASSIGNMENT Part 2: Hypothesis Testing

print("\n" + "=" * 70)
print("ASSIGNMENT PART 2: HYPOTHESIS TESTING")
print("=" * 70)

# Test 1: Do smokers tip differently than non-smokers?
print("\nTest 1: Do smokers tip differently than non-smokers?")
print("-" * 50)

smoker_tips = tips[tips['smoker'] == 'Yes']['tip']
non_smoker_tips = tips[tips['smoker'] == 'No']['tip']

t_stat, p_value = ttest_ind(smoker_tips, non_smoker_tips)

print(f"H₀: μ_smoker = μ_non_smoker")
print(f"H₁: μ_smoker ≠ μ_non_smoker")
print(f"\nSmokers: n={len(smoker_tips)}, mean=${smoker_tips.mean():.2f}")
print(f"Non-smokers: n={len(non_smoker_tips)}, mean=${non_smoker_tips.mean():.2f}")
print(f"\nt-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("\n✓ REJECT H₀: Smokers tip significantly differently")
else:
    print("\n✗ FAIL TO REJECT H₀: No significant difference in tipping")

In [None]:
# Test 2: Is there a relationship between smoking and time of day?
print("\nTest 2: Is smoking related to time of day (lunch vs dinner)?")
print("-" * 50)

contingency = pd.crosstab(tips['smoker'], tips['time'])
print("Contingency Table:")
print(contingency)

chi2, p_value, dof, expected = chi2_contingency(contingency)

print(f"\nH₀: Smoking and meal time are independent")
print(f"H₁: Smoking and meal time are NOT independent")
print(f"\nChi-square: {chi2:.3f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("\n✓ REJECT H₀: Smoking is related to meal time")
else:
    print("\n✗ FAIL TO REJECT H₀: No significant relationship")

In [None]:
# Test 3: Do tips differ by day of week?
print("\nTest 3: Do tips differ by day of week?")
print("-" * 50)

# One-way ANOVA (F-test)
from scipy.stats import f_oneway

thur_tips = tips[tips['day'] == 'Thur']['tip']
fri_tips = tips[tips['day'] == 'Fri']['tip']
sat_tips = tips[tips['day'] == 'Sat']['tip']
sun_tips = tips[tips['day'] == 'Sun']['tip']

f_stat, p_value = f_oneway(thur_tips, fri_tips, sat_tips, sun_tips)

print(f"H₀: All days have equal mean tips")
print(f"H₁: At least one day has different mean tip")

print(f"\nMean tips by day:")
for day in ['Thur', 'Fri', 'Sat', 'Sun']:
    mean_tip = tips[tips['day'] == day]['tip'].mean()
    print(f"  {day}: ${mean_tip:.2f}")

print(f"\nF-statistic: {f_stat:.3f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("\n✓ REJECT H₀: Tips differ significantly by day")
else:
    print("\n✗ FAIL TO REJECT H₀: No significant difference by day")

In [None]:
# Visualize all tests
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Test 1: Smoker vs Non-smoker tips
sns.boxplot(data=tips, x='smoker', y='tip', ax=axes[0])
axes[0].set_title('Tips by Smoker Status', fontsize=12)

# Test 2: Smoking by time
contingency.plot(kind='bar', ax=axes[1])
axes[1].set_title('Smoking by Meal Time', fontsize=12)
axes[1].tick_params(axis='x', rotation=0)

# Test 3: Tips by day
sns.boxplot(data=tips, x='day', y='tip', order=['Thur', 'Fri', 'Sat', 'Sun'], ax=axes[2])
axes[2].set_title('Tips by Day of Week', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# Assignment Summary
print("\n" + "=" * 70)
print("ASSIGNMENT SUMMARY")
print("=" * 70)
print("""
CORRELATION ANALYSIS:
• Total bill and tip have a strong positive correlation (r ≈ 0.68)
• This is statistically significant (p < 0.001)
• As bill increases, tip increases proportionally

HYPOTHESIS TESTING RESULTS:

1. Smokers vs Non-smokers (t-test):
   • No significant difference in tipping behavior
   • Cannot conclude smoking affects tip amount

2. Smoking and Meal Time (Chi-square):
   • Significant relationship exists
   • Smoking patterns differ between lunch and dinner

3. Tips by Day (ANOVA):
   • No significant difference between days
   • Day of week doesn't affect tip amount

KEY TAKEAWAYS:
• Always check statistical significance before drawing conclusions
• Correlation doesn't imply causation
• Use appropriate tests for different data types
""")

---

## Summary

Today you learned:

### Descriptive Statistics
- **Central Tendency**: Mean (average), Median (middle), Mode (most frequent)
- **Dispersion**: Range, Variance, Standard Deviation, IQR
- **Shape**: Skewness, Kurtosis

### Probability Distributions
- **Normal**: Bell curve, 68-95-99.7 rule, z-scores
- **Binomial**: Successes in n trials
- **Uniform**: All outcomes equally likely
- **Others**: Poisson, Exponential, Chi-square, t-distribution

### Correlation
- **Pearson**: Linear relationships (-1 to 1)
- **Spearman**: Monotonic relationships (based on ranks)
- **Correlation ≠ Causation**: Confounding variables!

### Hypothesis Testing
- **Framework**: H₀, H₁, α, p-value, decision
- **t-tests**: Compare means (one-sample, two-sample)
- **Chi-square**: Test independence of categorical variables
- **ANOVA**: Compare multiple group means

---

## Week 1 Complete!

Congratulations! You've completed Week 1: Foundations & Data Mastery!

You now have a solid foundation in:
- ML concepts and environment setup
- NumPy for numerical computing
- Pandas for data manipulation
- Visualization with Matplotlib and Seaborn
- Statistics for machine learning

**Next Week**: We'll start building actual ML models!

---

**Great job completing Day 5 and Week 1!**