# Day 3 — Foundations of Statistics & Probability

**Duration:** 2 hours

**Objectives:**
- Refresh descriptive statistics and probability
- Explore key distributions
- Perform simple inferential tests

## 1. Descriptive statistics recap

Measures: mean, median, mode, variance, standard deviation, IQR. We'll compute these on the `tips` dataset from seaborn.

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

tips = sns.load_dataset('tips')

print('Columns:', tips.columns.tolist())
print('\nFirst rows:')
print(tips.head())

print('\nSummary stats for total_bill:')
print(tips['total_bill'].describe())

## 2. Visualizing distributions

Histograms and boxplots help us understand shape, spread and outliers.

In [None]:
# Histogram and boxplot for total_bill
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.hist(tips['total_bill'], bins=20)
plt.title('Histogram of total_bill')
plt.xlabel('total_bill')

plt.subplot(1,2,2)
plt.boxplot(tips['total_bill'].dropna())
plt.title('Boxplot of total_bill')
plt.show()

## 3. Probability basics & distributions

We'll demonstrate Normal, Binomial and Poisson distributions using `scipy.stats`.

In [None]:
import numpy as np
from scipy.stats import norm, binom, poisson

# Normal: mean=0, sd=1
x = np.linspace(-4,4,200)
plt.plot(x, norm.pdf(x))
plt.title('Standard Normal Distribution')
plt.show()

# Binomial example: n=10, p=0.5
k = np.arange(0,11)
pmf = binom.pmf(k, 10, 0.5)
plt.bar(k, pmf)
plt.title('Binomial PMF (n=10, p=0.5)')
plt.xlabel('k')
plt.show()

# Poisson example: lambda=3
k = np.arange(0,15)
pmf_p = poisson.pmf(k, 3)
plt.bar(k, pmf_p)
plt.title('Poisson PMF (lambda=3)')
plt.show()

## 4. Correlation vs Causation

Compute Pearson correlation matrix and visualize with a heatmap.

In [None]:
corr = tips.corr(numeric_only=True)
print(corr)

sns.heatmap(corr, annot=True)
plt.title('Correlation matrix (tips)')
plt.show()

## 5. Inferential statistics: Confidence intervals & Hypothesis testing

We'll compute a 95% confidence interval for the mean tip and run a t-test comparing male vs female tips.

In [None]:
import numpy as np
from scipy import stats

# 95% CI for mean tip
tips_arr = tips['tip'].dropna().values
n = len(tips_arr)
mean = tips_arr.mean()
se = stats.sem(tips_arr)
ci = stats.t.interval(0.95, n-1, loc=mean, scale=se)
print('Mean tip:', mean)
print('95% CI for mean tip:', ci)

# t-test: male vs female tips
male = tips[tips['sex']=='Male']['tip']
female = tips[tips['sex']=='Female']['tip']
stat, pvalue = stats.ttest_ind(male, female, equal_var=False)
print('\nT-test male vs female tips: t=', stat, ', p=', pvalue)

## 6. Interpreting p-values

If p < 0.05, reject the null hypothesis at alpha=0.05. Here, check the p-value from the t-test and discuss.

## 7. Bias vs Variance (conceptual)

- Bias: error due to wrong model assumptions
- Variance: error due to sensitivity to the training data

Balancing them is central in ML model selection.

## 8. Exercise (in-notebook)

1. Compute 95% CI for `total_bill` mean.
2. Test whether average `total_bill` for smokers vs non-smokers significantly differs (t-test).
3. (Optional) Visualize distributions by `smoker` status.

In [None]:
# Exercise starters
# 1) 95% CI for total_bill
tb = tips['total_bill'].dropna()
ci_tb = stats.t.interval(0.95, len(tb)-1, loc=tb.mean(), scale=stats.sem(tb))
print('Total bill mean:', tb.mean(), '95% CI:', ci_tb)

# 2) t-test smoker vs non-smoker
t_smoker = tips[tips['smoker']=='Yes']['total_bill']
t_non = tips[tips['smoker']=='No']['total_bill']
stat2, p2 = stats.ttest_ind(t_smoker, t_non, equal_var=False)
print('\nT-test smoker vs non-smoker: t=', stat2, ', p=', p2)

# 3) Visualization
sns.boxplot(x='smoker', y='total_bill', data=tips)
plt.title('Total bill by smoker status')
plt.show()

## 9. Wrap-up & Reading

Suggested: *Practical Statistics for Data Scientists* (Bruce & Bruce). Tomorrow: Day 4 — Machine Learning Foundations.