# 📘 02 - Inferential Statistics

🔹 **Objective**: Introduce techniques to infer population-level conclusions based on sample data.

## 🔍 What is Inferential Statistics?
Inferential statistics allows us to make predictions or inferences about a population based on a sample of data.

### Descriptive vs Inferential Statistics
- **Descriptive Statistics** summarizes the features of a dataset.
- **Inferential Statistics** makes generalizations about a population using data from a sample.

### Importance in Data-Driven Decision Making
Companies and researchers rely on inferential methods to:
- Make product decisions based on surveys
- Test medical hypotheses with clinical trials
- Predict elections based on polls

## 🧪 Sampling
### Sampling Methods
- **Simple Random Sampling**: Equal chance for each data point
- **Stratified Sampling**: Divide population into strata and sample from each
- **Systematic Sampling**: Every nth item

### Sampling Bias
Occurs when some members of the population are less likely to be included in the sample.

In [None]:
import numpy as np
import pandas as pd
np.random.seed(42)

# Simulate a population
population = np.random.normal(loc=50, scale=10, size=10000)

# Simple random sample
sample = np.random.choice(population, size=100, replace=False)

print("Population Mean:", np.mean(population))
print("Sample Mean:", np.mean(sample))

## 📊 Central Limit Theorem (CLT)
The CLT states that the distribution of sample means approaches a normal distribution as the sample size increases, regardless of the population's distribution.

In [None]:
import matplotlib.pyplot as plt

sample_means = []
for _ in range(1000):
    sample = np.random.choice(population, size=50, replace=True)
    sample_means.append(np.mean(sample))

plt.hist(sample_means, bins=30, edgecolor='black')
plt.title("Sampling Distribution of the Mean")
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
plt.show()

## 📐 Estimators
- **Point Estimate**: A single value estimate of a population parameter (e.g., sample mean).
- **Interval Estimate**: A range likely to contain the parameter (e.g., confidence interval).

### Standard Error
Standard deviation of the sampling distribution of the mean.

In [None]:
std_error = np.std(sample) / np.sqrt(len(sample))
print("Standard Error:", std_error)

## 📏 Confidence Intervals
A confidence interval gives a range of plausible values for the population parameter.

- **95% CI**: ~95% chance the true mean lies in this interval
- **99% CI**: Wider range, more confidence

In [None]:
import scipy.stats as stats

mean = np.mean(sample)
se = stats.sem(sample)
ci_95 = stats.t.interval(0.95, df=len(sample)-1, loc=mean, scale=se)
ci_99 = stats.t.interval(0.99, df=len(sample)-1, loc=mean, scale=se)

print("95% Confidence Interval:", ci_95)
print("99% Confidence Interval:", ci_99)

In [None]:
plt.figure(figsize=(8, 4))
plt.axvline(ci_95[0], color='red', linestyle='--', label='95% CI lower')
plt.axvline(ci_95[1], color='red', linestyle='--', label='95% CI upper')
plt.axvline(mean, color='green', label='Sample Mean')
plt.title("95% Confidence Interval")
plt.legend()
plt.show()

## 🧪 Basics of Hypothesis Testing (Preview)
- **Null Hypothesis (H0)**: No effect or difference
- **Alternative Hypothesis (H1)**: There is an effect or difference
- **Type I Error (α)**: Rejecting H0 when it's true
- **Type II Error (β)**: Not rejecting H0 when it's false

### p-value Intuition
The p-value is the probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true.

> Low p-value (< 0.05) → reject H0
> High p-value → fail to reject H0

## 🧪 Practice Exercises
1. Simulate 100 samples of size 30 from a population and plot the sampling distribution.
2. Compute 95% and 99% confidence intervals for each sample.
3. Estimate the probability of sample means falling outside the CI.