# `Advanced Statistics Notes`

> # `Covariance | Covariance  vs Causation`


<details>
<summary>Click to expand</summary>

# **Covariance**

## **1. Definition**

* **Covariance** measures how **two numerical variables change together**.
* If both variables increase/decrease together → **positive covariance**.
* If one increases while the other decreases → **negative covariance**.
* If no consistent relationship → covariance ≈ 0.

**Formula (for two variables X and Y):**

$$
\text{Cov}(X, Y) = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{n-1}
$$

---

## **2. Example**

* Suppose study hours (X) and exam scores (Y):

  * If more hours → higher scores → **positive covariance**.
* Suppose exercise (X) and body weight (Y):

  * If more exercise → lower weight → **negative covariance**.

---

## **3. Limitations**

* The **value of covariance is not standardized** → hard to compare.
* That’s why we often use **correlation**, which is a scaled version of covariance.

---

# **Covariance vs Causation**

## **Covariance (or Correlation)**

* Shows **association** between variables.
* Example: Ice cream sales ↑ and temperature ↑ → strong positive covariance.

## **Causation**

* Implies that **one variable directly affects the other**.
* Example: Temperature ↑ → people buy more ice cream (temperature causes ice cream sales).

---

## **Key Differences**

| Aspect    | Covariance / Correlation                            | Causation                                         |
| --------- | --------------------------------------------------- | ------------------------------------------------- |
| Meaning   | Shows if variables move together                    | Shows if one variable influences the other        |
| Direction | Positive, Negative, or None                         | Cause → Effect                                    |
| Proof     | From data relationship only                         | Requires controlled experiment or deep reasoning  |
| Example   | Shoe size and reading ability may correlate in kids | Age causes both shoe size ↑ and reading ability ↑ |

---

## **4. In Machine Learning**

* **Covariance/Correlation** → used in feature selection, PCA (Principal Component Analysis), detecting multicollinearity.
* **Causation** → very hard to prove! Needs experiments (A/B testing), causal inference, or domain knowledge.
* ML models often exploit **correlations**, but that doesn’t always mean true causality.

---

**Quick Mnemonic**:

* **Covariance = Moves together?**
* **Causation = One makes the other happen.**

---

</details>

> # `Probability Distribution Functions - PDF, PMF & CDF`

<details>
<summary>Click to expand</summary>

# **Probability Distribution Functions**

Probability distributions describe **how probabilities are assigned to different outcomes** of a random variable.

There are 2 types of random variables:

* **Discrete** → countable outcomes (e.g., dice roll, number of students).
* **Continuous** → infinite possible values within a range (e.g., height, weight).

---

## **1. PMF (Probability Mass Function)**

* Used for **discrete random variables**.
* Assigns a probability to **each possible outcome**.
* Properties:

  * $P(X = x) \geq 0$
  * $\sum P(X = x) = 1$

**Example (Dice roll):**

* PMF:

  $$
  P(X=x) = \frac{1}{6}, \quad x \in \{1,2,3,4,5,6\}
  $$

**In ML:** PMFs appear in Naive Bayes (discrete features), categorical variables, language models (word counts).

---

## **2. PDF (Probability Density Function)**

* Used for **continuous random variables**.
* Probability is **not assigned to a single point** (since infinite values exist), but to an **interval**.
* Area under the curve = probability.
* Properties:

  * $f(x) \geq 0$
  * $\int_{-\infty}^{\infty} f(x) dx = 1$

**Example (Normal distribution):**

* Bell curve for heights of people.
* Probability someone’s height is between 160–170 cm = area under curve between 160 and 170.

**In ML:** PDFs appear in density estimation, Gaussian Naive Bayes, anomaly detection, generative models.

---

## **3. CDF (Cumulative Distribution Function)**

* Works for **both discrete and continuous** variables.
* Gives the **probability that random variable ≤ x**.
* Formula:

  $$
  F(x) = P(X \leq x)
  $$

**Example (Dice roll):**

* $F(3) = P(X \leq 3) = \frac{3}{6} = 0.5$

**Example (Height with Normal PDF):**

* If CDF(170) = 0.6 → 60% of people are ≤ 170 cm tall.

**In ML:** CDFs are used in hypothesis testing (p-values), probabilistic models, and survival analysis.

---

## **Quick Comparison**

| Function | Variable Type | What it Tells You                                                        |
| -------- | ------------- | ------------------------------------------------------------------------ |
| **PMF**  | Discrete      | Probability of exact value                                               |
| **PDF**  | Continuous    | Probability density at a value (area over an interval gives probability) |
| **CDF**  | Both          | Probability variable ≤ a value                                           |

---

**Quick Mnemonic**

* **PMF → Point probability** (discrete).
* **PDF → Density probability** (continuous).
* **CDF → Cumulative probability** (up to a point).

---

</details>

> # `Kernel Density Estimation (KDE)`

<details>
<summary>Click to expand</summary>

## **1. Definition**

* **KDE** is a **non-parametric** way to estimate the probability density function (PDF) of a random variable.
* Unlike a histogram (which is blocky), KDE gives a **smooth curve** that represents the underlying distribution of data.

---

## **2. How it Works**

* Imagine placing a **small “bump” (kernel function)** on each data point.
* Add up all these bumps → get a smooth curve (the density estimate).
* The **kernel** is usually Gaussian (bell-shaped), but can be others.
* The **bandwidth** controls smoothness:

  * Small bandwidth → too wiggly (overfits).
  * Large bandwidth → too flat (underfits).

---

## **3. Example**

Suppose we have exam scores: `[50, 55, 60, 70, 80, 85, 90]`.

* Histogram → shows frequencies in bins.
* KDE → smooth curve showing how scores are distributed.

---

## **4. Applications in Machine Learning**

* **Data Exploration (EDA):** Visualizing feature distributions.
* **Outlier Detection:** Peaks vs rare regions.
* **Generative Modeling:** Estimating density without assuming a specific distribution.
* **Anomaly Detection:** Points with very low KDE density = anomalies.

---

## **5. KDE vs Histogram**

| Aspect     | Histogram                    | KDE                       |
| ---------- | ---------------------------- | ------------------------- |
| Shape      | Blocky (depends on bin size) | Smooth                    |
| Parameters | Bin width                    | Bandwidth                 |
| Use Case   | Quick count visualization    | Estimating underlying PDF |

---

## **6. Quick Analogy**

Think of KDE as pouring a drop of **ink (kernel)** at every data point.
When all ink drops overlap, they form a **smooth shape** (density curve).

---

**Summary**:

* KDE = Smooth estimate of PDF.
* Controlled by **kernel type** and **bandwidth**.
* Widely used in ML for **EDA, density estimation, and anomaly detection**.

---

</details>

> # `Normal Distribution`

<details>
<summary>Click to expand</summary>

## **1. Definition**

* Also called the **Gaussian distribution** (after Carl Friedrich Gauss).
* A **continuous probability distribution** shaped like a symmetric "bell curve."
* Most natural phenomena (heights, IQ scores, measurement errors) follow it (approximately).

---

## **2. Properties**

* **Symmetric** around the mean ($\mu$).
* **Mean = Median = Mode**.
* **68-95-99.7 Rule** (Empirical Rule):

  * ~68% of data within 1 standard deviation ($\sigma$).
  * ~95% within 2$\sigma$.
  * ~99.7% within 3$\sigma$.

**Probability Density Function (PDF):**

$$
f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{ -\frac{(x-\mu)^2}{2\sigma^2} }
$$

---

## **3. Standard Normal Distribution**

* Special case: $\mu = 0$, $\sigma = 1$.
* Often denoted $Z$.
* Used for **Z-scores**:

  $$
  Z = \frac{X - \mu}{\sigma}
  $$
* Helps compare different datasets on the same scale.

---

## **4. Applications in Machine Learning**

* **Assumption in Models:** Linear regression, logistic regression, Naive Bayes often assume normality (errors or features).
* **Feature Scaling:** Z-score normalization uses mean and standard deviation.
* **Hypothesis Testing:** Many test statistics (t-test, ANOVA, etc.) rely on normality assumptions.
* **Probability & Likelihood:** Gaussian likelihood is used in generative models (e.g., Gaussian Mixture Models).
* **Noise Modeling:** Measurement errors often assumed to be Gaussian.

---

## **5. Example**

* Suppose students’ heights in a class are approximately normal with:

  * $\mu = 170$ cm, $\sigma = 10$ cm.
  * ~68% of students are between **160 cm and 180 cm**.
  * ~95% between **150 cm and 190 cm**.

---

## **6. Visual Intuition**

* Bell-shaped curve.
* Tallest at mean, tails taper off.
* Area under curve = probability.

---

**Quick Mnemonic**:

* **Normal = Natural = Bell curve**.
* Think of it as the "default" distribution in statistics.

---


</details>

> # `Skewness`

<details>
<summary>Click to expand</summary>

## **1. Definition**

* **Skewness** measures the **asymmetry** of a distribution.
* A normal distribution has **skewness = 0** (perfectly symmetric).
* If data leans to one side → it’s skewed.

---

## **2. Types of Skewness**

### **a) Positive Skew (Right-skewed)**

* Tail stretches more to the **right**.
* Mean > Median > Mode.
* Example: Income distribution (a few very high incomes pull the mean to the right).

### **b) Negative Skew (Left-skewed)**

* Tail stretches more to the **left**.
* Mean < Median < Mode.
* Example: Age at retirement (most retire around 60–65, but some retire much earlier).

### **c) Zero Skew (Symmetric)**

* Distribution is symmetric.
* Mean = Median = Mode.
* Example: Heights of adults (approximately).

---

## **3. Formula**

For a dataset $X$:

$$
\text{Skewness} = \frac{\sum (X_i - \bar{X})^3 / n}{\sigma^3}
$$

* Positive value → right skew.
* Negative value → left skew.

---

## **4. In Machine Learning**

* Skewed features can affect algorithms that assume normality (e.g., Linear Regression, Logistic Regression).
* Often corrected with **log transform, square root, or Box-Cox transform**.
* Important in **outlier detection** (skewness often caused by outliers).

---

## **5. Visual Intuition**

* **Right skew** = tail on the right.
* **Left skew** = tail on the left.
* Think of the "tail" as the direction of skew.

---

**Quick Mnemonic**:

* "Tail tells the tale."
* If the **tail is right → right skew**.
* If the **tail is left → left skew**.

---

</details>

> # `Non-Gaussian Probability Distributions`

<details>
<summary>Click to expand</summary>

###  What are Non-Gaussian Distributions?

* A **Gaussian distribution** = the bell curve (symmetric, unimodal, defined by mean & variance).
* **Non-Gaussian distributions** are all distributions that **don’t follow the bell curve**.

  * They may be **skewed**, **heavier-tailed**, **multi-peaked**, or **discrete**.

---

###  Common Non-Gaussian Distributions

1. **Uniform Distribution**

   * All outcomes equally likely.
   * Example: Rolling a fair die (1–6, each has probability 1/6).

2. **Bernoulli Distribution**

   * Only two outcomes: success (1) or failure (0).
   * Example: Tossing a coin once.

3. **Binomial Distribution**

   * Number of successes in `n` independent Bernoulli trials.
   * Example: Number of heads in 10 coin tosses.

4. **Poisson Distribution**

   * Counts the number of events in a fixed time/space interval.
   * Example: Number of emails received in an hour.

5. **Exponential Distribution**

   * Models **time between events** in a Poisson process.
   * Example: Time between arrivals at a bus stop.

6. **Geometric Distribution**

   * Number of trials until the first success.
   * Example: Tossing a coin until you get heads.

7. **t-Distribution (Student’s t)**

   * Similar to normal but with **heavier tails** (more outliers).
   * Example: Useful when working with **small sample sizes**.

8. **Chi-Square Distribution**

   * Distribution of a sum of squared standard normal variables.
   * Used in **hypothesis testing** (e.g., independence tests).

9. **Exotic Non-Gaussians**

   * **Cauchy Distribution** → very heavy tails, mean & variance undefined.
   * **Beta Distribution** → used for probabilities (values between 0 and 1).
   * **Gamma Distribution** → generalization of exponential.

---

###  Key Idea

 Non-Gaussian distributions show up when:

* Data is **discrete** (like dice rolls).
* Data is **skewed** (like income).
* Data has **heavy tails** (like stock returns).

---

</details>

> # `Kurtosis`

<details>
<summary>Click to expand</summary>

###  What is Kurtosis?

* **Kurtosis** measures the **“tailedness”** of a probability distribution.
* It tells us **how heavy or light the tails** are compared to a normal distribution.
* In other words: Do extreme values (outliers) happen more or less often than in a normal bell curve?

---

###  Types of Kurtosis

1. **Mesokurtic (Normal)**

   * Same kurtosis as the normal distribution.
   * Balanced tails.
   * Example: Standard normal curve.

2. **Leptokurtic (High Kurtosis > 3)**

   * **Heavy tails**, **sharp peak**.
   * More extreme values (outliers) likely.
   * Example: Stock market returns.

3. **Platykurtic (Low Kurtosis < 3)**

   * **Light tails**, **flatter peak**.
   * Fewer extreme values.
   * Example: Uniform distribution.

---

###  Formula for Sample Kurtosis

$$
K = \frac{\frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^4}{\left(\frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2\right)^2}
$$

* Numerator = 4th moment (sensitive to extreme values).
* Denominator = squared variance.

Often we use **Excess Kurtosis = K – 3** so that:

* Normal distribution → 0
* Leptokurtic → positive
* Platykurtic → negative

---

###  Intuition

* **Skewness** = “Is the curve tilted left or right?”
* **Kurtosis** = “How fat are the tails and how sharp is the peak?”

---

</details>

> # `Q–Q plot`

<details>
<summary>Click to expand</summary>

###  What is a Q–Q Plot?

* **Q–Q (Quantile–Quantile) Plot** compares the **quantiles** of your dataset with the quantiles of a theoretical distribution (like normal).
* If your data follows that distribution → the points lie **close to a straight line** (usually 45°).

---

###  How it Works

1. Take your dataset and sort it.
2. Compute its **quantiles**.
3. Compare against the **theoretical quantiles** (e.g., of normal distribution).
4. Plot data quantiles (Y-axis) vs theoretical quantiles (X-axis).

---

### Interpretation

* **Straight line** → data ≈ theoretical distribution.
* **Upward curve at ends** → heavier tails than normal (leptokurtic).
* **Downward curve at ends** → lighter tails (platykurtic).
* **S-shape** → skewness (left or right tilt).

---

###  Example in Machine Learning

* Before applying models like **Linear Regression** or **Logistic Regression**, we often assume errors are **normally distributed**.
* A Q–Q plot of residuals can check this assumption quickly.

---

</details>

> # `Uniform Distribution`

<details>
<summary>Click to expand</summary>

###  What is a Uniform Distribution?

* A probability distribution where **all outcomes are equally likely**.
* Every value within a given range has the **same probability**.
* Example: Rolling a fair die → each face (1–6) has probability $\tfrac{1}{6}$.

---

###  Types of Uniform Distribution

1. **Discrete Uniform Distribution**

   * Finite number of values, each equally likely.
   * Example: Rolling a die, picking a random card from a deck.

2. **Continuous Uniform Distribution**

   * Infinite values in an interval $[a, b]$, all equally likely.
   * Probability Density Function (PDF):

     $$
     f(x) = \frac{1}{b-a}, \quad a \leq x \leq b
     $$
   * Cumulative Distribution Function (CDF):

     $$
     F(x) = \frac{x-a}{b-a}, \quad a \leq x \leq b
     $$

---

### Properties

* **Mean**:

  $$
  \mu = \frac{a+b}{2}
  $$
* **Variance**:

  $$
  \sigma^2 = \frac{(b-a)^2}{12}
  $$
* Shape: Flat (rectangular).

---

### In Machine Learning

* Used for **random initialization** of weights (before training starts).
* Sampling for simulations (Monte Carlo methods).
* Data augmentation (random crops, rotations with equal probability).

---

</details>

> # `Log Normal Distribution `

<details>
<summary>Click to expand</summary>

### What is a Log-Normal Distribution?

* A random variable $X$ is **log-normally distributed** if:

  $$
  \ln(X) \sim N(\mu, \sigma^2)
  $$

  That means the **logarithm of $X$** follows a **normal distribution**.
* So while a normal distribution can take **negative values**, a log-normal is **strictly positive**.

---

### 🔹 Probability Density Function (PDF)

For $x > 0$:

$$
f(x; \mu, \sigma) = \frac{1}{x\sigma \sqrt{2\pi}} \exp\left(-\frac{(\ln x - \mu)^2}{2\sigma^2}\right)
$$

* Skewed to the right (long positive tail).
* Shape depends on $\mu$ and $\sigma$.

---

###  Properties

* **Mean**:

  $$
  E[X] = e^{\mu + \frac{\sigma^2}{2}}
  $$
* **Median**:

  $$
  \text{Median} = e^{\mu}
  $$
* **Variance**:

  $$
  Var[X] = \left(e^{\sigma^2} - 1\right) e^{2\mu + \sigma^2}
  $$

---

###  Examples in Real Life

* **Stock prices**: Can’t be negative, but can grow exponentially.
* **Income distribution**: Most people earn around the median, but a few earn extremely high amounts.
* **Biological measures**: Reaction times, survival times, etc.

---

### In Machine Learning

* **Modeling skewed data** (e.g., prices, time-to-event data).
* **Feature transformation**: Taking $\ln(x)$ of a skewed feature often makes it more normal-like, which helps models that assume normality (like linear regression).
* **Generative models**: Useful for positive-only variables.

---
</details>

> # `Pareto Distribution`

<details>
<summary>Click to expand</summary>

###  What is the Pareto Distribution?

* A continuous probability distribution often called the **“80/20 rule” distribution**.
* Named after economist **Vilfredo Pareto**, who observed that **20% of people owned 80% of the land in Italy**.
* It describes situations where **a small number of events account for most of the effect**.

---

###  Probability Density Function (PDF)

For $x \geq x_m$:

$$
f(x; x_m, \alpha) = \frac{\alpha x_m^\alpha}{x^{\alpha+1}}
$$

Where:

* $x_m > 0$ = minimum possible value (scale).
* $\alpha > 0$ = shape parameter.

---

###  Properties

* **Mean** (only if $\alpha > 1$):

  $$
  E[X] = \frac{\alpha x_m}{\alpha - 1}
  $$
* **Variance** (only if $\alpha > 2$):

  $$
  Var[X] = \frac{x_m^2 \alpha}{(\alpha - 1)^2 (\alpha - 2)}
  $$
* Heavy-tailed: Extreme large values occur more often than in normal or exponential.

---

###  Applications in Real Life

* **Wealth distribution** (small % of population owns most wealth).
* **Internet traffic** (few sites get most of the clicks).
* **Natural phenomena** (sizes of cities, earthquakes, file sizes).

---

### In Machine Learning

* Understanding **imbalanced data** (e.g., rare events dominate outcomes).
* **Feature engineering**: log-transforming Pareto-like features can stabilize models.
* **Anomaly detection**: large outliers may be natural if the underlying process is Pareto.

---

**Compare with Log-Normal**:

* Both are skewed with heavy tails.
* **Log-normal** → arises when multiplying many random factors.
* **Pareto** → arises when following power-law processes.

---
</details>

> # `Transformations `

<details>
<summary>Click to expand</summary>

# **Data Transformations**

## **1. Definition**

* A **transformation** changes the **scale, shape, or distribution** of data.
* Purpose: Improve model performance, reduce skewness, stabilize variance, or make relationships more linear.

---

## **2. Common Types of Transformations**

### **a) Logarithmic Transformation**

* Formula: $X' = \log(X)$ (usually natural log).
* **Use:** Reduce right skew, compress large values.
* **Example:** Income, population, stock prices.

---

### **b) Square Root Transformation**

* Formula: $X' = \sqrt{X}$
* **Use:** Reduce moderate skewness, stabilize variance.
* **Example:** Count data like number of defects, number of calls.

---

### **c) Cube Root Transformation**

* Formula: $X' = \sqrt[3]{X}$
* **Use:** Can reduce skewness for both positive and negative values.

---

### **d) Reciprocal Transformation**

* Formula: $X' = 1/X$
* **Use:** Invert relationships, reduce influence of large values.

---

### **e) Box-Cox Transformation**

* Formula: $X' = \frac{X^\lambda - 1}{\lambda}$ (for $\lambda \neq 0$)
* **Use:** Finds the best power transformation to **normalize** data.
* **Requirement:** All $X > 0$.

---

### **f) Yeo-Johnson Transformation**

* Similar to Box-Cox but works for **positive and negative values**.

---

## **3. Why Transformations are Important in ML**

* **Linear models** assume linearity → transform skewed features to improve fit.
* **Stabilize variance** → reduce heteroscedasticity (non-constant variance).
* **Reduce effect of outliers** → compress extreme values.
* **Make distributions more normal** → better for algorithms that assume normality (e.g., Naive Bayes, t-tests).

---

## **4. Visual Example**

* Original data: Right-skewed histogram.
* After log transformation: More symmetric, easier for models to learn.

---

**Quick Mnemonic:**

* **Log / Sqrt → compress big values**
* **Reciprocal → invert relationships**
* **Box-Cox → find the best power automatically**

---

</details>

> # `Bernoulli Distribution `

<details>
<summary>Click to expand</summary>

#  **Bernoulli Distribution**

## **1. Definition**

* The **Bernoulli distribution** is the probability distribution of a **single trial** with only **two possible outcomes**:

  * Success ($1$) with probability $p$
  * Failure ($0$) with probability $1-p$

It’s like flipping a biased coin once.

---

## **2. Probability Mass Function (PMF)**

For outcome $x \in \{0, 1\}$:

$$
P(X = x) = p^x (1-p)^{1-x}
$$

* $P(X=1) = p$
* $P(X=0) = 1-p$

---

## **3. Properties**

* **Mean (Expected Value):**

  $$
  E[X] = p
  $$
* **Variance:**

  $$
  Var(X) = p(1-p)
  $$
* **Support:** $X \in \{0,1\}$

---

## **4. Real-Life Examples**

* Tossing a (possibly biased) coin: Heads (1), Tails (0).
* A student passing (1) or failing (0) an exam.
* Email being spam (1) or not spam (0).

---

## **5. In Machine Learning**

* Basis of **binary classification**.
* Logistic regression models $P(Y=1)$, which follows Bernoulli.
* Loss function: **Binary Cross-Entropy** is derived from Bernoulli likelihood.
* Used in **generative models** (e.g., Bernoulli Naive Bayes).

---

## **6. Related Distributions**

* **Binomial distribution** = sum of several independent Bernoulli trials.
* **Geometric distribution** = number of Bernoulli trials until the first success.

---

</details>

> # `Binomial Distribution `

<details>
<summary>Click to expand</summary>

#  **Binomial Distribution**

## **1. Definition**

* The **binomial distribution** gives the probability of getting exactly $k$ successes in $n$ independent Bernoulli trials, each with probability of success $p$.
* Example: Flipping a coin $n$ times and counting heads.

---

## **2. Probability Mass Function (PMF)**

For $k = 0,1,2, \dots, n$:

$$
P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}
$$

Where:

* $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ (number of ways to choose $k$ successes)
* $p$ = probability of success
* $n$ = number of trials

---

## **3. Properties**

* **Mean (Expected Value):**

  $$
  E[X] = np
  $$
* **Variance:**

  $$
  Var(X) = np(1-p)
  $$
* **Support:** $X \in \{0,1,2,\dots,n\}$

---

## **4. Real-Life Examples**

* Number of heads in 10 coin flips.
* Number of students passing an exam (success/fail).
* Number of defective items in a batch of products.

---

## **5. In Machine Learning**

* **Evaluation metrics**: Many are based on binomial proportions (e.g., accuracy = number of correct predictions out of $n$).
* **Feature modeling**: If you record “number of clicks out of trials,” binomial fits well.
* **Hypothesis testing**: Proportions are often modeled with binomial.

---

## **6. Connection with Other Distributions**

* **Bernoulli**: Special case of binomial with $n=1$.
* **Normal approximation**: For large $n$, binomial ≈ normal distribution (Central Limit Theorem).
* **Poisson approximation**: If $n$ is large and $p$ is small.

---

</details>

> # `Sampling Distribution `

<details>
<summary>Click to expand</summary>

#  **Sampling Distribution**

## **1. Definition**

* A **sampling distribution** is the probability distribution of a statistic (like the mean, variance, or proportion) that is calculated from many random samples of the same population.
* In other words:

  * Take many samples from a population.
  * Compute a statistic (e.g., sample mean $\bar{X}$) for each.
  * Plot the distribution of those statistics → that’s the **sampling distribution**.

---

## **2. Why It’s Important**

* It tells us **how much our sample statistic is likely to vary** from sample to sample.
* Forms the foundation of:

  * **Confidence intervals**
  * **Hypothesis testing**
  * **Standard errors**

---

## **3. Example**

* Population: Heights of students in a college (true mean $\mu$ = 170 cm).
* Take random samples of size $n=30$.
* Compute the mean height for each sample.
* Collect many such sample means → their distribution is the **sampling distribution of the mean**.

---

## **4. Central Limit Theorem (CLT)**

* Key fact: As sample size $n$ increases, the sampling distribution of the mean $\bar{X}$ approaches a **normal distribution**, regardless of the population’s original distribution.
* Mean of sampling distribution:

  $$
  E[\bar{X}] = \mu
  $$
* Standard error (spread of sample means):

  $$
  SE = \frac{\sigma}{\sqrt{n}}
  $$

---

## **5. In Machine Learning**

* **Cross-validation** mimics the idea of sampling distributions (different train/test splits → different performance measures).
* **Bootstrap methods** use repeated resampling to approximate sampling distributions.
* Understanding model stability → how much metrics (like accuracy) vary across samples.

---

**Quick Mnemonic:**

* **Population → one big truth**
* **Sample → one estimate**
* **Sampling distribution → distribution of many estimates**

---
</details>

> # `Central Limit Theorem`

<details>
<summary>Click to expand</summary>

#  **Central Limit Theorem (CLT)**

## **1. Definition**

The **CLT states**:
When we take **many independent random samples** from a population with mean $\mu$ and variance $\sigma^2$, the **sampling distribution of the sample mean** $\bar{X}$ approaches a **normal distribution**, as the sample size $n$ becomes large.

Formally:

$$
\frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \;\;\xrightarrow{d}\;\; N(0,1) \quad \text{as } n \to \infty
$$

---

## **2. Key Points**

* Works **regardless of the population’s original distribution** (normal, uniform, skewed, etc.).
* Only requirements:

  * Samples must be **independent** and **identically distributed (i.i.d.)**.
  * Finite mean ($\mu$) and variance ($\sigma^2$).
* The approximation improves as $n$ increases (usually $n \geq 30$ is “large enough”).

---

## **3. Implications**

* The **mean of the sampling distribution** = population mean $\mu$.
* The **spread of the sampling distribution** = standard error (SE):

  $$
  SE = \frac{\sigma}{\sqrt{n}}
  $$
* Lets us use **normal probability theory** (z-scores, confidence intervals, hypothesis tests) even when data isn’t normal.

---

## **4. Example**

* Population: Students’ exam scores, skewed right (not normal).
* Take random samples of size $n=40$.
* Compute sample means repeatedly.
* Distribution of those sample means → nearly **bell-shaped (normal)**.

---

## **5. In Machine Learning**

* Justifies why many algorithms assume “errors are normally distributed.”
* Supports **bootstrap resampling** and confidence intervals for model metrics.
* Ensures that **loss functions** and **test statistics** (like accuracy differences) behave predictably.

---

**Quick Mnemonic:**

* **CLT = "Means go Normal"** (even if data isn’t).

---
</details>

> # `Confidence Intervals `

<details>
<summary>Click to expand</summary>

#  **Confidence Intervals (CIs)**

## **1. Idea**

A **confidence interval** gives a **range of plausible values** for a population parameter (like the mean or proportion) based on sample data.

It’s not a guarantee, but rather a **level of confidence** (say 95%) that the true population parameter lies within that interval.

---

## **2. Formula (for Mean, when σ is known)**

From CLT, we know the sample mean $\bar{X}$ follows a normal distribution:

$$
CI = \bar{X} \;\; \pm \;\; Z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}
$$

* $\bar{X}$ = sample mean
* $\sigma$ = population standard deviation
* $n$ = sample size
* $Z_{\alpha/2}$ = critical value from standard normal distribution

  * 95% CI → $Z = 1.96$
  * 99% CI → $Z = 2.58$

---

## **3. When σ is Unknown**

We use **sample standard deviation (s)** and a **t-distribution**:

$$
CI = \bar{X} \;\; \pm \;\; t_{\alpha/2, \; (n-1)} \cdot \frac{s}{\sqrt{n}}
$$

---

## **4. Example**

* Suppose you measure **average study time** for 100 students: $\bar{X} = 4.5$ hours, $s = 1.2$.
* 95% CI:

$$
4.5 \pm 1.96 \times \frac{1.2}{\sqrt{100}}
= 4.5 \pm 0.24
= [4.26, 4.74]
$$

Interpretation: We are 95% confident that the **true mean study time** for all students is between **4.26 and 4.74 hours**.

---

## **5. Misconceptions**

* ❌ A 95% CI does **not** mean the probability is 95% that the mean is in the interval.
* ✅ It means: if we repeat the experiment many times, **95% of the intervals built this way will contain the true mean**.

---

## **6. Uses in ML/DS**

* Estimate **model performance** (e.g., 95% CI for accuracy).
* Assess uncertainty in **parameter estimates**.
* Compare two models’ results with overlap or separation of their intervals.

---

**Quick Mnemonic:**
**CI = Estimate ± Margin of Error**

---


</details>

> # `What is Hypothesis Testing?`

<details>
<summary>Click to expand</summary>

#  What is Hypothesis Testing?

**Definition:**
Hypothesis testing is a **statistical method** used to make decisions or draw conclusions about a **population parameter** based on data from a **sample**.

It answers:
 *“Is this observed effect real, or could it just be due to random chance?”*

---

## **Key Ideas**

* **Hypothesis** = an assumption or claim.
* We test this claim using sample data.
* There are always two competing hypotheses:

  * **Null Hypothesis (H₀):** "No effect" or "status quo" (e.g., mean = 50).
  * **Alternative Hypothesis (H₁ or Hₐ):** What you want to check/prove (e.g., mean > 50).

---

## **Example in Real Life**

* A medicine company claims a new drug lowers blood pressure by 10 points on average.
* You collect a sample of patients and test whether the average reduction is really 10 or not.
* If data strongly contradicts H₀, you reject it and accept H₁.

---

## **Why It Matters in Machine Learning**

* To test whether a **feature is significant**.
* To compare two models’ accuracies (e.g., Model A vs Model B).
* In **A/B testing** for model updates or user interface changes.

---

 **In one line:**
Hypothesis testing is a **decision-making tool** that uses data to check if an assumption about the population is likely true or not.

---

</details>

> # `Null Hypothesis (H₀) and Alternative Hypothesis (H₁ / Hₐ) `

<details>
<summary>Click to expand</summary>


#  Null Hypothesis (H₀)

* The **default assumption** — "nothing new is happening."
* Usually states there is **no effect**, **no difference**, or the value equals a specific number.
* Example:

  * $H₀: \mu = 50$ → The population mean is 50.
  * In machine learning: "Model A and Model B have the same accuracy."

---

# Alternative Hypothesis (H₁ / Hₐ)

* The **opposite claim** to $H₀$.
* Represents what the researcher or analyst is trying to prove.
* Example:

  * $H₁: \mu > 50$ → The population mean is greater than 50.
  * In ML: "Model A has higher accuracy than Model B."

---

# Types of Alternative Hypotheses

1. **Two-tailed (≠)**

   * $H₁: \mu \neq 50$
   * Tests if the mean is *different* (either higher or lower).

2. **Right-tailed (>)**

   * $H₁: \mu > 50$
   * Tests if the mean is *greater*.

3. **Left-tailed (<)**

   * $H₁: \mu < 50$
   * Tests if the mean is *less*.

---

# Analogy (Simple)

* Think of a **courtroom trial**:

  * $H₀$: The person is innocent (default assumption).
  * $H₁$: The person is guilty (claim to be tested).
  * Evidence (data) decides whether we reject $H₀$.

---

**In one line:**

* **H₀:** No effect / status quo.
* **H₁:** There is an effect / change.

---
</details>

> # `Steps Involved in a Hypothesis Test`

<details>
<summary>Click to expand</summary>

#  **Steps Involved in a Hypothesis Test**

### **Step 1: State the Hypotheses**

* Formulate **Null Hypothesis (H₀)** and **Alternative Hypothesis (H₁)**.
* Example:

  * $H₀: \mu = 50$ (mean = 50)
  * $H₁: \mu > 50$ (mean is greater than 50)

---

### **Step 2: Choose the Significance Level (α)**

* Common choices: **0.05 (5%)** or **0.01 (1%)**.
* This is the probability of making a **Type I error** (rejecting H₀ when it’s true).

---

### **Step 3: Select the Test Statistic**

* Depends on data type, distribution, and sample size:

  * **Z-test** → large samples, known population σ.
  * **t-test** → small samples, σ unknown.
  * **Chi-square test** → categorical data.
  * **ANOVA** → comparing more than two group means.

---

### **Step 4: Compute the Test Statistic from Sample Data**

* Formula:

  $$
  Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}}
  $$

  (for mean with known σ).

* You’ll calculate either:

  * A **test statistic** (Z, t, χ², F, etc.)
  * Or a **p-value** (probability of observing data this extreme if H₀ is true).

---

### **Step 5: Define the Decision Rule**

* Compare test statistic with **critical value** from the distribution table, OR use the **p-value method**:

  * If **p ≤ α** → Reject H₀.
  * If **p > α** → Fail to reject H₀.

---

### **Step 6: Make a Decision and Interpret**

* Reject H₀ → Evidence supports the alternative hypothesis.
* Fail to reject H₀ → No strong evidence against H₀.

---

#  Quick Example

A sample of 36 phones has mean battery = 9.5 hours, σ = 1.8.
Company claims mean = 10. Test at α = 0.05.

1. $H₀: \mu = 10,\; H₁: \mu \neq 10$.
2. α = 0.05.
3. Z-test.
4. $Z = \frac{9.5 - 10}{1.8/\sqrt{36}} = -1.67$.
5. Critical Z = ±1.96.
6. Since -1.67 is inside the range, **Fail to reject H₀** → no significant difference.

---

**Mnemonic:**
**“Hypothesis → Alpha → Test → Compute → Compare → Conclude.”**

---

</details>

> # `performing a Z-test`

<details>
<summary>Click to expand</summary>

#  **Z-Test**

## **1. When to Use Z-Test**

* Population standard deviation (σ) is known.
* Sample size is large (**n ≥ 30**) → CLT applies.
* Data is approximately normally distributed.

Common uses:

* Test a population **mean**.
* Test a population **proportion**.
* Compare **two population means** (large n).

---

## **2. Formula for Z-Test (One Sample Mean)**

$$
Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}
$$

Where:

* $\bar{X}$ = sample mean
* $\mu_0$ = hypothesized population mean (from H₀)
* $\sigma$ = population standard deviation
* $n$ = sample size

---

## **3. Steps**

1. **State Hypotheses**
   Example: A company claims battery life = 10 hours.

   * $H₀: \mu = 10$
   * $H₁: \mu \neq 10$ (two-tailed)

2. **Set α (significance level)**

   * Example: α = 0.05 → critical Z = ±1.96.

3. **Calculate Test Statistic**
   Suppose:

   * $\bar{X} = 9.5$, σ = 1.8, n = 36.

   $$
   Z = \frac{9.5 - 10}{1.8/\sqrt{36}} 
   = \frac{-0.5}{0.3} 
   = -1.67
   $$

4. **Decision Rule**

   * If $|Z| > 1.96$, reject H₀.
   * Here, $|-1.67| < 1.96$.

5. **Conclusion**

   * Fail to reject H₀ → no significant evidence battery life is different from 10 hours.

---

## **4. Z-Test for Proportion (bonus)**

If testing population proportion $p$:

$$
Z = \frac{\hat{p} - p_0}{\sqrt{p_0 (1 - p_0) / n}}
$$

---

**In short:** Z-test compares your sample mean (or proportion) with the hypothesized population value, using the standard error to scale differences.

---

</details>

> # `Rejection Region Approach`

<details>
<summary>Click to expand</summary>

# Rejection Region Approach

## **1. Idea**

* Instead of calculating a p-value, we decide by comparing the **test statistic** (Z, t, χ², etc.) to a **critical value**.
* The **rejection region** is the set of values for which we reject $H₀$.
* It depends on:

  * The **significance level (α)**.
  * Whether the test is **one-tailed** or **two-tailed**.

---

## **2. Steps**

1. **State Hypotheses**
   Example: $H₀: \mu = 50,\; H₁: \mu > 50$.

2. **Choose α (say 0.05)**.

3. **Determine the Critical Value(s)**

   * Look up the Z or t distribution.
   * For a **one-tailed test** (α = 0.05): critical Z = 1.645.
   * For a **two-tailed test** (α = 0.05): critical Z = ±1.96.

4. **Compute the Test Statistic**

   * Formula:

     $$
     Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}
     $$

5. **Decision Rule**

   * If the test statistic **falls in the rejection region**, reject $H₀$.
   * If not, fail to reject $H₀$.

---

## **3. Example (Two-Tailed Z-Test)**

Claim: Population mean = 100.
Sample: $\bar{X} = 104,\; \sigma = 12,\; n = 36$.
Test at α = 0.05.

1. $H₀: \mu = 100,\; H₁: \mu \neq 100$.
2. α = 0.05 → critical Z = ±1.96.
3. Compute:

   $$
   Z = \frac{104 - 100}{12/\sqrt{36}} 
   = \frac{4}{2} 
   = 2.0
   $$
4. Decision: Since $Z = 2.0 > 1.96$, reject $H₀$.
5. Conclusion: Mean is significantly different from 100.

---

## **4. Visualization (Mental Picture)**

* Imagine the normal bell curve.
* Shade α = 0.05 at the tails → these are the **rejection regions**.
* If your Z lands inside the shaded area, reject $H₀$.

---

**Key Difference from p-value method:**

* **Rejection region:** Compare test statistic to critical value.
* **p-value method:** Compare p-value to α.
* Both give the same conclusion, just different perspectives.

---

</details>

> # `Type 1 Vs Type 2 Errors`

<details>
<summary>Click to expand</summary>

# **Type I vs Type II Errors**

## **1. The Setup**

In hypothesis testing, we make a decision about the **null hypothesis (H₀)**:

* **Reject H₀**
* **Fail to reject H₀**

But mistakes can happen depending on the **truth in the population**.

---

## **2. The Two Types of Errors**

###  **Type I Error (α)**

* **Definition:** Rejecting $H₀$ when it is actually true.
* Analogy: Convicting an innocent person.
* Probability of Type I Error = **α** (the significance level).
* Example: Claiming a new drug works when in fact it doesn’t.

---

###  **Type II Error (β)**

* **Definition:** Failing to reject $H₀$ when it is actually false.
* Analogy: Letting a guilty person go free.
* Probability of Type II Error = **β**.
* Example: Missing the fact that a new drug actually works (concluding it doesn’t).

---

## **3. Power of a Test**

* **Power = 1 – β**
* Probability of correctly rejecting a false $H₀$.
* A powerful test = low chance of Type II Error.

---

## **4. Summary Table**

| Reality ↓ / Decision → | Fail to Reject H₀ | Reject H₀      |
| ---------------------- | ----------------- | -------------- |
| H₀ True                | ✅ Correct         | ❌ Type I Error |
| H₀ False               | ❌ Type II Error   | ✅ Correct      |

---

## **5. In Machine Learning**

* **Type I Error (False Positive):** Model predicts spam when the email is not spam.
* **Type II Error (False Negative):** Model misses spam (predicts not spam when it is spam).
* Choosing α balances the trade-off between the two.

---

**Mnemonic:**

* **Type I = False Alarm** (crying wolf when there is no wolf).
* **Type II = Missed Detection** (not crying wolf when there is one).

---
</details>

> # `One Sided vs 2 sided tests`

<details>
<summary>Click to expand</summary>

#  **One-Sided vs Two-Sided Tests**

## **1. Two-Sided (Two-Tailed) Test**

* **Hypotheses:**

  * $H_0: \mu = \mu_0$
  * $H_1: \mu \neq \mu_0$
* We test for **any difference** (greater or smaller).
* Rejection regions: both **tails** of the distribution.
* Example: A company claims average battery life = 10 hrs. You want to test if it’s **different from 10** (could be less OR more).

Critical Z values (α = 0.05): ±1.96.

---

## **2. One-Sided (One-Tailed) Test**

* **Hypotheses:**

  * Right-tailed:

    * $H_0: \mu \leq \mu_0$
    * $H_1: \mu > \mu_0$
  * Left-tailed:

    * $H_0: \mu \geq \mu_0$
    * $H_1: \mu < \mu_0$
* Tests for a **specific direction** (greater than OR less than, not both).
* Rejection region: only **one tail**.
* Example: Testing if a new medicine **increases survival rate** compared to standard. You only care if it’s better, not worse.

Critical Z value (α = 0.05): 1.645 (right tail) or –1.645 (left tail).

---

## **3. Key Differences**

| Feature                                  | One-Sided Test                          | Two-Sided Test     |
| ---------------------------------------- | --------------------------------------- | ------------------ |
| Direction                                | Specific (>, <)                         | Any difference (≠) |
| Rejection region                         | One tail                                | Both tails         |
| Power (for true effect in one direction) | More powerful                           | Less powerful      |
| Risk                                     | Might miss effect in opposite direction | More conservative  |

---

## **4. In Machine Learning**

* One-sided: Testing if **model A’s accuracy > model B’s accuracy**.
* Two-sided: Testing if **model A’s accuracy ≠ model B’s accuracy** (could be better OR worse).

---

**Rule of thumb:**

* Use **two-sided** unless you have a strong prior reason to expect an effect only in one direction.

---
</details>

> # `Statistical Power`

<details>
<summary>Click to expand</summary>

###  What is Statistical Power?

* **Definition:** The probability of correctly rejecting the null hypothesis (**H₀**) when the alternative hypothesis (**H₁**) is true.
* In short: it tells us **how good a test is at detecting a real effect**.

Formula (conceptually):

$$
\text{Power} = 1 - \beta
$$

Where:

* **β (Type II Error):** Probability of failing to reject H₀ when H₁ is actually true.
* So, **higher power = lower risk of missing a real effect**.

---

###  Why is Power Important?

* A test with **low power** might miss real effects → false negatives.
* Researchers usually aim for **80% power** (0.8) → means if the effect is real, we’ll catch it 8 times out of 10.

---

###  Factors Affecting Statistical Power

1. **Sample Size (n):** Larger samples = higher power.
2. **Effect Size:** Bigger effect = easier to detect → higher power.
3. **Significance Level (α):** Larger α (e.g., 0.10 instead of 0.05) increases power but also increases chance of Type I error.
4. **Variability (σ²):** Less noise = higher power.

---

###  Simple Example

Suppose we want to test if a new medicine lowers blood pressure:

* **H₀:** No difference (medicine = placebo).

* **H₁:** Medicine lowers BP.

* If the medicine truly works, a **powerful test** will detect it.

* A **low-power test** might miss it and wrongly conclude the medicine doesn’t work.

---

**Memory Trick:**

* **Type I error (α):** False alarm.
* **Type II error (β):** Missed detection.
* **Power = 1 - β:** Ability to catch the signal when it’s really there.

---

</details>

> # `P-Value`

<details>
<summary>Click to expand</summary>

#  **P-Value**

## **1. Definition**

The **p-value** is the **probability of observing the sample data (or something more extreme) assuming the null hypothesis $H_0$ is true**.

In simple words:

* It measures how **compatible your data** is with the null hypothesis.
* Smaller p-value → less compatible → stronger evidence against $H_0$.

---

## **2. Interpreting the P-Value**

* **p ≤ α:** Reject $H_0$ (data is unlikely under H₀)
* **p > α:** Fail to reject $H_0$ (data is consistent with H₀)

---

## **3. Relationship to Significance Level**

* α = significance level (common: 0.05, 0.01)
* p-value is **compared to α**:

  * If **p < 0.05**, result is statistically significant.
  * If **p ≥ 0.05**, result is not significant.

---

## **4. Example**

A sample of 36 phones has mean battery = 9.5, population mean = 10, σ = 1.8.

* Compute Z = -1.67 (from previous Z-test example).
* **Two-tailed p-value**:

$$
p = 2 \cdot P(Z \le -1.67) \approx 2 \cdot 0.0475 = 0.095
$$

* Compare to α = 0.05 → **p > 0.05**, fail to reject $H_0$.

---

## **5. Misconceptions**

* ❌ P-value is **not** the probability that $H_0$ is true.
* ❌ P-value is **not** the probability that results happened by chance.
* ✅ It’s the probability of seeing data as extreme (or more extreme) if H₀ is true.

---

## **6. In Machine Learning**

* Test if a feature **significantly correlates with target**.
* Compare **model performance metrics** between two models.
* Evaluate **A/B testing** results for algorithm changes.

---

**Mnemonic:**
**“p-value = How surprising is my data if H₀ were true?”**

---

</details>

> # `How to interpret P-values`

<details>
<summary>Click to expand</summary>

#  **Interpreting P-Values**

## **1. Compare P-Value to Significance Level (α)**

* **Step 1:** Choose a significance level α (common choices: 0.05 or 0.01).
* **Step 2:** Look at your p-value.

  * **p ≤ α → Reject H₀** (evidence suggests an effect exists)
  * **p > α → Fail to reject H₀** (insufficient evidence to claim an effect)

---

## **2. Understanding the Magnitude**

* **Very small p-value (e.g., < 0.01)** → Strong evidence **against H₀**.
* **Moderate p-value (e.g., 0.01 – 0.05)** → Moderate evidence **against H₀**.
* **Large p-value (e.g., > 0.05)** → Weak or no evidence **against H₀**.

---

## **3. Examples**

| P-value | Interpretation                              |
| ------- | ------------------------------------------- |
| 0.001   | Very strong evidence against H₀ → reject H₀ |
| 0.03    | Strong evidence against H₀ → reject H₀      |
| 0.07    | Weak evidence → fail to reject H₀           |
| 0.5     | No evidence → fail to reject H₀             |

---

## **4. Common Misinterpretations**

* ❌ A p-value is **not** the probability that H₀ is true.
* ❌ A p-value is **not** the probability that your data happened by chance.
* ✅ Correct view: **P-value = probability of observing your data (or more extreme) if H₀ were true.**

---

## **5. P-Values in Machine Learning**

* Determine whether a **feature is statistically significant** for prediction.
* Test whether **model improvements** are meaningful.
* Evaluate **A/B tests** for algorithm changes in production.

---

**Quick tip:**

* **Small p-value → reject H₀** → effect likely exists.
* **Large p-value → fail to reject H₀** → effect not supported.

---

</details>

> # `Types of Hypothesis Tests`

<details>
<summary>Click to expand</summary>

#  **Types of Hypothesis Tests**

## **1. Z-Test**

* **Purpose:** Test a population mean (or proportion) when **σ is known** or **large sample size (n ≥ 30)**.
* **Types:**

  * **One-sample Z-test** → Compare sample mean to population mean.
  * **Two-sample Z-test** → Compare means of two independent samples.
  * **Proportion Z-test** → Test population proportion.
* **Example:** Is the average battery life of phones = 10 hours?

---

## **2. T-Test**

* **Purpose:** Test means when **population σ unknown** and/or **small sample size (n < 30)**.
* **Types:**

  1. **One-sample t-test** → Compare sample mean to hypothesized mean.
  2. **Independent two-sample t-test** → Compare means of two independent groups.
  3. **Paired t-test** → Compare means of **related samples** (before-after studies).
* **Example:** Does a new teaching method increase student scores compared to old method?

---

## **3. Chi-Square Test (χ²)**

* **Purpose:** Test relationships between **categorical variables**.
* **Types:**

  1. **Chi-square goodness-of-fit** → Does observed data fit a specific distribution?
  2. **Chi-square test of independence** → Are two categorical variables independent?
* **Example:** Is gender independent of preference for a product?

---

## **4. ANOVA (Analysis of Variance)**

* **Purpose:** Compare **means of 3 or more groups**.
* **One-way ANOVA:** One factor, multiple groups.
* **Two-way ANOVA:** Two factors.
* **Example:** Do exam scores differ across three teaching methods?

---

## **5. Non-Parametric Tests**

* Used when **data doesn’t meet normality assumptions**.
* Examples:

  * **Mann-Whitney U Test** → Compare two independent samples.
  * **Wilcoxon Signed-Rank Test** → Paired samples.
  * **Kruskal-Wallis Test** → Compare more than two groups.

---

## **6. Other Specialized Tests**

* **F-Test:** Compare variances of two populations.
* **Proportion Test:** Compare population proportions (one-sample or two-sample).
* **Correlation Tests:** Test if correlation ≠ 0 (e.g., Pearson correlation).

---

**Quick Summary:**

| Test Type      | Data Type             | Purpose                              |
| -------------- | --------------------- | ------------------------------------ |
| Z-test         | Continuous            | Known σ, test mean or proportion     |
| T-test         | Continuous            | Unknown σ, small sample              |
| Chi-square     | Categorical           | Test independence or goodness-of-fit |
| ANOVA          | Continuous            | Compare 3+ group means               |
| Non-parametric | Continuous or Ordinal | Data doesn’t fit normality           |

---

**ML Connection:**

* **T-tests & ANOVA:** Compare model metrics across groups.
* **Chi-square:** Feature selection for categorical variables.
* **Non-parametric tests:** Useful when data is skewed or ordinal.

---

</details>

> # `Z-Test`

<details>
<summary>Click to expand</summary>

#  **Z-Test**

## **1. Definition**

A **Z-test** is a statistical test used to determine if there is a **significant difference** between a sample mean (or proportion) and a population mean (or proportion), **when the population standard deviation (σ) is known** or the sample size is large (n ≥ 30).

---

## **2. Types of Z-Tests**

1. **One-Sample Z-Test**

   * Compare a **sample mean** to a **known population mean**.
   * Example: Is the average battery life of 36 phones different from 10 hours?

2. **Two-Sample Z-Test**

   * Compare the means of **two independent samples**.
   * Example: Compare average exam scores of students in Class A vs Class B.

3. **Z-Test for Proportions**

   * Compare a sample proportion with a population proportion or compare **two proportions**.
   * Example: Is the proportion of students passing a test different from 0.8?

---

## **3. Formula**

### **One-Sample Z-Test (Mean)**

$$
Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}
$$

Where:

* $\bar{X}$ = sample mean
* $\mu_0$ = population mean (H₀)
* σ = population standard deviation
* n = sample size

---

### **Two-Sample Z-Test (Mean)**

$$
Z = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}
$$

---

### **Z-Test for Proportion**

$$
Z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}
$$

* $\hat{p}$ = sample proportion
* $p_0$ = population proportion

---

## **4. Steps to Perform Z-Test**

1. **State Hypotheses**

   * Example: H₀: μ = 10, H₁: μ ≠ 10

2. **Choose α (significance level)**

   * Usually 0.05 → Z critical = ±1.96 for two-tailed test

3. **Compute Z statistic** using formula

4. **Decision Rule**

   * If |Z| > Z critical → reject H₀
   * If |Z| ≤ Z critical → fail to reject H₀

5. **Conclusion**

   * Interpret in context of the problem

---

## **5. Example (One-Sample Z-Test)**

* Sample: n = 36 phones, mean battery life = 9.5 hrs
* Population mean: μ₀ = 10 hrs, σ = 1.8

$$
Z = \frac{9.5 - 10}{1.8 / \sqrt{36}} = \frac{-0.5}{0.3} = -1.67
$$

* Critical Z = ±1.96 (α = 0.05, two-tailed)
* |–1.67| < 1.96 → **Fail to reject H₀**
* Conclusion: No significant difference in battery life.

---

## **6. When to Use Z-Test**

* Large sample size (n ≥ 30)
* Known population σ
* Data roughly normally distributed

---

## **7. Machine Learning Connection**

* Compare model accuracies to a benchmark.
* Compare conversion rates in A/B tests.
* Feature significance testing for continuous variables.

---

**Quick Memory Trick:**
**Z-Test = “How many standard errors away is my sample mean from the population mean?”**

---

</details>

> # `T-Test`

<details>
<summary>Click to expand</summary>

#  **T-Test**

## **1. Definition**

A **t-test** is a statistical test used to compare **means** when:

* The **population standard deviation (σ) is unknown**
* The **sample size is small (n < 30)**

It uses the **t-distribution**, which is similar to the normal distribution but has **fatter tails** to account for more variability in small samples.

---

## **2. Types of T-Tests**

### **a) One-Sample T-Test**

* Compare a **sample mean** to a **known or hypothesized population mean**.
* Example: Does the average test score of 15 students differ from 75?

**Formula:**

$$
t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}}
$$

Where:

* $\bar{X}$ = sample mean
* $s$ = sample standard deviation
* n = sample size
* $\mu_0$ = hypothesized population mean

---

### **b) Independent Two-Sample T-Test**

* Compare **means of two independent groups**.
* Example: Compare exam scores of Class A vs Class B.

**Formula (assuming equal variances):**

$$
t = \frac{\bar{X}_1 - \bar{X}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}
$$

Where $s_p$ = pooled standard deviation:

$$
s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}
$$

---

### **c) Paired Sample T-Test**

* Compare **means of two related samples** (before-after, matched pairs).
* Example: Student scores **before and after** a new teaching method.

**Formula:**

$$
t = \frac{\bar{d}}{s_d / \sqrt{n}}
$$

Where:

* $\bar{d}$ = mean of differences
* $s_d$ = standard deviation of differences
* n = number of pairs

---

## **3. Steps to Perform a T-Test**

1. **State Hypotheses**

   * Example: H₀: μ = 75, H₁: μ ≠ 75

2. **Choose α (significance level)**, usually 0.05

3. **Compute t statistic** using the formula

4. **Compare with Critical t value** (from t-table, depends on df = n–1) or use **p-value**

5. **Decision**

   * |t| > t critical → reject H₀
   * |t| ≤ t critical → fail to reject H₀

---

## **4. Example (One-Sample T-Test)**

* Sample: n = 10 students, mean = 78, sample sd = 5
* Population mean: μ₀ = 75

$$
t = \frac{78 - 75}{5 / \sqrt{10}} = \frac{3}{1.58} \approx 1.90
$$

* df = 10–1 = 9
* α = 0.05 → two-tailed t critical ≈ ±2.262
* |1.90| < 2.262 → **Fail to reject H₀**

---

## **5. When to Use T-Test**

* **Unknown σ**
* **Small sample (n < 30)**
* Data roughly **normally distributed**

---

## **6. Machine Learning Applications**

* Compare **model performance metrics** for small datasets.
* Test whether a **feature significantly affects target**.
* Compare **before-after effect of algorithm changes**.

---

**Quick Memory Trick:**

* **Z-test:** σ known or large n
* **T-test:** σ unknown, small n

---
</details>

> # `Chi-square Test`

<details>
<summary>Click to expand</summary>

#  **Chi-Square (χ²) Test**

## **1. Definition**

The **Chi-Square test** is a **non-parametric statistical test** used to examine whether there is a significant association between **categorical variables** or whether an observed frequency distribution matches an expected distribution.

---

## **2. Types of Chi-Square Tests**

### **a) Chi-Square Goodness-of-Fit Test**

* **Purpose:** Test if a **single categorical variable** follows a specific distribution.
* **Example:** Are dice rolls uniform across 1–6?

**Formula:**

$$
\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
$$

Where:

* $O_i$ = observed frequency
* $E_i$ = expected frequency

---

### **b) Chi-Square Test of Independence**

* **Purpose:** Test if **two categorical variables are independent**.
* **Example:** Is gender independent of choosing a computer course?

**Formula (same as above):**

$$
\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}
$$

Where:

* $O_{ij}$ = observed frequency in cell i,j
* $E_{ij} = \frac{(\text{row total} \cdot \text{column total})}{\text{grand total}}$

---

## **3. Steps to Perform Chi-Square Test**

1. **State Hypotheses**

   * Goodness-of-fit: H₀ = observed frequencies match expected
   * Independence: H₀ = variables are independent

2. **Set significance level α** (common: 0.05)

3. **Calculate χ² statistic** using observed and expected frequencies

4. **Find critical χ² value** from χ² table with appropriate **degrees of freedom (df)**:

   * Goodness-of-fit: df = number of categories – 1
   * Independence: df = (rows–1) × (columns–1)

5. **Decision**

   * χ² > χ² critical → reject H₀
   * χ² ≤ χ² critical → fail to reject H₀

---

## **4. Example (Test of Independence)**

Survey of 100 students:

* **Gender:** Male/Female
* **Course Preference:** Data Science / AI

|        | DS | AI | Total |
| ------ | -- | -- | ----- |
| Male   | 20 | 30 | 50    |
| Female | 25 | 25 | 50    |
| Total  | 45 | 55 | 100   |

**Expected frequencies:**

* Male & DS = (50 × 45)/100 = 22.5
* Male & AI = (50 × 55)/100 = 27.5
* Female & DS = 22.5, Female & AI = 27.5

$$
\chi^2 = \frac{(20-22.5)^2}{22.5} + \frac{(30-27.5)^2}{27.5} + \frac{(25-22.5)^2}{22.5} + \frac{(25-27.5)^2}{27.5} \approx 1.36
$$

* df = (2–1) × (2–1) = 1
* χ² critical at α = 0.05 → 3.841
* 1.36 < 3.841 → **Fail to reject H₀** → Gender and course preference are independent

---

## **5. When to Use**

* Data is **categorical**
* Observations are **independent**
* Expected frequency in each cell ≥ 5 (rule of thumb)

---

## **6. Machine Learning Applications**

* Feature selection: test if categorical features are **dependent on target**
* Market research: test if user preferences differ by demographics
* A/B testing for categorical outcomes

---

**Mnemonic:**

* **Goodness-of-fit → single variable**
* **Independence → two variables**
* χ² measures **difference between observed & expected frequencies**

---

</details>

> # `ANOVA (Analysis of Variance)`

<details>
<summary>Click to expand</summary>

# **ANOVA (Analysis of Variance)**

## **1. Definition**

ANOVA is a statistical test used to determine if there are **significant differences between the means of 3 or more groups**.

* Instead of doing multiple t-tests (which increases error risk), ANOVA compares all group means **in one test**.

---

## **2. Idea Behind ANOVA**

It looks at **variability** in data:

* **Between-group variability**: how much the group means differ from each other.
* **Within-group variability**: how much individuals differ within each group.

**If between > within → groups are significantly different.**

---

## **3. Hypotheses**

* **Null hypothesis (H₀):** All group means are equal.
* **Alternative hypothesis (H₁):** At least one mean is different.

---

## **4. Types of ANOVA**

1. **One-way ANOVA**

   * One factor (independent variable), 3+ groups.
   * Example: Do exam scores differ across three teaching methods?

2. **Two-way ANOVA**

   * Two factors (e.g., teaching method **and** gender).
   * Tests main effects + interaction effects.

3. **Repeated Measures ANOVA**

   * Same subjects measured under different conditions.
   * Example: Blood pressure measured before, during, and after treatment.

---

## **5. Test Statistic (F-ratio)**

$$
F = \frac{\text{Between-group variance}}{\text{Within-group variance}}
$$

* If **F is large**, groups differ significantly.
* Compare with **F-critical** value from F-distribution (df between, df within).

---

## **6. Steps in One-Way ANOVA**

1. **State hypotheses** (H₀: μ₁ = μ₂ = μ₃ …).
2. **Set significance level (α = 0.05)**.
3. **Calculate F-statistic** (ANOVA table).
4. **Compare F with critical F** or check **p-value**.

   * If p < α → reject H₀.
5. If significant → do **post-hoc tests** (like Tukey’s HSD) to see **which groups differ**.

---

## **7. Example (One-Way ANOVA)**

Suppose exam scores of students taught by 3 methods:

* Method A: [85, 90, 88]
* Method B: [78, 74, 80]
* Method C: [92, 95, 91]

ANOVA compares:

* Between-group variance (differences in averages).
* Within-group variance (spread inside each method).

If F = 12.5 and p < 0.05 → reject H₀ → teaching methods differ.

---

## **8. Applications in Machine Learning**

* Model comparison: Compare accuracy across 3+ ML models.
* Feature testing: Compare outcomes across multiple categories of a feature.
* A/B/n testing: Test if multiple experimental groups differ significantly.

---

## **9. Limitations**

* Assumes **normality** and **equal variances** across groups.
* If violated → use **non-parametric alternatives** (Kruskal-Wallis test).

---

**Quick Mnemonic:**

* **T-test** = compare 2 means
* **ANOVA** = compare 3+ means
* **F-test** inside ANOVA tells us if differences are real

---

</details>