> **What are the statistics?**

`statistics supplies principled ways to describe data, quantify uncertainty, make inferences from samples to populations, and construct interpretable models — all of which are foundational for trustworthy machine learning (ML).`

# Main Types of Statistics

<details>
<summary>Click to expand</summary>

### 1. **Descriptive Statistics**

* **Purpose:** To **summarize and describe** data.
* **What it does:** Tells us *what the data shows* (without making predictions or generalizations).
* **Examples:**

  * Mean, median, mode (measures of central tendency).
  * Variance, standard deviation, range, IQR (measures of spread).
  * Graphs & charts: histograms, bar charts, boxplots.
* **ML use:** Understanding dataset distribution, detecting outliers, feature scaling.

---

### 2. **Inferential Statistics**

* **Purpose:** To **make predictions, decisions, or generalizations** about a population based on a sample.
* **What it does:** Uses probability to **infer** characteristics of the larger group.
* **Examples:**

  * Hypothesis testing (t-test, chi-square, ANOVA).
  * Confidence intervals.
  * Regression analysis.
* **ML use:**

  * Comparing two models’ performance (is A better than B?).
  * Determining if a feature has significant predictive power.
  * Generalizing from training data to unseen test data.

---

### Subfields Often Mentioned

* **Predictive Statistics** → focused on making forecasts (closely tied to machine learning).
* **Exploratory Data Analysis (EDA)** → part of descriptive stats but used for **discovering patterns** in data.
* **Bayesian Statistics** → alternative to classical (frequentist) inference; uses prior knowledge + data to update beliefs.

---

**Quick Analogy:**

* *Descriptive statistics* = "This class of 30 students has an average height of 165 cm."
* *Inferential statistics* = "Based on this sample, we estimate that the average height of **all students in the school** is around 165 cm, with a 95% confidence interval."
</details>


# Types of Data in Statistics

<details>
<summary>Click to expand</summary>

Data can be broadly divided into **Qualitative (Categorical)** and **Quantitative (Numerical)**.

---

### 1. **Qualitative (Categorical) Data**

* Represents **qualities, categories, or labels** (not numbers you can calculate with).
* **Subtypes:**

  * **Nominal Data** → categories **without order**.

    * Examples: gender (male/female), colors (red/blue/green), city names.
    * ML use: one-hot encoding before feeding to models.
  * **Ordinal Data** → categories **with a natural order/rank**.

    * Examples: ratings (bad/average/good/excellent), education level (high school, bachelor, master).
    * ML use: sometimes label encoding works because order matters.

---

### 2. **Quantitative (Numerical) Data**

* Represents **numbers you can measure or count**.
* **Subtypes:**

  * **Discrete Data** → countable values (whole numbers).

    * Examples: number of students in class, number of cars in parking lot.
    * ML use: Poisson models, classification features.
  * **Continuous Data** → measurable values, can take **any value in a range**.

    * Examples: height, weight, temperature, time.
    * ML use: regression models, normalization, scaling.

---

### Summary Table

| **Type**     | **Subtype** | **Examples**                  | **ML Usage**        |
| ------------ | ----------- | ----------------------------- | ------------------- |
| Qualitative  | Nominal     | Gender, color, city name      | One-hot encoding    |
|              | Ordinal     | Rating scale, education level | Label encoding      |
| Quantitative | Discrete    | No. of students, goals scored | Count models        |
|              | Continuous  | Height, weight, temperature   | Regression, scaling |

---

**Quick Analogy:**

* *Nominal* = Names only (no order).
* *Ordinal* = Order matters.
* *Discrete* = Counting numbers.
* *Continuous* = Measuring numbers.

</details>

# Measure of Central Tendency

<details>
<summary>Click to expand</summary>
    
### Definition

* **Central tendency** = a single value that represents the **center** or **typical value** of a dataset.
* It helps us understand **where most data points lie**.

---

### Main Measures

### 1. **Mean (Arithmetic Average)**

* Formula:

  $$
  \bar{x} = \frac{\sum_{i=1}^n x_i}{n}
  $$
* Example: marks = \[10, 20, 30] → mean = (10+20+30)/3 = 20
* **Use in ML:** feature scaling, finding average error (MSE).
* **Limit:** affected by outliers (e.g., income data).

---

### 2. **Median (Middle Value)**

* Arrange data in order, pick the middle.

  * If odd $n$: middle value.
  * If even $n$: average of two middle values.
* Example: \[10, 20, 30] → median = 20
  \[10, 20, 30, 100] → median = (20+30)/2 = 25
* **Use in ML:** robust to outliers, used in imputation (replace missing values with median).

---

### 3. **Mode (Most Frequent Value)**

* Value that occurs **most often**.
* Example: \[2, 2, 3, 4] → mode = 2
* **Use in ML:** categorical feature analysis (e.g., most common category).

---

### Comparison

| Measure | Best When                              | Weakness                           |
| ------- | -------------------------------------- | ---------------------------------- |
| Mean    | Data is symmetric, no extreme outliers | Sensitive to outliers              |
| Median  | Data is skewed or has outliers         | Ignores actual values              |
| Mode    | Categorical or discrete data           | May not exist or may not be unique |

---

**Quick Analogy:**

* **Mean** = “average of everyone’s salary.”
* **Median** = “the middle person’s salary if everyone stands in line.”
* **Mode** = “the most common salary in the group.”
</details>

# Measure of Dispersion

<details>
<summary>Click to expand</summary>

###  Definition

* **Dispersion** = how much the data **varies or spreads out** around the central value (mean/median).
* Helps us know:

  * Is data **tightly clustered** or **widely scattered**?
  * Are averages reliable?

---

### Main Measures of Dispersion

### 1. **Range**

* Formula:

  $$
  \text{Range} = \text{Maximum value} - \text{Minimum value}
  $$
* Example: \[10, 20, 30, 40] → Range = 40 – 10 = 30
* **Simple but weak** (depends only on 2 extreme values).

---

### 2. **Interquartile Range (IQR)**

* Formula:

  $$
  IQR = Q_3 - Q_1
  $$

  where $Q_1$ = 25th percentile, $Q_3$ = 75th percentile
* Example: data \[10, 20, 30, 40, 50]

  * $Q_1 = 20$, $Q_3 = 40$ → IQR = 20
* **Use:** robust to outliers, good for skewed data.

---

### 3. **Variance**

* Formula:

  $$
  \sigma^2 = \frac{\sum (x_i - \bar{x})^2}{n}
  $$
* Example: \[2, 4, 6]

  * Mean = 4
  * Variance = ((2-4)² + (4-4)² + (6-4)²)/3 = (4+0+4)/3 = 8/3 ≈ 2.67
* **Interpretation:** larger variance = more spread out.

---

### 4. **Standard Deviation (SD)**

* Formula:

  $$
  \sigma = \sqrt{\sigma^2}
  $$
* Example: same data \[2, 4, 6] → SD = √2.67 ≈ 1.63
* **Most widely used** → because it’s in the same unit as the data.
* **ML use:** feature scaling (z-score normalization uses mean & SD).

---

### 5. **Coefficient of Variation (CV)**

* Formula:

  $$
  CV = \frac{\sigma}{\bar{x}} \times 100\%
  $$
* Shows **relative variation** → how big SD is compared to mean.
* Useful to compare variability between datasets with different units.

---

### Comparison Table

| Measure  | Formula                   | Notes                                |
| -------- | ------------------------- | ------------------------------------ |
| Range    | Max – Min                 | Quick, but depends on extremes       |
| IQR      | Q3 – Q1                   | Robust to outliers                   |
| Variance | Avg of squared deviations | Hard to interpret (squared units)    |
| SD       | √Variance                 | Most common, same unit as data       |
| CV       | SD ÷ Mean × 100           | Compares variability across datasets |

---

 **Quick Analogy:**

* Imagine students’ heights:

  * If everyone is \~165 cm, dispersion is **small**.
  * If some are 150 cm and some 190 cm, dispersion is **large**.

</details>

# Quantiles and Percentiles

<details>
<summary>Click to expand</summary>

### **1. Quantiles**

* **Definition**: Quantiles divide a dataset into **equal-sized intervals** (parts) based on the data distribution.
* **General form**:
  If data is divided into **q equal parts**, each part represents a **1/q portion** of the data.

### **Common Quantiles**

* **Quartiles** (q = 4) → divide data into **4 equal parts**

  * Q1 = 25% point (1st quartile)
  * Q2 = 50% point (Median)
  * Q3 = 75% point (3rd quartile)
* **Quintiles** (q = 5) → divide into 5 equal parts
* **Deciles** (q = 10) → divide into 10 equal parts
* **Percentiles** (q = 100) → divide into 100 equal parts

**Example**:
If students’ scores are sorted,

* Q1 = 25th percentile (25% of students scored below this)
* Q2 = 50th percentile (median)
* Q3 = 75th percentile

---

### **2. Percentiles**

* **Definition**: The **p-th percentile** is the value below which **p% of the observations** fall.
* Range: 0th percentile = minimum, 100th percentile = maximum.
* **Common use**:

  * 90th percentile → value below which 90% of data falls.
  * 99th percentile → helps detect extreme outliers.

**Example**:
In an exam, if your score is at the **80th percentile**, it means **you scored better than 80% of students**.

---

### **3. Relation Between Quantiles & Percentiles**

* Quantiles are a general concept, percentiles are a **special case**.
* **Q1 = 25th percentile**, **Q2 = 50th percentile**, **Q3 = 75th percentile**.

---

### **4. Applications in Machine Learning**

* **Outlier detection**: Using IQR (Q3 − Q1).

  * If a data point < Q1 − 1.5×IQR or > Q3 + 1.5×IQR → Outlier.
* **Feature scaling**: Robust Scaler in sklearn uses **median and IQR** (instead of mean and standard deviation).
* **Evaluation metrics**: Percentiles help in measuring **model performance distribution** (e.g., latency in production systems).
* **Handling skewed data**: Percentile-based binning for categorical encoding.

---

**Quick Recap**

* Quantiles = general division of data into equal parts.
* Percentiles = division into 100 parts (more fine-grained).
* Quartiles (Q1, Q2, Q3) are special percentiles (25, 50, 75).
* Very useful in ML for **EDA, scaling, outlier detection, and fairness analysis**.

</details>

# 5 number summary and BoxPlot

<details>
<summary>Click to expand</summary>

### **Five-Number Summary**

The **5-number summary** describes a dataset using **five key statistics**:

1. **Minimum** → The smallest observation.
2. **Q1 (First Quartile / 25th Percentile)** → 25% of data lies below this point.
3. **Median (Q2 / 50th Percentile)** → Middle value, divides dataset into two halves.
4. **Q3 (Third Quartile / 75th Percentile)** → 75% of data lies below this point.
5. **Maximum** → The largest observation.

---

### **Formula Recap**

* Quartiles (Q1, Q2, Q3) are calculated using ordered data.
* **Interquartile Range (IQR) = Q3 − Q1** (used for spread & outlier detection).

---

### **Example**

Dataset (sorted):
`[2, 4, 7, 10, 12, 15, 18, 21, 25]`

* Minimum = 2
* Q1 = 7 (25th percentile)
* Median (Q2) = 12
* Q3 = 18 (75th percentile)
* Maximum = 25

**Five-number summary = (2, 7, 12, 18, 25)**

---

### **Applications in Machine Learning**

* **Boxplots**: Visualize the five-number summary.
* **Outlier detection**: Data points outside `[Q1 − 1.5×IQR, Q3 + 1.5×IQR]`.
* **Feature scaling**: Median and quartiles are more **robust to skew/outliers** than mean & SD.
* **EDA**: Quickly summarizes distribution shape (spread, skewness).

---

 **Quick Mnemonic** → “**Min–Q1–Median–Q3–Max**”
---

### **Boxplots (Whisker Plots)**

### **1. Definition**

A **boxplot** is a graphical representation of a dataset’s distribution using the **five-number summary**:

* **Minimum**
* **Q1 (25th percentile)**
* **Median (Q2 / 50th percentile)**
* **Q3 (75th percentile)**
* **Maximum**

It also shows **outliers** using separate points beyond whiskers.

---

### **2. Structure of a Boxplot**

* **Box** → From Q1 to Q3 (the Interquartile Range, IQR).
* **Line inside the box** → Median (Q2).
* **Whiskers** → Extend from box to minimum and maximum values *within 1.5×IQR*.
* **Outliers** → Points beyond whiskers plotted as dots or stars.

---

### **3. Example**

If dataset = `[2, 4, 7, 10, 12, 15, 18, 21, 25]`

* Min = 2
* Q1 = 7
* Median = 12
* Q3 = 18
* Max = 25

Boxplot will show:

* Box from 7 → 18
* Line at 12 (median)
* Whiskers from 2 → 25

---

### **4. Applications in Machine Learning**

* **Outlier Detection** → Quick way to spot extreme values.
* **Distribution Shape** → Skewness (median closer to top or bottom).
* **Compare Groups** → Side-by-side boxplots help compare feature distributions across categories.

  * Example: Compare **salary distributions by education level**.
* **Feature Engineering** → Decide whether to transform, scale, or remove outliers.

---

### **5. Advantages**

* Summarizes data distribution compactly.
* Good for **skewed data** and **outliers**.
* Easy to compare multiple datasets side by side.

---

**Summary**:
Boxplot = **visual + summary** of spread, median, and outliers.
It answers: *Where is the center? How spread out is the data? Any unusual points?*

</details>

# Graphs for Univariate Analysis

<details>
<summary>Click to expand</summary>

*Univariate analysis* means analyzing a **single variable** at a time. The goal is to understand its distribution, central tendency, and spread. Graphical methods are very useful here.

---

### **1. Histogram**

* **Definition**: A bar-like plot where the data is grouped into intervals (called bins).
* **Purpose**: Shows the **frequency distribution** of continuous data.
* **Example in ML**: Visualizing the distribution of ages in a dataset to check if it’s normal or skewed.
* **Note**: Helps in detecting skewness, outliers, and modality (unimodal, bimodal, etc.).

---

### **2. Bar Chart**

* **Definition**: Rectangular bars represent the frequency of each category.
* **Purpose**: Used for **categorical data** (nominal or ordinal).
* **Example in ML**: Showing the count of customers by gender (Male vs Female).

---

### **3. Pie Chart**

* **Definition**: A circular chart divided into slices.
* **Purpose**: Shows **proportion** of categories.
* **Example in ML**: Proportion of spam vs non-spam emails in a dataset.
* **Note**: Not recommended for large categories (bar chart is clearer).

---

### **4. Frequency Polygon**

* **Definition**: A line graph connecting midpoints of histogram bins.
* **Purpose**: Helps in comparing two or more distributions easily.
* **Example in ML**: Comparing frequency of exam scores between two different classes.

---

### **5. Box Plot (Whisker Plot)**

* **Definition**: A plot that shows the **median, quartiles, and outliers**.
* **Purpose**: Identifies **spread and skewness**, highlights outliers.
* **Example in ML**: Detecting salary outliers in employee dataset.

---

### **6. Stem-and-Leaf Plot**

* **Definition**: Splits numbers into “stem” (leading digit) and “leaf” (trailing digit).
* **Purpose**: Shows actual values along with distribution.
* **Example in ML**: Exploring small datasets (e.g., exam scores of 30 students).

---

### **7. Density Plot (KDE Plot)**

* **Definition**: A smoothed version of histogram using probability density.
* **Purpose**: Useful for understanding **probability distribution shape**.
* **Example in ML**: Checking if dataset feature follows a Gaussian (Normal) distribution.

---

**Summary Table**

| Graph Type         | Data Type   | Purpose                      |
| ------------------ | ----------- | ---------------------------- |
| Histogram          | Continuous  | Frequency distribution       |
| Bar Chart          | Categorical | Category comparison          |
| Pie Chart          | Categorical | Proportions                  |
| Frequency Polygon  | Continuous  | Compare distributions        |
| Box Plot           | Continuous  | Spread & Outliers            |
| Stem-and-Leaf      | Discrete    | Actual values + distribution |
| Density Plot (KDE) | Continuous  | Smooth distribution curve    |

---

In machine learning, these graphs are often the **first step in EDA (Exploratory Data Analysis)** to understand feature distributions before modeling.

</details>

# Graphs for Bivariate Analysis

<details>
<summary>Click to expand</summary>
    
*Bivariate analysis* means analyzing the **relationship between two variables** at the same time. These variables can be **categorical vs categorical**, **categorical vs numerical**, or **numerical vs numerical**.

---

### **1. Scatter Plot**

* **Definition**: Points plotted on a 2D plane with one variable on the x-axis and another on the y-axis.
* **Purpose**: Shows **relationship, correlation, and patterns** between two numerical variables.
* **Example in ML**: Relationship between house size (sqft) and price.
* **Note**: Trend lines can be added to show regression.

---

### **2. Line Chart**

* **Definition**: Plots points connected with a line.
* **Purpose**: Shows **trend** over continuous variables.
* **Example in ML**: Plotting loss vs epochs in neural network training.

---

### **3. Heatmap (Correlation Matrix)**

* **Definition**: A colored grid showing correlation values between multiple variables.
* **Purpose**: Identifies **strength and direction** of relationships.
* **Example in ML**: Checking correlation of features with target variable.
* **Note**: Positive correlation = dark/high value; negative correlation = light/low value.

---

### **4. Grouped / Clustered Bar Chart**

* **Definition**: Bars grouped by one categorical variable and compared across another.
* **Purpose**: Compare categorical variable counts with subcategories.
* **Example in ML**: Comparing survival rate of Titanic passengers by gender and class.

---

### **5. Stacked Bar Chart**

* **Definition**: Bars stacked on top of each other for subcategories.
* **Purpose**: Shows **composition** within categories.
* **Example in ML**: Proportion of male/female customers across age groups.

---

### **6. Side-by-Side Boxplots**

* **Definition**: Multiple boxplots compared across categories.
* **Purpose**: Shows spread of numerical variable across different categories.
* **Example in ML**: Salary distribution by education level.

---

### **7. Bubble Chart**

* **Definition**: An extension of scatter plot where the **size of bubble** represents a third variable.
* **Purpose**: Adds more information to scatter plots.
* **Example in ML**: Plotting population (bubble size) vs GDP vs literacy rate for countries.

---

### **8. Violin Plot**

* **Definition**: Combination of boxplot and density plot.
* **Purpose**: Shows distribution of numerical data across categories.
* **Example in ML**: Distribution of exam scores across different schools.

---

**Summary Table**

| Graph Type            | Variable Type                 | Purpose                             |
| --------------------- | ----------------------------- | ----------------------------------- |
| Scatter Plot          | Numerical vs Numerical        | Correlation & patterns              |
| Line Chart            | Numerical vs Numerical (time) | Trends over time                    |
| Heatmap               | Numerical vs Numerical        | Correlation matrix                  |
| Grouped/Clustered Bar | Categorical vs Categorical    | Compare groups                      |
| Stacked Bar Chart     | Categorical vs Categorical    | Show composition                    |
| Side-by-Side Boxplots | Categorical vs Numerical      | Compare distributions across groups |
| Bubble Chart          | Numerical + Numerical + Size  | Multivariate visualization          |
| Violin Plot           | Categorical vs Numerical      | Distribution + density              |

---

In **machine learning**, bivariate graphs are crucial in **Exploratory Data Analysis (EDA)** for:

* Detecting correlations
* Understanding feature-target relationships
* Spotting multicollinearity before regression
* Deciding feature transformations

</details>