## Quartiles and Percentiles

### Percentile

A **percentile** is a value below which a certain percentage of observations lie.

**Example:**  
95th percentile marks in an exam means that the person has scored better than **95% of the total students**.

---

### Example Dataset

n = {2, 2, 3, 4, 5, 5, 5, 6, 7, 8, 8, 8, 8, 8, 9, 9, 10, 11, 11, 12}


---
### Percentile Rank Formula

$$\text{Percentile Rank of } x = \left( \frac{\text{Number of values below } x}{n} \right) \times 100$$

For **$x = 10$**, the percentile rank is **80** for the above dataset.

---

### Finding Value at a Given Percentile (The Index)

To find the position of a percentile in a sorted dataset, use the index formula:

$$i = \frac{P}{100} \times (n - 1)$$

**Where:**
* $i$: The resulting index (position) in the sorted data.
* $P$: The desired percentile.
* $n$: The total number of observations.

> **Note:** If $i$ is not an integer, you typically interpolate between the values at index $\lfloor i \rfloor$ and $\lceil i + 1 \rceil$.
---

## Quartile

### Definition

**Quartiles** are values that divide sorted data into **four equal parts**, with each part containing **25% of the data**.

There are **three quartiles**:
- **Q1** ‚Üí 25th percentile  
- **Q2** ‚Üí 50th percentile (Median)  
- **Q3** ‚Üí 75th percentile  

---

### Interpretation

- **Q1:** 25% of the data lies below this value  
- **Q2 (Median):** 50% of the data lies below this value  
- **Q3:** 75% of the data lies below this value  

The range between **Q1 and Q3** is called the **Interquartile Range (IQR)**.  
It contains the **middle 50% of the data** and shows the **spread of the data**.

---


## Five Number Summary

**Definition:**  
A set of five statistics that summarize the distribution of a dataset, showing its **center, spread, and range**.

The five numbers are:

1. **Minimum** ‚Äì smallest value in the dataset  
2. **First Quartile (Q1, 25%)** ‚Äì value below which 25% of data falls  
3. **Median (Q2)** ‚Äì middle value that divides data into two equal halves (50%)  
4. **Third Quartile (Q3, 75%)** ‚Äì value below which 75% of data falls  
5. **Maximum** ‚Äì largest value in the dataset  

**Note:** These five numbers are visually displayed in a **boxplot**.

---

### Example Calculation

**Sample dataset:**  

n = {1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 6, 6, 6, 6, 7, 8, 8, 9, 27}


**Step 1: Find the five number summary**  
- Minimum = 1  
- Q1 = 3 (value at 25% position)  
- Median (Q2) = 5 (middle value, 10th position out of 19)  
- Q3 = 6.5 (value at 75% position)  
- Maximum = 27  

**Step 2: Calculate IQR (Interquartile Range)**
  
$$\text{IQR} = Q3 - Q1 = 6.5 - 3 = 3.5$$

**Note:** IQR represents the **spread of the middle 50% of data** and is resistant to outliers.

---

### Removing Outliers from n

**Definition:**  
Outliers are extreme values that lie unusually far from other observations. They can skew analysis and need identification.

**Method:** Use the **fence method** to identify outliers.

**Lower and Upper Fence:**

#### Lower Fence
$$\text{Lower fence} = Q1 - 1.5 \times IQR = 3 - 1.5 \times 3.5 = -2.25$$

#### Upper Fence
$$\text{Upper fence} = Q3 + 1.5 \times IQR = 6.5 + 1.5 \times 3.5 = 11.75$$

**Outlier identification:**  
- Any value < -2.25 ‚Üí lower outlier  
- Any value > 11.75 ‚Üí upper outlier  

**In our dataset:**  
- 27 > 11.75, so **27 is an outlier**

**Clean dataset (without outliers):**  

{1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 6, 6, 6, 6, 7, 8, 8, 9}

**Note:** The 1.5 √ó IQR rule is a standard convention. Some use 3 √ó IQR for **extreme outliers**.


## Covariance

**Definition:**  
Covariance is used to quantify the relationship between **X** and **Y**.  
It measures how two variables change together, indicating whether they move in the **same** or **opposite** directions.  

- **Positive value:** Variables move in the same direction  
- **Negative value:** Variables move in opposite directions  

**Key Points:**  
- Can take any value between **-‚àû to +‚àû**  
  - Negative ‚Üí negative relationship  
  - Positive ‚Üí positive relationship  
- Measures **linear relationship** between variables  
- Gives the **direction** of the relationship  

---

### Covariance Formulas

**For Population Data ($N$):**
Using population means $\mu_x$ and $\mu_y$:
$$\text{Cov}(X,Y) = \frac{\sum_{i=1}^{N} (x_i - \mu_x)(y_i - \mu_y)}{N}$$

**Where**
* **$x_i$**: The $i^{th}$ value of the variable $X$ in the population.
* **$y_i$**: The $i^{th}$ value of the variable $Y$ in the population.
* **$\mu_X$**: The population mean of variable $X$ (the average of all $x_i$ values).
* **$\mu_Y$**: The population mean of variable $Y$ (the average of all $y_i$ values).
* **$N$**: The total number of data points in the population.
* **$\sum$**: The summation symbol; it indicates we sum the products of the deviations for all data points from $i=1$ to $N$.
* **$N$ (Denominator)**: In population covariance, we divide by $N$ because we have access to the entire population. **Bessel's correction is not required** because we are calculating the actual parameter rather than estimating it from a sample.

**For Sample Data ($n$):**
Using sample means $\bar{x}$ and $\bar{y}$ with Bessel's correction ($n-1$):
$$\text{Cov}(X,Y) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{n-1}$$

**Where**
* **$x_i$**: The $i^{th}$ value of the variable $X$ in the sample.
* **$y_i$**: The $i^{th}$ value of the variable $Y$ in the sample.
* **$\bar{x}$**: The sample mean of variable $X$ (the average of all $x_i$ values).
* **$\bar{y}$**: The sample mean of variable $Y$ (the average of all $y_i$ values).
* **$n$**: The total number of data points (observations) in the sample.
* **$\sum$**: The summation symbol; it indicates that we sum the products of the deviations for all data points from $i=1$ to $n$.
* **$n-1$**: The **degrees of freedom**. When working with a sample, we divide by $n-1$ to correct for the bias in estimating population covariance. This is known as **Bessel's correction**.
---




### Types of Covariance

1. **Positive Covariance:**  
   - One variable increases ‚Üí the other tends to increase  
   - One variable decreases ‚Üí the other tends to decrease  

2. **Negative Covariance:**  
   - One variable increases ‚Üí the other tends to decrease  
   - One variable decreases ‚Üí the other tends to increase  

3. **Zero Covariance:**  
   - No linear relationship  
   - Variables move independently  

---

**Visualization:**

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20251209122450861356/Covariance.webp" width="400px" alt="Covariance Example">


## Correlation

**Definition:**  
Correlation is a statistical measure that quantifies the **strength and direction of the linear relationship** between two numerical variables.  
It standardizes covariance to a scale of **-1 to +1**.

Correlation is derived from covariance and ranges between **-1 and 1**.  
Unlike covariance, which only indicates the direction of the relationship, correlation provides a **standardized measure of strength**.

### Types of Correlation

- **Positive Correlation (close to +1):**  
  As one variable increases, the other variable also tends to increase.

- **Negative Correlation (close to -1):**  
  As one variable increases, the other variable tends to decrease.

- **Zero Correlation:**  
  There is no linear relationship between the variables.

**Note:**  
Correlation does **not imply causation**. Two variables can be correlated without one causing the other.

---

## Covariance (Prerequisite Concept)

**Definition:**  
Covariance measures how two variables change together.
- Positive covariance ‚Üí variables increase together
- Negative covariance ‚Üí one increases while the other decreases

**Formula (Sample Data):**
$$\text{cov}(x,y) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{n-1}$$

**Limitation:**  
Covariance is **not standardized**, so its magnitude is hard to interpret.  
This is why **correlation**, which standardizes covariance, is more useful.

---

## Pearson Correlation Coefficient

### Formula for Population Dataset (N)

$$\rho_{X,Y} = \frac{\text{cov}(X,Y)}{\sigma_X \sigma_Y}$$

**Where:**
* $\rho_{X,Y}$: The population correlation coefficient between $X$ and $Y$.
* $\text{cov}(X,Y)$: The covariance of $X$ and $Y$.
* $\sigma_X$: The population standard deviation of $X$.
* $\sigma_Y$: The population standard deviation of $Y$.

---

### Formula for Sample Dataset (n)

The correlation coefficient ($r$) measures the strength and direction of the linear relationship between two variables.

$$r = \text{Corr}(x,y) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \cdot \sum_{i=1}^{n} (y_i - \bar{y})^2}}$$

**Where:**
* $r$ or $\text{Corr}(x,y)$: Sample correlation coefficient
* $\bar{x}, \bar{y}$: Sample means of x and y
* $x_i, y_i$: Individual sample points
* $n$: Number of observations

---

### Interpretation of Pearson Correlation

- $r = +1$ ‚Üí perfect positive linear relationship
- $r = -1$ ‚Üí perfect negative linear relationship
- $r = 0$ ‚Üí no linear relationship
- $|r| > 0.7$ ‚Üí strong correlation
- $0.3 < |r| < 0.7$ ‚Üí moderate correlation
- $|r| < 0.3$ ‚Üí weak correlation

---
### Assumptions for Pearson Correlation

1. Both variables are continuous (interval or ratio scale)
2. Linear relationship between variables
3. No significant outliers
4. Approximately normally distributed (for hypothesis testing)

---

### Visualization

<img src="https://www.scribbr.com/wp-content/uploads/2022/07/Perfect-positive-correlation-Perfect-negative-correlation.webp" width="300px" alt="Pearson correlation example">
<img src="https://www.scribbr.com/wp-content/uploads/2022/05/Strong-positive-correlation-and-strong-negative-correlation.webp" width="300px" alt="Pearson correlation example">
<img src="https://www.scribbr.com/wp-content/uploads/2022/05/Low-positive-correlation-and-low-negative-correlation.webp" width="300px" alt="Pearson correlation example">
<img src="https://www.scribbr.com/wp-content/uploads/2022/05/Zero-correlation.webp" width="150px" alt="Pearson correlation example">

---
<p align="center">
  <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4e/Spearman_fig1.svg/500px-Spearman_fig1.svg.png" 
       width="400px" 
       alt="Spearman vs Pearson">
</p>

### Key Difference Between Pearson and Spearman

In the above example, the values on the y-axis increase continuously as the x-axis values increase, yet the **Pearson correlation is not equal to 1** because the relationship is **not linear** (it is exponential/curved).

In such cases, **Spearman's rank correlation** is more appropriate, as it captures **monotonic relationships** (non-linear but consistent direction).  
Spearman would give a correlation of **1** here because the **ranks have a perfect relationship**.

---

## Spearman's Rank Correlation

**Definition:**  
A **non-parametric** measure that assesses the strength and direction of association between two variables based on their **ranks** rather than actual values.  
Works with ordinal, interval, or ratio data.

### Interpretation

- Range: -1 to +1
- +1 ‚Üí perfect positive monotonic relationship
- -1 ‚Üí perfect negative monotonic relationship
- 0 ‚Üí no monotonic relationship
- Captures monotonic relationships even if they are not linear

---

### When to Use Spearman Over Pearson

1. Data is ordinal (rankings, ratings)
2. Relationship is monotonic but not linear
3. Presence of outliers
4. Data is not normally distributed
5. Small sample size

---

## Spearman Formula (Without Ties)

$$\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}$$

**Where:**
* $\rho$ (rho): Spearman's rank correlation coefficient.
* $d_i$: The difference between the ranks of corresponding values ($d_i = \text{rank}(x_i) - \text{rank}(y_i)$).
* $\sum d_i^2$: The sum of the squared differences in ranks.
* $n$: The number of observations (data points).

**Steps:**
1. Rank both variables
2. Compute rank differences ($d$)
3. Square differences ($d^2$)
4. Sum all $d^2$
5. Apply formula

**Note:**  
This formula works only when there are **no tied ranks**.

---

## General Formula (Spearman as Pearson on Ranks)

This formula shows that Spearman's correlation is simply the Pearson correlation applied to the **ranks** of the data:

$$r_s = \rho(R[X], R[Y]) = \frac{\text{cov}(R[X], R[Y])}{\sigma_{R[X]} \sigma_{R[Y]}}$$

**Where:**
* $r_s$: Spearman's rank correlation coefficient
* $R[X], R[Y]$: The ranks of the observations in variables $X$ and $Y$.
* $\text{cov}(R[X], R[Y])$: Covariance of the ranked variables.
* $\sigma_{R[X]}, \sigma_{R[Y]}$: Standard deviations of the ranked variables.

note: This general formula handles tied ranks automatically and is equivalent to applying Pearson's formula to the ranked data. Most statistical software uses this approach.


**Note:**  
This formula automatically handles **tied ranks** and is used by most statistical software.

---

## Example Calculation

**Data:**  
X = [1, 2, 3, 4, 5]  
Y = [2, 4, 6, 8, 10]

**Step 1: Rank both variables**
- Ranks of X: [1, 2, 3, 4, 5]
- Ranks of Y: [1, 2, 3, 4, 5]

**Step 2: Calculate rank difference**
- $d = [0, 0, 0, 0, 0]$

**Step 3: Square differences**
- $d^2 = [0, 0, 0, 0, 0]$

**Step 4: Sum**
- $\sum d^2 = 0$

**Step 5: Apply formula**

$$\rho = 1 - \frac{6 \times 0}{5(25-1)} = 1 - 0 = 1$$

**Result:**  
Perfect positive correlation ($\rho = 1$)


Result: Perfect positive correlation (œÅ = 1)


## Correlation vs Causation

definition: 
- **Correlation**: A statistical relationship where two variables tend to change together, but one does not necessarily cause the other.
- **Causation**: A relationship where one variable directly causes a change in another variable.

key principle: "Correlation does not imply causation" - just because two variables are correlated doesn't mean one causes the other.


### Why Correlation ‚â† Causation

Three main reasons why correlated variables may not have a causal relationship:

1. **Coincidence (Spurious Correlation)**
   - Two variables happen to move together by random chance
   - Example: Ice cream sales and drowning deaths both increase in summer (both caused by warm weather, not each other)

2. **Confounding Variable (Third Variable Problem)**
   - A hidden third variable causes both correlated variables
   - Example: Shoe size correlates with reading ability in children (age is the confounding variable causing both)

3. **Reverse Causation**
   - The assumed cause and effect are backwards
   - Example: Does depression cause poor sleep, or does poor sleep cause depression? Could be either direction.


### Examples

**Correlation WITHOUT Causation:**
- Number of firefighters at a fire ‚Üî Damage caused by fire (both caused by fire size)
- Nicolas Cage movies released ‚Üî Swimming pool drownings (pure coincidence)
- Coffee consumption ‚Üî Heart disease (lifestyle factors may be the real cause)

**Correlation WITH Causation:**
- Cigarette smoking ‚Üí Lung cancer (proven causal relationship)
- Study hours ‚Üí Exam scores (studying causes better performance)
- Exercise ‚Üí Fitness level (physical activity causes fitness improvement)


### Establishing Causation

To prove causation, you typically need:

1. **Temporal precedence**: Cause must come before effect
2. **Correlation**: Variables must be related
3. **No alternative explanations**: Rule out confounding variables
4. **Experimental evidence**: Randomized controlled trials showing manipulation of one variable changes the other
5. **Mechanism**: Logical explanation for how one variable causes the other


### Practical Implications

- Always question: "Could there be another explanation for this relationship?"
- Use phrases like "associated with" or "related to" rather than "causes" when only correlation exists
- Be skeptical of claims that jump from correlation to causation without proper evidence
- Design experiments (not just observe) to test causal relationships

note: Most data analysis reveals correlations. Proving causation requires careful experimental design, control of confounding variables, and often longitudinal studies or randomized controlled trials.

## Visualizing Multiple Variables

definition: Techniques to display relationships between three or more variables simultaneously in a single plot, allowing us to understand complex interactions and patterns in multivariate data.


### Why Visualize Multiple Variables?

- Understand how variables interact together (not just pairwise)
- Identify patterns that emerge only when considering multiple dimensions
- Compare groups across multiple metrics simultaneously
- Reduce the need for multiple separate plots


### Common Techniques for Multiple Variables

#### 1. **Adding a Third Variable to 2D Plots**

**a. Color/Hue Encoding**
- Use color to represent a third categorical variable
- Example: Scatterplot of height vs weight, colored by gender
- Works well with: Scatterplots, line plots, bar plots

**b. Size Encoding (Bubble Chart)**
- Use bubble size to represent a third numerical variable
- Example: Scatterplot of GDP vs life expectancy, bubble size = population
- Also called: Bubble plot

**c. Shape/Marker Encoding**
- Use different shapes for a third categorical variable
- Example: Scatterplot with circles for male, triangles for female
- Limit: Best with 3-5 categories maximum

**d. Faceting/Small Multiples**
- Create separate subplots for each category of a third variable
- Also called: FacetGrid, subplot grid
- Example: Separate scatterplots for each country showing same relationship


#### 2. **Specialized Multi-Variable Plots**

**a. Pairplot (Scatterplot Matrix)**
- Grid showing all pairwise relationships between multiple numerical variables
- Diagonal shows distributions of individual variables
- Off-diagonal shows scatterplots between pairs
- Can add color for categorical variable

**b. Heatmap with Hierarchical Clustering**
- Shows correlation matrix for multiple variables
- Color intensity represents correlation strength
- Can add dendrograms to group similar variables (ClusterMap)

**c. Parallel Coordinates Plot**
- Each vertical axis represents one variable
- Each line connects values for one observation across all variables
- Good for: Comparing patterns across many variables, identifying clusters

**d. 3D Scatterplot**
- Uses x, y, z axes for three numerical variables
- Can add color/size for fourth and fifth variables
- Limitation: Hard to interpret, rotation needed, avoid if possible

**e. Contour Plot / Heatmap (2D)**
- Shows relationship between three numerical variables
- x and y axes for two variables, color/contour lines for third
- Good for: Visualizing surfaces, density, or model predictions


#### 3. **Advanced Techniques**

**a. Violin Plot (grouped)**
- Shows distribution of numerical variable across multiple categories
- Can be split by another categorical variable
- Combines box plot + KDE for rich information

**b. Stacked/Grouped Bar Charts**
- Compare multiple categories across groups
- Stacked: Shows composition and total
- Grouped: Shows side-by-side comparison

**c. Bubble Chart with Multiple Encodings**
- x-axis: numerical variable 1
- y-axis: numerical variable 2
- Size: numerical variable 3
- Color: categorical or numerical variable 4
- Can represent 4-5 variables in one plot

**d. Animated Plots**
- Use time as an additional dimension
- Show how relationships change over time
- Example: Gapminder plots showing development over decades


### Best Practices

**Dos:**
- Limit to 3-4 variables maximum for clarity
- Use color-blind friendly palettes
- Add legends and clear labels
- Choose encoding that matches variable type (categorical vs numerical)
- Use faceting when patterns differ significantly across groups

**Don'ts:**
- Avoid 3D plots unless absolutely necessary (2D + color is usually better)
- Don't use too many colors (max 7-10 categories)
- Don't combine too many encoding methods (overwhelming)
- Avoid pie charts for multiple variables (use stacked bars instead)


### Choosing the Right Visualization

| Variables | Best Plot | Alternative |
|-----------|-----------|-------------|
| 2 numerical + 1 categorical | Scatterplot with color | Faceted scatterplot |
| 3 numerical | Scatterplot with size/color | 3D scatterplot (avoid) |
| 1 numerical + 2 categorical | Grouped/stacked bar chart | Faceted bar plot |
| Multiple numerical pairs | Pairplot | Correlation heatmap |
| Many numerical variables | Parallel coordinates | Multiple scatterplots |
| Time + 2 numerical | Line plot with multiple lines | Animated scatterplot |


### Example Use Cases

1. **Business**: Sales vs Marketing Spend, colored by Region, sized by Profit
2. **Health**: Weight vs Height, colored by Gender, faceted by Age Group
3. **Finance**: Stock Returns vs Volatility, colored by Sector, sized by Market Cap
4. **Science**: Temperature vs Pressure vs Volume (contour plot)
5. **Social**: Income vs Education, colored by Occupation, faceted by City

note: The goal is insight, not complexity. If a visualization is hard to understand, simplify or split into multiple simpler plots.

---

## Credits

**Prepared by:**  
**Chetan Sharma**  
AIML / Data Science Notes  

üîó **GitHub:** [github.com/Chetan559](https://github.com/Chetan559)  
üåê **Portfolio:** [chetan559.github.io](https://chetan559.github.io)  
üíº **LinkedIn:** [linkedin.com/in/sharma-chetan-k](https://www.linkedin.com/in/sharma-chetan-k/)  

These notes were compiled for learning, revision, and academic understanding. 
