### üìä **Univariate Analysis: A Complete Deep Dive (In Simple Terms)**

Let‚Äôs start from the basics and build up step by step.



## üßê **What is Univariate Analysis?**

**Univariate Analysis** means analyzing **one variable at a time**.

- **"Uni"** = one  
- **"Variate"** = variable  

It helps us understand the **distribution**, **central tendency**, and **dispersion** of a **single feature** in a dataset.



### üéØ **Why Do We Do Univariate Analysis?**

Univariate analysis is crucial because:

‚úÖ It helps you **understand your data**.  
‚úÖ It shows **patterns, trends, and outliers**.  
‚úÖ It helps decide **which preprocessing steps** to apply.  
‚úÖ It gives you insights on **how to handle missing values, scaling, or feature engineering**.



## üìå **Types of Univariate Analysis**

There are **two types** of variables you‚Äôll encounter:

1. **Numerical (Continuous/Discrete)**  
2. **Categorical**

The approach to univariate analysis is different for **numerical** and **categorical variables**. Let‚Äôs cover both in detail.



# üßÆ **1. Univariate Analysis for Numerical Variables**

Numerical variables can take **continuous** or **discrete** values.

For example:

- **Continuous**: Height, Weight, Income  
- **Discrete**: Number of children, Number of cars

### üìä **Key Metrics for Numerical Variables:**

| **Metric**             | **Description**                                               |
|------------------------|---------------------------------------------------------------|
| **Mean**               | Average value                                                 |
| **Median**             | Middle value when sorted                                      |
| **Mode**               | Most frequent value                                           |
| **Range**              | Difference between max and min values                         |
| **Variance**           | How spread out the values are                                 |
| **Standard Deviation**  | Average distance of values from the mean                     |
| **Quartiles**          | Divide data into four equal parts                             |
| **Skewness**           | Measures the asymmetry of the distribution                    |
| **Kurtosis**           | Measures the "tailedness" (peakedness) of the distribution     |



### üìê **How to Perform Univariate Analysis for Numerical Variables?**

Here‚Äôs how you can do it step by step:

#### ‚úÖ **1. Use Summary Statistics**

You can calculate basic summary statistics like:

- **Mean**  
- **Median**  
- **Mode**  
- **Variance**  
- **Standard Deviation**

#### ‚úÖ **2. Plot Graphs**

Visualize the distribution of the data using:

1. **Histogram** ‚Äì Shows the frequency distribution.  
2. **Boxplot** ‚Äì Shows spread, median, and outliers.  
3. **Density Plot** ‚Äì Smooth curve showing distribution.



### üîç **Example: Analyzing 'Age' of Customers**

Suppose you have the following **ages** of customers:

| Age   |
|-------|
| 22    |
| 25    |
| 28    |
| 30    |
| 35    |
| 38    |
| 40    |
| 45    |

#### üìä **Summary Statistics:**

- **Mean** = 32.875  
- **Median** = 30  
- **Mode** = No mode  
- **Range** = 45 - 22 = 23  
- **Variance** = 55.98  
- **Standard Deviation** = 7.48  

#### üñºÔ∏è **Visual Representation:**

- **Histogram** ‚Üí Shows how many customers fall within each age range.  
- **Boxplot** ‚Üí Helps identify outliers (if any).  
- **Density Plot** ‚Üí Shows a smooth curve of the distribution.



# üóÇÔ∏è **2. Univariate Analysis for Categorical Variables**

Categorical variables represent **categories** or **groups**.

For example:

- **Gender**: Male, Female  
- **Marital Status**: Single, Married, Divorced  
- **Product Category**: Electronics, Clothing, Grocery  



### üìä **Key Metrics for Categorical Variables:**

| **Metric**          | **Description**                          |
|---------------------|------------------------------------------|
| **Frequency Count**  | Number of times each category appears    |
| **Mode**            | Most frequently occurring category        |
| **Proportion**       | Percentage of each category              |



### üìê **How to Perform Univariate Analysis for Categorical Variables?**

#### ‚úÖ **1. Use Frequency Tables**

A **frequency table** shows how many times each category occurs.

#### ‚úÖ **2. Plot Graphs**

Visualize the distribution using:

1. **Bar Chart** ‚Äì Shows the frequency of each category.  
2. **Pie Chart** ‚Äì Shows the proportion of each category.  
3. **Countplot** ‚Äì A bar chart with count labels (popular in Seaborn).



### üîç **Example: Analyzing 'Gender' of Customers**

Suppose you have the following **genders** of customers:

| Gender |
|--------|
| Male   |
| Female |
| Male   |
| Female |
| Male   |
| Female |
| Female |

#### üìä **Frequency Table:**

| Gender  | Count | Proportion |
|---------|-------|------------|
| Male    | 3     | 42.86%     |
| Female  | 4     | 57.14%     |

#### üñºÔ∏è **Visual Representation:**

- **Bar Chart** ‚Üí Shows the count of Male vs. Female customers.  
- **Pie Chart** ‚Üí Shows the percentage split between Male and Female customers.



# üìà **How to Perform Univariate Analysis in Python?**

Here‚Äôs a quick example in Python:

```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Sample Data
data = {'Age': [22, 25, 28, 30, 35, 38, 40, 45], 
        'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Female']}

# Create DataFrame
df = pd.DataFrame(data)

# Numerical Analysis
print(df['Age'].describe())

# Histogram
plt.hist(df['Age'])
plt.title('Age Distribution')
plt.show()

# Categorical Analysis
print(df['Gender'].value_counts())

# Countplot
sns.countplot(x='Gender', data=df)
plt.title('Gender Count')
plt.show()
```



# üí° **Key Insights from Univariate Analysis:**

1. **For Numerical Variables**:
   - Find **central tendency** (Mean, Median, Mode).  
   - Find **spread** (Range, Variance, Standard Deviation).  
   - Identify **outliers** using boxplots.

2. **For Categorical Variables**:
   - Identify **dominant categories** using frequency tables.  
   - Check **distribution balance** between categories (e.g., Male vs. Female).



# üîë **Common Graphs Used in Univariate Analysis**

| **Type of Variable**  | **Graph Type**        | **Use Case**                                  |
|-----------------------|-----------------------|----------------------------------------------|
| Numerical             | Histogram             | To see the distribution of values            |
| Numerical             | Boxplot               | To detect outliers and understand spread     |
| Numerical             | Density Plot          | To see a smooth distribution curve           |
| Categorical           | Bar Chart             | To see the count of each category            |
| Categorical           | Pie Chart             | To see the proportion of each category       |
| Categorical           | Countplot             | To see the frequency of categories in Seaborn|



# üß© **Final Takeaways (In Simple Words):**

- **Univariate Analysis** is all about understanding **one feature at a time**.
- **For Numerical Variables** ‚Üí Look at central tendency, dispersion, and distribution.  
- **For Categorical Variables** ‚Üí Look at frequency, mode, and proportion.

---

### üìä **Bivariate Analysis: A Complete Deep Dive (In Simple Terms)**

Now that you‚Äôve mastered **univariate analysis**, let's step up to **bivariate analysis**!



## üßê **What is Bivariate Analysis?**

**Bivariate Analysis** is the process of analyzing **two variables at a time** to:

- **Find relationships** between them.
- **Understand correlations** or associations.
- **Identify patterns and trends**.



### üéØ **Why Do We Do Bivariate Analysis?**

It helps answer questions like:

‚úÖ **Is there a relationship between age and income?**  
‚úÖ **Do higher product prices lead to more sales?**  
‚úÖ **Does one feature affect another?**  
‚úÖ **Are two variables positively or negatively correlated?**  

In simpler terms: **Are two variables connected in any way?**



## üìå **Types of Bivariate Analysis**

There are **three types** based on the nature of variables:

1. **Numerical vs. Numerical**  
2. **Numerical vs. Categorical**  
3. **Categorical vs. Categorical**

Each type has different methods for analysis. Let‚Äôs dive into each.



# üßÆ **1. Numerical vs. Numerical**

In this case, both variables are **continuous or discrete numbers**.

### üîç **Example:**

- **Age vs. Income**  
- **Height vs. Weight**  
- **Sales vs. Advertising Budget**



### üìä **How to Perform Bivariate Analysis for Numerical Variables?**

Here are the common methods:

| **Method**               | **Description**                                   | **Visual Representation** |
|--------------------------|---------------------------------------------------|--------------------------|
| **Scatter Plot**          | Shows the relationship between two variables      | üìà Scatter Plot           |
| **Correlation Coefficient (r)** | Measures the strength and direction of the relationship (range: -1 to +1) | üî¢ Numeric Value          |
| **Regression Analysis**   | Fits a line to understand the nature of the relationship | üìâ Line of Best Fit       |



### üñºÔ∏è **Example: Scatter Plot and Correlation**

Let‚Äôs say you have the following data for **Age** and **Income**:

| Age   | Income (in $) |
|-------|---------------|
| 22    | 20000         |
| 25    | 25000         |
| 28    | 28000         |
| 30    | 31000         |
| 35    | 35000         |
| 40    | 40000         |
| 45    | 45000         |

#### ‚úÖ **Steps to Perform Analysis:**

1. **Scatter Plot** ‚Üí Plot Age on the x-axis and Income on the y-axis.  
2. **Correlation Coefficient** ‚Üí Calculate the correlation value **(r)**.

#### üîç **Insights:**

- **If r = +1** ‚Üí Perfect **positive correlation** (Both increase together).  
- **If r = -1** ‚Üí Perfect **negative correlation** (One increases, the other decreases).  
- **If r = 0** ‚Üí No correlation.



### üìê **Python Code Example:**

```python
import pandas as pd
import matplotlib.pyplot as plt

# Sample Data
data = {'Age': [22, 25, 28, 30, 35, 40, 45], 
        'Income': [20000, 25000, 28000, 31000, 35000, 40000, 45000]}

# Create DataFrame
df = pd.DataFrame(data)

# Scatter Plot
plt.scatter(df['Age'], df['Income'])
plt.title('Age vs Income')
plt.xlabel('Age')
plt.ylabel('Income')
plt.show()

# Correlation
correlation = df.corr()
print(correlation)
```



# üìà **2. Numerical vs. Categorical**

In this case, one variable is **numerical**, and the other is **categorical**.

### üîç **Example:**

- **Product Category vs. Sales**  
- **Gender vs. Height**  
- **Region vs. Income**



### üìä **How to Perform Bivariate Analysis for Numerical vs. Categorical Variables?**

| **Method**               | **Description**                                           | **Visual Representation** |
|--------------------------|-----------------------------------------------------------|--------------------------|
| **Box Plot**              | Shows the distribution of the numerical variable for each category | üì¶ Box Plot              |
| **Violin Plot**           | Similar to a box plot but shows more details about distribution | üéª Violin Plot           |
| **Bar Plot**              | Shows the mean or median of the numerical variable for each category | üìä Bar Chart             |



### üñºÔ∏è **Example: Gender vs. Height**

Suppose you have the following data for **Gender** and **Height**:

| Gender | Height (in cm) |
|--------|----------------|
| Male   | 170            |
| Female | 160            |
| Male   | 175            |
| Female | 165            |
| Male   | 180            |
| Female | 155            |

#### ‚úÖ **Steps to Perform Analysis:**

1. **Box Plot** ‚Üí Compare the distribution of height for **Male** and **Female**.  
2. **Bar Plot** ‚Üí Plot the average height for each gender.



### üìê **Python Code Example:**

```python
import seaborn as sns

# Sample Data
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
        'Height': [170, 160, 175, 165, 180, 155]}

# Create DataFrame
df = pd.DataFrame(data)

# Box Plot
sns.boxplot(x='Gender', y='Height', data=df)
plt.title('Gender vs Height')
plt.show()
```



# üóÇÔ∏è **3. Categorical vs. Categorical**

In this case, both variables are **categorical**.

### üîç **Example:**

- **Gender vs. Purchase Decision**  
- **Region vs. Product Category**  
- **Education Level vs. Employment Status**



### üìä **How to Perform Bivariate Analysis for Categorical Variables?**

| **Method**                | **Description**                                         | **Visual Representation** |
|---------------------------|---------------------------------------------------------|--------------------------|
| **Contingency Table**      | Shows the frequency of occurrences between two categories | üìä Table                 |
| **Stacked Bar Chart**      | Compares proportions of categories within a group       | üìä Stacked Bar Chart      |
| **Heatmap**                | Shows the relationship between two categorical variables | üî• Heatmap               |



### üñºÔ∏è **Example: Gender vs. Purchase Decision**

Suppose you have the following data:

| Gender | Purchase Decision |
|--------|-------------------|
| Male   | Yes               |
| Female | No                |
| Male   | Yes               |
| Female | Yes               |
| Male   | No                |
| Female | No                |

#### ‚úÖ **Steps to Perform Analysis:**

1. **Contingency Table** ‚Üí Count the occurrences of each combination (e.g., Male & Yes).  
2. **Heatmap** ‚Üí Visualize the relationship.



### üìê **Python Code Example:**

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sample Data
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
        'Purchase': ['Yes', 'No', 'Yes', 'Yes', 'No', 'No']}

# Create DataFrame
df = pd.DataFrame(data)

# Contingency Table
contingency_table = pd.crosstab(df['Gender'], df['Purchase'])
print(contingency_table)

# Heatmap
sns.heatmap(contingency_table, annot=True, cmap='YlGnBu')
plt.title('Gender vs Purchase Decision')
plt.show()
```



# üîë **Key Insights from Bivariate Analysis**

1. **Numerical vs. Numerical** ‚Üí Use **scatter plots** and **correlation**.  
2. **Numerical vs. Categorical** ‚Üí Use **box plots** and **bar charts**.  
3. **Categorical vs. Categorical** ‚Üí Use **contingency tables** and **heatmaps**.



# üß© **Final Takeaways (In Simple Words):**

- **Bivariate Analysis** is about **finding relationships between two variables**.  
- Different combinations of variable types require different methods.  
- **Visualizations** like scatter plots, box plots, bar charts, and heatmaps make it easy to understand relationships.

---

### üìä **Multivariate Analysis: A Complete Deep Dive (In Simple Terms)**

Now that you have a solid understanding of **univariate** and **bivariate analysis**, let's take it one step further to **multivariate analysis** ‚Äî the most comprehensive type of data analysis!



## üßê **What is Multivariate Analysis?**

**Multivariate Analysis** is the process of analyzing **three or more variables** at the same time to:

- **Identify patterns and relationships** across multiple variables.
- **Understand the combined effect** of multiple variables on a target variable.
- **Detect hidden structures or clusters** in the data.

It helps answer questions like:

‚úÖ **What factors impact house prices (location, size, number of rooms, etc.)?**  
‚úÖ **How do age, income, and education level influence spending habits?**  
‚úÖ **Which combination of features best predicts customer churn?**

In simpler terms:  
**Multivariate analysis looks at the big picture by studying multiple variables together.**



## üî¨ **Why is Multivariate Analysis Important?**

In the real world, most problems involve **more than two variables**. For example:

- **Predicting house prices** requires analyzing location, size, age of the house, and more.  
- **Understanding customer behavior** involves looking at age, income, education, and spending patterns.

Without multivariate analysis, you may **miss important interactions** between variables.



## üìå **Types of Multivariate Analysis**

Multivariate analysis can be divided into two broad categories:

1. **Dependence Methods** (When there is a dependent variable)  
2. **Interdependence Methods** (No dependent variable; exploring relationships)

Let‚Äôs explore both in detail!



# üìö **1. Dependence Methods (When There‚Äôs a Dependent Variable)**

These methods are used when you want to **predict or explain a dependent variable** using **multiple independent variables**.



### ‚úÖ **1.1. Multiple Regression**

Used to **predict a numerical dependent variable** based on **multiple independent variables**.

#### üîç **Example:**

Predicting **house prices** using:

- **Location**  
- **Number of rooms**  
- **Size (sq. ft.)**  
- **Age of the house**



### üìê **Equation for Multiple Regression:**

$$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + ... + \beta_n X_n
$$

Where:

- $ Y $ = Dependent variable (e.g., house price)  
- $ X_1, X_2, X_3, ... $ = Independent variables (e.g., location, size, number of rooms)  
- $ \beta_0 $ = Intercept  
- $ \beta_n $ = Coefficients (how much each variable contributes to $ Y $)



### üìä **How to Perform Multiple Regression in Python:**

```python
from sklearn.linear_model import LinearRegression
import pandas as pd

# Sample Data
data = {
    'Size': [1500, 2000, 2500, 1800, 2200],
    'Rooms': [3, 4, 4, 3, 5],
    'Age': [10, 5, 15, 8, 12],
    'Price': [300000, 400000, 500000, 350000, 450000]
}

# Create DataFrame
df = pd.DataFrame(data)

# Define independent and dependent variables
X = df[['Size', 'Rooms', 'Age']]
y = df['Price']

# Fit the model
model = LinearRegression()
model.fit(X, y)

# Print coefficients
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)
```



### ‚úÖ **1.2. Logistic Regression**

Used when the dependent variable is **categorical** (e.g., **Yes/No**, **Churn/No Churn**).

#### üîç **Example:**

Predicting whether a customer will **churn or not** based on:

- **Monthly charges**  
- **Contract type**  
- **Tenure**  



### ‚úÖ **1.3. MANOVA (Multivariate Analysis of Variance)**

Used to test whether **multiple dependent variables** are significantly affected by **independent variables**.

#### üîç **Example:**

Studying the effect of **diet and exercise** on:

- **Weight loss**  
- **Cholesterol levels**  
- **Blood pressure**



# üìö **2. Interdependence Methods (No Dependent Variable)**

These methods are used when you want to **explore relationships** between variables without focusing on a specific dependent variable.



### ‚úÖ **2.1. Principal Component Analysis (PCA)**

Used to **reduce the dimensionality** of large datasets by finding the most important variables (called **principal components**).

#### üîç **Example:**

Suppose you have a dataset with **100 features**.  
PCA can reduce it to a smaller set of **important features** that explain most of the variance in the data.



### ‚úÖ **2.2. Cluster Analysis**

Used to **group similar data points** into clusters.

#### üîç **Example:**

Grouping **customers** based on their:

- **Purchase behavior**  
- **Age**  
- **Income**  
- **Preferences**



### ‚úÖ **2.3. Factor Analysis**

Used to identify **hidden factors** that explain the relationships between variables.

#### üîç **Example:**

In a survey on **job satisfaction**, factor analysis might reveal that:

- **Salary, benefits, and job security** ‚Üí form a **‚ÄúCompensation‚Äù factor**.  
- **Work-life balance, flexible hours** ‚Üí form a **‚ÄúWork Environment‚Äù factor**.

# üìä **Visualizing Multivariate Analysis**

| **Method**               | **Best For**                                | **Visualization**      |
|--------------------------|---------------------------------------------|-----------------------|
| **Pair Plot**             | Numerical vs Numerical relationships        | üìà Pair Plot           |
| **Heatmap**               | Correlation between multiple variables      | üî• Heatmap             |
| **3D Scatter Plot**       | Exploring relationships between 3 variables | üßä 3D Scatter Plot     |
| **Parallel Coordinates**  | Comparing multiple variables simultaneously | üìâ Parallel Coordinates |



### üñºÔ∏è **Pair Plot Example (Seaborn):**

```python
import seaborn as sns
import matplotlib.pyplot as plt

# Sample Data
data = sns.load_dataset('iris')

# Pair Plot
sns.pairplot(data)
plt.show()
```



# üß© **Multivariate Analysis vs. Bivariate Analysis**

| **Aspect**          | **Bivariate Analysis**            | **Multivariate Analysis**           |
|---------------------|-----------------------------------|------------------------------------|
| **Number of Variables** | 2                               | 3 or more                           |
| **Objective**        | Find relationships between 2 variables | Explore relationships between multiple variables |
| **Methods**          | Correlation, Regression           | PCA, Clustering, Factor Analysis    |
| **Visuals**          | Scatter Plot, Box Plot            | Heatmap, Pair Plot, 3D Plots        |





# ‚úÖ **Steps to Perform Multivariate Analysis**

1. **Understand the variables and their types** (Numerical/Categorical).  
2. **Check for missing values and clean the data**.  
3. **Visualize relationships** using pair plots or heatmaps.  
4. **Apply appropriate techniques** (e.g., PCA, regression, clustering).  
5. **Interpret the results and draw insights**.



# üîë **Key Takeaways (In Simple Words):**

- **Multivariate analysis** looks at **multiple variables** at once.  
- It helps in **understanding complex relationships** and **reducing dimensionality**.  
- **Dependence methods** are used for prediction, while **interdependence methods** are used for exploration.  
- **Visual tools** like pair plots, heatmaps, and 3D scatter plots are crucial for understanding the data.

---

# üö© **1. What are Quantiles?**

Quantiles are points that divide a dataset into **equal parts**. Think of quantiles as a way to slice your data into **chunks** of equal size.

### ‚úÖ **Key Concept**  
- If you divide your data into **4 equal parts**, you get **quartiles**.  
- If you divide it into **10 equal parts**, you get **deciles**.  
- If you divide it into **100 equal parts**, you get **percentiles**.

### ‚úÖ **Types of Quantiles:**

| **Quantile Type** | **Divides Data Into** | **Example**                  |
|-------------------|-----------------------|------------------------------|
| Quartiles         | 4 equal parts         | Q1 (25%), Q2 (50%), Q3 (75%) |
| Quintiles         | 5 equal parts         | 20%, 40%, 60%, 80%           |
| Deciles           | 10 equal parts        | 10%, 20%, 30%, ..., 100%     |
| Percentiles       | 100 equal parts       | 1%, 2%, 3%, ..., 100%        |



## üßÆ **How to Calculate Quantiles:**

Let‚Äôs take an example dataset to understand:

**Dataset:**  
`[10, 20, 30, 40, 50, 60, 70, 80, 90, 100]`

### ‚úÖ **Quartiles (4 equal parts):**

- **Q1 (25%)** = The value at the 25% mark = **25th percentile**  
- **Q2 (50%)** = The value at the 50% mark = **Median**  
- **Q3 (75%)** = The value at the 75% mark

For this dataset:

| **Quartile** | **Value** |
|--------------|-----------|
| Q1           | 30        |
| Q2 (Median)  | 55        |
| Q3           | 80        |



### ‚úÖ **Why Quantiles Matter in Machine Learning:**

Quantiles help us understand the **distribution** of data. They are useful for:

1. **Handling outliers**:  
   - The **1st quartile (Q1)** and **3rd quartile (Q3)** can be used to detect outliers using the **IQR (Interquartile Range)** method.
   
2. **Scaling data**:  
   - In **quantile normalization**, data is scaled based on quantiles to make different distributions comparable.



# üö© **2. What are Percentiles?**

Percentiles are a specific type of **quantile** that divide your data into **100 equal parts**.

### ‚úÖ **Definition:**
A **percentile** tells you the value **below which a given percentage of data points fall**.

For example:
- **90th percentile** means **90% of the data points** are **below** this value.



## üßÆ **How to Calculate Percentiles:**

Using the same dataset:

**Dataset:**  
`[10, 20, 30, 40, 50, 60, 70, 80, 90, 100]`

### ‚úÖ **Let‚Äôs find the 60th percentile:**

#### **Step 1: Formula to find the position of the percentile:**

$$
P = \left(\frac{n}{100}\right) \times (N + 1)
$$

Where:
- $ P $ = Percentile position  
- $ n $ = Desired percentile (e.g., 60)  
- $ N $ = Total number of data points  

For the **60th percentile**:

$$
P = \left(\frac{60}{100}\right) \times (10 + 1) = 6.6
$$

#### **Step 2: Interpolate between the 6th and 7th values:**

- The 6th value = **60**  
- The 7th value = **70**  

Since the position is **6.6**, we take **60 + 0.6 √ó (70 - 60) = 66**.

‚úÖ **The 60th percentile = 66.**



# üö© **3. Quantiles vs. Percentiles**

| **Aspect**         | **Quantiles**                                   | **Percentiles**                               |
|--------------------|-------------------------------------------------|----------------------------------------------|
| **Definition**     | Divide data into equal intervals                | Divide data into 100 equal intervals         |
| **Range**          | Depends on type (quartiles, deciles, etc.)      | 1% to 100%                                   |
| **Use Case**       | General data distribution understanding         | Ranking individuals or values in a dataset   |



# üö© **4. Real-World Application of Percentiles:**

### **Example: Student Scores in an Exam**

Let‚Äôs say **1000 students** took an exam. The score distribution is as follows:

| **Score** | **Percentile** |
|-----------|----------------|
| 45        | 25th percentile |
| 60        | 50th percentile (Median) |
| 85        | 90th percentile |

- A score of **60** means the student is in the **50th percentile** ‚Äî **50% of students scored below** this mark.
- A score of **85** means the student is in the **90th percentile** ‚Äî **90% of students scored below** this mark.



# üö© **5. How Quantiles and Percentiles Help in Machine Learning:**

### ‚úÖ **Use Case 1: Outlier Detection**

The **Interquartile Range (IQR)** method uses **quartiles** to detect outliers:

$$
\text{IQR} = Q3 - Q1
$$

- **Lower bound** = $ Q1 - 1.5 \times \text{IQR} $  
- **Upper bound** = $ Q3 + 1.5 \times \text{IQR} $

Values outside this range are considered **outliers**.



### ‚úÖ **Use Case 2: Feature Scaling (Quantile Transformer)**

In machine learning, **scaling data** is important for models like **SVMs** and **neural networks**. The **Quantile Transformer** in **scikit-learn** scales data based on **percentiles**, making the distribution **uniform**.

```python
from sklearn.preprocessing import QuantileTransformer

qt = QuantileTransformer(output_distribution='uniform')
X_scaled = qt.fit_transform(X)
```

---