## How to use PDF in Data Science?


## 2D density plots

## Normal Distribution

It is also known as **‘Gaussian Distribution’** or **‘Bell-Curved’**.

- Normal Distribution is also part of **Parametric Density Estimation**.
- It is a **Continuous Probability Distribution**, which means it is a **Probability Density Function (PDF)** that is **symmetrical around the mean**, and looks like a **bell-shaped curve**.

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*G5J3SMzt2cmYDnBL6qu9MQ.jpeg" width="400">
  <br>
  <em>Normal Distribution</em>
</p>



### In this image
- At the center **‘0’** represents the **mean value**.
- On the **Y-axis**, it represents **Probability Density**, while on the **X-axis** we have normal $x$ values.
- The curve inclines towards the ends, called **‘Tails’** on both sides. The tail never touches the x-axis. It is **Asymptotic** in nature, which means it touches only when it reaches **Infinity**.
- A basic summary of this graph is that in the bell-shaped curve, many points are scattered near the center (Mean), and some points are scattered near tails on both sides.
- A curve that is high at points means there is a high density of data, and where the curve is low, there is a low density of data.
- A normal distribution is characterized by two parameters: **‘$\mu$’ (Mean)** and **‘$\sigma$’ (Standard Deviation)**.
- If we have the mean and standard deviation of any data and it follows normal distribution, then we can easily create a graph.
- Mean explains the **Centre** of data and the Standard Deviation explains the **Spread** of the data.
- A normal distribution is decided based on these two parameters.

---

## Importance of Normal Distribution

- Because it is widespread, many natural phenomena follow the pattern of normal distribution such as heights, weights of people, IQ score of the population, salary distribution, and more.
- In statistics, over many years, researchers across different domains have collected and analyzed data. When they created PDFs for selected data, they consistently observed a bell-shaped curve resembling the normal distribution. Hence, this graph was considered important because it exists so frequently in nature, which is why it started being called the **Normal Distribution**.
- Since then, many studies have been done on the normal distribution, and various aspects of this distribution are known.
- Therefore, if our data follows a normal distribution, it becomes very easy to analyze due to the well-understood characteristics of the distribution.

---

## PDF Equation of Normal Distribution

$$f(x \mid \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} \, e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$$

**Where:**
- $x$: The value of the random variable  
- $\mu$: The population mean (location of the peak)  
- $\sigma$: The population standard deviation (spread of the distribution)  
- $\sigma^2$: The variance  
- $\pi$: Approximately 3.14159  
- $e$: The base of the natural logarithm, approximately 2.71828  

---

## Parameters of Normal Distribution

If we change the values of $\mu$ and $\sigma$ in the equation of the normal distribution, it will affect the shape and position of the graph.

### 1. Effect of changing ‘$\mu$’ (mean) — impact on bell curve position

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*f1P_NPS2Gr4Bh9GmHsq8kw.png" width="400">
  <br>
  <em>Change in position of normal distribution on x-axis with change in mean ‘$\mu$’</em>
</p>



- If $\mu$ moves from 0 towards the positive direction, the graph on the x-axis will start shifting towards the positive direction.
- Conversely, if mean ($\mu$) moves from 0 towards the negative direction, the graph on the x-axis will start shifting towards the negative direction.
- Essentially, changing the mean $\mu$ causes a **horizontal shift** of the graph along the x-axis.

---

### 2. Effect of changing ‘$\sigma$’ (standard deviation) — impact on bell curve shape

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*zlBeMCIRo8iiEq8dP4JZ_A.jpeg" width="400">
  <br>
  <em>Change in ‘$\sigma$’ impact on normal distribution</em>
</p>



- The standard deviation $\sigma$ influences the spread or dispersion of the data.
- If we increase $\sigma$, the height of the curve decreases slightly, resulting in a fatter shape. This means an increase in spread causes the curve to become broader on the x-axis, with data points more scattered and less steep at its peak.
- Conversely, if we decrease $\sigma$, the curve becomes narrower, indicating a smaller spread of data points around the mean. The height of the curve increases, giving it a more peaked and sharper shape.

---

## Intuition

This is the intuition of Normal Distribution based on changes in its parameters.

## Standard Normal Variate

- This is special case of normal distribution.  
- It is also known as **‘Z’**  
- If **‘mu’** value is **‘Zero’** and standard deviation is **‘1’** which means normal distribution is **‘center’** then it is called **‘Standard Normal Distribution’** and denoted by **‘Z’**.  
- Normal distribution: **X ~ N(μ , σ)**  
- Standard Normal Distribution: **Z ~ N(0, 1)**  

---

### Importance of Standard Normal Distribution

<p align="center">
  <img src="https://www.mathsisfun.com/data/images/normal-distrubution-large.svg" width="300">
  <br>
  <em>Standard Normal Distribution</em>
</p>

This is **‘Standard Normal Variate’** graph as **Mean = 0** and **Standard Deviation = 1**

- It allow us to compare different distribution with each other by converting them into standard normal distribution.  
- Using standardized normal distribution we can calculate **‘probability’** using **‘Standard table’**.  
- We can calculate any probabilities because for standard normal variate we have all probabilities value available.  
- In normal distribution equation, when we replace mean **‘mu’** with **0** and standard deviation **‘sigma’** with **1** we get standard normal distribution formula.  

Press enter or click to view image in full size  

---

### Z-table

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*e_d_6cUOL8vr2YAaZ_GaoQ.png" width="300">
  <br>
  <em>Z-Score Standardization</em>
</p>

#### Importance of Z-scores

1. Can be used to determine whether to accept or reject the Null Hypothesis.  
2. Enables us to compare two scores that are from different samples having different mean and standard deviations.  
3. To identify the Outliers  
4. Calculate probabilities and percentiles using the standard normal distribution.  

---

### Z-score Formula

The formula for calculating a z-score is  

**Z = (x − μ) / σ**  

**Z = (datapoint — Mean) / Standard Deviation**

Where:  
- **x** is datapoint of interest  
- **μ** is the population mean  
- **σ** is the population standard deviation  

**Note:** Alternatively if Population Mean and standard deviation is not present, we can use the sample mean and standard deviation.  

For more on z-score refer:  
https://medium.com/analytics-vidhya/z-score-in-detail-9dd0f0afa142  

---

### Empirical Rule

This is the most important properties of normal distribution

- Approximately **68%** of data falls within **1 standard deviation** from the mean.  
- Around **95%** of data falls within **2 standard deviations**.  
- Nearly **99.7%** of data falls within **3 standard deviations**.  

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*BvSWE3LGXBSn7z-IVElI-Q.png" width="300">
  <br>
  <em>Empirical Rule</em>
</p>


## Properties of Normal Distribution

### 1. Symmetricity

It means normal distribution is **‘Symmetric around the mean’**, its like mirror image.  
If we know one side probability distribution we can easily calculate other side probability distribution too.  

---

### 2. Measures of Central Tendency are Equal

Mean, Median, Mode all equal for proper Normal Distribution.  

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:566/format:webp/1*RA7XLNHdsOWDx5eGKyKOEA.png" width="300">
  <br>
  <em>Mean = Median = Mode</em>
</p>

---

### 3. The Area under the Curve

Area under the curve is **‘1’** also this is True for any PDF (Probability Density Function)


## Skewness

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*Z0wk2ut5quB4bArWWFJ9Sw.png" width="400">
  <br>
  <em>Positive — Zero — Negative Skewness</em>
</p>

- Normal Distribution is **‘Symmetric’** and **‘Skewness’** tells normal distribution is not symmetric.
- It means data is leaning more towards one side if there is skewness.
- How much data deviates from normal distribution can be found through skewness.
- As we know in a **‘Symmetrical Distribution’**, Mean, Median, and Mode are all **‘EQUAL’**.
- But in the case of skewness, Mean, Median, and Mode are not equal, and one tail is longer than the other.

---

### Types of Skewness

There can be two types of skewness:

#### Positive Skewness
When the tail of the distribution is longer on the **‘Right side’**, it is called **‘Positive Skewed’** or **‘Right Skewed’**.  
In this case, the Mean is greater than the Median and Mode:

**Mean > Median > Mode**

---

#### Negative Skewness
When the tail of the distribution is longer on the **‘Left side’**, it is called **‘Negative Skewed’** or **‘Left Skewed’**.  
In this case, the Mean is less than the Median and Mode:

**Mean < Median < Mode**

> The greater the skew, the greater the distance between mean, median, and mode.

---

### How Skewness is Calculated?

In statistics, there are four **‘Moments’**:
1. First moment: **Mean**
2. Second moment: **Variance**
3. Third moment: **Skewness**
4. Fourth moment: **Kurtosis**

---

### Skewness Formula

The most common measure of skewness is the **Pearson’s Moment Coefficient of Skewness**.

#### For Population Data ($\gamma_1$):

$$
\gamma_1 = E\left[\left(\frac{X-\mu}{\sigma}\right)^3\right] = \frac{\sum_{i=1}^{N} (x_i - \mu)^3}{N\sigma^3}
$$

#### For Sample Data ($g_1$):

When working with a sample, the **Adjusted Fisher-Pearson Standardized Moment Coefficient** is typically used:

$$
g_1 = \frac{n}{(n-1)(n-2)} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s}\right)^3
$$

---

### Interpretation of Skewness

- **Skewness = 0**: The distribution is perfectly symmetrical (like a Normal Distribution).
- **Skewness > 0 (Positive Skew)**: The "tail" on the right side is longer or fatter. Most of the data is concentrated on the left.
- **Skewness < 0 (Negative Skew)**: The "tail" on the left side is longer or fatter. Most of the data is concentrated on the right.

---

![Skewness Diagram](Assets/Img/skewness.png)

---

### Note

- The possible value of skewness range is from **$-3$ to $3$**.
- Skewness values beyond the range of **$-2$ to $2$** are less frequently observed, indicating extreme departures from symmetry.
- A value of **$0$** indicates a perfect symmetrical distribution.
- A value between **$-0.5$ & $0$** or between **$0$ & $0.5$** indicates an **‘Approximately Symmetric Distribution’**.
- A value between **$-1$ & $-0.5$** or between **$0.5$ & $1$** indicates a **‘Moderately Skewed Distribution’**.
- A value between **$-1.5$ & $-1$** or between **$1$ & $1.5$** indicates a **‘Highly Skewed Distribution’**.
- A value less than **$-1.5$** or greater than **$1.5$** indicates an **‘Extremely Skewed Distribution’**.
- In real data, getting an exact **$0$** is very difficult; however, results around **$0.5$** or **$-0.5$** are often treated as a **Normal Distribution**.

## CDF of Normal Distribution

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*qA3ZWwljjwf0U-cQnT2-jw.png" width="300">
  <br>
  <em>CDF of Normal Distribution</em>
</p>

### In graph:-

- The blue, red, and yellow lines all have a mean of **0** but differ in standard deviation.  
- As the standard deviation increases, the curve shifts away from the mean.  
- When the standard deviation is closer to the mean, the curve stays closer to the center.  
- The center is where the mean equals **0**.  
- The **Cumulative Distribution Function (CDF)** always explains the probability of being up to a certain point, denoted as **P(x < x)**.  
- For a symmetric distribution, such as the normal distribution, this probability is **50%**.  
- To find the CDF up to a particular point **‘x’**, we integrate from **negative infinity** to that point.  
- If we have the PDF equation and want to find the CDF for a specific point, we integrate from negative infinity up to that point.  


When upon PDF we perform integration basically find area under the curve we get **CDF**.  
And when we do **CDF differentiation** we get **PDF**.


## Use of Normal Distribution in Data Science

### 1. Outlier Detection

As we know one of the Normal Distribution properties is **‘Empirical Rule’** and as per **‘Empirical Rule’** if any datapoint is away for **‘+3, -3’** standard deviation then we count that point as **‘Outlier’**.  

---

### 2. Assumptions on Data for ML Algorithms

In machine learning, there are several algorithms like **‘Linear Regression’**, **‘Logistic Regression’**, **‘Gaussian Mixture Models’** they take an assumptions that **‘Data is Normally Distributed’**.  
In Linear Regression we take normality assumptions upon **‘Residuals’**.  

So if we do not provide normally distributed data to these kind of algorithms its performance not much good.  

---

### 3. Hypothesis Testing

When making inferences about a population, many tests operate under the assumption that the data follows a normal distribution.  

---

### 4. Centre Limit Theorem

The core principle of the **Central Limit Theorem** is that regardless of the original distribution’s shape, when we sample from it, the distribution of the sample means will tend to be normal.
