## Random Variables

definition: A variable whose value is determined by the outcome of a random phenomenon or experiment. It maps outcomes from a sample space to numerical values.

In algebra, a variable is an unknown value and are denoted using lower case letters. Ex. x, y, z

But in stats and probability:

A Random Variable is a set of possible values from a random experiment and are denoted using capitalized letters X, Y, Z

**Sample Space (S)**: The set of all possible outcomes of a random experiment.

**Random Variable**: A function that assigns a numerical value to each outcome in the sample space.

Example: 
- Experiment: Toss a coin twice
- Sample Space: S = {HH, HT, TH, TT}
- Random Variable X = number of heads
- X can take values: {0, 1, 2}


### Types of Random Variables

#### 1. Discrete Random Variable

definition: A random variable that can take on a countable number of distinct values (finite or countably infinite).

characteristics:
- Values are discrete/separate (no values in between)
- Can be counted (1, 2, 3, ...)
- Gaps exist between possible values
- Probability Mass Function (PMF) describes probabilities

examples:
- Number of students in a class: {0, 1, 2, 3, ...}
- Number of heads in 10 coin tosses: {0, 1, 2, ..., 10}
- Number of cars sold per day: {0, 1, 2, 3, ...}
- Dice roll outcome: {1, 2, 3, 4, 5, 6}
- Number of defective items in a batch

visualization: Bar charts, probability mass function plots

mathematical notation: P(X = x) for specific values


#### 2. Continuous Random Variable

definition: A random variable that can take any value within a range or interval (infinite number of possible values).

characteristics:
- Values can be any number in a range (uncountable)
- Cannot list all possible values
- No gaps between values
- Probability Density Function (PDF) describes probabilities
- P(X = specific value) = 0 (probability at exact point is zero)
- Instead, we calculate P(a < X < b) for intervals

examples:
- Height of students: any value between 4 ft to 7 ft (e.g., 5.734 ft)
- Time taken to complete a task: 0 to ∞ seconds
- Temperature: -50°C to 50°C (can be 23.456°C)
- Weight of a person: 0 to 200 kg
- Distance traveled: any positive real number

visualization: Histograms, density curves, probability density function plots

mathematical notation: P(a ≤ X ≤ b) for intervals, not P(X = x)


### Key Differences

| Aspect | Discrete | Continuous |
|--------|----------|------------|
| **Values** | Countable, distinct | Uncountable, any value in range |
| **Examples** | 1, 2, 3, ... | 1.5, 2.37, 3.14159... |
| **Probability function** | PMF (Probability Mass Function) | PDF (Probability Density Function) |
| **At specific point** | P(X = x) > 0 | P(X = x) = 0 |
| **Visual** | Bar chart | Smooth curve |
| **Sum of probabilities** | ΣP(X = x) = 1 | ∫f(x)dx = 1 |
| **Typical data** | Counts, categories | Measurements |


### Important Notes

**For Discrete Variables:**
- Sum of all probabilities = 1
- Each outcome has a specific probability
- Can calculate exact P(X = value)

**For Continuous Variables:**
- Area under PDF curve = 1
- Probability of exact value = 0
- Calculate probability for intervals only: P(a < X < b)
- Use cumulative distribution function (CDF) to find probabilities

**Mixed Variables:**
Some variables can be treated as either:
- Age: Discrete if in years (25, 26, 27), Continuous if exact (25.384 years)
- Income: Discrete if in dollars ($50,000), Continuous if allowing cents ($50,234.67)

note: In practice, measurements are often discrete (limited by instrument precision), but we treat them as continuous if they have many possible values and approximation is reasonable.

## Probability Distribution

A **probability distribution** is a list of all the possible outcomes of a random variable along with their corresponding probability values.

---

### Example: Rolling Two Dice

If we throw two dice together, what are the possible values we get after adding the numbers?

- **Minimum value:**  
  If both dice show 1, the minimum sum is **2**
- **Maximum value:**  
  If both dice show 6, the maximum sum is **12**

So the possible outcomes are:

**2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12**

<img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*oZRq48kREoXJpCetetkYRQ.png" 
     width="300px" 
     alt="Two dice outcomes">

---

### Probability Table

| Sum | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|-----|---|---|---|---|---|---|---|---|----|----|----|
| Probability | 1/36 | 2/36 | 3/36 | 4/36 | 5/36 | 6/36 | 5/36 | 4/36 | 3/36 | 2/36 | 1/36 |

<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*ZsL1jb4AT_3D9_FcquV53Q.png" 
     width="300px" 
     alt="Probability Mass Function for two dice">

**Probability Mass Function for Two Dice Roll Example**

In this example, where two dice are rolled at the same time, the probability of each outcome is **not the same**.

- The **highest probability** is for the sum **7**
- The **lowest probability** is for the sums **2** and **12**

---

## Problem with This Type of Distribution

This method of finding probability becomes **tedious** when the number of outcomes is large, because it requires writing extensive tables.

### Example

- Rolling **10 dice together** and creating a probability distribution of their sum  
- If the number of outcomes is **very large** or **infinite**, creating a table becomes impractical

---

## Solution: Using a Function

Instead of using tables, we can use a **mathematical function** to model the relationship between outcomes and probabilities.

$$y = f(x)$$

**Where:**
* $x$ = outcome
* $y$ = probability

This function maps outcomes to probabilities and allows us to plot a graph.

---

## Probability Distribution Function

A **Probability Distribution Function** is a mathematical relationship between all possible outcomes of a random variable and their corresponding probabilities.

Using functions instead of tables makes it easier to work with **large or continuous datasets**.


# Probability Distribution Function and Their Types

## Problem with Distribution
In many scenario, the number of outcomes can be very large and hence a table would be tedious to write down.  
In case the number of outcomes is infinite, we're doomed.

**Example:**  
Rolling 10 dice together and making a probability distribution of their sum.

---

## The Solution to This Problem Is a Function
If we use a mathematical function to model/map the relationship between outcome and probability, like: $$y = f(x)$$  We can plot a graph using that function.  
This is called a **Probability Distribution Function**.

---

## Types of Probability Distribution Function
Any distribution can have two types:

1. **Discrete**
2. **Continuous**

![Probability Distributions](Assets/Img/discrete-continuos-prob-distribution.png)

- In the graph where we see a **histogram-type graph**, it is a **Discrete Probability Distribution**
  - **Examples:** Binomial, Poisson, Categorical
- Where we find **continuity in the line**, it represents a **Continuous Probability Distribution**
  - **Examples:** Normal Distribution, Uniform, Chi-Square, Exponential, Pareto, F Distribution, Log-Normal

---

## Few Famous Probability Distributions

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*Zl6tWSiGKRT-k04d2ET2AA.jpeg"
       width="500px"
       alt="Probability Distributions">
</p>

Some common probability distribution function types are available in nature, which means when we plot a graph for data and place all possible outcomes on the X-axis and all possible probabilities on the Y-axis, there is a huge chance the PDF will match one of these famous distributions.

---

## Why Is Probability Distribution Important?

- It gives an idea of our data shape/distribution, which makes it easy to read data and make assumptions from it.
- **Example:** Average salary distribution of employees in a company can easily be understood just by looking at the graph.
- If data follows a famous distribution, we automatically know a lot about the data.
  - **Example:** If data is normally distributed, all assumptions of this distribution can be applied to our data.

---

## A Note on Parameters

- Parameters in probability distributions are numerical values that determine the **shape**, **location**, and **scale** of the distribution.
- Different probability distributions have different sets of parameters that determine their shape and characteristics.
- Understanding these parameters is essential in statistical analysis and inference.
- It is like a **“tuning knob”** — by changing these parameters, the graph shape of the data also changes.

---

## Types of Probability Distribution Functions

A probability distribution function is a mathematical function that describes the probability of obtaining different values of a random variable in a particular probability distribution.

1. **Probability Mass Function (PMF)** — discrete random variable  
2. **Probability Density Function (PDF)** — continuous random variable  
3. **Cumulative Distribution Function (CDF)** — mass/density  


## Probability Mass Function (PMF)

PMF describes the distribution of a **Discrete Random Variable**.

PMF assigns probability to each possible value of the random variable, and it should follow two conditions:

- The probability assigned to each value cannot be **zero or negative**.  
  It should always be **greater than zero**.
- The **sum of probabilities of all possible outcomes must be equal to 1**.

### Example: Dice Roll

For a fair dice rolled once,

$$
P(X = x) = 
\begin{cases} 
\frac{1}{6}, & \text{if } x \in \{1, 2, 3, 4, 5, 6\} \\
0, & \text{otherwise}
\end{cases}
$$

**Where:**
* $X$: The random variable representing the outcome of the die roll.
* $x$: A specific possible outcome.


## Cumulative Distribution Function (CDF) of PMF

The **Cumulative Distribution Function (CDF)**, denoted as $F(x)$, describes the probability that a random variable $X$ with a given probability distribution will be found at a value less than or equal to $x$.

The CDF is also a function. In the **Probability Mass Function (PMF)**, we find the probability for a particular value, such as $P(X = 2)$. For example, in a dice roll, the probability of rolling a 2 is $1/6$; we find this $f(x)$ for all dice numbers from 1 to 6.

---

### How to Calculate CDF

In CDF, we look for values like $F(X \le 4)$ or $F(X \le 5)$, which represents the probability of getting a 4 or less. To find this, we sum the probabilities of all outcomes up to that point:

* **To find $F(X \le 4)$:**
    $$f(X = 4) + f(X = 3) + f(X = 2) + f(X = 1)$$
    $$\frac{1}{6} + \frac{1}{6} + \frac{1}{6} + \frac{1}{6} = \frac{4}{6}$$

---

### Comparison: PMF vs. CDF

* **PMF Graph:** In the dice roll example, all probabilities are equal, so it forms a **Uniform Distribution** graph.
* **CDF Graph:** The CDF shows the accumulation of these probabilities.

<p align="center">
<img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*etYZ9Hw7eWVwt7lLb2wMDw.png" width="300px">
<br>
<em>Cumulative Distribution Function of PMF in two dice roll</em>
</p>

#### Outcome Table for CDF ($F(x)$)
| 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|
| $1/6$ | $2/6$ | $3/6$ | $4/6$ | $5/6$ | $6/6$ |

#### Outcome Table for PMF ($f(x)$)
| 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|
| $1/6$ | $1/6$ | $1/6$ | $1/6$ | $1/6$ | $1/6$ |

---

### Key Takeaway
* **PMF** gives us the probability at a **particular point** ($x$).
* **CDF** gives us the probability of **all points up to** $x$.

## Probability Density Function (PDF)

In **PDF**, we create an equation for a **Continuous Variable**, whereas in **PMF**, we create an equation for a **Discrete Variable**. It describes the probability distribution of continuous random variables.

<p align="center">
<img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*Btp16X006rcEMO2fE_JO_g.jpeg" width="300px">
<br>
<em>Normal distribution sample</em>
</p>

* The image above is a sample PDF. The "Continuous Line" is the hallmark of a PDF.
* **Example:** If we plot college student marks on the x-axis, the y-axis no longer represents "Probability" as it does in PMF; instead, it represents **Probability Density**.

> **Key Distinction:** PMF presents "Probability," while PDF presents "Probability Density."

---

#### Why is it Probability Density, not Probability?

In simple terms, **the probability is not shown on the y-axis because, for any specific point on the x-axis, the chance of it happening is almost zero.** * Let’s say we’re looking at student exam scores ranging from 0 to 10. Between these whole numbers, there are countless decimal values. 
* The likelihood of getting a specific score like $7.912$ is extremely low, almost close to impossible.
* Since we have an **infinite number of possible values on the x-axis**, the probability of any single value occurring is either zero or very, very small. 
* Therefore, it’s not practical to calculate the probability for each value because it would essentially be zero. This is why the y-axis represents **Probability Density**.

---

#### What does the area under the curve represent?

In the example of student marks ranging from 0 to 10, the probability of any one specific mark is incredibly low. However, the **area under the curve** allows us to calculate probability for a range.

* **Total Area:** When we add up the entire area under the curve, it sums up to **1**, indicating the complete set of possibilities.
* **Range Probability:** Probability density helps us determine the likelihood of a value occurring between two specific numbers. For instance, to find the probability of marks between 8 and 9, we shade that area on the graph.
* **Calculation:** By calculating this shaded area, we find the probability for that range. 
* **Integration:** For all practical purposes, we calculate this area using **integration**:
  $$P(a \le X \le b) = \int_{a}^{b} f(x) \, dx$$

**Therefore, on this curve, the y-axis does not represent probability directly; rather, the area under the graph represents probability, and we use probability density to calculate this area.**

## Density Estimation

**How is this graph made?** It is created through a process called **Density Estimation**.

**Density Estimation** is a statistical technique used to estimate the **Probability Density Function (PDF)** of a random variable based on a set of observations or data. It involves estimating the underlying distribution of a set of data points.

* It is used for various purposes, including **hypothesis testing**, **visualizations**, and **data analysis**.
* There are two primary types of methods: **Parametric** and **Non-Parametric** density estimation.

---

### Types of Density Estimation

#### 1. Parametric Density Estimation
In this approach, we assume the data follows a specific, known distribution. Examples include:
* **Normal (Gaussian) Distribution**
* **Log-normal Distribution**
* **Uniform Distribution**

#### 2. Non-Parametric Density Estimation
In this approach, we make **no assumptions** about the underlying data distribution. We calculate the PDF without relying on a base of known distributions.

---

### Commonly Used Techniques

The choice of method depends on the specific characteristics of the data and the intended use of the estimate. Famous techniques include:

* **Kernel Density Estimation (KDE):** A non-parametric way to estimate the PDF of a random variable.
* **Histogram Estimation:** One of the simplest forms of density estimation, where data is binned.
* **Gaussian Mixture Models (GMMs):** A parametric method that assumes data is composed of several Gaussian distributions.

## Parametric and Non-Parametric Density Estimation

### Parametric Density Estimation

The parametric density function is a method of estimating the probability density function (PDF) of a **Random variable** by assuming that the underlying distribution belongs to a specific parametric family of probability distributions, such as the normal, exponential, or Poisson distributions.

#### Steps:

* On any dataset to check if it follows any famous distribution or not, we calculate a **Histogram** to see which kind of distribution data follows.
* If it seems like a **Normal Distribution** or any other famous Parametric distribution, we work according to that distribution's parameters.
* For example, if it follows a **Normal Distribution**, then we need to calculate its parameters:
    * **Mean ($\mu$)**
    * **Standard Deviation ($\sigma$)**
* Now, we calculate the **mean** and **St. Deviation** for the sample dataset and try to populate the mean and standard deviation for the population dataset.
* Once we have both parameter values, we will use a **PDF equation** to calculate the values or probabilities for each of the values in the sample dataset:
  $$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$$
* This provides the **probability density value** for every sample data point, which we then plot.

---

#### Note:

* The sample dataset parameter should be very close to the population parameter to get a normally distributed PDF.
* It is called **parametric** because it depends upon parameters ($\mu$, $\sigma$). The more accurate the parameters we select, the better the results.

---

### Steps in code:

```python
import matplotlib.pyplot as plt
import numpy as np
from numpy.random import normal

# Generate Normally distributed data based on the assumption 
# that the data distribution is normal.
# 1000 points generated with: Pop.mean = 50, Pop.std_dev = 5
sample = normal(loc=50, scale=5, size=1000)

# Plot histogram to visualize the distribution of the generated data
plt.hist(sample, bins=10)
plt.title("Histogram of Generated Sample Data")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()

```
<p align="center"> <img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*GpJnEgmpmI26dBZvYITcWw.png" width="300px"><br><em>Histogram for 1000 sample size </em</p>

```python
# calculate sample mean and sample std dev
sample_mean = sample.mean()
sample_std = sample.std()

# Example output values:
# 49.98322995864287 - sample mean
# 4.97301044472603 - sample standard deviation
# fulfilling the condition close to population parameters

# fit the distribution with the above parameters
from scipy.stats import norm
dist = norm(sample_mean, sample_std)

# take 100 data sample to get sample min and sample max
values = np.linspace(sample.min(), sample.max(), 100)

# applied normal distribution upon sample dataset
probabilities = [dist.pdf(value) for value in values]

# plot the histogram and pdf on sample dataset
plt.hist(sample, bins=10, density=True)
plt.plot(values, probabilities)
plt.show()
```

<p align="center"> <img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*-a-yd4rPlDdi4ySI_Z4M_g.png" width="300px">
<br>
<em>Normal Distribution</em> </p>

### Non-Parametric Density Estimation

- When data matches any famous distribution such as **Normal Distribution**, **Uniform**, **Log-Normal**, or **Bernoulli**, we apply **Parametric Density Estimation**.

- When **data does not match any known probability distribution**, we use **Non-Parametric Density Estimation**.

- **Non-Parametric Density Estimation does not make any assumptions about the underlying data distribution**.

- In parametric estimation (e.g., Normal Distribution), we find parameters like **mean** and **standard deviation**.  
  In non-parametric estimation, **every data point is used to understand the data**.

- The **main advantage of Non-Parametric Density Estimation** is that it **does not require assuming a specific distribution**, providing more flexibility and often more accurate estimation when the underlying distribution is unknown.

- **It can be applied to any type of dataset to estimate the Probability Density Function (PDF)**.

- However, it is **computationally expensive** and may not give accurate results if the dataset size is small.

- One of the most famous **Non-Parametric Density Estimation techniques** is **KDE (Kernel Density Estimation)**.

## Kernel Density Estimation

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*alsY1cn8eX4SyoyxgVdgYA.png" width="500px"><br>
  <em>KDE — Nonparametric Density Function</em>
</p>

Kernel Density Function (KDE) involves using a **‘Kernel Function’**.

- Imagine we have six data points, as shown in the histogram above.
- By observing the histogram, it’s evident that it doesn’t follow any familiar distribution pattern and resembles a **‘Bimodal distribution’**, so we turn to KDE.

**In KDE, we select a kernel, which is essentially a ‘Probability Distribution’. While we can choose any kernel, the ‘Gaussian kernel’ is commonly preferred.**

The **Gaussian/Normal kernel** is widely used and is the **most famous kernel**.

---

### KDE Working Principle

Here’s what we do:

- We take each **data point and treat it as the ‘center’**, then create a **‘normal distribution’** around it.
- Essentially, for each data point, we consider it as the **mean** and generate a **normal/Gaussian distribution** for the other data points.
- In the given example with six data points, we create **six Gaussian distributions** by treating each data point as the mean.
- Moving in the **Y-direction** for each point and moving perpendicularly upwards, we collect all possible Gaussian curves that intersect at specific points.  
  Then, **we sum up the Y-density for all intersecting Gaussian curves at those points**.  
  This sum represents the density for each particular data point.
- Similarly, for all data points along the x-axis, we count the number of Gaussian curves intersecting each data point and sum up all the Y-density values.  
  **This process helps us calculate the overall density function for all data points using KDE**.

---

### Gaussian / Normal Distribution Parameters

Gaussian/Normal Distribution has two crucial parameters:

- **μ (mean)**
- **σ (standard deviation)**

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*z62USl22kxChT1cenEmm4Q.jpeg" width="300px"><br>
  <em>KDE type — Peak, Smoothen curve</em>
</p>

<i>The red line indicates the ‘Peakdness in curve’.</i>  
<i>The blue line illustrates the ‘Smoothen out curve’.</i>

- In this context, the **mean refers to every data point**, and we need to determine the **standard deviation for each data point**.

---

### Bandwidth

In KDE, the **standard deviation is known as the ‘Bandwidth’**.

- Bandwidth acts like a **‘Hyperparameter’** that we can adjust based on our needs.
- If we **decrease the bandwidth**, we reduce the spread of the normal distribution.  
  Consequently, the **‘peakiness’ of the normal distribution increases**, leading to spikes in the overall curve when we sum up all the Y-density points.
- Conversely, **increasing the bandwidth spreads out the normal distribution**, causing data points to scatter more widely.
- As a result, the **Y-density decreases**, reducing peakiness and producing a **smoother overall curve**.

---

### Note

In essence, for each data point, we determine a **‘kernel’**, and then we calculate the overall **Y-value** by summing the Y-points of all intersecting Gaussian kernels at a given **X-value** and plot the graph.

However, another crucial factor is the **‘Bandwidth’ or Standard Deviation**:

- Increasing bandwidth → **wider Gaussian kernel → smoother overall curve**
- Decreasing bandwidth → **narrow Gaussian kernel → higher peakedness**

Therefore, **selecting the appropriate bandwidth is vital**.


#### Refer this article for a detailed example
https://medium.com/ai-mind-labs/parametric-non-parametric-density-estimation-f23faedc06ef

## Cumulative Distribution Function (CDF) of PDF

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*Dkne5qtPs0LphEWBazBrRw.png" width="500px"><br>
  <em>PDF & CDF</em>
</p>

In a **Continuous Random Variable**, we apply the **Probability Density Function (PDF)** instead of the **Probability Mass Function (PMF)**.  
On the Y-axis, we have **‘Probability Density’** instead of **‘Probability’**, and we calculate probability in a continuous random variable by using the **area under the curve** with the help of probability density.

---

### Relationship Between PDF and CDF

- With the probability density curve, we can easily construct the **CDF (Cumulative Distribution Function)**.
- In PDF, if we take the center point and draw a line from the peak towards the Y-axis, it gives a **probability density value**.
- In CDF, if we take the center point and draw a line from the peak towards the Y-axis, it gives the **probability**.
- In cumulative distribution, it represents **P(x ≤ 165)** in the image when **y = 0.5**, which means **50% of people's height is below 165 cm**.
- In probability density, it represents **P(x = 165)** exactly in the image when **y = 0.04**, which means **40%**.

---

### Understanding CDF

The **Cumulative Distribution Function (CDF)** offers a complete picture of the probability distribution by illustrating the **accumulated probability up to a particular value**.  
This enables a thorough examination of the dataset’s characteristics and behavior.  
CDF explains the probability **up to a given point for all values**.

---

### Key Observations

- If we calculate the **area under the graph** in Probability Density Function, we get the **CDF**.
- If we calculate the **slope of the CDF** at every point, it gives the **PDF**.
