<h1 style="font-size: 1.6rem; font-weight: bold">Module 3 - Topic 4: Maximum Likelihood Estimation (MLE) </h1>
<p style="margin-top: 5px; margin-bottom: 5px;">Monash University Australia</p>
<p style="margin-top: 5px; margin-bottom: 5px;">ITO 4001: Foundations of Computing</p>
<p style="margin-top: 5px; margin-bottom: 5px;">Jupyter Notebook by: Tristan Sim Yook Min</p>

(Source Material: Monash Univeristy)

---

#### **Core Concept**
Maximum likelihood estimation finds the parameter values that make our observed data most probable. It works by:
1. Finding the parameter $\theta$ that maximizes the likelihood function
2. Producing a point estimate (single best value) for each unknown parameter

#### **How It Works**
- Given: Random sample $X_1, \ldots, X_n$ from distribution $F_\theta$ with unknown parameter(s) $\theta$
- Goal: Find $\theta$ value that maximizes probability of observing our sample
- Result: The MLE is the value of $\theta$ that makes our observed data most likely

#### **Applications**
1. **Statistics & Data Science**
   - Parameter estimation for probabilistic models, Fitting regression models and generalized linear models and Estimating distribution parameters (normal, exponential, etc.)

2. **Machine Learning**
   - Training classification algorithms (logistic regression), Building generative models and Neural network training (cross-entropy loss is related to MLE)

<br>

#### **Example: Maximum Likelihood Estimation for Normal Distribution**

For a normal distribution $N(\mu, \sigma^2)$, we can derive the MLE step by step:

1. **Setup**: We have sample data $X_1, X_2, ..., X_n$ from $N(\mu, \sigma^2)$ with unknown $\mu$

2. **Write the likelihood function**:
   $$L(\mu) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(X_i - \mu)^2}{2\sigma^2}\right)$$

3. **Take the logarithm** (for easier calculation):
   $$\log L(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(X_i - \mu)^2$$

4. **Maximize by setting the derivative to zero**:
   $$\frac{d}{d\mu}\log L(\mu) = \frac{1}{\sigma^2}\sum_{i=1}^{n}(X_i - \mu) = 0$$

5. **Solve for** $\mu$:
   $$\sum_{i=1}^{n}X_i - n\mu = 0 \implies \mu = \frac{1}{n}\sum_{i=1}^{n}X_i = \bar{X}$$

**Numerical Example**: 
* Given sample: $X_1 = 2$, $X_2 = 3$, $X_3 = 4$
* Calculate MLE of $\mu$: $$\hat{\mu} = \frac{2+3+4}{3} = 3$$
* Conclusion: The value $\mu = 3$ maximizes the likelihood of observing our data

---

### **Maximum Likelihood Estimation (MLE) Example**

#### **Problem Statement**
We have 10 data points sampled from an unknown Gaussian distribution. Our task is to determine which of four candidate Gaussian models most likely generated these points. The image below shows the sample of 10 data points plotted along the x-axis, all with y-values of 0.

![image.png](attachment:image.png)

## Gaussian Distribution
A Gaussian distribution is parameterized by:
- Mean (μ): Determines the center of the distribution
- Standard deviation (σ): Controls the spread

![image-2.png](attachment:image-2.png)

Displays four candidate Gaussian distributions:
  - F1: μ = 0, σ = 1 (green)
  - F2: μ = 2, σ = 1 (blue)
  - F3: μ = 1, σ = 1 (red)
  - F4: μ = 2, σ = 0.5 (purple)

MLE finds the parameter values that maximize the probability of observing our given data: The true distribution that generated the data is confirmed to be F2: μ = 2, σ = 1 (blue), as it is the the best curve to fit the data points.

For Further Reading: Understanding Maximum Likelihood: An interactive visualization (Magnusson, 2021). 
Link: https://rpsychologist.com/likelihood/

---

### **Worksheet Examples: Maximum Likelihood Estimation for Gaussian Distribution**

Let's attempt the MLE for the Gaussian distribution. Suppose 3 data points were generated from a process described by a Gaussian distribution. These points are 3, 4, and 5. How do we calculate the maximum likelihood estimates of the parameter values of the Gaussian distribution μ and σ?

We want to calculate the total probability of observing all of the data, that is, the joint probability distribution of all observed data points. To do this, we would need to calculate some conditional probabilities, which can get very difficult.

**Assumption**: Each data point is generated independently of the others.

Assuming that each data point is generated independently of the others makes the maths easier. Recall that for **independent events**, the total probability of observing all of the data is the product of observing each data point individually, which is the product of the marginal probabilities. The probability density of observing a single data point x, that is generated from a Gaussian distribution is given by:

$$P(x;μ,σ) = \frac{1}{σ\sqrt{2π}} \exp\left(-\frac{(x-μ)^2}{2σ^2}\right)$$

<br> 

#### **The MLE Process in Simple Terms**

1. **Start with the PDF**: We begin with the Gaussian probability density function, which tells us how likely it is to observe a specific value given parameters μ and σ.

2. **Build the Likelihood Function**: Since our data points are independent, we multiply the individual probabilities for each point (3, 4, and 5) to get the likelihood function.

3. **Take the Logarithm**: To make calculations easier, we convert to log-likelihood (multiplications become additions).

4. **Find the Maximum**: To find parameter values that maximize the likelihood:
   - Take derivatives with respect to μ and σ
   - Set these derivatives equal to zero
   - Solve the resulting equations

5. **Calculate**:
   - For μ: The maximum likelihood estimate is simply the sample mean μ 
   - For σ: The maximum likelihood estimate relates to the average squared deviation from the mean σ

<br> 

### **Maximum Likelihood Estimation for the Parameter μ**

#### Given Information:
- Data points: x₁ = 3, x₂ = 4, x₃ = 5
- Distribution: Gaussian (Normal)
- Parameters to estimate: μ (mean) and σ (standard deviation)
- Gaussian PDF: $$P(x;μ,σ) = \frac{1}{σ\sqrt{2π}} \exp\left(-\frac{(x-μ)^2}{2σ^2}\right)$$

#### Step 1: Joint Probability Function
Since the data points are independent, the joint probability is the product of individual probabilities:

$$P(3,4,5;μ,σ) = P(3;μ,σ) × P(4;μ,σ) × P(5;μ,σ)$$

$$= \frac{1}{σ\sqrt{2π}} \exp\left(-\frac{(3-μ)^2}{2σ^2}\right) × \frac{1}{σ\sqrt{2π}} \exp\left(-\frac{(4-μ)^2}{2σ^2}\right) × \frac{1}{σ\sqrt{2π}} \exp\left(-\frac{(5-μ)^2}{2σ^2}\right)$$

$$= \left(\frac{1}{σ\sqrt{2π}}\right)^3 \exp\left(-\frac{1}{2σ^2}[(3-μ)^2 + (4-μ)^2 + (5-μ)^2]\right)$$

#### Step 2: Log-Likelihood Function
Taking the natural logarithm of the joint probability:

$$\ln(P(3,4,5;μ,σ)) = 3\ln\left(\frac{1}{σ\sqrt{2π}}\right) - \frac{1}{2σ^2}[(3-μ)^2 + (4-μ)^2 + (5-μ)^2]$$

$$= -3\ln(σ) - \frac{3}{2}\ln(2π) - \frac{1}{2σ^2}[(3-μ)^2 + (4-μ)^2 + (5-μ)^2]$$

#### Step 3: Differentiate with Respect to μ
To find the value of μ that maximizes the log-likelihood, take the partial derivative with respect to μ and set it equal to zero:

$$\frac{∂\ln(P(3,4,5;μ,σ))}{∂μ} = \frac{1}{2σ^2} \frac{∂}{∂μ}[(3-μ)^2 + (4-μ)^2 + (5-μ)^2]$$

$$= \frac{1}{2σ^2}[2(3-μ)(-1) + 2(4-μ)(-1) + 2(5-μ)(-1)]$$

$$= \frac{1}{σ^2}[-(3-μ) - (4-μ) - (5-μ)]$$

$$= \frac{1}{σ^2}[-(3+4+5) + 3μ]$$

$$= \frac{1}{σ^2}[-12 + 3μ]$$

#### Step 4: Solve for μ
Set the derivative equal to zero and solve:

$$\frac{1}{σ^2}[-12 + 3μ] = 0$$

Since $$\frac{1}{σ^2} \neq 0$$

$$-12 + 3μ = 0$$
$$3μ = 12$$
$$μ = 4$$

Therefore, the maximum likelihood estimate for μ is 4, which is exactly the sample mean of the data points: $$(3+4+5)/3 = 4$$

#### Interpretation
The value μ = 4 maximizes the likelihood function, meaning that a Gaussian distribution with mean μ = 4 is most likely to have generated our observed data points (3, 4, and 5).

<br>

### **Maximum Likelihood Estimation for the Parameter σ**

#### Solving for Parameter σ

The same steps used for finding μ apply when solving for parameter σ, but we differentiate the log-likelihood with respect to σ instead.

#### Step 1: Log-Likelihood Function
Recall our log-likelihood function:

$$\ln(P(3,4,5;μ,σ)) = -3\ln(σ) - \frac{3}{2}\ln(2π) - \frac{1}{2σ^2}[(3-μ)^2 + (4-μ)^2 + (5-μ)^2]$$

Now, using our previously found μ = 4:

$$\ln(P(3,4,5;4,σ)) = -3\ln(σ) - \frac{3}{2}\ln(2π) - \frac{1}{2σ^2}[(3-4)^2 + (4-4)^2 + (5-4)^2]$$

$$= -3\ln(σ) - \frac{3}{2}\ln(2π) - \frac{1}{2σ^2}[1 + 0 + 1]$$

$$= -3\ln(σ) - \frac{3}{2}\ln(2π) - \frac{2}{2σ^2}$$

$$= -3\ln(σ) - \frac{3}{2}\ln(2π) - \frac{1}{σ^2}$$

#### Step 2: Differentiate with Respect to σ
We take the partial derivative with respect to σ and set it equal to zero:

$$\frac{∂\ln(P(3,4,5;4,σ))}{∂σ} = -\frac{3}{σ} + \frac{2}{σ^3} = 0$$

#### Step 3: Solve for σ
Setting the derivative equal to zero:

$$-\frac{3}{σ} + \frac{2}{σ^3} = 0$$

$$\frac{2}{σ^3} = \frac{3}{σ}$$

$$2 = 3σ^2$$

$$σ^2 = \frac{2}{3}$$

$$σ = \sqrt{\frac{2}{3}} \approx 0.82$$

Therefore, the maximum likelihood estimate for σ is approximately 0.82.

<br> 

### **General Form of Maximum Likelihood Estimation**

MLE can be applied to any random variable, assuming the observations are independent and identically distributed:

#### Step 1: Define the Parametric Model
Consider a parametric statistical model with parameter(s) θ, where $$P(x|θ)$$ is the probability density function of a random variable.

#### Step 2: Calculate Joint Probability
For a random variable with observations $x_1,...,x_n$, calculate the joint probability:

$$P(x|θ) = \prod_{i=1}^{n} p(x_i|θ)$$

#### Step 3: Take Log-Likelihood
Convert to log-likelihood for easier computation:

$$\ln(P(x|θ)) = \sum_{i=1}^{n} \ln(p(x_i|θ))$$

#### Step 4: Differentiate with Respect to θ
Take the derivative with respect to θ and set it equal to zero:

$$\frac{∂\ln(P(x|θ))}{∂θ} = 0$$

#### Step 5: Solve for θ
Solve the equation to find the maximum likelihood estimate of θ.

#### Interpretation
The values μ = 4 and σ = 0.82 define the Gaussian distribution that is most likely to have generated our observed data points (3, 4, and 5). This process can be generalized to find the parameters of any parametric distribution, given a set of observations.