# Maximum Likelihood Estimation

In this notebook, we will see what Maximum Likelihood Estimation is and how it can be used. 

**Contents**:
- **[1. Introduction](#introduction)**
- **[2. Mathematical Formulation](#math-formulation)**
- **[3. Difference between Probability and Likelihood](#probability-likelihood)**
- **[4. Probability Distributions, Estimations and Codes](#distributions)**
    - **[4.1 Bernoulli Distribution](#bernoulli)**
    - **[4.2 Binomial Distribution](#binomial)**
    - **[4.3 Poisson Distribution](#poisson)**
    - **[4.4 Exponential Distribution](#exponential)**
    - **[4.5 Normal Distribution](#normal)**
- **[5. Summary](#summary)**
- **[6. References](#references)**
---

## <a id="introduction">1. Introduction</a>

We often need to study populations for various purposes such as understanding its characteristics, for finding patterns and to devise better solutions using these information. However, for large populations, it is neither possible nor feasible to examine each individual in a population to understand these properties and characteristics. Hence, the statisticians make use of random sampling. The measurements are, therefore, made on much smaller samples for analysis and drawing conclusions about the population.

Analysing the data becomes a lot more easier if we are able to identify distribution of the population, using the sample. Any probability distribution is uniquely defined by some parameters. Therefore, if we somehow find/identify these parameters, we can easily study the population. This process of finding parameters for the population by analysing the samples from that population is called **estimation**.

We can say that whenever our estimation is good, on random sampling again, we are **likelier** to obtain a sample which would be very similar to our original sample. In other words, more the chances of getting a similar sample with our estimates of the parameters, more is the **likelihood** that our estimation is satisfactorily correct.

**Maximum likelihood estimation** is basically the process of estimating these parameters based on our sample, such that the likelihood of the population to be defined with those parameters is maximised.

---

## <a id="math-formulation">2. Mathematical Formulation</a>

Now that we have a gist of the problem we are trying to solve here, let's define it more formally.

Suppose we have obtained a random sample from our population, given by $x_1, \ x_2, \ x_3, \ ..., \ x_n$. <br>
Now, using this sample, we want to assume some distribution for our population. Let that distribution be defined as $D(\theta)$, <br>
where $\theta = [\theta_1 \ \theta_2 \ \theta_3 \ ... \ \theta_k]^T$ is the vector defining the parameters for our assumed distribution $D$. 

Now, using our random sample, we want to determine the values of $\theta$ which will make our distribution $D$ most likely to define our population. In order to quantify how likely it is for our assumed distribution $D$ with its estimated parameters $\theta$ to define the population, we make use of **likelihood function**.

We define the likelihood function, $L : \{\theta, \ x_1, \ x_2, \ x_3, \ ..., \ x_n\} \longrightarrow [0, 1]$ as follows: <br>
$$
    \begin{align}
        L(\theta \ | \ x_1, \ x_2, \ x_3, \ ..., \ x_n) & = P(X_1 = x_1,X_2 = x_2,X_3 = x_3, \ ..., \ X_n = x_n) \\
        & = f(x_1 | \theta)\ .\ f(x_2 | \theta)\ .\ f(x_3 | \theta)\ .\ ...\ .\ f(x_n | \theta) \\
        & = \prod_{i = 1}^n f(x_i | \theta)
    \end{align}
$$
Here, $f$ is the probability mass/density function for our distribution $D$.

**Note**: Since the sample is selected randomly from a large population, we can assume that selection of a datum into the sample is **independent** from selection of any other datum. This assumption of independence allows us to multiply the probability mass/density functions for each datum to calculate the overall likelihood.

The **goal** of **Maximum Likelihood Estimation** is to find $\theta$ such that the likelihood function value gets maximised. For any likelihood function which is **concave**, or **negatively convexed**, we can find its maximum likelihood estimate by equating the first partial derivatives w.r.t. each parameter $\theta_j, \forall j \in \{1, 2, 3, \ ..., \ k\}$ to $0$. As we will see later, for commonly occuring probability distributions, the likelihood functions are indeed concave.

$$
    \begin{align}
        \therefore \ \theta & = arg \ \max_{\theta \ \in \ \Theta} \  L(\theta \ | \ x_1, x_2, x_3, \ ..., \ x_n) \\
        \Longrightarrow \ \theta_j & = arg \left( \frac{\partial L}{\partial \theta_j} = 0 \right) \ \forall \ j \in \{1, \ 2, \ 3, \ ..., \ k\} \ | \ L \ is \ concave
    \end{align}
$$

Equating partial derivatives to $0$ would be difficult here because product of probability mass/density function for all the elements of the sample would require extended application of chain rule.
Hence, we make use of **logarithm function**. Logarithm function is monotonically strictly increasing function. Therefore, wherever $L$ will maximise, at the same parameter values, $\log(L)$ will also maximize. Therefore, instead of maximising the likelihood estimation, we maximise the **log likelihood estimation**.

$$
\begin{align}
    L(\theta \ | \ x_1, x_2, x_3, \ ..., \ x_n) & = \prod_{i = 1}^n f(x_i \ | \ \theta) \\
    \Longrightarrow \log\left(L(\theta \ | \ x_1, x_2, x_3, \ ..., \ x_n)\right) & = \log\left(\prod_{i = 1}^n f(x_i \ | \ \theta)\right) \\
    & = \sum_{i = 1}^n \log\left(f(x_i \ | \ \theta)\right) \\
    \Longrightarrow \frac{\partial}{\partial \theta_j} \log\left(L(\theta \ | \ x_1, x_2, x_3, \ ..., \ x_n)\right) & = \sum_{i = 1}^n \frac{\partial}{\partial \theta_j} \log\left(f(x_i \ | \ \theta)\right), \ \forall \ j \in \{1, 2, 3, \ ..., \ k\} \\
    & = \sum_{i = 1}^n \frac{1}{f(x_i \ | \ \theta)} \frac{\partial}{\partial \theta_j} f(x_i \ | \ \theta), \ \forall \ j \in \{1, 2, 3, \ ..., \ k\} \\
    \therefore \ for \ concave \ likelihood \ functions, \ \theta_j & = arg \left(\sum_{i = 1}^n \frac{1}{f(x_i \ | \ \theta)} \frac{\partial}{\partial \theta_j} f(x_i \ | \ \theta) = 0\right), \ \forall \ j \in \{1, 2, 3, \ ..., \ k\} \\
\end{align}
$$

Now that we have seen the way to maximise the likelihood of estimation, let's do this for some probability distributions in the next section.

---

## <a id="probability-likelihood">3. Difference between Probability and Likelihood</a>

## <a id="distributions">4. Probability Distributions, Estimations and Codes</a>

### <a id="bernoulli">4.1 Bernoulli Distribution</a>

### <a id="binomial">4.2 Binomial Distribution</a>

### <a id="poisson">4.3 Poisson Distribution</a>

### <a id="exponential">4.4 Exponential Distribution</a>

### <a id="normal">4.5 Normal Distribution</a>

## <a id="summary">5. Summary</a>

## <a id="references">6. References</a>

The following sources were referred for making this notebook:
- [Maximum Likelihood Estimation, STAT 415 - Introduction to Mathematical Statistics, PennState](https://online.stat.psu.edu/stat415/lesson/1/1.2)
- [Maximum Likelihood Estimation, Wikipedia](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation)
- [Probability Concepts Explained: Maximum Likelihood Estimation, towards data science](https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1)
- [StatQuest: Maximum Likelihood, clearly explained!!!, StatQuest with Josh Starmer](https://youtu.be/XepXtl9YKwc)
- [StatQuest: Probability vs Likelihood, StatQuest with Josh Starmer](https://youtu.be/pYxNSUDSFH4)
- [Maximum Likelihood for the Binomial Distribution, Clearly Explained!!!, StatQuest with Josh Starmer](https://youtu.be/4KKV9yZCoM4)
- [Maximum Likelihood for the Exponential Distribution, Clearly Explained! V2.0, StatQuest with Josh Starmer](https://youtu.be/p3T-_LMrvBc)
- [Maximum Likelihood For the Normal Distribution, step-by-step!, StatQuest with Josh Starmer](https://youtu.be/Dn6b9fCIUpM)