In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

## Methods for constructing estimates

### Method of moments

In [28]:
from sympy import symbols, diff, integrate, simplify, pi, exp, sqrt, oo
# from math import sqrt

The **method of moments** is a statistical technique for estimating population parameters by equating sample moments (like the sample mean) to the corresponding population moments and solving the resulting equations. 
In simple terms, it uses the observed data's properties, such as its average, to make an educated guess about the underlying characteristics of the population from which the data was drawn. 


In [36]:
x = symbols('x')
theta = symbols('theta')
sigma = symbols('sigma')

In [37]:
expx = x*((1/(sigma*sqrt(2*pi)))*exp(-(1/2)*((x-theta)/sigma)**2))

In [38]:
simplify(expx)

sqrt(2)*x*exp(-0.5*(theta - x)**2/sigma**2)/(2*sqrt(pi)*sigma)

In [40]:
integrate(expx, (x, -oo, oo))

Piecewise((-0.176776695296637*sqrt(2)*theta*(-2.82842712474619*sigma*exp(-0.5*theta**2/sigma**2)/theta - 2*sqrt(pi)*(2 - erfc(0.707106781186548*theta/sigma)))/sqrt(pi) - 0.176776695296637*sqrt(2)*theta*(2.82842712474619*sigma*exp(-0.5*theta**2/sigma**2)/theta - 2*sqrt(pi)*erfc(0.707106781186548*theta/sigma))/sqrt(pi), ((Abs(arg(sigma)) < pi/4) | ((Abs(arg(sigma)) <= pi/4) & (Abs(4*arg(sigma) - 2*arg(theta)) < pi)) | ((Abs(arg(sigma)) < pi/4) & (Abs(4*arg(sigma) - 2*arg(theta)) <= pi)) | ((Abs(arg(sigma)) < pi/4) & (Abs(4*arg(sigma) - 2*arg(theta)) < pi))) & ((Abs(arg(sigma)) < pi/4) | ((Abs(arg(sigma)) <= pi/4) & (Abs(-4*arg(sigma) + 2*arg(theta) + 2*pi) < pi)) | ((Abs(arg(sigma)) < pi/4) & (Abs(-4*arg(sigma) + 2*arg(theta) + 2*pi) <= pi)) | ((Abs(arg(sigma)) < pi/4) & (Abs(-4*arg(sigma) + 2*arg(theta) + 2*pi) < pi)))), (Integral(sqrt(2)*x*exp(-0.5*(-theta + x)**2/sigma**2)/(2*sqrt(pi)*sigma), (x, -oo, oo)), True))

### Maximum likelihood method

In [1]:
# Importing libraries 
import numpy as np # used for handling arrays and mathematical operations.
from scipy.optimize import minimize # function that minimizes another function

Estimating parameters is a fundamental step in statistical analysis and machine learning. 
Among the various methods available, **Maximum Likelihood Estimation (MLE)** is one of the most widely used approaches due to its intuitive nature, mathematical rigor, and broad applicability across different types of data and models. 

The **maximum likelihood method** is a statistical technique to estimate the parameters of a probability distribution by finding the parameter values that make the observed data most probable. 
This is done by creating a likelihood function, which represents the probability of the data for different parameter values, and then finding the parameter values that maximize this function. 
To simplify the process, the logarithm of the likelihood function is often used, and its derivative is set to zero to find the maximum. 

* **Probability** is about predicting data from parameters
* **Likelihood** measures how plausible different parameter values are, given the observed data. 
    It’s a function of parameters for fixed data. In contrast, probability is a function of data for fixed parameters.

![Probability vs Likelihood](../../00-images/ad_4nxcyfncgqgeyknfqxzodeuudd-wsazuyvqz9djpe96aurzs9pzvdreznj8cooxafdnj4shkc7_kg3nkakmoj4j-legg4lmttsmfvedm3lyodyakjlh-hwwpq82p_3hc2wsftzbeetq.avif)

Let’s suppose we have a dataset: $x_1, x_2, ..., x_n.$
We believe these data points are generated from a probability distribution that depends on some unknown parameter $\theta$. 
Our main goal is to estimate $\theta$. 

$$L(\theta) = P(\text{data} | \theta) = P(x_1, x_2, ..., x_n | \theta)$$

The likelihood function measures how likely it is to observe your data for different values of $\theta.$ 

Using the chain rule of probability, we can expand the above equation into this:
$$L(\theta) = P(x_1, x_2, ..., x_n | \theta) = P(x_1| \theta) \cdot P(x_2| x_1, \theta) \cdot ... \cdot P(x_n| x_1, x_2, ..., x_{n-1} \theta)$$

However, this is quite a complicated equation! 
So we make the assumption that the data points are independent - more specifically, conditionally independent. 

$$L(\theta) = P(x_1| \theta) \cdot P(x_2| \theta) \cdot ... \cdot P(x_n| \theta) = \prod_{k=1}^n P(x_k| \theta)$$

In [None]:
##