# Why Squared Error Minimization = Maximum Likelihood Estimation
> An investigation into why minimizing squared error is the same as maximum likelihood estimation
- toc: true 
- badges: false
- comments: true
- categories: [deep-learning, mathematics]
- image: images/saddle.JPG 

# Pre-requisites

### Minimizing the squared error loss

Minimizing the Squared Error is the technique any machine learning practitioner uses while tackling a Regression problem in which the target is a continuous variable. Let's say we have collected all our independent variables in a matrix ${X}$ of shape (m,n) where m is the number of training examples and each training example is represented by a n dimensional vector. All the dependent variables are represented by a vector ${Y}$ of length m (each training example has a real number as it's target). Our task is to find a vector ${\beta}$ of length n such that ${X\beta = Y}$. Now this task can be simple solved by saying ${\beta = X^{-1}Y}$ but this is the correct answer only in the case when m = n i.e ${X}$ is a squared matrix and ${Y}$ lies in the column space of ${X}$ i.e Y is simply a weighted sum of columns of X. If that's not the case ${X\beta}$ can never be equal to ${Y}$. All we can hope to do is minimize the distance between the two vectors ${X\beta}$ and ${Y}$. The distance we choose to minimize is ${(Y-X\beta)^{2}}$ or ${(Y-X\beta)^{T}(Y-X\beta)}$. Let's denote this error by ${\epsilon}$. Now, if you are like me, you may wonder why don't we simply minimize ${Y - X\beta}$ or some other power of ${Y - X\beta}$. This is perfectly valid question to ask and we are going to explore this in this blog.

### Maximum Likelihood Estimation

Another way of looking at the regression task is that we have m observations of ${(x,y)}$ where ${x}$ is a n dimensional vector and ${y}$ is it's corresponding target value. These m data points come from a true but unknown data distribution ${P(X,Y\space;\space\theta)}$ where ${\theta}$ is the parameter of the distribution. Let's say we want to predict how many matches will each of the player win out of his next 10 matches by just looking at the player's age and ATP ranking. Therefore, our independent variable is ${X = [\text Age, \text Ranking]}$ and dependent variable is ${Y =}$ number of matches won out of next 10 for eg. Nadal will be represented by vector ${(35,3)}$ and Federer by ${(39,8)}$. What do you think ${P([18,123], 10)}$ will be? In other words what is the probability that a player aged 18 and ranked 123 wins all 10 of his next matches? We can safely say that this a highly unlikely event since the player is inexperienced. In the same way, we would expect ${P([30,1], 8)}$ to be high since the player is ranked highest in the world. Now instead of guessing, we want to construct a probability distribution over random variables ${X}$ and ${Y}$. Once we have such a distribution we can simply plug in age and ranking of any player and calculate ${P([\text age,\text ranking], Y\space ; \space \theta)}$ for each ${Y}$ in ${(0,10)}$ and report the one which gave the highest probability. Unfortunately, we can't know true distribution since for that we would have to collect data from every single active player which can be a lot. But, we can estimate the true distribution. Any distribution is entirely characterised by it's parameters. for eg a Gaussian distribution is charachterised by it's mean and variance and is denoted as ${N (x; \mu, \sigma^{2})}$. If we can estimate the parameters of a distribution, then can construct a good estimate of the distribution as a whole. Let's denote the estimate of parameters ${\theta}$ as ${\hat\theta}$.

To construct such estimate of a distribution we collect the age and ranking of m players and observe them for the next 10 matches and note down how many of those matches they won. Let each of these observations be represented by ${x^{i}, y^{i}}$ where each ${x^{i}}$ is itself a vector of length two containing the age and ranking of ${i^{th}}$ player and ${y^{i}}$ is the number of matches won by that player out of 10. Let's assume every observation is independent of any other observation. Given each ${x^{i}}$ our distribution should predict ${y^{i}}$ and this can only happen when ${P(x^{i}, y^{i}\space ;\space \hat\theta)}$ is maximum of among all other ${P(x^{i}, y\space ; \space \hat\theta)}$ where ${y \neq y_{i}}$. The quantity ${P(x^{i}, y^{i})}$ is known as likelihood of the observation  ${(x^{i}, y^{i})}$. Since we want likelihood of all the observations to be maximum, we might as well say that we we want the quantity ${P(x^{1}, y^{1})\cdot P(x^{2}, y^{2})\cdot P(x^{3}, y^{3}) ... P(x^{m}, y^{m})}$ to be maximum which in short can be written as ${\Pi_{i=1}^{m}P(x^{i}, y^{i}\space ; \space \theta)}$. Then ${\hat\theta}$ will be ${argmax_{\theta} \space\Pi_{i=1}^{m} P(x^{i}, y^{i}\space ; \space \theta)}$