<div style="display: flex; justify-content: space-between;">
<a style="flex: 1; text-align: left;" href="./2_1_Multiple_linear_regresion.ipynb">← Previous: 2.1 Multiple Linear Regression</a>
<a style="flex: 1; text-align: right;" href="./2_3_Error_term.ipynb">Next: 2.3 Estimation of the Error Term →</a>
</div>

### 2.2 Least Squares Estimation of the Parameters
---


<div style="text-align: justify; line-height: 2">

The Least Squares method is used to estimate the parameters $\beta_i$ in the linear regression model $(1)$ using training data like follows:

$$\{x_{i1}, x_{i2}, ..., x_{ik}, y_i\}, \hspace{.5cm} i = 1, 2, ..., n, \hspace{.5cm}  n > k$$

Where $i$ is the number of observations and $k$ is the number of variables. Thus, $x_{ij}$ is the $j$ th variable for the $i$ th observation and $y_i$ is the response variable for the $i$ th observation. Note that the constraint $n > k$ indicates that the number of observations must be greater than the number of variables. The model takes this data and attempts to draw a line, or hyperplane in the multple regression case, of best fit through the data, i.e. a line or hyperplane with minimal variation of the data points from itself<sup><a href="#footnote3">3</a></sup>. These variations are called residuals and are denoted by $\epsilon_i$<sup><a href="#footnote3">3</a></sup>:

$$\varepsilon_i = y_i - \hat{y}_i$$

Where $\hat{y}_i$ is the predicted value of $y_i$ based on the linear regression model $(2.1.1)$<sup><a href="#footnote2">2</a></sup>:

$$\hat{y}_i = \beta_0 + \beta_1 x_{i1} + ... + \beta_k x_{ik}$$

Then, $$L=\sum_{i=1}^{k} \varepsilon_i^2 = \sum_{i=1}^{k} (y_i - \beta_0 - \sum_{j=1}^{k} \beta_j x_{ij})^2$$ is the sum of the squared 'errors' (SSE). The Least Squares method minimizes $L$ with respect to coefficients $\beta_0, \beta_j$ by taking the first partial derivatives of $L$ with respect to the coefficients and setting them equal to zero <sup><a href="#footnote2">2</a></sup>:

$$\begin{equation} \tag{2.2.1} 
\frac{\partial L}{\partial \beta_j} = -2 \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{k} \beta_j x_{ij})x_{ij} = 0
\end{equation}$$

To solve for $\beta_i$, we rewrite the components of equation $(2.2.1)$ as matrices<sup><a href="#footnote4">4</a></sup>:

$$y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \hspace{.5cm} X = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1k} \\ 1 & x_{21} & x_{22} & \cdots & x_{2k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{nk} \end{bmatrix}, \hspace{.5cm} \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix}, \hspace{.5cm} \varepsilon = \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix}$$

Then, the equation becomes:

$$ -2X^Ty + 2X^TX\beta = 0 $$

Solving for $\beta$ we obtain the solutions for equation $(2.2.1)$:
$$\begin{equation} \tag{2.2.2}
\boxed{\beta = (X^TX)^{-1}X^Ty}
\end{equation}
$$
given that $X^TX$ is invertible. As $X$ is a $n \times (k+1)$ matrix, the constraint $n > k$ mentioned above is necessary to ensure that $X^TX$ is symmetric ($n \times n$)<sup><a href="#footnote4">4</a></sup>. $\hat{\beta}$ is therefore the vector of coefficients that estimate estimate $\beta$ from the training data using equation $(2.2.2)$. The predicted value of $y_i$ is then given by the matrix notation of equation $(2.1.1)$:

$$\begin{equation} \tag{2.2.3}
\hat{y} = X\hat{\beta} +e
\end{equation}
$$

Where $e$ are the residuals $e = y-\hat{y}$.

Using a bivariate example, the Least Squares method can be visualized:

<table align="center" style="border: none;">
<tr>
<td style="text-align: center; border: none;">
  
![Figure 2.2.1](../Images/2_2_1.png)
  
***Figure 2.2.1: Linear regression with two variables.** The line is drawn while reducing the sum of the squares of the residuals, which can be seen in green as the vertical lines from the data points to the predicted values.*
  
</td>
</tr>
</table>

The estimators obtained from the Least Squares method are unbiased under the following assumptions<sup><a href="#footnote1">2</a></sup>:

1. The model is linear in the parameters. This can be easily checked by plotting the data and seeing if a linear relationship exists. If not, the data can be transformed as discussed in section 2.1. 
2. The error terms $\varepsilon_i$ are independent.
3. The error terms $\varepsilon_i$ are normally distributed with a mean of 0.
4. The error terms $\varepsilon_i$ have equal variances sigma squared, $\sigma^2$. The estimation of $\sigma^2$ is discussed in the next section.


</div>
&nbsp;  
&nbsp;  

---

<a name="footnote1"></a>1: [PennState Eberly College of Science](./5_References.ipynb#1) (Note:7.3)  
<a name="footnote2"></a>2: [Ivan T. Ivanov](./5_References.ipynb#2)  
<a name="footnote3"></a>3: [OpenStax](./5_References.ipynb#3)  
<a name="footnote4"></a>4: [Heij, De Boer; Franses, Kloek; and Van Dijk](./5_References.ipynb#4)

<div style="display: flex; justify-content: space-between;">
<a style="flex: 1; text-align: left;" href="./2_1_Multiple_linear_regresion.ipynb">← Previous: 2.1 Multiple Linear Regression</a>
<span style="flex: 1; text-align: center;">2.2 Least Squares Estimation of the Parameters</span>
<a style="flex: 1; text-align: right;" href="./2_3_Error_term.ipynb">Next: 2.3 Estimation of the Error Term →</a>
</div>
