## Some derivations for the coefficient of determination ($R^2$).

### Linear regression

We consider the **Linear Regression Model**, i.e.,

$$Y = \beta_0 + \beta^{T} X + \epsilon,$$

where $X = (X_1, X_2, \ldots, X_r)^T$ is the vector of $r$ independent variables (predictors), $\beta_0$ and $\beta$ are the unknown parameters with $\beta_0$ being a constant and $\beta = (\beta_1, \beta_2, \ldots, \beta_r)^T$ an r-dimensional vector, and $\epsilon$ is a random disturbance. It is assumed that $\mathbb{E}(\epsilon) = 0$ and $\mathbb{V}ar(\epsilon) = \sigma^2$.

Given a dataset of $n$ observations, we can solve the Ordinary Least Squares (OLS) problem to obtain the estimates $\hat{\beta}_0$ and $\hat{\beta}$ of the model parameters. With a slight abuse of the notation, we will assume that the data are given in a matrix form as

$$
X = 
\begin{bmatrix}
1 & x_{11} & x_{12} & \ldots & x_{1r}\\
1 & x_{21} & x_{22} & \ldots & x_{2r}\\
\vdots & \vdots & \vdots &\ddots & \vdots\\
1 & x_{n1} & x_{n2} & \ldots & x_{nr}\\ 
\end{bmatrix}
$$

and $Y = [Y_1, Y_2, \ldots, Y_n]^T$. We model the dataset with the Linear Regression as follows

$$Y = X \beta + \epsilon,$$

where now the constant term is incorporated into the vector of parameters, i.e., $\beta = [\beta_0, \beta_1, \beta_2, \ldots, \beta_r]^T$, and $\epsilon = [\epsilon_1, \epsilon_2, \ldots, \epsilon_n]^T$ is a vector of random errors such that $\mathbb{E}(\epsilon) = \boldsymbol{0}$ and $\mathbb{V}ar(\epsilon) = \sigma^2 \boldsymbol{I}$.

### Estimation of the parameters

We consider the **Residual Sum of Squares** (RSS) defined as:

$$\textrm{RSS}(\beta) = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2,$$

where $\hat{Y} = [\hat{Y}_1, \hat{Y}_2, \ldots, \hat{Y}_n]^T = X\beta$. The OLS is solved to obtain the OLS estimate of $\beta$, denoted $\hat{\beta}$. By the formulation of the OLS problem, it holds that the gradient of the RSS considered as a multivariate function of the $\beta$ parameters is a zero vector at the estimate $\hat{\beta}$, i.e.,

$$\nabla RSS(\beta) = 2 X^{T}(Y - X\beta)$$

and

$$\nabla RSS(\hat{\beta}) = 2 X^{T}(Y - X\hat{\beta}) = \boldsymbol{0}.$$

Let us denote $(Y_i - [1,x_{i1},x_{i2},\ldots,x_{in}]\hat{\beta})$ as $\hat{\epsilon}_i$.  Then from the equation above, it follows that 

$$
\begin{bmatrix}
1 & 1 & \ldots & 1\\
x_{11} & x_{21} & \ldots & x_{n1}\\
x_{12} & x_{22} & \ldots & x_{n2}\\
\vdots & \vdots & \ddots & \vdots\\
x_{1r} & x_{2r} & \ldots & x_{nr}\\ 
\end{bmatrix}
\begin{bmatrix}
\hat{\epsilon}_0\\
\hat{\epsilon}_1\\
\vdots\\
\hat{\epsilon}_n
\end{bmatrix} =
\begin{bmatrix}
0\\
0\\
\vdots\\
0
\end{bmatrix}.
$$

We can rewrite this matrix equation in the form of a system of equations:

$$\sum_{i=1}^{n}\hat{\epsilon}_i = 0,$$

i.e., the sum of the residuals is zero if **a model contains the constant term $\beta_0$**, and

$$\sum_{i=1}^{n} x_{ij} \hat{\epsilon}_i = 0 \quad\textrm{for all } j \in [1,2,\ldots,r].$$

### Coefficient of Determination

We introduce the following notation:

$$\bar{Y} \triangleq \frac{1}{n}\sum_{i=1}^n Y_i.$$

The **Coefficient of Determination** $R^2$ is defined as

$$R^2=\frac{\textrm{ESS}}{\textrm{TSS}},$$

where TSS is the **total sum of squares** defined as:

$$\textrm{TSS} = \sum_{i=1}^{n}(Y_i - \bar{Y})^2 = (Y_i - \bar{Y})^{T}(Y_i - \bar{Y})$$

and ESS is the **explained sum of squares** defined as:

$$\textrm{ESS} = \sum_{i=1}^{n}(\hat{Y}_i - \bar{Y})^2 = (\hat{Y}_i - \bar{Y})^{T}(\hat{Y}_i - \bar{Y}).$$

**Remark:** Since $\textrm{TSS}$, $\textrm{ESS}$, and $\textrm{RSS}$ depend on $\beta$, one should write $\textrm{TSS}(\beta)$, $\textrm{ESS}(\beta)$, and $\textrm{RSS}(\beta)$ to be fully precise, but to simplify the notation, we skip writing $\beta$'s keeping in mind that all the three quantities depend on the parameter values.

First, let us consider the sum $\sum_{i=1}^{n} \hat{\epsilon}_i \hat{Y}_i$. We show that it is equal zero. This fact will be used in the derivations presented in the continuation. Indeed, we have that

$$
\begin{align}
\sum_{i=1}^{n} \hat{\epsilon}_i \hat{Y}_i 
  &= \sum_{i=1}^{n} \hat{\epsilon}_i \left (\beta_0 + \sum_{j=1}^r x_{ij}\beta_j \right )\\
  &= \sum_{i=1}^{n} \hat{\epsilon}_i\beta_0 + \sum_{i=1}^{n} \hat{\epsilon}_i \left (\sum_{j=1}^r x_{ij}\beta_j \right)\\
  &= \beta_0 \sum_{i=1}^{n} \hat{\epsilon}_i + \sum_{j=1}^r \beta_j \left ( \sum_{i=1}^{n} x_{ij}\hat{\epsilon}_i \right )\\
  &= \beta_0 \cdot 0 + \sum_{j=1}^r \beta_j \cdot 0\\
  &= 0.
\end{align}
$$

Next, TSS can be written as:

$$
\begin{align}
\textrm{TSS} &= \sum_{i=1}^{n}(Y_i - \bar{Y})^2\\
             &= \sum_{i=1}^{n}(Y_i - \hat{Y}_i + \hat{Y}_i - \bar{Y})^2\\
             &= \sum_{i=1}^{n}\left [(Y_i - \hat{Y}_i)^2 + 2(Y_i - \hat{Y}_i)(\hat{Y}_i - \bar{Y}) + (\hat{Y}_i - \bar{Y})^2\right ]\\
             &= \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 + \sum_{i=1}^{n}(\hat{Y}_i - \bar{Y})^2 + 2\sum_{i=1}^{n}(Y_i - \hat{Y}_i)(\hat{Y}_i - \bar{Y})\\
             &= \textrm{RSS} + \textrm{ESS} + 2\sum_{i=1}^{n}(Y_i - \hat{Y}_i)(\hat{Y}_i - \bar{Y})\\
             &= \textrm{RSS} + \textrm{ESS} + 2\left [\sum_{i=1}^{n} \hat{\epsilon}_i (\hat{Y}_i - \bar{Y}) \right ]\\
             &= \textrm{RSS} + \textrm{ESS} + 2\left [\sum_{i=1}^{n} \hat{\epsilon}_i \hat{Y}_i - \sum_{i=1}^{n} \hat{\epsilon}_i \bar{Y} \right ]\\
             &= \textrm{RSS} + \textrm{ESS} + 2\left [\sum_{i=1}^{n} \hat{\epsilon}_i \hat{Y}_i - \bar{Y} \sum_{i=1}^{n} \hat{\epsilon}_i \right ]\\
             &= \textrm{RSS} + \textrm{ESS} - 2\bar{Y} \sum_{i=1}^{n} \hat{\epsilon}_i.
\end{align}
$$

It follows that if the model **does include** the constant term $\beta_0$, then the last sum is 0 since $\sum_{i=1}^{n} \hat{\epsilon}_i = 0$ as shown above. Hence, if the model **does include** the constant term $\beta_0$, then

$$\textrm{TSS} = \textrm{RSS} + \textrm{ESS}$$

and

$$0 \leq R^2 \leq 1.$$

Otherwise, $R^2$ can be greater than 1. Therefore, if a model **does not include** the constant $\beta_0$, the coefficient of detemination is not well-defined.