<div style="display: flex; justify-content: space-between;">
<a style="flex: 1; text-align: left;" href="./2_3_Error_term.ipynb">← Previous: 2.3 Estimation of the Error Term</a>
<a style="flex: 1; text-align: right;" href="./3_1_MLR_bus_weather.ipynb">Next: 3.1 Model Building with Weather →</a>
</div>

### 2.4 Tests for the Significance of the Regression
---


<div style="text-align: justify; line-height: 2">

There are multiple tests that can be used to determine the significance of the regression. The most common are the t-test and the F-test, described below.

#### 2.4.1 Hypothesis Testing with the t-test

The t-test is used to test the significance of the regression coefficients $\beta_i$. Essentially, the t-test is used to determine whether the estimated value of a coefficient $\beta_i$ is significantly different from zero, i.e. whether the regressor $x_i$ belongs in the model. The t-test is  defined as follows<sup><a href="#footnote2">2</a></sup>:

$$H_0: \beta_i = 0$$
$$H_1: \beta_i \neq 0$$

The test statistic is defined as:

$$\begin{equation} \tag{2.4.1}
\boxed{t = \frac{\hat{\beta_i}}{\sqrt{\sigma^2c_{jj}}}}
\end{equation}
$$

where $\hat{\beta_i}$ is the estimated value of $\beta_i$, $\sigma^2$ is the variance of the error term, and $c_{jj}$ is the $j$ th diagonal element of the matrix $(X'X)^{-1}$. The test statistic is t-distributed with $n-p$ degrees of freedom, where $n$ is the number of observations and $p = k+1$ is the number of regressors. The test is performed as a two-tailed test, i.e the null hypothesis is rejected if $|t| > t_{\alpha/2, n-p}$. In this case, the coefficient $\beta_i$ is said to be significantly different from zero and is accepted as a regressor in the model<sup><a href="#footnote3">3</a><a href="#footnote5">5</a></sup>.

#### 2.4.2 Coefficient of Determination $R^2$

The coefficient of determination $R^2$ is a measure of the goodness of fit of the regression model. It is defined as the fraction of the total variation in the dependent variable $y$ that is explained by the regression model<sup><a href="#footnote4">4</a></sup>. The coefficient of determination contains two components: the sum of squared 'errors' ($SS_E$) and the total sum of squares ($SS_T$), i.e, the sum of squared residuals and the sum of squared variations of the dependent variable $y$ around its mean, respectively. The coefficient of determination is defined as<sup><a href="#footnote2">2</a></sup>:

$$\begin{equation} \tag{2.4.2}
{R^2 = \frac{SS_R}{SS_T} = 1 - \frac{SS_E}{SS_T}}
\end{equation}
$$
Where
$$SS_E = \sum_{i=1}^{m} \varepsilon_i^2 ; \hspace{.5cm} SS_T = \sum_{i=1}^{n} (y_i^2 - 1/n \sum_{i=1}^{n} y_i)^2 ; \hspace{.5cm} SS_R = SS_T - SS_E$$

The coefficient of determination is a number between 0 and 1. In general, this represents the percentage of variability of the data that is explained by the model. However, this is not a good measure of the goodness of fit of the model, as it exclusively increases with the number of regressors in the model. Therefore, the adjusted coefficient of determination is preferred<sup><a href="#footnote2">2</a></sup>:

$$\begin{equation} \tag{2.4.3}
\boxed{R_{adj}^2 = 1 - \frac{SS_E/(n-p)}{SS_T/(n-1)}}
\end{equation}
$$

This measure penalizes the coefficient of determination for the number of regressors in the model. Therefore, including a regressor in the model will only increase the adjusted coefficient of determination by a significant amount if it significantly improves the fit of the model, thus avoiding overfitting<sup><a href="#footnote2">2</a></sup>. 

#### 2.4.3 Multicollinearity

In many applications, a correlation between some of the regressors is expected, and in cases where the correlation is high, it is called multicollinearity. This poses a problem for model building, as this correlation makes it difficult to determine the individual effect of each regressor on the predicted variable while the other regressors are held constant, which is the main purpose of the regression model <sup><a href="#footnote1">1</a></sup>. A common method to detect multicollinearity is to use Pearson's correlation coefficient, which is defined as <sup><a href="#footnote6">6</a></sup>:

$$\begin{equation} \tag{2.4.4} 
\boxed{\rho_{ij} = \frac{cov(x_i, x_j)}{\sigma_i \sigma_j}}
\end{equation}
$$

where $\rho_{ij}$ is the correlation coefficient between regressors $x_i$ and $x_j$, $cov(x_i, x_j)$ is the covariance between $x_i$ and $x_j$, and $\sigma_i$ and $\sigma_j$ are the standard deviations of $x_i$ and $x_j$, respectively. This coefficient ranges from -1 to 1, where 1 indicates maximum positive correlation, -1 indicates maximum negative correlation, and 0 indicates no correlation. If two or more regressors are highly correlated, one of them should be removed from the model to avoid multicollinearity.

A useful tool to visualize the correlation between the regressors is the correlation matrix or a correlation heatmap, where each element of the matrix is the correlation coefficient between the corresponding regressors. An example of a correlation matrix is shown in figure 2.4.1.

<table align="center" style="border: none;">
<tr>
<td style="text-align: center; border: none;">
  
<p style="text-align: center;"><img src="../Images/2_4_1.png" alt="Figure 2.4.1" /></p>
  
<p style="text-align: center;"><strong><em>Figure 2.4.1: Correlation heatmap.</em></strong><br><em>The correlation between five different variables is shown in the heatmap. The diagonal elements of the matrix are always 1, as the correlation between a variable and itself is always 1. Note that variables B and E show high positive correlation, while variables A and D show lower but still significant negative correlation.</em></p>

</td>
</tr>
</table>

Under the assumptions stated in section 2.2, the t-test and the adjusted coefficient of determination are used to determine the significance of the regressors and the goodness of fit of the model, respectively. Together with the correlation matrix, these tests are implemented in variable selection when model building, which is discussed in the next section.

</div>
&nbsp;  
&nbsp;  

---

<a name="footnote1"></a>1: [PennState Eberly College of Science](./5_References.ipynb#1)  Note: 12.3  
<a name="footnote2"></a>2: [Ivan T. Ivanov](./5_References.ipynb#2)  
<a name="footnote3"></a>3: [OpenStax](./5_References.ipynb#3)  
<a name="footnote4"></a>4: [Heij, De Boer; Franses, Kloep; and Van Dijk](./5_References.ipynb#4)  
<a name="footnote5"></a>5: [Mostoufi, Navid; Constantinides, Alkis](./5_References.ipynb#5)  
<a name="footnote6"></a>6: [Rand R. Wilcox](./5_References.ipynb#6) Note: p.45

<div style="display: flex; justify-content: space-between;">
<a style="flex: 1; text-align: left;" href="./2_3_Error_term.ipynb">← Previous: 2.3 Estimation of the Error Term</a>
<span style="flex: 1; text-align: center;">2.4 Tests for the Significance of the Regression</span>
<a style="flex: 1; text-align: right;" href="./3_1_MLR_bus_weather.ipynb">Next: 3.1 Model Building with Weather →</a>
</div>
