# Introduction to Linear Regression

## Table of Contents:

1. [Introduction to Linear Regression](#Introduction-to-Linear-Regression)
2. [The Method of Least Squares](#The-Method-of-Least-Squares)
3. [Inferences on Least Squares Estimated Parameters](#Inferences-on-Least-Squares-Estimated-Parameters)
4. [Linear Regression as Analysis of Variance](#Linear-Regression-as-Analysis-of-Variance)
5. [Linear Regression Example through SAS](#Linear-Regression-Example-through-SAS)
6. [Summary](#Summary)
7. [Citations](#Citations)

### Introduction to Linear Regression

**Linear regression** is a class of techniques that relates one or a group of variables, known as **predictor/design/feature/regressor variables**, to a **response variable**. The predictors are independent variables whereas the response is a dependent variable. Such a relationship is known as a **model**. Given that a response variable is $Y$ and its predictor variable is $X$, the following constitutes a linear model:

\begin{equation} 
Y = b_{0} + b_{1} X
\end{equation}

This equation is just a simple line, where $b_0$ is the **intercept** and $b_1$ is the **slope**, but it demonstrates the relationship between $X$ and $Y$ in a simple manner. $X$ is proportional to $Y$ by $b_{1}$ and offset by $b_{0}$. Collectively, the $b_{i}$'s are known as the **parameters/coefficients/weights** of the model. However, data in real life has variability. It becomes necessary to add a Gaussian probability density distribution as an error term ($\varepsilon$) to the end of the linear model:

\begin{equation}
Y = b_{0} + b_{1} X + \varepsilon
\end{equation}

The error term is assumed to have a mean of 0 and the same variance at every $X$. Below shows a graphical image of how the error term distributes the possible values of $Y$ given an $X$ value [1]:

<img src="files/images/linear%20regression%20with%20normal%20error%20distribution.PNG">

The goal of linear regression given a set of ($X$, $Y$) data is to estimate what $b_0$ and $b_1$ is. We denote the estimated versions of $b_{0}$ and $b_{1}$ as $\widehat{b}_{0}$ and $\widehat{b}_{1}$. Using $X$ with the estimated values, we have an estimated version of $Y$ as well:

\begin{equation}
\widehat{y} = \widehat{b}_{0} + \widehat{b}_{1} x
\end{equation}

The values of $\widehat{y}$ are simply the means of the distribution of the response given the observed values of $X$.

**Simple linear regression** is a one-to-one relationship where one independent variable is matched to one response variable. **Multiple linear regression** describes a many-to-one relationship where a group of independent variables are matched to one response variable. Multiple linear regression expands $X$ to $X_{1}$, $X_{2}$, ..., $X_{n}$ and $b_{1}$ to $b_{1}$, $b_{2}$, ..., $b_{n}$.

[(back to top)](#Table-of-Contents:)

## The Method of Least Squares

This section involves generating an estimate for $b_{0}$ and $b_{1}$. The common technique is known as the **method of least squares**.

The ultimate derivation of a least-squares problem involves linear algebra and maximum likelihood estimation and will not be covered here. Simply put, we want to minimize the difference between the true $y$'s and the estimated $\widehat{y}$'s. Ideally, if $\widehat{y} = y$ then $y - \widehat{y} = 0$. However, due to the error term, this will not be the case. For every $X$ value, there will be some difference in $y - \widehat{y}$, known as a **residual**. Sometimes $\widehat{y}$ will be greater than $y$ and other times it will be less than $y$. The net error would be the addition of all of these residuals. To ensure that two residuals don't cancel each other out, we square each one before adding them together. The sum is known as the **sum of squared error**:

\begin{equation}
SSE = \sum_{i = 1}^{n} (y_{i} - \widehat{y}_{i})^2 = \sum_{i = 1}^{n} (y_{i} - (b_{0} + b_{1} x_i))^{2} = \sum_{i = 1}^{n} (y_{i} - b_{0} - b_{1} x_{i})^{2}
\end{equation}

We want to try and find estimates for $b_0$ and $b_1$ that will minimize the value of the SSE. If we minimize the SSE, then we also minimize all of the residuals. To figure out which values of $b_0$ and $b_1$ will minimize the SSE, we take partial derivatives of the SSE equation with respect to each of the parameters respectively:

\begin{equation}
\frac{\partial (SSE)}{\partial b_{0}} = \frac{\partial (\sum_{i = 1}^{n} (y_{i} - b_{0} - b_{1} x_{i})^2)}{\partial b_{0}} = -2 \sum_{i = 1}^{n} (y_{i} - b_{0} - b_{1} x_{i})
\end{equation}

\begin{equation}
\frac{\partial (SSE)}{\partial b_1} = \frac{\partial (\sum_{i = 1}^{n} (y_i - b_0 - b_1 x_i)^2)}{\partial b_1} = -2 \sum_{i = 1}^{n} (y_i - b_0 - b_1 x_i)x_i
\end{equation}

In order to find the minimum SSE based on both $b_0$ and $b_1$, we set each partial derivative to 0 and solve for each parameter:

\begin{equation}
\frac{\partial (SSE)}{\partial b_0} = -2 \sum_{i = 1}^{n} (y_i - b_0 - b_1 x_i) = 0
\end{equation}

\begin{equation}
\frac{\partial (SSE)}{\partial b_1} = -2 \sum_{i = 1}^{n} (y_i - b_0 - b_1 x_i)x_i = 0
\end{equation}

Now that we have a system of equations, if we solve this system for $b_0$ and $b_1$ we get their estimated versions $\widehat{b}_0$ and $\widehat{b}_1$:

\begin{equation}
\widehat{b}_1 = \frac{\sum_{i = 1}^{n} (x_i - \overline{x})(y_i - \overline{y})}{\sum_{i = 1}^{n} (x_i - \overline{x})^2}
\end{equation}

\begin{equation}
\widehat{b}_0 = \overline{y} - b_1 \overline{x}
\end{equation}

[(back to top)](#Table-of-Contents:)

### Inferences on Least Squares Estimated Parameters

Each estimated value $\widehat{b}_0$, $\widehat{b}_1$, $\widehat{y}_i$ also possesses a distribution from which we can draw inferences on. In this section, common distribution properties are found for each of the estimators. Their derivations will not be described here, but can be found in [2] for those interested.

All t-statistic values use $n - 2$ degrees of freedom.

The properties of $\widehat{y}_i$ are as follows:

<table>
    <tr>
        <td>**mean**</td>
        <td>$\mu_{\hat{y}_{i}|x_i} = \widehat{b}_0 + \widehat{b}_1 x_i$</td>
    </tr>
    <tr>
        <td>**variance**</td>
        <td>$\sigma^2_{\hat{y}_{i}|x_i} = \sigma^2$</td>
    </tr>
    <tr>
        <td>**confidence interval bounds**</td>
        <td>$\widehat{y}_i \pm t_{\alpha/2} s \sqrt{\frac{1}{n} + \frac{(x_i - \overline{x})^2}{\sum_{i = 1}^{n} (x_i - \overline{x})^2}}$</td>
    </tr>
    <tr>
        <td>**prediction interval bounds**</td>
        <td>$\widehat{y}_i \pm t_{\alpha/2} s \sqrt{1 + \frac{1}{n} + \frac{(x_i - \overline{x})^2}{\sum_{i = 1}^{n} (x_i - \overline{x})^2}}$</td>
    </tr>
</table>

The unbiased estimate of $\sigma^2$ is

\begin{equation}
    s^2 = \frac{SSE}{n - 2} = \frac{\sum_{i = 1}^{n} (y_i - \widehat{y}_i)^2}{n - 2}
\end{equation}

The properties of $\widehat{b}_1$ are as follows:

<table>
    <tr>
        <td>**mean**</td>
        <td>$\mu_{\widehat{b}_1} = \widehat{b}_1$</td>
    </tr>
    <tr>
        <td>**variance**</td>
        <td>$\sigma^2_{\widehat{b}_1} = \frac{\sigma^2}{\sum_{i = 1}^{n} (x_i - \overline{x})^2}$</td>
    </tr>
    <tr>
        <td>**confidence interval bounds**</td>
        <td>$\widehat{b}_1 \pm t_{\alpha/2} \frac{s}{\sqrt{\sum_{i = 1}^{n} (x_i - \overline{x})^2}}$</td>
    </tr>
    <tr>
        <td>**hypothesis testing against $H_0: b_1 = b_{10}$**</td>
        <td>t-statistic: $t = \frac{b_1 - b_{10}}{s} \sqrt{\sum_{i = 1}^{n} (x_i - \overline{x})^2}$</td>
    </tr>
</table>

The properties of $\widehat{b}_0$ are as follows:

<table>
    <tr>
        <td>**mean**</td>
        <td>$\mu_{\widehat{b}_0} = \widehat{b}_0$</td>
    </tr>
    <tr>
        <td>**variance**</td>
        <td>$\sigma^2_{\widehat{b}_0} = \frac{\sigma^2 \sum_{i = 1}^{n} x^2_i}{n \sum_{i = 1}^{n} (x_i - \overline{x})^2}$</td>
    </tr>
    <tr>
        <td>**confidence interval bounds**</td>
        <td>$\widehat{b}_0 \pm t_{\alpha/2} \frac{s}{\sqrt{n \sum_{i = 1}^{n} (x_i - \overline{x})^2}} \sqrt{\sum_{i = 1}^{n} x^2_i}$</td>
    </tr>
    <tr>
        <td>**hypothesis testing against $H_0: b_0 = b_{00}$**</td>
        <td>t-statistic: $t = \frac{b_0 - b_{00}}{s \sqrt{\frac{\sum_{i = 1}^{n} x^2_i}{n \sum_{i = 1}^{n} (x_i - \overline{x})^2}}}$</td>
    </tr>
</table>

The hypothesis tests listed for $\widehat{b}_0$ and $\widehat{b}_1$ are a way to evaluate whether the intercept or the feature variable corresponding to $\widehat{b}_1$ actually contribute to the model. Often times, $b_{00} = b_{10} = 0$. This means that if $H_0$ cannot be rejected for a specific parameter, that parameter is 0. If that parameter is 0, then its feature variable is taken out of the final model. An example of how to interpet these test results will be shown in section 5.

[(back to top)](#Table-of-Contents:)

### Linear Regression as Analysis of Variance

With a model specified, it is essential to see whether the generated model actually fits the data. For example, it would be meaningless to find a relationship between students' heights and spelling test grades. An analysis of variance approach to linear regression will indicate that there is little to no relationship between heights and grades.

First, let's define two new terms.

The **regression sum of squares** shows how much variation can be found in the predicted $\widehat{y}$ values compared to the data mean ($\overline{y}$):

\begin{equation}
SSR = \sum_{i = 1}^{n} (\widehat{y}_i - \overline{y})^2
\end{equation}

In many sources, the SSR is said to reflect the amount of variation as explained by the model.

The **total corrected sum of squares** shows much variation can be found in the original $y$ values compared to the data mean ($\overline{y}$):

\begin{equation}
SST = \sum_{i = 1}^{n} (y_i - \overline{y})^2
\end{equation}

The SST can also be written as the sum of the SSR and the SSE:

\begin{equation}
SST = SSR + SSE
\end{equation}

\begin{equation}
\sum_{i = 1}^{n} (y_i - \overline{y})^2 = \sum_{i = 1}^{n} (\widehat{y}_i - \overline{y})^2 + \sum_{i = 1}^{n} (y_i - \widehat{y}_i)^2
\end{equation}

This equation states that the variation in the original $y$ values with respect to their mean is equal to the variation in the predicted $\widehat{y}$ values with respect to the original data's mean plus the sum of squared residuals. Putting it another way, the variance in the original values (SST) is split between the variance explained by the model (SSR) and the variance not explained by the model (SSE).

We want the model to explain as much of the data as possible. Thus, we want the ratio between SSR to SSE to be large. Take the null hypothesis $H_0: b_1 = 0$. This essentially says that none of the data is explained by the model. The alternate hypothesis $H_1: b_1 \neq 0$ says that at least some of the data is explained by the model. To turn the sum of squares functions into true variance functions, we divide each sum of squares term by its degrees of freedom. This result is known as the **mean square**:

\begin{equation}
MS_{SSR} = \frac{SSR}{1}
\end{equation}

\begin{equation}
MS_{SSE} = \frac{SSE}{n - 2}
\end{equation}

The f-statistic, then, is the ratio of the two mean square terms:

\begin{equation}
f = \frac{MS_{SSR}}{MS_{SSE}} = \frac{\frac{SSR}{1}}{\frac{SSE}{n - 2}} = (n - 2) \frac{SSR}{SSE}
\end{equation}

The f-statistic is represented by an $F$-distribution with $n - 2$ degrees of freedom. We can display the sum of squares information as the following table:

<table>
    <tr>
        <td>**source of variation**</td>
        <td>**sum of squares**</td>
        <td>**degrees of freedom**</td>
        <td>**mean square**</td>
        <td>**f-statistic**</td>
    </tr>
    <tr>
        <td>regression</td>
        <td>$SSR$</td>
        <td>$1$</td>
        <td>$SSR$</td>
        <td>$\frac{SSR}{s^2}$</td>
    </tr>
    <tr>
        <td>error</td>
        <td>$SSE$</td>
        <td>$n - 2$</td>
        <td>$s^2 = \frac{SSE}{n - 2}$</td>
        <td></td>
    </tr>
    <tr>
        <td>**total**</td>
        <td>$SST$</td>
        <td>$n - 1$</td>
        <td></td>
        <td></td>
    </tr>
</table>

Using the information on this table, we see that the f-statistic is:

\begin{equation}
f = \frac{MS_{SSR}}{MS_{SSE}} = \frac{SSR}{s^2}
\end{equation}

$H_0$ can be rejected if $f > f_\alpha(1, n - 2)$. if $H_0$ is ultimately rejected, we can conclude that the proposed model **does** explain much of the variance that can be found in the original data.

[(back to top)](#Table-of-Contents:)

### Linear Regression Example through SAS

As a simple demonstration, let's use SASHELP.CLASS as an example dataset. We will compare age to height values for all the students to determine whether teenagers grow taller as they get older.

The dataset values look like this:

In [1]:
title "Age vs. Height for SASHELP.CLASS";
proc print data = sashelp.class (obs = 8);
    var age height;
run;

Obs,Age,Height
1,14,69.0
2,13,56.5
3,13,65.3
4,14,62.8
5,14,63.5
6,12,57.3
7,12,59.8
8,15,62.5


PROC REG is the most general out of all of the different regression procedures in SAS. This is the one that we will be using for the current example. Its syntax is as follows:

    proc reg data = <dataset>;
        model <response variable> = <feature variable list>;
    run;
    
Specify the dataset to be used in the PROC REG statement equated to the DATA option. Then, use the MODEL statement to set the response variable to be equal to a list of the feature variables.

The model that we want to estimate is the following:

\begin{equation}
HEIGHT_i = b_0 + b_1 AGE_i
\end{equation}

PROC REG will give us our parameter estimates $\widehat{b}_0$ and $\widehat{b}_1$.

Let's try this on the SASHELP.CLASS dataset:

In [2]:
ods graphics on;
proc reg data = sashelp.class plots(only) = (fit(stats=none));
    model height = age;
run;
ods graphics off;

0,1
Number of Observations Read,19
Number of Observations Used,19

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,1,311.54348,311.54348,32.77,<.0001
Error,17,161.62073,9.5071,,
Corrected Total,18,473.16421,,,

0,1,2,3
Root MSE,3.08336,R-Square,0.6584
Dependent Mean,62.33684,Adj R-Sq,0.6383
Coeff Var,4.94629,,

Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates
Variable,DF,Parameter Estimate,Standard Error,t Value,Pr > |t|
Intercept,1,25.22388,6.52169,3.87,0.0012
Age,1,2.78714,0.48688,5.72,<.0001


Let's look at the coefficients we found:

In [1]:
ods graphics on;
proc reg data = sashelp.class plots = none;
    model height = age;
    ods select ParameterEstimates;
    ods show;
run;
ods graphics off;

Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates
Variable,DF,Parameter Estimate,Standard Error,t Value,Pr > |t|
Intercept,1,25.22388,6.52169,3.87,0.0012
Age,1,2.78714,0.48688,5.72,<.0001


We have found $\widehat{b}_0 = 25.224$ and $\widehat{b}_1 = 2.787$, making the following our model:

\begin{equation}
HEIGHT_i = 25.224 + 2.787 (AGE_i)
\end{equation}

This equation signifies that the average height per age is close to 3x that of the age value offset by 25.

The sum of squared error in our model amounts to $SSE = 161.621$. This represents the total squared deviation of our data from our predicted means. 

Let's look at the "Fit Plot for Height" graph:

In [3]:
ods graphics on;
proc reg data = sashelp.class plots(only) = (fit(stats=none));
    model height = age;
    ods select FitPlot;
    ods show;
run;
ods graphics off;

The original data is shown as age vs. height scatter points. The "Fit" line is the line through the mean of each height distribution given a certain age value. Since we are dealing with probability distributions, it becomes important to demonstrate where the true mean, given an infinite amount of data, could potentially lie through the use of 95% confidence intervals. The prediction intervals indicate our confidence in where a future height value for an unmeasured age (like age 10) could lie. With this example, however, we do know that the confidence and prediction intervals are only valid for a certain range of ages. People stop growing as they get older, so we expect (given more data) for the CIs and the PIs to level off after a certain age.

Alright, so we found a model. But, how do we know that this model is truly accurate? That is to say, how do we know that similar results will be found if we chose a different group of teenagers? Do we really trust our results?

All SAS regression procedures test parameters against a null hypothesis value of 0. This means that for $\widehat{b}_0$ and $\widehat{b}_1$:

\begin{matrix}
H_{0, b_0}: b_0 = 0 \\
H_{1, b_0}: b_0 \neq 0
\end{matrix}

\begin{matrix}
H_{0, b_1}: b_1 = 0 \\
H_{1, b_1}: b_1 \neq 0
\end{matrix}

If $H_0$ holds true for either parameter, this means that the feature variable associated with that parameter is being multiplied by 0. Thus, that particular feature variable has no effect on the value of the response variable and is effectively taken out of the model.

To test the null hypothesis, SAS uses the t-statistic and finds its p-value. t-statistic expressions can be found from the equations for each parameter in section 4. Usually, p-values are compared against $\alpha = 0.05$. If the p-value is above $0.05$, then the null hypothesis ($H_0$) cannot be rejected. If the p-value is below $0.05$, then the null hypothesis ($H_0$) is rejected for the alternate hypothesis ($H_1$).

The p-values for $\widehat{b}_0$ and $\widehat{b}_1$ can be found under the "Pr > |t|" heading in the "Parameter Estimates" table:

In [4]:
ods graphics on;
proc reg data = sashelp.class plots = none;
    model height = age;
    ods select ParameterEstimates;
    ods show;
run;
ods graphics off;

Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates
Variable,DF,Parameter Estimate,Standard Error,t Value,Pr > |t|
Intercept,1,25.22388,6.52169,3.87,0.0012
Age,1,2.78714,0.48688,5.72,<.0001


Given the small values for each parameter, it can be concluded that neither the intercept or the AGE variable can be taken out of the model.

As a check using the analysis of variance method, we see that the model's f-value is $32.77$ with a small p-value in the "Analysis of Variance" table, indicating that much of the variance found in the original data can be explained by the model:

In [5]:
ods graphics on;
proc reg data = sashelp.class plots = none;
    model height = age;
    ods select ANOVA;
    ods show;
run;
ods graphics off;

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,1,311.54348,311.54348,32.77,<.0001
Error,17,161.62073,9.5071,,
Corrected Total,18,473.16421,,,


This lends credence to our model being a good fit to the data.

This is a brief overview about parameter estimate significance testing. We will explore model selection more in a separarte document.

[(back to top)](#Table-of-Contents:)

### Summary

In this document, we learned:
* what linear regression is
* the difference between simple and multiple linear regression
* how to calculate estimated parameters for simple linear regression
* about the relationship between different sum of squares components
* how ANOVA techniques can determine model adequacy
* how to perform and interpret inference tests for estimated parameters
* how to run a simple linear regression in SAS

Key equations:
* $SSE = \sum_{i = 1}^{n} (y_i - \widehat{y}_i)^2 = \sum_{i = 1}^{n} (y_i - b_0 - b_1 x_i)^2$
* $\widehat{b}_1 = \frac{\sum_{i = 1}^{n} (x_i - \overline{x})(y_i - \overline{y})}{\sum_{i = 1}^{n} (x_i - \overline{x})^2}$
* $\widehat{b}_0 = \overline{y} - b_1 \overline{x}$

New SAS procedures:
* PROC REG

[(back to top)](#Table-of-Contents:)

### Citations

[1] J. L. Devore, “Simple Linear Regression and Correlation,” in *Probability and Statistics for Engineering and the Sciences*, 8th ed. Boston, USA: Brooks/Cole, 2012, ch. 12, sec. 1, pp. 473.

[2] R. E. Walpole, R. H. Myers, S. L. Myers, K. Ye, in *Probability & Statistics for Engineers & Scientists*, 9th ed. Boston, USA: Pearson Education, Inc., 2012, ch. 11, sec. 4-6, pp. 400-413.

[(back to top)](#Table-of-Contents:)