# MindTap 14 - Simple Linear Regression

## 14.1 Simple Linear Regression Model

### Regression Model and Regression Equation

Regression analysis requires us to develop a **regression model**, an equation showing how a dependent variable $ y $ is related to the independent variable $ x $.

The simple linear regression model is as follows:

$$
y = \beta _0 + \beta _1 x + \epsilon
$$

Where $ \beta _0 $ and $ \beta _1 $ are referred to as the parameters of the model, and $ \epsilon $ is a random variable called the error term. The error term accounts for the variability in y cannot be explained by the linear relationship between $ x $ and $ y $.

The equation that describes how the expected value of $ y $, denoted $ E(y) $, is related to $ x $ is called the **regression equation**:

$$
E(y) = \beta _0 + \beta _1 x
$$

Note that this model is a line with slope $ \beta _1 $ and y-intercept $ \beta _0 $.

### Estimated Regression Equation

Since the values of the population parameters $ \beta _0 $ and $ \beta _1 $ are rarely known, they must be estimated. These estimates are represented as the sample statistics $ b_0 $ and $ b_1 $. With this, we can now construct the **estimated regression equation**:

$$
\hat{y} = b_0 + b_1 x
$$

The graph of this equation is known as the *estimated regression line*.

## 14.2 Least Squares Method

The **least squares method** is a procedure for using sample data to find the estimated regression equation. This method uses sample data to provide the values of $b_0$ and $b_1$ that minimize the *sum of the squares of the deviations* between the observed values of the dependent variable $y_i$ and the predicted values of the dependent variable $\hat{y}_i$:

$$
min\ \Sigma (y_i - \hat{y}_i)^2
$$

where
- $y_i$ = observed value of the dependent variable for the *i*th observation
- $\hat{y}_i$ = predicted value of the dependent variable for the *i*th observation.

Using calculus, you can derive the following equations for $b_0$ and $b_1$:

$$
b_1 = \frac{\Sigma (x_i - \bar{x})(y_i - \bar{y})}{\Sigma (x_i - \bar{x})^2}
$$

$$
b_0 = \bar{y} - b_1 \bar{x}
$$

where
- $x_i$ = value of the independent variable for the *i*th observation
- $y_i$ = value of the dependent variable for the *i*th observation
- $\bar{x}$ = mean value for the independent variable
- $\bar{y}$ = mean value for the dependent variable
- $n$ = total number of observations

## 14.3 Coefficient of Determination

The **coefficient of determination** provides a measure of the goodness of the fit for an estimated regression equation.

For the *i*th observation, the difference between the observed value of the dependent variable ($y_i$) and the predicted value of the dependent variable ($\hat{y}_i$) is called ***i*th residual** ($y_i - \hat{y}_i$). The *i*th residual represents the error in using $\hat{y}_i$ to estimate $y_i$. The sum of squares of these residuals or errors is the quantity that is minimized in the least squares method. This quantity is known as the **sum of squares due to error (SSE)**.

$$
SSE = \Sigma (y_i - \hat{y}_i)^2
$$

The value of SSE is a measure of the errror in using the estimated regression equation to predict the value of the dependent variable in a sample.

When we don't know the size of the population, we must also estimate the population mean with the sample mean. The error due to this can be calculated as the **total sum of squares (SST)**.

$$
SST = \Sigma (y_i - \bar{y})^2
$$

Note that we can think of SSE as a measure of how well the observations cluster above the line $ y=\hat{y} $ and SST as a measure of how well the observations cluster above the line $ y=\bar{y} $.

The **sum of squares due to regression (SSR)** is a measure of how much the values of $\hat{y}$ deviate from $\bar{y}$:

$$
SSR = \Sigma (\hat{y}_i - \bar{y})^2
$$

Theses 3 values are related as such:

$$
SST = SSR + SSE
$$

The **coefficient of determination ($r^2$)** measures the goodness of fit for the estimated regression equation. Note that $r^2$ will always have a value between 0 and 1.

$$
r^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}
$$

### Correlation Coefficient

The **correlation coefficient ($r_{xy}$)** is a descriptive measure of the strength of linear association between two variables $x$ and $y$. Note that the values of $r_{xy}$ lie between -1 and 1 where negative values of $r_{xy}$ indicate a negative relationship between $x$ and $y$ and positive and positive values of $r_{xy}$ indicate a positive relationship between $x$ and $y$.

$$
r_{xy} = (sign\ of\ b_1)\sqrt{r^2}
$$

## 14.4 Model Assumptions

The tests of significance in regression analysis are based off on the following assumptions of the error term $\epsilon$:

1. $E(\epsilon) = 0$. This is necessary so that the expected value of the simple linear regression model $y = \beta _0 + \beta _1 x + \epsilon$ works itself out to be $y = \beta _0 + \beta _1 x$.

2. The variance of $\epsilon$, denoted by $\sigma^2$, is the same for all values of $x$.

3. The values of $\epsilon$ are independent.

4. The error term $\epsilon$ is a normally distributed random variable for all values of $x$.

Note that linear regression also assumes that relationships between the variables analyzed is in fact linear.

## 14.5 Testing for Signifance