# Regression

Regression analysis is a group of statistical methods that estimate the relationship between a dependent variable and one or more independent variables. The most frequently used regression analysis is linear regression, which involves requires a scientists to find a line of “best fit” which most closely traces the data according to a certain mathematical criterion.

We can use regression analysis to answer a wide variety of questions, including the examples below.

* Is the relationship between two variables linear?

* Is there even a dependency between two variables?

* How strong is the relationship?

* Which variable contributes the most to the outcome measurement?

* How accurately can we predict future values?

* Is our outcome variable caused by another variable?


Regression analysis allows us to answer these questions by estimating regression coefficients for every predictor variable used in a model. Let’s assume that we are analyzing a simple model that is made up of just one predictor, one outcome variable. This model is formally represented as follows: $Y=B_0+B_1*x+error$.

The best fit line is agreed to be the one that minimizes the residual error ($Error_i = y_i - \hat y_i$). It is important to note that regression models use the sum of squared error (SSE).

## Fitting the Model

### Example

In [None]:
plot(trees)

In [None]:
m1 =lm(Volume~Girth,data=trees)

In [None]:
coefficients(m1)

In [None]:
fitted(m1)

In [None]:
# pay attention to the Pr(>|t|) column, adjusted R-squared, and the p-value associated with the F-statistic
summary(m1)

In [None]:
coefficients(m1)[[2]]*2.1+coefficients(m1)[[1]]

### Question 1

Load the dataset `iris`.

A) Perform a linear regression where you aim to predict the petal width from its pedal length

B)  Based off of your results, what would be the predicted petal width of the iris if its pedal length is 2.3?

## Residual Diagnostics

One of the assumptions of a regression analysis is that the distribution of the errors is normal. Let's look at the residuals.

In [None]:
residuals(m1)

In [None]:
plot(m1)

In [None]:
# we're interested in the p-value
shapiro.test(residuals(m1))

### Question 1 (cont'd)

C) Plot a histogram of the residuals of the analysis from question 1 and perform a Shapiro-Wilk normality test. Is the assumption of normalcy met?

## Model Fit

In [None]:
m1h=lm(Volume~Height,data=trees)
m2 =lm(Volume~Girth+Height,data=trees)

**Using adjusted R-squared**

In [None]:
summary(m1)

In [None]:
summary(m1h)

In [None]:
summary(m2)

**Using ANOVA**

In [None]:
print(anova(m1))

In [None]:
print(anova(m1h))

In [None]:
print(anova(m2))

**Using AIC**

In AIC model selection, we compare the information value of each model and choose the one with the lowest AIC value (a lower number means more information explained!)

In [None]:
AIC(m1)

In [None]:
AIC(m1h)

In [None]:
AIC(m2)

### Question 2

Load the dataset `mtcars`.

A) Perform a linear regression where you aim to predict the displacement from both its horsepower and weight.

B) Try seeing if it is more prudent use only one parameter (single variable regression) instead of both
hp and weight, and prove it by using the appropriate test.

C) Find the 95% bootstrap confidence intervals for the regression coefficients. 

### Question 3

What does the F-statistic tell you?

### Question 4

What does the AIC inform on?


### Question 5

**BONUS:** Why is it commonplace to take the sum of squared residuals instead of just the residual
(not squared)? What advantage(s) does that have?
