# Assessing model fit

## Quantifying model fit

#### Coefficient of determination (R-squared)

The coefficient of determination is a measure of how well a linear regression line fits the observed values. Is definded as the proportion of the variance in the response variable that is predictable from the explanatory variable.

For simple linear regression, the interpretation of the coefficient of determination is simply **the correlation between the explanatory and response variable, squared**.

It takes values from 0 to 1. 
* An coefficient of determination equal to 0 implies a useless model.
* An coefficient of determination equal to 1 corresponds to a perfect model.

#### Residual standard error (RSE)

Residual standard error ($\text{RSE}$) is a measure of the typical size of the residuals. Equivalently, it's a measure of how wrong you can expect predictions to be.

$\displaystyle{\text{RSE} = \sqrt{\frac{1}{\eta}\sum_{i=1}^{n}(y_i - \bar{y}_i)^2}}$ , 

where $\bar{y}_i^2$ and $y_i$ are the predicted and real value, respectively,

and $\eta$ is the number of degrees of freedom: the number of observations minus the number of model coefficients. 



The $\text{RSE}$ has the same units as the response variable.

#### Mean squared error (MSE)

Is another metric of accuracy of a linear regression model. In terms of the $\text{RSE}$ is defined as

$\displaystyle{\text{MSE} = \text{RSE}^2}$,

$\displaystyle{\text{MSE} = \frac{1}{\eta}\sum_{i=1}^{n}(y_i - \bar{y}_i)^2}$.

#### Root-mean-square error (RMSE)

Is a metric similar to $\text{RSE}$, but you use the number of observations $n$ instead of $\eta$.

If you want to compare the accuracy of different models, generally you should use the $\text{RSE}$ of them.

$\displaystyle{\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \bar{y}_i)^2}}$.

## Visualizing the model fit

#### Residuals vs. fitted values

If a linear regression model is good fit, then the residuals are approximately normally distributed, with mean zero. 

This graph is useful to identifying whether the residual values tend to grow or decrease as the fitted values increase. If our model is good, the residual values must be all around zero and should not have a tendecy to increase or decrease.

#### Q-Q plot

This kind of graph shows how well our residuals fit a normal distribution. If they do, our data points must be near to a straight line with a slope of $0.5$ and with intercept $0$.

On the x-axis the points correspond to quantiles from the normal distribution. On the y-axis the points correspond to the quantiles derived from you dataset.

#### Scale-location plot

This plot shows the square root of the standarized residuals versus the fitted values. It is useful to identifying wether the residuals sizes are randomly distributed as the fitted values increases. That is the expected behavior is the data fits to a linear regression model.

## Outliers, leverage and influence

#### Outliers

Once we have created a linear model for out dataset, there are two ways to identify outliers within it:

* Find **extreme values** for the **explanatory variable**.
* Find the **points far away** from the **model line**.    

#### Leverage

Leverage measures **how unusual or extreme the explanatory variables are** for each observation **High leverage** means that the explanatory variable has **values that are different** from other points in the dataset. 

In the case of **simple linear regression**, where there is only one explanatory value, this typically means values with a very **high or very low explanatory value**.

#### Influence

Influence measures **how much a model would change if each observation was left out of the model** calculations, one at a time. That is, it measures how different the prediction line would look if you would run a linear regression on all data points except that point, compared to running a linear regression on the whole dataset. 

The **standard metric for influence** is **Cook's distance**, which calculates influence **based on the residual size and the leverage of the point**.

<!-- Influence measures how much a model would change if each observation was left out of the model calculations, one at a time. That is, it measures how different the prediction line would look if you would run a linear regression on all data points except that point, compared to running a linear regression on the whole dataset. It depends on the value of the data and its leverage. -->