So far, we've learned how to build linear regression models and interpret estimated coefficients. In this checkpoint, we discuss how we evaluate model performance in the training phase. Recall that there are two contexts where we care about performance: in relation to the training set and in relation to a test set. The former enables us to talk about how well our model explains the information in the target variable, while the latter tells us how well our model will perform when it's given previously unseen observations.

In this checkpoint, we'll go over concepts like **F-tests** and **R-squared**. F-tests allow us to compare our model to a reduced model with no features. R-squared and **adjusted R-squared** (which is a variant of R-squared) values tell us how well the model accounts for variance in the target.

Last, we'll see how we can compare different models in terms of their explanatory power. We show how to read **Akaike** and **Bayesian** information criteria for this purpose.

**Key topics**

* training and test data
* evaluating training performance
* F-tests
* degrees of freedom
* R-squared
* Akaike information criterion
* Bayesian information criterion

At the end of this checkpoint, you'll work through two assignments where you'll evaluate the performance of your weather and house prices models.

## Is our model better than an "empty" model?

When evaluating our model, we first need to ask whether our model contributes anything to the explanation of the outcome variable. In other words, we need to determine whether or not our features explain variance in the outcome. If not, we could drop our features altogether and the resulting "empty" model would perform equally well (which is to say, not very well!).

For this purpose, we use an **F-test**.

####  F-tests

F-tests can be calculated in different ways depending on the situation, but, in general, they represent the ratio between a model's unexplained variance compared to a reduced model. Here, the "reduced model" is a model with no features, meaning all variance in the outcome is unexplained. For a linear regression model with two parameters $y=\alpha+\beta x$, the F-test is built from these pieces:

* unexplained model variance:

$$SSE_F=\sum(y_i-\hat{y}_i)^2$$

* unexplained variance in reduced model:

$$SSE_R=Var_y = \sum(y_i-\bar{y})^2$$

 * number of parameters in the model:

$$p_F = 2 (\alpha \text{ and } \beta)$$

 * number of parameters in the reduced model:

$$p_R = 1 (\alpha)$$

 * number of observations:

$$n$$

 * degrees of freedom of $SSE_F$:

$$df_F = n - p_F$$

 * degrees of freedom of $SSE_R$:

$$df_R = n - p_R$$

These pieces come together to give us the full equation for the F-test:

$$F=\dfrac{SSE_F-SSE_R}{df_F-df_R}÷\dfrac{SSE_F}{df_F}$$

This introduces some new terminology. **Degrees of freedom** quantifies the amount of information "left over" to estimate variability after all parameters are estimated.

In regression, degrees of freedom for a function works like this:  With two data points, a regression line $y=\alpha + \beta x$ has 0 degrees of freedom (2 minus the number of parameters). Those two parameters encompass all the information in the data. Knowing $\alpha$ and $\beta$ alone, we can perfectly reproduce the original data. No additional information is available from the data itself. If we have 10 data points, then the model's degrees of freedom would be 8 (10 minus the number of parameters).

The F-test null hypothesis states that the model is indistinguishable from the reduced model, which means that the features contribute nothing to the explanation of the target variable. Instead of reading the F statistic, it's easier to read its associated p-value. The lower the p-value, the better for our model. Namely, if the p-value of the F-test for our model is less than or equal to 0.1 (or even less than or equal to 0.05), we say that our model is useful and contributes something that is statistically significant in the explanation of the target.

Let's see the F statistic of our medical costs model:

In [5]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sqlalchemy import create_engine

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings(action="ignore")

In [6]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'medicalcosts'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

insurance_df = pd.read_sql_query('select * from medicalcosts',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

insurance_df.head(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.9
1,18,male,33.77,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.705,0,no,northwest,21984.5
4,32,male,28.88,0,no,northwest,3866.86
5,31,female,25.74,0,no,southeast,3756.62
6,46,female,33.44,1,no,southeast,8240.59
7,37,female,27.74,3,no,northwest,7281.51
8,37,male,29.83,2,no,northeast,6406.41
9,60,female,25.84,0,no,northwest,28923.1


In [7]:
insurance_df["is_male"] = pd.get_dummies(insurance_df.sex, drop_first=True)
insurance_df["is_smoker"] = pd.get_dummies(insurance_df.smoker, drop_first=True)

# Y is the target variable
Y = insurance_df['charges']

# X is the feature set
X = insurance_df[['is_male','is_smoker', 'age', 'bmi']]

# We add constant to the model as it's a best practice
# to do so every time!
X = sm.add_constant(X)

# We fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# We print the summary results.
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.747
Model:                            OLS   Adj. R-squared:                  0.747
Method:                 Least Squares   F-statistic:                     986.5
Date:                Wed, 19 Dec 2018   Prob (F-statistic):               0.00
Time:                        17:18:48   Log-Likelihood:                -13557.
No. Observations:                1338   AIC:                         2.712e+04
Df Residuals:                    1333   BIC:                         2.715e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1.163e+04    947.267    -12.281      0.0

This model's F statistic is 986.5, and the associated p-value is very close to zero. This means that our features add some information to the reduced model and our model is useful in explaining charges.

However, F-tests don't quantify how much information our model contributes. This requires R-squared, which we discuss next.

## Quantifying the performance of a model on the training set

R-squared is probably the most common measure of goodness of fit in a linear regression model. It is a proportion (between 0 and 1) that expresses how much variance in the outcome variable is explained by the explanatory variables in the model. Generally speaking, higher $R^2$ values are better to a point — a low $R^2$ indicates that our model isn't explaining much information about the outcome, which means it will not give very good predictions. However, a very high $R^2$ is a warning sign of overfitting. No dataset is a perfect representation of reality, so a model that perfectly fits our data ($R^2$ of 1 or close to 1) is likely to be biased by quirks in the data and will perform less well on the test set.

In the regression summary table above, we see that the R-squared value of our medical costs model is 0.747. This means that our model explains 74.7% of the variance in the charges, leaving 25.3% unexplained. We can conclude that there's still room for improvement. Let's fit the model in the previous checkpoint again where we included the interaction of body mass index (BMI) and is_smoking dummy:

In [8]:
# Y is the target variable
Y = insurance_df['charges']

# This is the interaction between bmi and smoking
insurance_df["bmi_is_smoker"] = insurance_df.bmi * insurance_df.is_smoker

# X is the feature set
X = insurance_df[['is_male','is_smoker', 'age', 'bmi', "bmi_is_smoker"]]

# We add constant to the model as it's a best practice
# to do so ever ytime!
X = sm.add_constant(X)

# We fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# We print the summary results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.837
Model:                            OLS   Adj. R-squared:                  0.836
Method:                 Least Squares   F-statistic:                     1365.
Date:                Wed, 19 Dec 2018   Prob (F-statistic):               0.00
Time:                        17:18:50   Log-Likelihood:                -13265.
No. Observations:                1338   AIC:                         2.654e+04
Df Residuals:                    1332   BIC:                         2.657e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const         -2071.0750    840.644     -2.464

The R-squared of this model is 0.837, which is higher than our previous model's. This improvement indicates that the interaction of BMI and is_smoker explains some previously unexplained variance in the charges. 

As we said before, high R-squared values are generally desirable. However, in some cases, very high R-squared values indicate some potential problems with our model. Specifically:

* Very high R-squared value may be a sign of overfitting. If our model is too complex for the data, then it may overfit the training set and do a poor job on the test set. That said, there's not an agreed upon threshold for R-squared to detect overfitting. Instead, it requires a comparison between performance on test and training data. **If our model performs significantly worse on the test set compared to the training set, then we should suspect overfitting**. We'll discuss how to evaluate linear regression models in the test set on the next checkpoint.

* R-squared is an inherently biased estimate of the performance in the sense that the more explanatory variables are added to the model, the higher R-squared values we get. This is so even we include irrelevant variables like noises or random data. To mitigate this problem, we usually use a metric called **adjusted R-squared** instead of R-squared. Adjusted R-squared does the same job as R-squared, but it is adjusted according to the number of features included in the model. Hence, **it's always safer to look at the adjusted R-squared value instead of R-squared value**.

**A note on negative R-squared values**: It is possible to get negative R-squared values for some models. In general terms, if a model is weaker than a straight horizontal line, then R-squared value becomes negative. This usually happens when a constant is not included in the model. Getting a negative value for R-squared means that your model does very poorly in explaining the target. 

## Comparing different models

Comparing different models and choosing the best one is one of the essential practices in data science. Often, we try several models and evaluate their performance on a test set in order to determine the top performing one. However, *inference* is also a critical task when it comes to linear regression models. Unlike testing the predictive power, in inference, we care about the explanatory power of our models.

Throughout this checkpoint, we saw that we can measure the performance of our models on the training set using F-test or R-squared. Hence, both F-test and R-squared can be used in the comparison of different models. Unfortunately, the two metrics suffer from some drawbacks that make them inappropriate to use in certain situations.

Here, we briefly outline how we can use F-tests and R-squared in model comparison. Then, we introduce information criteria that we can also use to compare different models.

#### Using F-tests for model comparison

We can use an F-test to compare two models if one of them is nested within the other. That is, if the feature set in a model is a subset of the feature set of the other, then we can use F-test. In this case, we say that the model with higher F statistic is superior to the other one.

However, if models are not nested, then using an F-test may be misleading. F-tests are quite sensitive to the normality of the error terms. If errors are not normally distributed, we should try other methods.

#### Using R-squared for model comparison

R-squared can also be used. We already saw that R-squared is biased as it tends to increase with the number of explanatory variables. So, instead of R-squared, we can use adjusted R-squared. The higher adjusted R-squared, the better the model explains the target variable. 

#### Using information criteria

Using information criteria is also a common way of comparing different models and selecting the best one. Here, we talk about two information criteria known as the **Akaike Information Criterion (AIC)** and **Bayesian Information Criterion (BIC)**. Both take into consideration the sum of the squared errors (SSE), the sample size, and the number of parameters.

The formula for AIC is:

$$nln(SSE)−nln(n)+2p$$ 


The formula for BIC is:

$$nln(SSE)−nln(n)+pln(n)$$

In both of these formulas, $n$ represents the sample size, and $p$ represents the number of regression coefficients in the model (including the constant). $ln$ stands for the natural logarithm.

For both AIC and BIC, the lower the value the better. Hence, we choose the model with the lowest AIC or BIC value. Although we can use either of the two criteria, AIC is usually criticized for its tendency to overfit. In contrast, BIC penalizes the number of parameters more severely than AIC and hence favors more parsimonious models (that is, models with fewer parameters).

## Which medical costs model is better?

statmodels' `summary()` function gives us all of the above metrics. In the tables above, we see that for our first model, R-squared is 0.747, adjusted R-squared is 0.747, F statistic is 986.5, AIC is 27.120 and BIC is 27.150. For our second model, R-squared is 0.837, adjusted R-squared is 0.836, F statistic is 1365, AIC is 26.540 and BIC is 26.570. According to all of the metrics, our second model seems better than the first one.

## Assignments

As in previous checkpoints, please submit links to two Juypyter notebooks (one for each assignment below).

Please submit links to all your work below. This is not a graded checkpoint, but you should discuss your solutions with your mentor. Also, when you're done, compare your work to [these example solutions](https://github.com/Thinkful-Ed/machine-learning-regression-problems/blob/master/notebooks/5.solution_evaluating_goodness_of_fit.ipynb).



### 1. Weather model

For this assignment, you'll revisit the historical temperature dataset. To complete this assignment, submit a link a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Like in the previous checkpoint, build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why? 
* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one? 
* Add *visibility* as an additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the *visibility* in terms of the improvement in the adjusted R-squared. Which one is more useful?
* Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.


###  2. House prices model

In this exercise, you'll work on your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.
* Do you think your model is satisfactory? If so, why?
* In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables. 
* For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?