# Comparing regression models

We need to be able to compare our regression models in order to test our hypotheses, e.g.:
* What is a better predictor of app **latency** / **throughput** - **number of instances** on the machine or **CPU usage** of the machine?
* Is it worth to add **CPU usage** to the model?
* What are the best features for predicting **latency** / **throughput**?

This notebook covers some measures that help to answer those questions, using data from experiment `linpack_12x20` as an example.

In [1]:
from helpers import (
    add_instances_n,
    draw_regression_graph,
    fit_regression,
    get_experiments_data,
    get_attach_indexes,
    get_data_with_cpu,
)

experiment_name = 'linpack_12x20'

exp_name, df = next(get_data_with_cpu(experiment_name, instances_n=12))

df.head()

Unnamed: 0,cbtool_time,cpu_time,app_latency,app_throughput,cpu,instances_n
0,44.0,44.0,,32.8908,180.0,1.0
1,104.0,104.0,,33.4394,427.0,1.0
2,164.0,165.0,,33.2988,1244.0,1.0
3,224.0,225.0,,33.4386,2035.0,1.0
4,284.0,285.0,,33.4588,2034.0,1.0


## R-Squared

**R-squared** ($R^2$) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model.

So, if the $R^2$ of a model is $0.50$, then approximately half of the observed variation can be explained by the model's inputs.

## Adjusted R-Squared

R-Squared only works as intended in a simple linear regression model with one explanatory variable. With a multiple regression made up of several independent variables, the R-Squared must be adjusted. The **adjusted R-squared** compares the descriptive power of regression models that include diverse numbers of predictors. Every predictor added to a model increases R-squared and never decreases it. Thus, a model with more terms may seem to have a better fit just for the fact that it has more terms, while the adjusted R-squared compensates for the addition of variables and only increases if the new term enhances the model above what would be obtained by probability and decreases when a predictor enhances the model less than what is predicted by chance. In an overfitting condition, an incorrectly high value of R-squared is obtained, even when the model actually has a decreased ability to predict. This is not the case with the adjusted R-squared.

## Akaike information criterion (AIC)

**Akaike information criterion (AIC)** is an estimator of out-of-sample prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

AIC is founded on information theory. When a statistical model is used to represent the process that generated the data, the representation will almost never be exact; so some information will be lost by using the model to represent the process. AIC estimates the relative amount of information lost by a given model: the less information a model loses, the higher the quality of that model (**the lower the better**).

In estimating the amount of information lost by a model, AIC deals with the trade-off between the goodness of fit of the model and the simplicity of the model. In other words, AIC deals with both the risk of overfitting and the risk of underfitting.

## Bayesian information criterion (BIC)

The formula for the **Bayesian information criterion (BIC)** is similar to the formula for AIC, but with a different penalty for the number of parameters.

## Example

### Regression `app_throughput` ~ `instances_n`

In [2]:
results = fit_regression(data=df, formula='app_throughput ~ instances_n')
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:         app_throughput   R-squared:                       0.928
Model:                            OLS   Adj. R-squared:                  0.928
Method:                 Least Squares   F-statistic:                     1977.
Date:                Wed, 17 Jun 2020   Prob (F-statistic):           2.14e-89
Time:                        02:55:01   Log-Likelihood:                -341.02
No. Observations:                 155   AIC:                             686.0
Df Residuals:                     153   BIC:                             692.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      34.5753      0.332    104.068      

### Regression `app_throughput` ~ `cpu`

In [3]:
results = fit_regression(data=df, formula='app_throughput ~ cpu')
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:         app_throughput   R-squared:                       0.908
Model:                            OLS   Adj. R-squared:                  0.907
Method:                 Least Squares   F-statistic:                     1511.
Date:                Wed, 17 Jun 2020   Prob (F-statistic):           3.39e-81
Time:                        02:55:01   Log-Likelihood:                -360.14
No. Observations:                 155   AIC:                             724.3
Df Residuals:                     153   BIC:                             730.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     38.2653      0.462     82.792      0.0

### Regression `app_throughput` ~ `instances_n` + `cpu`

In [4]:
results = fit_regression(data=df, formula='app_throughput ~ instances_n + cpu')
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:         app_throughput   R-squared:                       0.943
Model:                            OLS   Adj. R-squared:                  0.942
Method:                 Least Squares   F-statistic:                     1255.
Date:                Wed, 17 Jun 2020   Prob (F-statistic):           3.18e-95
Time:                        02:55:01   Log-Likelihood:                -323.24
No. Observations:                 155   AIC:                             652.5
Df Residuals:                     152   BIC:                             661.6
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      36.3838      0.414     87.793      

### Results

As we can see, the results are:

`app_throughput` ~ `instances_n`:
* Adj. R-squared: `0.928`
* AIC: `686.0`
* BIC: `692.1`

`app_throughput` ~ `cpu`:
* Adj. R-squared: `0.907`
* AIC: `724.3`
* BIC: `730.4`

`app_throughput` ~ `instances_n` + `cpu`:
* Adj. R-squared: `0.942`
* AIC: `652.5`
* BIC: `661.6`

We can conclude that the model with only `instances_n` is **bettter** than the model with only `cpu`, because **Adj. R-squared** is **higher** and **AIC** and **BIC** are **lower**.

Although in this case, the variable `cpu` is still useful, because model with both `instances_n` and `cpu` is better than models with single variables.

It should be pointed out, that those conclusions are **not true for every experiment**.

## Pearson correlation coefficient

**Pearson correlation coefficient** is a statistic that measures linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation

## Spearman's rank correlation coefficient

**Spearman's rank correlation coefficient** is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a monotonic function.

## Mutual information

**Mutual information (MI)** of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" (in units such as shannons, commonly called bits) obtained about one random variable through observing the other random variable. The concept of mutual information is intricately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

## Example

In [5]:
from itertools import combinations
from scipy.stats import pearsonr

features = ['app_throughput', 'instances_n', 'cpu']

for feat_a, feat_b in combinations(features, 2):
    corr = pearsonr(df[feat_a], df[feat_b])[0]
    print(f'Pearson between {feat_a} and {feat_b}: {corr:.4f}')

Pearson between app_throughput and instances_n: -0.9634
Pearson between app_throughput and cpu: -0.9529
Pearson between instances_n and cpu: 0.9496


In [6]:
from scipy.stats import spearmanr


for feat_a, feat_b in combinations(features, 2):
    corr = spearmanr(df[feat_a], df[feat_b])[0]
    print(f'Spearmanr between {feat_a} and {feat_b}: {corr:.4f}')

Spearmanr between app_throughput and instances_n: -0.9788
Spearmanr between app_throughput and cpu: -0.9589
Spearmanr between instances_n and cpu: 0.9846


As we can see, we confirmed our hypothesis that `instances_n` is a better explanatory variable than `cpu`, because there is stronger correlation between `app_throughput` and `instances_n` than between `app_throughput` and `cpu`.

In [7]:
from sklearn.metrics import mutual_info_score


for feat_a, feat_b in combinations(features, 2):
    mi = mutual_info_score(df[feat_a], df[feat_b])
    print(f'Mutual information between {feat_a} and {feat_b}: {mi:.4f}')

Mutual information between app_throughput and instances_n: 2.4209
Mutual information between app_throughput and cpu: 4.9898
Mutual information between instances_n and cpu: 2.4030


On the other hand, mutual information is higher for `app_throughput` and `cpu`.

This would suggest that the dependence between `app_throughput` and `cpu` is still strong, but it's more **non-linear** than between `app_throughput` and `instances_n`.

That's why those measures are complementary and should both be looked at.