# Comparing regression models

We need to be able to compare our regression models in order to test our hypotheses, e.g.:
* What is a better predictor of app **latency** / **throughput** - **number of instances** on the machine or **CPU usage** of the machine?
* Is it worth to add **CPU usage** to the model?
* What are the best features for predicting **latency** / **throughput**?

This notebook covers some measures that help to answer those questions, using data from experiment `linpack_12x20` as an example.

## R-Squared

**R-squared** ($R^2$) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model.

So, if the $R^2$ of a model is $0.50$, then approximately half of the observed variation can be explained by the model's inputs.

## Adjusted R-Squared

R-Squared only works as intended in a simple linear regression model with one explanatory variable. With a multiple regression made up of several independent variables, the R-Squared must be adjusted. The **adjusted R-squared** compares the descriptive power of regression models that include diverse numbers of predictors. Every predictor added to a model increases R-squared and never decreases it. Thus, a model with more terms may seem to have a better fit just for the fact that it has more terms, while the adjusted R-squared compensates for the addition of variables and only increases if the new term enhances the model above what would be obtained by probability and decreases when a predictor enhances the model less than what is predicted by chance. In an overfitting condition, an incorrectly high value of R-squared is obtained, even when the model actually has a decreased ability to predict. This is not the case with the adjusted R-squared.

## Akaike information criterion (AIC)

**Akaike information criterion (AIC)** is an estimator of out-of-sample prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

AIC is founded on information theory. When a statistical model is used to represent the process that generated the data, the representation will almost never be exact; so some information will be lost by using the model to represent the process. AIC estimates the relative amount of information lost by a given model: the less information a model loses, the higher the quality of that model (**the lower the better**).

In estimating the amount of information lost by a model, AIC deals with the trade-off between the goodness of fit of the model and the simplicity of the model. In other words, AIC deals with both the risk of overfitting and the risk of underfitting.

## Bayesian information criterion (BIC)

The formula for the **Bayesian information criterion (BIC)** is similar to the formula for AIC, but with a different penalty for the number of parameters.

## Example - linpack 12x20min

In [1]:
import warnings
warnings.filterwarnings('ignore')

from helpers.helpers_old import (
    fit_regression,
    get_data_with_metrics,
)

experiments_path = '../../data'
experiment_name = 'linpack_12x20'

exp_name, df = next(get_data_with_metrics(experiment_name, experiments_path, instances_n=12))

df.head()

Unnamed: 0,cbtool_time,cpu_time,app_latency,app_throughput,cpu,memory,instances_n
0,1592311591,1592311594,,33.382,2035.0,4546589000.0,1.0
1,1592311651,1592311654,,33.37,2034.0,4547256000.0,1.0
2,1592311711,1592311715,,33.3523,2034.0,4545925000.0,1.0
3,1592311771,1592311775,,32.272,2057.0,4546154000.0,1.0
4,1592311831,1592311836,,33.1085,2044.0,4545212000.0,1.0


### Regression `app_throughput` ~ `instances_n`

In [2]:
def print_measures(results):
    print(f'Adj. R-squared: {results.rsquared_adj:.4f}')
    print(f'AIC: {results.aic:.5f}')
    print(f'BIC: {results.bic:.5f}')

In [3]:
results = fit_regression(data=df, formula='app_throughput ~ instances_n')
print_measures(results)

NameError: name 'fit_regression' is not defined

### Regression `app_throughput` ~ `cpu`

In [4]:
results = fit_regression(data=df, formula='app_throughput ~ cpu')
print_measures(results)

Adj. R-squared: 0.9239
AIC: 654.52017
BIC: 660.55473


### Regression `app_throughput` ~ `instances_n` + `cpu`

In [5]:
results = fit_regression(data=df, formula='app_throughput ~ instances_n + cpu')
print_measures(results)

Adj. R-squared: 0.9452
AIC: 606.04180
BIC: 615.09364


### Results

We can conclude that the model with only `instances_n` is **slightly worse** than the model with only `cpu`, because **Adj. R-squared** is **lower** and **AIC** and **BIC** are **higher**.

Although in this case, the variable `instances_n` is still useful, because model with both `instances_n` and `cpu` is better than models with single variables.

It should be pointed out, that those conclusions are **not true for every experiment**.

## Pearson correlation coefficient

**Pearson correlation coefficient** is a statistic that measures linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation

## Spearman's rank correlation coefficient

**Spearman's rank correlation coefficient** is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a monotonic function.

## Mutual information

**Mutual information (MI)** of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" (in units such as shannons, commonly called bits) obtained about one random variable through observing the other random variable. The concept of mutual information is intricately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

## Example

In [6]:
from itertools import combinations
from scipy.stats import pearsonr

features = ['app_throughput', 'instances_n', 'cpu', 'memory']

for feat_a, feat_b in combinations(features, 2):
    corr = pearsonr(df[feat_a], df[feat_b])[0]
    print(f'Pearson between {feat_a} and {feat_b}: {corr:.4f}')

Pearson between app_throughput and instances_n: -0.9609
Pearson between app_throughput and cpu: -0.9615
Pearson between app_throughput and memory: -0.9197
Pearson between instances_n and cpu: 0.9535
Pearson between instances_n and memory: 0.9648
Pearson between cpu and memory: 0.8998


In [7]:
from scipy.stats import spearmanr


for feat_a, feat_b in combinations(features, 2):
    corr = spearmanr(df[feat_a], df[feat_b])[0]
    print(f'Spearmanr between {feat_a} and {feat_b}: {corr:.4f}')

Spearmanr between app_throughput and instances_n: -0.9753
Spearmanr between app_throughput and cpu: -0.9671
Spearmanr between app_throughput and memory: -0.9537
Spearmanr between instances_n and cpu: 0.9852
Spearmanr between instances_n and memory: 0.9722
Spearmanr between cpu and memory: 0.9589


In [8]:
from sklearn.metrics import mutual_info_score


for feat_a, feat_b in combinations(features, 2):
    mi = mutual_info_score(df[feat_a], df[feat_b])
    print(f'Mutual information between {feat_a} and {feat_b}: {mi:.4f}')

Mutual information between app_throughput and instances_n: 2.3679
Mutual information between app_throughput and cpu: 4.9622
Mutual information between app_throughput and memory: 5.0081
Mutual information between instances_n and cpu: 2.3587
Mutual information between instances_n and memory: 2.3679
Mutual information between cpu and memory: 4.9530


## Another example - wrk 12x20min

In [9]:
experiment_name = 'wrk_12x20'

exp_name, df = next(get_data_with_metrics(experiment_name, experiments_path, instances_n=12))

df.head()

Unnamed: 0,cbtool_time,cpu_time,app_latency,app_throughput,cpu,memory,instances_n
0,1591802843,1591802848,15.24,8880.0,7167.0,4690260000.0,1.0
1,1591802907,1591802908,10.08,20180.0,5906.0,4676293000.0,1.0
2,1591802971,1591802969,20.2,5360.0,7869.0,4687049000.0,1.0
3,1591803449,1591803453,21.11,5340.0,6597.0,4683792000.0,1.0
4,1591803513,1591803514,10.05,20140.0,6163.0,4669866000.0,1.0


In [10]:
results = fit_regression(data=df, formula='app_latency ~ instances_n')
print_measures(results)

Adj. R-squared: 0.9656
AIC: 532.21507
BIC: 537.02851


In [11]:
results = fit_regression(data=df, formula='app_latency ~ cpu')
print_measures(results)

Adj. R-squared: 0.7126
AIC: 706.21839
BIC: 711.03183


In [12]:
results = fit_regression(data=df, formula='app_latency ~ memory')
print_measures(results)

Adj. R-squared: 0.6287
AIC: 727.22240
BIC: 732.03584


In [13]:
results = fit_regression(data=df, formula='app_latency ~ instances_n + cpu')
print_measures(results)

Adj. R-squared: 0.9778
AIC: 497.10948
BIC: 504.32964


In [14]:
results = fit_regression(data=df, formula='app_latency ~ instances_n + cpu + memory')
print_measures(results)

Adj. R-squared: 0.9803
AIC: 488.36683
BIC: 497.99371


In this case, we can see that models with only `instances_n` are much better feature than those with only with `cpu` or only `memory`.

At the same time, we see that adding `cpu` and `memory` to the models gives us a slight improvement.

Let's check correlations and mutual information.

In [15]:
features = ['app_latency', 'instances_n', 'cpu', 'memory']

for feat_a, feat_b in combinations(features, 2):
    corr = pearsonr(df[feat_a], df[feat_b])[0]
    print(f'Pearson between {feat_a} and {feat_b}: {corr:.4f}')

Pearson between app_latency and instances_n: 0.9829
Pearson between app_latency and cpu: 0.8463
Pearson between app_latency and memory: 0.7958
Pearson between instances_n and cpu: 0.7919
Pearson between instances_n and memory: 0.7312
Pearson between cpu and memory: 0.8598


In [16]:
for feat_a, feat_b in combinations(features, 2):
    corr = spearmanr(df[feat_a], df[feat_b])[0]
    print(f'Spearman between {feat_a} and {feat_b}: {corr:.4f}')

Spearman between app_latency and instances_n: 0.9926
Spearman between app_latency and cpu: 0.9492
Spearman between app_latency and memory: 0.5966
Spearman between instances_n and cpu: 0.9594
Spearman between instances_n and memory: 0.5989
Spearman between cpu and memory: 0.5687


In [17]:
for feat_a, feat_b in combinations(features, 2):
    mi = mutual_info_score(df[feat_a], df[feat_b])
    print(f'Mutual information between {feat_a} and {feat_b}: {mi:.4f}')

Mutual information between app_latency and instances_n: 2.4569
Mutual information between app_latency and cpu: 4.3898
Mutual information between app_latency and memory: 4.3898
Mutual information between instances_n and cpu: 2.4569
Mutual information between instances_n and memory: 2.4569
Mutual information between cpu and memory: 4.4067


**Pearson** correlation shows that there is a strong linear dependency between `app_latency` and `instances_n`. 

**Spearman** correlation shows that there is a significant, monotonic (not necessarily linear) dependency between `app_latency` and `cpu`.

**Mutual information** shows that there is a significant (but not necessarily monotonic) dependency between `app_latency` and `memory`.