# Evaluation of regression models

In this notebook, we will look at how to evaluate (simple) regression models.

First we import the standard packages

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats

import statsmodels.api as sm
from sklearn import linear_model

Then we import some example data - Google Analytics webdata again.

In [2]:
webdata = pd.read_excel("GA users and convertions.xlsx")

In [3]:
webdata

Unnamed: 0,DayIndex,Users,PurchaseCompleted
0,2016-08-01,560,8
1,2016-08-02,378,10
2,2016-08-03,412,11
3,2016-08-04,499,7
4,2016-08-05,375,11
...,...,...,...
87,2016-10-27,351,23
88,2016-10-28,398,23
89,2016-10-29,209,10
90,2016-10-30,224,16


We will first train simple linear regression models using both *statsmodels* and *scikit-learn*. First, we do it using statsmodels:

In [4]:
X = webdata["Users"]
X = sm.add_constant(X)

y = webdata["PurchaseCompleted"]

linreg_statsm = sm.OLS(y, X).fit()

From the model summary we already see some evaluation metrics such as R-squared.

In [5]:
linreg_statsm.summary()

0,1,2,3
Dep. Variable:,PurchaseCompleted,R-squared:,0.378
Model:,OLS,Adj. R-squared:,0.372
Method:,Least Squares,F-statistic:,54.8
Date:,"Thu, 05 Feb 2026",Prob (F-statistic):,6.81e-11
Time:,18:32:40,Log-Likelihood:,-258.35
No. Observations:,92,AIC:,520.7
Df Residuals:,90,BIC:,525.8
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-2.3670,1.309,-1.808,0.074,-4.968,0.234
Users,0.0297,0.004,7.403,0.000,0.022,0.038

0,1,2,3
Omnibus:,37.103,Durbin-Watson:,0.767
Prob(Omnibus):,0.0,Jarque-Bera (JB):,81.388
Skew:,1.511,Prob(JB):,2.12e-18
Kurtosis:,6.478,Cond. No.,1010.0


We can also get the R-squared and adjusted R-squared directly from the fitted model object.

In [6]:
linreg_statsm.rsquared

np.float64(0.37847262621338407)

In [7]:
linreg_statsm.rsquared_adj

np.float64(0.37156676650464393)

We can get the residuals of the model also

In [8]:
linreg_statsm.resid

0     -6.263597
1      1.141357
2      1.131640
3     -5.452047
4      2.230449
        ...    
87    14.943190
88    13.547406
89     6.160242
90    11.714779
91    13.349787
Length: 92, dtype: float64

Using these, we can calculate the *MAE* (Mean Absolute Error).

In [9]:
np.mean(np.abs(linreg_statsm.resid))

np.float64(2.7583876196345)

and the *MSE* (Mean Squared Error).

In [10]:
np.mean(linreg_statsm.resid**2)

np.float64(16.09538539763005)

and the *RMSE* (Root Mean Squared Error).

In [11]:
np.sqrt(np.mean(linreg_statsm.resid**2))

np.float64(4.011905457214819)

###  Evaluation metrics from scikit-learn

If we train a linear regression model using scikit-learn, we can also use its submodule to evaluate the model.

In [12]:
X_no_int = webdata[["Users"]]

linreg_scikit = linear_model.LinearRegression()
linreg_scikit.fit(X_no_int, y)

0,1,2
,"fit_intercept  fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).",True
,"copy_X  copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.",True
,"tol  tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7",1e-06
,"n_jobs  n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24",False


We can also get the R-squared, but here we need to provide data.

In [13]:
linreg_scikit.score(X_no_int, y)

0.37847262621338396

Scikit-learn also comes with a metrics submodule with all sorts of evaluation metrics (See https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)

In [14]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, root_mean_squared_error

All of these take the true and the predicted values as arrays. Thus, we first define a variable for the predicted.

In [15]:
y_pred = linreg_scikit.predict(X_no_int)

In [16]:
r2_score(y, y_pred)

0.37847262621338396

In [17]:
mean_absolute_error(y, y_pred)

2.7583876196345

In [18]:
mean_squared_error(y, y_pred)

16.095385397630054

In [19]:
root_mean_squared_error(y, y_pred)

4.01190545721482

## Train-test splitting

We will first look at splitting data on test and train. For this, we first need to import a scikit-learn function:

In [20]:
from sklearn.model_selection import train_test_split

We can now split the data with a call to the function `train_test_split` with the following input and output:

Input:
* X - our entire X dataset
* y - our entire y dataset
* test_size: The percentage of data put in the test dataset - how much we allocate for test depends, by 30% is a fairly common choice if there is enough data
* random_state: An integer used for the random seed generation, such that we can replicate the split, if we want to

Output:
* X_train: Features of the training dataset
* X_test: Features of the test dataset
* y_train: Response variable of the training dataset
* y_test: Response variable (**groundtruth**) of the test dataset

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X_no_int, y, test_size=0.3, random_state=123)

In [22]:
X_train.head()

Unnamed: 0,Users
21,391
59,327
19,174
38,279
90,224


In [23]:
y_train.head()

21    11
59     2
19     2
38    10
90    16
Name: PurchaseCompleted, dtype: int64

In [24]:
X_test.head()

Unnamed: 0,Users
71,332
62,153
29,373
53,267
88,398


In [25]:
y_test.head()

71    12
62     3
29     6
53     3
88    23
Name: PurchaseCompleted, dtype: int64

Note how this function returns 4 dataframes. Note also how the indexes for the X_train and y_train matches (this is of course essential for supervised learning!) and the same goes for the X_test and y_test.

Now let us retrain our simple linear regression.

In [26]:
linreg = linear_model.LinearRegression()
linreg.fit(X_train, y_train)

0,1,2
,"fit_intercept  fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).",True
,"copy_X  copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.",True
,"tol  tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7",1e-06
,"n_jobs  n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24",False


We can now calculate our evaluation metrics on the training set:

In [27]:
y_pred_train = linreg.predict(X_train)

In [28]:
r2_score(y_train, y_pred_train)

0.38379054163559545

In [29]:
mean_absolute_error(y_train, y_pred_train)

2.673073910559279

In [30]:
root_mean_squared_error(y_train, y_pred_train)

4.060519102355277

We can now also calculate our evaluation metrics on the testset to get a better estimate of how well our model generalizes, that is, how well it predict on new unseen cases:

In [31]:
y_pred_test = linreg.predict(X_test)

In [32]:
r2_score(y_test, y_pred_test)

0.2885830107992312

In [33]:
mean_absolute_error(y_test, y_pred_test)

2.8898751558519122

In [34]:
root_mean_squared_error(y_test, y_pred_test)

3.986318168999399

We clearly see that the model perform worse on the test data for r2 and MAE as expected. However, for some unknown reasons, the RMSE got slightly better on the testset, but only marginally.