# Multiple Regression from Scratch

*Daniel Wiesenfeld*

In this notebook I demostrate how to fit a linear regression and duplicate every metric in `statsmodels`' linear regression summary output from scratch. By that, I mean using only Python with the `numpy` and `scipy` libraries. There is also an accompanying Google sheet and Excel workbook that has shows you how to compute all the metrics using built in formulas only. Though this exercise is mostly for the fun of it (and to learn how all the formulas work), there are actually practical uses. I think you will find that it is far simpler and easier to generate $\beta$'s with just `numpy`, which you will almost always use anyway in a data science context, than it is using `sklearn` or `statsmodels`. Also, many businesses use spreadsheets all the time. I can think of many instances where having that functionality within a spreadsheet, without the need for plugins, add-ons, or custom programming can be very helpful.

Enjoy!

## Imports and Data Generation

Let's import our libraries and get the data in a `pandas` dataframe.

In [75]:
from statsmodels.api import OLS, datasets, add_constant
import numpy as np
import scipy as sp
import pandas as pd

Let's use the macrodata dataset from `statsmodels` as our example. We will build a multiple regression model to predict unemployment with seven features.

In [2]:
df = pd.DataFrame(datasets.macrodata.load().data)
df.head()

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1,1959.0,2.0,2778.801,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
2,1959.0,3.0,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,5.3,178.657,2.74,1.09
3,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
4,1960.0,1.0,2847.699,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,180.007,2.31,1.19


We'll create our `X` dataframe, by selecting 7 features of interest and adding a constant, and we will select the 'unemp' column as our `y` series.

In [78]:
X = df[['realgdp', 'realcons', 'realgovt', 'realdpi', 'cpi', 'm1', 'pop']].pipe(add_constant)
y = df['unemp']

In [79]:
X.head()

Unnamed: 0,const,realgdp,realcons,realgovt,realdpi,cpi,m1,pop
0,1.0,2710.349,1707.4,470.045,1886.9,28.98,139.7,177.146
1,1.0,2778.801,1733.7,481.301,1919.7,29.15,141.7,177.83
2,1.0,2775.488,1751.8,491.26,1916.4,29.35,140.5,178.657
3,1.0,2785.204,1753.7,484.052,1931.3,29.37,140.0,179.386
4,1.0,2847.699,1770.5,462.199,1955.5,29.54,139.6,180.007


In [80]:
y.head()

0    5.8
1    5.1
2    5.3
3    5.6
4    5.2
Name: unemp, dtype: float64

Run the following code if you want to copy X & y as a single dataframe to the clopboard so you can paste in Sheets or Excel:

In [84]:
X.assign(y = y).to_clipboard()
# now just paste in Sheets or Excel

## Statsmodels Output

OK let's take a look at what `statsmodels` gives us

In [7]:
mod = OLS(y, X)
res = mod.fit()
res.summary()

0,1,2,3
Dep. Variable:,unemp,R-squared:,0.876
Model:,OLS,Adj. R-squared:,0.872
Method:,Least Squares,F-statistic:,197.6
Date:,"Mon, 15 May 2023",Prob (F-statistic):,6.21e-85
Time:,09:42:54,Log-Likelihood:,-151.94
No. Observations:,203,AIC:,319.9
Df Residuals:,195,BIC:,346.4
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
constant,-17.5391,2.806,-6.250,0.000,-23.074,-12.004
realgdp,-0.0116,0.000,-27.105,0.000,-0.012,-0.011
realcons,0.0087,0.001,10.055,0.000,0.007,0.010
realgovt,-0.0030,0.001,-4.824,0.000,-0.004,-0.002
realdpi,0.0029,0.001,4.112,0.000,0.002,0.004
cpi,0.0799,0.006,13.065,0.000,0.068,0.092
m1,-0.0030,0.001,-4.945,0.000,-0.004,-0.002
pop,0.1907,0.018,10.646,0.000,0.155,0.226

0,1,2,3
Omnibus:,0.415,Durbin-Watson:,0.779
Prob(Omnibus):,0.813,Jarque-Bera (JB):,0.204
Skew:,0.057,Prob(JB):,0.903
Kurtosis:,3.106,Cond. No.,861000.0


The problem with `statsmodels` output is that the numbers are all rounded to no more than three decimal places. It will be hard to see how accurate our own computations are unless we extract these attributes independently from the `res` object. I show you how to do it below and conveniently save them in variables beginning with the suffix '_sm', so that we can compare them to our own computations.

In [85]:
r2_sm = res.rsquared # R-squared
r2_sm

0.8764203251692085

In [87]:
r2_adj_sm = res.rsquared_adj # Adj. R-squared
r2_adj_sm

0.8719841317137442

In [88]:
f_sm = res.fvalue # F-Statistic
f_sm

197.56134036257416

In [89]:
pf_sm = res.f_pvalue # Prob(F-Statistic)
pf_sm

6.214397666559647e-85

In [90]:
ll_sm = res.llf # Log-Likelihood
ll_sm

-151.94432883403306

In [91]:
aic_sm = res.aic # AIC
aic_sm

319.8886576680661

In [92]:
bic_sm = res.bic # BIC
bic_sm

346.39430550040043

In [95]:
beta_sm = res.params # coefficients
beta_sm

constant   -17.539104
realgdp     -0.011650
realcons     0.008694
realgovt    -0.003007
realdpi      0.002923
cpi          0.079861
m1          -0.003043
pop          0.190674
dtype: float64

In [96]:
se_sm = res.bse # standard erros of coefficients
se_sm

constant    2.806322
realgdp     0.000430
realcons    0.000865
realgovt    0.000623
realdpi     0.000711
cpi         0.006113
m1          0.000615
pop         0.017910
dtype: float64

In [97]:
t_sm = res.tvalues # t values of coefficients
t_sm

constant    -6.249854
realgdp    -27.105072
realcons    10.054683
realgovt    -4.824153
realdpi      4.112236
cpi         13.064576
m1          -4.944581
pop         10.646391
dtype: float64

In [98]:
p_sm = res.pvalues # p values of coefficients
p_sm

constant    2.532927e-09
realgdp     4.704738e-68
realcons    1.991447e-19
realgovt    2.828657e-06
realdpi     5.772642e-05
cpi         1.972339e-28
m1          1.641259e-06
pop         3.691092e-21
dtype: float64

In [104]:
lb_sm = res.conf_int()[0] # lower bound of 95% confidence interval of coefficients
lb_sm

constant   -23.073744
realgdp     -0.012497
realcons     0.006989
realgovt    -0.004236
realdpi      0.001521
cpi          0.067805
m1          -0.004256
pop          0.155352
Name: 0, dtype: float64

In [103]:
ub_sm = res.conf_int()[1] # upper bound of 95% confidence interval of coefficients
ub_sm

constant   -12.004463
realgdp     -0.010802
realcons     0.010399
realgovt    -0.001777
realdpi      0.004325
cpi          0.091916
m1          -0.001829
pop          0.225995
Name: 1, dtype: float64

The bottom-most table of the summary output requires a few more methods from `statsmodels` and access to the model's residuals.

In [106]:
from statsmodels.stats.stattools import omni_normtest, robust_skewness, robust_kurtosis, durbin_watson, jarque_bera
resids_sm = res.resid # residuals (errors)

In [111]:
omni_sm = omni_normtest(resids_sm)[0] # Omnibus statistic of residuals
omni_sm

0.41478658948228114

In [112]:
pomni_sm = omni_normtest(resids_sm)[1] # Prob(Omnibus)
pomni_sm

0.8126999565057904

In [113]:
skew_sm = robust_skewness(resids)[0] # Skewness of residuals
skew_sm

0.056667302727931364

In [115]:
kurt_sm = robust_kurtosis(resids)[0] + 3 # Kurtosis of residuals
kurt_sm

3.1061908834830367

In [116]:
dw_sm = durbin_watson(resids) # Durbin-Watson
dw_sm

0.778964211500831

In [118]:
jb_sm = jarque_bera(resids)[0] # Jarque-Bera (JB)
jb_sm

0.20402545897229157

In [117]:
pjb_sm = jarque_bera(resids)[1] # Prob(JB)
pjb_sm

0.9030180566398727

In [120]:
cn_sm = res.condition_number #Cond. No.
cn_sm

860663.3059185082

Phew!

## Coding the Metrics from Scratch

### Coefficients

Let's start with the most important metrics, the coefficients themselves!
Multiple Linear Regression is actually just a linear algebra problem. Pro tip: the `@` operator is a shortcut form matrix multiplication.

In [130]:
beta = np.linalg.inv(X.T @ X) @ X.T @ y
beta

0   -17.539104
1    -0.011650
2     0.008694
3    -0.003007
4     0.002923
5     0.079861
6    -0.003043
7     0.190674
dtype: float64

That's it! let's compare them to the coefficients we pulled from `statsmodels`.

In [132]:
beta.index = beta_sm.index # reindexing beta with the feature names, allows the comparison to work
beta - beta_sm

constant   -2.014140e-10
realgdp    -1.008395e-14
realcons    9.141299e-14
realgovt    2.271707e-14
realdpi    -9.579967e-14
cpi         1.537936e-13
m1         -1.881134e-14
pop         1.316003e-12
dtype: float64

Ok so we're off by a teeny tiny fraction ... most likely due to rounding differences between `statsmodels` and `numpy`.

The standard errors are a bit more involved, so let's jump up to the top table of the summary and compute some things there first, as some of them will come in handy later.

### The Top Table

In [138]:
# The first sixth row of the first column contains number of observations 
# which is simply the number of rows in X or the length of y, which we'll call n.
n = len(y)
n

203

In [140]:
# The next two rows are the degress of freedom of the residuals which is n less the parameters of the model,
# and the degrees of freedom of the model which is the parameters of the model - 1.
# We'll start by creating p as the parameters of the model, which is just the columns in X, and then compute the two df's.
# These degrees of freedoms are used in various other metrics.

p = X.shape[1]
dfr = n - p
dfm = p- 1
dfr, dfm

(195, 7)

Now let's compute R-squared. R-squared tells us how well $X$ explains $y$. It is computed as: 
$$R^2 = 1 - \frac{SSE}{SST}$$


SSE or sum of squares of errors is the sum of the squares of each residual or error $e_i$ - the difference between each true value $y_i$ and its predicted value $\hat{y_i}$:
$$ SSE = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat{y_i})^2 $$


SST or the sum of total squares is the sum of the squares of of the differences between each $y_i$ and $\bar{y}$ (the average of all $y_i$'s):

$$ SST = \sum_{i=1}^n (y_i - \bar{y})^2.$$

In [148]:
# let's compute all the things we'll need and the R-squared itself.

yhat = X @ np.array(beta).T # for whatever reason, we need to first recast the beta series as an array before the matrix mult
e = y - yhat
sse = (e**2).sum()
ybar = y.mean()
sst = ((y - ybar)**2).sum()
r2 = 1 - sse/sst
r2

0.8764203251692086

In [150]:
# and let's compare to sm
r2 - r2_sm

1.1102230246251565e-16

Adjusted R-squared is just a version of R squared that is penalized for having additional parameters. As you can see for a two-parameter model (i.e. simple linear regression), it reduces to R-squared:

$$ R^2_{adj.} = 1 - \frac{SSE}{SST}\left(\frac{n - 1}{n-p-1}\right) $$



In [151]:
r2_adj = 1 - (sse/sst) * (n - 1)/(n - p - 1)
r2_adj

0.8713242561040214

The F-statistic is a measure of how well the entire model would generalize to unseen data. The greater the F-statistic, the better we assume the model generalizes. In order to put a probability to that, we take the area under the right tail of the F distribution (i.e. where x >= F-stat) with the degrees of freedom we computed above. That value is the probability that we would see our model results assuming the null hypothesis (that our mode does not generalize at all) were true. Like all p-values, the lower the value, the more confident we can be in our model.

The F-Statistic is computed as:

$$F_{statistic} = \frac{MSM}{MSE}$$

The denominator of that, the MSE, is the mean version of hte SSE we computed above and is computed as: 
$$ MSE = \frac{SSE}{\text{df}_{residual}}$$,
and if you recall $\text{df}_{residual} = n-p$.

The numerator, the MSM, is the mean version of the SSM, sum of squares model, which we have not yet computed. The SSM is the sum of the squares of the differences between each $\hat{y_i}$ and $\bar{y}$. You can think of it as how much we gain from the model vs. our baseline guess. If we didn't have our model, and we just had all the values of $y$, if we had to guess the value of a randomly drawn $y_i$, our best guess would be the mean, $\bar{y}$:
$$SSM =\sum_{i=1}^n (\hat{y_i} - \bar{y})^2 $$

(Note that $SST = SSE + SSM$)

MSM is then computed as:
$$ MSM = \frac{SSM}{\text{df}_{model}} $$,
and if you recall, $\text{df}_{model} = p-1$.

In [153]:
ssm = ((yhat - ybar)**2).sum()
msm = ssm / dfm

mse = sse / dfr
f = msm / mse
f

197.5613403619848

In [136]:
res.summary()

0,1,2,3
Dep. Variable:,unemp,R-squared:,0.876
Model:,OLS,Adj. R-squared:,0.872
Method:,Least Squares,F-statistic:,197.6
Date:,"Mon, 15 May 2023",Prob (F-statistic):,6.21e-85
Time:,20:53:05,Log-Likelihood:,-151.94
No. Observations:,203,AIC:,319.9
Df Residuals:,195,BIC:,346.4
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
constant,-17.5391,2.806,-6.250,0.000,-23.074,-12.004
realgdp,-0.0116,0.000,-27.105,0.000,-0.012,-0.011
realcons,0.0087,0.001,10.055,0.000,0.007,0.010
realgovt,-0.0030,0.001,-4.824,0.000,-0.004,-0.002
realdpi,0.0029,0.001,4.112,0.000,0.002,0.004
cpi,0.0799,0.006,13.065,0.000,0.068,0.092
m1,-0.0030,0.001,-4.945,0.000,-0.004,-0.002
pop,0.1907,0.018,10.646,0.000,0.155,0.226

0,1,2,3
Omnibus:,0.415,Durbin-Watson:,0.779
Prob(Omnibus):,0.813,Jarque-Bera (JB):,0.204
Skew:,0.057,Prob(JB):,0.903
Kurtosis:,3.106,Cond. No.,861000.0


In [61]:
resids

0      0.300955
1     -0.030197
2     -0.163874
3      0.025448
4     -0.062204
         ...   
198    1.375159
199    1.139917
200   -0.457351
201    0.091882
202    0.872710
Length: 203, dtype: float64

In [70]:
evs

array([7.69015952e+00, 2.90200619e-01, 1.11691444e-02, 5.44482293e-03,
       2.65061637e-03, 2.20352640e-05, 2.18543567e-04, 1.34697715e-04])

In [72]:
X_Norm = X.iloc[:, 0:]/X.iloc[:, 0:].apply(np.linalg.norm)
evs = np.linalg.eigvals(X_Norm.T@X_Norm)
np.sqrt(max(evs)/min(evs))

590.7565501644921

In [63]:
X.apply(np.linalg.norm)

constant        14.247807
realgdp     112576.023445
realcons     76207.337767
realgovt      9660.703425
realdpi      83134.873746
cpi           1732.003865
m1           11508.543780
pop           3456.637401
dtype: float64

In [127]:
b = IGM @ X.T @ y
b

0    -18.203965
1     -0.011142
2      0.007961
3     -0.000321
4     -0.002995
5      0.003010
6      0.087879
7     -0.004386
8      0.071531
9      0.193646
10    -0.119283
11    -0.131872
dtype: float64

In [239]:
res.params

constant   -18.203965
realgdp     -0.011142
realcons     0.007961
realinv     -0.000321
realgovt    -0.002995
realdpi      0.003010
cpi          0.087879
m1          -0.004386
tbilrate     0.071531
pop          0.193646
infl        -0.119283
realint     -0.131872
dtype: float64

In [128]:
yh = X @ np.array(b)
yh

0      5.809808
1      5.094347
2      5.384577
3      5.439211
4      5.217286
         ...   
198    4.719013
199    5.684469
200    8.556952
201    9.125579
202    8.724699
Length: 203, dtype: float64

In [129]:
e = yh-y
e

0      0.009808
1     -0.005653
2      0.084577
3     -0.160789
4      0.017286
         ...   
198   -1.280987
199   -1.215531
200    0.456952
201   -0.074421
202   -0.875301
Length: 203, dtype: float64

In [205]:
sse = (e**2).sum()
sse

51.53020908619396

In [240]:
res.ssr

51.53020908619406

In [177]:
n = len(y)
n

203

In [178]:
p = X.shape[1]
p

12

In [246]:
mse = sse/dfe
mse

0.26979167060834536

In [243]:
res.mse_resid

0.26979167060834586

In [254]:
se = (np.diag(mse * GM)**(1/2))
se

array([3.23384832e+00, 6.59616837e-04, 9.30922202e-04, 6.35480239e-04,
       7.74340976e-04, 7.30971746e-04, 7.36467163e-03, 8.44282484e-04,
       1.95455456e-01, 2.09826255e-02, 1.94754872e-01, 1.94644087e-01])

In [256]:
res.bse

constant    3.233848
realgdp     0.000660
realcons    0.000931
realinv     0.000635
realgovt    0.000774
realdpi     0.000731
cpi         0.007365
m1          0.000844
tbilrate    0.195455
pop         0.020983
infl        0.194755
realint     0.194644
dtype: float64

In [185]:
t = b/se
t

0     -5.629196
1    -16.891324
2      8.551655
3     -0.504974
4     -3.868399
5      4.118321
6     11.932456
7     -5.195064
8      0.365969
9      9.228855
10    -0.612479
11    -0.677505
dtype: float64

In [255]:
res.tvalues

constant    -5.629196
realgdp    -16.891324
realcons     8.551655
realinv     -0.504974
realgovt    -3.868399
realdpi      4.118321
cpi         11.932456
m1          -5.195064
tbilrate     0.365969
pop          9.228855
infl        -0.612479
realint     -0.677505
dtype: float64

In [192]:
t.apply(lambda x: sp.stats.t.sf(abs(x), dfe)*2)

0     6.386736e-08
1     9.338393e-40
2     3.881400e-15
3     6.141595e-01
4     1.501928e-04
5     5.677926e-05
6     6.920758e-25
7     5.224382e-07
8     7.147933e-01
9     5.217379e-17
10    5.409497e-01
11    4.989052e-01
dtype: float64

In [257]:
res.pvalues

constant    6.386736e-08
realgdp     9.338393e-40
realcons    3.881400e-15
realinv     6.141595e-01
realgovt    1.501928e-04
realdpi     5.677926e-05
cpi         6.920758e-25
m1          5.224382e-07
tbilrate    7.147933e-01
pop         5.217379e-17
infl        5.409497e-01
realint     4.989052e-01
dtype: float64

In [193]:
lb = b - sp.stats.t.isf(0.025, dfe) * se
lb

0    -24.582607
1     -0.012443
2      0.006125
3     -0.001574
4     -0.004523
5      0.001569
6      0.073352
7     -0.006051
8     -0.313998
9      0.152258
10    -0.503430
11    -0.515800
dtype: float64

In [259]:
res.conf_int()

Unnamed: 0,0,1
constant,-24.582607,-11.825322
realgdp,-0.012443,-0.009841
realcons,0.006125,0.009797
realinv,-0.001574,0.000933
realgovt,-0.004523,-0.001468
realdpi,0.001569,0.004452
cpi,0.073352,0.102405
m1,-0.006051,-0.002721
tbilrate,-0.313998,0.457059
pop,0.152258,0.235033


In [194]:
ub = b + sp.stats.t.isf(0.025, dfe) * se
ub

0    -11.825322
1     -0.009841
2      0.009797
3      0.000933
4     -0.001468
5      0.004452
6      0.102405
7     -0.002721
8      0.457059
9      0.235033
10     0.264863
11     0.252056
dtype: float64

In [172]:
sse = ((yh - y)**2).sum()
sse

51.53020908619396

In [196]:
ssm = ((yh - y.mean())**2).sum()
ssm

378.2124510096161

In [197]:
sst = ((y - y.mean())**2).sum()
sst

429.7426600985222

In [207]:
ssm + sse

429.74266009581004

In [208]:
f = msm/mse
f

127.44259307284828

In [237]:
res.fvalue

127.44259307376187

In [212]:
sp.stats.f.sf(f, dfm, dfe)

1.071809644299557e-81

In [234]:
res.f_pvalue

1.071809643657869e-81

In [230]:
ll = (n/2) * np.log(1/(2 * np.pi * (sse/n)))- (1/(2 * (sse/n))) * sse 
ll

-148.88418968708558

In [221]:
r2 = 1 - sse/sst
r2

0.8800905428509699

In [225]:
r2_adj = 1 - (sse/dfe)/(sst/(n-1))
r2_adj

0.8731847625963137

In [231]:
aic = 2*p - 2*ll
aic

321.76837937417116

In [232]:
bic = -2 * ll + np.log(n) * k
bic

361.5268511226726

In [222]:
dfe

191

In [223]:
n-p

191

In [24]:
data.data

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
0,Aruba,ABW,"Fertility rate, total (births per woman)",SP.DYN.TFRT.IN,4.820,4.655,4.471,4.271,4.059,3.842,...,1.786,1.769,1.754,1.739,1.726,1.713,1.701,1.690,,
1,Andorra,AND,"Fertility rate, total (births per woman)",SP.DYN.TFRT.IN,,,,,,,...,,,1.240,1.180,1.250,1.190,1.220,,,
2,Afghanistan,AFG,"Fertility rate, total (births per woman)",SP.DYN.TFRT.IN,7.671,7.671,7.671,7.671,7.671,7.671,...,7.136,6.930,6.702,6.456,6.196,5.928,5.659,5.395,,
3,Angola,AGO,"Fertility rate, total (births per woman)",SP.DYN.TFRT.IN,7.316,7.354,7.385,7.410,7.425,7.430,...,6.704,6.657,6.598,6.523,6.434,6.331,6.218,6.099,,
4,Albania,ALB,"Fertility rate, total (births per woman)",SP.DYN.TFRT.IN,6.186,6.076,5.956,5.833,5.711,5.594,...,2.004,1.919,1.849,1.796,1.761,1.744,1.741,1.748,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214,"Yemen, Rep.",YEM,"Fertility rate, total (births per woman)",SP.DYN.TFRT.IN,7.292,7.308,7.328,7.354,7.385,7.418,...,5.588,5.393,5.199,5.010,4.829,4.658,4.498,4.348,,
215,South Africa,ZAF,"Fertility rate, total (births per woman)",SP.DYN.TFRT.IN,6.173,6.144,6.103,6.049,5.984,5.911,...,2.721,2.675,2.627,2.580,2.538,2.500,2.467,2.438,,
216,"Congo, Dem. Rep.",COD,"Fertility rate, total (births per woman)",SP.DYN.TFRT.IN,6.001,6.015,6.030,6.048,6.067,6.089,...,6.809,6.728,6.642,6.550,6.454,6.354,6.251,6.146,,
217,Zambia,ZMB,"Fertility rate, total (births per woman)",SP.DYN.TFRT.IN,7.018,7.071,7.127,7.184,7.240,7.292,...,5.974,5.954,5.932,5.908,5.881,5.849,5.813,5.773,,
