# Homework 2: Building a GLM from scratch using only numpy (and pandas for data management)

In [1]:
import numpy as np
import pandas as pd

Going forward, it's important to note that python implements the `@` operator to mean matrix multiplication. 

In [2]:
backpain = pd.read_csv('backpain.csv')

Now trying to hand-code a GLM function. I'm trying to predict `paindiff` as a function of other predictors - patient id, treatment group, bpi_intensity, gender, and patient status. 

Include expressions for: 

   * $\hat{\beta}$, which is computed by $\hat{\beta} = (X^TX)^{-1}X^Ty$
   * $var(e)$, which is the error variance, computed by $var(e) = \frac{1}{n-1} \sum_{i=1}^{n} {(e_i)^2} = \frac{S}{n-1}$
   * $R^2$, which is computed by: $R^2 = \frac{SS_{\text{tot}} - SS_{\text{res}}}{SS_{\text{tot}}}$ or $R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}$

The following will help me get there:

* $S = (y - X\beta)^T(y - X\beta)$ represents the sum of squared errors (SSE). This is what we are minimizing in our model.
* $SS_{\text{res}} = e^Te$ is the sum of squared residuals. This is the sum of the squared errors after we have fit the model.


In [3]:
# estimate the beta coefficients
def get_beta_estimates(X, y):
    betas = np.linalg.inv(X.T @ X) @ np.dot(X.T, y)
    return betas

In [4]:
# estimate the covariance of beta hat
def cov_beta_hat(X, y, B):
    S = (y - (X @ B)).T @ (y - (X @ B))
    n = X.shape[0]
    return S/(n-1)

In [5]:
# estimate the sum of squared errors
def SS_tot(y):
    return np.dot((y - np.mean(y)).T, (y - np.mean(y)))

In [6]:
# estimate the sum of squared residuals
def SS_res(X, y, B):
    return np.dot((y - np.dot(X, B)).T, (y - np.dot(X, B)))

In [7]:
# estimate the R squared
def R_squared(X, y, B):
    return (SS_tot(y) - SS_res(X, y, B))/SS_tot(y)

In [8]:
# put it all together
def GLM(X, y):
    B = get_beta_estimates(X, y)
    S = SS_res(X, y, B)
    n = X.shape[0]
    cov_beta = cov_beta_hat(X, y, B)
    R2 = R_squared(X, y, B)
    return B, S, cov_beta, R2

In [9]:
# Get my matrix of predictor values (X) and my response vector (y)
# first make a list of the predictor variables I'm interested in
var_labels = ['id', 'group', 'bpi_intensity', 'gender', 'is_patient']
# subset my dataframe down to those predictors
X = backpain[var_labels].to_numpy()
# grab my response variable
y = backpain['pain_diff'].to_numpy()

In [10]:
# get results and print them out
results = GLM(X, y)
print('Beta estimates for each variable:\n')
for i, var in enumerate(var_labels):
    print(f'{var}: {results[0][i]}')
print('\n\nS (sum of squared errors): ', results[1])
print('cov_beta: ', results[2])
print('R2: ', results[3])

Beta estimates for each variable:

id: 0.0003140003130136992
group: 1.0560804398642403
bpi_intensity: 0.009906787275012041
gender: -0.1855060030488982
is_patient: -3.429480968417778


S (sum of squared errors):  215.02230361367157
cov_beta:  1.6046440568184446
R2:  0.33238629060675456


# Comparing my hand-built results with statsmodels results

In [11]:
import statsmodels.api as sm

In [12]:
# add a constant to X (statsmodels does not automatically add a constant)
X = sm.add_constant(X)
# fit the model using statsmodels OLS class and the fit method
model = sm.OLS(y, X).fit()

In [13]:
# print out the results
print(model.summary())
# get the r-squared
print(f"R2: {model.rsquared}")

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.332
Model:                            OLS   Adj. R-squared:                  0.312
Method:                 Least Squares   F-statistic:                     16.18
Date:                Wed, 27 Sep 2023   Prob (F-statistic):           8.88e-11
Time:                        13:50:13   Log-Likelihood:                -222.98
No. Observations:                 135   AIC:                             456.0
Df Residuals:                     130   BIC:                             470.5
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.0003      0.000      1.158      0.2

Hooray! Looks like my hand-built model is working. Also nice to know that statsmodels OLS is really doing the same thing as my hand-built model.