# Homework 4: Extending the GLM to include the following:
1. Calculate t-tests for each $beta$
2. Calculate an F-statistic, df, and P-value for the overall model $R^2$, comparing the full model to an intercept-only model. 
3. Add the ability to perform t-tests for a set of contrasts.
4. Make your script into a function that takes in a design matrix $X$, data $y$, and optional contrast matrix $C$. 

Pulling in functions from last assignment

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

In [2]:
# estimate the beta coefficients
def get_beta_estimates(X, y):
    betas = np.linalg.inv(X.T @ X) @ np.dot(X.T, y)
    return betas

In [3]:
# estimate the covariance of beta hat
def cov_beta_hat(X, y, B):
    S = (y - (X @ B)).T @ (y - (X @ B))
    n = X.shape[0]
    return S/(n-1)

In [4]:
# estimate the sum of squared errors
def SS_tot(y):
    return np.dot((y - np.mean(y)).T, (y - np.mean(y)))

In [79]:
# estimate the sum of squared residuals
def SS_res(X, y, B):
    if not isinstance(B, np.ndarray):
        B = np.array([B])
    return np.dot((y - np.dot(X, B)).T, (y - np.dot(X, B)))

In [80]:
# estimate the R squared
def R_squared(X, y, B):
    return (SS_tot(y) - SS_res(X, y, B))/SS_tot(y)

In [81]:
# put it all together
def GLM(X, y):
    B = get_beta_estimates(X, y)
    S = SS_res(X, y, B)
    n = X.shape[0]
    cov_beta = cov_beta_hat(X, y, B)
    R2 = R_squared(X, y, B)
    return B, S, cov_beta, R2

In [110]:
backpain = pd.read_csv('../week2/backpain.csv')
# subset to variables of interest
var_labels = ['group', 'bpi_intensity', 'gender', 'is_patient']
X = backpain[var_labels].to_numpy()
y = backpain['pain_diff'].to_numpy()
backpain.head()

Unnamed: 0.1,Unnamed: 0,id,1,2,group,bpi_intensity,gender,is_patient,pain_diff
0,0,12,2.5,2.75,3.0,2.5,1.0,1,0.25
1,2,14,2.5,2.0,3.0,2.5,2.0,1,-0.5
2,4,15,2.25,2.75,3.0,2.25,1.0,1,0.5
3,6,18,4.5,2.25,2.0,4.5,1.0,1,-2.25
4,8,23,2.5,2.25,2.0,2.5,2.0,1,-0.25


# 1. Calculate t-tests for each $\beta$

The t-statistic for testing $\beta$ = 0 is given by:
$t_j = \frac{\hat{\beta}_j}{\hat{\sigma}\sqrt{(X^T X)^{-1}_{jj}}}$

Within that function, I see that I need to calculate the following:
1. $\hat{\beta}_j$ - the estimate of the beta coefficient for the jth predictor
2. $\hat{\sigma}$ - the estimated standard deviation of the residuals
3. $(X^T X)^{-1}_{jj}$ - the jth diagonal element of the inverse of the product of the design matrix and its transpose. This essentially means I am calculating the covariance matrix of the design matrix, inverting it, and then taking the jth diagonal element of that matrix (where j is the index of the predictor in the design matrix).
        - Inside this function term, I see $X^T X$. This is the product of the design matrix and its transpose, which I know to be the covariance matrix of the design matrix. 

My understanding is that the t-statistic is the ratio of the estimated coefficient to the estimated standard deviation of the residuals. The estimated standard deviation of the residuals is the square root of the sum of squared residuals divided by the degrees of freedom. The degrees of freedom is the number of observations minus the number of predictors.

So, reducing the the standard deviation of the residuals would mean that I would see less spread of the data points around the regression line. Less spread means that my beta can be smaller but still be significant. In the same vein, if I have a large spread around that line, then my beta would have to be larger to be significant.

Also, adding more data points to an otherwise very noisy cloud of data points surrounding the regression line would reduce the standard deviation of the residuals. This is because the sum of squared residuals would be larger, but the degrees of freedom would also be larger. So, the ratio of the two would be smaller.

In [111]:
# make a function for the estimated variance of the residuals
def sigma_hat(X, y, B):
    n = X.shape[0]
    p = X.shape[1]
    return SS_res(X, y, B)/(n-p)

def t_stat(X, y, B):
    n = X.shape[0] # number of observations
    p = X.shape[1] # number of predictors
    # get sigma_hat (estimated variance of residuals)
    t = {}
    for j, varname in enumerate(var_labels):
        t[varname] = B[j] / (sigma_hat(X,y,B)**2 * np.sqrt(np.linalg.inv(X.T @ X)[j,j]))
        # reminder - try replacing the * with @ to see if that changes anything.
        
    return t
        
    """
    This function is an algorithm. For each beta, it finds the index of the beta in the beta vector. This is provided 
    by the j term of the enumerate function in the loop. The name of the variable, `varname`, is also provided as the second
    value of each iteration of the loop. The variable `t` is a dictionary. The key is the variable name, and the value is the
    t-statistic for that variable (these key-value pairs are populated on each successive iteration of the loop. 
    
    On each iteration, the t-statistic is calculated by dividing the beta estimate Bj by the product of the estimated standard 
    deviation of the residuals. 
    
    The function returns the dictionary of t-statistics.
    """

Below I'm gonna first get the betas and other values from the GLM function I wrote in the last assignment. Then I'm gonna use those values to calculate the t-statistic for each beta.

In [112]:
B, S, cov_beta, R2 = GLM(X, y)
for i, var in enumerate(var_labels):
    print(f'{var} beta = : {B[i]}')
print('\nSum of squared residuals: ', S)
print('Covariance of beta hat: ', cov_beta)
print('R squared: ', R2)

group beta = : 1.0653366477226105
bpi_intensity beta = : -0.005826644237355405
gender beta = : -0.1962808372279472
is_patient beta = : -3.176285499149767

Sum of squared residuals:  217.2416875350979
Covariance of beta hat:  1.621206623396253
R squared:  0.32549541878811156


Now I'm gonna use the t_stat function I wrote above to calculate the t-statistic for each beta.

In [113]:
tvals = t_stat(X, y, B)
print('T-statistics: ', tvals)

T-statistics:  {'group': 3.6435017708995323, 'bpi_intensity': -0.03776585604790268, 'gender': -0.4136203237019821, 'is_patient': -3.0363867561782234}


Yay! Now, for sanity's sake, I'm gonna compare my t-statistics to the ones from the statsmodels package.

In [114]:
X_const = sm.add_constant(X)
model = sm.OLS(y, X_const)
results = model.fit()
print(results.summary()) 
print(f't-values: {results.tvalues}')
# the p-values are in the last column of the summary table
print(f'p-values for each t-statistic: {results.pvalues}')

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.325
Model:                            OLS   Adj. R-squared:                  0.310
Method:                 Least Squares   F-statistic:                     21.07
Date:                Fri, 13 Oct 2023   Prob (F-statistic):           3.35e-11
Time:                        12:05:22   Log-Likelihood:                -223.67
No. Observations:                 135   AIC:                             455.3
Df Residuals:                     131   BIC:                             467.0
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             1.0653      0.137      7.781      0.0

In [116]:
for i, label in enumerate(var_labels):
    print(f'{label} beta coefficient = {results.params[i]}')

group beta coefficient = 1.0653366477226107
bpi_intensity beta coefficient = -0.00582664423735382
gender beta coefficient = -0.19628083722794493
is_patient beta coefficient = -3.1762854991497687


Whoops, looks like I made a mistake. My hand-calculated t-values are much smaller than the ones from the statsmodels package. However, they are all smaller by the same proportion. So I'm close, but clearly something is wrong with how I calculated them. 

Again, here's the formula for the t-statistic:
$t_j = \frac{\hat{\beta}_j}{\hat{\sigma}\sqrt{(X^T X)^{-1}_{jj}}}$

And here's my code:
```
def t_stat(X, y, B):
    n = X.shape[0] # number of observations
    p = X.shape[1] # number of predictors
    # get sigma_hat (estimated variance of residuals)
    t = {}
    for j, varname in enumerate(var_labels):
        t[varname] = B[j] / (sigma_hat(X,y,B)**2 * np.sqrt(np.linalg.inv(X.T @ X)[j,j]))
        # reminder - try replacing the * with @ to see if that changes anything.
        
    return t
```

The problem is that I'm squaring the estimated standard deviation of the residuals in `sigma_hat(X,y,B)**2`, where I should be taking the square root! This is because my function for $\hat\sigma$ actually returns the sum of squared residuals divided by the degrees of freedom. So, I need to take the square root of that value to get the estimated standard deviation of the residuals (perhaps I need to rename the function). 

In [117]:
def t_stat(X, y, B):
    n = X.shape[0] # number of observations
    p = X.shape[1] # number of predictors
    # get sigma_hat (estimated variance of residuals)
    t = {}
    for j, varname in enumerate(var_labels):
        ### NOW TAKE SQRT ###
        t[varname] = B[j] / (np.sqrt(sigma_hat(X,y,B)) * np.sqrt(np.linalg.inv(X.T @ X)[j,j]))
        
    return t

In [118]:
tvals = t_stat(X, y, B)
print('T-statistics: ', tvals)

T-statistics:  {'group': 7.780845521775375, 'bpi_intensity': -0.08065051436321638, 'gender': -0.8833029447374987, 'is_patient': -6.4843268316443075}


In [119]:
# saving these to put in the table later
tstats = list(tvals.values())

Great - now I have the right t-values. Next I'm going to calculate the p-values for each t-statistic.

From the book:
The P-value comes from a t-distribution, which is a Normal (Gaussian) distribution adjusted for the fact that there are not really $n$ independent observations, and thus not really $n$ error df, if we have estimated some parameters ($p$ parameters, to be precise) and removed them when estimating the residual error. The t-statistic, $t_j$, can be compared to a t-distribution with $n - p$ degrees of freedom to determine the significance of the $j$-th predictor.

So to get p-values I really need to generate a null <em>distribution</em>, rather than a singular value. I think the best way to do this in Python is to use pre-made functionality to generate a cumulative distribution function (CDF) for the t-distribution, and then use that to calculate the p-value for each t-statistic. I'll use the `stats` module within the scipy package, and the `t.sf` function to calculate the p-value for each t-statistic. I'll also assume we want a two-tailed distribution for now, so I'll multiply the p-value by 2.

The `t.sf` function takes two arguments: the t-statistic and the degrees of freedom. It represents the survival function, which is the complement of the cumulative distribution function (CDF). 

In [120]:
from scipy import stats

In [121]:
# get the degrees of freedom for the full model
n = X.shape[0]
p = X.shape[1]
df = n - p

In [122]:
# get the p-values
pvals = {}
for varname, tstat in tvals.items():
    pvals[varname] = stats.t.sf(np.abs(tstat), df)*2
    
# saving these for later
pvalues = list(pvals.values())
print('P-values: ', pvals)

P-values:  {'group': 1.9002890971436953e-12, 'bpi_intensity': 0.9358429755813209, 'gender': 0.37869073716402013, 'is_patient': 1.6659160795132115e-09}


Now I'll integrate all this into a new and improved GLM function.

In [123]:
def fancy_GLM(X, y, predictor_names):
    B, S, cov_beta, R2 = GLM(X, y)
    tvals = t_stat(X, y, B)
    pvals = {}
    for varname, tstat in tvals.items():
        pvals[varname] = stats.t.sf(np.abs(tstat), df)*2
        
    output = pd.DataFrame({'Predictor': predictor_names, 'Beta': B, 'T-statistic': tstats, 'P-value (two-tailed)': pvalues})
    return output

In [124]:
new_model_results = fancy_GLM(X, y, var_labels)
new_model_results

Unnamed: 0,Predictor,Beta,T-statistic,P-value (two-tailed)
0,group,1.065337,7.780846,1.900289e-12
1,bpi_intensity,-0.005827,-0.080651,0.935843
2,gender,-0.196281,-0.883303,0.3786907
3,is_patient,-3.176285,-6.484327,1.665916e-09


# 2. Calculate an F-statistic, df, and P-value for the overall model $R^2$, comparing the full model to an intercept-only model.

From the book:

The F-test is used to compare the fits of different models. Specifically, it is used to test the hypothesis that a set of predictors has no effect on the response variable. 

Given:

 $SSR_full$ = Sum of Squared Residuals for the full model

 $SSR_reduced$ = Sum of Squared Residuals for a reduced model

 $p$ = Number of predictors in the full model

 $q$ = Number of predictors in the reduced model (with)

 $n$ = Total number of observations

The F-statistic is caluclated as:


$$F = \frac{(SSR_{\text{reduced}} - SSR_{\text{full}}) / (p - q)}{SSR_{\text{full}} / (n - p)}$$

The difference between $SSR_reduced$ and $SSR_full$ is that $SSR_reduced$ is the sum of squared residuals for the reduced model, which is the model with fewer predictors. $SSR_full$ is the sum of squared residuals for the full model, which is the model with more predictors.


In [125]:
# calculate the sum of squared residuals for the intercept-only model (i.e., the reduced model)
X_intercept = np.ones((X.shape[0], 1)) # this is a column of ones - statsmodels can do this with sm.add_constant(X)
B_intercept, S_intercept, cov_beta_intercept, R2_intercept = GLM(X_intercept, y) 
print('Sum of squared residuals for intercept-only model: ', S_intercept) # this is SSreduced

Sum of squared residuals for intercept-only model:  322.07592592592596


In [126]:
# calculate the explained variance
explained_variance = S_intercept - S
print('Explained variance: ', explained_variance) # this is SSRreduced - SSRfull

Explained variance:  104.83423839082806


In [127]:
# Again, for my own note, the proportion of explained to unexplained variance is the R-squared
R2 = explained_variance / S_intercept
print('R-squared: ', R2) # this is R2full

R-squared:  0.32549541878811156


In [128]:
def F_stat(X, y, B):
    n = X.shape[0] # number of observations
    p = X.shape[1] # number of predictors
    q
    F = (SS_res(X, y, B)/p) / (SS_res(X, y, B)/(n-p))
    return F


In [129]:
# given a design matrix X, a response vector y, and a list of beta values B, calculate the F-statistic
def F_stat(X, y, B):
    n = X.shape[0] # number of observations
    p = X.shape[1] # number of predictors
    # get sigma_hat (estimated variance of residuals)
    sigma_hat = SS_res(X, y, B)/(n-p)
    # get sigma_hat_reduced (estimated variance of residuals for reduced model)
    sigma_hat_reduced = SS_res(X_intercept, y, B_intercept)/(n-1)
    # calculate F-statistic
    F = ((sigma_hat_reduced**2 - sigma_hat**2)/p)/(sigma_hat**2/(n-p))
    return F

F = F_stat(X, y, B)
print('F-statistic: ', F)

F-statistic:  36.04779061524388


To get the p-value for the F I'm using `scipy.stats.f.sf`, which is like the function I used for the t-statistic, but for the F-statistic. It takes three arguments: the F-statistic, the degrees of freedom for the numerator, and the degrees of freedom for the denominator (df for the full model and df for the reduced model).  

In [130]:
# get the p-value for the F-statistic
pval = 1 - stats.f.cdf(F, p, n-p)
print('P-value for F-statistic: ', pval)

P-value for F-statistic:  1.1102230246251565e-16


# 3) Add the ability to perform t-tests for a set of contrasts

In [131]:
backpain

Unnamed: 0.1,Unnamed: 0,id,1,2,group,bpi_intensity,gender,is_patient,pain_diff
0,0,12,2.50,2.75,3.0,2.50,1.0,1,0.25
1,2,14,2.50,2.00,3.0,2.50,2.0,1,-0.50
2,4,15,2.25,2.75,3.0,2.25,1.0,1,0.50
3,6,18,4.50,2.25,2.0,4.50,1.0,1,-2.25
4,8,23,2.50,2.25,2.0,2.50,2.0,1,-0.25
...,...,...,...,...,...,...,...,...,...
130,278,1268,4.50,2.75,2.0,2.75,2.0,1,-1.75
131,280,1277,4.25,5.00,2.0,5.00,2.0,1,0.75
132,282,1294,5.50,5.50,2.0,5.50,2.0,1,0.00
133,284,1302,1.00,1.00,3.0,1.00,2.0,1,0.00


In [133]:
# in this case the contrast matrix is just the identity matrix

contrast_matrix = np.identity(4)
contrast_matrix

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [134]:
# run t-tests for the rows of the contrast matrix
contrast_tvals = {}
for i, row in enumerate(contrast_matrix):
    contrast_tvals[var_labels[i]] = t_stat(X, y, contrast_matrix[i])

# get the p-values
contrast_pvals = {}
for varname, tstat in tvals.items():
    contrast_pvals[varname] = stats.t.sf(np.abs(tstat), df)*2

In [136]:
# print out the results
print('T-statistics: ', contrast_tvals)
print('P-values: ', contrast_pvals)

T-statistics:  {'group': {'group': 2.5764426780723197, 'bpi_intensity': 0.0, 'gender': 0.0, 'is_patient': 0.0}, 'bpi_intensity': {'group': 0.0, 'bpi_intensity': 3.4406127184191035, 'gender': 0.0, 'is_patient': 0.0}, 'gender': {'group': 0.0, 'bpi_intensity': 0.0, 'gender': 1.723121464700032, 'is_patient': 0.0}, 'is_patient': {'group': 0.0, 'bpi_intensity': 0.0, 'gender': 0.0, 'is_patient': 0.9234338998931243}}
P-values:  {'group': 1.9002890971436953e-12, 'bpi_intensity': 0.9358429755813209, 'gender': 0.37869073716402013, 'is_patient': 1.6659160795132115e-09}


In [140]:
B.shape[0]

4

In [150]:
# now putting that functionality into a super fancy function
def super_fancy_GLM(X, y):
    B, S, cov_beta, R2 = GLM(X, y)
    tvals = t_stat(X, y, B)
    pvals = {}
    for varname, tstat in tvals.items():
        pvals[varname] = stats.t.sf(np.abs(tstat), df)*2
        
    # calculate the sum of squared residuals for the intercept-only model (i.e., the reduced model)
    X_intercept = np.ones((X.shape[0], 1)) 
    B_intercept, S_intercept, cov_beta_intercept, R2_intercept = GLM(X_intercept, y) 
    # calculate the F-statistic
    F = F_stat(X, y, B)
    # get the p-value for the F-statistic
    F_pval = 1 - stats.f.cdf(F, p, n-p)
    output_table = pd.DataFrame({'Predictor': var_labels, 'Beta': B, 'T-statistic': tstats, 'P-value (two-tailed)': pvalues})
    contrast_matrix = np.identity(B.shape[0])
    c_tvals = {}
    for i, row in enumerate(contrast_matrix):
        c_tvals[var_labels[i]] = t_stat(X, y, contrast_matrix[i])[var_labels[i]]
    c_pvals = {}
    for varname, tstat in tvals.items():
        c_pvals[varname] = stats.t.sf(np.abs(tstat), df)*2
    return output_table, c_tvals, c_pvals, F, F_pval

In [151]:
outputs, contrast_tvals, contrast_pvals, F, F_pval = super_fancy_GLM(X, y)
display(outputs)
print('T-statistics for contrasts: ', contrast_tvals)
print('P-values for contrasts: ', contrast_pvals)
print('F-statistic: ', F)
print('P-value for F-statistic: ', F_pval)

Unnamed: 0,Predictor,Beta,T-statistic,P-value (two-tailed)
0,group,1.065337,7.780846,1.900289e-12
1,bpi_intensity,-0.005827,-0.080651,0.935843
2,gender,-0.196281,-0.883303,0.3786907
3,is_patient,-3.176285,-6.484327,1.665916e-09


T-statistics for contrasts:  {'group': 2.5764426780723197, 'bpi_intensity': 3.4406127184191035, 'gender': 1.723121464700032, 'is_patient': 0.9234338998931243}
P-values for contrasts:  {'group': 1.9002890971436953e-12, 'bpi_intensity': 0.9358429755813209, 'gender': 0.37869073716402013, 'is_patient': 1.6659160795132115e-09}
F-statistic:  36.04779061524388
P-value for F-statistic:  1.1102230246251565e-16
