# Fitting regression models to data

In this notebook we fit regression models and interpret their paramters.

In [1]:
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd  
import seaborn as sns 
from sklearn.datasets import load_diabetes
import statsmodels.api as sm

dm_dataset = load_diabetes() 
dm = pd.DataFrame(data=dm_dataset.data, columns=dm_dataset.feature_names)

We fit a linear regression model and output the model summary:

In [2]:
m0 = sm.OLS.from_formula("s6 ~ age + sex + bmi", dm)
r0 = m0.fit()
r0.summary()

0,1,2,3
Dep. Variable:,s6,R-squared:,0.225
Model:,OLS,Adj. R-squared:,0.22
Method:,Least Squares,F-statistic:,42.39
Date:,"Tue, 25 Feb 2025",Prob (F-statistic):,4.57e-24
Time:,21:45:25,Log-Likelihood:,775.34
No. Observations:,442,AIC:,-1543.0
Df Residuals:,438,BIC:,-1526.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.006e-17,0.002,5.03e-15,1.000,-0.004,0.004
age,0.2149,0.043,4.956,0.000,0.130,0.300
sex,0.1411,0.043,3.298,0.001,0.057,0.225
bmi,0.3365,0.043,7.847,0.000,0.252,0.421

0,1,2,3
Omnibus:,4.733,Durbin-Watson:,1.746
Prob(Omnibus):,0.094,Jarque-Bera (JB):,4.495
Skew:,0.229,Prob(JB):,0.106
Kurtosis:,3.187,Cond. No.,23.7


From the above output we observe that age, sex, and bmi are statistically significant predictors because their p-values are below $\alpha = 0.05$. On the other hand the Intercept is not statistically significant since its p-value is $1.000>\alpha$.

Now we try the following regression model:

In [3]:
model = sm.OLS.from_formula("s6 ~ age + sex + bmi + bp", data=dm)
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,s6,R-squared:,0.256
Model:,OLS,Adj. R-squared:,0.249
Method:,Least Squares,F-statistic:,37.54
Date:,"Tue, 25 Feb 2025",Prob (F-statistic):,5.3500000000000006e-27
Time:,21:45:25,Log-Likelihood:,784.28
No. Observations:,442,AIC:,-1559.0
Df Residuals:,437,BIC:,-1538.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.006e-17,0.002,5.13e-15,1.000,-0.004,0.004
age,0.1654,0.044,3.749,0.000,0.079,0.252
sex,0.1068,0.043,2.498,0.013,0.023,0.191
bmi,0.2683,0.045,5.961,0.000,0.180,0.357
bp,0.2031,0.048,4.248,0.000,0.109,0.297

0,1,2,3
Omnibus:,5.285,Durbin-Watson:,1.766
Prob(Omnibus):,0.071,Jarque-Bera (JB):,5.091
Skew:,0.255,Prob(JB):,0.0784
Kurtosis:,3.124,Cond. No.,28.4


As we observe from the output adding bp (blood presure) improved the model's explanatory power (R-squared increased from 0.225 to 0.256, Adjusted R-squared from 0.220 to 0.245). The model fit also has been improved (AIC decreased from -1543 to -1559, BIC from -1526 to -1538). In addition bp is a significant predictor (p-value 0.000), meaning it has a strong relationship with the dependent variable s6. So including bp in the model was the right decision since it improves predictions without unnecessary complexity. 
However, since the F-statistic decreases slightly we conclude that even though bp is significant, it does not contribute as much relative explanatory power as the other variables (age, sex, BMI).

The following code generates a logistics regression for the binary varibale smq (smoking status) and output the model summary:

In [4]:
da = pd.read_csv("nhanes_2015_2016.csv")
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})
da["smq"] = da.SMQ020.replace({2: 0, 7: np.nan, 9: np.nan})
da["DMDEDUC2x"] = da.DMDEDUC2.replace({1: "lt9", 2: "x9_11", 3: "HS", 4: "SomeCollege",
                                       5: "College", 7: np.nan, 9: np.nan})
model = sm.GLM.from_formula("smq ~ RIAGENDRx + RIDAGEYR + DMDEDUC2x", family=sm.families.Binomial(), data=da)
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,smq,No. Observations:,5463.0
Model:,GLM,Df Residuals:,5456.0
Model Family:,Binomial,Df Model:,6.0
Link Function:,Logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-3421.3
Date:,"Tue, 25 Feb 2025",Deviance:,6842.6
Time:,21:45:25,Pearson chi2:,5470.0
No. Iterations:,4,Pseudo R-squ. (CS):,0.1019
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.3330,0.111,-21.108,0.000,-2.550,-2.116
RIAGENDRx[T.Male],0.9313,0.058,15.986,0.000,0.817,1.045
DMDEDUC2x[T.HS],0.9345,0.087,10.761,0.000,0.764,1.105
DMDEDUC2x[T.SomeCollege],0.8425,0.082,10.338,0.000,0.683,1.002
DMDEDUC2x[T.lt9],0.2357,0.106,2.230,0.026,0.029,0.443
DMDEDUC2x[T.x9_11],1.0745,0.103,10.426,0.000,0.872,1.276
RIDAGEYR,0.0185,0.002,11.061,0.000,0.015,0.022


We see that all the predictors and the intercept are significant since the p-values are less than 0.05. Age has the lowest possitive effect on the dependent variable smq. 

Since the Pseudo R-squared is 0.1019 the model explains around 10% of the varition in smq. 
