# OHL Regression Tests

When performing OHLs for Linear Regression models, there are some assumptions that we need to make. These assumptions are:

1. **Linearity**: The relationship between the independent variables and the dependent variable is linear.
2. **Expectation of Errors**: The errors (residuals) have a mean of zero $E(\epsilon|X) = 0$.
3. **Errors and Data Uncorrelated**: The errors (residuals) are not correlated with the independent variables $E(\epsilon X) = 0$
4. **Homoscedasticity**: The variance of the errors is constant across all levels of the independent variables.
5. **No Multicollinearity**: The independent variables are not correlated with each other.



In [10]:

from sklearn.datasets import load_diabetes
import pandas as pd
diabetes = load_diabetes()

diabetes.keys()


df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df['target'] = diabetes.target

df.head()





Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


In [11]:
import statsmodels.formula.api as smf

model = smf.ols('target ~ age + sex + bmi + bp + s1 + s2 + s3 + s4 + s5 + s6', data=df).fit()

model.summary()






0,1,2,3
Dep. Variable:,target,R-squared:,0.518
Model:,OLS,Adj. R-squared:,0.507
Method:,Least Squares,F-statistic:,46.27
Date:,"Tue, 08 Oct 2024",Prob (F-statistic):,3.8299999999999998e-62
Time:,16:25:07,Log-Likelihood:,-2386.0
No. Observations:,442,AIC:,4794.0
Df Residuals:,431,BIC:,4839.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,152.1335,2.576,59.061,0.000,147.071,157.196
age,-10.0099,59.749,-0.168,0.867,-127.446,107.426
sex,-239.8156,61.222,-3.917,0.000,-360.147,-119.484
bmi,519.8459,66.533,7.813,0.000,389.076,650.616
bp,324.3846,65.422,4.958,0.000,195.799,452.970
s1,-792.1756,416.680,-1.901,0.058,-1611.153,26.802
s2,476.7390,339.030,1.406,0.160,-189.620,1143.098
s3,101.0433,212.531,0.475,0.635,-316.684,518.770
s4,177.0632,161.476,1.097,0.273,-140.315,494.441

0,1,2,3
Omnibus:,1.506,Durbin-Watson:,2.029
Prob(Omnibus):,0.471,Jarque-Bera (JB):,1.404
Skew:,0.017,Prob(JB):,0.496
Kurtosis:,2.726,Cond. No.,227.0


## R squared

The R-squared value is a measure of how much of the variance in the dependent variable is explained by the independent variables. It is calculated as the ratio of the explained variance to the total variance.

In [12]:
model.rsquared

np.float64(0.5177484222203499)

In our case, the R-squared value is 0.518, which means that 51.8% of the variance in the dependent variable is explained by the independent variables.

This is not a very high R-squared value, which means that the model is not a very good fit for the data.

## F-statistic

The F-statistic is a measure of the overall significance of the model. It is calculated as the ratio of the explained variance to the unexplained variance.

In [13]:
model.fvalue

np.float64(46.27243958524321)

In [14]:
model.f_pvalue


np.float64(3.8286490381849547e-62)

In our case, the F-statistic is 46.27 which is a measure of the overall significance of the model.

The null hypothesis is that all the coefficients are zero. The p-value under it indicates the probability that we would achieve such a statistic, if all the coefficients were zero.

In our case, the p-value is 0.000, which means that we can reject the null hypothesis that all the coefficients are zero. That is, at least one of the coefficients is not zero.




## Likelihood function, Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)

The likelihood function is a measure of how well the model fits the data. It is calculated as the product of the probabilities of the observed data given the model.

The Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) are measures of the goodness of fit of the model. They are calculated as the log-likelihood of the model plus a penalty term.

The AIC and BIC are measures of the goodness of fit of the model. **A lower AIC and BIC indicate a better fit**.




From an information theory perspective, the AIC and BIC are measures of the trade-off between the goodness of fit of the model and the complexity of the model.

The AIC is calculated as:

$$
AIC = -2L + 2k
$$

The BIC is calculated as:

$$
BIC = -2L + k\log(n)
$$

Where $L$ is the log-likelihood of the model, $n$ is the number of observations and $k$ is the number of parameters in the model.

In both cases, maximinizing the log-likelihood function is equivalent to minimizing the AIC and BIC. 





