# Ordinary Least Squares (OLS) with statsmodels

In this notebook, we use the **statsmodels** library to fit an OLS regression model.

Unlike scikit-learn, statsmodels focuses on:
- Statistical interpretation
- Hypothesis testing
- p-values and confidence intervals


## Goals

- Fit an OLS model using statsmodels
- Interpret regression coefficients statistically
- Understand p-values and R²
- Compare statsmodels with previous approaches


In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

We use the same student performance dataset.

In [2]:
df = pd.read_csv("datasets/students_scores.csv")
df

Unnamed: 0,study_hours,attendance_rate,previous_gpa,final_score
0,5,70,14.0,13.0
1,6,75,14.5,14.0
2,7,80,15.0,15.0
3,8,85,15.5,16.0
4,9,90,16.0,17.0
5,10,95,16.5,18.0
6,11,96,17.0,18.5
7,12,98,17.5,19.0
8,13,99,18.0,19.5
9,14,100,18.5,20.0


We select the input features and the target variable.

In [3]:
X = df[["study_hours", "attendance_rate", "previous_gpa"]]
y = df["final_score"]

In statsmodels, we must manually add a constant term
to include the intercept in the model.

In [4]:
X = sm.add_constant(X)
X

Unnamed: 0,const,study_hours,attendance_rate,previous_gpa
0,1.0,5,70,14.0
1,1.0,6,75,14.5
2,1.0,7,80,15.0
3,1.0,8,85,15.5
4,1.0,9,90,16.0
5,1.0,10,95,16.5
6,1.0,11,96,17.0
7,1.0,12,98,17.5
8,1.0,13,99,18.0
9,1.0,14,100,18.5


We now fit the OLS model.

In [5]:
model = sm.OLS(y, X)
results = model.fit()

The summary below contains detailed statistical information
about the regression model.

In [6]:
results.summary()

0,1,2,3
Dep. Variable:,final_score,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,33680.0
Date:,"Mon, 15 Dec 2025",Prob (F-statistic):,1.14e-14
Time:,21:23:34,Log-Likelihood:,23.379
No. Observations:,10,AIC:,-40.76
Df Residuals:,7,BIC:,-39.85
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0042,0.001,3.188,0.015,0.001,0.007
study_hours,0.2416,0.005,44.178,0.000,0.229,0.255
attendance_rate,0.1346,0.003,40.566,0.000,0.127,0.142
previous_gpa,0.1693,0.016,10.376,0.000,0.131,0.208

0,1,2,3
Omnibus:,3.887,Durbin-Watson:,2.512
Prob(Omnibus):,0.143,Jarque-Bera (JB):,0.884
Skew:,-0.553,Prob(JB):,0.643
Kurtosis:,3.949,Cond. No.,2.36e+17


Important outputs to focus on:

- **coef**: estimated coefficients
- **std err**: standard error of estimates
- **p-value**: statistical significance
- **R-squared**: goodness of fit

A common rule of thumb:

- p-value < 0.05 → statistically significant
- p-value ≥ 0.05 → not statistically significant

If features are highly correlated, p-values may become unreliable
even if the model has a high R-squared value.

This is a known issue caused by multicollinearity.

In [7]:
y_pred = results.predict(X)

We compute the Mean Squared Error (MSE).

In [8]:
mse = np.mean((y - y_pred) ** 2)
mse

np.float64(0.0005455245593017545)

We compare actual and predicted values.

In [9]:
comparison = pd.DataFrame({
    "Actual": y,
    "Predicted": y_pred
})

comparison

Unnamed: 0,Actual,Predicted
0,13.0,13.001925
1,14.0,14.001027
2,15.0,15.000128
3,16.0,15.99923
4,17.0,16.998331
5,18.0,17.997433
6,18.5,18.458283
7,19.0,19.053697
8,19.5,19.514547
9,20.0,19.975398


statsmodels provides rich statistical output,
making it suitable for inference and interpretation.

scikit-learn focuses more on prediction performance.
