<a href="https://colab.research.google.com/github/Tydos/Interpretable-ML-Models/blob/main/Linear%20Regression/Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Linear Regression

This notebook implements linear regression using the popular diabetes dataset used for regression

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

In [None]:
X = load_diabetes().data
y = load_diabetes().target
cols = load_diabetes().feature_names

print(X[1],y[1],cols)

[-0.00188202 -0.04464164 -0.05147406 -0.02632753 -0.00844872 -0.01916334
  0.07441156 -0.03949338 -0.06833155 -0.09220405] 75.0 ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']


In [None]:
X_train, X_test, ytrain, y_test = train_test_split(X,y,test_size=0.2)
print(len(X_train))
print(len(X_test))

353
89


We have to predict y = w*x+b

10 features -> 10 weights, 10 biases

### Closed form solution

Closed-form solutions are best suited for problems with a small number of features. The dominant computational cost comes from inverting the feature covariance matrix, which has o(n^3) complexity in the number of features and therefore does not scale well to high-dimensional datasets.

In [None]:
#closed form solution - this one passes through origin
def normal_equation(X, y):
    return np.linalg.inv(X.T @ X) @ X.T @ y

def normal_bias_equation(X,y):
  #add bias using numpy column concat function
  len_features = X.shape[0]
  bias = np.ones((len_features,1)) #create matrix of size 1*n
  X_b = np.c_[bias,X]

  theta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
  return theta

def lstsq_normal(X,y):
  len_features = X.shape[0]
  bias = np.ones((len_features,1)) #create matrix of size 1*n
  X_b = np.c_[bias,X]

  return np.linalg.lstsq(X_b,y,rcond=None)


In [None]:
w = normal_equation(X_train,ytrain)
for i, col in enumerate(cols):
  print(f"{col}: {w[i]}")

age: 49.29676719759565
sex: -190.70427659307546
bmi: 349.778876623417
bp: 500.3243396054219
s1: -322.1621020437322
s2: 119.1139531044016
s3: -61.93438671836239
s4: 59.33712029028939
s5: 579.1195068364318
s6: -41.09220369266277


In [None]:
w = normal_bias_equation(X_train,ytrain)
for i, col in enumerate(cols):
  print(f"{col}: {w[i]}")


age: 154.77536343077819
sex: 6.847805466780149
bmi: -227.1055395553644
bp: 501.9976219896486
s1: 339.1559534072792
s2: -733.3300153748287
s3: 396.65417189292674
s4: 77.49394589362713
s5: 197.07479938350198
s6: 737.972840441065


In [None]:
w, residual, rank, s = lstsq_normal(X_train,ytrain)
for i, col in enumerate(cols):
  print(f"{col}: {w[i]}")

print(residual)
print(rank)
print(s)


age: 154.77536343077747
sex: 6.847805466780528
bmi: -227.10553955536543
bp: 501.9976219896482
s1: 339.1559534072782
s2: -733.3300153748266
s3: 396.6541718929243
s4: 77.49394589363055
s5: 197.07479938350326
s6: 737.9728404410674
[961638.07479431]
11
[18.78846789  1.78902077  1.07394428  0.9725083   0.84022452  0.74022996
  0.68110506  0.6432289   0.59536392  0.23543861  0.07739121]


### Gradient descent solution


### Assumptions of Linear Regression

### 1. Linearity (in parameters)
**Intuition:**  
Each feature contributes additively and proportionally to the prediction.

**Violation symptoms:**  
- Residuals vs. predicted values show curved patterns  
- Systematic under- or over-prediction in certain ranges  

**Fixes:**  
- Add polynomial or interaction terms  
- Transform features (log, square root, etc.)  
- Use a non-linear model if necessary  

---

### 2. Independence of observations
**Intuition:**  
Each data point should provide new information, not repeat another.

**Violation symptoms:**  
- Time-series or grouped data (e.g., repeated measurements per user)  
- Autocorrelation in residuals  

**Fixes:**  
- Use time-series models  
- Add lag features  
- Use clustered/robust standard errors or mixed models  

---

### 3. Homoscedasticity (constant error variance)
**Intuition:**  
The model should be equally confident across all prediction ranges.

**Violation symptoms:**  
- Residual plot shows a fan or cone shape  
- Errors increase with the magnitude of predictions  

**Fixes:**  
- Transform the target variable (log, Box–Cox)  
- Use weighted least squares  
- Apply heteroscedasticity-robust standard errors  

---

### 4. Normality of errors (mainly for inference)
**Intuition:**  
Normally distributed errors allow reliable confidence intervals and hypothesis tests.

**Violation symptoms:**  
- Skewed or heavy-tailed residual distribution  
- Strong deviations from the diagonal in a Q–Q plot  

**Fixes:**  
- Transform the target variable  
- Use bootstrapping methods  
- Often ignorable for prediction with large sample sizes  

---

### 5. No (or low) multicollinearity
**Intuition:**  
Each feature should explain unique information about the target.

**Violation symptoms:**  
- Large standard errors for coefficients  
- Unstable coefficients or unexpected sign changes  
- High Variance Inflation Factor (VIF)  

**Fixes:**  
- Remove or combine correlated features  
- Apply dimensionality reduction (e.g., PCA)  
- Use regularization techniques (Ridge, Lasso)  



### Linear Regression model interpretablility