<h1 align="center"> Linear Regression</br>using Moore-Penrose Pseudoinverse</br>and K-Fold Cross-Validation </h1>

##### Import required modules

In [159]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, KFold

## Step 1: Writing required methods

#### 1. Data Generator Method

In [209]:
def linear_data_generator(size=(1, 1), noise=0.1, seed = 100):
    n = size[0]
    p = size[1]
    np.random.seed(seed=seed)
    X = np.random.rand(n, p)
    coefficients = np.random.rand(p, 1)
    y = X.dot(coefficients) + noise * np.random.randn(n, 1)
    return X, y

#### 2. Moore-Penrose Pseudo Inverse Method

In [220]:
def moore_penrose_inv(A):
    U, s, VT = np.linalg.svd(A)
    S = np.zeros(A.shape)
    np.fill_diagonal(S, s)
    S_plus = S.T
    for i in range(np.linalg.matrix_rank(A)):
        S_plus[i, i] = 1 / S_plus[i, i]
    A_plus = VT.T @ S_plus @ U.T
    return A_plus

## Step 3: Data Generation

#### Generating matrix `X` and `y`

In [221]:
X_bar, y_bar = linear_data_generator(size=(100, 5), noise=0.1, seed=77)
X = X_bar.T @ X_bar
y = X_bar.T @ y_bar

#### Converting them to a `DataFrame` to display our data in a table format for improved clarity

In [222]:
df = pd.DataFrame(data=X, columns=[f"feature {i+1}" for i in range(len(X[0]))], index=[f"sample {i+1}" for i in range(len(X))])
df["target"] = y
df.head()

Unnamed: 0,feature 1,feature 2,feature 3,feature 4,feature 5,target
sample 1,31.042075,24.299935,23.235161,20.245869,26.004368,81.215247
sample 2,24.299935,33.095042,24.171426,21.536074,27.096102,86.012863
sample 3,23.235161,24.171426,30.195341,19.753527,24.200142,79.96898
sample 4,20.245869,21.536074,19.753527,27.23034,23.301305,74.318278
sample 5,26.004368,27.096102,24.200142,23.301305,35.777421,86.539583


#### Constructing coefficients matrix 

In [223]:
A = np.insert(X, 0, np.ones(len(X)), axis=1)

## Step 4: Performing Linear Regression

### Type 1: Linear Regression using `train_test_split` method

#### Use `train_test_split` to split the data

In [257]:
X_train, X_test, y_train, y_test = train_test_split(
    A, y, test_size=0.2, random_state=77)

#### Calculating the pseudo inverse of `X_train` matrix

In [258]:
X_train_plus = moore_penrose_inv(X_train)

#### Calculating the Linear Regression coefficients using `train` data

In [259]:
beta_hat = X_train_plus @ y_train

#### Predicting the `y` vector for the train and test data

In [260]:
y_pred_train = X_train @ beta_hat
y_pred_test = X_test @ beta_hat

#### Calculating the Errors for both train and test data

In [None]:
norm_train = np.linalg.norm(y_train - y_pred_train, ord=2)
norm_test = np.linalg.norm(y_test - y_pred_test, ord=2)

mse_train = mean_squared_error(y_train, y_pred_train)
mse_test = mean_squared_error(y_test, y_pred_test)

#### Printing the results

In [262]:
print(f"Norm Error for Train data: {norm_train}")
print(f"Norm Error for Test data: {norm_test}")
print(f"Mean Squared Error for Train data: {mse_train}")
print(f"Mean Squared Error for Test data: {mse_test}")

Norm Error for Train data: 3.759838749409412e-14
Norm Error for Test data: 0.7751584832372913
Mean Squared Error for Train data: 3.534096855390133e-28
Mean Squared Error for Test data: 0.600870674134738


### Type 2: Linear Regression using `KFold` class

#### Initializing KFold object

In [263]:
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=77)

#### K-Fold Cross-Validation

In [264]:
mse_list = []
norm_list = []

for train_index, test_index in kf.split(A):
    X_train, X_test = A[train_index], A[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    X_train_plus = moore_penrose_inv(X_train)
    
    beta_hat = X_train_plus @ y_train
    
    y_pred = X_test @ beta_hat
    
    error = np.linalg.norm(y_test - y_pred, ord=2)
    mse = mean_squared_error(y_test, y_pred)
    
    norm_list.append(error)
    mse_list.append(mse)

#### Printing the results

In [304]:
df = pd.DataFrame(data={"Folds": range(1, k+1),
                              "Norms Error": norm_list,
                              "MSE": mse_list})
print("The norms error and mse table for each fold:\n")
print(df.to_string(index=False), end="\n\n")
print("Average Norms:", np.mean(norm_list))
print("Average Mean Squared Error:", np.mean(mse_list))

The norms error and mse table for each fold:

 Folds  Norms Error       MSE
     1     0.775158  0.600871
     2     3.224640 10.398303
     3     1.755594  3.082112
     4     0.694851  0.482818
     5     3.327010 11.068993

Average Norms: 1.955450684986431
Average Mean Squared Error: 5.126619362242911
