Q1: K-Fold Cross Validation  for Multiple  Linear Regression (Least Square Error Fit)  
Download the dataset regarding USA  House Price Predictio

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

(a) Divide  the dataset into input features (all columns except price) and output variable (price) 

In [None]:
df = pd.read_csv("USA_Housing.csv")
x = df.drop(columns=["Price"])
columns = x.columns
y = df["Price"]
x.head()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population
0,79545.45857,5.682861,7.009188,4.09,23086.8005
1,79248.64245,6.0029,6.730821,3.09,40173.07217
2,61287.06718,5.86589,8.512727,5.13,36882.1594
3,63345.24005,7.188236,5.586729,3.26,34310.24283
4,59982.19723,5.040555,7.839388,4.23,26354.10947


(b) Scale the values of input features

In [None]:
sc = StandardScaler()
x_scaled = sc.fit_transform(x)
x = pd.DataFrame(x_scaled, columns=columns)
x.head()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population
0,1.02866,-0.296927,0.021274,0.088062,-1.317599
1,1.000808,0.025902,-0.255506,-0.722301,0.403999
2,-0.684629,-0.112303,1.516243,0.93084,0.07241
3,-0.491499,1.221572,-1.393077,-0.58454,-0.186734
4,-0.807073,-0.944834,0.846742,0.201513,-0.988387


(c) Divide  input and output features into five folds.

In [6]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

(d) Run five iterations, in each iteration consider one-fold as test set and remaining 
four sets as training set. Find the beta (𝛽) matrix, predicted values, and R2_score 
for each iteration using least square error fit. 

In [None]:
best_r2 = -np.inf
best_beta = None
r2_scores = []

for train_idx, test_idx in kf.split(x):
    x_train, x_test = x.iloc[train_idx], x.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    x_train_new = np.hstack([np.ones((x_train.shape[0],1)), x_train])
    x_test_new = np.hstack([np.ones((x_test.shape[0],1)), x_test])
    
    beta = np.linalg.pinv(x_train_new.T @ x_train_new)@x_train_new.T @ y_train
    y_pred = x_test_new @ beta
    
    r2 = r2_score(y_test, y_pred)
    r2_scores.append(r2)
    if r2>best_r2:
        best_r2 = r2
        best_beta = beta

(e) Use the best value of (𝛽) matrix (for which R2_score is maximum), to train the 
regressor for 70% of data and test the performance for remaining 30% data.  

In [9]:
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.3, random_state=42)

X_train_final = np.hstack([np.ones((X_train.shape[0], 1)), X_train])
X_test_final = np.hstack([np.ones((X_test.shape[0], 1)), X_test])

beta_final = np.linalg.pinv(X_train_final.T @ X_train_final) @ X_train_final.T @ Y_train
Y_pred = X_test_final @ beta_final
r2_final = r2_score(Y_test, Y_pred)

print(r2_final)

0.9146818498916267


Q2: Concept of Validation  set for Multiple  Linear Regression (Gradient Descent Optimization) 

Consider the same dataset of Q1, rather than dividing  the dataset into  five folds, divide  the dataset into training  set (56%),validation set (14%), and test set (30%). Consider four different values of learning rate i.e. {0.001,0.01,0.1,1}. Compute the values of regression coefficients for each value of learning rate after 1000 iterations. For each set of regression coefficients, compute R2_score for validation and test set and find the best value of regression coefficients.

In [11]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler

data = pd.read_csv("USA_Housing.csv")
X = data.drop(columns=["Price"])
y = data["Price"]

scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.44, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=30/44, random_state=42)

def gradient_descent(X, y, lr, iterations):
    m, n = X.shape
    X = np.c_[np.ones(m), X]
    y = y.values.reshape(-1, 1)
    beta = np.zeros((n + 1, 1))

    for _ in range(iterations):
        y_pred = X @ beta
        error = y_pred - y
        gradient = (X.T @ error) / m
        beta -= lr * gradient
    return beta

def predict(X, beta):
    m = X.shape[0]
    X = np.c_[np.ones(m), X]
    return X @ beta

learning_rates = [0.001, 0.01, 0.1, 1]
best_r2 = -np.inf
best_beta = None
best_lr = None

for lr in learning_rates:
    beta = gradient_descent(X_train, y_train, lr, iterations=1000)
    y_val_pred = predict(X_val, beta)
    y_test_pred = predict(X_test, beta)
    r2_val = r2_score(y_val, y_val_pred)
    r2_test = r2_score(y_test, y_test_pred)

    print(f"LR: {lr}, Validation R2: {r2_val:.4f}, Test R2: {r2_test:.4f}")
    if r2_val > best_r2:
        best_r2 = r2_val
        best_beta = beta
        best_lr = lr

print(f"\nBest Learning Rate: {best_lr}")
print(f"Best Beta Coefficients:\n{best_beta.flatten()}")


LR: 0.001, Validation R2: -1.0428, Test R2: -0.9601
LR: 0.01, Validation R2: 0.9199, Test R2: 0.9134
LR: 0.1, Validation R2: 0.9200, Test R2: 0.9134
LR: 1, Validation R2: 0.9200, Test R2: 0.9134

Best Learning Rate: 0.1
Best Beta Coefficients:
[1232180.27200919  230645.88389435  165328.94019375  120045.00851908
    2945.02108903  151375.22971285]
