#Q1: K-Fold Cross Validation for Multiple Linear Regression (Least Square Error Fit)
Download the dataset regarding USA House Price Prediction from the following link:
https://drive.google.com/file/d/1O_NwpJT-8xGfU_-3llUl2sgPu0xllOrX/view?usp=sharing
Load the dataset and Implement 5- fold cross validation for multiple linear regression (using least square error fit).
Steps:
a) Divide the dataset into input features (all columns except price) and output variable
(price)
b) Scale the values of input features.
c) Divide input and output features into five folds.
d) Run five iterations, in each iteration consider one-fold as test set and remaining four sets as training set. Find the beta (𝛽) matrix, predicted values, and R2_score for each iteration using least square error fit.
e) Use the best value of (𝛽) matrix (for which R2_score is maximum), to train the regressor for 70% of data and test the performance for remaining 30% data.

In [None]:
# Importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



In [None]:
# Loading dataset
from google.colab import files
uploaded = files.upload()



Saving USA_Housing.csv to USA_Housing.csv


In [None]:
df= pd.read_csv('USA_Housing.csv')
df.head()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price
0,79545.45857,5.682861,7.009188,4.09,23086.8005,1059034.0
1,79248.64245,6.0029,6.730821,3.09,40173.07217,1505891.0
2,61287.06718,5.86589,8.512727,5.13,36882.1594,1058988.0
3,63345.24005,7.188236,5.586729,3.26,34310.24283,1260617.0
4,59982.19723,5.040555,7.839388,4.23,26354.10947,630943.5


In [None]:
# a) Divide the dataset into input features (all columns except price) and output variable (price)
X= df.drop('Price', axis=1)
y= df['Price']

print(X.shape)
print(y.shape)

(5000, 5)
(5000,)


In [None]:
# b) Scale the values of input features
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
X= scaler.fit_transform(X)
print(X[:5])

[[ 1.02865969 -0.29692705  0.02127433  0.08806222 -1.31759867]
 [ 1.00080775  0.02590164 -0.25550611 -0.72230146  0.40399945]
 [-0.68462915 -0.11230283  1.5162435   0.93084045  0.07240989]
 [-0.49149907  1.22157207 -1.39307717 -0.58453963 -0.18673422]
 [-0.80707253 -0.94483368  0.84674187  0.20151314 -0.98838741]]


In [None]:
# c) Divide input and output features into five folds.
from sklearn.model_selection import KFold
kf= KFold(n_splits=5, shuffle=True, random_state=42)
for train_index,test_index in kf.split(X):
    print("Train size:", len(train_index), "Test size:", len(test_index))



Train size: 4000 Test size: 1000
Train size: 4000 Test size: 1000
Train size: 4000 Test size: 1000
Train size: 4000 Test size: 1000
Train size: 4000 Test size: 1000


In [None]:
#  d) Run five iterations, in each iteration consider one-fold as test set and remaining four sets as training set.
# Find the beta (𝛽) matrix, predicted values, and R2_score for each iteration using least square error fit.
#Run five iterations, in each iteration consider one-fold as test set and remaining four sets as training set.

best_r2 = -np.inf
best_beta = None
results = []

for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # Find the beta (𝛽) matrix, predicted values, and R2_score for each iteration using least square error fit.
    # Add a column of ones for the intercept term
    X_train_with_intercept = np.hstack([np.ones((X_train.shape[0], 1)), X_train])
    X_test_with_intercept = np.hstack([np.ones((X_test.shape[0], 1)), X_test])

    # Calculate beta using the least squares formula
    beta = np.linalg.inv(X_train_with_intercept.T.dot(X_train_with_intercept)).dot(X_train_with_intercept.T).dot(y_train)

    # Predict values
    y_pred = X_test_with_intercept.dot(beta)

    # Calculate R2 score
    r2_score = 1 - np.sum((y_test - y_pred)**2) / np.sum((y_test - np.mean(y_test))**2)

    results.append({'fold': fold + 1, 'beta': beta, 'r2_score': r2_score})

    print(f'Fold {fold + 1}:')
    print('Beta matrix: ', beta)
    #print('Predicted values:', y_pred)
    print('R2_score:', r2_score)
    print('-' * 30)

    # Check if this fold has the best R2 score
    if r2_score > best_r2:
        best_r2 = r2_score
        best_beta = beta

print(f'\nBest R2_score: {best_r2}')
print(f'Best Beta matrix (from fold with best R2_score): {best_beta}')

Fold 1:
Beta matrix:  [-2.63507290e+06  2.16522058e+01  1.64666481e+05  1.19624012e+05
  2.44037761e+03  1.52703134e+01]
R2_score: 0.9179971706985152
------------------------------
Fold 2:
Beta matrix:  [-2.64332951e+06  2.14960668e+01  1.67328375e+05  1.20843820e+05
  1.69564375e+03  1.52020352e+01]
R2_score: 0.9145677884802818
------------------------------
Fold 3:
Beta matrix:  [-2.62335104e+06  2.16032765e+01  1.64185223e+05  1.20332947e+05
  1.01065526e+03  1.51375276e+01]
R2_score: 0.9116116385364474
------------------------------
Fold 4:
Beta matrix:  [-2.65361218e+06  2.15353015e+01  1.66652443e+05  1.22139754e+05
  2.48272497e+03  1.52063570e+01]
R2_score: 0.9193091764960825
------------------------------
Fold 5:
Beta matrix:  [-2.63157196e+06  2.16033277e+01  1.65386268e+05  1.20424768e+05
  6.34893338e+02  1.51806193e+01]
R2_score: 0.9243869413350321
------------------------------

Best R2_score: 0.9243869413350321
Best Beta matrix (from fold with best R2_score): [-2.6315719

In [None]:
# e) Use the best value of (𝛽) matrix (for which R2_score is maximum), to train the regressor for 70% of data and test the performance for remaining 30% data.
X_with_intercept = np.hstack([np.ones((X.shape[0], 1)), X])
y_pred_final = X_with_intercept.dot(best_beta)

# Evaluate the performance on the entire dataset
final_r2_score = 1 - np.sum((y - y_pred_final)**2) / np.sum((y - np.mean(y))**2)

print(f'R2_score on the entire dataset using the best beta: {final_r2_score}')

R2_score on the entire dataset using the best beta: 0.9180056538688431


#Q2
Concept of Validation set for Multiple Linear Regression (Gradient Descent
Optimization)
Consider the same dataset of Q1, rather than dividing the dataset into five folds, divide the dataset into training set (56%), validation set (14%), and test set (30%).
Consider four different values of learning rate i.e. {0.001,0.01,0.1,1}. Compute the values of regression coefficients for each value of learning rate after 1000 iterations.
For each set of regression coefficients, compute R2_score for validation and test set and find the best value of regression coefficients.

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the dataset
df = pd.read_csv('USA_Housing.csv')

# Separate input features and output variable
X = df.drop('Price', axis=1)
y = df['Price']

# Scale input features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Shape of X_scaled:", X_scaled.shape)
print("Shape of y:", y.shape)

Shape of X_scaled: (5000, 5)
Shape of y: (5000,)


In [None]:
from sklearn.model_selection import train_test_split

# Split into training (56%) and temporary (44%) sets
X_train, X_temp, y_train, y_temp = train_test_split(X_scaled, y, test_size=0.44, random_state=42)

# Split temporary set into validation (14%) and test (30%) sets
# Calculate the test size for the second split: 30% of original data is 30/(14+30) of the temp data
test_size_second_split = 0.30 / 0.44
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=test_size_second_split, random_state=42)

# Print the shapes to verify the splits
print("Shape of X_train:", X_train.shape)
print("Shape of X_val:", X_val.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_val:", y_val.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (2800, 5)
Shape of X_val: (700, 5)
Shape of X_test: (1500, 5)
Shape of y_train: (2800,)
Shape of y_val: (700,)
Shape of y_test: (1500,)


In [None]:
import numpy as np

def gradient_descent(X_train, y_train, learning_rate, num_iterations):

    # Add a column of ones for the intercept term
    X_train_with_intercept = np.hstack([np.ones((X_train.shape[0], 1)), X_train])

    # Initialize beta values
    beta = np.zeros(X_train_with_intercept.shape[1])

    m = X_train.shape[0] # Number of training samples

    for i in range(num_iterations):
        # Calculate predicted values
        y_pred = X_train_with_intercept.dot(beta)

        # Calculate the error
        error = y_pred - y_train

        # Calculate the gradient
        gradient = (1/m) * X_train_with_intercept.T.dot(error)

        # Update beta values
        beta = beta - learning_rate * gradient

    return beta

In [None]:
learning_rates = [0.001, 0.01, 0.1, 1]
results = []

for learning_rate in learning_rates:
    print(f"Training with learning rate: {learning_rate}")
    # Call the gradient_descent function
    beta_gd = gradient_descent(X_train, y_train, learning_rate, num_iterations=1000)

    # Add intercept term to validation and test sets
    X_val_with_intercept = np.hstack([np.ones((X_val.shape[0], 1)), X_val])
    X_test_with_intercept = np.hstack([np.ones((X_test.shape[0], 1)), X_test])

    # Calculate predicted values for validation and test sets
    y_pred_val = X_val_with_intercept.dot(beta_gd)
    y_pred_test = X_test_with_intercept.dot(beta_gd)

    # Compute R2 score for validation set
    r2_val = 1 - np.sum((y_val - y_pred_val)**2) / np.sum((y_val - np.mean(y_val))**2)

    # Compute R2 score for test set
    r2_test = 1 - np.sum((y_test - y_pred_test)**2) / np.sum((y_test - np.mean(y_test))**2)

    # Store results
    results.append({
        'learning_rate': learning_rate,
        'beta': beta_gd,
        'r2_val': r2_val,
        'r2_test': r2_test
    })

    print(f"Learning Rate: {learning_rate}, R2 Validation: {r2_val:.4f}, R2 Test: {r2_test:.4f}")
    print("-" * 30)

Training with learning rate: 0.001
Learning Rate: 0.001, R2 Validation: -1.0428, R2 Test: -0.9601
------------------------------
Training with learning rate: 0.01
Learning Rate: 0.01, R2 Validation: 0.9199, R2 Test: 0.9134
------------------------------
Training with learning rate: 0.1
Learning Rate: 0.1, R2 Validation: 0.9200, R2 Test: 0.9134
------------------------------
Training with learning rate: 1
Learning Rate: 1, R2 Validation: 0.9200, R2 Test: 0.9134
------------------------------


In [None]:
best_r2_val = -np.inf
best_beta_gd = None

for result in results:
    if result['r2_val'] > best_r2_val:
        best_r2_val = result['r2_val']
        best_beta_gd = result['beta']

print(f"The best R2 score on the validation set is: {best_r2_val:.4f}")
print(f"The corresponding best beta coefficients are: {best_beta_gd}")

The best R2 score on the validation set is: 0.9200
The corresponding best beta coefficients are: [1232180.27200919  230645.88389435  165328.94019375  120045.00851908
    2945.02108903  151375.22971285]


In [None]:
# Find the result entry corresponding to the best validation R2 score
best_result = None
for result in results:
    if result['r2_val'] == best_r2_val:
        best_result = result
        break

# Access the R2 score on the test set from the best result
best_r2_test = best_result['r2_test']

# Print the best R2 score on the test set
print(f"The R2 score on the test set with the best beta coefficients (based on validation set performance) is: {best_r2_test:.4f}")

The R2 score on the test set with the best beta coefficients (based on validation set performance) is: 0.9134
