# Q1: K-Fold Cross Validation for Multiple Linear Regression (Least Square Error Fit)

Load the dataset and Implement 5- fold cross validation for multiple linear regression
(using least square error fit).
Steps:
* a) Divide the dataset into input features (all columns except price) and output variable(price)
* b) Scale the values of input features.
* c) Divide input and output features into five folds.
* d) Run five iterations, in each iteration consider one-fold as test set and remaining four sets as training set. Find the beta (𝛽) matrix, predicted values, and R2_score for each iteration using least square error fit.
* e) Use the best value of (𝛽) matrix (for which R2_score is maximum), to train the regressor for 70% of data and test the performance for remaining 30% data

In [1]:
import pandas as pd

df = pd.read_csv('/content/USA_Housing.csv')
display(df.head())
display(df.info())

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price
0,79545.45857,5.682861,7.009188,4.09,23086.8005,1059034.0
1,79248.64245,6.0029,6.730821,3.09,40173.07217,1505891.0
2,61287.06718,5.86589,8.512727,5.13,36882.1594,1058988.0
3,63345.24005,7.188236,5.586729,3.26,34310.24283,1260617.0
4,59982.19723,5.040555,7.839388,4.23,26354.10947,630943.5


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Avg. Area Income              5000 non-null   float64
 1   Avg. Area House Age           5000 non-null   float64
 2   Avg. Area Number of Rooms     5000 non-null   float64
 3   Avg. Area Number of Bedrooms  5000 non-null   float64
 4   Area Population               5000 non-null   float64
 5   Price                         5000 non-null   float64
dtypes: float64(6)
memory usage: 234.5 KB


None

In [2]:
X = df.drop('Price', axis=1)
y = df['Price']

display(X.head())
display(y.head())

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population
0,79545.45857,5.682861,7.009188,4.09,23086.8005
1,79248.64245,6.0029,6.730821,3.09,40173.07217
2,61287.06718,5.86589,8.512727,5.13,36882.1594
3,63345.24005,7.188236,5.586729,3.26,34310.24283
4,59982.19723,5.040555,7.839388,4.23,26354.10947


Unnamed: 0,Price
0,1059034.0
1,1505891.0
2,1058988.0
3,1260617.0
4,630943.5


In [3]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

display(pd.DataFrame(X_scaled, columns=X.columns).head())

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population
0,1.02866,-0.296927,0.021274,0.088062,-1.317599
1,1.000808,0.025902,-0.255506,-0.722301,0.403999
2,-0.684629,-0.112303,1.516243,0.93084,0.07241
3,-0.491499,1.221572,-1.393077,-0.58454,-0.186734
4,-0.807073,-0.944834,0.846742,0.201513,-0.988387


In [4]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np

kf = KFold(n_splits=5, shuffle=True, random_state=42)

r2_scores = []
beta_matrices = []

for train_index, test_index in kf.split(X_scaled):
    X_train, X_test = X_scaled[train_index], X_scaled[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    model = LinearRegression()
    model.fit(X_train, y_train)

    beta = np.append(model.coef_, model.intercept_)
    beta_matrices.append(beta)

    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    r2_scores.append(r2)

best_r2_index = np.argmax(r2_scores)

best_beta = beta_matrices[best_r2_index]

print("R2 Scores for each fold:", r2_scores)
print("Best R2 Score:", r2_scores[best_r2_index])
print("Best Beta Matrix:", best_beta)

R2 Scores for each fold: [0.9179971706985147, 0.9145677884802819, 0.9116116385364478, 0.9193091764960816, 0.9243869413350316]
Best R2 Score: 0.9243869413350316
Best Beta Matrix: [2.30225051e+05 1.63956839e+05 1.21115120e+05 7.83467170e+02
 1.50662447e+05 1.23161736e+06]


In [5]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import numpy as np
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled_intercept = np.insert(X_train_scaled, 0, 1, axis=1)
X_test_scaled_intercept = np.insert(X_test_scaled, 0, 1, axis=1)

y_pred_best_beta = X_test_scaled_intercept @ best_beta

r2_final = r2_score(y_test, y_pred_best_beta)

print("Final R2 score using the best beta matrix on 70/30 split:", r2_final)

Final R2 score using the best beta matrix on 70/30 split: -16.711123036834696


# Concept of Validation set for Multiple Linear Regression (Gradient Descent Optimization)
######Consider the same dataset of Q1, rather than dividing the dataset into five folds, divide the dataset into training set (56%), validation set (14%), and test set (30%).
######Consider four different values of learning rate i.e. {0.001,0.01,0.1,1}. Compute the values of regression coefficients for each value of learning rate after 1000 iterations.
######For each set of regression coefficients, compute R2_score for validation and test set and find the best value of regression coefficients.

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
import numpy as np

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.2, random_state=42) # 0.2 of 70% is 14%

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

X_train_scaled = np.insert(X_train_scaled, 0, 1, axis=1)
X_val_scaled = np.insert(X_val_scaled, 0, 1, axis=1)
X_test_scaled = np.insert(X_test_scaled, 0, 1, axis=1)


def gradient_descent(X, y, learning_rate, iterations):
    m = len(y)
    beta = np.zeros(X.shape[1])
    for _ in range(iterations):
        y_pred = X @ beta
        error = y_pred - y
        gradient = (X.T @ error) / m
        beta -= learning_rate * gradient
    return beta

learning_rates = [0.001, 0.01, 0.1, 1]
best_r2_val = -np.inf
best_beta_gd = None
results = {}

for lr in learning_rates:
    beta_gd = gradient_descent(X_train_scaled, y_train, lr, 1000)
    y_val_pred = X_val_scaled @ beta_gd
    r2_val = r2_score(y_val, y_val_pred)

    results[lr] = {'beta': beta_gd, 'r2_val': r2_val}

    if r2_val > best_r2_val:
        best_r2_val = r2_val
        best_beta_gd = beta_gd

    print(f"Learning Rate: {lr}, R2 on Validation Set: {r2_val}")

y_test_pred_best_beta_gd = X_test_scaled @ best_beta_gd
r2_test_best_beta_gd = r2_score(y_test, y_test_pred_best_beta_gd)

print("\nBest Beta Matrix (Gradient Descent):", best_beta_gd)
print("R2 on Test Set with Best Beta (Gradient Descent):", r2_test_best_beta_gd)

Learning Rate: 0.001, R2 on Validation Set: -0.7993341493580068
Learning Rate: 0.01, R2 on Validation Set: 0.9098185612062919
Learning Rate: 0.1, R2 on Validation Set: 0.9097995626742029
Learning Rate: 1, R2 on Validation Set: 0.9097995626742028

Best Beta Matrix (Gradient Descent): [1232381.36998704  234542.29331285  162395.51342821  121485.70570465
    3089.76469737  151581.05530854]
R2 on Test Set with Best Beta (Gradient Descent): 0.9147438859798211


# Pre-processing and Multiple Linear Regression
1. Load the dataset with following column names ["symboling", "normalized_losses", "make", "fuel_type", "aspiration","num_doors", "body_style", "drive_wheels", "engine_location", "wheel_base", "length", "width", "height", "curb_weight", "engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"] and replace all ? values with NaN
2. Replace all NaN values with central tendency imputation. Drop the rows with NaN values in price column
3. There are 10 columns in the dataset with non-numeric values. Convert these values to numeric values using following scheme:
* (i) For “num_doors” and “num_cylinders”: convert words (number names) to figures for e.g., two to 2
* (ii) For "body_style", "drive_wheels": use dummy encoding scheme
* (iii) For “make”, “aspiration”, “engine_location”,fuel_type: use label encoding scheme
* (iv) For fuel_system: replace values containing string pfi to 1 else all values to 0.
* (v) For engine_type: replace values containing string ohc to 1 else all values to 0.
4. Divide the dataset into input features (all columns except price) and output variable (price). Scale all input features.
5. Train a linear regressor on 70% of data (using inbuilt linear regression function of Python) and test its performance on remaining 30% of data.
6. Reduce the dimensionality of the feature set using inbuilt PCA decomposition and then again train a linear regressor on 70% of reduced data (using inbuilt linear regression function of Python). Does it lead to any performance improvement on test set?

In [16]:
column_names = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration","num_doors", "body_style", "drive_wheels", "engine_location", "wheel_base", "length", "width", "height", "curb_weight", "engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"]

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data', names=column_names, na_values='?')

display(df.head())
display(df.info())

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized_losses  164 non-null    float64
 2   make               205 non-null    object 
 3   fuel_type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num_doors          203 non-null    object 
 6   body_style         205 non-null    object 
 7   drive_wheels       205 non-null    object 
 8   engine_location    205 non-null    object 
 9   wheel_base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb_weight        205 non-null    int64  
 14  engine_type        205 non-null    object 
 15  num_cylinders      205 non-null    object 
 16  engine_size        205 non

None

In [15]:
for column in df.columns:
    if df[column].isnull().any():
        if df[column].dtype == 'object':
            # Impute with mode for object type
            df[column].fillna(df[column].mode()[0], inplace=True)
        else:
            # Impute with mean for numeric type
            df[column].fillna(df[column].mean(), inplace=True)

df.dropna(subset=['price'], inplace=True)

display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 30 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   symboling             205 non-null    int64  
 1   normalized_losses     205 non-null    float64
 2   make                  205 non-null    int64  
 3   fuel_type             205 non-null    int64  
 4   aspiration            205 non-null    int64  
 5   num_doors             205 non-null    int64  
 6   engine_location       205 non-null    int64  
 7   wheel_base            205 non-null    float64
 8   length                205 non-null    float64
 9   width                 205 non-null    float64
 10  height                205 non-null    float64
 11  curb_weight           205 non-null    int64  
 12  engine_type           205 non-null    int64  
 13  num_cylinders         205 non-null    int64  
 14  engine_size           205 non-null    int64  
 15  fuel_system           2

None

In [9]:
from sklearn.preprocessing import LabelEncoder

# (i) For “num_doors” and “num_cylinders”: convert words to figures
num_word_to_fig = {
    'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'eight': 8, 'twelve': 12
}
df['num_doors'] = df['num_doors'].map(num_word_to_fig)
df['num_cylinders'] = df['num_cylinders'].map(num_word_to_fig)

# (ii) For "body_style", "drive_wheels": use dummy encoding scheme
df = pd.get_dummies(df, columns=['body_style', 'drive_wheels'], drop_first=True)

# (iii) For “make”, “aspiration”, “engine_location”,fuel_type: use label encoding scheme
label_encoder = LabelEncoder()
for col in ['make', 'aspiration', 'engine_location', 'fuel_type']:
    df[col] = label_encoder.fit_transform(df[col])

# (iv) For fuel_system: replace values containing string pfi to 1 else all values to 0.
df['fuel_system'] = df['fuel_system'].apply(lambda x: 1 if 'pfi' in x else 0)

# (v) For engine_type: replace values containing string ohc to 1 else all values to 0.
df['engine_type'] = df['engine_type'].apply(lambda x: 1 if 'ohc' in x else 0)

display(df.head())
display(df.info())

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,engine_location,wheel_base,length,width,...,peak_rpm,city_mpg,highway_mpg,price,body_style_hardtop,body_style_hatchback,body_style_sedan,body_style_wagon,drive_wheels_fwd,drive_wheels_rwd
0,3,122.0,0,1,0,2,0,88.6,168.8,64.1,...,5000.0,21,27,13495.0,False,False,False,False,False,True
1,3,122.0,0,1,0,2,0,88.6,168.8,64.1,...,5000.0,21,27,16500.0,False,False,False,False,False,True
2,1,122.0,0,1,0,2,0,94.5,171.2,65.5,...,5000.0,19,26,16500.0,False,True,False,False,False,True
3,2,164.0,1,1,0,4,0,99.8,176.6,66.2,...,5500.0,24,30,13950.0,False,False,True,False,True,False
4,2,164.0,1,1,0,4,0,99.4,176.6,66.4,...,5500.0,18,22,17450.0,False,False,True,False,False,False


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 30 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   symboling             205 non-null    int64  
 1   normalized_losses     205 non-null    float64
 2   make                  205 non-null    int64  
 3   fuel_type             205 non-null    int64  
 4   aspiration            205 non-null    int64  
 5   num_doors             205 non-null    int64  
 6   engine_location       205 non-null    int64  
 7   wheel_base            205 non-null    float64
 8   length                205 non-null    float64
 9   width                 205 non-null    float64
 10  height                205 non-null    float64
 11  curb_weight           205 non-null    int64  
 12  engine_type           205 non-null    int64  
 13  num_cylinders         205 non-null    int64  
 14  engine_size           205 non-null    int64  
 15  fuel_system           2

None

In [10]:
X = df.drop('price', axis=1)
y = df['price']

display(X.head())
display(y.head())

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,engine_location,wheel_base,length,width,...,horsepower,peak_rpm,city_mpg,highway_mpg,body_style_hardtop,body_style_hatchback,body_style_sedan,body_style_wagon,drive_wheels_fwd,drive_wheels_rwd
0,3,122.0,0,1,0,2,0,88.6,168.8,64.1,...,111.0,5000.0,21,27,False,False,False,False,False,True
1,3,122.0,0,1,0,2,0,88.6,168.8,64.1,...,111.0,5000.0,21,27,False,False,False,False,False,True
2,1,122.0,0,1,0,2,0,94.5,171.2,65.5,...,154.0,5000.0,19,26,False,True,False,False,False,True
3,2,164.0,1,1,0,4,0,99.8,176.6,66.2,...,102.0,5500.0,24,30,False,False,True,False,True,False
4,2,164.0,1,1,0,4,0,99.4,176.6,66.4,...,115.0,5500.0,18,22,False,False,True,False,False,False


Unnamed: 0,price
0,13495.0
1,16500.0
2,16500.0
3,13950.0
4,17450.0


In [11]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

display(pd.DataFrame(X_scaled, columns=X.columns).head())

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,engine_location,wheel_base,length,width,...,horsepower,peak_rpm,city_mpg,highway_mpg,body_style_hardtop,body_style_hatchback,body_style_sedan,body_style_wagon,drive_wheels_fwd,drive_wheels_rwd
0,1.74347,0.0,-1.948256,0.328798,-0.469295,-1.141653,-0.121867,-1.690772,-0.426521,-0.844782,...,0.171065,-0.263484,-0.646553,-0.546059,-0.201517,-0.720082,-0.938474,-0.372678,-1.188177,1.302831
1,1.74347,0.0,-1.948256,0.328798,-0.469295,-1.141653,-0.121867,-1.690772,-0.426521,-0.844782,...,0.171065,-0.263484,-0.646553,-0.546059,-0.201517,-0.720082,-0.938474,-0.372678,-1.188177,1.302831
2,0.133509,0.0,-1.948256,0.328798,-0.469295,-1.141653,-0.121867,-0.708596,-0.231513,-0.190566,...,1.261807,-0.263484,-0.953012,-0.691627,-0.201517,1.38873,-0.938474,-0.372678,-1.188177,1.302831
3,0.93849,1.328961,-1.788499,0.328798,-0.469295,0.875923,-0.121867,0.173698,0.207256,0.136542,...,-0.05723,0.787346,-0.186865,-0.109354,-0.201517,-0.720082,1.065559,-0.372678,0.841625,-0.767559
4,0.93849,1.328961,-1.788499,0.328798,-0.469295,0.875923,-0.121867,0.10711,0.207256,0.230001,...,0.272529,0.787346,-1.106241,-1.2739,-0.201517,-0.720082,1.065559,-0.372678,-1.188177,-0.767559


In [12]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X_train_scaled, X_test_scaled, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

model = LinearRegression()
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

r2 = r2_score(y_test, y_pred)
print("R2 score on the test set:", r2)

R2 score on the test set: 0.804442243576259


In [13]:
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

pca = PCA(n_components=0.95) # Keep components explaining 95% variance
X_pca = pca.fit_transform(X_scaled)

X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=42)

model_pca = LinearRegression()
model_pca.fit(X_train_pca, y_train)

y_pred_pca = model_pca.predict(X_test_pca)

r2_pca = r2_score(y_test, y_pred_pca)
print("R2 score on the test set after PCA:", r2_pca)

R2 score on the test set after PCA: 0.7500675882701553


In [14]:
print(f"R2 score on the test set with original scaled features: {r2}")
print(f"R2 score on the test set after PCA: {r2_pca}")

if r2_pca > r2:
    print("Applying PCA improved the performance of the linear regression model.")
else:
    print("Applying PCA did not improve the performance of the linear regression model.")

R2 score on the test set with original scaled features: 0.804442243576259
R2 score on the test set after PCA: 0.7500675882701553
Applying PCA did not improve the performance of the linear regression model.
