## Assignment 3

Q1:
<br>K-Fold Cross Validation for Multiple Linear Regression (Least Square Error Fit)
<br>Download the dataset regarding USA House Price Prediction from the following link:
https://drive.google.com/file/d/1O_NwpJT-8xGfU_-3llUl2sgPu0xllOrX/view?usp=sharing
<br>Load the dataset and Implement 5- fold cross validation for multiple linear regression (using least square error fit).
<br><br>Steps:
<br>a) Divide the dataset into input features (all columns except price) and output variable
(price)
<br>b) Scale the values of input features.
<br>c) Divide input and output features into five folds.
<br>d) Run five iterations, in each iteration consider one-fold as test set and remaining four sets as training set. Find the beta (𝛽) matrix, predicted values, and R2_score for each iteration using least square error fit.
<br>e) Use the best value of (𝛽) matrix (for which R2_score is maximum), to train the regressor for 70% of data and test the performance for remaining 30% data.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.model_selection import KFold, train_test_split

# Load dataset
df = pd.read_csv("USA_Housing.csv")  # rename file if needed

# Step a) Divide into input X and output y
X = df.drop("Price", axis=1).values
y = df["Price"].values

# Step b) Scale input features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Step c) 5-fold cross validation setup
kf = KFold(n_splits=5, shuffle=True, random_state=42)

betas = []
r2_scores = []

# Step d) Run 5 iterations
fold = 1
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Add bias column of 1s
    X_train_b = np.c_[np.ones(X_train.shape[0]), X_train]
    X_test_b = np.c_[np.ones(X_test.shape[0]), X_test]
    
    # Least squares fit: β = (XᵀX)⁻¹ Xᵀy
    beta = np.linalg.inv(X_train_b.T @ X_train_b) @ (X_train_b.T @ y_train)
    
    # Predictions
    y_pred = X_test_b @ beta
    
    # R² score
    r2 = r2_score(y_test, y_pred)
    
    betas.append(beta)
    r2_scores.append(r2)
    
    print(f"Fold {fold}: R2 Score = {r2:.4f}")
    fold += 1

# Step e) Pick the best β (highest R²)
best_idx = np.argmax(r2_scores)
best_beta = betas[best_idx]
print("\nBest fold index:", best_idx+1)
print("Best R2 Score:", r2_scores[best_idx])

# Train-test split 70/30 using best β
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)
X_train_b = np.c_[np.ones(X_train.shape[0]), X_train]
X_test_b = np.c_[np.ones(X_test.shape[0]), X_test]

# Re-train with best β approach
# @ is used for matrix multiplication (just like np.dot())
beta_final = np.linalg.inv(X_train_b.T @ X_train_b) @ (X_train_b.T @ y_train)
y_pred_final = X_test_b @ beta_final
final_r2 = r2_score(y_test, y_pred_final)

print("\nFinal Model Performance on 30% test data:")
print("R2 Score =", final_r2)
print("Beta coefficients:\n", beta_final)

Fold 1: R2 Score = 0.9180
Fold 2: R2 Score = 0.9146
Fold 3: R2 Score = 0.9116
Fold 4: R2 Score = 0.9193
Fold 5: R2 Score = 0.9244

Best fold index: 5
Best R2 Score: 0.9243869413350317

Final Model Performance on 30% test data:
R2 Score = 0.9146818498916267
Beta coefficients:
 [1231278.63687691  230464.52520478  164159.19982569  120514.71328324
    2913.62424674  151019.35865134]


Q2:
<br>Concept of Validation set for Multiple Linear Regression (Gradient Descent Optimization)
<br>Consider the same dataset of Q1, rather than dividing the dataset into five folds, divide the dataset into training set (56%), validation set (14%), and test set (30%).
<br>Consider four different values of learning rate i.e. {0.001,0.01,0.1,1}. Compute the values of regression coefficients for each value of learning rate after 1000 iterations.
<br>For each set of regression coefficients, compute R2_score for validation and test set and find the best value of regression coefficients.

In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

# Load dataset
df = pd.read_csv("USA_Housing.csv")  # change filename if needed

# Input features and output
X = df.drop("Price", axis=1).values
y = df["Price"].values.reshape(-1, 1)  # make y a column vector

# Scale features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split into Train (56%), Validation (14%), Test (30%)
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.2, random_state=42)   # 0.2 * 0.7 ≈ 0.14

# Add bias term
def add_bias(X):
    return np.c_[np.ones((X.shape[0], 1)), X]

X_train_b = add_bias(X_train)
X_val_b = add_bias(X_val)
X_test_b = add_bias(X_test)

# Gradient Descent function
def gradient_descent(X, y, lr, iterations):
    n_samples, n_features = X.shape
    beta = np.zeros((n_features, 1))  # initialize coefficients
    for i in range(iterations):
        y_pred = X @ beta
        error = y_pred - y
        gradient = (2/n_samples) * (X.T @ error)
        beta -= lr * gradient
    return beta

learning_rates = [0.001, 0.01, 0.1, 1]
results = []

for lr in learning_rates:
    beta = gradient_descent(X_train_b, y_train, lr, 1000)
    
    y_val_pred = X_val_b @ beta
    y_test_pred = X_test_b @ beta
    
    r2_val = r2_score(y_val, y_val_pred)
    r2_test = r2_score(y_test, y_test_pred)
    
    results.append({
        "learning_rate": lr,
        "beta": beta,
        "r2_val": r2_val,
        "r2_test": r2_test
    })
    print(f"Learning Rate: {lr}")
    print(f"Validation R2: {r2_val:.4f}, Test R2: {r2_test:.4f}\n")

# Find best model based on validation R2
best_model = max(results, key=lambda x: x["r2_val"])
print("Best Model based on Validation R2")
print("Learning Rate:", best_model["learning_rate"])
print("Validation R2:", best_model["r2_val"])
print("Test R2:", best_model["r2_test"])
print("Beta Coefficients:\n", best_model["beta"])

Learning Rate: 0.001
Validation R2: 0.6820, Test R2: 0.6490

Learning Rate: 0.01
Validation R2: 0.9098, Test R2: 0.9148

Learning Rate: 0.1
Validation R2: 0.9098, Test R2: 0.9148

Learning Rate: 1
Validation R2: -inf, Test R2: -inf

Best Model based on Validation R2
Learning Rate: 0.01
Validation R2: 0.909799626728122
Test R2: 0.9147569598865972
Beta Coefficients:
 [[1232618.31836202]
 [ 230067.95333238]
 [ 163710.26584918]
 [ 121680.22876975]
 [   2833.37135223]
 [ 150657.57448494]]


  numerator = xp.sum(weight * (y_true - y_pred) ** 2, axis=0)
  numerator = xp.sum(weight * (y_true - y_pred) ** 2, axis=0)


Q3
<br>Pre-processing and Multiple Linear Regression
Download the dataset regarding Car Price Prediction from the following link:
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
1. Load the dataset with following column names ["symboling", "normalized_losses",
"make", "fuel_type", "aspiration","num_doors", "body_style", "drive_wheels",
"engine_location", "wheel_base", "length", "width", "height", "curb_weight",
"engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke",
"compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"]
and replace all ? values with NaN
2. Replace all NaN values with central tendency imputation. Drop the rows with NaN
values in price column
3. There are 10 columns in the dataset with non-numeric values. Convert these values to
numeric values using following scheme:
<br>(i) For “num_doors” and “num_cylinders”: convert words (number names) to figures
for e.g., two to 2
<br>(ii) For "body_style", "drive_wheels": use dummy encoding scheme
<br>(iii) For “make”, “aspiration”, “engine_location”,fuel_type: use label encoding
scheme
<br>(iv) For fuel_system: replace values containing string pfi to 1 else all values to 0.
<br>(v) For engine_type: replace values containing string ohc to 1 else all values to 0.
4. Divide the dataset into input features (all columns except price) and output variable
(price). Scale all input features.
5. Train a linear regressor on 70% of data (using inbuilt linear regression function of
Python) and test its performance on remaining 30% of data.
6. Reduce the dimensionality of the feature set using inbuilt PCA decomposition and then
again train a linear regressor on 70% of reduced data (using inbuilt linear regression
function of Python). Does it lead to any performance improvement on test set?

In [4]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.metrics import r2_score

# Step 1: Load dataset
columns = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location", "wheel_base",
           "length", "width", "height", "curb_weight", "engine_type", "num_cylinders",
           "engine_size", "fuel_system", "bore", "stroke", "compression_ratio",
           "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"]

df = pd.read_csv("Ass3_Q3_Dataset.csv", names=columns, na_values="?")

# Step 2: Impute numeric NaNs (excluding price) and drop rows with NaN price
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols.remove("price")
for col in numeric_cols:
    df[col] = df[col].fillna(df[col].mean())

df = df.dropna(subset=["price"])

# Step 3: Convert non-numeric columns
# (i) num_doors, num_cylinders mapping
num_map = {"two":2, "three":3, "four":4, "five":5, "six":6, "eight":8, "twelve":12}
df["num_doors"] = df["num_doors"].map(num_map)
df["num_cylinders"] = df["num_cylinders"].map(num_map)

# Drop rows that became NaN after mapping
df = df.dropna(subset=["num_doors", "num_cylinders"])

# (ii) Dummy encoding: body_style, drive_wheels
df = pd.get_dummies(df, columns=["body_style", "drive_wheels"], drop_first=True)

# (iii) Label encoding: make, aspiration, engine_location, fuel_type
label_cols = ["make", "aspiration", "engine_location", "fuel_type"]
le = LabelEncoder()
for col in label_cols:
    df[col] = le.fit_transform(df[col])

# (iv) Binary encoding: fuel_system (pfi=1, else 0)
df["fuel_system"] = df["fuel_system"].apply(lambda x: 1 if "pfi" in str(x).lower() else 0)

# (v) Binary encoding: engine_type (ohc=1, else 0)
df["engine_type"] = df["engine_type"].apply(lambda x: 1 if "ohc" in str(x).lower() else 0)

# Final check for remaining NaNs
if df.isna().sum().sum() > 0:
    print("Warning: Dataset still contains NaNs!")
else:
    print("No NaNs remain. Ready for modeling.")

# Step 4: Split into input features and output
X = df.drop("price", axis=1)
y = df["price"].values

# Scale input features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 5: Train-test split (70-30)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, train_size=0.7, random_state=42)

# Train Linear Regression on original features
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
r2_original = r2_score(y_test, y_pred)
print("R2 Score (Original Features):", r2_original)

# Step 6: PCA dimensionality reduction (retain 95% variance)
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_scaled)

# Train-test split on reduced data
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_reduced, y, train_size=0.7, random_state=42)

# Train Linear Regression on PCA-reduced features
lr_r = LinearRegression()
lr_r.fit(X_train_r, y_train_r)
y_pred_r = lr_r.predict(X_test_r)
r2_reduced = r2_score(y_test_r, y_pred_r)
print("R2 Score (PCA Reduced Features):", r2_reduced)

# Optional: Compare number of features
print("Original number of features:", X_scaled.shape[1])
print("Reduced number of features after PCA:", X_reduced.shape[1])

No NaNs remain. Ready for modeling.
R2 Score (Original Features): 0.8315729391592779
R2 Score (PCA Reduced Features): 0.8627501465144092
Original number of features: 29
Reduced number of features after PCA: 16
