Part 1 : K-Fold Cross Validation for Multiple Linear Regression (Least Square Error Fit)  

(a) Divide the dataset into input features (all columns except price) and output variable  
(price)

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

df = pd.read_csv("USA_Housing.csv")

X = df.drop("Price", axis=1).values   # all columns except Price
y = df["Price"].values.reshape(-1, 1) # target variable

print("(a) Input features (X) shape:", X.shape)
print("(a) Output variable (y) shape:", y.shape)
print("First 5 rows of X:\n", X[:5])
print("First 5 values of y:\n", y[:5])


(a) Input features (X) shape: (5000, 5)
(a) Output variable (y) shape: (5000, 1)
First 5 rows of X:
 [[7.95454586e+04 5.68286132e+00 7.00918814e+00 4.09000000e+00
  2.30868005e+04]
 [7.92486424e+04 6.00289981e+00 6.73082102e+00 3.09000000e+00
  4.01730722e+04]
 [6.12870672e+04 5.86588984e+00 8.51272743e+00 5.13000000e+00
  3.68821594e+04]
 [6.33452401e+04 7.18823609e+00 5.58672866e+00 3.26000000e+00
  3.43102428e+04]
 [5.99821972e+04 5.04055452e+00 7.83938779e+00 4.23000000e+00
  2.63541095e+04]]
First 5 values of y:
 [[1059033.558 ]
 [1505890.915 ]
 [1058987.988 ]
 [1260616.807 ]
 [ 630943.4893]]


(b) Scale the values of input features. 

In [4]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("\n(b) Scaled features (first 5 rows):\n", X_scaled[:5])


(b) Scaled features (first 5 rows):
 [[ 1.02865969 -0.29692705  0.02127433  0.08806222 -1.31759867]
 [ 1.00080775  0.02590164 -0.25550611 -0.72230146  0.40399945]
 [-0.68462915 -0.11230283  1.5162435   0.93084045  0.07240989]
 [-0.49149907  1.22157207 -1.39307717 -0.58453963 -0.18673422]
 [-0.80707253 -0.94483368  0.84674187  0.20151314 -0.98838741]]


(c) Divide input and output features into five folds. 

In [5]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)
print("\n(c) Prepared 5-fold cross validation.")


(c) Prepared 5-fold cross validation.


(d) Run five iterations, in each iteration consider one-fold as test set and remaining 
four sets as training set. Find the beta (𝛽) matrix, predicted values, and R2_score 

In [6]:
r2_scores = []
betas = []
fold = 1

for train_idx, test_idx in kf.split(X_scaled):
    print(f"\n--- Fold {fold} ---")
    
    X_train, X_test = X_scaled[train_idx], X_scaled[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    # Add bias column (intercept)
    X_train_b = np.c_[np.ones((X_train.shape[0], 1)), X_train]
    X_test_b = np.c_[np.ones((X_test.shape[0], 1)), X_test]
    
    # Compute Beta
    beta = np.linalg.inv(X_train_b.T @ X_train_b) @ (X_train_b.T @ y_train)
    print("Beta matrix:\n", beta.flatten())
    
    # Predictions
    y_pred = X_test_b @ beta
    
    # R2 Score
    score = r2_score(y_test, y_pred)
    print("R2 Score:", score)
    
    r2_scores.append(score)
    betas.append(beta)
    fold += 1


--- Fold 1 ---
Beta matrix:
 [1232002.6748241   230745.9407348   163243.27314515  120309.77397759
    3011.45976111  151552.63069359]
R2 Score: 0.9179971706985147

--- Fold 2 ---
Beta matrix:
 [1232037.85755945  229081.97914235  165882.1605634   121536.57475055
    2092.4478622   150874.99274586]
R2 Score: 0.9145677884802819

--- Fold 3 ---
Beta matrix:
 [1231951.92563846  230224.50511001  162766.17455493  121022.77324577
    1247.16258975  150234.77720419]
R2 Score: 0.9116116385364478

--- Fold 4 ---
Beta matrix:
 [1232751.4648651   229500.10043209  165212.07110924  122839.9376815
    3063.71699324  150917.88484984]
R2 Score: 0.9193091764960818

--- Fold 5 ---
Beta matrix:
 [1.23161736e+06 2.30225051e+05 1.63956839e+05 1.21115120e+05
 7.83467170e+02 1.50662447e+05]
R2 Score: 0.9243869413350317


(e) Use the best value of (𝛽) matrix (for which R2_score is maximum), to train the regressor for 70% of data and test the performance for remaining 30% data.

In [7]:
best_idx = np.argmax(r2_scores)
best_beta = betas[best_idx]

print("\n(e) Best R2 Score:", r2_scores[best_idx])
print("(e) Best Beta Matrix:\n", best_beta.flatten())


(e) Best R2 Score: 0.9243869413350317
(e) Best Beta Matrix:
 [1.23161736e+06 2.30225051e+05 1.63956839e+05 1.21115120e+05
 7.83467170e+02 1.50662447e+05]


Part 2 : Concept of Validation set for Multiple Linear Regression (Gradient Descent Optimization) Consider the same dataset of Q1, rather than dividing the dataset into five folds, divide the dataset into training set (56%), validation set (14%), and test set (30%). Consider four different values of learning rate i.e. {0.001,0.01,0.1,1}. Compute the values of regression coefficients for each value of learning rate after 1000 iterations. For each set of regression coefficients, compute R2_score for validation and test set and find the best value of regression coefficients.

In [9]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score


# Load dataset

df = pd.read_csv("USA_Housing.csv")

X = df.drop("Price", axis=1).values
y = df["Price"].values.reshape(-1, 1)


# Split Train (56%), Val (14%), Test (30%)

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.20, random_state=42)

# Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

# Bias term
X_train_b = np.c_[np.ones((X_train.shape[0], 1)), X_train]
X_val_b = np.c_[np.ones((X_val.shape[0], 1)), X_val]
X_test_b = np.c_[np.ones((X_test.shape[0], 1)), X_test]


# Gradient Descent Function with safe stop

def gradient_descent(X, y, lr, n_iter=1000, tol=1e10):
    m, n = X.shape
    beta = np.zeros((n, 1))
    for i in range(n_iter):
        gradients = (2/m) * X.T @ (X @ beta - y)
        beta -= lr * gradients
        # stop if values explode
        if np.any(np.abs(beta) > tol):
            print(f"Stopped early: Divergence at iteration {i}")
            return None
    return beta


# Try learning rates

learning_rates = [0.001, 0.01, 0.1, 1]
results = []

for lr in learning_rates:
    print(f"\nLearning Rate = {lr}")
    beta = gradient_descent(X_train_b, y_train, lr, n_iter=1000)
    
    if beta is None:
        print("❌ Diverged (too large learning rate)")
        continue
    
    # Predictions
    y_val_pred = X_val_b @ beta
    y_test_pred = X_test_b @ beta
    
    # Scores
    r2_val = r2_score(y_val, y_val_pred)
    r2_test = r2_score(y_test, y_test_pred)
    
    print("Beta coefficients (rounded):", np.round(beta.flatten(), 2))
    print("R2 Score (Validation):", round(r2_val, 4))
    print("R2 Score (Test):", round(r2_test, 4))
    
    results.append((lr, beta, r2_val, r2_test))


# Best Result

if results:
    best_result = max(results, key=lambda x: x[2])
    print("\n✅ Best Learning Rate:", best_result[0])
    print("Best Validation R2:", round(best_result[2], 4))
    print("Best Beta (rounded):", np.round(best_result[1].flatten(), 2))



Learning Rate = 0.001
Beta coefficients (rounded): [1065976.39  201488.21  140561.4    97763.8    20752.23  130367.98]
R2 Score (Validation): 0.6874
R2 Score (Test): 0.6523

Learning Rate = 0.01
Beta coefficients (rounded): [1232434.57  234562.96  162415.89  121759.16    2815.32  151577.64]
R2 Score (Validation): 0.9098
R2 Score (Test): 0.9148

Learning Rate = 0.1
Beta coefficients (rounded): [1232434.58  234562.99  162415.95  121760.31    2814.16  151577.59]
R2 Score (Validation): 0.9098
R2 Score (Test): 0.9148

Learning Rate = 1
Stopped early: Divergence at iteration 18
❌ Diverged (too large learning rate)

✅ Best Learning Rate: 0.01
Best Validation R2: 0.9098
Best Beta (rounded): [1232434.57  234562.96  162415.89  121759.16    2815.32  151577.64]


Part 3 : Pre-processing and Multiple Linear Regression

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.decomposition import PCA


# 1. Load dataset with column names & replace '?' with NaN

columns = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

df = pd.read_csv("CarDetail.csv", names=columns, na_values="?")

print("(1) Dataset loaded. Shape:", df.shape)
print("First 5 rows:\n", df.head())


# 2. Handle NaN (central tendency imputation)

# Drop rows with NaN in price (target variable)
df = df.dropna(subset=["price"])

# For numeric columns: fill NaN with mean
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].apply(lambda x: x.fillna(x.mean()))

# For categorical columns: fill NaN with mode
categorical_cols = df.select_dtypes(include=[object]).columns
df[categorical_cols] = df[categorical_cols].apply(lambda x: x.fillna(x.mode()[0]))

print("\n(2) Missing values handled.")
print("Remaining NaN count:\n", df.isna().sum().sum())


# 3. Convert non-numeric columns


# (i) num_doors & num_cylinders: convert word to number
word_to_num = {
    "two": 2, "three": 3, "four": 4, "five": 5,
    "six": 6, "eight": 8, "twelve": 12
}
df["num_doors"] = df["num_doors"].replace(word_to_num)
df["num_cylinders"] = df["num_cylinders"].replace(word_to_num)

# (ii) body_style, drive_wheels → dummy encoding
df = pd.get_dummies(df, columns=["body_style", "drive_wheels"], drop_first=True)

# (iii) make, aspiration, engine_location, fuel_type → label encoding
label_cols = ["make", "aspiration", "engine_location", "fuel_type"]
for col in label_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])

# (iv) fuel_system: "pfi" → 1, else 0
df["fuel_system"] = df["fuel_system"].apply(lambda x: 1 if "pfi" in x else 0)

# (v) engine_type: "ohc" → 1, else 0
df["engine_type"] = df["engine_type"].apply(lambda x: 1 if "ohc" in x else 0)

print("\n(3) Non-numeric columns converted.")
print("Dataset shape after encoding:", df.shape)


# 4. Divide dataset into input (X) and output (y)

X = df.drop("price", axis=1).values
y = df["price"].values.reshape(-1, 1)

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("\n(4) Features scaled.")
print("X shape:", X_scaled.shape, "| y shape:", y.shape)


# 5. Train/Test Split (70% train, 30% test)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.30, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
r2_original = r2_score(y_test, y_pred)

print("\n(5) Linear Regression without PCA")
print("R2 Score on Test Set:", round(r2_original, 4))


# 6. PCA + Linear Regression

pca = PCA(n_components=0.95)  # keep 95% variance
X_pca = pca.fit_transform(X_scaled)

X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca, y, test_size=0.30, random_state=42)

model_pca = LinearRegression()
model_pca.fit(X_train_pca, y_train_pca)

y_pred_pca = model_pca.predict(X_test_pca)
r2_pca = r2_score(y_test_pca, y_pred_pca)

print("\n(6) Linear Regression with PCA")
print("Number of components after PCA:", X_pca.shape[1])
print("R2 Score on Test Set:", round(r2_pca, 4))

# Compare
if r2_pca > r2_original:
    print("\n✅ PCA improved the performance.")
else:
    print("\n❌ PCA did not improve the performance.")


(1) Dataset loaded. Shape: (205, 26)
First 5 rows:
    symboling  normalized_losses         make fuel_type aspiration num_doors  \
0          3                NaN  alfa-romero       gas        std       two   
1          3                NaN  alfa-romero       gas        std       two   
2          1                NaN  alfa-romero       gas        std       two   
3          2              164.0         audi       gas        std      four   
4          2              164.0         audi       gas        std      four   

    body_style drive_wheels engine_location  wheel_base  ...  engine_size  \
0  convertible          rwd           front        88.6  ...          130   
1  convertible          rwd           front        88.6  ...          130   
2    hatchback          rwd           front        94.5  ...          152   
3        sedan          fwd           front        99.8  ...          109   
4        sedan          4wd           front        99.4  ...          136   

   fuel_sy

  df["num_doors"] = df["num_doors"].replace(word_to_num)
  df["num_cylinders"] = df["num_cylinders"].replace(word_to_num)
