Question 4: Neighborhood Data


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from IPython.display import display

# Create the dataset
data = {
    "Size": [1500, 2000, 1800, 2200, 1700, 2000],
    "is_urban": [1, 0, 0, 1, 0, 0],
    "is_suburban": [0, 1, 0, 0, 1, 0],
    "is_rural": [0, 0, 1, 0, 0, 1],
    "Price": [350000, 400000, 300000, 450000, 370000, 320000]
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Create the test set
test_data = {
    "Size": [1900, 1600, 2100],
    "is_urban": [1, 0, 0],
    "is_suburban": [0, 1, 0],
    "is_rural": [0, 0, 1]
}

test_df = pd.DataFrame(test_data)

def compute_hat_matrix(X):
    """Computes the hat matrix using pseudo-inverse H = X (X'X)^(-1) X'"""
    return X @ np.linalg.pinv(X.T @ X) @ X.T


# Prepare X matrix with bias term and target y
X = np.column_stack((np.ones(len(df)), df[["Size", "is_urban", "is_suburban", "is_rural"]].values))
y = np.array(df["Price"]).reshape(-1, 1)

# Compute hat matrix (used only for training predictions)
H = compute_hat_matrix(X)

# Predictions using the hat matrix (FOR TRAINING SET ONLY)
y_hat_train = H @ y

print("Predictions using Hat Matrix (Training Data Only):")
print(y_hat_train)

# Compute the normal equation solution (θ)
theta = np.linalg.inv(X.T @ X) @ X.T @ y

# Prepare test data for prediction (including bias term)
X_test = np.column_stack((np.ones(len(test_df)), test_df[["Size", "is_urban", "is_suburban", "is_rural"]].values))

# Predict using the computed θ
y_test_hat = X_test @ theta

print("\nPredictions using Normal Equation (Manual Regression):")
print(y_test_hat)

# Train scikit-learn regression model
reg = LinearRegression(fit_intercept=False)  # No need for bias since we include it in X
reg.fit(X, y)  # Use full X with bias term

# Predict using Scikit-Learn's model
predictions = reg.predict(X_test)

print("\nPredictions using Scikit-Learn Linear Regression:")
print(predictions)

# Display results
test_df["Predicted Price (Manual)"] = y_test_hat
test_df["Predicted Price (Scikit-Learn)"] = predictions
display(test_df)


Predictions using Hat Matrix (Training Data Only):
[[353145.16129   ]
 [405080.6451614 ]
 [296612.9032257 ]
 [446854.83870996]
 [364919.35483856]
 [323387.09677426]]

Predictions using Normal Equation (Manual Regression):
[[1446547.91523686]
 [ 943671.92862051]
 [ 922631.90631442]]

Predictions using Scikit-Learn Linear Regression:
[[406693.5483871 ]
 [351532.25806452]
 [336774.19354839]]


Unnamed: 0,Size,is_urban,is_suburban,is_rural,Predicted Price (Manual),Predicted Price (Scikit-Learn)
0,1900,1,0,0,1446548.0,406693.548387
1,1600,0,1,0,943671.9,351532.258065
2,2100,0,0,1,922631.9,336774.193548


**Build a linear regression model yourself using the matrix operations discussed in class and apply it to make predictions for new homes. What do you observe? If you encounter any issues, how can you address them?**

I noticed that the regression model predictions using the hat matrix seemed very similar to the given data of the training data.



**Also build a regressor using scikit-learn’s regression model and make predictions for the same test data. Do your answers match the scikit-learn model’s answers? Do both sets of answers match the ground truth from the test data? Why/why not?**

My predictions closely align with those of the scikit-learn regression model. I believe both sets of predictions match the ground truth from the test data because the predicted prices remain fairly close to the actual prices, with minimal disparity.