## **Predicting House Prices using Linear Regression and k-Nearest Neighbors (k-NN)**

### **Task Overview**

The goal of this project was to predict the **median house value** in California districts using the California Housing dataset. We aimed to compare two regression models:  
- **Linear Regression** (parametric)  
- **k-NN Regression** (nonparametric, k = 5)

We evaluated model performance using standard metrics like **R² (coefficient of determination)**, **Mean Absolute Error (MAE)**, and **Mean Squared Error (MSE)**.

### **Dataset Description**

We used the California Housing dataset from scikit-learn, which contains features like:
- Median income
- Average house age
- Average rooms
- Population
- Latitude, Longitude, etc.

The **target variable** is:
- `MedHouseVal` → Median house value for each block group (in $100,000s)

### **Key Takeaways**

- **Linear Regression** is a parametric model that assumes a linear relationship between features and the target.
- **k-NN Regression** is a nonparametric model that makes predictions based on the average of the k nearest neighbors.
- We used multiple metrics to get a complete picture of model performance.
- Standardization was essential to ensure fair distance calculations for the k-NN model.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# Set random seed for reproducibility
np.random.seed(42)

# 1. Data Loading and Exploration
print("Loading California Housing Dataset...")
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# Display basic information about the dataset
print(f"Dataset shape: {X.shape}")
print("\nFeature names:")
for name in housing.feature_names:
    print(f"- {name}")
print(f"\nTarget variable: {housing.target_names[0]}")

# Display basic statistics
print("\nBasic statistics of features:")
print(X.describe())

# 2. Data Visualization
plt.figure(figsize=(12, 10))

# Correlation matrix
plt.subplot(2, 2, 1)
correlation_matrix = np.corrcoef(X.values, y.reshape(-1, 1), rowvar=False)
feature_names_with_target = housing.feature_names + ['MedHouseValue']
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', 
            xticklabels=feature_names_with_target, 
            yticklabels=feature_names_with_target)
plt.title('Correlation Matrix')

# Scatter plot of MedInc vs. House Value
plt.subplot(2, 2, 2)
plt.scatter(X['MedInc'], y, alpha=0.5)
plt.xlabel('Median Income')
plt.ylabel('Median House Value')
plt.title('Income vs. House Value')

# Distribution of target variable
plt.subplot(2, 2, 3)
plt.hist(y, bins=50)
plt.xlabel('Median House Value')
plt.ylabel('Frequency')
plt.title('Distribution of House Values')

# Geographical distribution (Latitude vs. Longitude colored by price)
plt.subplot(2, 2, 4)
plt.scatter(X['Longitude'], X['Latitude'], c=y, cmap='viridis', 
            alpha=0.5, s=10)
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('House Prices across California')
plt.colorbar(label='Median House Value')

plt.tight_layout()
plt.savefig('california_housing_eda.png')
plt.close()

# 3. Data Preparation
# Split into training and testing sets (80% / 20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

print(f"\nTraining set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Model Training and Evaluation
# Function to evaluate model performance
def evaluate_model(model_name, y_true, y_pred):
    r2 = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    
    print(f"\n{model_name} Performance:")
    print(f"R² Score: {r2:.4f}")
    print(f"Mean Absolute Error: {mae:.4f}")
    print(f"Mean Squared Error: {mse:.4f}")
    print(f"Root Mean Squared Error: {rmse:.4f}")
    
    return {
        'Model': model_name,
        'R²': r2,
        'MAE': mae,
        'MSE': mse,
        'RMSE': rmse
    }

# 4.1 Linear Regression
print("\nTraining Linear Regression model...")
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)
lr_pred = lr_model.predict(X_test_scaled)

# Evaluate Linear Regression
lr_metrics = evaluate_model("Linear Regression", y_test, lr_pred)

# 4.2 k-NN Regression with different k values
k_values = [3, 5, 7, 9, 11]
knn_metrics_list = []

for k in k_values:
    print(f"\nTraining k-NN Regression model with k={k}...")
    knn_model = KNeighborsRegressor(n_neighbors=k)
    knn_model.fit(X_train_scaled, y_train)
    knn_pred = knn_model.predict(X_test_scaled)
    
    # Evaluate k-NN
    knn_metrics = evaluate_model(f"k-NN (k={k})", y_test, knn_pred)
    knn_metrics_list.append(knn_metrics)

# 5. Performance Comparison
# Combine all metrics for comparison
all_metrics = [lr_metrics] + knn_metrics_list
metrics_df = pd.DataFrame(all_metrics)
print("\nModel Performance Comparison:")
print(metrics_df)

# Visualize performance metrics
plt.figure(figsize=(14, 10))

# R² comparison
plt.subplot(2, 2, 1)
plt.bar(metrics_df['Model'], metrics_df['R²'])
plt.title('R² Score Comparison')
plt.xticks(rotation=45)
plt.ylim(0, 1)  # R² is typically between 0 and 1

# MAE comparison
plt.subplot(2, 2, 2)
plt.bar(metrics_df['Model'], metrics_df['MAE'])
plt.title('Mean Absolute Error Comparison')
plt.xticks(rotation=45)

# MSE comparison
plt.subplot(2, 2, 3)
plt.bar(metrics_df['Model'], metrics_df['MSE'])
plt.title('Mean Squared Error Comparison')
plt.xticks(rotation=45)

# RMSE comparison
plt.subplot(2, 2, 4)
plt.bar(metrics_df['Model'], metrics_df['RMSE'])
plt.title('Root Mean Squared Error Comparison')
plt.xticks(rotation=45)

plt.tight_layout()
plt.savefig('model_comparison.png')
plt.close()

# Analyze feature importance for Linear Regression
feature_importance = pd.DataFrame({
    'Feature': housing.feature_names,
    'Coefficient': lr_model.coef_
})
feature_importance['Abs_Coefficient'] = abs(feature_importance['Coefficient'])
feature_importance = feature_importance.sort_values('Abs_Coefficient', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Coefficient'])
plt.title('Linear Regression Coefficients')
plt.xlabel('Coefficient Value')
plt.savefig('feature_importance.png')
plt.close()

# 6. Prediction Visualization
plt.figure(figsize=(14, 6))

# Linear Regression predictions vs actual
plt.subplot(1, 2, 1)
plt.scatter(y_test, lr_pred, alpha=0.5)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'r--')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Linear Regression: Predicted vs Actual')

# Best k-NN predictions vs actual
best_knn_idx = metrics_df['R²'][1:].idxmax()
best_knn_model = metrics_df['Model'][best_knn_idx]
best_k = int(best_knn_model.split('=')[1].strip(')'))

knn_model = KNeighborsRegressor(n_neighbors=best_k)
knn_model.fit(X_train_scaled, y_train)
knn_pred = knn_model.predict(X_test_scaled)

plt.subplot(1, 2, 2)
plt.scatter(y_test, knn_pred, alpha=0.5)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'r--')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title(f'k-NN (k={best_k}): Predicted vs Actual')

plt.tight_layout()
plt.savefig('prediction_comparison.png')
plt.close()

print("\nAnalysis complete. All figures saved to current directory.")
print("The best k-NN model used k =", best_k)
print("\nLinear Regression Feature Importance:")
print(feature_importance)

# Summary of findings
print("\n========== SUMMARY ==========")
print("Linear Regression strengths: Interpretable, fast, good for understanding feature relationships")
print("k-NN strengths: Captures non-linear patterns, no assumptions about data distribution")
print(f"Best performing model: {metrics_df.iloc[metrics_df['R²'].idxmax()]['Model']} with R² = {metrics_df['R²'].max():.4f}")


Loading California Housing Dataset...
Dataset shape: (20640, 8)

Feature names:
- MedInc
- HouseAge
- AveRooms
- AveBedrms
- Population
- AveOccup
- Latitude
- Longitude

Target variable: MedHouseVal

Basic statistics of features:
             MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000   
mean       3.870671     28.639486      5.429000      1.096675   1425.476744   
std        1.899822     12.585558      2.474173      0.473911   1132.462122   
min        0.499900      1.000000      0.846154      0.333333      3.000000   
25%        2.563400     18.000000      4.440716      1.006079    787.000000   
50%        3.534800     29.000000      5.229129      1.048780   1166.000000   
75%        4.743250     37.000000      6.052381      1.099526   1725.000000   
max       15.000100     52.000000    141.909091     34.066667  35682.000000   

           AveOccup      Latitude     Longitude  
count  