# Assignment: Real Estate Price Prediction

## Instructions

1. **Download the Dataset**  
   - Use the [Real Estate Price Prediction](https://www.kaggle.com/datasets/quantbruce/real-estate-price-prediction) dataset from Kaggle.

2. **Data Preprocessing**  
   - Load the dataset and perform any necessary preprocessing (e.g., handling missing values, scaling features, etc.).

3. **Model 1: Polynomial Regression**  
   - Train a Polynomial Regression model on the dataset.
   - Experiment with different polynomial degrees and find the degree that gives the best performance.

4. **Model 2: Support Vector Regression (SVR)**  
   - Train a Support Vector Regression model using the same dataset.
   - Use an appropriate kernel (e.g., RBF or polynomial) and tune the hyperparameters.

5. **Comparison**  
   - Evaluate both models using appropriate metrics (e.g., Mean Squared Error, R²).
   - Compare the performance of Polynomial Regression and SVR.

6. **Report**  
   - Present your results, including:
     - The best hyperparameters and polynomial degree.
     - Metrics for both models.
     - A brief discussion on which model performed better and why.

7. **Submission**  
   - Submit your code, the trained models, and a short report summarizing your findings.


In [29]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt


In [30]:
df = pd.read_csv('./Dataset/Real estate.csv')
df.isnull().sum()

No                                        0
X1 transaction date                       0
X2 house age                              0
X3 distance to the nearest MRT station    0
X4 number of convenience stores           0
X5 latitude                               0
X6 longitude                              0
Y house price of unit area                0
dtype: int64

In [31]:
X = df.drop(columns=['Y house price of unit area'])
y = df['Y house price of unit area']

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [33]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [34]:
# Try different polynomial degrees
degrees = [2, 3, 4, 5]
best_degree = 0
best_mse = float('inf')

for degree in degrees:
    # Transform features
    poly = PolynomialFeatures(degree=degree)
    X_train_poly = poly.fit_transform(X_train_scaled)
    X_test_poly = poly.transform(X_test_scaled)

    # Train a Linear Regression model
    model = LinearRegression()
    model.fit(X_train_poly, y_train)

    # Evaluate on the test set
    y_pred = model.predict(X_test_poly)
    mse = mean_squared_error(y_test, y_pred)
    print(f"Degree: {degree}, MSE: {mse:.2f}")

    # Keep track of the best degree
    if mse < best_mse:
        best_mse = mse
        best_degree = degree

print(f"\nBest Polynomial Degree: {best_degree}, Best MSE: {best_mse:.2f}")


Degree: 2, MSE: 41.99
Degree: 3, MSE: 1650.71
Degree: 4, MSE: 102926059.67
Degree: 5, MSE: 73749.55

Best Polynomial Degree: 2, Best MSE: 41.99


In [35]:
from sklearn.model_selection import GridSearchCV


# Define parameter grid for GridSearch
param_grid = {
    'kernel': ['rbf', 'poly'],
    'C': [0.1, 1, 10],
    'epsilon': [0.1, 0.2],
    'gamma': ['scale', 'auto']
}

# Create and train the SVR model with GridSearch
svr = SVR()
grid_search = GridSearchCV(svr, param_grid, cv=3, scoring='neg_mean_squared_error', verbose=2)
grid_search.fit(X_train_scaled, y_train)

# Best SVR model
best_svr = grid_search.best_estimator_
print("Best SVR Parameters:", grid_search.best_params_)

# Evaluate SVR
y_pred_svr = best_svr.predict(X_test_scaled)
mse_svr = mean_squared_error(y_test, y_pred_svr)
r2_svr = r2_score(y_test, y_pred_svr)

print(f"SVR MSE: {mse_svr:.2f}, R²: {r2_svr:.2f}")


Fitting 3 folds for each of 24 candidates, totalling 72 fits
[CV] END ........C=0.1, epsilon=0.1, gamma=scale, kernel=rbf; total time=   0.0s
[CV] END ........C=0.1, epsilon=0.1, gamma=scale, kernel=rbf; total time=   0.0s
[CV] END ........C=0.1, epsilon=0.1, gamma=scale, kernel=rbf; total time=   0.0s
[CV] END .......C=0.1, epsilon=0.1, gamma=scale, kernel=poly; total time=   0.0s
[CV] END .......C=0.1, epsilon=0.1, gamma=scale, kernel=poly; total time=   0.0s
[CV] END .......C=0.1, epsilon=0.1, gamma=scale, kernel=poly; total time=   0.0s
[CV] END .........C=0.1, epsilon=0.1, gamma=auto, kernel=rbf; total time=   0.0s
[CV] END .........C=0.1, epsilon=0.1, gamma=auto, kernel=rbf; total time=   0.0s
[CV] END .........C=0.1, epsilon=0.1, gamma=auto, kernel=rbf; total time=   0.0s
[CV] END ........C=0.1, epsilon=0.1, gamma=auto, kernel=poly; total time=   0.0s
[CV] END ........C=0.1, epsilon=0.1, gamma=auto, kernel=poly; total time=   0.0s
[CV] END ........C=0.1, epsilon=0.1, gamma=auto,

In [37]:
# Generate polynomial features using the best degree (this should match training transformation)
poly = PolynomialFeatures(degree=best_degree)
X_train_poly = poly.fit_transform(X_train_scaled)  # Used for training
X_test_poly = poly.transform(X_test_scaled)       # Used for prediction

# Refit the model using the best degree's transformation
model = LinearRegression()
model.fit(X_train_poly, y_train)

# Predict on the transformed test set
y_pred_poly = model.predict(X_test_poly)

# Evaluate performance
mse_poly = mean_squared_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)

print(f"Polynomial Regression - MSE: {mse_poly:.2f}, R²: {r2_poly:.2f}")


Polynomial Regression - MSE: 41.99, R²: 0.75
