# ENSF 611 Final Project

Instructor: Dr. L. Dawson

Cory Wu <br>
Rick Zhang <br>

We will use the rentfaster.csv dataset to predict the price of a rental property.

Dataset source: https://www.kaggle.com/datasets/sergiygavrylov/25000-canadian-rental-housing-market-june-2024?resource=download

In [1]:
# Importing numpy and pandas
import numpy as np
import pandas as pd

### Load the data

In [2]:
# Reading dataset and save to df
df = pd.read_csv('data/rentfaster.csv', header=0)
df.head()


FileNotFoundError: [Errno 2] No such file or directory: 'data/rentfaster.csv'

In [None]:
# show all columns names
df.columns


In [None]:
# check data type
df.info()
# check for missing values
df.isnull().sum()

In [None]:
# drop rows with missing values
df = df.dropna()
df.info()

In [None]:

# drop uninformative columns
df=df.drop(columns=['rentfaster_id', 'address', 'sq_feet', 'link', 'availability_date'])
df.head()

In [None]:
# Create Column Transformer using an encoder and StandardScaler 
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder

ct = ColumnTransformer(
    [("scaling", StandardScaler(), ['latitude',
                                     'longitude'
                                     ]),
     ("onehot", OneHotEncoder(sparse_output=False, handle_unknown='ignore'), ['city',
                                                                            'province',
                                                                            'lease_term', 
                                                                            'type',
                                                                            'beds',
                                                                            'baths',
                                                                            'furnishing',
                                                                            'smoking',
                                                                            'cats',
                                                                            'dogs'])])


In [None]:
# split data into features and target
X = df.drop(columns=['price'])
y = df['price']

In [None]:
# split data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.1)



In [None]:
y_test.info()


In [None]:
X_test.info()

### Implement Machine Learning Models

In [None]:
# Create a list to store the results
results_list = []


#### 1. Linear Regression

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor
# import r2_score
from sklearn.metrics import r2_score

# Option 1: Using TransformedTargetRegressor with 'max' function
def force_positive(y):
    return np.maximum(y, 0)  # Clips negative values to 0

pipe = make_pipeline(
    ct, 
    TransformedTargetRegressor(
        regressor=LinearRegression(),
        func=force_positive,
        check_inverse=False
    )
)

# Option 2: Using a custom clip after prediction
pipe = make_pipeline(ct, LinearRegression())
pipe.fit(X_train, y_train)
y_pred = np.maximum(pipe.predict(X_test), 0)  # Clips negative values to 0

# calculate R2 score
print(r2_score(y_test, y_pred))

# Add the result to the dataframe
results_list.append({'Model': 'Linear Regression (Non-negative)', 'R2 Score': r2_score(y_test, y_pred)})


The performance of the linear regression model is very bad.There is no point to do cross validation and hyperparameter tunning.

Create a scatter plot to visualize the relationship between the actual prices and the predicted prices.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create single plot
fig, ax = plt.subplots(figsize=(10, 6))

# Scatter plot with actual prices in blue
scatter_actual = ax.scatter(y_test, y_test, alpha=0.5, color='blue', label='Actual Prices')

# Scatter plot with predicted prices in red
scatter_pred = ax.scatter(y_test, y_pred, alpha=0.5, color='red', label='Predicted Prices')

# Perfect prediction line
ideal_line = ax.plot([y_test.min(), y_test.max()], 
                     [y_test.min(), y_test.max()], 
                     'k--', lw=2,
                     label='Perfect Predictions')

# Set axis limits and labels
ax.set_ylim(0, 10000)
ax.set_xlabel('Actual Price')
ax.set_ylabel('Price')
ax.set_title('Linear Regression: Actual vs Predicted Prices')

# Add legend
ax.legend()

plt.tight_layout()
plt.show()

The plot above shows the model works well for the properties in the middle range, but performs poorly for the properties with very high or very low prices.

To better understand the relationship between the features and the target, we created scatter plots for numerical features and box plots for one of the categorical features.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create a figure with two subplots
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))

# 1. Scatter plots for numerical features (latitude and longitude)
numerical_features = ['latitude', 'longitude']
colors = ['blue', 'green']
for feature, color in zip(numerical_features, colors):
    ax1.scatter(X_train[feature], y_train, alpha=0.5, color=color, label=feature)

ax1.set_ylabel('Price')
ax1.set_title('Relationship between Numerical Features and Price')
ax1.legend()

# 2. Box plots for categorical features (using one important categorical feature as example)
sns.boxplot(data=df, x='province', y='price', ax=ax2)
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=45, ha='right')
ax2.set_title('Price Distribution by Province')
ax2.set_ylabel('Price')

plt.tight_layout()
plt.show()

Look at plot 1, there is no linear pattern between the latitude and longitude of the properties and the price.

Look at plot 2, the price from different provinces have overlapping ranges.

These help to explain why the linear regression model performs poorly.


#### 2. Random Forest Regression

We tried to use random forest regression, which is a non-linear model, to see if it performs better.

In [None]:
# TO DO: Create a pipeline with the ColumnTransformer and a linear regression model
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor

pipe = make_pipeline(ct, RandomForestRegressor())

# TO DO: Fit the pipeline on the training data
pipe.fit(X_train, y_train)

# TO DO: Make predictions on the test data  
y_pred = pipe.predict(X_test)

# calculate R2 score
from sklearn.metrics import r2_score
print(r2_score(y_test, y_pred))




The R2 score is 0.88, which is much better than the linear regression model.

We made a scatter plot to visualize the relationship between the actual prices and the predicted prices again.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create single plot
fig, ax = plt.subplots(figsize=(10, 6))

# Scatter plot with actual prices in blue
scatter_actual = ax.scatter(y_test, y_test, alpha=0.5, color='blue', label='Actual Prices')

# Scatter plot with predicted prices in red
scatter_pred = ax.scatter(y_test, y_pred, alpha=0.5, color='red', label='Predicted Prices')

# Perfect prediction line
ideal_line = ax.plot([y_test.min(), y_test.max()], 
                     [y_test.min(), y_test.max()], 
                     'k--', lw=2,
                     label='Perfect Predictions')

# Set axis limits and labels
ax.set_ylim(0, 10000)
ax.set_xlabel('Actual Price')
ax.set_ylabel('Price')
ax.set_title('Random Forest Regression: Actual vs Predicted Prices')

# Add legend
ax.legend()

plt.tight_layout()
plt.show()

The result from the random forest regression model is much better than the linear regression model.

We will add cross validation and hyperparameter tunning to the random forest regression model to see if the performance can be improved.


In [None]:
# We only put 2 parameters in the grid search because it takes very long time to run.
# Each candidate takes about 25 seconds to run.

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score

# Create pipeline
pipe = make_pipeline(ct, RandomForestRegressor())

# Define parameter grid for RandomForestRegressor
param_grid = {
    'randomforestregressor__max_depth': [100, 1000],
}

# Create GridSearchCV object
grid_search = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    cv=5,  # 5-fold cross-validation
    scoring='r2',
    n_jobs=-1,  # Use all available cores
    verbose=1
)

# Fit the grid search
grid_search.fit(X_train, y_train)

# Print best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

# Make predictions with best model
y_pred = grid_search.predict(X_test)

# Calculate R2 score on test set
test_score = r2_score(y_test, y_pred)
print("Test set R2 score:", test_score)

# Add the result to the dataframe
results_list.append({'Model': 'Random Forest Regression', 'R2 Score': r2_score(y_test, y_pred)})


The R2 score went slightly higher to 0.884.

#### 3. K-NN Regression

From now on, we will strictly follow Hyperparameter Tuning -> Implement Model -> Validate Model -> Visualize Model Performance.


In [None]:
# We only put 2 parameters in the grid search because it takes very long time to run.
# Each candidate takes about 25 seconds to run.

from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score

# Create pipeline
pipe = make_pipeline(ct, KNeighborsRegressor())

# Define parameter grid for KNeighborsRegressor
param_grid = {
    'kneighborsregressor__n_neighbors': [5, 10, 15],  # number of neighbors
    'kneighborsregressor__weights': ['uniform', 'distance']  # weight function used
}

# Create GridSearchCV object
grid_search = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    cv=5,  # 5-fold cross-validation
    scoring='r2',
    n_jobs=-1,  # Use all available cores
    verbose=1
)

# Fit the grid search
grid_search.fit(X_train, y_train)

# Print best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

# Make predictions with best model
y_pred = grid_search.predict(X_test)

# Calculate R2 score on test set
test_score = r2_score(y_test, y_pred)
print("Test set R2 score:", test_score)

# Add the result to the dataframe
results_list.append({'Model': 'K-NN Regression', 'R2 Score': r2_score(y_test, y_pred)})


Visualize the performance of the K-NN regression model.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create single plot
fig, ax = plt.subplots(figsize=(10, 6))

# Scatter plot with actual prices in blue
scatter_actual = ax.scatter(y_test, y_test, alpha=0.5, color='blue', label='Actual Prices')

# Scatter plot with predicted prices in red
scatter_pred = ax.scatter(y_test, y_pred, alpha=0.5, color='red', label='Predicted Prices')

# Perfect prediction line
ideal_line = ax.plot([y_test.min(), y_test.max()], 
                     [y_test.min(), y_test.max()], 
                     'k--', lw=2,
                     label='Perfect Predictions')

# Set axis limits and labels
ax.set_ylim(0, 10000)
ax.set_xlabel('Actual Price')
ax.set_ylabel('Price')
ax.set_title('K-NN Regression: Actual vs Predicted Prices')

# Add legend
ax.legend()

plt.tight_layout()
plt.show()

#### 4. Gradient Boosted Regression Tree

The linear regression model performs very poor, we don't want to count it as 1 of the 3 models. We added Gradient Boosted Regression Tree.

The original plan was to use MLP but we kept getting ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet, so we gave up.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score

# Create pipeline
pipe = make_pipeline(ct, GradientBoostingRegressor())

# Define parameter grid for GradientBoostingRegressor
param_grid = {
    'gradientboostingregressor__learning_rate': [0.5, 0.8, 1]  # learning rate
}

# Create GridSearchCV object
grid_search = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    cv=5,  # 5-fold cross-validation
    scoring='r2',
    n_jobs=-1,  # Use all available cores
    verbose=1
)

# Fit the grid search
grid_search.fit(X_train, y_train)

# Print best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

# Make predictions with best model
y_pred = grid_search.predict(X_test)

# Calculate R2 score on test set
test_score = r2_score(y_test, y_pred)
print("Test set R2 score:", test_score)

# Add the result to the dataframe
results_list.append({'Model': 'Gradient Boosted Regression Tree', 'R2 Score': r2_score(y_test, y_pred)})


The result is not as good as KNN or Random Forest Regression.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create single plot
fig, ax = plt.subplots(figsize=(10, 6))

# Scatter plot with actual prices in blue
scatter_actual = ax.scatter(y_test, y_test, alpha=0.5, color='blue', label='Actual Prices')

# Scatter plot with predicted prices in red
scatter_pred = ax.scatter(y_test, y_pred, alpha=0.5, color='red', label='Predicted Prices')

# Perfect prediction line
ideal_line = ax.plot([y_test.min(), y_test.max()], 
                     [y_test.min(), y_test.max()], 
                     'k--', lw=2,
                     label='Perfect Predictions')

# Set axis limits and labels
ax.set_ylim(0, 10000)
ax.set_xlabel('Actual Price')
ax.set_ylabel('Price')
ax.set_title('Gradient Boosted Tree Regression: Actual vs Predicted Prices')

# Add legend
ax.legend()

plt.tight_layout()
plt.show()

### Conclusion

In [None]:
# Print out the results
# Format R2 Score to 4 decimal places
results_df = pd.DataFrame(results_list)
results_df['R2 Score'] = results_df['R2 Score'].apply(lambda x: f"{x:.4f}")
results_df

Linear Regression is not a good choice for our problem.

Random Forest Regression and K-NN Regression are both good models.

We recommend to use K-NN Regression for this problem. It performs better than Random Forest Regression and trains much faster.


The code to create scatter plot is modifed from code generated by ChatGPT with prompt: "Help me create a scatter plot to compare actual prices and predicted prices."
