# Rental Listing Price Model

Below are the steps taken to build our regression model which will be used to predict effective prices for prospective rental listings.

## Preparing the Data

First we need to clean and standardize the data scraped from the rental listing site in order to have the model train on it.

In [None]:
import sys
import os

# Assuming your desired directory is one level up from the notebook's directory
current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)

sys.path.append(current_dir)
sys.path.append(parent_dir)

from data.data_cleaner import get_cleaned_data
import pandas as pd
import numpy as np
import os

### Data Cleaning
`get_cleaned_data()` removes invalid and outlier data including blanks and data for single room listings. It also formats the building and unit amenities by making each column a dict that contains the relevant amenities as keys with a value of 1 if the listing has it, else 0.

`flatten_data()` flattens the building and unit amenities to put individual amenities into their own columns, essentially flattening the building and unit amenities dicts into separate columns in each row.

In [None]:
from data.data_cleaner import get_cleaned_df

pd.set_option('display.max_columns', None)

df = get_cleaned_df()

print(df.head())

In [None]:
print("Printing columns:")
print(df.columns)

In [None]:
print("Printing first 2 rows:")
print(df.head(2))

### The `Building` and `UnitType` class

For our purposes, we want to group the data by building type, unit type, and city as three major parameters. We created the `Building`, `UnitType`, `City` classes to group data together cleanly. This will become useful when dividing our data into a training and test set.

The `Building` class encompasses the relationship between a building name and the different types of units in it.

The `UnitType` class represents the different types of units where units are distinguished by number of bedrooms.

The `City` class contains all the `Building` values associated with a specific city.

In [None]:
from constants import TableHeaders
from classes import Building, City

In [None]:
from classes import Building, City, convert_df_to_classes

cities: list[City] = convert_df_to_classes(df)

for city in cities:
    for building in city.buildings[:5]:
        print(building)


Since we want to partition the data into a test and train set with an even 20% split based on the unit type, let's remove the entries that have less than 5 listings for that unit type.

In [None]:
standardized_df = df.copy()
print(df.head())
city_groups = df.groupby(TableHeaders.CITY.value)
for city_name, city_df in city_groups:
    unit_groups = city_df.groupby(TableHeaders.BED.value)
    for unit_type, unit_df in unit_groups:
        # Filter out the unit listings that have less than 5 entries for that unit type
        # since it won't have sufficient data to split between testing and training
        if len(unit_df) < 5:
            # print(city_name, unit_type, len(unit_df))
            standardized_df = standardized_df.loc[
                ~((standardized_df[TableHeaders.CITY.value] == city_name) &
                (standardized_df[TableHeaders.BED.value] == unit_type))
            ]

## Training the Model

In [None]:
from sklearn.model_selection import train_test_split
city_groups = standardized_df.groupby(TableHeaders.CITY.value)
print(standardized_df.head())
master_train_df = pd.DataFrame()
master_test_df = pd.DataFrame()
for city_name, city_df in city_groups:
    train_df, test_df = train_test_split(city_df, test_size=0.2, random_state=42, stratify=city_df[TableHeaders.BED.value])
    # Concatenate the individual city train and test sets with the master DataFrames
    master_train_df = pd.concat([master_train_df, train_df], ignore_index=True)
    master_test_df = pd.concat([master_test_df, test_df], ignore_index=True)

master_train_1_bed = len(master_train_df.loc[master_train_df[TableHeaders.BED.value] == 1])
master_test_1_bed = len(master_test_df.loc[master_test_df[TableHeaders.BED.value] == 1])
master_train_2_bed = len(master_train_df.loc[master_train_df[TableHeaders.BED.value] == 2])
master_test_2_bed = len(master_test_df.loc[master_test_df[TableHeaders.BED.value] == 2])

print(f"{(master_train_1_bed/master_train_2_bed):.2f}, {(master_test_1_bed/master_test_2_bed):.2f}")

In [None]:
import torch
# Assuming 'target_column' is the name of your target variable
dropped_columns = [
    TableHeaders.PRICE.value,
    TableHeaders.BUILDING.value,
    TableHeaders.NEIGHBOURHOOD.value,
    TableHeaders.CITY.value,
    TableHeaders.LISTING.value,
    TableHeaders.ADDRESS.value,
    TableHeaders.DATE.value,
    # TableHeaders.LAT.value,
    # TableHeaders.LON.value,
    # TableHeaders.PETS.value,
    # TableHeaders.SQFT.value,
    # TableHeaders.BED.value,
    # TableHeaders.BATH.value,
    # 'Balcony',
    # 'In Unit Laundry',
    # 'Air Conditioning',
    # 'High Ceilings',
    # 'Furnished',
    # 'Hardwood Floor',
    # 'Controlled Access',
    # 'Fitness Center',
    # 'Swimming Pool',
    # 'Roof Deck',
    # 'Storage',
    # 'Residents Lounge',
    # 'Outdoor Space',
]

# updated_df = master_train_df.loc[master_train_df[TableHeaders.CITY.value] == 'toronto']

updated_train_df = master_train_df.drop(dropped_columns, axis=1)
updated_test_df = master_test_df.drop(dropped_columns, axis=1)

X_train = torch.tensor(updated_train_df.values).float()
y_train = torch.tensor(master_train_df[TableHeaders.PRICE.value].values).float()

X_test = torch.tensor(updated_test_df.values).float()
y_test = torch.tensor(master_test_df[TableHeaders.PRICE.value].values).float()

# y_train = y_train.view(y_train.shape[0],1)
# y_test = y_test.view(y_test.shape[0],1)
for index, column in enumerate(updated_train_df.columns):
    print(column, X_train[0][index].item())

print("Unit Price: ", y_train[0].item())

In [None]:
from dataset import RentalDataset
from torch.utils.data import DataLoader

train_dataset = RentalDataset(X_train, y_train)
test_dataset = RentalDataset(X_test, y_test)

batch_size = 32  # Choose a batch size that fits your model and training process

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

NUM_EPOCHS = 50
LEARNING_RATE = 0.001

class RegressionModelV2(nn.Module):
    def __init__(self, input_size):
        super(RegressionModelV2, self).__init__()
        # Increasing the complexity of the model
        self.fc1 = nn.Linear(input_size, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, 32)
        self.fc4 = nn.Linear(32, 16)
        self.fc5 = nn.Linear(16, 1)  # Single output for regression

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = F.relu(self.fc4(x))
        x = self.fc5(x)  # No activation function for the last layer in regression
        return x

input_size = X_train.shape[1]

# Replace input_size with the actual size of your input features
model = RegressionModelV2(input_size)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)  # Learning rate can be adjusted

model.train()

for epoch in range(NUM_EPOCHS):
  for input, targets in train_loader:
    prediction = model(input)
    loss = criterion(prediction, targets)

    optimizer.zero_grad()  # Clear existing gradients
    loss.backward()  # Backpropagation
    optimizer.step()      # Update model parameters

  # Optional: Print the loss every epoch
  print(f'Epoch [{epoch+1}/{NUM_EPOCHS}], Loss: {loss.item():.4f}')

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt

model.eval()  # Set the model to evaluation mode
with torch.no_grad():  # Gradient computation is not needed for evaluation
  # Assuming you have tensors 'predictions' and 'actuals' for your test set
  predictions = model(X_test)  # X_test is your input tensor for the test set
  actuals = y_test  # y_test is the corresponding actual values tensor for the test set

  # Detach predictions and actuals from the computation graph if they require gradients
  predictions = predictions.detach()
  actuals = actuals.detach()

  # Convert to numpy arrays if needed
  predictions_np = predictions.numpy()
  actuals_np = actuals.numpy()

  # Calculate MSE and RMSE
  mse = mean_squared_error(actuals_np, predictions_np)
  rmse = sqrt(mse)

  # Calculate MAE
  mae = torch.mean(torch.abs(predictions - actuals)).item()

  # Calculate R-squared
  r2 = r2_score(actuals_np, predictions_np)

  print(f'MAE: {mae}')
  print(f'MSE: {mse}')
  print(f'RMSE: {rmse}')
  print(f'R-squared: {r2}')
  # Code to evaluate the model on the test set

In [None]:
import matplotlib.pyplot as plt

# Assuming y_test is your actual values and predictions is your model's predictions
plt.scatter(y_test, predictions)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Predicted vs. Actual Prices')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=3)  # Diagonal line
plt.show()

In [None]:
from sklearn.linear_model import LinearRegression

regression_model = LinearRegression()

regression_model.fit(X_train, y_train)

predictions = regression_model.predict(X_test)

print(f"{regression_model.score(X_test, y_test):.4f}")

In [None]:
import matplotlib.pyplot as plt

plt.scatter(y_test, predictions)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Linear Regression - Predicted vs. Actual Prices')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=3)  # Diagonal line
plt.show()

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Create the Random Forest classifier
random_forest = RandomForestRegressor(n_estimators=500, random_state=42)

# Train the model
random_forest.fit(X_train, y_train)

# Predict using the test set
predictions = random_forest.predict(X_test)

# Evaluate accuracy
print(f"{random_forest.score(X_test, y_test):.4f}")

In [None]:
import matplotlib.pyplot as plt

# Assuming y_test is your actual values and predictions is your model's predictions
plt.scatter(y_test, predictions)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Random Forest - Predicted vs. Actual Prices')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=3)  # Diagonal line
plt.show()

In [None]:
import numpy as np

# Assuming 'rf_model' is your trained Random Forest model and 'feature_names' is the list of feature names
importances = random_forest.feature_importances_
indices = np.argsort(importances)

plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], align='center')
plt.yticks(range(len(indices)), [updated_train_df.columns[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

In [None]:
from joblib import dump

# Save random forest model
dump(random_forest, 'random_forest_model.joblib')

In [None]:


# Get predictions from each tree for the single data point
single_point_predictions = np.array([tree.predict(single_data_point.reshape(1, -1)) for tree in model.estimators_])

# Calculate mean and standard deviation for the single prediction
mean_prediction = np.mean(single_point_predictions)
std_deviation = np.std(single_point_predictions)

# Prediction interval
lower_bound = mean_prediction - 2 * std_deviation
upper_bound = mean_prediction + 2 * std_deviation

# Output the prediction and interval
print(f"Prediction: {mean_prediction}, Interval: [{lower_bound}, {upper_bound}]")