# Rental Listing Price Model

Below are the steps taken to build our regression model which will be used to predict effective prices for prospective rental listings.

## Preparing the Data

First we need to clean and standardize the data scraped from the rental listing site in order to have the model train on it.

In [3]:
from data_cleaner import get_cleaned_data, flatten_data
import pandas as pd
import numpy as np

### Data Cleaning
`get_cleaned_data()` removes invalid and outlier data including blanks and data for single room listings. It also formats the building and unit amenities by making each column a dict that contains the relevant amenities as keys with a value of 1 if the listing has it, else 0.

`flatten_data()` flattens the building and unit amenities to put individual amenities into their own columns, essentially flattening the building and unit amenities dicts into separate columns in each row.

In [4]:
pd.set_option('display.max_columns', None)

cleaned_data = get_cleaned_data()
flattened_data = flatten_data(cleaned_data)
df = pd.DataFrame(flattened_data)
df.to_excel("cleaned_data.xlsx", index=False)

In [5]:
print("Printing columns:")
print(df.columns)

Printing columns:
Index(['Building', 'Address', 'City', 'Listing', 'Bed', 'Bath', 'SqFt',
       'Price', 'Pets', 'Latitude', 'Longitude', 'Balcony', 'In Unit Laundry',
       'Air Conditioning', 'High Ceilings', 'Furnished', 'Hardwood Floor',
       'Controlled Access', 'Fitness Center', 'Swimming Pool', 'Roof Deck',
       'Storage', 'Residents Lounge', 'Outdoor Space'],
      dtype='object')


In [6]:
print("Printing first 2 rows:")
print(df.head(2))

Printing first 2 rows:
             Building                                  Address     City  \
0  20 Samuel Wood Way  20 Samuel Wood Way, Toronto, ON M9B 0C8  toronto   
1  20 Samuel Wood Way  20 Samuel Wood Way, Toronto, ON M9B 0C8  toronto   

     Listing  Bed  Bath  SqFt  Price  Pets  Latitude  Longitude  Balcony  \
0     Studio    0   1.0   370   2225     0   43.6959    -79.552        0   
1  1 Bedroom    1   1.0   540   2625     0   43.6959    -79.552        0   

   In Unit Laundry  Air Conditioning  High Ceilings  Furnished  \
0                0                 0              0          0   
1                0                 0              0          0   

   Hardwood Floor  Controlled Access  Fitness Center  Swimming Pool  \
0               0                  1               1              0   
1               0                  1               1              0   

   Roof Deck  Storage  Residents Lounge  Outdoor Space  
0          0        1                 1             

### The `Building` and `UnitType` class

For our purposes, we want to group the data by building type, unit type, and city as three major parameters. We created the `Building`, `UnitType`, `City` classes to group data together cleanly. This will become useful when dividing our data into a training and test set.

The `Building` class encompasses the relationship between a building name and the different types of units in it.

The `UnitType` class represents the different types of units where units are distinguished by number of bedrooms.

The `City` class contains all the `Building` values associated with a specific city.

In [7]:
from constants import TableHeaders
from classes import UnitType, Building, City

In [8]:
cities: list[City] = []

# Group data by city to extract city specific insights
city_groups = df.groupby(TableHeaders.CITY.value)

for city_name, city_df in city_groups:
    current_city = City(city_name) 
    # Group city data by building name to extract building specific insights
    building_groups = city_df.groupby(TableHeaders.BUILDING.value)

    # Create an intermediary tuple to record number of available units and sort buildings accordingly
    # When displaying overarching insights for an area, buildings with more units will be more informational
    buildings_tuples = [(building, building_df, len(building_df)) for building, building_df in building_groups]
    buildings_tuples.sort(key = lambda x: x[2], reverse=True)

    for building_name, building_df, num_units in buildings_tuples:

        current_building: Building = Building(building_name, city_name)
        # Group by bed type within this building
        bed_groups = building_df.groupby(TableHeaders.BED.value)
        for bed, unit_df in bed_groups:
            current_building.add_unit_type(bed=bed, unit_df=unit_df)
        
        current_city.add_building(current_building)
    cities.append(current_city)

Since we want to partition the data into a test and train set with an even 20% split based on the unit type, let's remove the entries that have less than 5 listings for that unit type.

In [9]:
standardized_df = df.copy()
for city_name, city_df in city_groups:
    unit_groups = city_df.groupby(TableHeaders.BED.value)
    for unit_type, unit_df in unit_groups:
        # Filter out the unit listings that have less than 5 entries for that unit type
        # since it won't have sufficient data to split between testing and training
        if len(unit_df) < 5:
            # print(city_name, unit_type, len(unit_df))
            standardized_df = standardized_df.loc[
                ~((standardized_df[TableHeaders.CITY.value] == city_name) & 
                (standardized_df[TableHeaders.BED.value] == unit_type)) 
            ]

In [10]:
for city in cities:
    print(f"City: {city.name}")
    for building in city.buildings[:5]:
        print(building)

City: edmonton
Building: Citizen on Jasper
Total Units: 79
Overall Average SqFt: 652.15
Overall Average Price: 2042.63
Overall Price Per SqFt: 3.13
-----------------------------------
Bedroom Type: 1 beds
 - Units: 63
 - Average SqFt: 585.62
 - Average Price: 1894.30
 - Price per SqFt: 3.23
-----------------------------------
Bedroom Type: 2 beds
 - Units: 15
 - Average SqFt: 899.93
 - Average Price: 2571.07
 - Price per SqFt: 2.86
-----------------------------------
Bedroom Type: 3 beds
 - Units: 1
 - Average SqFt: 1127.00
 - Average Price: 3461.00
 - Price per SqFt: 3.07
-----------------------------------

Building: Raymond Block
Total Units: 17
Overall Average SqFt: 824.18
Overall Average Price: 2017.94
Overall Price Per SqFt: 2.45
-----------------------------------
Bedroom Type: 1 beds
 - Units: 11
 - Average SqFt: 736.82
 - Average Price: 1913.64
 - Price per SqFt: 2.60
-----------------------------------
Bedroom Type: 2 beds
 - Units: 6
 - Average SqFt: 984.33
 - Average Price:

### Standardize the Data
We use standard scaling to standardize the values before passing to the model.

In [11]:
print(standardized_df)
sqft_values = np.array(standardized_df[TableHeaders.SQFT.value])
standardized_sqft_values = (sqft_values - np.mean(sqft_values)) / np.std(sqft_values)
standardized_df[TableHeaders.SQFT.value] = standardized_sqft_values
print(standardized_df[TableHeaders.SQFT.value].head())

                Building                                            Address  \
0     20 Samuel Wood Way            20 Samuel Wood Way, Toronto, ON M9B 0C8   
1     20 Samuel Wood Way            20 Samuel Wood Way, Toronto, ON M9B 0C8   
2            The Brixton               410 Dufferin St, Toronto, ON M6K 0H1   
3            The Brixton               410 Dufferin St, Toronto, ON M6K 0H1   
4            The Brixton               410 Dufferin St, Toronto, ON M6K 0H1   
...                  ...                                                ...   
4862     4507 162 Ave Nw              4507 162 Ave Nw, Edmonton, AB T5Y 0H1   
4863    12911 132 Ave Nw             12911 132 Ave Nw, Edmonton, AB T5L 3R2   
4864     4518 118 Ave Nw              4518 118 Ave Nw, Edmonton, AB T5W 1A9   
4865          10388 1506  10388 1506, 10388 105 Street Northwest #1506, ...   
4866     6548 175 Ave Nw              6548 175 Ave Nw, Edmonton, AB T5Y 4B3   

          City               Listing  Bed  Bath  Sq

## Training the Model

In [12]:
from sklearn.model_selection import train_test_split
city_groups = standardized_df.groupby(TableHeaders.CITY.value)
master_train_df = pd.DataFrame()
master_test_df = pd.DataFrame()
for city_name, city_df in city_groups:
    train_df, test_df = train_test_split(city_df, test_size=0.2, random_state=42, stratify=city_df[TableHeaders.BED.value])

    # Concatenate the individual city train and test sets with the master DataFrames
    master_train_df = pd.concat([master_train_df, train_df], ignore_index=True)
    master_test_df = pd.concat([master_test_df, test_df], ignore_index=True)

# print(len(master_train_df.loc[master_train_df[TableHeaders.BED.value] == 1]))
# print(len(master_test_df.loc[master_test_df[TableHeaders.BED.value] == 1]))
# print(len(master_train_df.loc[master_train_df[TableHeaders.BED.value] == 2]))
# print(len(master_test_df.loc[master_test_df[TableHeaders.BED.value] == 2]))

In [13]:
import torch
# Assuming 'target_column' is the name of your target variable
dropped_columns = [
    TableHeaders.PRICE.value, 
    TableHeaders.BUILDING.value,
    TableHeaders.CITY.value,
    TableHeaders.LISTING.value,
    TableHeaders.ADDRESS.value,
    TableHeaders.LAT.value,
    TableHeaders.LON.value,
    TableHeaders.PETS.value,
    # TableHeaders.SQFT.value,
    TableHeaders.BATH.value,
    'Balcony',
    'In Unit Laundry',
    'Air Conditioning',
    'High Ceilings',
    'Furnished',
    'Hardwood Floor',
    'Controlled Access',
    'Fitness Center',
    'Swimming Pool',
    'Roof Deck',
    'Storage',
    'Residents Lounge',
    'Outdoor Space',
]

updated_df = master_train_df.loc[master_train_df[TableHeaders.CITY.value] == 'toronto']

X_train = torch.tensor(updated_df.drop(dropped_columns, axis=1).values).float()
y_train = torch.tensor(master_train_df[TableHeaders.PRICE.value].values).float()

X_test = torch.tensor(updated_df.drop(dropped_columns, axis=1).values).float()
y_test = torch.tensor(master_test_df[TableHeaders.PRICE.value].values).float()

print(X_train[0], y_train[0])

tensor([ 1.0000, -0.4425]) tensor(2162.)


In [14]:
from dataset import RentalDataset
from torch.utils.data import DataLoader

train_dataset = RentalDataset(X_train, y_train)
test_dataset = RentalDataset(X_test, y_test)

batch_size = 64  # Choose a batch size that fits your model and training process

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


In [15]:
import torch
import torch.nn as nn
import torch.nn.functional as F

NUM_EPOCHS = 100
LEARNING_RATE = 0.005

class RegressionModel(nn.Module):
    def __init__(self, input_size):
        super(RegressionModel, self).__init__()
        # Increasing the complexity of the model
        self.fc1 = nn.Linear(input_size, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)  # Single output for regression

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)  # No activation function for the last layer in regression
        return x

input_size = X_train.shape[1]

# Replace input_size with the actual size of your input features
model = RegressionModel(input_size)  
criterion = nn.MSELoss()  # Mean Squared Error Loss for regression tasks
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)  # Learning rate can be adjusted

for epoch in range(NUM_EPOCHS):
    model.train()  # Set the model to training mode
    for index, (inputs, targets) in enumerate(train_loader):
        # print(f"Processing batch {index} / {len(train_loader)}")
        # Forward pass
        outputs = model(inputs)

        l1_lambda = 0.001  # Regularization strength
        l1_norm = sum(p.abs().sum() for p in model.parameters())
        loss = criterion(outputs, targets) + l1_lambda * l1_norm
        # print(next(model.parameters()))
        # loss = criterion(outputs, targets)

        # Backward pass and optimization
        optimizer.zero_grad()  # Clear existing gradients
        loss.backward()       # Backpropagation
        optimizer.step()      # Update model parameters

    # Optional: Print the loss every epoch
    print(f'Epoch [{epoch+1}/{NUM_EPOCHS}], Loss: {loss.item():.4f}')

    # Optional: Evaluate the model every epoch
    # model.eval()  # Set the model to evaluation mode
    # with torch.no_grad():  # Gradient computation is not needed for evaluation
    #     # Code to evaluate the model on the test set


  return F.mse_loss(input, target, reduction=self.reduction)
  return F.mse_loss(input, target, reduction=self.reduction)


Epoch [1/100], Loss: 9678651.0000
Epoch [2/100], Loss: 4739660.0000
Epoch [3/100], Loss: 1493835.8750
Epoch [4/100], Loss: 1200698.6250
Epoch [5/100], Loss: 1095932.6250
Epoch [6/100], Loss: 1451447.5000
Epoch [7/100], Loss: 16217389.0000
Epoch [8/100], Loss: 1634444.5000
Epoch [9/100], Loss: 6427657.5000
Epoch [10/100], Loss: 1017921.4375
Epoch [11/100], Loss: 2147139.0000
Epoch [12/100], Loss: 1907690.0000
Epoch [13/100], Loss: 1484333.1250
Epoch [14/100], Loss: 948898.2500
Epoch [15/100], Loss: 4086450.5000
Epoch [16/100], Loss: 700297.0625
Epoch [17/100], Loss: 1032093.6875
Epoch [18/100], Loss: 1617412.7500
Epoch [19/100], Loss: 56902936.0000
Epoch [20/100], Loss: 54351452.0000
Epoch [21/100], Loss: 4089810.7500
Epoch [22/100], Loss: 1309831.3750
Epoch [23/100], Loss: 3971045.2500
Epoch [24/100], Loss: 820714.8750
Epoch [25/100], Loss: 1073614.6250
Epoch [26/100], Loss: 1417959.6250
Epoch [27/100], Loss: 3750353.0000
Epoch [28/100], Loss: 3217237.5000
Epoch [29/100], Loss: 1555739