# Predict Rent Prices

The dataset we have obtained is from StreetEasy and contains the median rent prices over the last 14 years. We will work with the dataset to predict rent prices over set time periods. 

In [1]:
# Import packages 
import os 
import pandas as pd

In [None]:
# Load data 
data = pd.read_csv(r"medianAskingRent_All.csv")

In [12]:
# Inspect data 
data.head()

# View number of rows and columns
data.shape

(198, 180)

Taking a look at the first few rows of data, we need to perform a few transformation to feed the data into a model. We will need to reshape the data into a time series format by making the date the index and the rent values the target variable.

In [13]:
# Data transformation 
# Calculate average rent data across all areas in NYC for each month 

# We will exclude the first three columns because they are categorical 
average_rent_data = data.iloc[:, 3:].mean(axis = 0).reset_index()

# Rename columns 
average_rent_data.columns = ["Date", "Average Rent"]
average_rent_data["Date"] = pd.to_datetime(average_rent_data["Date"], format = "%Y-%m")

# Sort by date 
average_rent_data = average_rent_data.sort_values(by = "Date").set_index("Date")

# Display newly transformed data
average_rent_data.head()

Unnamed: 0_level_0,Average Rent
Date,Unnamed: 1_level_1
2010-01-01,2306.102273
2010-02-01,2257.144444
2010-03-01,2248.873563
2010-04-01,2275.090909
2010-05-01,2328.843373


A summary of the steps performed: 

- Calculate monthly average rent: exclude categorical columns (areaName, Borough, areaType) and calculate mean across all rows for each month. This gives the average rent for each month
- Rename columns for clarity 
- Convert date column to a datetime format to set up the data from time series analysis
- Sort and set index: sort the data by date and set the date column as the index to obtain a time series structure

### Next step

The next step is to begin creating **lagged features** which allows for the model to use past rent values as a predictor for future rent values. Once the the features are created, we can train a model to predict future prices. 

##### What is a lagged feature? 

They are variables created from past values in a time series dataset to help predict future values. In time series forecasting, they are useful because they allow a model to consider previous data points as predictors, helping it understand patterns or dependencies over time.

 - *Capture Temporal Patterns*: They help capture the time-based relationships in the data, like trends and seasonality
- *Enable Supervised Learning*: By transforming a time series into a supervised learning format, you can apply machine learning algorithms that aren’t inherently designed for sequential data

In [14]:
# We want 12 months of lagged data 
lag_number = 12 

# Create the lag features using a for loop
for lag in range(1, lag_number + 1):
    average_rent_data[f"rent_t-{lag}"] = average_rent_data["Average Rent"].shift(lag)
    
# Drop rows with NaN values 
average_rent_data = average_rent_data.dropna() 

# # Display data 
average_rent_data.head()

Unnamed: 0_level_0,Average Rent,rent_t-1,rent_t-2,rent_t-3,rent_t-4,rent_t-5,rent_t-6,rent_t-7,rent_t-8,rent_t-9,rent_t-10,rent_t-11,rent_t-12
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2011-01-01,2403.897727,2386.755556,2440.666667,2508.258824,2562.56962,2419.146341,2391.876543,2329.47619,2328.843373,2275.090909,2248.873563,2257.144444,2306.102273
2011-02-01,2372.119565,2403.897727,2386.755556,2440.666667,2508.258824,2562.56962,2419.146341,2391.876543,2329.47619,2328.843373,2275.090909,2248.873563,2257.144444
2011-03-01,2339.765957,2372.119565,2403.897727,2386.755556,2440.666667,2508.258824,2562.56962,2419.146341,2391.876543,2329.47619,2328.843373,2275.090909,2248.873563
2011-04-01,2366.691489,2339.765957,2372.119565,2403.897727,2386.755556,2440.666667,2508.258824,2562.56962,2419.146341,2391.876543,2329.47619,2328.843373,2275.090909
2011-05-01,2382.305263,2366.691489,2339.765957,2372.119565,2403.897727,2386.755556,2440.666667,2508.258824,2562.56962,2419.146341,2391.876543,2329.47619,2328.843373


Before we perform any predictions, let's perform some EDA to get a better understanding of our data and trends in rent prices. 

In [15]:
# Import visualization library 
import plotly.graph_objects as go 

# Create our canvas 
fig = go.Figure()

# Create the plot itself on the canvas 
fig.add_trace(go.Scatter(x = average_rent_data.index, 
                         y = average_rent_data["Average Rent"],
                         mode = "lines", 
                         name = "Average Rent"))

fig.update_layout(title = "Average Rent Over Time (2010 - 2024)",
                  xaxis_title = "Date", 
                  yaxis_title = "Average Rent",
                  template = "plotly_white")

fig.show()

What are some insights we can get obtain from the plot? 

Next, we will also look at the summary statistics for the data. 

In [16]:
summary = average_rent_data["Average Rent"].describe()
summary

count     165.000000
mean     2643.530593
std       300.674673
min      2336.429577
25%      2468.500000
50%      2524.228261
75%      2677.935714
max      3453.338129
Name: Average Rent, dtype: float64

- Mean rent: $2,643
- Median rent: $2,524
- Standard deviation: $300 (some variation around the mean)
- Range: minimum rent is $2,336 and maximum is $3,453 with a 75% percentile value of $2,677. This suggests that rent has increased over the years, but hasn't fluctuated too much. 

Let's continue with the predictions using the Random Forest Regressor model

##### Random Forest Regressor

- Supervised learning model 
- Good for our predictions because:
    - it handles nonlinear relationships well
    - it works well with structured data
    - it is robust to overfitting with enough data
    - it can give good performance with minimal hyperparameter tuning
    - it can give information on which features are heavily weighted when predicting prices

Some considerations: 
- Lack of Time Awareness: Random forests don’t inherently understand time dependencies; that’s why we manually create lagged features. However, if trends are captured well through these features, a random forest can model the data effectively
- Computationally Intensive: Random forests require more computational power than simpler models like linear regression, but for a moderate-sized dataset, this is manageable

In [18]:
# Import packages for model predictions 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

In [20]:
# Input features and target variable definition 
X = average_rent_data.drop(columns = ["Average Rent"])
y = average_rent_data["Average Rent"]

The typical train_test_split method to split the data into training and testing variables may not necessarily work for the approach we are trying to go for (using preivous rent prices to predict new prices). 

Since train_test_split can disrupt the temporal order of our data, we will use time-based splitting to create the train and test variables. We will train on the **earliest data** and test on the **most recent data**. 

In [25]:
# We will use the last 24 months as the test set (last 1-2 years)
split_index = -24
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]

# Initialize and train the model
model = RandomForestRegressor(random_state = 42)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate model performance
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Mean absolute error: ${mae:.2f}")
print(f"Root mean squared error: ${rmse:.2f}")

Mean absolute error: $141.71
Root mean squared error: $171.38


- MAE: tells us the average dollar amount the predictions deviate from actual rent prices
- RMSE: measures the average magnitude of error between the predicted values and the actual values, with a stronger emphasis on larger errors due to squaring. RMSE tells us how far, on average, our predictions are from the actual rent prices in dollar terms
    - since the average rent is around $2,643, an rmse of $171, is about 6.5% of the average rent
    - having an RMSE under 10% of the target value is generally good

Next, we will use the model to predict rent prices for the next 2-3 years using an iterative forecasting approach. 

- set up conditions: start with the last available data point and use the model to predict the next month's rent 
- iterate for each month: use each new prediction as input for the following month 
- store predictions: store each prediction to track the forecasted rent prices over time 

In [26]:
# Import packages 
import datetime 

# Get the last available data point to generate a prediction for the next 3 years (36 months)
forecast_horizon = 36 
last_date = average_rent_data.index[-1] # Last date in the dataframe 
future_predictions = []
# Select the last row of data from X (the final month in historical data) 
# This is the starting point for forecasting the next month 
# Extract the row as a NumPy array to make it compatible with the model's input requirements
# Reshape the array into a 2D array with 1 row and multiple columns 
# -1 automatically determines the appropriate number of columns based on the number of elements
current_features = X.iloc[-1].values.reshape(1, -1)

In [27]:
# 1. Using the previous month’s input features to make a new prediction
# 2. Updating the features by adding the new prediction and removing the oldest lagged value

# Iterate and predict one month at a time
for i in range(forecast_horizon):
    # Predict rent price 
    next_rent = model.predict(current_features)[0]
    future_predictions.append(next_rent)
    
    # Update features by shifting and adding the new prediction as the most recent lag
    # Take current features, drop the oldest lagged value
    # Convert remaining values to a list 
    # Add the new prediction to the end of the list to ensure that the most recent month's prediction becomes the newest input for the model
    current_features = current_features[0][1:].tolist() + [next_rent] 
    # Convert updated list back to a 2D numpy array to keep it compatible 
    current_features = np.array(current_features).reshape(1, -1)


X does not have valid feature names, but RandomForestRegressor was fitted with feature names


X does not have valid feature names, but RandomForestRegressor was fitted with feature names


X does not have valid feature names, but RandomForestRegressor was fitted with feature names


X does not have valid feature names, but RandomForestRegressor was fitted with feature names


X does not have valid feature names, but RandomForestRegressor was fitted with feature names


X does not have valid feature names, but RandomForestRegressor was fitted with feature names


X does not have valid feature names, but RandomForestRegressor was fitted with feature names


X does not have valid feature names, but RandomForestRegressor was fitted with feature names


X does not have valid feature names, but RandomForestRegressor was fitted with feature names


X does not have valid feature names, but RandomForestRegressor was fitted with feature names


X does not have valid feature names, but RandomFo

In [32]:
# Create dates for the forecasted values 
future_dates = [last_date + datetime.timedelta(days = 30 * (i + 1)) for i in range(forecast_horizon)]

# Convert to a dataframe 
future_price_df = pd.DataFrame({"Date": future_dates, "Predicted Rent": future_predictions})
future_price_df.set_index("Date", inplace = True)

# Display predicted rent for the next 2-3 years 
future_price_df.head()

Unnamed: 0_level_0,Predicted Rent
Date,Unnamed: 1_level_1
2024-10-01,3141.161973
2024-10-31,3141.161973
2024-11-30,3141.161973
2024-12-30,3141.161973
2025-01-29,3141.161973


- **Starting rent (Oct. 2024)**: $3,141.16
- **Ending rent (Aug. 2027)**: $3,141.33

Our model predicted relatively stable values over time. This may indicate that the model is capturing steady trends and not large seasonal variations. Let's take a look at how the predictions stack up against the historical data we have. 

In [38]:
# Create our canvas 
fig = go.Figure()

# Plot the historical data 
fig.add_trace(go.Scatter(x = average_rent_data.index, 
                         y = average_rent_data["Average Rent"],
                         mode = "lines", 
                         name = "Historical Data", 
                         line = dict(color = "blue")))

# Add the forecasted data
fig.add_trace(go.Scatter(x = future_price_df.index,
                         y = future_price_df["Predicted Rent"],
                         mode = "lines",
                         name = "Forecasted Rent Price",
                         line = dict(color = "red", dash = "dash")))
# Update titles 
fig.update_layout(title = "Historical and Forecasted Avg. Rent Prices in NYC",
                  xaxis_title = "Date", 
                  yaxis_title = "Average Rent", 
                  template = "plotly_white",
                  width = 1200,
                  height = 600)
# Display plot 
fig.show()

What are some things that you notice? How could we improve our model? 