# 📊 Canada Rent Regression Analysis using AutoGluon

# 🏘️ Canada Rent Dataset (1987–2024)

**This dataset provides historical rental data for various cities and provinces across Canada from 1987 to 2024.** It includes details about rent prices, unit types, and geographic locations, which can be valuable for housing market analysis, rental trend forecasting, and urban planning studies.

## 🧾 Dataset Schema Overview

| Feature            | Data Type | Description                                                 |
|--------------------|-----------|-------------------------------------------------------------|
| **Province**        | object    | Canadian province where the rental unit is located          |
| **City**            | object    | City within the province                                    |
| **Year**            | int64     | Year of the rent data (from 1987 to 2024)                   |
| **AverageRent**     | int64     | Average monthly rent price in CAD                           |
| **UnitType**        | object    | Category of the rental unit (e.g., Apartment, Row House)    |
| **UnitDescription** | object    | Detailed description of the rental unit (e.g., 2 Bedroom)   |


In [1]:
import pandas as pd
from autogluon.tabular import TabularPredictor

In [2]:
#Loading the dataset
df = pd.read_csv('Canada_Rent_1987-2024_NO ZEROS.csv', encoding = 'latin1')
df.head()

Unnamed: 0,Province,City,Year,AverageRent,UnitType,UnitDescription
0,Newfoundland and Labrador,Corner Brook,1987,480,Two bedroom units,Apartment structures of six units and over
1,Newfoundland and Labrador,Gander,1987,370,One bedroom units,Apartment structures of six units and over
2,Newfoundland and Labrador,Gander,1987,414,Two bedroom units,Apartment structures of six units and over
3,Newfoundland and Labrador,Gander,1987,414,Three bedroom units,Apartment structures of six units and over
4,Newfoundland and Labrador,Labrador City,1987,254,One bedroom units,Apartment structures of six units and over


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69532 entries, 0 to 69531
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Province         69532 non-null  object
 1   City             69532 non-null  object
 2   Year             69532 non-null  int64 
 3   AverageRent      69532 non-null  int64 
 4   UnitType         69532 non-null  object
 5   UnitDescription  69532 non-null  object
dtypes: int64(2), object(4)
memory usage: 3.2+ MB


In [4]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,69532.0,2007.715412,9.921874,1987.0,1999.0,2008.0,2016.0,2024.0
AverageRent,69532.0,680.042757,302.502535,169.0,464.0,610.0,820.0,2961.0


## 📊 Exploratory Data Analysis (EDA) Report

For a detailed overview of the dataset, distributions, correlations, and other insights, check out the full EDA report here:

🔗 [Canada Rent EDA Report](https://kcracks.github.io/EDA_Reports/ydata/Canada_Report.html)


In [5]:
#Defining the target variable
target = 'AverageRent'
#Dropping rows where target is missing (if any)
df = df.dropna(subset=[target])

In [6]:
#Splitting the dataset into features and target variable
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

In [7]:
#Autogluon Regressor
predictor = TabularPredictor(label=target).fit(train_data)

No path specified. Models will be saved in: "AutogluonModels/ag-20250412_032320"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.2
Python Version:     3.10.16
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 24.3.0: Thu Jan  2 20:23:36 PST 2025; root:xnu-11215.81.4~3/RELEASE_ARM64_T8112
CPU Count:          8
Memory Avail:       1.33 GB / 8.00 GB (16.6%)
Disk Space Avail:   13.88 GB / 228.27 GB (6.1%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and b

[1000]	valid_set's rmse: 46.1162
[2000]	valid_set's rmse: 43.4778
[3000]	valid_set's rmse: 42.3414
[4000]	valid_set's rmse: 41.5979
[5000]	valid_set's rmse: 40.967
[6000]	valid_set's rmse: 40.3216
[7000]	valid_set's rmse: 40.0321
[8000]	valid_set's rmse: 39.6645
[9000]	valid_set's rmse: 39.3521
[10000]	valid_set's rmse: 39.0907


	-39.0822	 = Validation score   (-root_mean_squared_error)
	19.98s	 = Training   runtime
	0.42s	 = Validation runtime
Fitting model: LightGBM ...


[1000]	valid_set's rmse: 39.8841
[2000]	valid_set's rmse: 37.2184
[3000]	valid_set's rmse: 35.4499
[4000]	valid_set's rmse: 34.6497
[5000]	valid_set's rmse: 34.1509
[6000]	valid_set's rmse: 33.7314
[7000]	valid_set's rmse: 33.39
[8000]	valid_set's rmse: 33.0661
[9000]	valid_set's rmse: 32.8765
[10000]	valid_set's rmse: 32.6826


	-32.6604	 = Validation score   (-root_mean_squared_error)
	9.39s	 = Training   runtime
	0.24s	 = Validation runtime
Fitting model: RandomForestMSE ...
	-49.0845	 = Validation score   (-root_mean_squared_error)
	1.1s	 = Training   runtime
	0.03s	 = Validation runtime
Fitting model: CatBoost ...
	-48.6969	 = Validation score   (-root_mean_squared_error)
	76.48s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: ExtraTreesMSE ...
	-94.553	 = Validation score   (-root_mean_squared_error)
	0.6s	 = Training   runtime
	0.03s	 = Validation runtime
Fitting model: NeuralNetFastAI ...
	-45.852	 = Validation score   (-root_mean_squared_error)
	20.4s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: XGBoost ...
	-65.343	 = Validation score   (-root_mean_squared_error)
	7.82s	 = Training   runtime
	0.04s	 = Validation runtime
Fitting model: NeuralNetTorch ...
	-79.0302	 = Validation score   (-root_mean_squared_error)
	74.17s	 = Training   runtime
	0.03s	 = Validation 

[1000]	valid_set's rmse: 36.9196
[2000]	valid_set's rmse: 34.7041
[3000]	valid_set's rmse: 34.0429
[4000]	valid_set's rmse: 33.4126
[5000]	valid_set's rmse: 33.1703
[6000]	valid_set's rmse: 33.0204
[7000]	valid_set's rmse: 32.921
[8000]	valid_set's rmse: 32.8258
[9000]	valid_set's rmse: 32.7069
[10000]	valid_set's rmse: 32.6742


	-32.6742	 = Validation score   (-root_mean_squared_error)
	35.86s	 = Training   runtime
	0.78s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
	Ensemble Weights: {'LightGBM': 0.474, 'LightGBMLarge': 0.474, 'RandomForestMSE': 0.053}
	-32.1619	 = Validation score   (-root_mean_squared_error)
	0.02s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 250.99s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 2374.4 rows/s (2500 batch size)
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/Users/kamani/GitHub/KBIA_Portfolio/Canada Rent Project/AutoGluon Analysis - Canada Rent/AutogluonModels/ag-20250412_032320")


In [10]:
#Leaderboard to evaluate the performance of the models
predictor.leaderboard(test_data, silent=True)

If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")


Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,-29.307713,-32.161863,root_mean_squared_error,6.105472,1.052907,46.358309,0.014262,0.001044,0.015482,2,True,12
1,LightGBMLarge,-29.601958,-32.674194,root_mean_squared_error,4.484803,0.780507,35.856439,4.484803,0.780507,35.856439,1,True,11
2,LightGBM,-30.211416,-32.660355,root_mean_squared_error,1.474782,0.244432,9.390929,1.474782,0.244432,9.390929,1,True,4
3,LightGBMXT,-37.239541,-39.082189,root_mean_squared_error,2.090789,0.419215,19.97643,2.090789,0.419215,19.97643,1,True,3
4,NeuralNetFastAI,-44.438516,-45.852027,root_mean_squared_error,0.103298,0.020143,20.402468,0.103298,0.020143,20.402468,1,True,8
5,CatBoost,-44.903626,-48.696869,root_mean_squared_error,0.094519,0.015704,76.480107,0.094519,0.015704,76.480107,1,True,6
6,RandomForestMSE,-47.339443,-49.084521,root_mean_squared_error,0.131624,0.026924,1.095459,0.131624,0.026924,1.095459,1,True,5
7,XGBoost,-67.36086,-65.342968,root_mean_squared_error,0.21422,0.041725,7.817703,0.21422,0.041725,7.817703,1,True,9
8,NeuralNetTorch,-81.172827,-79.030206,root_mean_squared_error,0.042305,0.031235,74.173968,0.042305,0.031235,74.173968,1,True,10
9,ExtraTreesMSE,-93.876769,-94.553022,root_mean_squared_error,0.198881,0.027275,0.601269,0.198881,0.027275,0.601269,1,True,7


In [11]:
#Model performance evaluation
performance = predictor.evaluate(test_data)

In [12]:
#Displaying the performance metrics
performance_df = pd.DataFrame(list(performance.items()), columns=['Metric', 'Value'])
performance_df

Unnamed: 0,Metric,Value
0,root_mean_squared_error,-29.307713
1,mean_squared_error,-858.942014
2,mean_absolute_error,-14.565557
3,r2,0.990771
4,pearsonr,0.995375
5,median_absolute_error,-7.071899


In [13]:
#Getting Feature Importance
feature_importance = predictor.feature_importance(test_data)
print(feature_importance)

Computing feature importance via permutation shuffling for 5 features using 5000 rows with 5 shuffle sets...
	75.13s	= Expected runtime (15.03s per shuffle set)
	64.84s	= Actual runtime (Completed 5 of 5 shuffle sets)


                 importance    stddev       p_value  n    p99_high     p99_low
Year             281.301333  5.341944  1.559849e-08  5  292.300469  270.302197
City             214.391148  2.675186  2.908569e-09  5  219.899392  208.882904
UnitType         166.271759  0.827469  7.360357e-11  5  167.975528  164.567990
Province          38.963494  0.989955  4.996183e-08  5   41.001826   36.925163
UnitDescription   20.867028  1.348818  2.083227e-06  5   23.644263   18.089793


In [15]:
#Prediction on new data
y_pred = predictor.predict(test_data.drop(columns=[target]))
print("\nPredictions:")
print(y_pred)


Predictions:
65499    1151.758789
51253     555.073181
26072     535.379333
56611     721.856689
53923    1301.127441
            ...     
19193     803.646667
17539     328.081848
34271     623.530518
64481    1211.135742
16778     374.108521
Name: AverageRent, Length: 13907, dtype: float32


# 🧠 Conclusion and Analysis of Results

## 🔍 Analysis of Model Performance

| Metric | Value |
|:------|------:|
| Root Mean Squared Error (RMSE) | 29.31 |
| Mean Squared Error (MSE) | 858.94 | 
| Mean Absolute Error (MAE) | 14.57 |
| R² Score | 0.9908 |
| Pearson Correlation (r) | 0.9954 |
| Median Absolute Error (MedAE) | 7.07 |

## 📝 Interpretation

### Scores
- **High R² Score (0.9908)** indicates that the model explains approximately 99% of the variance in average rent prices.
- **Low RMSE (29.31)** and **Low MAE (14.57)** suggest that prediction errors are minimal, indicating excellent model accuracy.
- ⁠**Pearson Correlation (0.9954)** shows a very strong positive linear relationship between actual and predicted rents.
-  ⁠**Median Absolute Error (7.07)** being lower than MAE highlights that most prediction errors are smaller, with few larger deviations.

### Features
- Based on the model, **Year** is the most important factor, confirming rent prices are highly "time-sensitive".
- **City** and **UnitType** also significantly influence rent, highlighting the importance of location and unit classification.
- **Province** and **UnitDescription** contribute to a lesser extent but remain statistically relevant.

## ✅ Final Thoughts

🍁 **This model achieves outstanding predictive performance and can be confidently used for data-driven pricing strategies across Canadian cities** 🍁 such as rental price forecasting, market analysis, and others with minimal risk of significant errors.