# Phase 4: Model Training & Forecasting
**Project:** National Rent Intelligence Engine
**Goal:** Train an XGBoost model to predict 2026 Rents based on 2025 market pressure signals.

**The Workflow:**
1.  **Train:** Teach the AI using historical patterns (2021‚Äì2024).
2.  **Test:** verify accuracy on unseen data.
3.  **Forecast:** Predict the "Unknown" (2026) using the latest 2025 data.

In [6]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import matplotlib.pyplot as plt
import os

# Paths
PROCESSED_PATH = "../data/processed"
OUTPUT_PATH = "../output"
os.makedirs(OUTPUT_PATH, exist_ok=True)

# Load Training Data
df_train = pd.read_csv(f"{PROCESSED_PATH}/national_training_data.csv")
print(f"‚úÖ Training Data Loaded: {len(df_train)} rows")

‚úÖ Training Data Loaded: 153 rows


## Prepare the Training Matrix
We separate the **Features (X)** from the **Target (y)**.
* **X (Inputs):** `Rent_Lag1`, `Turnover_Rate`, `Cap_Rate`, `Student_Supply_Mismatch`, etc.
* **y (Output):** `Target_Next_Year_Growth` (The % change we want to predict).

In [7]:
# Define the Features the model is allowed to see
# We EXCLUDE columns that are "future knowledge" or text (like City Name)
features = [
    'Turnover_Rate', 'Cap_Rate', 'Price_to_Rent_Ratio',
    'Average rent ($)_Lag1', 'Average rent ($)_Lag2', 'Average rent ($)_Growth',
    'Turnover_Rate_Lag1', 'Turnover_Rate_Growth',
    'Total_Units_Growth', 'Student_Supply_Mismatch'
]
target = 'Target_Next_Year_Growth'

X = df_train[features]
y = df_train[target]

# Split: 80% for Training, 20% for Testing (to check accuracy)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training Features: {X_train.shape}")
print(f"Testing Features: {X_test.shape}")

Training Features: (122, 10)
Testing Features: (31, 10)


## Train the XGBoost Model
We use a **Regressor** model because we are predicting a continuous number (percentage growth).

In [8]:
# ---------------------------------------------------------
# STRATEGY A: ROBUST MODEL (Prevents Overfitting)
# ---------------------------------------------------------
# We reduce the model's complexity so it focuses on Big Trends, not noise.

model = xgb.XGBRegressor(
    n_estimators=50,     # REDUCED: 1000 -> 50 trees (Stop it from memorizing)
    max_depth=3,         # REDUCED: 5 -> 3 (Keep decision logic simple)
    learning_rate=0.1,   # INCREASED: Learn faster, stop sooner
    subsample=0.8,       # Randomness helps it generalize
    colsample_bytree=0.8,# Only look at 80% of features per tree
    random_state=42
)

# Train
print("ü§ñ Training Robust Model...")
model.fit(X_train, y_train)

# Evaluate
predictions = model.predict(X_test)
train_preds = model.predict(X_train)

mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
train_r2 = r2_score(y_train, train_preds)

print(f"\nüéØ Model Performance:")
print(f"   -> Training Score: {train_r2:.3f} (Lower is better here - means it didn't memorize)")
print(f"   -> Test Score (R2): {r2:.3f} (Must be > 0.0 to be useful)")
print(f"   -> Average Error: +/- {mae:.2f}%")

# Sanity Check
comparison = pd.DataFrame({'Actual_Growth_%': y_test, 'Predicted_Growth_%': predictions})
print("\n--- Reality Check (Test Set Sample) ---")
print(comparison.head(5))

ü§ñ Training Robust Model...

üéØ Model Performance:
   -> Training Score: 0.924 (Lower is better here - means it didn't memorize)
   -> Test Score (R2): 0.662 (Must be > 0.0 to be useful)
   -> Average Error: +/- 2.37%

--- Reality Check (Test Set Sample) ---
     Actual_Growth_%  Predicted_Growth_%
84          0.000000            3.915250
86          0.000000           -0.558284
97          2.583026            2.648124
115         4.673496            5.248118
29         12.447844            7.301955


# The 2026 Forecast & Export 

In [None]:
# ---------------------------------------------------------
# THE 2026 FORECAST & EXPORT 
# ---------------------------------------------------------
print("üöÄ Starting 2026 Forecast Generation...")

# 1. Load the "Exam Paper" (The 2025 Data)
# We need to predict the *Target* (2026 Rent) for these rows
df_forecast_input = pd.read_csv(f"{PROCESSED_PATH}/national_forecast_input.csv")

# 2. Select the Exact Same Features used in Training
# (The model will crash if columns don't match perfectly)
X_future = df_forecast_input[features]

# 3. Ask the AI to Predict
# It returns the "Projected Growth %"
df_forecast_input['Predicted_Growth_2026_Pct'] = model.predict(X_future)

# 4. Calculate the Final Dollar Amount
# Formula: Rent_2026 = Rent_2025 * (1 + Growth%)
df_forecast_input['Forecast_Rent_2026'] = df_forecast_input['Average rent ($)'] * (1 + (df_forecast_input['Predicted_Growth_2026_Pct'] / 100))

# 5. Create the Final Report Table
final_report = df_forecast_input[[
    'City', 'Year', 'Average rent ($)', 
    'Predicted_Growth_2026_Pct', 'Forecast_Rent_2026',
    'Cap_Rate', 'Student_Supply_Mismatch'
]].copy()

# Rename for clarity
final_report.rename(columns={'Average rent ($)': 'Current_Rent_2025'}, inplace=True)

# Sort: Winners at the top (Highest Growth)
final_report = final_report.sort_values(by='Predicted_Growth_2026_Pct', ascending=False)

# 6. Save to Output
output_file = f"{OUTPUT_PATH}/Final_2026_Rent_Forecast.csv"
final_report.to_csv(output_file, index=False)

print(f"‚úÖ DONE! Forecast saved to: {output_file}")
print("\n--- üèÜ TOP 5 CITIES FOR 2026 GROWTH ---")
display(final_report.head(5))

üöÄ Starting 2026 Forecast Generation...
‚úÖ DONE! Forecast saved to: ../output/Final_2026_Rent_Forecast.csv

--- üèÜ TOP 5 CITIES FOR 2026 GROWTH ---


Unnamed: 0,City,Year,Current_Rent_2025,Predicted_Growth_2026_Pct,Forecast_Rent_2026,Cap_Rate,Student_Supply_Mismatch
6,Ottawa,2025,1757.0,15.205829,2024.166334,5.240974,0.0
14,Windsor,2025,1451.0,8.931981,1580.603066,4.445239,
5,Montreal,2025,1346.0,8.208922,1456.492044,3.797195,0.0
13,Victoria,2025,2120.0,7.588059,2280.866718,4.614129,-3.936073
0,Calgary,2025,1914.0,7.379888,2055.251085,6.976207,-11.01327


In [None]:
# ---------------------------------------------------------
# MARKET INTELLIGENCE REPORT GENERATOR 
# ---------------------------------------------------------
# This script reads your forecast and writes a text summary for stakeholders.

top_3 = final_report.head(3)
bottom_1 = final_report.tail(1)

print("="*60)
print("       üá®üá¶ NATIONAL RENT INTELLIGENCE REPORT (2026)       ")
print("="*60)
print(f"Generated by: XGBoost Predictive Engine | Date: {pd.Timestamp.now().date()}\n")

print("EXECUTIVE SUMMARY")
print("-" * 20)
print(f"The model has analyzed {len(final_report)} major rental markets.")
print(f"The highest projected growth is in {top_3.iloc[0]['City']} ({top_3.iloc[0]['Predicted_Growth_2026_Pct']:.1f}%).")
print(f"The national average rent growth forecast is {final_report['Predicted_Growth_2026_Pct'].mean():.1f}%.\n")

print("üèÜ TOP INVESTMENT OPPORTUNITIES")
print("-" * 30)
for index, row in top_3.iterrows():
    print(f"#{index+1}: {row['City'].upper()}")
    print(f"   ‚Ä¢ 2025 Rent: ${row['Current_Rent_2025']:.0f}")
    print(f"   ‚Ä¢ 2026 Forecast: ${row['Forecast_Rent_2026']:.0f} (+{row['Predicted_Growth_2026_Pct']:.1f}%)")
    print(f"   ‚Ä¢ Driver: Cap Rate is {row['Cap_Rate']:.1f}% (Yield)")
    print(f"   ‚Ä¢ Pressure: Student/Supply Mismatch Score: {row['Student_Supply_Mismatch']:.1f}")
    print("")

print("‚ö†Ô∏è MARKET WATCH (LOWEST GROWTH)")
print("-" * 30)
row = bottom_1.iloc[0]
print(f"The slowest market is projected to be {row['City']} ")
print(f"with growth of just {row['Predicted_Growth_2026_Pct']:.1f}%.\n")

print("="*60)
print("END OF REPORT")

       üá®üá¶ NATIONAL RENT INTELLIGENCE REPORT (2026)       
Generated by: XGBoost Predictive Engine | Date: 2026-02-04

EXECUTIVE SUMMARY
--------------------
The model has analyzed 16 major rental markets.
The highest projected growth is in Ottawa (15.2%).
The national average rent growth forecast is 6.2%.

üèÜ TOP INVESTMENT OPPORTUNITIES
------------------------------
#7: OTTAWA
   ‚Ä¢ 2025 Rent: $1757
   ‚Ä¢ 2026 Forecast: $2024 (+15.2%)
   ‚Ä¢ Driver: Cap Rate is 5.2% (Yield)
   ‚Ä¢ Pressure: Student/Supply Mismatch Score: 0.0

#15: WINDSOR
   ‚Ä¢ 2025 Rent: $1451
   ‚Ä¢ 2026 Forecast: $1581 (+8.9%)
   ‚Ä¢ Driver: Cap Rate is 4.4% (Yield)
   ‚Ä¢ Pressure: Student/Supply Mismatch Score: nan

#6: MONTREAL
   ‚Ä¢ 2025 Rent: $1346
   ‚Ä¢ 2026 Forecast: $1456 (+8.2%)
   ‚Ä¢ Driver: Cap Rate is 3.8% (Yield)
   ‚Ä¢ Pressure: Student/Supply Mismatch Score: 0.0

‚ö†Ô∏è MARKET WATCH (LOWEST GROWTH)
------------------------------
The slowest market is projected to be London 
with growth