#Automated Valuation Model

## Overview
The Government of Manitoba, through the City of Winnipeg relies on property assessments to determine fair taxation for homeowners and businesses. Property taxes are one of the city’s largest revenue sources, funding essential public services like roads, transit, emergency response, and community programs.

Traditional property assessments are performed by human assessors, which is time-consuming, costly, and liable to inconsistency. With hundreds of thousands of parcels across Winnipeg, it is tough to manually re-assess each property frequently. This creates challenges in keeping valuations accurate and up to date with real estate market trends.

## Why this project?


The main objective of this project is to build and deploy a machine learning model to predict property assessment values in Winnipeg using publicly available data from the City of Winnipeg's Open Data Portal.

An Automated Valuation Model (AVM) will addresses aforementioned challenges by using statistical and machine learning methods to estimate property values in a consistent, transparent, and scalable way. An AVM enables:

- Fairness & Equity: similar properties are assessed consistently, ensuring fair tax distribution.

- Efficiency: the entire city’s properties can be revalued quickly, reducing manual workload.

- Transparency: model outputs can be explained using features such as lot size, living area, and distance to downtown.

- Adaptability: models can incorporate new data (e.g., neighbourhood trends, spatial features) to reflect market shifts.

By adopting an AVM, Winnipeg can strengthen public trust in its assessment system, improve operational efficiency, and ensure that property taxation remains fair, consistent, and aligned with current market conditions.

##Project Execution


## 1.0 Model Scaldfolding

### 1.1 Cloning Repository

In [21]:
 !git pull origin main

remote: Enumerating objects: 11, done.[K
remote: Counting objects:   9% (1/11)[Kremote: Counting objects:  18% (2/11)[Kremote: Counting objects:  27% (3/11)[Kremote: Counting objects:  36% (4/11)[Kremote: Counting objects:  45% (5/11)[Kremote: Counting objects:  54% (6/11)[Kremote: Counting objects:  63% (7/11)[Kremote: Counting objects:  72% (8/11)[Kremote: Counting objects:  81% (9/11)[Kremote: Counting objects:  90% (10/11)[Kremote: Counting objects: 100% (11/11)[Kremote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects:  33% (1/3)[Kremote: Compressing objects:  66% (2/3)[Kremote: Compressing objects: 100% (3/3)[Kremote: Compressing objects: 100% (3/3), done.[K
remote: Total 8 (delta 5), reused 8 (delta 5), pack-reused 0 (from 0)[K
Unpacking objects:  12% (1/8)Unpacking objects:  25% (2/8)Unpacking objects:  37% (3/8)Unpacking objects:  50% (4/8)Unpacking objects:  62% (5/8)Unpacking objects:  75% (6/8)Unpacking objects:  87% 

In [22]:
# Starting the project by cloning my Github repository into this colab notebook
import os

if not os.path.exists('property-valuation'):
    !git clone https://github.com/BoseOmoniyi/property-valuation.git
    print("✓ Repository cloned")
else:
    print("✓ Repository already exists")

%cd property-valuation
print(f"Current directory: {os.getcwd()}")

Cloning into 'property-valuation'...
remote: Enumerating objects: 28, done.[K
remote: Counting objects:   3% (1/28)[Kremote: Counting objects:   7% (2/28)[Kremote: Counting objects:  10% (3/28)[Kremote: Counting objects:  14% (4/28)[Kremote: Counting objects:  17% (5/28)[Kremote: Counting objects:  21% (6/28)[Kremote: Counting objects:  25% (7/28)[Kremote: Counting objects:  28% (8/28)[Kremote: Counting objects:  32% (9/28)[Kremote: Counting objects:  35% (10/28)[Kremote: Counting objects:  39% (11/28)[Kremote: Counting objects:  42% (12/28)[Kremote: Counting objects:  46% (13/28)[Kremote: Counting objects:  50% (14/28)[Kremote: Counting objects:  53% (15/28)[Kremote: Counting objects:  57% (16/28)[Kremote: Counting objects:  60% (17/28)[Kremote: Counting objects:  64% (18/28)[Kremote: Counting objects:  67% (19/28)[Kremote: Counting objects:  71% (20/28)[Kremote: Counting objects:  75% (21/28)[Kremote: Counting objects:  78% (22/28)[Kremote

- Observation:
My Github repo has been cloned into the google colab, this gives me direct global access to my repo and the ability to save my work directly to Github  

### 1.2 Install Dependencies

In [23]:
# Install packages
print("Installing dependencies...")
!pip install -q pandas numpy scikit-learn matplotlib seaborn requests xgboost
print("✓ All dependencies installed")

Installing dependencies...
✓ All dependencies installed


### 1.3 Set Random Seeds

In [24]:
# Reproducibility
import random
import numpy as np
import os

RANDOM_STATE = 42
random.seed(RANDOM_STATE)
np.random.seed(RANDOM_STATE)
os.environ['PYTHONHASHSEED'] = str(RANDOM_STATE)

print("=" * 60)
print("REPRODUCIBILITY")
print("=" * 60)
print(f"Random State: {RANDOM_STATE}")
print("All random operations will use this seed.")
print("=" * 60)

REPRODUCIBILITY
Random State: 42
All random operations will use this seed.


### 1.4 Document Environment Information

In [25]:
# Document environment
import sys
import platform
from datetime import datetime

print("=" * 60)
print("ENVIRONMENT")
print("=" * 60)
print(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Python: {sys.version.split()[0]}")
print(f"Platform: {platform.platform()}")
print("=" * 60)

ENVIRONMENT
Date: 2025-10-04 23:03:54
Python: 3.12.11
Platform: Linux-6.6.97+-x86_64-with-glibc2.35


### 1.5 Show Configuration Information

In [26]:
# Import config
import sys
sys.path.insert(0, '/content/property-valuation')

from config import *

print("=" * 60)
print("CONFIGURATION")
print("=" * 60)
print(f"\n Reproducibility:")
print(f"  Random State: {RANDOM_STATE}")
print(f"\n API:")
print(f"  URL: {WINNIPEG_API_URL}")
print(f"  Dataset ID: {WINNIPEG_DATASET_ID}")
print(f"\n Data Split:")
print(f"  Test Size: {TEST_SIZE} ({int(TEST_SIZE*100)}%)")
print(f"\n Features:")
print(f"  Target: {TARGET_VARIABLE}")
print(f"  Features: {BASELINE_FEATURES}")
print("=" * 60)

CONFIGURATION

 Reproducibility:
  Random State: 42

 API:
  URL: https://data.winnipeg.ca/resource/d4mq-wa44.json
  Dataset ID: d4mq-wa44

 Data Split:
  Test Size: 0.2 (20%)

 Features:
  Target: Total.Assessed.Value.Num
  Features: ['Total.Living.Area.Num', 'Assessed.Land.Area.Num']


### 1.6 Import Libraries

In [28]:
# Force reimport after overwriting file
import importlib
import sys

# Remove old module from cache
if 'src.data_fetching' in sys.modules:
    del sys.modules['src.data_fetching']

# Now import fresh
from src.data_fetching import fetch_winnipeg_assessment_data, get_dataset_info

print("✓ Functions imported successfully!")

✓ Functions imported successfully!


In [29]:
# Import all libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Import project modules
from src.data_fetching import fetch_winnipeg_assessment_data, get_dataset_info
from src.data_loading import load_data
from src.reproducibility import set_random_seeds

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
%matplotlib inline

print("✓ All libraries imported successfully")

ImportError: cannot import name 'load_data' from 'src.data_loading' (/content/property-valuation/src/data_loading.py)

In [30]:
%%writefile src/data_fetching.py
"""
Fetch data from Winnipeg Open Data Portal API
"""
import pandas as pd
import requests
import json
from pathlib import Path
from datetime import datetime

# Import from config
import sys
sys.path.insert(0, '/content/property-valuation')
from config import WINNIPEG_API_URL

def get_dataset_info(api_url=None):
    """
    Get information about the dataset (row count, columns, etc.)
    """
    if api_url is None:
        api_url = WINNIPEG_API_URL

    print("Fetching dataset information...")

    # Get sample to see columns
    response = requests.get(api_url, params={'$limit': 1})
    response.raise_for_status()
    sample = response.json()

    # Try to get count
    try:
        count_params = {'$select': 'COUNT(*)'}
        count_response = requests.get(api_url, params=count_params)
        count_data = count_response.json()
        total_count = int(count_data[0]['COUNT']) if count_data else 'Unknown'
    except:
        total_count = 'Unknown'

    info = {
        'total_records': total_count,
        'columns': list(sample[0].keys()) if sample else [],
        'sample_record': sample[0] if sample else None
    }

    return info


def fetch_all_records(api_url, batch_size=50000):
    """
    Fetch all records from Socrata API with pagination.
    """
    all_data = []
    offset = 0

    print("Fetching all records with pagination...")

    while True:
        params = {
            '$limit': batch_size,
            '$offset': offset
        }

        print(f"  Fetching batch at offset {offset}...")
        response = requests.get(api_url, params=params)
        response.raise_for_status()

        data = response.json()

        if not data:
            break

        all_data.extend(data)
        offset += len(data)

        print(f"  Total records fetched: {len(all_data):,}")

        if len(data) < batch_size:
            break

    print(f"✓ Completed! Total records: {len(all_data):,}")

    df = pd.DataFrame(all_data)
    return df


def fetch_winnipeg_assessment_data(api_url=None, save_local=True, limit=None):
    """
    Fetch Winnipeg property assessment data from Open Data Portal API.
    """
    if api_url is None:
        api_url = WINNIPEG_API_URL

    print("=" * 60)
    print("FETCHING WINNIPEG ASSESSMENT DATA")
    print("=" * 60)
    print(f"API URL: {api_url}")
    print()

    try:
        if limit is None:
            # Fetch all data with pagination
            df = fetch_all_records(api_url)
        else:
            # Fetch limited records
            print(f"Fetching sample of {limit:,} records...")
            response = requests.get(api_url, params={'$limit': limit})
            response.raise_for_status()
            data = response.json()
            df = pd.DataFrame(data)
            print(f"✓ Successfully fetched {len(df):,} records")

        # Save local copy if requested
        if save_local:
            output_dir = Path('data/raw')
            output_dir.mkdir(parents=True, exist_ok=True)

            timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
            output_file = output_dir / f'Assessment_Parcels_{timestamp}.csv'

            df.to_csv(output_file, index=False)
            print(f"✓ Saved local copy to: {output_file}")

        print("=" * 60)
        return df

    except requests.exceptions.RequestException as e:
        print(f" Error fetching data: {e}")
        raise
    except json.JSONDecodeError as e:
        print(f" Error parsing JSON response: {e}")
        raise

Overwriting src/data_fetching.py


## 2.0 Dataset - City of Winnipeg Open Data

### 2.1 Retriving Dataset

In [31]:
### 2.1 # Get dataset information
print("Checking dataset structure...\n")

info = get_dataset_info()

print(f"Dataset Information:")
print(f"  Total Records: {info['total_records']}")
print(f"  Number of Columns: {len(info['columns'])}")

print(f"\nAvailable Columns:")
for i, col in enumerate(info['columns'], 1):
    print(f"  {i:2d}. {col}")

Checking dataset structure...

Fetching dataset information...
Dataset Information:
  Total Records: 244596
  Number of Columns: 35

Available Columns:
   1. roll_number
   2. street_number
   3. street_name
   4. street_type
   5. full_address
   6. neighbourhood_area
   7. market_region
   8. total_living_area
   9. building_type
  10. basement
  11. basement_finish
  12. year_built
  13. rooms
  14. air_conditioning
  15. fire_place
  16. attached_garage
  17. detached_garage
  18. pool
  19. property_use_code
  20. assessed_land_area
  21. property_influences
  22. zoning
  23. total_assessed_value
  24. assessment_date
  25. detail_url
  26. current_assessment_year
  27. property_class_1
  28. status_1
  29. assessed_value_1
  30. multiple_residences
  31. geometry
  32. dwelling_units
  33. centroid_lat
  34. centroid_lon
  35. gisid


##3.0 Exploratory Data Analysis

###3.1 Fetch Sample Data For Exploration

In [33]:
# Fetch sample data for exploration
print("Fetching sample of 1000 records...")
print("This helps us explore the data structure quickly.\n")

df_sample = fetch_winnipeg_assessment_data(limit=1000, save_local=False)

print(f"\nSuccess: Sample loaded!")
print(f"  Shape: {df_sample.shape[0]:,} rows × {df_sample.shape[1]} columns")
print(f"  Memory: {df_sample.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

Fetching sample of 1000 records...
This helps us explore the data structure quickly.

FETCHING WINNIPEG ASSESSMENT DATA
API URL: https://data.winnipeg.ca/resource/d4mq-wa44.json

Fetching sample of 1,000 records...
✓ Successfully fetched 1,000 records

Success: Sample loaded!
  Shape: 1,000 rows × 45 columns
  Memory: 2.47 MB


In [36]:
# Display all column names
print("=" * 60)
print("Dataset Columns")
print("=" * 60)
print(f"\nTotal columns: {len(df_sample.columns)}\n")

for i, col in enumerate(df_sample.columns, 1):
    print(f"{i:2d}. {col}")

Dataset Columns

Total columns: 45

 1. roll_number
 2. street_number
 3. street_name
 4. street_type
 5. full_address
 6. neighbourhood_area
 7. market_region
 8. total_living_area
 9. building_type
10. basement
11. basement_finish
12. year_built
13. rooms
14. air_conditioning
15. fire_place
16. attached_garage
17. detached_garage
18. pool
19. property_use_code
20. assessed_land_area
21. property_influences
22. zoning
23. total_assessed_value
24. assessment_date
25. detail_url
26. current_assessment_year
27. property_class_1
28. status_1
29. assessed_value_1
30. multiple_residences
31. geometry
32. dwelling_units
33. centroid_lat
34. centroid_lon
35. gisid
36. property_class_2
37. status_2
38. assessed_value_2
39. property_class_3
40. status_3
41. assessed_value_3
42. water_frontage_measurement
43. sewer_frontage_measurement
44. unit_number
45. number_floors_condo


In [37]:
# Show columns with sample values in a nice table
print("=" * 80)
print("COLUMNS AND SAMPLE VALUES")
print("=" * 80)

# Create a transposed view
sample_display = df_sample.head(5).T  # T = transpose

display(sample_display)

COLUMNS AND SAMPLE VALUES


Unnamed: 0,0,1,2,3,4
roll_number,01000001000,01000005500,01000008000,01000008200,01000008400
street_number,1636,1584,1574,1550,1538
street_name,MCCREARY,MCCREARY,MCCREARY,MCCREARY,MCCREARY
street_type,ROAD,ROAD,ROAD,ROAD,ROAD
full_address,1636 MCCREARY ROAD,1584 MCCREARY ROAD,1574 MCCREARY ROAD,1550 MCCREARY ROAD,1538 MCCREARY ROAD
neighbourhood_area,WILKES SOUTH,WILKES SOUTH,WILKES SOUTH,WILKES SOUTH,WILKES SOUTH
market_region,"6, CHARLESWOOD","6, CHARLESWOOD","6, CHARLESWOOD","6, CHARLESWOOD","6, CHARLESWOOD"
total_living_area,1313,4007,1052,3120,1510
building_type,ONE STOREY,TWO STOREY,ONE STOREY,ONE STOREY,ONE STOREY
basement,Yes,Yes,No,Yes,Yes
