# Predicting Housing Prices with Multiple Linear Regression

Welcome to this Jupyter notebook where I explore predicting housing prices using Multiple Linear Regression (MLR).

This technique, a cornerstone of machine learning, allows us to forecast a target variable by analyzing multiple input features. For this project, I'll be predicting the average absorbed unit cost of residential properties. The term 'absorbed' refers to units that have been sold after the completion of construction.

---
## Data and Objective
My approach will involve the following steps:

   - Examine the number of residential units completed each month. Clean the relevant data and prepare it for the regression model.

   - Next, analyze the number of absorbed residential units each month. Similar to the first step, this data will be prepared and cleaned.

   - Subsequently, I'll save the average absorption price of single-family/attached homes, the target variable, for testing the trained model.

   - Lastly, we look into the Housing Price Index (HPI), which gives us insights into the relative changes in housing prices over periods of time.
   
It should be noted that all training data used is aggregated and sourced from Statistics Canada, specific to Calgary, AB. Moreover, as the primary goal of this project is to provide me with hands-on practice in building and applying Multiple Linear Regression models, the focus is more on the process rather than the absolute accuracy of the model.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression

# Feature 1: Residenital Units Completed Each Month

Feature 1 represents the number of residential units completed each month. This data is sourced from the 'res_construction.csv' file. After cleaning and filtering to only include 'Total residential' units, we are left with an array of 16 monthly counts.

In [2]:
res_con = pd.read_csv('res_construction.csv')

In [3]:
res_con = res_con[['REF_DATE', 'Type of structure', 'VALUE']]

In [4]:
res_con = res_con.rename(columns={'REF_DATE': 'Period','Type of structure': 'Type', 'VALUE':'Count'})

In [5]:
res_con = res_con[res_con['Type'] == 'Total residential']

In [6]:
res_con = res_con.drop('Type', axis=1).set_index('Period', drop=True)

In [7]:
res_con.shape, res_con.dtypes 

((16, 1),
 Count    int64
 dtype: object)

In [8]:
res_con = np.array(res_con)

In [9]:
res_con

array([[1251],
       [ 805],
       [ 970],
       [1500],
       [1683],
       [2045],
       [1562],
       [1439],
       [1373],
       [1366],
       [1835],
       [1429],
       [1536],
       [1262],
       [1265],
       [1416]])

## Feature 1 Complete: Residential Units Completed Each Month

# Feature 2: Completed Units Absorbtion Each Month

Feature 2 pertains to the absorption of completed residential units each month. The data is sourced from the 'absorb_units.csv' file. 

After initial data cleaning, the dataset is then converted into integer format and reshaped into an array with 17 rows and 1 column. To align with the other datasets, the array's length is reduced to 16, representing the count of absorbed units for each corresponding month.

In [10]:
absorb = pd.read_csv('absorb_units.csv')
absorb = absorb[['Time','All']].set_index('Time', drop=True)
absorb = absorb.rename(columns={'All':'Count'})

In [11]:
absorb = absorb.loc[absorb.index.str.contains('2023|2022')]

In [12]:
absorb = absorb['Count'].astype('int64')

In [13]:
absorb = absorb.values.reshape(-1,1)

In [14]:
absorb.shape, absorb.dtype

((17, 1), dtype('int64'))

In [15]:
absorb = absorb[:-1]
absorb.shape

(16, 1)

In [16]:
absorb

array([[577],
       [471],
       [548],
       [645],
       [605],
       [656],
       [563],
       [620],
       [588],
       [548],
       [630],
       [705],
       [481],
       [615],
       [657],
       [695]])

## Feature 2 Complete: Abosrbed Residential Units Each Month

# Feature 3: Absorbed Unit Costs (Target Variable)

Feature 3, which serves as our target variable, represents the average absorption cost of single family/attached homes each month. The corresponding data is collected from the 'absorb_price.csv' file. After cleaning and filtering, the data for the years 2022 and 2023 is transformed into an integer array with 17 rows and 1 column. To ensure consistency with our other datasets, this array is further trimmed down to 16 entries. 

However, unlike the other features, this data will not be used for training but will be held back for testing the performance of the machine learning model.

In [17]:
df = pd.read_csv('absorb_price.csv')
df = df[['Time','Average']].set_index('Time', drop=True)
df

Unnamed: 0_level_0,Average
Time,Unnamed: 1_level_1
1990 January,157461
1990 February,145924
1990 March,153323
1990 April,158684
1990 May,169382
...,...
2023 February,719495
2023 March,721938
2023 April,743584
2023 May,766959


In [18]:
df = df.loc[df.index.str.contains('2022|2023')]

In [19]:
df = df.Average.str.replace(',','').astype('int64')

In [20]:
df = df.values.reshape(-1,1)
df.shape

(17, 1)

In [21]:
df = df[:-1]

In [22]:
df.shape

(16, 1)

In [23]:
df

array([[620090],
       [630697],
       [652992],
       [651990],
       [689070],
       [683979],
       [657916],
       [720625],
       [707905],
       [690682],
       [700049],
       [737179],
       [699978],
       [719495],
       [721938],
       [743584]])

## Feature 3 Complete: AVG. Absorbiton Price of Single Family/Attached Homes

# Feature 4: Housing Price Index - Composite

Feature 4, the Housing Price Index - Composite (HPI), is an important indicator of market trends. The relevant data is extracted from the 'comp_hpi.csv' file. 

After setting 'Period' as the index, the data is reshaped into an array with 17 rows and 1 column. I then ensure consistency with the other feature datasets, so the last entry is removed, leaving us with 16 entries. These represent the composite HPI for specific months within the years 2022 and 2023.

In [24]:
df_hpi = pd.read_csv('comp_hpi.csv')
df_hpi = df_hpi[['Period','HPI']].set_index('Period', drop=True)
df_hpi

Unnamed: 0_level_0,HPI
Period,Unnamed: 1_level_1
Jan 2022,227.1
Feb 2022,240.4
Mar 2022,247.2
Apr 2022,250.9
May 2022,251.4
Jun 2022,250.6
Jul 2022,249.0
Aug 2022,246.2
Sep 2022,244.0
Oct 2022,242.3


In [25]:
df_hpi = df_hpi.values.reshape(-1,1)
df_hpi.shape, df_hpi.dtype

((17, 1), dtype('float64'))

In [26]:
df_hpi = df_hpi[:-1]

In [37]:
df_hpi

array([[227.1],
       [240.4],
       [247.2],
       [250.9],
       [251.4],
       [250.6],
       [249. ],
       [246.2],
       [244. ],
       [242.3],
       [240.6],
       [239.3],
       [241.3],
       [245.9],
       [249.6],
       [254.7]])

## Feature 4 Completed: Housing Price Index - Composite

# Training The Model

After completion of data extraction and cleaning, Feature 1 (Residential Units Completed Each Month), Feature 2 (Absorbed Residential Units Each Month), and Feature 4 (Housing Price Index - Composite) are concatenated into a single dataset, which will be used as the feature set for the machine learning model.

The target dataset is created using Feature 3 (Absorbed Unit Costs), which is not used in the model's training but rather for testing its performance.

The model's performance is gauged using the R-squared score, which is printed out to provide a quantitative measure of the model's predictive power.

In [27]:
features = np.concatenate((res_con,absorb,df_hpi), axis=1)

In [28]:
target = df

In [29]:
model = LinearRegression()

In [30]:
model.fit(features,target)

In [31]:
# target: 766,959

In [32]:
next_month_absorb = 948
next_month_res_con = 1400
next_month_df_hpi = 257.8

next_month_features = np.array([[next_month_res_con, next_month_absorb, next_month_df_hpi]])

In [33]:
prediction = model.predict(next_month_features)
prediction

array([[800745.29661285]])

In [34]:
score = model.score(features,df)

print(f"R-Squared Score is {score}")

R-Squared Score is 0.37814278255725453


## Summary

This project allowed me to dive deep into the core concepts of MLR and its real-world applications. Even though the model's performance wasn't high, I recognize the importance of hands-on experience and the lessons gained throughout the process. 

I am excited to explore more datasets, try different types of regression techniques, and experiment with more advanced machine learning models to boost the predictive power of my future projects. 