## Group proposal: The Sophisticated Housing Price in Vancouver

**(1) Introduction:**

**Background information:**
The housing market of the lower mainland region of BC has been known as a “Housing Bubble” for the past decade with prices exploding in value. However, even within this bubble, there are trends and patterns that can help us predict how the market will further change in the coming years.

**Research queation:** We will be predicting the house price for the future based on price history from 2005-2024.

**Dataset we used:** The dataset we’re using was found in the UBC resource library that is originally data from the CREA (Canadian Real Estate Association). It includes data from major cities/regions across Canada. For the purpose of our analysis we will only be looking at data collected from the lower mainland region, and out of this region we will only be using the single family HPI (home price index) as a basis for our predictions.

**(2) Preliminary exploratory data analysis:**

First, we read the dataset from the web into Python.

In [13]:
import pandas as pd

In [14]:
data = pd.read_excel("Seasonally Adjusted.xlsx",sheet_name = "LOWER_MAINLAND")
data

Unnamed: 0,Date,Composite_HPI_SA,Single_Family_HPI_SA,One_Storey_HPI_SA,Two_Storey_HPI_SA,Townhouse_HPI_SA,Apartment_HPI_SA,Composite_Benchmark_SA,Single_Family_Benchmark_SA,One_Storey_Benchmark_SA,Two_Storey_Benchmark_SA,Townhouse_Benchmark_SA,Apartment_Benchmark_SA
0,2005-01-01,100.0,100.0,100.0,100.0,100.0,100.0,333400,463100,362900,536800,264900,202700
1,2005-02-01,100.4,100.2,100.6,100.1,100.4,100.7,334600,464200,364900,537200,265900,204200
2,2005-03-01,100.8,100.4,101.0,100.1,100.7,102.0,336100,464800,366500,537400,266700,206800
3,2005-04-01,101.1,100.6,101.3,100.3,101.2,102.7,337000,465800,367500,538300,268200,208100
4,2005-05-01,101.5,100.8,101.7,100.5,101.7,103.5,338400,466900,368900,539300,269300,209700
...,...,...,...,...,...,...,...,...,...,...,...,...,...
224,2023-09-01,345.7,393.3,390.3,374.0,371.3,355.1,1152600,1821500,1416400,2007600,983500,719700
225,2023-10-01,344.1,391.0,388.2,371.6,372.2,356.8,1147300,1810600,1408600,1994900,985900,723200
226,2023-11-01,342.5,390.0,384.8,371.9,371.3,356.1,1142000,1806000,1396400,1996400,983500,721900
227,2023-12-01,339.6,387.8,383.9,369.4,368.8,353.2,1132300,1795900,1393300,1983200,976900,716000


Then, we clean and wrangle your data into a tidy format.

In [15]:
data = data.melt(id_vars = ["Date"], 
                 var_name = "house_type", 
                 value_name = "house_price_index")
data

Unnamed: 0,Date,house_type,house_price_index
0,2005-01-01,Composite_HPI_SA,100.0
1,2005-02-01,Composite_HPI_SA,100.4
2,2005-03-01,Composite_HPI_SA,100.8
3,2005-04-01,Composite_HPI_SA,101.1
4,2005-05-01,Composite_HPI_SA,101.5
...,...,...,...
2743,2023-09-01,Apartment_Benchmark_SA,719700.0
2744,2023-10-01,Apartment_Benchmark_SA,723200.0
2745,2023-11-01,Apartment_Benchmark_SA,721900.0
2746,2023-12-01,Apartment_Benchmark_SA,716000.0


We filter the dataframe to keep one house type "Single_Family_HPI_SA".

In [17]:
data = data[data["house_type"] == "Single_Family_HPI_SA"]
data

Unnamed: 0,Date,house_type,house_price_index
229,2005-01-01,Single_Family_HPI_SA,100.0
230,2005-02-01,Single_Family_HPI_SA,100.2
231,2005-03-01,Single_Family_HPI_SA,100.4
232,2005-04-01,Single_Family_HPI_SA,100.6
233,2005-05-01,Single_Family_HPI_SA,100.8
...,...,...,...
453,2023-09-01,Single_Family_HPI_SA,393.3
454,2023-10-01,Single_Family_HPI_SA,391.0
455,2023-11-01,Single_Family_HPI_SA,390.0
456,2023-12-01,Single_Family_HPI_SA,387.8


We split the dataframe and use only training data.

In [19]:
from sklearn.model_selection import train_test_split

price_train, price_test = train_test_split(
    data, train_size=0.80, stratify=data["house_type"]
)
price_train

Unnamed: 0,Date,house_type,house_price_index
418,2020-10-01,Single_Family_HPI_SA,294.6
416,2020-08-01,Single_Family_HPI_SA,285.8
387,2018-03-01,Single_Family_HPI_SA,295.8
433,2022-01-01,Single_Family_HPI_SA,401.7
440,2022-08-01,Single_Family_HPI_SA,381.0
...,...,...,...
247,2006-07-01,Single_Family_HPI_SA,122.1
434,2022-02-01,Single_Family_HPI_SA,413.2
407,2019-11-01,Single_Family_HPI_SA,270.5
296,2010-08-01,Single_Family_HPI_SA,155.6


The table above reports the number of observations (183)
and the means of the predictor variables we plan to use (house price index).

Then we visualize the dataframe.

In [21]:
import altair as alt

plot = alt.Chart(data, title = "Housing price in the past 20 years").mark_line().encode(
    x=alt.X('Date:T').title('time'),
    y=alt.Y('house_price_index').title('HPI')
    
).configure_axis(titleFontSize = 15)
plot

**(3) Methods:**

**How we will conduct either our data analysis and which variables/columns we will use**: By plotting out the graph, we will be able to form a linear regression. Using that linear regression between year and housing price, we can then predict the future prices with that regression. We will be using the year and HPI variable.

**Visualization**: As mentioned earlier, we will be visualizing the results with a lineplot, as we intend to demonstrate the relationship between the two independent variables. 

**(4) Expected outcomes and significance:**

**Expected outcomes:** We expect to find the housing market price trend over time and predict whether the price will go up or down in the future based on the aspects/variables that exist in our data.

**Impact of the project:** 

(1) To better inform people of how the housing market will change in the coming years thus allowing people to make better-informed decisions in their lives to prepare them for the future.

(2) Tell the readers about the factors and characteristics that are related to housing market prices.

(3) Make the reader know which housing areas are affordable.


**Future questions:**

(1) What aspects should customers look at when looking for affordable homes?

(2) What is the predicted price of a house based on the customer aspects?
