# Metro Vancouver Housing Price Analysis (2004–2023)

## Objective
This project explores how neighborhood, property type, structural features, and time relate to housing prices in Metro Vancouver.

## Key Questions
1. Do housing prices differ across neighborhoods?
2. How does property type relate to market price?
3. Which factors are most associated with market price?

## Tools
- R (tidyverse, ggplot2, corrplot)
- Exploratory Data Analysis (EDA)
- Correlation analysis
- Visualization

## Key Insight
This project demonstrates my ability to distinguish correlation from causation, identify dominant signals in data, and connect statistical findings with real-world economic context.

- Author: Jake Zhan

## Part 1: Data Summary and Cleaning

In [None]:
library(tidyverse)
library(naniar)
library(tidyr)
library(dplyr)

In [None]:
library(corrplot)

In [None]:
df <- read_csv("synthetic_house_prices_20_years.csv")

In [None]:
# 1. Data Overview
glimpse(df)

This dataset contains 3,520 observations and 14 variables. The variables include both categorical and quantitative features that describe housing characteristics and market prices in Metro Vancouver.

In this project, `Market Price` is the main outcome variable, while the other variables are used to explore factors associated with housing prices.

In [None]:
# 2. Summary Statistic
summary(df)

# 3. Missing values
colSums(is.na(df))
miss_var_summary(df)

The summary statistics show that the dataset is generally clean and well-structured, with reasonable ranges across the numeric variables

For missing values, only Renovation Year has missing data, with 2,017 missing values (57.3%), while all other variables have no missing values. This missingness is likely to be MAR, because many properties may simply have no renovation record. Therefore, I will not delete the NA values. Instead, I will create a categorical variable (e.g., Renovated = Yes/No) to indicate whether a renovation year is recorded.

In [None]:
df_1 <- df |>
  mutate(Renovated = if_else(!is.na(`Renovation Year`), "Yes", "No")) |>
  select(-`Renovation Year`)
head(df_1)

## Part 2 Research Questions Analysis

### RQ1 Are there significant price differences across neighborhoods?

In [None]:
neigh_summary <- df|>
  group_by(Neighborhood) |>
  summarise(
    mean_price = mean(`Market Price`),
    median_price = median(`Market Price`),
    sd_price = sd(`Market Price`))
ggplot(neigh_summary, aes(x = reorder(Neighborhood, mean_price),
                          y = mean_price)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Average Market Price by Neighborhood",
    x = "Neighborhood",
    y = "Average Market Price")
head(neigh_summary)

The summary statistics and bar chart show clear differences in average market prices across neighborhoods. Some areas, such as Lynn Valley and Edmonds, have higher average prices (around 2.4–2.5 million), while others, such as Coquitlam Centre and Lower Lonsdale, have lower average prices (around 1.9–2.0 million). In addition, the standard deviations are relatively large, indicating price variation within neighborhoods. Overall, the results suggest that market price differs meaningfully across neighborhoods.


In [None]:
set.seed(123) # make code reproducible

random_neigh <-sample(df$Neighborhood, 10) 
# Randomly select 10 neighborhoods for comparison

df_selected <- df |>
# Keep only properties located in the selected neighborhoods
  filter(Neighborhood %in% random_neigh)

selected_summary <- df_selected |>
  filter(`Property Type` %in% c("House", "Condo")) |>
  group_by(Neighborhood, `Property Type`) |>
  summarise(
    mean_price = mean(`Market Price`),
    .groups = "drop")
# Calculate average market price for each group


# Visualize average prices for House vs Condo across neighborhoods
ggplot(selected_summary,
       aes(x = Neighborhood,
           y = mean_price,
           fill = `Property Type`)) +
  geom_col(position = "dodge") +
  coord_flip() +
  labs(title = "House vs Condo Prices Across Selected Neighborhoods",
       x = "Neighborhood",
       y = "Average Market Price")

To better assess whether neighborhood differences persist after controlling for property type, I randomly selected 10 neighborhoods (simple random sampling) to make the comparison manageable and unbiased. 

Then I filtered the data to include only Houses and Condos to control for differences in housing structure across areas. I calculated the average price for each property type within each neighborhood and visualized the results using a grouped bar chart. This suggests that neighborhood effects are not solely driven by differences in housing composition.

#### Answer to RQ1

Even after controlling for property type, noticeable price differences remain across neighborhoods. This suggests that location itself is associated with market price.

### RQ2 How does property type relate to market price?

In [None]:
df |>
  group_by(`Property Type`) |>
  summarise(
    mean_price = mean(`Market Price`, na.rm = TRUE),
    median_price = median(`Market Price`, na.rm = TRUE),
    sd_price = sd(`Market Price`, na.rm = TRUE),
    .groups = "drop")

To examine how property type relates to market price, I grouped the dataset by property type and calculated the mean, median, and standard deviation of market price for each category. This allows a direct comparison of price levels across different housing types. The results show that Houses have the highest average and median prices, followed by Triplex and Duplex properties, while Condos have the lowest prices.

In [None]:
ggplot(df, aes(x = `Property Type`,
           y = `Market Price`,
           fill = `Property Type`)) +
geom_boxplot() +
labs(title = "Market Price Distribution by Property Type",
         x = "Property Type",
         y = "Market Price")

To further examine the relationship between property type and market price, I used boxplots to visualize the distribution of prices across different housing types. The boxplots allow comparison of median values, variability, and the presence of outliers within each category. The figure shows that Houses have the highest median prices and the widest price range, indicating greater variability. In contrast, Condos have lower central prices and a more compact distribution. Duplex, Triplex, and Townhouse properties fall between these two extremes.

#### Answer to RQ2
Overall, property type is strongly associated with market price, with detached Houses commanding the highest values and Condos the lowest.

### RQ3 Which factors are most associated with market price?

In [None]:
cor_with_price <- cor_matrix[, "Market Price"]
cor_with_price <- cor_with_price[names(cor_with_price) != "Market Price"]
barplot(sort(cor_with_price, decreasing = TRUE),
        las = 2,
        col = "steelblue",
        ylab = "Correlation",
        main = "Correlation with Market Price")

In this step, I computed the correlation between Market Price and all other numeric variables to identify which factors are most associated with price. The bar graph shows that Year has the strongest positive correlation (≈ 0.43) with Market Price, while all other variables (Year Built, Bedrooms, Bathrooms, Square Footage, etc.) show very weak correlations close to zero. This suggests that time (Year) is the dominant factor associated with market price in this dataset.

In [None]:
ggplot(df, aes(x = Year, y = `Market Price`)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "red")

### Answer to RQ3
To further examine this relationship, I plotted Market Price against Year and added a linear regression line. The upward-sloping red line confirms that as Year increases, Market Price also tends to increase. This supports the earlier correlation result and indicates a strong positive association between Year and price.

However, this does not imply that an increase in Year directly causes Market Price to increase. Correlation does not equal causation. The upward trend may reflect broader market dynamics such as inflation, economic growth, housing demand, policy changes, or other confounding variables that simultaneously influence both Year and Market Price.

## Part 3 Conclusion / Summary

The dominant driver of housing price variation in this dataset is time, not property structure. While neighborhood (RQ1) and property type (RQ2) clearly differentiate price levels—with Houses priced above Condos and certain neighborhoods consistently more expensive—the strongest statistical association emerges from Year rather than from structural housing characteristics such as bedrooms or square footage. This indicates that temporal variation explains more price movement than cross-sectional differences in property features.

In the context of the Canadian housing market, this finding is economically intuitive. Over the past two decades, sustained price appreciation—driven by low interest rates, population growth, and strong urban demand—has created a powerful upward trend. In this dataset, Year likely acts as a proxy for these macroeconomic forces rather than being a causal determinant itself. This highlights an important analytical insight: trend-driven correlation can dominate structural effects in time-evolving markets.

However, correlation does not imply causation. The strong association between Year and Market Price does not mean time causes price increases; rather, it reflects underlying economic dynamics and potential confounding variables. A next step would involve controlling for temporal effects within a multivariate regression framework to better isolate structural price determinants.

### Data Source

The dataset used in this project was obtained from Kaggle:

Jenny Zhu. *Vancouver House Prices for Past 20 Years*.  
Available at: https://www.kaggle.com/datasets/jennyzzhu/vancouver-house-prices-for-past-20-years

The dataset is labeled as synthetic and is publicly available for educational and analytical purposes.

### License & Attribution

All credit for the dataset belongs to the original author on Kaggle.  
This project does not claim ownership of the dataset.  
The analysis, data cleaning, visualizations, and interpretations are independently conducted by the author of this repository.

### Disclaimer

This project is intended for educational and portfolio purposes only.  
The dataset is synthetic and does not represent actual housing transactions.  
All insights are exploratory and should not be interpreted as financial or investment advice.