# DSCI 100 Group 10 Project Proposal: Austin, TX Housing Prices
#### Bhagat Basra, Oscar Cheng, Lucas Wei, Wanning Zhang
### Introduction
Housing prices are dependent on a multitude of variables. We will be investigating the housing market in Austin, Texas, the tenth-most populous city in the United States and the price of single-family homes in this area (Moskowitz, 2024). Prices of homes are influenced by a variety of factors including the size, the number of bedrooms, the quality of the nearby school, and its age (Zietz et al., 2008). We chose to investigate these variables because we observed a relationship between them and the house price. 

The question we will try to answer with our project is, **"Can we predict the housing prices of single-family homes in Austin, Texas based on the living area square footage, number of bedrooms, average school rating, and year it was built?”**
The dataset we will be using is called “Austin, TX House Listings”. This dataset contains house listing data from 2021. 

### Methods & Results:

First, we will load in our libraries and read the data and assign it to a dataframe.

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(RCurl)
library(cowplot)
options(repr.matrix.max.rows = 6)

In [None]:
url <- getURL("https://raw.githubusercontent.com/Victorier0101/dsci_project/main/austinHousingData.csv")
data <- read_csv(url)
head(data, n = 3)

Below we are processing the data for our selected variables and filtering for single-family homes. We also filter specifically for Austin, Texas as ~1% of the homes in the dataset are from the Greater Austin area. We then split the data into training and testing.

In [None]:
data_selected <- data |> 
    filter(homeType == "Single Family", city == "austin") |> 
    select(latestPrice, livingAreaSqFt, numOfBedrooms, avgSchoolRating, yearBuilt)

Now we will set the seed and split the data into 75% training and 25% testing. 

#####  Table 1: Training data for the model

In [None]:
set.seed(1)
data_split <- initial_split(data_selected, prop = 0.75, strata = latestPrice) 
data_training <- training(data_split)
data_testing <- testing(data_split)
head(data_training)

#### Exploratory analysis

Below we are performing exploratory data analysis on the training data and summarizing the data in a table. We are showing the total number of homes in our training data aswell as the mean of all our selected predictors.

##### Table 2: Average value of each predictor

In [None]:
data_mean <- summarize(data_training,
                    total_homes = n(),
                    mean_price = mean(latestPrice),
                    mean_sqft = mean(livingAreaSqFt),
                    mean_bedrooms = mean(numOfBedrooms),
                    mean_school_rating = mean(avgSchoolRating),
                    mean_year_built = mean(yearBuilt))
data_mean 

Here are plots comparing each selected predictor with housing price. Outliers in the training data were ommited in the visualization to better represent the relationship. 

In [None]:
options(repr.plot.width = 12, repr.plot.height = 12)
data_sqft_plot <- data_training |>
    filter(livingAreaSqFt < 10000, latestPrice < 4000000)|>
    ggplot(aes(x= livingAreaSqFt, y= latestPrice))+
                geom_point(alpha = 0.3)+
                labs(x="Living Area (SqFt)", y="Price ($USD)")+
                ggtitle("Price vs. Living Area (SqFt)", subtitle = "Figure 1")+
                theme(text = element_text(size = 15))



data_bedrooms_plot <- data_training |>
    filter(latestPrice < 4000000)|>
    ggplot(aes(x= numOfBedrooms, y= latestPrice))+
                geom_point(alpha = 0.3)+
                labs(x="Number of Bedrooms", y="Price ($USD)")+
                ggtitle("Price vs. Number of Bedrooms", subtitle = "Figure 2")+
                theme(text = element_text(size = 15))
                


data_school_plot <- data_training |>
    filter(latestPrice < 4000000)|>
    ggplot(aes(x= avgSchoolRating, y= latestPrice))+
                geom_point(alpha = 0.3)+
                labs(x="Average Nearby School Rating (1-10)", y="Price ($USD)")+
                ggtitle("Price vs. Average Nearby School Rating ", subtitle = "Figure 3")+
                theme(text = element_text(size = 15))
                


data_yearbuilt_plot <- data_training |>
    filter(latestPrice < 4000000)|>
    ggplot(aes(x= yearBuilt, y= latestPrice))+
                geom_point(alpha = 0.3)+
                labs(x="Year Built", y="Price ($USD)")+
                ggtitle("Price vs. Year Built", subtitle = "Figure 4")+
                theme(text = element_text(size = 15))
                


plot_grid(data_sqft_plot,data_bedrooms_plot,data_school_plot,data_yearbuilt_plot)

These graphs demonstrate why we selected these predictors. In each visualization, there is a relationship for each predictor which is linear and relatively weak. Since these graphs show some form of correlation, this indicates that they will be a good predictor in our model and provide valuable information because the price seeems to be dependent on them. Both the number of bedrooms and average nearby school ratings graphs look like bars because the variables are not continuous. 

#### Analysis
Here we create an engine and recipe so we can analyze the optimal number of 'k neighbours' to use. We are using "neighbors = tune()" in order to do this.

In [None]:
set.seed(1)
data_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> 
       set_engine("kknn") |>
       set_mode("regression") 

data_recipe <- recipe(latestPrice ~., data = data_training) |>
       step_scale(all_predictors()) |>
       step_center(all_predictors())
data_recipe

Here we are creating splits for cross validation with 5 folds and creating a workflow. 

In [None]:
set.seed(1)
data_vfold <- vfold_cv(data_training, v = 5, strata = latestPrice)

data_workflow <- workflow() |>
  add_recipe(data_recipe) |>
  add_model(data_spec)
data_workflow

Next, we are determining which k value has the lowest mean rmse. We narrowed down the range by analyzing a much larger range with larger increments and by reducing them in order to get a specific k value. 

##### Table 3: Comparing result with each k neighbor

In [None]:
set.seed(1)
gridvals <- tibble(neighbors = seq(from = 5, to = 20, by = 1))
data_results <- data_workflow |>
  tune_grid(resamples = data_vfold, grid = gridvals) |>
  collect_metrics()
data_results

##### Table 4: Result using k = 11

In [None]:
data_min <- data_results |>
    filter(.metric == "rmse") |>
    slice_min(mean, n=1)
data_min

From this we can see that the optimal k value to select is k=11. Below is a graph further demonstrating that k=11 is the optimal choice. 

In [None]:
data_kval <- data_results |>
    filter(.metric == "rmse")
options(repr.plot.width = 8, repr.plot.height = 6)
data_k_results <- data_kval |>
    ggplot(aes(x = neighbors, y = mean))+
        geom_line()+
        labs(x="Neighbours", y = "Mean RMSE")+
        ggtitle("Mean RMSE vs. k neighbors", subtitle = "Figure 5")
data_k_results

Now that we have the optimal k value of 11, we created a new engine with this k value and used the same recipe to fit the data into training. The prediction summary of the testing data is shown below. 

##### Table 5: Summary of result using k = 11, evaulated on testing data.

In [None]:
data_kspec <- nearest_neighbor(weight_func = "rectangular", neighbors = 11) |> 
       set_engine("kknn") |>
       set_mode("regression") 

data_fit <- workflow() |>
          add_recipe(data_recipe) |>
          add_model(data_kspec) |>
          fit(data = data_training)

data_summary <- data_fit |>
           predict(data_testing) |>
           bind_cols(data_testing) |>
           metrics(truth = latestPrice, estimate = .pred)
data_summary

This shows that the RMSE (Root Mean Squared Error), the average difference between our predicted and actual price, is roughly 30,922.9. Since the mean home price is $520,634.5, the average percent error is roughly 6% (calculated by dividing the mean rmse by the mean home price). 

In [None]:
data_preds <- data_fit |>
  predict(data_testing) |>
  bind_cols(data_testing)|>
    filter(livingAreaSqFt < 7500, latestPrice < 4000000)

options(repr.plot.width = 12, repr.plot.height = 12)

data_livingAreaSqFt_plot <- data |>
        filter(livingAreaSqFt < 7500, latestPrice < 4000000)|>
        ggplot(aes(x = livingAreaSqFt, y = latestPrice)) +
          geom_point(alpha = 0.4) +
          geom_line(data = data_preds,
            mapping = aes(x = livingAreaSqFt, y = .pred),
            color = "blue",
            linewidth = 1) +
  xlab("House size (square feet)") +
  ylab("Price (USD)")+
  ggtitle("Price vs. House Size", subtitle = "Figure 6")



data_numOfBedrooms_plot <- data |>
        filter(numOfBedrooms <10, latestPrice < 4000000)|>
        ggplot(aes(x = numOfBedrooms, y = latestPrice)) +
          geom_point(alpha = 0.4) +
          geom_line(data = data_preds,
            mapping = aes(x = numOfBedrooms, y = .pred),
            color = "blue",
            linewidth = 1) +
  xlab("Number of Bedrooms") +
  ylab("Price (USD)")+
  ggtitle("Price vs. Number of Bedrooms", subtitle = "Figure 7")



data_avgSchoolRating_plot <- data |>
        filter(latestPrice < 4000000)|>
        ggplot(aes(x = avgSchoolRating, y = latestPrice)) +
          geom_point(alpha = 0.4) +
          geom_line(data = data_preds,
            mapping = aes(x = avgSchoolRating, y = .pred),
            color = "blue",
            linewidth = 1) +
  xlab("Average School Rating (1-10)") +
  ylab("Price (USD)")+
  ggtitle("Price vs. Average School Rating", subtitle = "Figure 8")


data_yearBuilt_plot <- data |>
        filter(latestPrice < 4000000)|>
        ggplot(aes(x = yearBuilt, y = latestPrice)) +
          geom_point(alpha = 0.4) +
          geom_line(data = data_preds,
            mapping = aes(x = yearBuilt, y = .pred),
            color = "blue",
            linewidth = 1) +
  xlab("Year Built") +
  ylab("Price (USD)") +
  ggtitle("Price vs. Year Built", subtitle = "Figure 9")


plot_grid(
  data_livingAreaSqFt_plot, data_numOfBedrooms_plot, data_avgSchoolRating_plot, data_yearBuilt_plot
)

These graphs above show visulaizations of our results. There are 4 graphs of each predictor variable vs. price, along with an overlayed line of the predicted house prices. We chose to show each predictor seperately because in order to accurately showcase the results with one figure, there would need to be a 5 dimensional plot, which is not possible. The relationships shown may look weak or not accurate but this is because the model takes all the predictors into account, so relationships between predictors are not able to be shown. Therefore, it is more important to look at the RMSE to determine if our model was successful, which it is because our RMSE is low. Analyzing these plots, house size seems to be the strongest predictor variable as its plot most aligns with the predicted values line. 

### Discussion

We have found a model that predicts the single family house price in Austin, TX. It had a RMSE value of around 31,000 dollars which suggests that the model is sufficient enough to predict the housing prices because the average single family house price is around 520,000 US dollar, and 31,000 US dollar is much smaller compared to the average price. 

This result is what we expected from the model because of our process for selecting the k value. We created a big range of k values with large increments. Then we refined to the smaller k value range with smaller increments, where it made more sense after seeing the graph generated with the larger values. This ensures us to find the right k-value that works the best for this project. Another key point of why we expected this model to work is because of the amount of decision making we established before we decided on our predictors. We intentionally selected the predictors of which it is more reasonable by looking at the graph generated and seeing if they had a general trend relating to the price of the house. 

Creating an accurate model of housing prices given certain input variables can have great impact. This would allow home owners to predict the price of their house to ensure that they were asking for a fair value. Another possible impact of this model is that it would allow real estate investors to use the model to identify undervalued property, increasing potential returns. Another way this model could be used is by banks and other mortgage lenders to appropriately asses the risk of a loan. 

Some future questions and investigations that could be explored is if using more variables will create a more accurate model. It could also be investigated which variables effect home price the most. There also other factors that we were not able to incorporate into the model because they were not in our data set, such as transit proximity, which can have an effect on housing price (Yu et al., 2016). Furthermore, more than just single family homes could be studied such as apartments, townhomes, or condos and in cities other than Austin, Texas.

### References

_____

Moskowitz, D. (2024, January 23). 10 largest cities in the U.S. Investopedia. https://www.investopedia.com/articles/personal-finance/050815/top-10-most-developed-cities-us.asp

Yu, H., Zhang, M., & Pang, H. (2017). Evaluation of transit proximity effects on residential land prices: an empirical study in Austin, Texas. Transportation Planning and Technology, 40(8), 841–854. https://doi.org/10.1080/03081060.2017.1355880

Zietz, J., Zietz, E.N. & Sirmans, G.S. Determinants of House Prices: A Quantile Regression Approach. J Real Estate Finance Econ 37, 317–333 (2008). https://doi.org/10.1007/s11146-007-9053-7

Data set obtained from: Pierce, E. (2021, April 12). Austin, TX House listings. Kaggle. https://www.kaggle.com/datasets/ericpierce/austinhousingprices/data