# COGS 118A - Final Project

# Moscow Real Estate Model

## Group members
- Keri Chen
- Arth Shukla
- WonJae Lee
- Ashley Chu
- Cynthia Delira

# Abstract 
Real estate is the foundation for many life milestones like owning a home, starting a family or a business, and more. However, it may be hard to break into the real estate world without first doing a lot of research and planning because real estate is after all, an investment. By building a ML model that helps predict the price of home, this can hopefully help to make the process easier for prospective homeowners and sellers. The data we will be using encompasses 5+ million Russian real estate sales from 2018 - 2021 and has multiple variables. Although our dataset is in the Russian market, it provides us a lot of data points that can allow us to learn more about the different models and generalize it to different markets.

We will perform some EDA analysis to view the correlation of the different factors, and then build a linear regression model using CART regression, logistic regression, and random forest. We will then evaluate the performance of our model using mean absolute error (MAE).

# Background

The real estate market has been a pivotal factor and contributor in the economy as according to the National Association of Home Builders, housing’s combined contribution to gross domestic product (GDP) generally averages 15-18%<a name="nahb"></a>[<sup>[1]</sup>](#nahb). This percentage is calculated based on both residential investment as well as consumption spending on housing. Not only is housing a contributor to the economy but it is also an important asset to people’s lives as it not only signifies having a place to sleep in but is often perceived as a way to show one’s social status and a valuable asset where money can be allocated.


Despite the importance and high contribution that the market is to the economy, it has many factors that can quickly change the market. Although different factors can influence the real-estate market, one of the most important factors is demographics<a name="keyfactors"></a>[<sup>[2]</sup>](#keyfactors).

Demographics consists of the data of the population near the location which affects the pricing of the house and also the demand the property has. Places in and near a major city could be more expensive due the proportion of square footage and price<a name="demographics">,</a>[<sup>[3]</sup>](#demographics) since major cities usually have limited land to be developed or already has no more space for new developments. Alhtough real-estate predictions across different areas (urban, rural, suburban) can very due to differences in land use, when considering a single location (e.g. one city) where land use, housing supply, etc are more similar, demographics prevail as the key difference when comparing real estate.

Our exploration will, therefore, focus on Moscow, the capital of Russia which has a quickly-growing real estate market. Because the market is developing, it can be difficult for the average person to determine which variables contribute most to real estate pricing. By building a model to predict real estate pricing (in rubles), we aim to make distilling this demographic information easier on a larger scale.

There has been a lot of prior (and ongoing) research within the real estate industry, especially real estate companies such as Zillow with their “Neural Zestimate,”<a name="Zestimate"></a>[<sup>[4]</sup>](#Zestimate). Redfin with their “Redfin Estimate,”<a name="Redfin"></a>[<sup>[5]</sup>](#Redfin). and many other real estate companies also have their own models for estimating home prices. Since each model is built differently, this leads to varying price estimations. However, the basis of the models are similar as it takes in large amounts of previous transactions and/or MLS data to get various variables to find good features to base the model off of as it keeps retraining to get better results.  

# Problem Statement

The real estate market can be a turbulent and rapidly changing environment, where it is often hard to predict the actual value due to many factors.Due to the multitude of different constants, we will focus our model on the general description of the property. We aim to make it easier for people to get this type of information by training a ML model on a large dataset of previous home purchases in order to predict what price point a home may be at. 

# Data

Our current best candidate is the following dataset of <a href="https://www.kaggle.com/datasets/mrdaniilak/russia-real-estate-20182021">Russian Real Estate pricing from 2018-2021.</a> The dataset contains an incredible 5 million+ data points, with no null values and only a few thousand duplicate rows. Therefore, our data is very-well poised to avoid generalization without uses of techniques like cross-validation. 

This massive dataset means training could takes many days or even weeks given our computational resources, which is not feasible. Since demographics data can vary between cities/countiesHowever, our exploration will primarily focus on Moscow. Thus, we are able to limit the size of our data to about 1/10th of the original dataset. Furthermore, if computational cost continues to be an issue, we may randomly sample a subset of our data to train (this will not harm any assumptions for the regression models we will use, since it does not violate any assumptions about the data which these models require)

There are 13 variables, 2 categorical, 2 location-based, and the rest ordinal. We will be removing the latitude and longitude columns as these prevent ethical issues regarding the location of homeowners and intense violations of privacy.

Each observation contains the price of a house, listing date and time, geographic location, region, and information about the building (type, storeys, floor, living rooms, rooms). Notably, it does not contain square footage, which is a landmark in much of the American real estate market.

Critical variables mostly encompass the house descriptions and the time of publishing. We will need to one-hot encode building type. Building type will not largely increase the width of the design matrix.

Finally, we will need to convert data and time of publication to only the year, and potentially also the month, in case we’d like to do time series analysis. As mentioned earlier, we’ll also remove the latitude and longitude due to concerns of privacy. Finally, for our non-tree models, we will also normalize our data points by z-score, since data like price in rubles will be orders of magnitude larger than number of rooms.

# Proposed Solution

Note that we discuss error metrics, including justifications for L1 loss (MAE), in the Evaluation Metrics section.

Before discussing our implementation, regarding benchmark models: there are some models available on Kaggle using time series analysis, which might result in good outputs. However, there are no significant authorities on Russian real estate pricing in machine learning, especially since this is an emerging market. Furthermore, American authorities on real estate prediction often keep their models internal as a part of their business model, so it is difficult to use existing robust benchmark models without apis and the sort.
First, it is important to note that our dataset is massive. With over 5 million samples, our model will certainly generalize well, but this also means we may have too many confounding variables and our model may not reach high enough MAE. During EDA, we will determine cities which contain interesting data, and we can fragment our data by city. Depending on computational resources and time constraints, we may choose multiple cities, or only use one.

Second, regardless of which or how many cities we use, this data is simply far too massive for any form of CV. Additionally, CV is not necessary here, since our validation set is likely to generalize well.
Finally, luckily much of the data is ordinal, with few categorical variables with limited possible values. For our regression models, we may try to avoid extra data points such that we don’t have too many features in our design matrix. However, to attempt to include these features in at least one model, we will also try random forests.

- CART Regression
- Linear regression using L1 loss
- After performing EDA, if certain metrics seem like they could use polynomial features, we can also try polynomial regression using L1 loss.
- Random Forests to include categorical variables.


We can also try variants of linear and polynomial regression using L2 regularization. It is unlikely that many of these features will be confounding (though we can confirm with EDA), so L2 regularization is likely more reasonable. We can also try mixed regularization in case some features are, indeed, confounding.

Then, if we have enough computational resources, we can perform grid search on different hyperparameters for model selection. However, if this is not feasible, we can empirically justify pruning techniques, regularization mix, etc.

Finally, we will use sklearn for all implementations for 1) readable code, and 2) efficient, thoroughly tested implementations of the algorithms discussed above. While tools like Keras do have gpu acceleration, these methods aren’t as useful for our models as compared to neural network models.


# Evaluation Metrics

The three most common metrics for regression are mean squared error (MSE), mean absolute error (MAE) and root mean squared error (RMSE). MSE and RMSE heavily penalize outliers, while MAE proportionately penalizes all errors. Our data includes some more extreme outliers (10 living rooms, 39th floor, etc). For these ‘extreme’ sorts of houses, there are also many extra possible factors beyond measurable features like number of rooms; for example, the ‘art’ of designing expensive homes with luxury features. So, using MSE or RMSE would likely bias our model to these extreme outliers while lowering our model’s success in gauging prices for a majority of houses on the market. Conversely, MAE would result in a better representation of the data for a majority of ‘normal’ cases. Therefore, we will stick to MAE.

# Results

First, after cleaning and regularizing our data we performed a linear regression model, which took a lengthy amount of time to train. Next, we tried CART Regression, followed by Random Forest, and a Deep Neural Network.

### Subsection 1 (Analysis)

You will likely have different subsections as you go through your report. For instance you might start with an analysis of the dataset/problem and from there you might be able to draw out the kinds of algorithms that are / aren't appropriate to tackle the solution.  Or something else completely if this isn't the way your project works.

### Subsection 2 (Linear Regression)

Another likely section is if you are doing any feature selection through cross-validation or hand-design/validation of features/transformations of the data

### Subsection 3 (CART Regression)

Probably you need to describe the base model and demonstrate its performance.  Maybe you include a learning curve to show whether you have enough data to do train/validate/test split or have to go to k-folds or LOOCV or ???

### Subsection 4 (Random Forest)

Perhaps some exploration of the model selection (hyper-parameters) or algorithm selection task. Validation curves, plots showing the variability of perfromance across folds of the cross-validation, etc. If you're doing one, the outcome of the null hypothesis test or parsimony principle check to show how you are selecting the best model.

### Subsection 5 (DNN)

Maybe you do model selection again, but using a different kind of metric than before?



# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations

While some of the models are able to give average predictions, there is still a high MAE which means that is not as accurate as it could be. Given additional time, it could be worthwhile to compile a more through dataset of the Russian market, perhaps with square footage, distance in relation to shopping, malls, tourist attractions, etc, to get a better understanding of the different metrics that could play a factor in determining the price of a home. And even though our dataset had a lot of data points, due to lack of computational power and time, it was better to focus on a particular city instead to find different generalizations. Running through the whole data set may give a more accurate model to the increased amount of data points. 

### Ethics & Privacy

- The Russian economy is currently in a volatile position due to the war in Ukraine. If our model were to be used as a source of truth, and if it were too optimistic or pessimistic, we could wrongfully inflate the market or cause people to sell their homes for less than they are truly worth. Real estate investments can make or break one’s livelihood, especially in a turbulent and growing market like Russia, so making sure our model is functional and usable is important.
- The dataset doesn’t contain explicit personal information, but it contains information like date and time of listing publication and longitude/latitude location, which could potentially be used to identify individuals.
- The data is collected under specific legal provisions, which means it is collected lawfully, but it should be ensured that the use of this data for a machine learning project aligns with the original purpose of data collection.
- Any dataset has a potential for systematic biases, which could result in biased outcomes in a machine learning project. It is important to be aware of this and to either adjust the dataset to more fairly represent different groups or adjust the machine learning model to reduce bias in its prediction.

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
<a name = "nahb"></a>1.[^](#nahb): Housing’s Contribution to Gross Domestic Product. https://www.nahb.org/news-and-economics/housing-economics/housings-economic-impact/housings-contribution-to-gross-domestic-product#:~:text=Share%3A,homes%2C%20and%20brokers'%20fees.<br> 
<a name="keyfactors"></a>2.[^](#keyfactors): Key Factors That Drive the Real Estate Market. https://www.investopedia.com/articles/mortages-real-estate/11/factors-affecting-real-estate-market.asp <br>
<a name="Redfin"></a>3.[^](#demographics): Is It Cheaper to Live in the City or the Suburbs?
. https://www.apartmenttherapy.com/suburbs-vs-city-cost-of-living-265646  <br>
<a name="Zestimate"></a>4.[^](#Zestimate): Building the Neural Zestimate
. https://www.zillow.com/tech/building-the-neural-zestimate/ <br>
<a name="Redfin"></a>5.[^](#Redfin): Redfin Estimate. https://www.redfin.com/redfin-estimate <br>

