# COGS 118A- Project Proposal

# Project Description

You will design and execute a machine learning project. There are a few constraints on the nature of the allowed project. 
- The problem addressed will not be a "toy problem" or "common training students problem" like mtcars, iris, palmer penguins etc.
- The dataset will have >1k observations and >5 variables. I'd prefer more like >10k observations and >10 variables. A general rule is that if you have >100x more observations than variables, your solution will likely generalize a lot better. The goal of training a supervised machine learning model is to learn the underlying pattern in a dataset in order to generalize well to unseen data, so choosing a large dataset is very important.

- The project will include a model selection and/or feature selection component where you will be looking for the best setup to maximize the performance of your ML system.
- You will evaluate the performance of your ML system using more than one appropriate metric
- You will be writing a report describing and discussing these accomplishments


Feel free to delete this description section when you hand in your proposal.

### Peer Review

You will all have an opportunity to look at the Project Proposals of other groups to fuel your creativity and get more ideas for how you can improve your own projects. 

Both the project proposal and project checkpoint will have peer review.

# Names

Hopefully your team is at least this good. Obviously you should replace these with your names.

- Lingxi Li
- Jianghua Lu
- Yvonne Luo
- Man Kui Sit
- Robert Zhang

# Abstract 
The goal of our project is to estimate the sale price for different houses based on their features and conditions. The dataset we chose contains sample houses with different conditions and their actual final sale price. These condition and features measurements include housing location, area, home appliances (AC, pool, utilities), time measurements (the year built, the year sold), etc.. We will likely be choosing only some of the features to analyze as many of these features are related and can be combined. We will likely be using strategies such as one-hot encoding to process categorical data to make it fit for statistical analysis. We will develop several different models such as regression to predict sale prices and evaluate their performance by comparing the predicted sale price with the actual sale price. Success will be measured with how close the predicted price is to the actual price.


# Background

The current global pandemic has changed so many aspects of people’s lives. From global circuit chip shortages to lack of daily life products, everything is making it harder for people to get back to their old, carefree lives. Other than the obvious threat of COVID virus, one impact the pandemic is throwing at our shoulder is that living costs are continuously rising. One particular example is house pricing. A combination of stimulus, low mortgage rate, and simple high demand for housing due to the nature of work-from-home are all tempting people to buy new houses<a name="house price reason"></a>[<sup>[4]</sup>](#reason). As house pricing keeps skyrocketing and driving daily living expenses to the roof, predicting house prices is becoming a very relevant problem. What factors are contributing to a huge house check, how are different factors weighted, and given various feature variables, is there a consistent and accurate method for predicting house prices? These are the questions being magnified under the current special time, and we look to address them with a machine learning approach. 
The idea of this project stems from an on-going Kaggle challenge<a name="house price pred"></a>[<sup>[5]</sup>](#pred) with over 4000 participants. This competition originated back in 2016, but the data is still extremely valuable because the feature set is extensive and contains every possible description of a house. Similar work in the related area from Varma et al. also addresses the problem of predicting house prices using regression techniques<a name="other paper"></a>[<sup>[6]</sup>](#paper). Our work aims at developing an effective regression model by comparing various different popular ones and tries to yield the most effective one.

# Problem Statement

The problem we are trying to solve is predicting housing price based on its condition and features such as location, area, home appliances, and year built(specific feature selection to be determined). This problem is quantifiable because all measurements are either originally numerical or can be encoded numerically using strategies such as one-hot encoding; the problem is measurable because the final predicted result price can be directly compared to the actual price using the difference between the predicted and actual price, R^2, and Root Mean Square Error as the metrics; the problem is replicable because all the machine learning models can be selected, trained, and tested based on the data we found repeatedly in order to verify the result.


# Data

The dataset comes directly from Kaggle’s competition page. It is already split into a training set and a testing set. The training set and testing set are roughly equally split, each containing about 1.5k sample points. The dataset consists of a total of 80 variables for each observation, one of the variables being the final sales price. From house sale year to garage type, it ranges all possible house features. Some key features include overall house condition, house types, building types, area, and year built. House conditions are categorical, so are house types, and building types. Area and year built are numerical data. The dataset contains a large amount of categorical data so quite a bit of preprocessing will be needed. On top of that, there are a lot of repetitive features present in the dataset, so some level of data preprocessing and selection will be needed. 


# Proposed Solution

Our overall task is to predict future house prices based on current data. This is a traditional regression task. Our solution to the problem consists of two major parts, a feature selection step and a model selection step. We consider the feature selection part necessary because the dataset we are using contains a huge amount of observation variables. Some features are highly overlapping, such as dwelling style and dwelling type. We suppose that a lot of repetitive features can be cut down to save computational cost, promote model simplicity, and also boost our model generality, since not all houses will have all the necessary features, thus posing a problem to deploy the model in real life. The second component of our solution is to develop a machine learning model so as to predict future house prices. A variety of regression models are suitable for the nature of our task. From the most basic least square solution to the trending deep neural network, each popular model has its own advantages. In our work, we wish to explore the differences between some of the most popular models, and seek to combine their performance for the best result. Some models to choose from include: logistic regression, which can be used as our benchmark because of its low complexity but high generality, random forest, boosting, and multi-layer perceptron. Random forest has been proved to be one of the best performing algorithms with tabular style data. Boosting represents the power of combining multiple models for the best performance. For neural networks, we choose multi-layer perceptron because our tabular style data doesn’t really possess the feature locality property needed by models such as Convolutional neural networks. We will evaluate all these models using both R^2 and Root Mean Square Error (RMSE). We wish to select the best model out of them and come up with the best solution. 

# Evaluation Metrics

Since we are going to predict future housing prices based on current data, we will implement a regression model. R^2, coefficient of determination, is a popular evaluation metric for goodness of fit when it comes to regression tasks. It is “the proportion of the variance in the dependent variable that is predictable from the independent variable”[1]. The value of R^2 ranges from 0 to 1 (R^2 of a model could be less than zero/negative, which is worse than R^2=0 that’s always predicting the mean value of y), and a R^2 value of 1 is the best scenario when the model gives perfect predictions. In the context of our project, we interpret R^2 as the proportion of the variance in the target variable (selling prices) that can be explained by the features (utilities, house style, year built, etc.)


# Ethics & Privacy

Our research question and hypothesis are designed to explore the house pricing dataset and train a model to make good predictions based on some features. The dataset we will be using is a public, open-source dataset on Kaggle, which gives us the right to perform data analysis and modeling on it. The dataset is anonymized so we will not violate any personal rights regarding the properties. Our project repository is open to the COGS 118A staff members for any investigation and will undergo further peer review processes later on in the project. Still, we will be cautious and will address potential ethics/privacy related concerns if there are any during our research. 


# Team Expectations 

- Team Expectation 1: We expect to communicate effectively (discord, emails) and help each other out if anyone is experiencing any difficulties during the project.
- Team Expectation 2: We expect to complete scheduled tasks (for individuals or as a group) on time (before meetings) and have meaningful in-person or online discussions every time we meet.
- Team Expectation 3: We expect to make every decision as a group, and resort to the professor/TAs in the class for any clarifications when in doubt.

# Project Timeline Proposal

Replace this with something meaningful that is appropriate for your needs. It doesn't have to be something that fits this format.  It doesn't have to be set in stone... "no battle plan survives contact with the enemy". But you need a battle plan nonetheless, and you need to keep it updated so you understand what you are trying to accomplish, who's responsible for what, and what the expected due dates are for each item.

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 4/20  | 7 PM |  Brainstorm topics and do background research about the topic that you are interested in. Read COGS 118 expectations.  | Finalized our topic. Assigned each person with a part of the proposal. | 
| 4/22  | 7 PM |  Finish different parts of the proposal. | Discussed and solved the difficulties and uncertainties about the proposal. Finalized the proposal together. | 
| 4/26  | 7 PM  | Review the Kaggle dataset, be familiar with the structure and content  | Review and edit the data   |
| 5/3  | 7 PM  | Complete Dataset and the Data Cleaning part. | Discuss data cleaning  |
| 5/10  | 7 PM  | Consider how to analyze our data, come up with appropriate models | Data analysis & result |
| 5/13  | 7 PM  | Complete analysis & result| Complete the Checkpoint |
| 6/1  | 8 PM  | Complete the entire analysis | Discuss/edit the full project  |
| 6/8  | Before 11:59 PM  | Finish off any leftover parts, proof reading the entire project | Discuss the whole project. Turn in Final Project & Group survey  |

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem

<a name="house price reason"></a>4.[^](#reason): Why House prices surged as the COVID-19 pandemic took hold. Dallasfed.org. (n.d.). Retrieved April 24, 2022, from https://www.dallasfed.org/research/economics/2021/1228.aspx  <br> 

<a name="house pred"></a>5.[^](#pred): House prices - advanced regression techniques. Kaggle. (n.d.). Retrieved April 24, 2022, from https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview/description  <br> 

<a name="other paper"></a>6.[^](#paper): A. Varma, A. Sarma, S. Doshi and R. Nair, "House Price Prediction Using Machine Learning and Neural Networks," 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), 2018, pp. 1936-1939, doi: 10.1109/ICICCT.2018.8473231.  <br> 

