Our client is a large Real Estate Investment Trust (REIT).
- They invest in houses, apartments, and condos(complex of buildings) within a small county in New York state.
- As part of their business, they try to predict the fair transaction price of a property before it's sold.
- They do so to calibrate their internal pricing models and keep a pulse on the market.
The REIT has hired us to find a data-driven approach to valuing properties.
- They currently have an untapped dataset of transaction prices for previous properties on the market.
- The data was collected in 2016.
- Our task is to build a real-estate pricing model using that dataset.
- If we can build a model to predict transaction prices with an average error of under US Dollars 70,000, then our client will be very satisfied with the our resultant model.
- Deliverable: Trained model file
- Win condition: Avg. prediction error < $70,000
- Model Interpretability will be useful
- No latency requirement
For this project:
- The dataset has 1883 observations in the county where the REIT operates.
- Each observation is for the transaction of one property only.
- Each transaction was between $200,000 and $800,000.
- 'tx_price' - Transaction price in USD
Public records:
- 'tx_year' - Year the transaction took place
- 'property_tax' - Monthly property tax
- 'insurance' - Cost of monthly homeowner's insurance
Property characteristics:
- 'beds' - Number of bedrooms
- 'baths' - Number of bathrooms
- 'sqft' - Total floor area in squared feet
- 'lot_size' - Total outside area in squared feet
- 'year_built' - Year property was built
- 'active_life' - Number of gyms, yoga studios, and sports venues within 1 mile
- 'basement' - Does the property have a basement?
- 'exterior_walls' - The material used for constructing walls of the house
- 'roof' - The material used for constructing the roof
Location convenience scores:
- 'restaurants' - Number of restaurants within 1 mile
- 'groceries' - Number of grocery stores within 1 mile
- 'nightlife' - Number of nightlife venues within 1 mile
- 'cafes' - Number of cafes within 1 mile
- 'shopping' - Number of stores within 1 mile
- 'arts_entertainment' - Number of arts and entertainment venues within 1 mile
- 'beauty_spas' - Number of beauty and spa locations within 1 mile
- 'active_life' - Number of gyms, yoga studios, and sports venues within 1 mile
Neighborhood demographics:
- 'median_age' - Median age of the neighborhood
- 'married' - Percent of neighborhood who are married
- 'college_grad' - Percent of neighborhood who graduated college
Schools:
- 'num_schools' - Number of public schools within district
- 'median_school' - Median score of the public schools within district, on the range 1 - 10
It is a regression problem, where given the above set of features, we need to predict the transaction price of the house.
Since it is a regression problem, we will use the following regression metrics:
- Root Mean Squared Error (RMSE)
- R-squared
- Mean Absolute Error
- An indicator variable to flag properties with 2 beds and 2 baths and name it 'two_and_two'.
- People might also not take much interest in old properties.
- so, this feature gives properties built before 1980.
- Both tax and insurance are monthly quantities property holder needs to pay monthly
- A new feature called 'during_recession' is created to indicate if a transaction falls between 2010 and 2013.
- property_age' denotes the age of the property when it was sold and not how old it is today, since we want to predict the price at the time when the property is sold.
- A school score feature is created as num_schools * median_school
- Machine learning algorithms cannot directly handle categorical features. Specifically, they cannot handle text values.
- Therefore, we need to create dummy variables for our categorical features.
- Dummy variables are a set of binary (0 or 1) features that each represent a single class from a categorical feature.
- Features such as property_tax, year_built, tx_year, insurance are removed.