# Automated Valuation Model

#### Automated Valuation Model (AVM) is a term for a service that uses mathematical modeling combined with databases of existing properties and transactions to calculate real estate values. The majority of automated valuation models (AVMs) compare the values of similar properties at the same point in time. Many appraisers, and even Wall Street institutions, use this type of model to value residential properties. (see [What is an AVM](https://www.investopedia.com/terms/a/automated-valuation-model.asp) Investopedia.com)


#### Why do real estate business use AVMs?
- Real estate companies often use Real Estate AVMs to capture the contact information of potential home-sellers.  Home sellers are valuable leads in the real estate industry and the assumption is that people looking to learn the value of a home might want to sell that home.  Many solution providers sell versions of AVM (the most well-known example is the Zillow Zestimate).  Companies take those AVMs, put them on their website or page and try to identify people in their area who will likely enter the real estate market.


#### Popular Commercial AVM's
* [Zestimate](https://www.zillow.com/zestimate/) - The Zestimate® home valuation model is Zillow’s estimate of a home's market value. The Zestimate incorporates public and user-submitted data, taking into account home facts, location and market conditions.

* [Core Logic](https://www.corelogic.com/landing-pages/automated-valuation-models.aspx) -  CoreLogic® is the chosen AVM provider for 8 of the top 10 U.S. mortgage lenders. 

* [House Canary](https://www.housecanary.com/products/data-points/) - We compute instant valuations spanning property and land values, home equity, and more, and report on the data density behind our conclusions. Gain speed and reduce errors with values and context exactly when and where you need them most.

* [Attom Data](https://www.attomdata.com/data/analytics-derived-data/avm-property-valuations/) - Utilizing more than 80 million homes in our property database across all 50 states and representing 99% of the US population, and valuation software developed by Automated Valuation Model Analytics

### Valuation Process
<img src="https://github.com/BlockchainClimateInstitute/microservice_price/develop/notebooks/AVM/valuation_process.png" height="120" >

### Interesting GitHub Repositories related to AVM's
* [Zillow-Kaggle](https://github.com/junjiedong/Zillow-Kaggle/blob/master/README.md) - This repo tackles the first round of Zillow’s Home Value Prediction Competition, which challenges competitors to predict the log error between Zestimate and the actual sale price of houses. And the submissions are evaluated based on Mean Absolute Error between the predicted log error and the actual log error. The competition was hosted from May 2017 to October 2017 on Kaggle, and the final private leaderboard was revealed after the evaluation period ended in January 2018.

* [AutomatedValuationModel](https://github.com/jayshah5696/AutomaticValuationModel/blob/master/notebooks/Final_notebook.ipynb) -  Automated valuation model (AVM) is the name given to a service that can provide real estate property valuations using mathematical modelling combined with a database. Most AVMs calculate a property’s value at a specific point in time by analyzing values of comparable properties. Some also take into account previous surveyor valuations, historica…

* [Lots more on Kaggle](https://www.kaggle.com/c/zillow-prize-1/notebooks) - The Zillow Prize contest competition, sponsored by Zillow, Inc. (“Sponsor”) is open to all individuals over the age of 18 at the time of entry. The competition will contain two rounds, one public and one private.. Each round will have separate datasets, submission deadlines and instructions on how to participate. The instructions on how to participate in each round are listed below. Capitalized terms used but not defined herein have the meanings assigned to them in the Zillow Prize competition Official Rules.

### Other interesting articles
* [towardsdatascience.com](https://towardsdatascience.com/automated-valuation-model-how-it-works-in-real-estate-industry-8d082757e1ed) - Automated Valuation Model — How It Works in Real Estate Industry?

### How does it relate to BCI Risk Modeling?
<img src="https://github.com/BlockchainClimateInstitute/microservice_price/edit/develop/notebooks/AVM/bci_flowchart_2.png" height="120" >

### Interesting cross-over companies using AVM technology in the context of climate risk modeling
* [Jupiter Intelligence](https://jupiterintel.com) - Predicting Risk in a Changing Climate: Jupiter’s AI and Scientific Models Deliver Unrivaled Power

### Development Plan
- EDA on golden dataset (due by July 1st) - volunteers? email mike.casale@blockchainclimate.org
- Basic machine learning studies of different models (due by July 8th) - volunteers? email mike.casale@blockchainclimate.org
- Hypertuning & final analysis of machine learning studies (due by July 8th) - volunteers? email mike.casale@blockchainclimate.org
- Completed AVM modeling and pipeline + integrate with AWS microservice (due by Aug 1st) - volunteers? email mike.casale@blockchainclimate.org

# AutoML in EvalML

In [60]:
import evalml, glob, os
from evalml.preprocessing import load_data
import pandas as pd
import urllib
import requests, json
import numpy as np
import pandas as pd

# LOAD EPC_PRICE SAMPLE

In [61]:
datapath = '../../data/processed/sample_EPC_Price_merged.csv'
data = pd.read_csv(datapath)
data = data.reset_index()
data

Unnamed: 0,index,Postcode,PriceAddress,EpcAddress,JaroDistance,Price,PurchaseDate,PropertyType,New,Duration,...,PotentialEnergyRating,CurrentEnergyEfficiency,PotentialEnergyEfficiency,EpcInspectionDate,GlazedArea,HabitableRooms,HeatedRooms,FlatStoreyCount,TotalFloorArea,FloorLevel
0,0,WV69QH,"142, CODSALL ROAD,","142, Codsall Road,",1.00,437500,5/20/20 0:00,D,N,F,...,C,55,75,6/24/13,Normal,8.0,8.0,,348.0,NODATA!
1,1,WV108AB,"49, CARISBROOKE ROAD,","49, Carisbrooke Road, Bushbury,",0.94,140000,2/7/20 0:00,S,N,F,...,B,81,85,6/17/16,Normal,5.0,4.0,,84.0,NODATA!
2,2,WV46BJ,"56, GREENOCK CRESCENT,","56, Greenock Crescent,",1.00,102000,3/11/20 0:00,F,N,L,...,C,78,79,5/2/12,Normal,3.0,3.0,,77.0,2nd
3,3,WV112QQ,"21, RYAN AVENUE,","21, Ryan Avenue,",1.00,141000,5/18/20 0:00,S,N,F,...,C,62,79,10/30/19,Normal,5.0,5.0,,77.0,NODATA!
4,4,WV147AP,"34, MARBURY DRIVE,","34, Marbury Drive,",1.00,122000,3/13/20 0:00,S,N,L,...,C,75,75,9/29/11,Normal,3.0,3.0,,56.5,NODATA!
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1116,1116,WV22AW,"116 - 126, STEELHOUSE LANE,","100, Steelhouse Lane,",0.75,280000,5/26/20 0:00,O,N,F,...,A,96,97,1/31/20,NO DATA!,,,,107.0,NO DATA!
1117,1117,WV108RP,"55, PRIMROSE LANE,","55, Primrose Lane,",1.00,162000,2/7/20 0:00,S,N,F,...,B,72,85,9/16/19,Normal,5.0,5.0,,102.0,NODATA!
1118,1118,WV106BA,"12, THREE TUNS PARADE,","Flat, 12a, Three Tuns Parade,",0.76,150000,6/24/20 0:00,O,N,F,...,C,57,72,4/11/19,Normal,3.0,3.0,,67.0,Ground
1119,1119,WV38NA,"201, CASTLECROFT ROAD,","201, Castlecroft Road,",1.00,470000,4/6/20 0:00,D,N,F,...,C,56,75,5/22/14,Normal,7.0,7.0,,229.0,NODATA!


# Handle Date Features 

In [62]:
#handle datefields
from evalml.pipelines.components.transformers import DateTimeFeaturizer

dtf = DateTimeFeaturizer(features_to_extract = ["year", "month", "day_of_week", "hour"])
datefields = ['PurchaseDate','EpcInspectionDate']
data[datefields[0]] = pd.to_datetime(data[datefields[0]])
data[datefields[1]] = pd.to_datetime(data[datefields[1]])
Xdates = data[datefields]
dtf.fit(Xdates)
Xdates = dtf.transform(Xdates)
Xdates = Xdates.reset_index()
Xdates


Unnamed: 0,index,PurchaseDate_year,PurchaseDate_month,PurchaseDate_day_of_week,PurchaseDate_hour,EpcInspectionDate_year,EpcInspectionDate_month,EpcInspectionDate_day_of_week,EpcInspectionDate_hour
0,0,2020,May,Wednesday,0,2013,June,Monday,0
1,1,2020,February,Friday,0,2016,June,Friday,0
2,2,2020,March,Wednesday,0,2012,May,Wednesday,0
3,3,2020,May,Monday,0,2019,October,Wednesday,0
4,4,2020,March,Friday,0,2011,September,Thursday,0
...,...,...,...,...,...,...,...,...,...
1116,1116,2020,May,Tuesday,0,2020,January,Friday,0
1117,1117,2020,February,Friday,0,2019,September,Monday,0
1118,1118,2020,June,Wednesday,0,2019,April,Thursday,0
1119,1119,2020,April,Monday,0,2014,May,Thursday,0


In [63]:
data_t = pd.merge(data, Xdates, on='index')
data_t = data_t.drop(datefields,axis=1)
data_t = data_t.drop('index',axis=1)
t_datapath = datapath.replace('.csv','_t.csv')
data_t.to_csv(t_datapath,index=False)


In [64]:
from evalml.preprocessing import load_data
index_fields = ['Postcode','PriceAddress','EpcAddress','JaroDistance']
X, y = load_data(t_datapath,index=index_fields,target='Price')
X, y

             Number of Features
Categorical                  11
Numeric                      10

Number of training examples: 1121
Targets
130000    2.14%
140000    1.96%
125000    1.69%
90000     1.61%
155000    1.61%
          ...  
234500    0.09%
277500    0.09%
257000    0.09%
236500    0.09%
256000    0.09%
Name: Price, Length: 380, dtype: object


(                                                                                         PropertyType  \
 Postcode PriceAddress                    EpcAddress                         JaroDistance                
 WV69QH   142, CODSALL ROAD,              142, Codsall Road,                 1.00                    D   
 WV108AB  49, CARISBROOKE ROAD,           49, Carisbrooke Road, Bushbury,    0.94                    S   
 WV46BJ   56, GREENOCK CRESCENT,          56, Greenock Crescent,             1.00                    F   
 WV112QQ  21, RYAN AVENUE,                21, Ryan Avenue,                   1.00                    S   
 WV147AP  34, MARBURY DRIVE,              34, Marbury Drive,                 1.00                    S   
 ...                                                                                               ...   
 WV22AW   116 - 126, STEELHOUSE LANE,     100, Steelhouse Lane,              0.75                    O   
 WV108RP  55, PRIMROSE LANE,              55, 

# Find the relevent "real estate" related features

In [65]:

X


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,PropertyType,New,Duration,CurrentEnergyRating,PotentialEnergyRating,CurrentEnergyEfficiency,PotentialEnergyEfficiency,GlazedArea,HabitableRooms,HeatedRooms,...,TotalFloorArea,FloorLevel,PurchaseDate_year,PurchaseDate_month,PurchaseDate_day_of_week,PurchaseDate_hour,EpcInspectionDate_year,EpcInspectionDate_month,EpcInspectionDate_day_of_week,EpcInspectionDate_hour
Postcode,PriceAddress,EpcAddress,JaroDistance,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
WV69QH,"142, CODSALL ROAD,","142, Codsall Road,",1.00,D,N,F,D,C,55,75,Normal,8.0,8.0,...,348.0,NODATA!,2020,May,Wednesday,0,2013,June,Monday,0
WV108AB,"49, CARISBROOKE ROAD,","49, Carisbrooke Road, Bushbury,",0.94,S,N,F,B,B,81,85,Normal,5.0,4.0,...,84.0,NODATA!,2020,February,Friday,0,2016,June,Friday,0
WV46BJ,"56, GREENOCK CRESCENT,","56, Greenock Crescent,",1.00,F,N,L,C,C,78,79,Normal,3.0,3.0,...,77.0,2nd,2020,March,Wednesday,0,2012,May,Wednesday,0
WV112QQ,"21, RYAN AVENUE,","21, Ryan Avenue,",1.00,S,N,F,D,C,62,79,Normal,5.0,5.0,...,77.0,NODATA!,2020,May,Monday,0,2019,October,Wednesday,0
WV147AP,"34, MARBURY DRIVE,","34, Marbury Drive,",1.00,S,N,L,C,C,75,75,Normal,3.0,3.0,...,56.5,NODATA!,2020,March,Friday,0,2011,September,Thursday,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
WV22AW,"116 - 126, STEELHOUSE LANE,","100, Steelhouse Lane,",0.75,O,N,F,A,A,96,97,NO DATA!,,,...,107.0,NO DATA!,2020,May,Tuesday,0,2020,January,Friday,0
WV108RP,"55, PRIMROSE LANE,","55, Primrose Lane,",1.00,S,N,F,C,B,72,85,Normal,5.0,5.0,...,102.0,NODATA!,2020,February,Friday,0,2019,September,Monday,0
WV106BA,"12, THREE TUNS PARADE,","Flat, 12a, Three Tuns Parade,",0.76,O,N,F,D,C,57,72,Normal,3.0,3.0,...,67.0,Ground,2020,June,Wednesday,0,2019,April,Thursday,0
WV38NA,"201, CASTLECROFT ROAD,","201, Castlecroft Road,",1.00,D,N,F,D,C,56,75,Normal,7.0,7.0,...,229.0,NODATA!,2020,April,Monday,0,2014,May,Thursday,0


# Construct the Price MicroService Pipeline

In [66]:
from evalml.pipelines import RegressionPipeline

class PriceMicroservicePipeline(RegressionPipeline):
    component_graph = ['Imputer', 'DateTime Featurization Component', 'One Hot Encoder', 'XGBoost Regressor']
    custom_name = 'Price Microservice Pipeline'
    parameters = {'Imputer': {'categorical_impute_strategy': 'most_frequent',
                                         'numeric_impute_strategy': 'mean',
                                         'categorical_fill_value': None,
                                         'numeric_fill_value': None
                 },
                 'DateTime Featurization Component':{},
                 'One Hot Encoder': { 'top_n': 10,
                                         'categories': None,
                                         'drop': None,
                                         'handle_unknown': 'ignore',
                                         'handle_missing': 'error'
                 },
                 'XGBoost Regressor':{'eta': 0.1,
                                         'max_depth': 6,
                                         'min_child_weight': 1,
                                         'n_estimators': 100
             }
        }

price_microservice_pipeline = PriceMicroservicePipeline({})
price_microservice_pipeline

PriceMicroservicePipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'DateTime Featurization Component':{}, 'One Hot Encoder':{'top_n': 10, 'categories': None, 'drop': None, 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'XGBoost Regressor':{'eta': 0.1, 'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 100},})

In [67]:
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, test_size=0.2, random_state=0, regression=True)
X_train.shape, X_holdout.shape


((896, 21), (225, 21))

In [68]:
price_microservice_pipeline.fit(X_train, y_train)

PriceMicroservicePipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'DateTime Featurization Component':{}, 'One Hot Encoder':{'top_n': 10, 'categories': None, 'drop': None, 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'XGBoost Regressor':{'eta': 0.1, 'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 100},})

In [89]:
feature_importance = price_microservice_pipeline.feature_importance
feature_importance.head(50)

Unnamed: 0,feature,importance
0,CurrentEnergyRating_C,0.252885
1,PropertyType_D,0.250919
2,PropertyType_O,0.149751
3,FlatStoreyCount,0.030948
4,FloorLevel_Ground,0.030853
5,PurchaseDate_month_March,0.023917
6,TotalFloorArea,0.022414
7,PropertyType_S,0.021111
8,FloorLevel_1st,0.009803
9,EpcInspectionDate_month_March,0.009327


<img src="https://github.com/BlockchainClimateInstitute/microservice_price/tree/develop/reports/figures/fi.png" height="120" >

In [69]:
price_microservice_pipeline.graph_feature_importance()

<img src="reports/figures/fi.png" height="120" >
<img src="../../reports/figures/fi.png" height="120" >

In [70]:
from evalml.model_understanding.graphs import graph_prediction_vs_actual

y_pred = price_microservice_pipeline.predict(X_holdout)
graph_prediction_vs_actual(y_holdout, y_pred, outlier_threshold=50)

In [71]:
price_microservice_pipeline.describe()

*******************************
* Price Microservice Pipeline *
*******************************

Problem Type: regression
Model Family: XGBoost
Number of features: 78

Pipeline Steps
1. Imputer
	 * categorical_impute_strategy : most_frequent
	 * numeric_impute_strategy : mean
	 * categorical_fill_value : None
	 * numeric_fill_value : None
2. DateTime Featurization Component
	 * features_to_extract : ['year', 'month', 'day_of_week', 'hour']
3. One Hot Encoder
	 * top_n : 10
	 * features_to_encode : None
	 * categories : None
	 * drop : None
	 * handle_unknown : ignore
	 * handle_missing : error
4. XGBoost Regressor
	 * eta : 0.1
	 * max_depth : 6
	 * min_child_weight : 1
	 * n_estimators : 100


In [72]:
print(price_microservice_pipeline.score(X_holdout, y_holdout, objectives=['MAE']))

OrderedDict([('MAE', 41639.26171875)])


In [73]:
save_path = '../../models/price_microservice_pipeline.pkl'
price_microservice_pipeline.save(save_path)

In [74]:
!evalml info

numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
EvalML version: 0.15.0
EvalML installation directory: /usr/local/anaconda3/envs/microservice_price_env/lib/python3.7/site-packages/evalml

SYSTEM INFO
-----------
python: 3.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 19.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
# of CPUS: 8
Available memory: 5.3G
unclosed file <_io.TextIOWrapper name='/usr/local/anaconda3/envs/mi

In [75]:
from evalml.pipelines import RegressionPipeline

price_microservice_pipeline = RegressionPipeline.load(save_path)
price_microservice_pipeline

PriceMicroservicePipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'DateTime Featurization Component':{}, 'One Hot Encoder':{'top_n': 10, 'categories': None, 'drop': None, 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'XGBoost Regressor':{'eta': 0.1, 'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 100},})

In [76]:
price_microservice_pipeline.predict(X)

0       426248.250000
1       151059.453125
2        82150.953125
3       142689.812500
4       127317.421875
            ...      
1116    279921.843750
1117    169645.578125
1118    153830.046875
1119    496080.750000
1120    214928.062500
Length: 1121, dtype: float32

<img src="https://github.com/BlockchainClimateInstitute/microservice_price/tree/develop/reports/figures/pd_area.png" height="120" >

In [84]:
from evalml.model_understanding.graphs import graph_partial_dependence
graph_partial_dependence(price_microservice_pipeline, X_holdout, feature='TotalFloorArea')

<img src="https://github.com/BlockchainClimateInstitute/microservice_price/tree/develop/reports/figures/pd_hr.png" height="120" >

In [85]:
graph_partial_dependence(price_microservice_pipeline, X_holdout, feature='HabitableRooms')


There are null values in the features, which will cause NaN values in the partial dependence output. Fill in these values to remove the NaN values.



<img src="reports/figures/pd_ee.png" height="120" >

In [87]:
graph_partial_dependence(price_microservice_pipeline, X_holdout, feature='PotentialEnergyEfficiency')

In [81]:
price_microservice_pipeline.input_feature_names

{'Imputer': ['PropertyType',
  'New',
  'Duration',
  'CurrentEnergyRating',
  'PotentialEnergyRating',
  'CurrentEnergyEfficiency',
  'PotentialEnergyEfficiency',
  'GlazedArea',
  'HabitableRooms',
  'HeatedRooms',
  'FlatStoreyCount',
  'TotalFloorArea',
  'FloorLevel',
  'PurchaseDate_year',
  'PurchaseDate_month',
  'PurchaseDate_day_of_week',
  'PurchaseDate_hour',
  'EpcInspectionDate_year',
  'EpcInspectionDate_month',
  'EpcInspectionDate_day_of_week',
  'EpcInspectionDate_hour'],
 'DateTime Featurization Component': ['PropertyType',
  'New',
  'Duration',
  'CurrentEnergyRating',
  'PotentialEnergyRating',
  'CurrentEnergyEfficiency',
  'PotentialEnergyEfficiency',
  'GlazedArea',
  'HabitableRooms',
  'HeatedRooms',
  'FlatStoreyCount',
  'TotalFloorArea',
  'FloorLevel',
  'PurchaseDate_year',
  'PurchaseDate_month',
  'PurchaseDate_day_of_week',
  'PurchaseDate_hour',
  'EpcInspectionDate_year',
  'EpcInspectionDate_month',
  'EpcInspectionDate_day_of_week',
  'EpcInspecti