# Predicting the performance of the Zestimate model

As most of you know, Zillow has become a major force in real estate. Their model "Zestimate" aims to predict housing prices based on a variety of features. In this dataset of many properties in southern California, they provide how well the model predicted house prices. Specifically, they calculated the (log) error between their model estimate and the actual price given by:

$logerror = log(Zestimate) - log(actual)$

This means that if the logerror was greater than 0, the Zestimate overestimated the actual prices, and if it was less than 0 the Zestimate underestimated the price. The goal of this project is to write a neural network that will predict whether the Zestimate will over or underpredict the price given features of other houses. 

In [5]:
%matplotlib inline

In [6]:
# add more imports as you need
import os
import sys
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn.model_selection as ms
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD  
from keras.utils import to_categorical
from sklearn.metrics import accuracy_score

In [7]:
# set the paths for your system

data_dir ='/Users/xli/'

## Pandas data I/O

Unfortunately, in this session we don't have the time to go over the pandas module, but suffice it to say that it is a very popular package for managing data. We will use pandas.read_csv() to bring in the data. For those of you familiar with R, pandas uses DataFrames as well. 

Pandas usually comes default with Anaconda. If you don't have it, you can install it using the following:

`conda install pandas`

or

`pip install pandas`

We provide the function call for loading in the data below. It is the bare-bones function call because I have already cleaned the data for you.


In [8]:
train_df = pd.read_csv(os.path.join(data_dir, "final_project_data.csv"))

In [9]:
# We can very quickly look at the data using the following:
train_df.head()

Unnamed: 0,parcelid,bathroomcnt,bedroomcnt,calculatedbathnbr,calculatedfinishedsquarefeet,finishedsquarefeet12,fips,fullbathcnt,latitude,longitude,...,regionidcounty,regionidzip,roomcnt,yearbuilt,structuretaxvaluedollarcnt,taxvaluedollarcnt,landtaxvaluedollarcnt,taxamount,censustractandblock,logerror
0,10711738,3.0,4.0,3.0,2538.0,2538.0,6037.0,3.0,34220381.0,-118620802.0,...,3101.0,96339.0,0.0,1978.0,245180.0,567112.0,321932.0,7219.18,60371130000000.0,0.0276
1,10711755,3.0,3.0,3.0,1589.0,1589.0,6037.0,3.0,34222040.0,-118622240.0,...,3101.0,96339.0,0.0,1959.0,254691.0,459844.0,205153.0,6901.09,60371130000000.0,-0.0182
2,10711805,2.0,3.0,2.0,2411.0,2411.0,6037.0,2.0,34220427.0,-118618549.0,...,3101.0,96339.0,0.0,1973.0,235114.0,384787.0,149673.0,4876.61,60371130000000.0,-0.1009
3,10711816,2.0,4.0,2.0,2232.0,2232.0,6037.0,2.0,34222390.0,-118618631.0,...,3101.0,96339.0,0.0,1973.0,262309.0,437176.0,174867.0,5560.07,60371130000000.0,-0.0121
4,10711858,2.0,4.0,2.0,1882.0,1882.0,6037.0,2.0,34222544.0,-118617961.0,...,3101.0,96339.0,0.0,1973.0,232037.0,382055.0,150018.0,4878.25,60371130000000.0,-0.0481


As mentioned before, we are simply interested in whether the Zestimate over or under-estimated the actual price. So we "binarize" our labels below by converting positive numbers to 1 and negative numbers to 0.

In [10]:
train_df["labels"]= train_df["logerror"].apply(lambda x: ( np.sign(x) +1) / 2 ) 
train_df = train_df.drop("logerror", axis=1)  # VERY important-- you'll probably get perfect classification if you forget to remove this

In [11]:
train_df_1=train_df.drop("parcelid",axis=1)
X = np.array(train_df_1.drop("labels",axis=1))  # our data is everything except the labels
y = train_df["labels"]  # our labels

In [12]:
y_vectorized = to_categorical(y)

You now have a dataset with which you can train your model. Specifically, you are trying to predict y based on X using a neural network with one or more hidden layers with an architecture of your choosing. Please use the tools from the previous exercises to export a Keras model "ZillowPredictor.h5" that predicts whether the Zestimate over or underestimates actual housing prices.

## Model

In [13]:
# Projecting the data to a lower dimension.
model1_layer_sizes = [X.shape[1],10,y_vectorized.shape[1]]

In [14]:
def build_model1(): 
    model = Sequential()

    model.add(Dense(input_dim=model1_layer_sizes[0],
                    units=model1_layer_sizes[1],
                    kernel_initializer="uniform",
                    activation="relu"))
    
  
    model.add(Dense(units=model1_layer_sizes[-1], # last layer
                    kernel_initializer='uniform',
                    activation="softmax"))
    
    sgd = SGD(lr=0.001, decay=1e-7, momentum=.9)  
    model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=["accuracy"])  
   
    return model

In [15]:
# Projecting the data to a higher dimension.
model2_layer_sizes = [X.shape[1],300,y_vectorized.shape[1]]

In [16]:
def build_model2(): 
    model = Sequential()

    model.add(Dense(input_dim=model2_layer_sizes[0],
                    units=model1_layer_sizes[1],
                    kernel_initializer="uniform",
                    activation="relu"))
    
  
    model.add(Dense(units=model2_layer_sizes[-1], # last layer
                    kernel_initializer='uniform',
                    activation="softmax"))
    
    sgd = SGD(lr=0.001, decay=1e-7, momentum=.9)  
    model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=["accuracy"])  
   
    return model

## Cross Validation for above two models

In [17]:
# define kfold here:
k=2
kf = ms.KFold(k, shuffle=True)

### Model 1

In [18]:
accuracies = []

for  train_idx, val_idx in kf.split(X):
    
    model1 = build_model1()
    
    model1.fit(X[train_idx],y_vectorized[train_idx],epochs=500,batch_size=10,verbose=0) 
    
    proba = model1.predict_proba(X[val_idx], batch_size=32)
    classes = np.argmax(proba, axis=1)
    
    accuracies.append(accuracy_score(y[val_idx], classes))


model1_accuracy = np.array(accuracies).mean()
print(model1_accuracy)

0.5018074550805204


### Model 2

In [19]:
accuracies = []

for  train_idx, val_idx in kf.split(X):
    
    model2 = build_model2()
    
    model2.fit(X[train_idx],y_vectorized[train_idx],epochs=500,batch_size=10,verbose=0) 
    
    proba = model2.predict_proba(X[val_idx], batch_size=32)
    classes = np.argmax(proba, axis=1)
    
    accuracies.append(accuracy_score(y[val_idx], classes))


model2_accuracy = np.array(accuracies).mean()
print(model2_accuracy)

0.5000793145524403


## A Model with Regularization

In [26]:
#Adding alpha=1e-4
from sklearn.neural_network import MLPClassifier
model3=MLPClassifier(activation='relu',solver='lbfgs', alpha=1e-4, hidden_layer_sizes=(300,), random_state=1,max_iter=500,batch_size=10,verbose=0)

## Cross Validation for Regularized models

In [27]:

accuracies = []

for  train_idx, val_idx in kf.split(X):
    
    model3.fit(X[train_idx],y_vectorized[train_idx]) 
    
    proba = model3.predict_proba(X[val_idx])
    classes = np.argmax(proba, axis=1)
    
    accuracies.append(accuracy_score(y[val_idx], classes))


model3_accuracy = np.array(accuracies).mean()
print(model3_accuracy)

0.4421934952441695


# Summary
The first model is a one hidden layer model with 10 neurons which will project the data to a lower dimension to find out a classifier. The second model is a one hidden layer model with 300 neurons which will project the data to a higher dimension and might lead to overfitting problem and bad result on the testing set. The third model is the second model adding regularization which can penalize the overfitting problem.