# Final Model Analysis

Finally, it is time to analyze our final model. We will greate a prediction function, examine our coefficients, and draw conclusions.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('../data/final_model.csv', header = None, names = ['names', 'coefficients']).set_index('names')
coef = df['coefficients'].to_dict()

## Prediction Function

To create our prediction function, we will start off with the data in the format it is in the dataset we were originally given, king_house_data.csv. Since we are using it with our coefficients dictionary, we don't actually need to remove the extra data: we can just only use data that is in both dictionaries. We do, however, need to turn all possible data into numeric data types, log transform the necessary data, and inverse log transform (np.exp) the price at the end. Some data will need to be kept as a  string in order to create dummy variables from it. Finally, we replace all nan values and turn the data into a dictionary. We can check this on a single row, but we also read in the first 5 lines from king_house_data.csv and try those.

In [3]:
def predict(data):
    data = data.split(',')
    columns = 'id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15'.split(',')
    df = pd.DataFrame(dict(zip(columns, data)), index=[0])
    df['yr_sold'] = df.date.map(lambda x: int(x.split('/')[-1]))
    df.drop('date', axis=1, inplace=True)
    for col in df.columns:
        df[col] = pd.to_numeric(df[col])
    df['yr_since_renovation'] = np.where(df['yr_renovated']==0.0, df['yr_sold']-df['yr_built'], df['yr_sold']-df['yr_renovated'])
    df['yr_since_built'] = df['yr_sold'] - df['yr_built']
    categoricals = ['floors', 'condition', 'grade', 'zipcode']
    df = df.astype({col: 'str' for col in categoricals})
    df = pd.get_dummies(df)
    df['renovated'] = df.yr_renovated.map(lambda x: 1 if x>0 else 0)
    continuous = ['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot']
    for col in continuous:
        df[col] = df[col].map(np.log)
    df.replace(np.nan, 0, inplace=True)
    data_dict = df.iloc[0].to_dict()

    prediction = coef['Intercept']
    for key, value in coef.items():
        prediction += value * data_dict.get(key, 0)
    prediction = np.exp(prediction)
    return round(prediction, 2)

In [4]:
data = '7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,0.0,3,7,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650'
predict(data)
# real price value: 221900.0

239005.82

In [5]:
# todo: import data from kc_house_data.csv and run the prediction function
with open('data/kc_house_data.csv') as f:
    f.readline()
    for i in range(5):
        print(predict(f.readline()))
# the real price values are: 221900.0, 538000.0, 180000.0, 604000.0, and 510000.0

239005.82
576006.35
239994.68
556154.46
471701.59


# Coefficient Analysis

In [6]:
coef

{'Intercept': -86.96327587506237,
 'bedrooms': -0.05790908383811544,
 'bathrooms': 0.062164516758676736,
 'sqft_living': 0.4837753401719039,
 'sqft_lot': 0.06663803910671842,
 'waterfront': 0.6027962254517414,
 'sqft_basement': -5.32076652954763e-05,
 'lat': 0.6472578402367359,
 'long': -0.5282742969088292,
 'yr_since_built': 0.00040867761814662815,
 'renovated': 0.06219396721553915,
 'floors_15': 0.015254015266802584,
 'floors_30': -0.0635517536782565,
 'condition_2': 0.18927766124862488,
 'condition_3': 0.3134553949417471,
 'condition_4': 0.3483349587318392,
 'condition_5': 0.3986555695222103,
 'grade_11': 0.13444823043057247,
 'grade_12': 0.17757036224366418,
 'grade_13': -1.1304037945325856e-14,
 'grade_4': -0.4158651581648968,
 'grade_5': -0.4607047689919184,
 'grade_6': -0.4109358560514862,
 'grade_7': -0.3326081907930973,
 'grade_8': -0.22083610555204264,
 'grade_9': -0.08208493742248679,
 'zipcode_98004': 0.9511542887595263,
 'zipcode_98005': 0.5854414838201202,
 'zipcode_98006

bedrooms: -0.057155141850407175
- each bedroom decreases the sale price of a house by 5%

bathrooms: 0.062049951410309764
- each bathroom increases the sale price of a house by 6%

sqft_living: 0.4834287268804242
- a 1% change in square footage living area increases the sale price of a house by .48%

sqft_lot: 0.06665523673039399
- a 1% change in square footage lot area increases the sale price of a house by .07%

waterfront: 0.6029579655205117
- if the house is on the waterfront, the sale price of a house increases by 60%

sqft_basement: -5.318890782739171e-05
- a 1% change in square footage basement area decreases the sale price of a house by .00005%

lat: 0.647575198704021
- if you move north, a 1 degree increase in latitude increases the sale price of a house by 65%

long: -0.5281545373156602
- if you move east, a 1 degree increase in longitude decreases the sale price of a house by 53%

yr_since_built: 0.0004074831112883181
- a 1-year increase in the age of a house increases its sale price by .04%

renovated: 0.062242978234706765
- a house that has been renovated has its sale price increased by 6%

floors_15: 0.015254015266802584 and floors_30: -0.0635517536782565
- using a one-floor house as a baseline, a 1.5-floor house has its price increased by 1.5%, while a 3-floor house has its price decreased by 6.4%. Other numbers of floors are approximately equal in price to a 1-floor house.

condition_2: 0.18927766124862488, condition_3: 0.3134553949417471, condition_4: 0.3483349587318392, and condition_5: 0.3986555695222103
- using a condition of 1 as a baseline, a condition of 2 increases the price by 19%, a condition of 3 increases the price of a house by 31%, a condition of 4 increases the price by 35%, and a condition of 5 increases the price by 40%

Using zipcode 98001 as a baseline, the listed zipcodes increase or decrease the price of a house by 100 times the number listed as a percentage. The unlisted zipcodes are all approximately the same in price. (I'm not planning on listing them all out.)

# Recommendations and Future Work

We attempted to add interation features to our model, but our results indicated that they only decreased the accuracy of our model. With more time, we could take a deeoer look at these and find out why that is the case, and see if other interactions could help our model.

Similarly, adding polynomial features could make our predictions more accurate. Trial-and-error would be needed to determine which features could be changed in this way to improve our model.

Using a mapping library could turn the longitude and latitude into more directly beneficial information, like distance to a school or grocery store. With more time, we could create new features using this information to add to our model.

# Conclusions

Our final model will be useful in predicting sale prices of houses in King county. We can use these predictions to help our clients set the prices for their houses, and find houses that are currently underpriced.