# Hedonic Pricing

We often try to predict the price of an asset from its observable characteristics. This is generally called **hedonic pricing**: How do the unit's characteristics determine its market price?

In the lab folder, there are three options: housing prices in pierce_county_house_sales.csv, car prices in cars_hw.csv, and airbnb rental prices in airbnb_hw.csv. If you know of another suitable dataset, please feel free to use that one.

1. Clean the data and perform some EDA and visualization to get to know the data set.
2. Transform your variables --- particularly categorical ones --- for use in your regression analysis.
3. Implement an ~80/~20 train-test split. Put the test data aside.
4. Build some simple linear models that include no transformations or interactions. Fit them, and determine their RMSE and $R^2$ on the both the training and test sets. Which of your models does the best?
5. Include transformations and interactions, and build a more complex model that reflects your ideas about how the features of the asset determine its value. Determine its RMSE and $R^2$ on the training and test sets. How does the more complex model your build compare to the simpler ones?
6. Summarize your results from 1 to 5. Have you learned anything about overfitting and underfitting, or model selection?


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

df = pd.read_csv('./pierce_county_house_sales.csv')
df.head()

In [None]:
df_sorted = df.sort_values('house_square_feet', ascending=True)

df_sorted

In [None]:
df = df.drop(['attic_finished_square_feet', 'basement_square_feet', 'attached_garage_square_feet', 'detached_garage_square_feet', 'view_quality', 'waterfront_type', 'utility_sewer', 'roof_cover'], axis=1)

In [None]:
df = df[df['house_square_feet'] != 1]

In [None]:
df.loc[:,['sale_price','exterior'] ].groupby('exterior').describe()

In [None]:
df.loc[:,['sale_price','interior'] ].groupby('interior').describe()

based on this analysis I can drop several cases that dont use a typical exterior. Any house with less than 40 cases is unlikely to be useful for analysis.

In [None]:
exterior_to_drop = ["Cedar A-Frame", "Cedar Finished Cabin", "Cedar Unfinished Cabin", "Frame Hardboard", "Frame Rustic Log", "Log", "Pine A-Frame", "Pine Finished Cabin", "Pine Unfinished Cabin", "Unfinished Cottage"]
df = df[~df.exterior.isin(exterior_to_drop)]

In [None]:
df.loc[:,['sale_price','exterior'] ].groupby('exterior').describe()

In [None]:
sns.kdeplot(x=df['sale_price'], hue=df['exterior'])
plt.show()

sns.kdeplot(x=np.log(df['sale_price']), hue=df['exterior'])
plt.show()

In [None]:
dropframe = df[~df.exterior.isin(["Frame Siding", "Frame Vinyl"])]

In [None]:
sns.kdeplot(x=dropframe['sale_price'], hue=dropframe['exterior'])
plt.show()

sns.kdeplot(x=np.log(dropframe['sale_price']), hue=dropframe['exterior'])
plt.show()

In [None]:
df.plot.scatter(x = 'house_square_feet',y = 'sale_price')