# King County Housing Regression Project

## Table Of Contents
<font size=3rem>
    
0 -**[ INTRO](#INTRODUCTION)<br>**
1 -**[ OBTAIN](#OBTAIN)**<br>
2 -**[ SCRUB](#SCRUB)**<br>
3 -**[ EXPLORE](#EXPLORE)**<br>
4 -**[ MODEL](#MODEL)**<br>
5 -**[ INTERPRET](#INTERPRET)**<br>
6 -**[ CONCLUSIONS & RECCOMENDATIONS](#Conclusions-&-Recommendations)<br>**
</font>
___

# INTRODUCTION

* Students: Cody Freese/Fennec Nightingale/Thomas Cornett
* Pace: Part time
* Instructor: Amber Yandow
* Blog post URL:

<p> In this notebook we're going to be using the OSEMN model to do OLS regression on housing data from King County in 2015. Here we'll be looking to answer questions like:</p><p> What factors impact the price of a home?</p>
<p> What factors impact the price of a home for different income levels?</p>
<p> If you're looking to move to king county, where is the best bang for your buck?</p>

## Import Tools

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import mlxtend

In [3]:
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.stats.api as sms
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import matplotlib as mpl

In [4]:
from math import sin, cos, sqrt, atan2, radians
from sklearn import svm
from scipy.stats import zscore
from sklearn import linear_model
from statsmodels.formula.api import ols
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier, KNeighborsRegressor
from sklearn.model_selection import cross_val_predict, KFold, train_test_split
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from statsmodels.stats.outliers_influence import variance_inflation_factor

### Editing Our Settings 
We have too many columns to view normally, and it's difficult to get a good grasp of our data with how much is normally cut off. 


In [5]:
%matplotlib inline
plt.style.use('dark_background')
mpl.rcParams['lines.linewidth'] = 2
mpl.rcParams['lines.color'] = '#FBE122'
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

### Define functions 

In [6]:
def getClosest(home_lat: float, home_lon: float, dest_lat_series: 'series', dest_lon_series: 'series'):
    """Pass 1 set of coordinates and one latitude or longitude column you would like to compare it's distance to"""
    #radius of the earth in miles 
    r = 3963
    #setting variables to use to iterate through  
    closest = 100
    within_mile = 0
    i = 0
    #using a while loop to iterate over our data and calculate the distance between each datapoint and our homes 
    while i < dest_lat_series.size:
        lat_dist = radians(home_lat) - (dest_lat := radians(dest_lat_series.iloc[i]))
        lon_dist = radians(home_lon) - (radians(dest_lon_series.iloc[i]))
        a = sin(lat_dist / 2)**2 + cos(radians(home_lat)) * cos(radians(dest_lat)) * sin(lon_dist / 2)**2
        c = 2 * atan2(sqrt(a), sqrt(1 - a))
        c = r * c 
        #find the closest data to our homes by keeping our smallest (closest) value
        if (c < closest):
            closest = c
        #find all of the points that fall within one mile and count them 
        if (c <= 1.0):
            within_mile += 1
        i += 1
    return [closest, within_mile]

In [7]:
def plotcoef(model):
    """Takes in OLS results and returns a plot of the coefficients"""
    #make dataframe from summary of results 
    coef_df = pd.DataFrame(model.summary().tables[1].data)
    #rename your columns
    coef_df.columns = coef_df.iloc[0]
    #drop header row 
    coef_df = coef_df.drop(0)
    #set index to variables
    coef_df = coef_df.set_index(coef_df.columns[0])
    #change dtype from obj to float
    coef_df = coef_df.astype(float)
    #get errors
    err = coef_df['coef'] - coef_df['[0.025']
    #append err to end of dataframe 
    coef_df['errors'] = err
    #sort values for plotting 
    coef_df = coef_df.sort_values(by=['coef'])
    ## plotting time ##
    var = list(coef_df.index.values)
    #add variables column to dataframe 
    coef_df['var'] = var
    # define fig 
    fig, ax = plt.subplots(figsize=(8,5))
    #error bars for 95% confidence interval
    coef_df.plot(x='var', y='coef', kind='bar',
                ax=ax, fontsize=15, yerr='errors', color='#FBE122', ecolor = '#FBE122')
    #set title and label 
    plt.title('Coefficients of Features With 95% Confidence Interval', fontsize=20)
    ax.set_ylabel('Coefficients', fontsize=15)
    ax.set_xlabel(' ')
    #coefficients 
    ax.scatter(x= np.arange(coef_df.shape[0]),
              marker='+', s=50, 
              y=coef_df['coef'], color='#FBE122')
    plt.legend(fontsize= 15,frameon=True, fancybox=True, facecolor='black')
    return plt.show()

In [8]:
def make_ols(df, x_columns, target='price'):
    """Pass in a DataFrame & your predictive columns to return an OLS regression model """
    #set your x and y variables
    X = df[x_columns]
    y = df[target]
    # pass them into stats models OLS package
    ols = sm.OLS(y, X)
    #fit your model
    model = ols.fit()
    #display the model summarry
    display(model.summary())
    #plot the residuals 
    fig = sm.graphics.qqplot(model.resid, dist=stats.norm, line='r', alpha=.65, fit=True, markerfacecolor="y")
    plt.xlim(-2.5, 2.5)
    plt.ylim(-2.5, 2.5)
    #return model for later use 
    return 

# OBTAIN DATA
Here we'll be working with the King County housing data provided to us by FlatIron and data about schools in King County gathered by ArcGis. We'll be importing them via the Pandas library. 

In [9]:
#wrote up our data types to save on computer space and stop some of them from being inccorectly read as objs
kc_dtypes = {'id': int, 'date' : str,  'price': float, 'bedrooms' : int, 'bathrooms' : float, 'sqft_living': int, 'sqft_lot': int, 
             'floors': float, 'waterfront': float, 'view' : float, 'condition': float, 'grade': int, 'sqft_above': int, 
             'yr_built': int, 'yr_renovated': float, 'zipcode': float, 'lat': float, 'long': float}

In [10]:
kc_data = pd.read_csv(r'~\Documents\Flatiron\data\data\kc_house_data.csv', parse_dates = ['date'], dtype=kc_dtypes)
schools = pd.read_csv(r'~\Documents\Flatiron\data\data\Schools.csv')
foods = pd.read_csv(r'~\Documents\Flatiron\foods.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\fenne\\Documents\\Flatiron\\foods.csv'

In [11]:
kc_data = pd.read_csv(r'~\Documents\Flatiron\kc_data.csv')

In [None]:
foods = foods.loc[foods['lat'] != '[0.0]'].copy()
foods = foods.loc[foods['long'] != '[0.0]'].copy()
foods['lat'] = foods['lat'].astype(dtype=float)
foods['long'] = foods['long'].astype(dtype=float)

In [None]:
rest = foods.loc[foods['SEAT_CAP'] != 'Grocery']
groc = foods.loc[foods['SEAT_CAP'] == 'Grocery']

In [None]:
kc_dict = {}

In [None]:
i = 0
while i < kc_data['lat'].size:
    school = getClosest(kc_data['lat'].iloc[i], kc_data['long'].iloc[i], schools['LAT_CEN'], schools['LONG_CEN'])
    restaurant = getClosest(kc_data['lat'].iloc[i], kc_data['long'].iloc[i], rest['lat'], rest['long'])
    grocery = getClosest(kc_data['lat'].iloc[i], kc_data['long'].iloc[i], groc['lat'], groc['long'])
    kc_dict[i] = {
        "closest school": school[0],
        "schools within mile": school[1],
        "closest restaurant": restaurant[0],
        "restaurants within mile": restaurant[1],
        "closest grocery": grocery[0],
        "groceries within mile": grocery[1]}
    i += 1 

That's all of the data we need to start. Now we'll be adding the last of our data, merging in the distance between the schools and our homes. 

In [None]:
kc = pd.DataFrame.from_dict(kc_dict, orient='index')
kc_data = kc_data.merge(kc, left_index=True, right_index=True)

In [None]:
kc_data = kc_data.rename(columns ={'closest school': 'mi_2_scl', 'schools within mile': 'scls_in_mi', 'closest restaurant':'mi_2_rest', 
                          'restaurants within mile':'rest_in_mi','closest grocery': 'mi_2_groc', 'groceries within mile': 'groc_in_mi'})

Now let's take a look at our data to see what we are working with and what we might need to fix 

In [None]:
kc_data.isnull().sum()

# SCRUB
Cleaning up our data, filling NaN values, dropping unnecessary columns 

In [12]:
kc_data = kc_data.drop(['id', 'date'], 1)

In [13]:
#to use sqft basment later on we need to convert it to a float 
kc_data['sqft_basement'] = kc_data['sqft_basement'].replace({'?': 0})
kc_data['sqft_basement'] = kc_data['sqft_basement'].astype(dtype=float)

As we can see we have 3 columns with null values, after exploring them, it makes the most sense to fill the null values with zeros, which is what they had been using to indicate a column without anything anyways. 

In [14]:
kc_data = kc_data.fillna(0)

In [15]:
#Convert to integer for whole number year, not sure why it'll let us reassign it here but raise errors in dtypes
kc_data['yr_renovated'] = kc_data['yr_renovated'].astype('int')

### Add Dummy Variables<p>
Catagorical columns needs to be transformed so we can use them in our model.</p>
<p>
thankfully, the Pandas library has got us covered with pd.get_dummies()</p>

In [16]:
# fixing condition to be a good or bad, hoping that'll help get rid of the multicolinearity 
kc_data['condition'] = kc_data.condition.replace(to_replace = [1.0, 2.0, 3.0, 4.0, 5.0],  value= ['bad', 'bad', 'good', 'good', 'good'])

In [17]:
#we have 70 zipcodes and 120 years, it would add too much complexity to our data to increase it by 190 columns
# so instead, we're going to go through and bin them! 
zips = []
years = []


for zipcode in kc_data.zipcode:
    zips.append(zipcode)
for year in kc_data.yr_built:
    years.append(year)
    
zips = list(set(zips))
years = list(set(years))

zips.sort()
years.sort()

In [18]:
#will have to find a way to write this into a loop at some point, but, I can't figure out how to get .replace()
#to adequatley read lists of lists while also giving them unique names, so for now this works 
kc_data['zipcode'] = kc_data.zipcode.replace(to_replace = zips[0:5],  value= 'zip001t005')
kc_data['zipcode'] = kc_data.zipcode.replace(to_replace = zips[5:10], value= 'zip006t011')
kc_data['zipcode'] = kc_data.zipcode.replace(to_replace = zips[10:15], value= 'zip014t024')
kc_data['zipcode'] = kc_data.zipcode.replace(to_replace = zips[15:20], value= 'zip027t031')
kc_data['zipcode'] = kc_data.zipcode.replace(to_replace = zips[20:25], value= 'zip032t039')
kc_data['zipcode'] = kc_data.zipcode.replace(to_replace = zips[25:30], value= 'zip040t053')
kc_data['zipcode'] = kc_data.zipcode.replace(to_replace = zips[30:35], value= 'zip055t065')
kc_data['zipcode'] = kc_data.zipcode.replace(to_replace = zips[35:40], value= 'zip070t077')
kc_data['zipcode'] = kc_data.zipcode.replace(to_replace = zips[40:45], value= 'zip092t106')
kc_data['zipcode'] = kc_data.zipcode.replace(to_replace = zips[45:50], value= 'zip107t115')
kc_data['zipcode'] = kc_data.zipcode.replace(to_replace = zips[50:55], value= 'zip116t122')
kc_data['zipcode'] = kc_data.zipcode.replace(to_replace = zips[55:60], value= 'zip125t144')
kc_data['zipcode'] = kc_data.zipcode.replace(to_replace = zips[60:65], value= 'zip146t168')
kc_data['zipcode'] = kc_data.zipcode.replace(to_replace = zips[65:70], value= 'zip177t199')

In [19]:
#gonna do the same for year built by 20 years, will give us 6 new columns, may be illuminating 
kc_data['yr_built'] = kc_data.yr_built.replace(to_replace = years[0:20], value= 'thru20')
kc_data['yr_built'] = kc_data.yr_built.replace(to_replace = years[20:40], value= 'thru40')
kc_data['yr_built'] = kc_data.yr_built.replace(to_replace = years[40:60], value= 'thru60')
kc_data['yr_built'] = kc_data.yr_built.replace(to_replace = years[60:80], value= 'thru80')
kc_data['yr_built'] = kc_data.yr_built.replace(to_replace = years[80:100], value= 'thru2000')
kc_data['yr_built'] = kc_data.yr_built.replace(to_replace = years[100:120], value= 'thru2020')

In [20]:
# get dummies of our new variables 
dummys = ['zipcode', 'yr_built', 'condition', ]

for dummy in dummys:
    dumm = pd.get_dummies(kc_data[dummy], drop_first=True)
    kc_data = kc_data.merge(dumm, left_index=True, right_index=True)

#we're doing something unique to these variables so it wouldn't save us any time to put them into a loop
dumm = pd.get_dummies(kc_data['view'], prefix='view', drop_first=True, dtype=int)
kc_data = kc_data.merge(dumm, left_index=True, right_index=True)
dumm = pd.get_dummies(kc_data['grade'], prefix='gra', drop_first=True, dtype=int)
kc_data = kc_data.merge(dumm, left_index=True, right_index=True)

In [21]:
#break up variables into diverse ranges & renaming our dummies so that they'r easier to interpret 
kc_data = kc_data.rename({'view_1.0': 'view1', 'view_2.0': 'view2', 'view_3.0': 'view3', 'view_4.0':'view4'},axis=1)
kc_data = kc_data.rename({'gra_4': 'D', 'gra_5':'Cmin', 'gra_6':'C','gra_7':'Cpl', 'gra_8':'Bmin', 'gra_9':'B',
                          'gra_10':'Bpl', 'gra_11':'Amin', 'gra_12':'A', 'gra_13':'Apl'},axis=1)

# EXPLORE

Now that we have all of the data we'll need ready to go we can really start digging in and checking it out! 

## Histogram

In [22]:
hist = kc_data[['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view',
                'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 
                'lat', 'long', 'sqft_living15', 'sqft_lot15', 'mi_2_scl', 'scls_in_mi', 'mi_2_rest',
                'rest_in_mi', 'mi_2_groc', 'groc_in_mi']]
hist.hist(figsize=(15,15))
plt.tight_layout()

KeyError: "['scls_in_mi', 'mi_2_scl', 'groc_in_mi', 'mi_2_rest', 'mi_2_groc', 'rest_in_mi'] not in index"

## Scatter Matrix

In [None]:
fig = pd.plotting.scatter_matrix(kc_data,figsize=(16,16));

## Heatmap

In [None]:
fig, ax = plt.subplots(figsize=(25,20))
corr = kc_data.corr().abs().round(3)
mask = np.triu(np.ones_like(corr, dtype=np.bool))
sns.heatmap(corr, annot=True, mask=mask, cmap='Oranges', ax=ax)
plt.setp(ax.get_xticklabels(), 
         rotation=45, 
         ha="right",
         rotation_mode="anchor")
ax.set_title('Correlations')
fig.tight_layout()

We can see some colinearity between our features, it's best to either remove or transform them if we want to use them in our model

If we were to multiply by basement to try and get rid of the correlation, we'd be multiplying by a bunch of zeros and it wouldn't adequetly represent our data. By adding one to every 'sqft_basement' that is equal to zero, when we multiply if there are no basement values we still keep our 'sqft_above' values. 

In [None]:
kc_data['sqft_basement'] = kc_data['sqft_basement'].map(lambda x :  1 if x == 0 else x )

In [None]:
#getting rid of multicolinearity in sqftage 
kc_data['sqft_total'] = kc_data['sqft_living']*kc_data['sqft_lot']
kc_data['sqft_neighb'] = kc_data['sqft_living15']*kc_data['sqft_lot15']
kc_data['sqft_habitable'] = kc_data['sqft_above']*kc_data['sqft_basement']

In [None]:
#print columns we will be using going forward 
#make a copy of the dataframe holding only columns we'll be including
kc_data.columns
all_data = kc_data.copy()
kc_data = kc_data[['price', 'bedrooms', 'bathrooms', 'floors','waterfront', 
                   'yr_renovated', 'lat', 'long', 
                   'sqft_total', 'sqft_neighb', 'sqft_habitable', 
                   'good', 'view1', 'view2', 'view3', 'view4', 
                   'D', 'Cmin', 'C', 'Cpl', 'Bmin', 'B', 'Bpl', 'Amin', 
                   'zip006t011', 'zip014t024', 'zip027t031', 'zip032t039', 
                   'zip040t053', 'zip055t065', 'zip070t077', 'zip092t106', 
                   'zip107t115', 'zip116t122', 'zip125t144', 'zip146t168', 
                   'zip177t199', 
                   'thru2000', 'thru2020', 'thru40', 'thru60', 'thru80',
                   'mi_2_scl', 'scls_in_mi', 'mi_2_rest', 'rest_in_mi', 'mi_2_groc', 'groc_in_mi']].copy()

# MODEL

In [None]:
plt.style.use('seaborn')

## Initial Model on Price

In [None]:
lowtier = kc_data[kc_data.price <=300000]
midtier = kc_data[(kc_data.price > 300001) & (kc_data.price<=800000) ]
hightier = kc_data[kc_data.price >800000]

lowincome = ['bedrooms', 'bathrooms', 'floors', 'waterfront', 
          'yr_renovated', 'lat', 'long', 
          'sqft_total', 'sqft_neighb', 'sqft_habitable', 
          'good', 'view1', 'view2', 'view3', 'view4', 
          'D', 'Cmin', 'C', 'Cpl', 'Bmin', 'B', 'Bpl', 'Amin',  
          'zip006t011', 'zip014t024', 'zip027t031', 'zip032t039', 
          'zip040t053', 'zip055t065', 'zip070t077', 'zip092t106', 
          'zip107t115', 'zip116t122', 'zip125t144', 'zip146t168', 
          'zip177t199', 
          'thru2000', 'thru2020', 'thru40', 'thru60', 'thru80',
          'mi_2_scl', 'scls_in_mi', 'mi_2_rest', 'rest_in_mi', 'mi_2_groc', 'groc_in_mi']

mediumincome = ['bedrooms', 'bathrooms', 'floors', 'waterfront', 
          'yr_renovated', 'lat', 'long', 
          'sqft_total', 'sqft_neighb', 'sqft_habitable', 
          'good', 'view1', 'view2', 'view3', 'view4', 
          'D', 'Cmin', 'C', 'Cpl', 'Bmin', 'B', 'Bpl', 'Amin',  
          'zip006t011', 'zip014t024', 'zip027t031', 'zip032t039', 
          'zip040t053', 'zip055t065', 'zip070t077', 'zip092t106', 
          'zip107t115', 'zip116t122', 'zip125t144', 'zip146t168', 
          'zip177t199', 
          'thru2000', 'thru2020', 'thru40', 'thru60', 'thru80',
          'mi_2_scl', 'scls_in_mi', 'mi_2_rest', 'rest_in_mi', 'mi_2_groc', 'groc_in_mi']

highincome = ['bedrooms', 'bathrooms', 'floors', 'waterfront', 
          'yr_renovated', 'lat', 'long', 
          'sqft_total', 'sqft_neighb', 'sqft_habitable', 
          'good', 'view1', 'view2', 'view3', 'view4', 
          'D', 'Cmin', 'C', 'Cpl', 'Bmin', 'B', 'Bpl', 'Amin',  
          'zip006t011', 'zip014t024', 'zip027t031', 'zip032t039', 
          'zip040t053', 'zip055t065', 'zip070t077', 'zip092t106', 
          'zip107t115', 'zip116t122', 'zip125t144', 'zip146t168', 
          'zip177t199', 
          'thru2000', 'thru2020', 'thru40', 'thru60', 'thru80',
          'mi_2_scl', 'scls_in_mi', 'mi_2_rest', 'rest_in_mi', 'mi_2_groc', 'groc_in_mi']

In [None]:
price_tiers = [('low', lowtier, lowincome), 
               ('mid', midtier, mediumincome), 
               ('high', hightier, highincome)]

In [None]:
for name, tier, income in price_tiers:
    print(name.upper())
    make_ols(tier, income)

## Refinement
First we're going to start filtering out outliers, helping normalize our data should improve our model 

In [None]:
for col in ['price']:
    col_zscore = str(col + '_zscore')
    kc_data[col_zscore] = (kc_data[col] - kc_data[col].mean())/kc_data[col].std()
    kc_data = kc_data.loc[kc_data[col_zscore] < 2]
    kc_data = kc_data.loc[kc_data[col_zscore] > (-2)]
    kc_data = kc_data.drop(col_zscore, axis = 1)

In [None]:
plt.figure(figsize=(15,4))
plt.plot(kc_data['price'].value_counts().sort_index(), color='#FBE122')

In [None]:
for i in range(1,100):
    q = i / 100
    print('{} percentile: {}'.format(q, kc_data['price'].quantile(q=q)))

In [None]:
#in bedrooms, we can clearly see a single outlier that is likely just a typo 
kc_data[kc_data['bedrooms'] == 33]
# wouldn't be realistic for a house with 33 bedrooms to only have a sqft_living of 1620 and only 1 3/4 bathrooms so we will adjust to 3 
kc_data[kc_data['bedrooms'] == 33] = kc_data[kc_data['bedrooms'] == 33].replace(33,3)

In [None]:
# to fix other outliers we will explore our data and find cutoffs that seem reasonable 
kc_data = kc_data.loc[kc_data['sqft_total'] <= 1.000000e+09] 
kc_data = kc_data.loc[kc_data['sqft_total'] >= 400000]
kc_data = kc_data.loc[kc_data['sqft_neighb'] <= 1.000000e+09]
kc_data = kc_data.loc[kc_data['sqft_habitable'] >= 400000]
kc_data = kc_data.loc[kc_data['sqft_habitable'] <= 1.000000e+07]
kc_data =  kc_data.loc[kc_data['bathrooms'] >= 1]
kc_data =  kc_data.loc[kc_data['bathrooms'] <= 5]
kc_data =  kc_data.loc[kc_data['bedrooms'] <= 7]

# The Final Model

In [None]:
lowtier = kc_data[(kc_data.price >= 210000) & (kc_data.price <= 348000) ]
midtier = kc_data[(kc_data.price >= 348000) & (kc_data.price <= 480000) ]
uppermidtier = kc_data[(kc_data.price >= 480000) & (kc_data.price <= 640000) ]
hightier = kc_data[(kc_data.price >= 640000) & (kc_data.price <= 900000)]

In [None]:
lowincome = ['bathrooms', 'waterfront', 'lat', 'long',
             'sqft_total', 'sqft_habitable', 
             'view1', 'view2', 'view3', 
             'C', 'Cpl', 'Bmin', 'B',
             'zip040t053', 'zip055t065', 'zip092t106', 
             'zip107t115', 'zip146t168', 
             'groc_in_mi']

mediumincome = ['bathrooms',  'lat', 'long', 
                'sqft_habitable', 'view2',   
                'Cpl', 'Bmin', 'B', 'Bpl',   
                'zip006t011', 'zip014t024', 'zip032t039', 
                'zip055t065', 'zip070t077', 'zip092t106', 
                'zip177t199', 'rest_in_mi', 'groc_in_mi',
                'thru2000', 'thru2020', 'thru60', 'thru80']

uppermedincome = ['bathrooms',  'lat', 'sqft_habitable',   
                  'C', 'Bmin', 'B', 
                  'zip014t024', 'zip027t031', 'zip032t039', 
                  'zip070t077', 'zip125t144', 'zip146t168', 
                  'thru2000', 'thru2020', 'thru60', 'thru80']


highincome = ['bathrooms', 'floors', 'sqft_neighb', 
              'sqft_habitable', 'thru2020',
              'zip006t011', 'zip107t115',
              'zip116t122', 'zip177t199', 
              'mi_2_scl', 'scls_in_mi', 'mi_2_rest',
              'mi_2_groc', 'groc_in_mi']

In [None]:
price_tiers = [('low', lowtier, lowincome), 
               ('mid', midtier, mediumincome), 
               ('upmid', uppermidtier, uppermedincome),
               ('high', hightier, highincome)]

In [None]:
for name, tier, income in price_tiers:
    print(name.upper())
    make_ols(tier, income)

## Train Split Test - High Tier

In [None]:
#first step
high_data = hightier[['price', 'bathrooms', 'floors', 'sqft_neighb', 
                      'sqft_habitable', 'thru2020',
                      'zip006t011', 'zip107t115',
                      'zip116t122', 'zip177t199', 
                      'mi_2_scl', 'scls_in_mi', 'mi_2_rest',
                      'mi_2_groc', 'groc_in_mi']].copy()

training_data, testing_data = train_test_split(high_data, test_size=0.25, random_state=44)

In [None]:
#split columns
target = 'price'
predictive_cols = training_data.drop('price', 1).columns

In [None]:
high_model = make_ols(hightier, predictive_cols)
high_model.params.sort_values()

In [None]:
# predictions
y_pred_train = high_model.predict(training_data[predictive_cols])
y_pred_test = high_model.predict(testing_data[predictive_cols])
# then get the scores:
train_mse = mean_squared_error(training_data[target], y_pred_train)
test_mse = mean_squared_error(testing_data[target], y_pred_test)
print('Training MSE:', train_mse, '\nTesting MSE:', test_mse)
print('Training Error: $', sqrt(train_mse), '\nTesting Error:', sqrt(test_mse))
plotcoef(high_model)

In [None]:
fig = plt.figure(figsize=(15,8))
fig = sm.graphics.plot_regress_exog(high_model, "groc_in_mi", fig=fig)
plt.show()

In [None]:
fig = plt.figure(figsize=(15,8))
fig = sm.graphics.plot_regress_exog(high_model, "sqft_habitable", fig=fig)
plt.show()

## Train Split Test - Upper Medium Tier

In [None]:
#first step
upper_med_data = uppermidtier[['bathrooms',  'lat', 'sqft_habitable',   
                               'C', 'Bmin', 'B', 'price',
                               'zip014t024', 'zip027t031', 'zip032t039', 
                               'zip070t077', 'zip125t144', 'zip146t168', 
                               'thru2000', 'thru2020', 'thru60', 'thru80']].copy()

training_data, testing_data = train_test_split(upper_med_data,test_size=0.30, random_state=55)

In [None]:
#split columns
target = 'price'
predictive_cols = training_data.drop('price', 1).columns

In [None]:
uppmid_model = make_ols(training_data, predictive_cols)
uppmid_model.params.sort_values()

In [None]:
# predictions
y_pred_train = uppmid_model.predict(training_data[predictive_cols])
y_pred_test = uppmid_model.predict(testing_data[predictive_cols])
# then get the scores:
train_mse = mean_squared_error(training_data[target], y_pred_train)
test_mse = mean_squared_error(testing_data[target], y_pred_test)
print('Training MSE:', train_mse, '\nTesting MSE:', test_mse)
print('Training Error: $', sqrt(train_mse), '\nTesting Error:', sqrt(test_mse))
plotcoef(uppmid_model)

In [None]:
fig = plt.figure(figsize=(15,8))
fig = sm.graphics.plot_regress_exog(uppmid_model, "sqft_habitable", fig=fig)
plt.show()

## Train Split Test - Medium Tier

In [None]:
#first step
mid_data = midtier[['bathrooms',  'lat', 'long', 
                    'sqft_habitable', 'view2', 'price', 
                    'Cpl', 'Bmin', 'B', 'Bpl',   
                    'zip006t011', 'zip014t024', 'zip032t039', 
                    'zip055t065', 'zip070t077', 'zip092t106', 
                    'zip177t199', 'rest_in_mi', 'groc_in_mi',
                    'thru2000', 'thru2020', 'thru60', 'thru80']].copy()
training_data, testing_data = train_test_split(mid_data, test_size=0.30, random_state=70)

In [None]:
#split columns
target = 'price'
predictive_cols = training_data.drop('price', 1).columns

In [None]:
mid_model = make_ols(mid_data, predictive_cols)
mid_model.params.sort_values()

In [None]:
# predictions
y_pred_train = mid_model.predict(training_data[predictive_cols])
y_pred_test = mid_model.predict(testing_data[predictive_cols])
# then get the scores:
train_mse = mean_squared_error(training_data[target], y_pred_train)
test_mse = mean_squared_error(testing_data[target], y_pred_test)
print('Training MSE:', train_mse, '\nTesting MSE:', test_mse)
print('Training Error: $', sqrt(train_mse), '\nTesting Error:', sqrt(test_mse))
plotcoef(mid_model)

In [None]:
fig = plt.figure(figsize=(15,8))
fig = sm.graphics.plot_regress_exog(high_model, "sqft_habitable", fig=fig)
plt.show()

## Train Split Test - Low Income 

In [None]:
#first step
low_data = lowtier[['bathrooms', 'waterfront', 'lat', 'long',
                    'sqft_total', 'sqft_habitable', 
                    'view1', 'view2', 'view3', 
                    'C', 'Cpl', 'Bmin', 'B', 'price',
                    'zip040t053', 'zip055t065', 'zip092t106', 
                    'zip107t115', 'zip146t168', 
                    'groc_in_mi']].copy()

training_data, testing_data = train_test_split(low_data, test_size=0.25, random_state=66)

In [None]:
#split columns
target = 'price'
predictive_cols = training_data.drop('price', 1).columns

In [None]:
low_model = make_ols(low_data, predictive_cols)

## Interpret Low Income Model  

In [None]:
low_model.params.sort_values()

In [None]:
# predictions
y_pred_train = low_model.predict(training_data[predictive_cols])
y_pred_test = low_model.predict(testing_data[predictive_cols])
# then get the scores:
train_mse = mean_squared_error(training_data[target], y_pred_train)
test_mse = mean_squared_error(testing_data[target], y_pred_test)
print('Training MSE:', train_mse, '\nTesting MSE:', test_mse)
print('Training Error: $', round(sqrt(train_mse), 2), '\nTesting Error: $', round(sqrt(test_mse), 2))

In [None]:
plotcoef(low_model)

In [None]:
fig = plt.figure(figsize=(15,8), color = 'solarized_light')
fig = sm.graphics.plot_regress_exog(low_model, "lat", fig=fig)
plt.show()
#as you can see here, the farther west, towards the cities, you go, the more expensive homes become. 
#please note we are not missing data in those gaps, those represent bodies of water where few people live on small islands or in house boats 

In [None]:
fig = plt.figure(figsize=(15,8))
fig = sm.graphics.plot_regress_exog(low_model, "bathrooms", fig=fig)
plt.show()
# 6,000$ doesn't look like much compared to prices in the 450,000, but, an increase from 1 to 4 could add over 20k and possibly bump you up to a higher 
# grade, making your home worth even more. 

In [None]:
fig, ax = plt.subplots(figsize=(12,8))

sns.regplot(x='bathrooms', y='price', data=low_data, ci=95, marker='o', units=.25, color='green', scatter_kws={'s': 95})
ax.set(title='Bathooms & Price', 
       xlabel='Bathrooms', ylabel='Price', alpha =.75)

fig.tight_layout()

# Conclusions & Recommendations

## Observations
- Different incomes have different priorities when it comes to buying or selling a home. In short, Low income tends to put more priority on pragmatic space, While Middle Income homes tend to put more on location and grade, a trend that will increase with importance as you go up income brackets. Here are our best reccomendations. 

## Reccomendations 

### *Upper Class*
<p>
Buyer - If you want to live in Bill Gates neighborhood, near the waterfront, Downtown, or in one of Seattles Art Districts, you'll want to look for home built recently, as homes built after 2000 cost an average of 152,439.24 less. 
    
Seller - Add a loft, it's the most affordable and hip way to increase the number of stories you have and add an average of 115,554.79, while you're at it, we'd reccomend adding another bathroom as well, so long as you don't pass a 1:1 ratio with bedrooms. 
</p>

### *Upper Middle Class* 
<p>
Buyer - Move into a newer, more bland home. Homes built after 1960 cost around 25,000 less than more antique homes, but if they have a slightly above average design, they'll still cost a bit more. However, if you aim for a completely average home without any frills, you're likely to save around an additional 11,274.38, letting you buy an Upper Middle Class home for ~35k less overall. 
    
Seller - Make your homes look nicer by doing things like: adding a nice walkway, garden, some trim or fix up your roof; bringing your grade up to at least a B will increase your house worth by 30,798.13.
</p>

### *Middle Class* 
<p>

*Buyer* - Tech & buisness heavy neighborhoods like Downtown Seattle/Bellevue, and some of the surrounding areas, dont have very many grocery stores despite being very expensive, living just outside of this range will be cheaper by around a minimum of 11,437.45 and give you more access to grocery stores, while keeping you close enough for a short commute and generally being able to walk everywhere.
    
*Seller* - Add a balcony, bathroom, and other finishing touches to your home. Giving your home a view of something, and some extra features to increase your grade to a B+ will increase the value by over 56,745.55
</p>

### *Lower Class*  
<p>

*Buyer* - To save the most money & have easy access to grocery stores, it's best to live in more rural areas. Moving a few miles farther North or South of downtown Seattle/Bellevue, will save you 10-23k on average, and you can stay away from the city noise/dust while still being in a good distance for adequate public transit or to driving. 
    
*Seller* - In real estate many experts believe the best ratio is a minimum of 2 bathrooms for every 3 bedrooms, and that fitting your property to that ratio will likely increase your value significantly. Our numbers agree with this conclusion, so we reccomend you add a bathroom or two, so long as you don't exceed a 1:1 ratio for bedrooms and bathrooms, each bathroom will add around 6,685.28. 
</p>