![for sale image, from https://time.com/5835778/selling-home-coronavirus/](https://api.time.com/wp-content/uploads/2020/05/selling-home-coronavirus.jpg?w=800&quality=85)

# Project Title

## Overview

We are taking housing data from King's County, Washington. This data has approximately 20 thousand entries, with each row representing a home sale. The entires include the sale price of each home, as well as other features, such as Sqaure fottage of living area, overall condition, zipcode etc. 

## Business Problem

We have been hired by a developer to find out what factors indicate that a home will sell for a large amount. They want to build single family homes like the ones in our dataset and need to know which attributes to focus on. The variables most associated with home price are those that are the most important for the developer to invest in. 


## Data Understanding

Describe the data being used for this project.

Questions to consider:

- Where did the data come from, and how do they relate to the data analysis questions?
- What do the data represent? Who is in the sample and what variables are included?
- What is the target variable?
- What are the properties of the variables you intend to use?

### Importing relevant libraries, as well as our data 

In [1]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

from statsmodels.formula.api import ols

pd.set_option('display.max_columns',None)
data = pd.read_csv('kc_house_data.csv')
#data['condition'].head(25)
#data.head(20)

### Taking an initial look at our dataframe

In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  float64
 9   view           21534 non-null  float64
 10  condition      21597 non-null  int64  
 11  grade          21597 non-null  int64  
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long  

### Seperating our data into train and test groups 

In [3]:
y = data['price']
X = data.drop('price', axis=1)

# Split the data out, specifying size of the split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    test_size=0.20,
                                                    random_state=5
)
data_train = pd.concat([X_train, y_train], axis=1)
data_test = pd.concat([X_test, y_test], axis=1)

### Looking at the correlations of price vs each column 

In [4]:
#sns.pairplot(data_train)
#plt.show()
#fig, ax = plt.subplots(figsize=(18,10))
#sns.heatmap(data_train.corr(), annot=True, ax=ax)
data_train.corr().price.sort_values(ascending=False)


price            1.000000
sqft_living      0.700923
grade            0.664625
sqft_above       0.604636
sqft_living15    0.585330
bathrooms        0.525356
view             0.392116
lat              0.307032
bedrooms         0.305913
waterfront       0.258042
floors           0.253709
yr_renovated     0.134954
sqft_lot         0.082496
sqft_lot15       0.079215
yr_built         0.050130
condition        0.037065
long             0.021214
id              -0.018081
zipcode         -0.057878
Name: price, dtype: float64

### One-Hot Encoding the waterfront column

In [5]:
def waterfront_cleanup(df, column_name):
    df['on_the_water'] = df[column_name]
    df.loc[ df['on_the_water'] != 1, ['on_the_water']] = 0
    df.loc[ df['on_the_water'] != 0, ['on_the_water']] = 1

waterfront_cleanup(data_train, 'waterfront')
data_train.on_the_water.value_counts()


0.0    17153
1.0      124
Name: on_the_water, dtype: int64

### Grouping our data by zipcode, and sorting by average house price

In [6]:
def zipcode_sorter(df):
    return df.groupby('zipcode')['price'].mean().round().sort_values(ascending=False)

zipcode_sorter(data_train)

zipcode
98039    2272014.0
98004    1362173.0
98040    1199883.0
98112    1091131.0
98102     912214.0
           ...    
98188     280392.0
98148     275446.0
98032     249788.0
98168     241826.0
98002     236006.0
Name: price, Length: 70, dtype: float64

### One-Hot encoding zipcodes

In [7]:
pd.set_option('display.max_columns', None)
from sklearn.preprocessing import OneHotEncoder

def zipcode_encoder(df):
    zipcodes = df[['zipcode']]
    ohe = OneHotEncoder(categories='auto', sparse=False, handle_unknown='ignore')
    ohe.fit(zipcodes)
    encoded_zipcodes = ohe.transform(zipcodes)
    zipcodes = list(ohe.categories_[0])
    encoded_zipcodes_df = pd.DataFrame(encoded_zipcodes, 
                                       columns=ohe.get_feature_names(['zipcode']),
                                       index = df.index)
    return encoded_zipcodes_df

    
data_train =  pd.concat([data_train, zipcode_encoder(data_train)], axis=1)
data_train.head()

Unnamed: 0,id,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price,on_the_water,zipcode_98001,zipcode_98002,zipcode_98003,zipcode_98004,zipcode_98005,zipcode_98006,zipcode_98007,zipcode_98008,zipcode_98010,zipcode_98011,zipcode_98014,zipcode_98019,zipcode_98022,zipcode_98023,zipcode_98024,zipcode_98027,zipcode_98028,zipcode_98029,zipcode_98030,zipcode_98031,zipcode_98032,zipcode_98033,zipcode_98034,zipcode_98038,zipcode_98039,zipcode_98040,zipcode_98042,zipcode_98045,zipcode_98052,zipcode_98053,zipcode_98055,zipcode_98056,zipcode_98058,zipcode_98059,zipcode_98065,zipcode_98070,zipcode_98072,zipcode_98074,zipcode_98075,zipcode_98077,zipcode_98092,zipcode_98102,zipcode_98103,zipcode_98105,zipcode_98106,zipcode_98107,zipcode_98108,zipcode_98109,zipcode_98112,zipcode_98115,zipcode_98116,zipcode_98117,zipcode_98118,zipcode_98119,zipcode_98122,zipcode_98125,zipcode_98126,zipcode_98133,zipcode_98136,zipcode_98144,zipcode_98146,zipcode_98148,zipcode_98155,zipcode_98166,zipcode_98168,zipcode_98177,zipcode_98178,zipcode_98188,zipcode_98198,zipcode_98199
2744,2472920140,4/3/2015,4,2.5,2620,9359,2.0,0.0,0.0,3,9,2620,0.0,1987,0.0,98058,47.438,-122.152,2580,7433,405000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8025,6021500025,8/18/2014,3,1.75,2360,4063,1.0,0.0,0.0,5,7,1180,1180.0,1940,0.0,98117,47.6902,-122.382,1660,4063,631750.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13314,7852130720,10/9/2014,3,2.5,2240,7791,2.0,0.0,0.0,3,7,2240,0.0,2002,0.0,98065,47.5361,-121.88,2480,5018,452500.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8085,1924059029,6/17/2014,5,6.75,9640,13068,1.0,1.0,4.0,3,12,4820,4820.0,1983,2009.0,98040,47.557,-122.21,3270,10454,4670000.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10303,4154304740,2/24/2015,3,2.75,2780,7200,1.5,0.0,0.0,4,8,1870,910.0,1913,0.0,98118,47.5632,-122.27,1700,7200,709000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Sanity Check!

In [8]:
zipcode_df = zipcode_encoder(data_train)
zipcode_df.head()

Unnamed: 0,zipcode_98001,zipcode_98002,zipcode_98003,zipcode_98004,zipcode_98005,zipcode_98006,zipcode_98007,zipcode_98008,zipcode_98010,zipcode_98011,zipcode_98014,zipcode_98019,zipcode_98022,zipcode_98023,zipcode_98024,zipcode_98027,zipcode_98028,zipcode_98029,zipcode_98030,zipcode_98031,zipcode_98032,zipcode_98033,zipcode_98034,zipcode_98038,zipcode_98039,zipcode_98040,zipcode_98042,zipcode_98045,zipcode_98052,zipcode_98053,zipcode_98055,zipcode_98056,zipcode_98058,zipcode_98059,zipcode_98065,zipcode_98070,zipcode_98072,zipcode_98074,zipcode_98075,zipcode_98077,zipcode_98092,zipcode_98102,zipcode_98103,zipcode_98105,zipcode_98106,zipcode_98107,zipcode_98108,zipcode_98109,zipcode_98112,zipcode_98115,zipcode_98116,zipcode_98117,zipcode_98118,zipcode_98119,zipcode_98122,zipcode_98125,zipcode_98126,zipcode_98133,zipcode_98136,zipcode_98144,zipcode_98146,zipcode_98148,zipcode_98155,zipcode_98166,zipcode_98168,zipcode_98177,zipcode_98178,zipcode_98188,zipcode_98198,zipcode_98199
2744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13314,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8085,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Generating a model of zipcodes vs price 

In [10]:
#zipcodes = sorted(list(set(data_train['zipcode'])))
zipcodes = zipcode_df.columns
zipcode_formula = 'price~'

for i in zipcodes:
    zipcode_formula += '{}+'.format(str(i))
    
zipcode_formula = zipcode_formula[:-1]
print(zipcode_formula)

all_zipcode_model = ols(formula=zipcode_formula, data=data_train).fit()
results= all_zipcode_model.summary()
results

price~zipcode_98001+zipcode_98002+zipcode_98003+zipcode_98004+zipcode_98005+zipcode_98006+zipcode_98007+zipcode_98008+zipcode_98010+zipcode_98011+zipcode_98014+zipcode_98019+zipcode_98022+zipcode_98023+zipcode_98024+zipcode_98027+zipcode_98028+zipcode_98029+zipcode_98030+zipcode_98031+zipcode_98032+zipcode_98033+zipcode_98034+zipcode_98038+zipcode_98039+zipcode_98040+zipcode_98042+zipcode_98045+zipcode_98052+zipcode_98053+zipcode_98055+zipcode_98056+zipcode_98058+zipcode_98059+zipcode_98065+zipcode_98070+zipcode_98072+zipcode_98074+zipcode_98075+zipcode_98077+zipcode_98092+zipcode_98102+zipcode_98103+zipcode_98105+zipcode_98106+zipcode_98107+zipcode_98108+zipcode_98109+zipcode_98112+zipcode_98115+zipcode_98116+zipcode_98117+zipcode_98118+zipcode_98119+zipcode_98122+zipcode_98125+zipcode_98126+zipcode_98133+zipcode_98136+zipcode_98144+zipcode_98146+zipcode_98148+zipcode_98155+zipcode_98166+zipcode_98168+zipcode_98177+zipcode_98178+zipcode_98188+zipcode_98198+zipcode_98199


0,1,2,3
Dep. Variable:,price,R-squared:,0.409
Model:,OLS,Adj. R-squared:,0.406
Method:,Least Squares,F-statistic:,172.3
Date:,"Tue, 05 Oct 2021",Prob (F-statistic):,0.0
Time:,15:30:45,Log-Likelihood:,-241430.0
No. Observations:,17277,AIC:,483000.0
Df Residuals:,17207,BIC:,483500.0
Df Model:,69,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.559e+17,1.44e+17,-1.086,0.277,-4.37e+17,1.25e+17
zipcode_98001,1.559e+17,1.44e+17,1.086,0.277,-1.25e+17,4.37e+17
zipcode_98002,1.559e+17,1.44e+17,1.086,0.277,-1.25e+17,4.37e+17
zipcode_98003,1.559e+17,1.44e+17,1.086,0.277,-1.25e+17,4.37e+17
zipcode_98004,1.559e+17,1.44e+17,1.086,0.277,-1.25e+17,4.37e+17
zipcode_98005,1.559e+17,1.44e+17,1.086,0.277,-1.25e+17,4.37e+17
zipcode_98006,1.559e+17,1.44e+17,1.086,0.277,-1.25e+17,4.37e+17
zipcode_98007,1.559e+17,1.44e+17,1.086,0.277,-1.25e+17,4.37e+17
zipcode_98008,1.559e+17,1.44e+17,1.086,0.277,-1.25e+17,4.37e+17

0,1,2,3
Omnibus:,17486.832,Durbin-Watson:,1.989
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2356947.428
Skew:,4.709,Prob(JB):,0.0
Kurtosis:,59.44,Cond. No.,565000000000000.0


### Generating a model of neighborhood vs price

In [23]:
zipcode_dict = {98002: 'Auburn',98092: 'Auburn',98224: 'Baring',98004: 'Bellevue',98005: 'Bellevue',
98006: 'Bellevue',98007: 'Bellevue',98008: 'Bellevue',98010: 'Black_Diamond',98011: 'Bothell',
98178: 'Bryn_Mawr_Skyway',98148: 'Burien',98166: 'Burien',98014: 'Carnation',98077: 'Cottage_Lake',
98042: 'Covington',98198: 'Des_Moines',98019: 'Duvall',98031: 'East_Hill_Meridian',98022: 'Enumclaw',
98058: 'Fairwood',98024: 'Fall_City',98003: 'Federal_Way',98023: 'Federal_Way',98027: 'Issaquah',
98029: 'Issaquah',98028: 'Kenmore',98032: 'Kent',98030: 'Kent',98033: 'Kirkland',98034: 'Kirkland',
98001: 'Lakeland_North',98038: 'Maple_Valley',98039: 'Medina',98040: 'Mercer_Island',98045: 'North_Bend',
98047: 'Pacific',98050: 'Preston',98051: 'Ravensdale',98052: 'Redmond',98055: 'Renton',98056: 'Renton',
98057: 'Renton',98059: 'Renton',98074: 'Sammamish',98075: 'Sammamish',98188: 'SeaTac',98199: 'Seattle',
98174: 'Seattle',98154: 'Seattle',98158: 'Seattle',98164: 'Seattle',98101: 'Seattle',98102: 'Seattle',
98103: 'Seattle',98104: 'Seattle',98105: 'Seattle',98106: 'Seattle',98107: 'Seattle',98108: 'Seattle',
98109: 'Seattle',98112: 'Seattle',98115: 'Seattle',98116: 'Seattle',98117: 'Seattle',98118: 'Seattle',
98119: 'Seattle',98121: 'Seattle',98122: 'Seattle',98125: 'Seattle',98126: 'Seattle',98133: 'Seattle',
98134: 'Seattle',98136: 'Seattle',98144: 'Seattle',98155: 'Shoreline',98177: 'Shoreline',98288: 'Skykomish',
98065: 'Snoqualmie',98168: 'Tukwila',98053: 'Union_Hill_Novelty_Hill',98195: 'Univ_Of_Washington',
98070: 'Vashon',98146: 'White_Center',98072: 'Woodinville'}


for i in zipcode_dict.keys():
    data_train.loc[data_train['zipcode'] == i, 'neighborhood'] = zipcode_dict[i] 
data_train.head()
neighborhoods = data_train.groupby(['neighborhood']).mean().price.round().sort_values(ascending=False)
neighborhoods

neighborhood_list = list(set(data_train['neighborhood']))

for i in neighborhood_list:
    data_train.loc[data_train['neighborhood'] == i, i] = 1
    data_train.loc[data_train['neighborhood'] != i, i] = 0

neighborhood_formula = ' price ~ '

for i in neighborhood_list:
    neighborhood_formula += '{} + '.format(str(i))
    
neighborhood_formula = neighborhood_formula[:-2]
    
neighborhood_model = ols(formula=neighborhood_formula, data=data_train).fit()
results= neighborhood_model.summary()
results

 price ~ White_Center + Kirkland + Woodinville + Fall_City + Des_Moines + Shoreline + Maple_Valley + Seattle + Cottage_Lake + Auburn + Tukwila + Issaquah + Lakeland_North + Mercer_Island + Bothell + Enumclaw + Renton + Fairwood + Bellevue + North_Bend + Duvall + Vashon + Bryn_Mawr_Skyway + Burien + Federal_Way + Carnation + Kent + Union_Hill_Novelty_Hill + Black_Diamond + Medina + Covington + Redmond + SeaTac + Snoqualmie + Kenmore + Sammamish + East_Hill_Meridian 


0,1,2,3
Dep. Variable:,price,R-squared:,0.285
Model:,OLS,Adj. R-squared:,0.283
Method:,Least Squares,F-statistic:,190.8
Date:,"Tue, 05 Oct 2021",Prob (F-statistic):,0.0
Time:,15:33:16,Log-Likelihood:,-243070.0
No. Observations:,17277,AIC:,486200.0
Df Residuals:,17240,BIC:,486500.0
Df Model:,36,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.734e+17,9.44e+16,2.896,0.004,8.83e+16,4.58e+17
White_Center,-2.734e+17,9.44e+16,-2.896,0.004,-4.58e+17,-8.83e+16
Kirkland,-2.734e+17,9.44e+16,-2.896,0.004,-4.58e+17,-8.83e+16
Woodinville,-2.734e+17,9.44e+16,-2.896,0.004,-4.58e+17,-8.83e+16
Fall_City,-2.734e+17,9.44e+16,-2.896,0.004,-4.58e+17,-8.83e+16
Des_Moines,-2.734e+17,9.44e+16,-2.896,0.004,-4.58e+17,-8.83e+16
Shoreline,-2.734e+17,9.44e+16,-2.896,0.004,-4.58e+17,-8.83e+16
Maple_Valley,-2.734e+17,9.44e+16,-2.896,0.004,-4.58e+17,-8.83e+16
Seattle,-2.734e+17,9.44e+16,-2.896,0.004,-4.58e+17,-8.83e+16

0,1,2,3
Omnibus:,16905.746,Durbin-Watson:,1.991
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1767596.661
Skew:,4.532,Prob(JB):,0.0
Kurtosis:,51.716,Cond. No.,261000000000000.0


## Data Preparation

Describe and justify the process for preparing the data for analysis.

Questions to consider:

- Were there variables you dropped or created?
- How did you address missing values or outliers?
- Why are these choices appropriate given the data and the business problem?

In [12]:
# code here to prepare your data

## Modeling

Describe and justify the process for analyzing or modeling the data.

Questions to consider:

- How did you analyze the data to arrive at an initial approach?
- How did you iterate on your initial approach to make it better?
- Why are these choices appropriate given the data and the business problem?

In [13]:
data_train.head()

Unnamed: 0,id,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price,on_the_water,zipcode_98001,zipcode_98002,zipcode_98003,zipcode_98004,zipcode_98005,zipcode_98006,zipcode_98007,zipcode_98008,zipcode_98010,zipcode_98011,zipcode_98014,zipcode_98019,zipcode_98022,zipcode_98023,zipcode_98024,zipcode_98027,zipcode_98028,zipcode_98029,zipcode_98030,zipcode_98031,zipcode_98032,zipcode_98033,zipcode_98034,zipcode_98038,zipcode_98039,zipcode_98040,zipcode_98042,zipcode_98045,zipcode_98052,zipcode_98053,zipcode_98055,zipcode_98056,zipcode_98058,zipcode_98059,zipcode_98065,zipcode_98070,zipcode_98072,zipcode_98074,zipcode_98075,zipcode_98077,zipcode_98092,zipcode_98102,zipcode_98103,zipcode_98105,zipcode_98106,zipcode_98107,zipcode_98108,zipcode_98109,zipcode_98112,zipcode_98115,zipcode_98116,zipcode_98117,zipcode_98118,zipcode_98119,zipcode_98122,zipcode_98125,zipcode_98126,zipcode_98133,zipcode_98136,zipcode_98144,zipcode_98146,zipcode_98148,zipcode_98155,zipcode_98166,zipcode_98168,zipcode_98177,zipcode_98178,zipcode_98188,zipcode_98198,zipcode_98199,neighborhood,White_Center,Kirkland,Woodinville,Fall_City,Des_Moines,Shoreline,Maple_Valley,Seattle,Cottage_Lake,Auburn,Tukwila,Issaquah,Lakeland_North,Mercer_Island,Bothell,Enumclaw,Renton,Fairwood,Bellevue,North_Bend,Duvall,Vashon,Bryn_Mawr_Skyway,Burien,Federal_Way,Carnation,Kent,Union_Hill_Novelty_Hill,Black_Diamond,Medina,Covington,Redmond,SeaTac,Snoqualmie,Kenmore,Sammamish,East_Hill_Meridian
2744,2472920140,4/3/2015,4,2.5,2620,9359,2.0,0.0,0.0,3,9,2620,0.0,1987,0.0,98058,47.438,-122.152,2580,7433,405000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Fairwood,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8025,6021500025,8/18/2014,3,1.75,2360,4063,1.0,0.0,0.0,5,7,1180,1180.0,1940,0.0,98117,47.6902,-122.382,1660,4063,631750.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Seattle,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13314,7852130720,10/9/2014,3,2.5,2240,7791,2.0,0.0,0.0,3,7,2240,0.0,2002,0.0,98065,47.5361,-121.88,2480,5018,452500.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Snoqualmie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
8085,1924059029,6/17/2014,5,6.75,9640,13068,1.0,1.0,4.0,3,12,4820,4820.0,1983,2009.0,98040,47.557,-122.21,3270,10454,4670000.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Mercer_Island,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10303,4154304740,2/24/2015,3,2.75,2780,7200,1.5,0.0,0.0,4,8,1870,910.0,1913,0.0,98118,47.5632,-122.27,1700,7200,709000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Seattle,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Simple Model 

Our simple model is just square footage of living space plotted against price, as that was the variable with the strongest correlation to price. 

In [14]:
simple_formula = 'price ~ sqft_living'
simple_model = ols(formula=simple_formula, data=data_train).fit()
results= simple_model.summary()
results

0,1,2,3
Dep. Variable:,price,R-squared:,0.491
Model:,OLS,Adj. R-squared:,0.491
Method:,Least Squares,F-statistic:,16680.0
Date:,"Tue, 05 Oct 2021",Prob (F-statistic):,0.0
Time:,15:30:46,Log-Likelihood:,-240120.0
No. Observations:,17277,AIC:,480300.0
Df Residuals:,17275,BIC:,480300.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-4.574e+04,4958.793,-9.225,0.000,-5.55e+04,-3.6e+04
sqft_living,282.1408,2.184,129.165,0.000,277.859,286.422

0,1,2,3
Omnibus:,12090.176,Durbin-Watson:,1.996
Prob(Omnibus):,0.0,Jarque-Bera (JB):,474100.489
Skew:,2.885,Prob(JB):,0.0
Kurtosis:,28.006,Cond. No.,5630.0


## Evaluation

The evaluation of each model should accompany the creation of each model, and you should be sure to evaluate your models consistently.

Evaluate how well your work solves the stated business problem. 

Questions to consider:

- How do you interpret the results?
- How well does your model fit your data? How much better is this than your baseline model? Is it over or under fit?
- How well does your model/data fit any modeling assumptions?

For the final model, you might also consider:

- How confident are you that your results would generalize beyond the data you have?
- How confident are you that this model would benefit the business if put into use?

### Baseline Understanding

- What does a baseline, model-less prediction look like?

In [15]:
# code here to arrive at a baseline prediction

### First $&(@# Model

Before going too far down the data preparation rabbit hole, be sure to check your work against a first 'substandard' model! What is the easiest way for you to find out how hard your problem is?

In [16]:
# code here for your first 'substandard' model

In [17]:
# code here to evaluate your first 'substandard' model

### Modeling Iterations

Now you can start to use the results of your first model to iterate - there are many options!

In [18]:
# code here to iteratively improve your models

In [19]:
# code here to evaluate your iterations

### 'Final' Model

In the end, you'll arrive at a 'final' model - aka the one you'll use to make your recommendations/conclusions. This likely blends any group work. It might not be the one with the highest scores, but instead might be considered 'final' or 'best' for other reasons.

In [20]:
# code here to show your final model

In [21]:
# code here to evaluate your final model

## Conclusions

Provide your conclusions about the work you've done, including any limitations or next steps.

Questions to consider:

- What would you recommend the business do as a result of this work?
- What are some reasons why your analysis might not fully solve the business problem?
- What else could you do in the future to improve this project (future work)?
