<div style="text-align:center"><img src="zillowprojlogo.png"/></div>

<a id='navigation'></a>

<button class="button-save large">[Scenario](#scenario)</button>
<button class="button-save large">[Project Planning](#project-planning)</button>
<button class="button-save large">[Key Findings](#key-findings)</button>
<button class="button-save large">[Tested Hypotheses](#tested-hypotheses)</button>
<button class="button-save large">[Take Aways](#take-aways)</button>
<button class="button-save large">[Data Dictionary](#data-dictionary)</button>
<button class="button-save large">[Workflow](#workflow)</button>

<div class="alert alert-block alert-info"><a name="scenario"></a><h1><i class="fas fa-home"></i> Scenario</h1></div>
Selling homes in our new normal has just gotten easier with Zillow Offers®. Now home owners can hand over the burden of selling their property, by selling directly to us based on our state of the art Zestimate score.

The accuracy and integrity of our Zestimate score is of high importance. As a junior data scientists on the Zillow data science team, we are tasked with uncovering what drivers most affect the validity of the Zestimate score. This is measured by our target variable: `logerror`, which is the difference between Zillow's estimated Zestimate and actual sale price. 
>`logerror` = log (Zestimate) − log (ActualSalePrice)

### Project Goal: 
The goal for this project is to create a model that will accurately predict the Zestimate’s `logerror`. By doing so, we will uncover what features available on the Zillow Dataset are driving the amount of error.


In [1]:
# Imports needed

import pandas as pd
import numpy as np
import os
import sklearn.preprocessing
from sklearn.model_selection import train_test_split
from env import host, user, password
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from scipy import stats
from sklearn.cluster import KMeans

import seaborn as sns
import matplotlib.pyplot as plt
import wrangle

# modeling methods
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, LassoLars
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestRegressor

## Acquire the Data

In [2]:
# Acquire the data from SQL

df = wrangle.zillow17()
df

Unnamed: 0,id,parcelid,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,...,censustractandblock,logerror,transactiondate,airconditioningdesc,architecturalstyledesc,buildingclassdesc,heatingorsystemdesc,propertylandusedesc,storydesc,typeconstructiondesc
0,1087254,10711855,,,,2.0,3.0,,8.0,2.0,...,6.037113e+13,-0.007357,2017-07-07,,,,Central,Single Family Residential,,
1,1072280,10711877,1.0,,,2.0,4.0,,8.0,2.0,...,6.037113e+13,0.021066,2017-08-29,Central,,,Central,Single Family Residential,,
2,1340933,10711888,1.0,,,2.0,4.0,,8.0,2.0,...,6.037113e+13,0.077174,2017-04-04,Central,,,Central,Single Family Residential,,
3,1878109,10711910,,,,2.0,3.0,,8.0,2.0,...,6.037113e+13,-0.041238,2017-03-17,,,,Central,Single Family Residential,,
4,2190858,10711923,,,,2.0,4.0,,8.0,2.0,...,6.037113e+13,-0.009496,2017-03-24,,,,Central,Single Family Residential,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77569,775695,167686999,,,,0.0,0.0,,,,...,,-0.068632,2017-02-28,,,,,Single Family Residential,,
77570,2863262,167687739,,,,0.0,0.0,,,,...,,0.360020,2017-03-03,,,,,Condominium,,
77571,1372384,167687839,,,,0.0,0.0,,,,...,,0.038797,2017-05-31,,,,,Single Family Residential,,
77572,2758757,167688532,1.0,,,3.0,3.0,,4.0,3.0,...,,0.006706,2017-02-03,Central,,,Central,Condominium,,


### Acquire Takeaways:
- Massive dataset
- Many columns are primarily nulls/NaNs
- Many columns are redundant

## Prep the Data

In [15]:
# Dropping unnecessary/redundant columns, dropping nulls/NaNs, and  limiting outliers 

df, train, validate, test = wrangle.wrangle_zillow()

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71818 entries, 10711855 to 162960814
Data columns (total 23 columns):
bathrooms                 71818 non-null float64
bedrooms                  71818 non-null int64
property_quality          71818 non-null int64
sqft                      71818 non-null float64
fips                      71818 non-null int64
latitude                  71818 non-null float64
longitude                 71818 non-null float64
lot_sqft                  71818 non-null float64
rawcensustractandblock    71818 non-null float64
regionidcity              71818 non-null float64
zip_code                  71818 non-null int64
roomcnt                   71818 non-null int64
unitcnt                   71818 non-null int64
yearbuilt                 71818 non-null int64
structure_value           71818 non-null float64
home_value                71818 non-null float64
land_value                71818 non-null float64
taxamount                 71818 non-null float64
logerror    

In [14]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bathrooms,71818.0,2.272822,0.916359,1.0,2.0,2.0,3.0,10.0
bedrooms,71818.0,3.016765,0.990155,1.0,2.0,3.0,4.0,11.0
property_quality,71818.0,6.758431,1.350214,1.0,6.0,7.0,7.0,12.0
sqft,71818.0,1741.974,873.320788,501.0,1172.0,1517.0,2056.0,12039.0
fips,71818.0,6049.319,21.083111,6037.0,6037.0,6037.0,6059.0,6111.0
latitude,71818.0,34007130.0,267350.75188,33339530.0,33810530.0,34021240.0,34176710.0,34818770.0
longitude,71818.0,-118200400.0,364289.055844,-119475300.0,-118418200.0,-118168500.0,-117917900.0,-117572300.0
lot_sqft,71818.0,27756.19,118484.322103,236.0,5985.0,7265.0,10619.25,6971010.0
rawcensustractandblock,71818.0,60496220.0,209253.579521,60371010.0,60374000.0,60376210.0,60590520.0,61110090.0
regionidcity,71818.0,33507.59,45986.851085,3491.0,12447.0,25218.0,45457.0,396556.0


### Data Prep Takeaways:
- Dropped unnecessary/redundant columns: 'id', 'calculatedbathnbr', 'finishedsquarefeet12', 'fullbathcnt', 'heatingorsystemtypeid',                 'propertycountylandusecode', 'propertylandusetypeid','propertyzoningdesc', 'censustractandblock', 'propertylandusedesc', 
  'heatingorsystemdesc', 'assessmentyear', 'regionidcounty'  
- Set our outliers to taxvaluedollarcnt < 5_000_000 and calculatedfinishedsquarefeet < 12500
- Dropped columns that had >60% nulls, dropped rows that had >70% nulls
- Dropped a total of 45 columns and 5756 rows
- Still a large dataset

## Explore the data

In [19]:
# Check our train, validate,test

train.shape, validate.shape, test.shape

((40217, 23), (17237, 23), (14364, 23))