# Predicting Zillow Log Error
### Author: 
Bethany Thompson, Jr. Data Scientist  
thompson.bethany.01@gmail.com
### Date:
October 21st, 2020
### Project Goals:
Create insight for the current Zillow Zestimate model by predicting the model's log error.
### Conclusions:

## Table of Contents:
 [Acquire](#first-bullet)  
> Use acquire module to assign the data to a variable  

 [Prepare](#second-bullet)  
> Use prepare module to clean the data  

 [Explore](#third-bullet)  
> Highlight key takeaways from explore  

 [Modeling](#fourth-bullet)  
> Creating 4 models to predict log error and determining the best performing one  

 [Conclusions](#fifth-bullet)  
> Reasons for log error, best model to predict log error

# Acquire <a class="anchor" id="first-bullet"></a>  
- If you would like to see the data before prep, run this code
- However, this function is also within the Prepare.py module
    - calling the Prepare.prepare_zillow() function will use Acquire to obtain the data and prep in one function
- The query in Acquire joins the predictions_2017 and properties_2017 tables, and selects only single-unit properties

In [1]:
# Acquire.py holds functions to connect to sql database, and return a df with data selected with query
import Acquire

In [8]:
raw_df = Acquire.get_home_data()

### The original data has 61 columns and 70,364 rows. There is also more that 50% of nulls in many columns.

In [9]:
raw_df.shape

(70364, 61)

In [10]:
raw_df.isnull().sum()

id                              0
parcelid                        0
airconditioningtypeid       50658
architecturalstyletypeid    70213
basementsqft                70320
                            ...  
taxdelinquencyflag          69043
taxdelinquencyyear          69043
censustractandblock          1590
logerror                        0
transactiondate                 0
Length: 61, dtype: int64

# Prepare <a class="anchor" id="second-bullet"></a>  

Within the prepare function:  
- columns and rows with more than %25 of nulls are removed
- uneccessary columns are removed (such as multiple sqft)
- outliers in continuous variables are removed
- leftover nulls are filled with mean or mode
- features for property age and transaction month are created
- columns are renamed and rearranged

In [6]:
# Prepare.py holds the function to obtain the data, prep the data, and split into train, validate, test
import Prepare

### Prepare.py also splits the data into train, validate, test
70% Train, 20% Validate, 10% Test

In [12]:
train, validate, test = Prepare.prepare_zillow()

train shape:  (44284, 25) , validate shape:  (12491, 25) , test shape:  (6309, 25)

train percent:  70.0 , validate percent:  20.0 , test percent:  10.0


### No nulls are left after using the prepare function

In [14]:
train.isnull().sum()

index_id               0
parcel_id              0
log_error              0
tax_value              0
structure_tax_value    0
land_tax_value         0
tax_amount             0
county_id              0
zip_code               0
latitude               0
longitude              0
census_id              0
bathrooms              0
bedrooms               0
full_bathrooms         0
bed_plus_bath          0
room_count             0
property_sqft          0
lot_sqft               0
land_use_code          0
land_use_type          0
year_built             0
property_age           0
transaction_date       0
transaction_month      0
dtype: int64

# Explore <a class="anchor" id="third-bullet"></a>

# Modeling <a class="anchor" id="fourth-bullet"></a>

# Conclusions <a class="anchor" id="fifth-bullet"></a>