# Zillow Regression Project

## Executive Summary



Model was able to predict home values within an error of $484,301.60
This represents a 26% improvement over baseline

*What are you going to do and how are you going to do it? Write out your thoughts here in a way that you can easily explain your work to members on your team.*

*This is your space to put together the elevator pitch for your project, and then follow it up with a plan of attack. It's going to be short, but that doesn't mean you won't need to spend much time putting it together -- this is a process of dropping the bad ideas until you're left with something that you are confident you can work with.*

*Don't take shortcuts here; coming up with a useful question and a straightforward work plan is going to make the rest of your project flow much more smoothly from start to finish.*

*One final, and very important, point: the main reason for thinking about this stage as how you're going to describe your work to others is that you should be talking about work with others!!! The more you work on your project in isolation, the better your ideas will sound to you -- even the bad ones. Listening to your ideas spoken out loud in your own voice is an excellent sanity check, and feedback from your peers is your most valuable resource.*

**You don't have to answer every question, but answer each that you can now and then come back later. Also, if the question doesn't apply, remove it.**

### Problem Statement

We want to be able to predict the values of single unit properties that the tax district assesses using the property data from those whose last transaction was during the "hot months" (in terms of real estate demand) of May and June in 2017.

#### Addidtional Reporting Requested

1. What state and county are these properties in?
2. Tax Rate for each property
3. Distribution of tax rates for each county
4. How much does the tax rate vary by county?
5. What tax rate do the bulk of the properties sit around?

This model will allow us to predict tax assessment values which is helpful when considering property purchase. 


### Work Plan
**Work Flow**   
Acquire data from Codeup Zillow database, clean and prepare   
Explore 3 variables for modeling (bedroomcnt, bathroomcnt, and calculatedfinishedsquarefeet)  
Use Chi Squared Test for Independence on bedroomcnt and bathroomcnt to determine dependence of features  
Construct a model to predict single unit property tax assessment value   
Evaluate model effectiveness  
Summarize Conclusions and next steps  

**Machine Learning(ML) Condsiderations**   
Model type is a Regression ML model  
Use all four common regression model types: Linear Regression, LassoLars, Polynomial Features, TweedieRegressor  
Models require numeric data must be scaled   
Establish Baseline based on target mean, if model Root Mean Squared Error (RMSE) is lower than the baseline the model performs better


## Imports

In [1]:
import src.acquire
import src.prepare
import src.explore
import src.model
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from math import sqrt
from scipy import stats

import warnings
warnings.filterwarnings("ignore")

from statsmodels.formula.api import ols
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.feature_selection import f_regression, SelectKBest, RFE 
from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
from sklearn.preprocessing import PolynomialFeatures



## Acquire the Data

### SQL Query synopsis

Select all columns from actual 2017 and join predictions 2017 to get the transaction date and filter for the months of May and June 2017 only. Then filter the results for only property use types that meet the deffinition of a single unit property. 

**Determine deffinition of single property**
used article by James Chen Updated Sep 11, 2020 What Is a Housing Unit?
"The term housing unit refers to a single unit within a larger structure that can be used by an individual or household to eat, sleep, and live. The unit can be in any type of residence such as a house, apartment, mobile home, or may also be a single unit in a group of rooms. Essentially, a housing unit is deemed to be a separate living quarter where the occupants live and eat separately from other residents of the structure or building. They also have direct access from the building's exterior or through a common hallway."

    https://www.investopedia.com/terms/h/housingunits.asp

**Identify Properties in the Database: Based on the above definition some categories do not fit brief**    
Propertylandusetypeid | propertylandusedesc    
  **No**        31           Commercial/Office/Residential Mixed Used  (not a residence)   
  **No**        46           Multi-Story Store                         (not a residence)   
  **No**        47           Store/Office (Mixed Use)                  (not a residence)   
            246          Duplex (2 Units, Any Combination)   
            247          Triplex (3 Units, Any Combination)   
            248          Quadruplex (4 Units, Any Combination)   
            260          Residential General   
            261          Single Family Residential   
            262          Rural Residence   
            263          Mobile Home   
            264          Townhouse   
            265          Cluster Home    
            266          Condominium   
  **No**        267          Cooperative                               (become shareholder not owner)   
            268          Row House
            269          Planned Unit Development   
  **No**        270          Residential Common Area                   (propterty feature)   
  **No**        271          Timeshare                                 (become shareholder not owner)  
            273          Bungalow
            274          Zero Lot Line   
            275          Manufactured, Modular, Prefabricated Homes   
            276          Patio Home    
            279          Inferred Single Family Residential
  **No**        290          Vacant Land - General                     (not a residence)   
  **No**        291          Residential Vacant Land                   (not a residence)   

## Prepare the Data

**Feature Selection**   
What Data do we need in our inital df?   
parcelid 20931 non-null int64 Listing Number - Drop for Explore   
id 20931 non-null int64 Listing ID - Drop for Explore   
airconditioningtypeid 6779 non-null float64 Too Many null-values - Drop for Explore   
architecturalstyletypeid 52 non-null float64 Too Many null-values - Drop for Explore   
basementsqft 16 non-null float64 16 non-values - Drop for Explore   
bathroomcnt 20931 non-null float64 Use   
bedroomcnt 20931 non-null float64 Use - Combine bath/bed (feature engeneering)   
buildingclasstypeid 0 non-null object All Values Null - Drop for Explore   
buildingqualitytypeid 13257 non-null float64 Too Many null-values - Drop for Explore   
calculatedbathnbr 20771 non-null float64 Repeat of barthromcnt - Drop for Explore   
decktypeid 174 non-null float64 Too Many null-values - Drop for Explore   
finishedfloor1squarefeet 1738 non-null float64 Repeat Column - Drop for Explore   
calculatedfinishedsquarefeet 20868 non-null float64 Use - Drop null   
finishedsquarefeet12 20024 non-null float64 Repeat Column - Drop for Explore   
finishedsquarefeet13 17 non-null float64 Repeat Column - Drop for Explore   
finishedsquarefeet15 736 non-null float64 Repeat Column - Drop for Explore   
finishedsquarefeet50 1738 non-null float64 Repeat Column - Drop for Explore   
finishedsquarefeet6 91 non-null float64 Repeat Column - Drop for Explore   
fips 20931 non-null float64 Repeat Column - Drop for Explore   
fireplacecnt 2422 non-null float64 Use - change null to 0   
fullbathcnt 20771 non-null float64 Repeat of bathroom - Drop for Explore   
garagecarcnt 7075 non-null float64 Use - Rename as garage, change null to 0   
garagetotalsqft 7075 non-null float64 - garagesqft verifys that they exist   
hashottuborspa 461 non-null float64 Use - change null to 0, for no ht or spa   
heatingorsystemtypeid 13285 non-null float64 Too Many null-values - Drop for Explore   
latitude 20931 non-null float64 Repeat Column - Drop for Explore   
longitude 20931 non-null float64 Repeat Column - Drop for Explore   
lotsizesquarefeet 18742 non-null float64 Too Large for Modeling, Scaling?   
poolcnt 4496 non-null float64 Use - change null to 0, for no pool   
poolsizesum 251 non-null float64 Repeat Column - Drop for Explore   
pooltypeid10 121 non-null float64 Repeat Column - Drop for Explore   
pooltypeid2 340 non-null float64 Repeat Column - Drop for Explore   
pooltypeid7 4154 non-null float64 Repeat Column - Drop for Explore   
propertycountylandusecode 20931 non-null object Repeat Column - Drop for Explore      
propertylandusetypeid 20931 non-null float64 Use - Categories   
propertyzoningdesc 13437 non-null object Too Many null-values - Drop for Explore   
rawcensustractandblock 20931 non-null float64 Repeat info(zip)   
regionidcity 20503 non-null float64 Repeat info(zip) - Drop for Explore   
regionidcounty 20931 non-null float64 Repeat info(zip) - Drop for Explore   
regionidneighborhood 8443 non-null float64 Too Many null-values - Drop for Explore   
regionidzip 20916 non-null float64 Use - latered to categorical   
roomcnt 20931 non-null float64 Use   
storytypeid 16 non-null float64 Too Many null-values - Drop for Explore   
threequarterbathnbr 2800 non-null float64 Repeat info(bathroom) - Drop for Explore   
typeconstructiontypeid 56 non-null float64 Too Many null-values - Drop for Explore   
unitcnt 13476 non-null float64 Repeat info() - Drop for Explore   
yardbuildingsqft17 701 non-null float64 Too Many null-values - Drop for Explore   
yardbuildingsqft26 25 non-null float64 Too Many null-values - Drop for Explore   
yearbuilt 20850 non-null float64 Use - Drop null values   
numberofstories 4917 non-null float64 Too Many null-values - Drop for Explore   
fireplaceflag 51 non-null float64 Repeat info(firepls) - Drop for Explore   
structuretaxvaluedollarcnt 20897 non-null float64 Correlates w/Target - Drop for Explore   
taxvaluedollarcnt 20930 non-null float64 Target Variable   
assessmentyear 20931 non-null float64 Filtered in SQL - Drop for Explore   
landtaxvaluedollarcnt 20930 non-null float64 Correlates w/Target - Drop for Explore   
taxamount 20931 non-null float64 Correlates w/Target - Drop for Explore   
taxdelinquencyflag 703 non-null object Correlates w/Target - Drop for Explore   
taxdelinquencyyear 703 non-null float64 Correlates w/Target - Drop for Explore   
censustractandblock 20852 non-null float64 Repeat Column - Drop for Explore   
id 20931 non-null int64 Repeat Column - Drop for Explore   
logerror 20931 non-null float64 Calculation - Drop for Explore   
transactiondate 20931 non-null object Filtered in SQL - Drop for Explore   

### Clean the Data

Rename columns for clarity: fireplace, hottub, garage    
Replace NaN values with 0 because property did not have feature if NaN was used     
Convert zip, parcelid, year to categorical values and assign codes using .astype and .cat     
Drop all columns except those needed for explore and modeling stages   
Drop outliers identified during tax rate calculation, total 14 oberservations, see Additional Information Notebook for strategy synopsis    
Drop any remaining null values   
Split data into train, validate, test   
Define dataframe for exploration (X_train_explore) based on train   
Define dataframes for scaling (X_train, X_validate, X_test) dropping target values and non-feature columns   
Define target dataframe for modeling (y_train, y_validate, y_test)   
Fit the scaler (MinMaxScaler used) to the X_train dataframe   
Transform the dataframes to return X_train_scaled, X_validate_scaled, X_test_scaled   


### Load created dataframes

In [None]:
path='zillow_df.csv'

df, X_train_explore, \
    X_train_scaled, y_train, \
    X_validate_scaled, y_validate, \
    X_test_scaled, y_test = src.prepare.wrangle_zillow(path)

X_train_scaled.shape, X_validate_scaled.shape, X_test_scaled.shape

## Explore the Data

**If you haven't split the data yet, do it before you explore**

*Before you start exploring the data, write out your thought process about what you're looking for and what you expect to find. Take a minute to confirm that your plan actually makes sense.*

*Calculate summary statistics and plot some charts to give you an idea what types of useful relationships might be in your dataset. Use these insights to go back and download additional data or engineer new features if necessary. Not now though... remember we're still just trying to finish the MVP!*

### Explore Distributions

*It's good to have a sense of things like the scale and the spread of the data you're working with. But looking at the distribution of each feature can be a helpful way to determine if more features are needed.*

*Look for outliers. Take a closer look at the observations that stand out from the rest. Is there anything about these data points that might explain why they aren't behaving like the others? If so, that's a feature you would want to add to your model, to control for such a big difference.*

*Look for clusters in your histograms and scatterplots of various features. If the variance is inconsistent, with dense clusters of observations within the feature space, try to figure out what makes those observations so much more similar than the others. This could point to another feature to add to your data. Investigate any unusual patterns that look like a feature is not being drawn from a random distribution.*

*Look for multiple modes within the distributions of your features. This is a sign that there are categorical variables within your data set. Try to find out what these classes are, if you don't already know, and label your observations if possible.*

### Explore Relationships

*The "eyeball test" is the quickest way to check that your hypothesis is on the right track. Plot your inputs against your outputs. If there is real predictive power for the model you are trying to build, you should be able to make reasonable guesses (just by pointing at the chart, no need for precise values yet) of what the output should be for any given input value.*

*In a regression model, we are looking for clear relationships between our features and targets. Generate pairplots and look for features that have high correlations with the target variable.*

*In a classification model, we are looking for features that separate the population into distinct distributions. Generate pairplots that are color-coded by your categories and look for features where the categories have distinct distributions.*

## Model the Data

*Describe the algorithms that you are considering. How do they work? Why are they good choices for this data and problem space?*

*What nuances in the data will you have to be aware of in order to avoid introducing bias to your model? What steps will you need to take to prevent overfitting? What risks are there for data leakage?*

### Train Validation Test Split

*Pay special attention here to what data is going into your training and test sets. Is there any data leakage? Make sure that you are testing your model on information it has not seen during the learning process. If this is a classification problem, make sure that your classes are reasonably balanced.*

*Also, keep in mind that after you've trained and evaluated your model, you may want to look at the results of specific missed predictions. Set up your training and test data sets so that it will be easy to identify the record id when you are looking at a given prediction result.*

### Preprocessing

*These are standard processing steps to prepare your data for modeling. But remember that any code you use to scale or encode your data has to come from the observations in your training set.*

#### Feature Scaling

#### Label Encoding

### Build and Train Model

*Write down any thoughts you may have about working with these algorithms on this data. What looks to have been the most successful design choices? What pain points are you running into? What other ideas do you want to try out as you iterate on this pipeline?*

### Predict and Score

Important: Stick to the metrics you chose to test before you trained and evaluated your model! Remember that this is the standard you selected before your analysis as the one that you would find most convincing that the model works.

## Interpret the Model

Write up the things you learned, and how well your model performed. Be sure address the model's strengths and weaknesses. What types of data does it handle well? What types of observations tend to give it a hard time? What future work would you or someone reading this might want to do, building on the lessons learned and tools developed in this project?

#### Report Metrics in Context
How did the model perform on the key metrics you chose to demonstrate its usefulness? What counts as good or bad performance? How well do humans perform on this task? How well would you expect to do with random dice rolls? What are the costs associated with missed predictions?

#### Inspect Errors

Look at some of the observations with missed predictions. (Do this in your validation set, never look at individual records in your test set!) Are there common patterns among the observations with bad results? Do you have data that you can include in your model that will capture these patterns or is this a task that will need another research project to solve?

This is your last step in the iteration cycle; if you can't find anything else you can work on here with your present data, project scope, and deadlines, then it's time to wrap things up.

#### Strengths and Weaknesses

Once you've gone through your iteration cycles and are finished with this version of the model, or this particular project, provide an assessment of what types of observations are handled well by your model, and what circumstances seem to give it trouble. This will point you and others towards more questions for future projects.

## Next Steps: What Can We Do Now?

Reporting a model's results is good, and is the main objective of any data science project. But a project is one thing, a career is another. A question is one thing, but science is another. If you've carried out your research with a mindset of curiosity and creativity, then by now you should have plenty more, and much better informed, questions about this topic than what you started with.

So in addition to reporting on the question you investigated and the answers you found, think of the needs of your team, your users, and your peers in the industry, and make some recommendations that answer these two questions:

What are some unanswered questions in my project where more information (additional data sources, deeper understanding, other models or tools) might help improve these results?
What are other needs or problems where my model or my approach may be useful?