# <center>Wrangling Zillow - Intro</center>

### Background
Let's set up an example scenario as perspective for our regression exercises using the Zillow dataset.

As a Codeup data science graduate, you want to show off your skills to the Zillow data science team in hopes of getting an interview for a position you saw pop up on LinkedIn. You thought it might look impressive to build an end-to-end project in which you use some of their Kaggle data to predict property values using some of their available features; who knows, you might even do some feature engineering to blow them away. Your goal is to predict the values of single unit properties using the obervations from 2017.

In these exercises, you will complete the first step toward the above goal: acquire and prepare the necessary Zillow data from the zillow database in the Codeup database server.

### Acquire
Acquire from the zillow database for all 'Single Family Residential' properties:
- propertylandusetype table: 
    * propertylandusetypeid
    * propertylandusedesc
- properties_2017 table: 
    * propertylandusetypeid
    * bedroomcnt
    * bathroomcnt
    * calculatedfinishedsquarefeet
    * taxvaluedollarcnt
    * yearbuilt
    * taxamount
    * fips
    
### Prepare
Using your acquired Zillow data, walk through the summarization and cleaning steps in your wrangle.ipynb file like we did above. 

You may handle the missing values however you feel is appropriate and meaningful; remember to document your process and decisions using markdown and code commenting where helpful.

### Compartmentalize Functions
Store all of the necessary functions to automate your process from acquiring the data to returning a cleaned dataframe witn no missing values in your wrangle.py file. Name your final function wrangle_zillow.

# <center>Wrangling Zillow - wrangle.py<center>

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd

from wrangle import get_zillow      # Get a fresh copy from Codeup DB or local drive
from wrangle import wrangle_zillow  # Get a no-null no-duplicate version from fresh

In [2]:
zillow = wrangle_zillow()

# <center>Wrangling Zillow - Investigating Further<center>

### Investigate columns for discrepancies

In [3]:
# zillow.describe().T

In [4]:
# zillow.taxvaluedollarcnt.sort_values()

In [5]:
# zillow[zillow.taxvaluedollarcnt == 22]

**Some property appraisals are very low**: Some properties having 30 sqft up to 150 sqft are appraised at 22 dollars, and have widely varying tax burdens (up to 6,000 dollars). I'm not sure what the cause of this is, but I will ignore it for now and take it into account as potential outliers.

In [6]:
# zillow[zillow.calculatedfinishedsquarefeet < 30]

**Some property sizes are very low**: Some properties have a designated 1 square foot with an appraisal value in the 7-figure range, while others having less-than-10 square feet to a property have an appraisal value in the 5 figure range. I'm not sure what the cause of this is, either.

In [7]:
# zillow[zillow.bedroomcnt > 10]

**Few houses have more than 10 bedrooms**: 79 observations (of over two million total observations) have more than 10 bedrooms. The 75th percentile is marked as 4 bedrooms. The high-bedroom-count outliers should be removed before statistics and modeling is done.

In [8]:
# zillow[zillow.bathroomcnt > 10]

**Few houses have more than 10 bathrooms**: Same story as the bedroom count, with 353 observations of homes having more than 10 bathrooms. The bathroom count's 75th percentile is 3 bathrooms.

# <center>Wrangling Zillow - Takeaways</center>

**Removing outliers will be valuable**: This is the regression module, and outliers are the enemy of a good regression module. There are outliers in the zillow dataset that should be removed prior to creating a model. We've so far removed rows with nulls or duplicate rows, but we must also remove outliers. I've added a function to my wrangle.py called remove_outliers that will use the IQR to remove outliers because I lack domain knowledge to remove outliers via my own discretion.

In [9]:
zillow.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1790213 entries, 7 to 2152853
Data columns (total 9 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   bedroomcnt                    float64
 1   bathroomcnt                   float64
 2   calculatedfinishedsquarefeet  float64
 3   taxvaluedollarcnt             float64
 4   yearbuilt                     object 
 5   taxamount                     float64
 6   fips                          object 
 7   propertylandusetypeid         float64
 8   propertylandusedesc           object 
dtypes: float64(6), object(3)
memory usage: 136.6+ MB
