<a id="TableOfContents"></a>
# TABLE OF CONTENTS:
<li><a href='#imports'>Imports</a></li>
<li><a href="#acquiremvp">Acquire-MVP</a></li>
<li><a href='#preparemvp'>Prepare-MVP</a></li>
<li><a href="#acquire1">Acquire-V1</a></li>
<li><a href='#prepare1'>Prepare-V1</a></li>
<li><a href='#extra'>Extra</a></li>

<a id="imports"></a>
# Imports:
<li><a href='#TableOfContents'>Table of Contents</a></li>

In [1]:
# Vectorization and tables
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Stats
from scipy import stats

# .py files
import wrangle

<a id="acquiremvp"></a>
# Acquire-MVP:
<li><a href='#TableOfContents'>Table of Contents</a></li>

Acquire everything from the vanilla zillow database via SQL query and connection

- Vanilla shape:
    - Rows: 77,613
    - Columns: 68

In [2]:
zillow = wrangle.acquire()
zillow.shape

(77613, 68)

<a id="preparemvp"></a>
# Prepare-MVP:
<li><a href='#TableOfContents'>Table of Contents</a></li>

### List o' column determinations:
- Drop Columns:
    - 'typeconstructiontypeid'
        - 99.9% Nulls
        - ID unnecessary
    - 'storytypeid'
        - 99.9% Nulls
        - ID unnecessary
    - 'propertylandusetypeid'
        - 32.4% Nulls
        - ID unnecessary
    - 'heatingorsystemtypeid'
        - 56.3% Nulls
        - ID unnecessary
    - 'buildingclasstypeid'
        - 100% Nulls
        - ID unnecessary
    - 'architecturalstyletypeid'
        - 99.9% Nulls
        - ID unnecessary
    - 'airconditioningtypeid'
        - 82.4% Nulls
        - ID unnecessary
    - 'parcelid'
        - 00.0% Nulls
        - ID unnecessary
    - 'logerror'
        - 00.0% Nulls
        - Appears unnecessary
    - 'transactiondate'
        - 00.0% Nulls
        - Appears unnecessary
    - 'id'
        - 32.4% Nulls
        - ID unnecessary
    - 'basementsqft'
        - 99.9% Nulls
        - Too many nulls
    - 'buildingqualitytypeid'
        - 56.5% Nulls
        - ID unnecessary
    - 'calculatedbathnbr'
        - 32.6% Nulls
        - Repeats 'bathroomcnt'
    - 'decktypeid'
        - 99.5% Nulls
        - ID unnecessary
    - 'finishedfloor1squarefeet'
        - 94.4% Nulls
        - Too many nulls
    - 'finishedsquarefeet12'
        - 32.8% Nulls
        - Repeats 'calculatedfinishedsquarefeet'
    - 'finishedsquarefeet13'
        - 100% Nulls
        - Too many nulls
    - 'finishedsquarefeet15'
        - 100% Nulls
        - Too many nulls
    - 'finishedsquarefeet50'
        - 94.4% Nulls
        - Too many nulls
    - 'finishedsquarefeet6'
        - 99.8% Nulls
        - Too many nulls
    - 'fireplacecnt'
        - 90.7% Nulls
        - Too many nulls
    - 'bathroomcnt'
        - 32.4% Nulls
        - Using 'fullbathcnt'
    - 'garagecarcnt'
        - 76.8% Nulls
        - Too many nulls
    - 'garagetotalsqft'
        - 76.8% Nulls
        - Too many nulls
    - 'hashottuborspa'
        - 98.0% Nulls
        - Too many nulls
    - 'latitude'
        - 32.4% Nulls
        - Unnecessary for now
    - 'longitude'
        - 32.4% Nulls
        - Unnecessary for now
    - 'poolcnt'
        - 85.7% Nulls
        - Too many nulls
    - 'poolsizesum'
        - 98.9% Nulls
        - Too many nulls
    - 'pooltypeid10'
        - 99.4% Nulls
        - Too many nulls
    - 'pooltypeid2'
        - 98.6% Nulls
        - Too many nulls
    - 'pooltypeid7'
        - 87.1% Nulls
        - Too many nulls
    - 'propertycountylandusecode'
        - 32.4% Nulls
        - ID unnecessary
    - 'propertyzoningdesc'
        - 56.4% Nulls
        - Too many nulls
    - 'rawcensustractandblock'
        - 32.4% Nulls
        - Unnecessary for now
    - 'regionidcity'
        - 33.8% Nulls
        - Unnecessary for now
    - 'regionidcounty'
        - 32.4% Nulls
        - Unnecessary for now
    - 'regionidneighborhood'
        - 75.5% Nulls
        - Too many nulls
    - 'regionidzip'
        - 32.5% Nulls
        - Unnecessary for now
    - 'roomcnt'
        - 32.4% Nulls
        - Numbers don't seem to relate to rest of data
    - 'threequarterbathnbr'
        - 91.3% Nulls
        - Too many nulls
    - 'unitcnt'
        - 56.4% Nulls
        - Too many nulls
    - 'yardbuildingsqft17'
        - 97.5% Nulls
        - Too many nulls
    - 'yardbuildingsqft26'
        - 99.9% Nulls
        - Too many nulls
    - 'numberofstories'
        - 81.2% Nulls
        - Too many nulls
    - 'fireplaceflag'
        - 99.9% Nulls
        - Too many nulls
    - 'structuretaxvaluedollarcnt'
        - 32.5% Nulls
        - Per instructions
    - 'assessmentyear'
        - 32.4% Nulls
        - Unnecessary for now
    - 'landtaxvaluedollarcnt'
        - 32.4% Nulls
        - Per instructions
    - 'taxamount'
        - 32.4% Nulls
        - Per instructions
    - 'taxdelinquencyflag'
        - 97.3% Nulls
        - Too many nulls
    - 'taxdelinquencyyear'
        - 97.3% Nulls
        - Too many nulls
    - 'censustractandblock'
        - 32.6% Nulls
        - Unnecessary for now
    - 'airconditioningdesc'
        - 82.4% Nulls
        - Too many nulls
    - 'architecturalstyledesc'
        - 99.9% Nulls
        - Too many nulls
    - 'buildingclassdesc'
        - 100% Nulls
        - Too many nulls
    - 'heatingorsystemdesc'
        - 56.3% Nulls
        - Too many nulls
    - 'storydesc'
        - 99.9% Nulls
        - Too many nulls
    - 'typeconstructiondesc'
        - 99.9% Nulls
        - Too many nulls
    - 'propertylandusedesc'
        - 32.4% Nulls
        - Unnecessary due to SQL query
- Fix columns:
    - 'bedroomcnt'
        - Fill null with mode
        - 'bedroomcnt' ==> 'bedrooms'
    - 'calculatedfinishedsquarefeet'
        - Fill null with mean
        - 'calculatedfinishedsquarefeet' ==> 'home_sqft'
    - 'fips'
        - Fill null with mode
        - 'fips' ==> 'county'
        - '6037' ==> 'Los Angeles'
        - '6059' ==> 'Orange'
        - '6111' ==> 'Ventura'
    - 'fullbathcnt'
        - Fill null with mode
        - 'fullbathcnt' ==> 'full_bathrooms'
        - dtype ==> int
    - 'lotsizesquarefeet'
        - Fill nulls with mean
        - 'lotsizesquarefeet' ==> 'lot_sqft'
    - 'yearbuilt'
        - Fill nulls with mode
        - Find diff to 2017
        - Change values to diff
        - 'yearbuilt' ==> 'home_age'
        - dtype ==> int
    - 'taxvaluedollarcnt'
        - Fill nulls with mean
        - 'taxvaluedollarcnt' ==> 'value'
- Create columns:
    - 'home_lot_ratio'
        - 'home_sqft' / 'lot_sqft'

In [3]:
# Drop columns
zillow = zillow.drop(columns=[ 
     'typeconstructiontypeid',
     'storytypeid',
     'propertylandusetypeid',
     'heatingorsystemtypeid',
     'buildingclasstypeid',
     'architecturalstyletypeid',
     'airconditioningtypeid',
     'parcelid',
     'logerror',
     'transactiondate',
     'id',
     'basementsqft',
     'bathroomcnt',
     'buildingqualitytypeid',
     'calculatedbathnbr',
     'decktypeid',
     'finishedfloor1squarefeet',
     'finishedsquarefeet12',
     'finishedsquarefeet13',
     'finishedsquarefeet15',
     'finishedsquarefeet50',
     'finishedsquarefeet6',
     'fireplacecnt',
     'garagecarcnt',
     'garagetotalsqft',
     'hashottuborspa',
     'latitude',
     'longitude',
     'poolcnt',
     'poolsizesum',
     'pooltypeid10',
     'pooltypeid2',
     'pooltypeid7',
     'propertycountylandusecode',
     'propertyzoningdesc',
     'rawcensustractandblock',
     'regionidcity',
     'regionidcounty',
     'regionidneighborhood',
     'regionidzip',
     'roomcnt',
     'threequarterbathnbr',
     'unitcnt',
     'yardbuildingsqft17',
     'yardbuildingsqft26',
     'numberofstories',
     'fireplaceflag',
     'structuretaxvaluedollarcnt',
     'assessmentyear',
     'landtaxvaluedollarcnt',
     'taxamount',
     'taxdelinquencyflag',
     'taxdelinquencyyear',
     'censustractandblock',
     'airconditioningdesc',
     'architecturalstyledesc',
     'buildingclassdesc',
     'heatingorsystemdesc',
     'propertylandusedesc',
     'storydesc',
     'typeconstructiondesc'])

In [4]:
# Fill nulls in remaining columns

# bedroomcnt
# Fill na with mode = 3.0 
# Change to int type (NO FLOATS)
zillow.bedroomcnt = zillow.bedroomcnt.fillna(3.0).astype(int)

# calculatedfinishedsquarefeet
# Fill na with mean = 1922.89
# Leave as float, but filled values are the only floats (ONLY FLOAT)
zillow.calculatedfinishedsquarefeet = zillow.calculatedfinishedsquarefeet.fillna(1922.89)

# fips
# Fill na with mode = 6037 
zillow.fips = zillow.fips.fillna(6037).astype(int)

# fullbathcount
# Fill na with mode = 2.0 
# Change to int type (NO FLOATS)
zillow.fullbathcnt = zillow.fullbathcnt.fillna(2.0).astype(int)

# lotsizesquarefeet
# Fill na with mean = 11339.62
# Leave as float type (HAS FLOATS)
zillow.lotsizesquarefeet = zillow.lotsizesquarefeet.fillna(11339.62)

# yearbuilt
# Fill na with mode = 1955 
# Change to int type (NO FLOATS)
zillow.yearbuilt = zillow.yearbuilt.fillna(1955).astype(int)

# taxvaluedollarcnt
# Fill na with mean = 529688.16
# Leave as float, filled values are the only floats (ONLY FLOAT)
zillow.taxvaluedollarcnt = zillow.taxvaluedollarcnt.fillna(529688.16)

In [5]:
# Fix remaining columns

# fips 
# 6037 ==> Los Angeles
# 6059 ==> Orange
# 6111 ==> Ventura
# Ensure object type
conditions = [
    zillow.fips == 6037,
    zillow.fips == 6059,
    zillow.fips == 6111
]

choices = [
    'Los Angeles',
    'Orange',
    'Ventura'
]

zillow.fips = np.select(conditions, choices)

# yearbuilt
# Get difference of year to 2017
zillow.yearbuilt = 2017 - zillow.yearbuilt

In [6]:
# Rename remaining columns

zillow = zillow.rename(columns={
    'bedroomcnt' : 'bedrooms',
    'calculatedfinishedsquarefeet' : 'home_sqft',
    'fips' : 'county',
    'fullbathcnt' : 'full_bathrooms',
    'lotsizesquarefeet' : 'lotsize_sqft',
    'yearbuilt' : 'home_age',
    'taxvaluedollarcnt' : 'value'
})

In [7]:
# Create columns

# Ratio of home size to lot size
zillow['home_lot_ratio'] = round(zillow.home_sqft / zillow.lotsize_sqft, 2)

In [8]:
# Test .py file functionality
train, validate, test = wrangle.wrangle_zillow_mvp()
train.sample()

train.shape:(43463, 10)
validate.shape:(18627, 10)
test.shape:(15523, 10)


Unnamed: 0,bedrooms,home_sqft,full_bathrooms,lotsize_sqft,home_age,value,home_lot_ratio,county_Los Angeles,county_Orange,county_Ventura
47960,2,1006.0,1,5700.0,75,492396.0,0.18,1,0,0


- Prepped shape:
    - Rows: 77,613
    - Columns: 10

<a id="extra"></a>
# Extra:
<li><a href='#TableOfContents'>Table of Contents</a></li>