<img src="https://prnewswire2-a.akamaihd.net/p/1893751/sp/189375100/thumbnail/entry_id/1_su9da4fu/def_height/1001/def_width/1911/version/100011/type/2/q/100"  width="300" height="200">

# Data Preparation

Let's create a road map to guide us through preparation.
1. Handle missing values
1. Cast data types
1. Rename columns
1. Visualize distributions of features

In [1]:
# Import libraries for cleaning data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# import functions to acquire data and visualize missing values.
import sys
sys.path.insert(1, '../src/')

from acquire import get_zillow_data
from prepare import handle_missing_values, missing_values_summary, prepare_zillow

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_rows', None)

In [2]:
# Let's use the function we created in acquisition to acquire the data.
# assign the data to a dataframe
df = get_zillow_data()

What's the shape of our data?

In [None]:
print(f"There are {df.shape[0]} properties and {df.shape[1]} features.")

In [19]:
print(df.parcelid.duplicated().sum())

119


In [None]:
df.drop_duplicates()

| Feature | Description |
| :------ | :---------- |
| `airconditioningtypeid` | Type of cooling system present in the home (if any) |
| `architecturalstyletypeid` | Architectural style of the home (i.e. ranch,  colonial, split-level, etc…) |
| `basementsqft` | Finished living area below or partially below ground level |
| `bathroomcnt` | Number of bathrooms in home including fractional bathrooms |
| `bedroomcnt` | Number of bedrooms in home  |
| `buildingqualitytypeid` | Overall assessment of condition of the building  |from best (lowest) to worst (highest)
| `buildingclasstypeid` | The building framing type (steel frame, wood frame,   concrete/brick) |
| `calculatedbathnbr` | Number of bathrooms in home including fractional  |bathroom
| `decktypeid` | Type of deck (if any) present on parcel |
| `threequarterbathnbr` | Number of 3/4 bathrooms in house (shower + sink +  toilet) |
| `finishedfloor1squarefeet` | Size of the finished living area on the first  (entry) floor of the home
| `calculatedfinishedsquarefeet` | Calculated total finished living area of the home | 
| `finishedsquarefeet6` | Base unfinished and finished area |
| `finishedsquarefeet12` | Finished living area |
| `finishedsquarefeet13` | Perimeter  living area |
| `finishedsquarefeet15` | Total area |
| `finishedsquarefeet50` |  Size of the finished living area on the first  (entry) floor of the home |
| `fips` | Federal Information Processing Standard code -  see https://en.wikipedia.org/wiki/FIPS_county_code for more details |
| `fireplacecnt` | Number of fireplaces in a home (if any) |
| `fireplaceflag` | Is a fireplace present in this home |
| `fullbathcnt` | Number of full bathrooms (sink, shower + bathtub, and toilet) present in home|
| `garagecarcnt` | Total number of garages on the lot including an attached garage |
| `garagetotalsqft` | Total number of square feet of all garages on lot including an attached garage |
`hashottuborspa` | Does the home have a hot tub or spa
`heatingorsystemtypeid` | Type of home heating system
`latitude` | Latitude of the middle of the parcel multiplied by 10e6
`longitude` | Longitude of the middle of the parcel multiplied by 10e6
`lotsizesquarefeet` | Area of the lot in square feet
`numberofstories` | Number of stories or levels the home has
`parcelid` | Unique identifier for parcels (lots) 
`poolcnt` | Number of pools on the lot (if any)
`poolsizesum` | Total square footage of all pools on property
`pooltypeid10` | Spa or Hot Tub
`pooltypeid2` | Pool with Spa/Hot Tub
`pooltypeid7` | Pool without hot tub
`propertycountylandusecode` | County land use code i.e. it's zoning at the county level |
| `propertylandusetypeid` | Type of land use the property is zoned for
| `propertyzoningdesc` | Description of the allowed land uses (zoning) for that property
| `rawcensustractandblock` | Census tract and block ID combined - also contains blockgroup assignment by extension
| `censustractandblock` | Census tract and block ID combined - also contains blockgroup assignment by extension
| `regionidcounty` |County in which the property is located
| `regionidcity` | City in which the property is located (if any)
| `regionidzip` | Zip code in which the property is located
| `regionidneighborhood` |Neighborhood in which the property is located
| `roomcnt` | Total number of rooms in the principal residence
| `storytypeid` | Type of floors in a multi-story house (i.e. basement and main level, split-level, attic, etc.).  See tab for details."
| `typeconstructiontypeid` | What type of construction material was used to construct the home
| `unitcnt` | Number of units the structure is built into (i.e. 2 = duplex, 3 = triplex, etc...)
| `yardbuildingsqft17` |Patio in  yard
| `yardbuildingsqft26` |Storage shed/building in yard
| `yearbuilt` | The Year the principal residence was built 
| `taxvaluedollarcnt` |The total tax assessed value of the parcel
| `structuretaxvaluedollarcnt` |The assessed value of the built structure on the parcel
| `landtaxvaluedollarcnt` |The assessed value of the land area of the parcel
| `taxamount` |The total property tax assessed for that assessment year
| `assessmentyear` |The year of the property tax assessment 
| `taxdelinquencyflag` |Property taxes for this parcel are past due as of 2015
| `taxdelinquencyyear` |Year for which the unpaid propert taxes were due 


# Handle Missing Values
1. Missing Values Summary
2. Drop columns with 100% of values missing
3. Salvage columns with 80-99.99% of values missing
> Repeat until all columns that have a missing value have been dropped or salvaged.

Let's look at the features of our dataset.

In [None]:
df.head()

## Missing Values Summary

In [None]:
# Use a function to calculate missing value stats.
df_missing = missing_values_summary(df)

In [None]:
missing_summary = (
    df_missing
    .sort_values(by=['num_rows_missing']
                 ,ascending=False)
                  )

In [None]:
# Visualize the missing data.
sns.set_theme(style="whitegrid")
plt.figure(figsize=(7, 20))

sns.barplot(
    
    y='attribute',
    x='pct_rows_missing',
    data=missing_summary,
    orient='h',
    color='Royalblue',
)

plt.title("What's the percentage of missing values in each column?")
plt.xlabel('% of column missing')
plt.ylabel('column name')

plt.xticks(
    
    ticks=np.linspace(0, 1, 6),
    labels=['0%', '20%', '40%', '60%', '80%', '100%']
)

plt.xlim(0, 1)
plt.show()

# Columns with 100% of Values Missing
More than half of our columns are missing a data point. Let's drop the columns that are missing between 80 and 100% of their values.

<img src="imputing_nans.jpg" width="300" height="300">

In [None]:
# Let's drop the rows with 100% of values missing
missing_summary = missing_summary[missing_summary.pct_rows_missing != 1]

# Columns with 80-99.99% of Values Missing

Before we drop columns with greater than 80% of values let's take a look at their values to see if we can salvageable.

In [None]:
# start with > 90% missing values first
missing_summary[missing_summary.pct_rows_missing > .9]

## Salvage features

- Matching {typeid: typedesc} columns
- Pool columns (including `has hot tub or spa`)
- Features measured in square feet
- Fire place
- Tax delinquincy columns


### Matching {typeid: typedesc} columns
1. `typeconstructiontypeid` : `typeconstructiondesc`
2. `storytypeid` : `storydesc`
3. `architecturalstyletypeid` : `architecturalstyledesc`

In [None]:
# drop this column
pd.crosstab(df.typeconstructiontypeid,
            df.typeconstructiondesc)

The story type represents how many levels/floors a property has, whether it has an attic, a basement, ect. 'Basement' is the only value in `storydesc`, let's rename it `has_basement` with values 0/1.<br>Where 0 == no basement, 1 == basement

In [None]:
# keep these columns
pd.crosstab(df.storytypeid,
            df.storydesc)

Note: If we had an ml algorithm that could utilize google images, we could infer the architectural style for every property. :O or infer the architectural style using clustering! (If we had more values...)

In [None]:
# drop this column
# Not enough values to infer the architectural style of a property.
pd.crosstab(df.architecturalstyletypeid,
            df.architecturalstyledesc)

##### Matching {typeid: typedesc} columns Summary
Colums to keep:
- `storytypeid` and `storydesc`
    - Create a new column called `has_basement` where: 0 == no basement, 1 == basement
        - Drop both columns after columns after creating the new column.

Columns to drop:
- `typeconstructiontypeid`
- `typeconstructiondesc`
- `architecturalstyletypeid`
- `architecturalstyledesc`

### Pool columns (including hot tub/spa)

The only value in `hashottuborspa` is 1. Let's rename it `has_hottub_or_spa` with values 0/1.<br>
Where 0 == no hottub or spa, 1 == hot tub or spa

In [None]:
# keep this column
df.hashottuborspa.value_counts()

Each pool column has a unique value of 1, one swimming pool. `poolcnt` has the least amount of missing values. Let's salvage it. Rename `poolcnt` as `has_pool`.
<br>Where 0 == no pool, 1 == pool

In [None]:
# keep this column
# The number of pools on a property
print(df.poolcnt.value_counts(), end='\n\n')

# drop these columns
# Pool with Spa/Hot Tub
print(df.pooltypeid2.value_counts(), end='\n\n')

# Pool without hot tub
print(df.pooltypeid7.value_counts(), end='\n\n')

# Spa or Hot Tub
print(df.pooltypeid10.value_counts(), end='\n')

Of properties that have pools, there are 262 unique pool sizes. Let's rename it as `pool_area`.

In [None]:
# Area of all pools on a property in square feet.
print(df.poolsizesum.nunique())

Let's take a look at the distribution of pool area in square feet.

In [None]:
df[['poolsizesum']].describe().T

In [None]:
plt.figure(figsize=(13, 7))
df.poolsizesum.hist()

# The only properties with a pool are located in Ventura county
# See the Appendix
plt.title('Distribution of Pool Sizes in Ventura County')
plt.xlabel('pool size (area in sqft)')
plt.ylabel('# of occurances')
plt.show()

In [None]:
pd.cut(df.poolsizesum, 4).value_counts(dropna=False)

##### Pool columns (including hot tub/spa) Summary
Colums to keep:
> Drop these columns once the new ones are created.
- `hashottuborspa`
    - Create a new column called `has_hot_or_spa` with the values 0/1.<br>Where: 0 == no hottub or spa, 1 == hot tub or spa
- `poolcnt`
    - Create a new column called `has_pool`.<br>
    Where: 0 == no pool, 1 == pool
- `poolsizesum`
    - Create a new column called `pool_area`. Fill missing values with 0.
    - For those with a pool but no reported area, impute the median.

Columns to drop:
- `pooltypeid2`
- `pooltypeid7`
- `pooltypeid10`

# Features measured in Square Feet

In [None]:
# duplicate column
np.all(df.finishedfloor1squarefeet == df.finishedsquarefeet50)

In [None]:
# Drop column
print(df.finishedsquarefeet6.count())

In [None]:
# Patio in yard
plt.figure(figsize=(13, 7))
df.yardbuildingsqft17.hist()
plt.show()

In [None]:
# Storage shed/building in yard
plt.figure(figsize=(13, 7))
df.yardbuildingsqft26.hist()
plt.show()

In [None]:
# Basement size in squarefeet
plt.figure(figsize=(13, 7))
df.basementsqft.hist()
plt.show()

In [None]:
df.storydesc.count() == df.basementsqft.count()

# Prepare the data
<div class='alert alert-block alert-info'>Will return to salvage more features if time permits.</div>

1. storytypeid and storydesc
    - Create a new column called `has_basement` where: 0 == no basement, 1 == basement
        - Drop both columns after columns after creating the new column.
1. hashottuborspa
    - Create a new column called `has_hottub_or_spa` with the values 0/1.
    Where: 0 == no hottub or spa, 1 == hot tub or spa
1. poolcnt
    - Create a new column called `has_pool`.
    Where: 0 == no pool, 1 == pool
1. poolsizesum
    - Create a new column called `pool_area`. Fill missing values with 0.
    For those with a pool but no reported area, impute the median.

## `has_basement`

In [None]:
df.storydesc.fillna(0, inplace=True)
df.storydesc.replace('Basement', 1, inplace=True)

In [None]:
df.storydesc.value_counts()

In [None]:
df['has_basement'] = df.storydesc

In [None]:
df.has_basement.value_counts()

## `has_hottub_or_spa`

In [None]:
df.hashottuborspa.fillna(0, inplace=True)

In [None]:
df['has_hottub_or_spa'] = df.hashottuborspa

## `has_pool` and `pool_area`

In [None]:
df.poolcnt.fillna(0, inplace=True)
df['has_pool'] = df.poolcnt

In [None]:
df.has_pool.value_counts()

In [None]:
df.poolsizesum.fillna(0, inplace=True)
df['pool_area_sqft'] = df.poolsizesum

In [None]:
df.pool_area_sqft.value_counts().head()

## `has_patio` and `patio_area_sqft`

In [None]:
df['has_patio'] = df.yardbuildingsqft17.notnull().astype(np.int)
df['patio_area_sqft'] = df.yardbuildingsqft17.fillna(0)

In [None]:
df.has_patio.value_counts().head()

## `has_shed` and `shed_area_sqft`

In [None]:
df['has_shed'] = df.yardbuildingsqft26.notnull().astype(np.int)
df['basement_area_sqft'] = df.yardbuildingsqft26.fillna(0)

In [None]:
bathroom_median = df.calculatedbathnbr.median()

## `num_of_bathrooms` and `num_of_bedrooms`

In [None]:
df.calculatedbathnbr.fillna(bathroom_median, inplace=True)
df.rename(columns={'calculatedbathnbr': 'num_of_restrooms'}, inplace=True)

In [None]:
df.head()

In [None]:
df.bedroomcnt.value_counts(dropna=False)

## `age_of_property`

In [None]:
median_year_built = df.yearbuilt.median()
current_year = 2017

df.yearbuilt.fillna(median_year_built, inplace=True)
df.yearbuilt = df.yearbuilt.astype(np.int)

df['age_of_property'] = 2017 - df.yearbuilt

## `lot_size_sqft`

In [None]:
median_lot_in_sqft = df.lotsizesquarefeet.median()
df['lot_size_sqft'] = df.lotsizesquarefeet.fillna(median_lot_in_sqft)

In [None]:
# Let's drop the columns
features_to_drop = [
    'decktypeid',
    'buildingclasstypeid',
    'buildingqualitytypeid',
    'finishedsquarefeet6',
    'finishedsquarefeet13',
    'finishedsquarefeet15',
    'buildingclassdesc',
    'pooltypeid2',
    'pooltypeid7',
    'pooltypeid10',
    'regionidcity',
    'regionidcounty',
    'regionidzip',    
    'typeconstructiontypeid',
    'typeconstructiondesc',
    'architecturalstyletypeid',
    'architecturalstyledesc',
    'storytypeid',
    'storydesc',
    'hashottuborspa',
    'poolcnt',
    'poolsizesum',
    'yardbuildingsqft17',
    'yardbuildingsqft26',
    'taxdelinquencyyear',
    'taxdelinquencyflag',
    'finishedsquarefeet50',
    'finishedfloor1squarefeet',
    'censustractandblock',
    'rawcensustractandblock',
    'propertylandusetypeid',
    'id',
    'assessmentyear',
    'finishedsquarefeet12',
    'bathroomcnt',
    'fullbathcnt',
    'basementsqft',
    'threequarterbathnbr',
    'lotsizesquarefeet',
    'propertylandusedesc',
    'propertycountylandusecode',
    'regionidcounty'
]
   

In [None]:
df.drop(columns=features_to_drop, inplace=True)

In [None]:
df.shape

In [None]:
df = handle_missing_values(df)

In [None]:
columns_to_impute = df.isna().sum()[df.isna().sum()>0].index.to_list()

In [None]:
columns_to_impute

In [None]:
for column_name in columns_to_impute:
    median = df[column_name].median()
    df[column_name] = df[column_name].fillna(median)

In [None]:
df.isna().sum()

In [None]:
df = prepare_zillow()
df.info()

In [None]:
df.rename(columns={'calculatedfinishedsquarefeet': 'living_room_area_sqft',
                     'structuretaxvaluedollarcnt': 'structure_tax',
                     'taxvaluedollarcnt': 'taxable_value',
                     'landtaxvaluedollarcnt': 'land_tax',
                     'taxamount': 'property_tax',
                     'lasttransactiondate': 'date_sold',
                     'roomcnt' : 'num_of_rooms',
                     'yearbuilt': 'year_built'},
         inplace=True)

In [None]:
df.info()

# Appendix

Use this snippet to load the data dictionary in your local environment once you've cloned the repository
```python
pd.read_csv('data_dictionary.csv')
```

In [None]:
df_appendix = get_zillow_data()

In [None]:
# Iteration #2 Figure out what the codes mean, Mason.
df_appendix.groupby(by='fips').propertycountylandusecode.value_counts(dropna=False)

In [None]:
# Properties in ventura county are the only properties that have pool area reported!
df_appendix.groupby(by='fips').poolsizesum.value_counts(dropna=False).head()