### Imports

In [15]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

from acquire import acquire_data
from prepare import the_master_imputer, data_prep, percent_of_values_missing, change_data_to_int, change_data_to_object
from set_counties import create_county_cols
from summerize import summarize_data, show_distribution

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Acquire

- using our function from our acquire.py file to import our data from SQL and import it into a csv, drop the Unnamed column and assign it to a data frame
- we also cut outliers but removing any home where their structuretaxvaluedollarcnt was > $1,000,000

In [2]:
df = acquire_data()

Acquiring data ...

- csv already exist

Data has been acquired


## Prepare

- our data_prep function removes any column missing 50% or more of their data as well as any row missing up to 75% of it's data (*We didn't want to waste time strategizing ways to accuratly sum up data missing over 50% of it's values*) 

In [3]:
df = data_prep(df)

- the function below tells us what percent of each column is missing

In [4]:
percent_of_values_missing(df)

parcelid                        0.00
logerror                        0.00
transactiondate                 0.00
bathroomcnt                     0.00
bedroomcnt                      0.00
calculatedfinishedsquarefeet    0.10
fips                            0.00
latitude                        0.00
longitude                       0.00
lotsizesquarefeet               0.65
regionidcity                    1.96
regionidcounty                  0.00
regionidzip                     0.04
yearbuilt                       0.16
structuretaxvaluedollarcnt      0.00
taxvaluedollarcnt               0.00
landtaxvaluedollarcnt           0.00
taxamount                       0.01
dtype: float64

- using the_master_imputer function, we will fill all of our missing functions with the median value, for that column's rows (*Since we weren't missing more than 2% of any data, we felt that the median would suffice for the missing values in our dataset*)

In [5]:
df = the_master_imputer(df)

We see that all of our missing values are taken care of

In [6]:
df.isna().sum()

parcelid                        0
logerror                        0
transactiondate                 0
bathroomcnt                     0
bedroomcnt                      0
calculatedfinishedsquarefeet    0
fips                            0
latitude                        0
longitude                       0
lotsizesquarefeet               0
regionidcity                    0
regionidcounty                  0
regionidzip                     0
yearbuilt                       0
structuretaxvaluedollarcnt      0
taxvaluedollarcnt               0
landtaxvaluedollarcnt           0
taxamount                       0
dtype: int64

>now we're going to take our fips # and use that to impute what county our observations were made in using our create_county_cols function and in the process we're going to drop our fips & our regioncounty columns.
fips source: [THIS LINK](https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013697)

In [7]:
df = create_county_cols(df)

### Lets change our values so we can have less noise in our exploration / modeling phase

- using our change_data_to_object function to change the datatypes of *parcelid*, *regionidcity*, & *regionidzip* to object. (we dont want to do any addition with our unit id's)

In [8]:
cols = ['parcelid', 'regionidzip', 'regionidcity']
df = change_data_to_object(df, cols)

- using our change_data_to_int function to change the datatypes of *yearbuilt*, *latitude*, *longitude*, *lotsizesquarefeet*, *calculatedfinishedsquarefeet*, & *bedroomcnt* to change these datatypes from float's to integers to make the data more exclusive

In [11]:
cols = ['yearbuilt', 'latitude', 'longitude', 'lotsizesquarefeet', 'calculatedfinishedsquarefeet', 'bedroomcnt' ]
df = change_data_to_int(df, cols)

### We're going summarize what our data looks like
- this functions runs a .info, a .describe and a .shape on our data and returns it in a pandas series

In [14]:
summarize_data(df)

******** Info
<class 'pandas.core.frame.DataFrame'>
Int64Index: 51586 entries, 0 to 51585
Data columns (total 19 columns):
logerror                        51586 non-null float64
transactiondate                 51586 non-null object
bathroomcnt                     51586 non-null float64
structuretaxvaluedollarcnt      51586 non-null float64
taxvaluedollarcnt               51586 non-null float64
landtaxvaluedollarcnt           51586 non-null float64
taxamount                       51586 non-null float64
LA                              51586 non-null uint8
Orange                          51586 non-null uint8
Ventura                         51586 non-null uint8
parcelid                        51586 non-null object
regionidzip                     51586 non-null object
regionidcity                    51586 non-null object
yearbuilt                       51586 non-null int64
latitude                        51586 non-null int64
longitude                       51586 non-null int64
lotsizesquare

## preprocessing