# Acquire (acquire.py)

Zillow

For the following, iterate through the steps you would take to create functions: write the code to do the following in a jupyter notebook, test it, convert to functions, then create the file to house those functions.

You will have a zillow.ipynb file and a helper file for each section in the pipeline

***acquire & summarize***

1. Acquire data from mySQL using the python module to connect and query. you will want to end with a single dataframe. Make sure to include: the logerror, all fields related to the properties that are available. You will end up using all the tables in the database

    - ***Be sure to do the correct join. We do not want to eliminate properties purely because they may have a null value for `airconditioningtypeid`***
    
    - only include properties with a transaction in 2017, and include only the last transaction for each property (so no duplicate property id's), along with zestimate error and date of transaction.
    
    - only include properties that include a latitude and longitude value.

In [1]:
# Set up

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Wrangling
import pandas as pd 
import numpy as np 

# Exploring
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns 

# default pandas decimal number display format.
pd.options.display.float_format = '{:20,.2f}'.format
pd.set_option("max_r", 80)

import acquire
import prepare


In [2]:
df = acquire.get_zillow_data()

In [3]:
df.head()

Unnamed: 0,logerror,transactiondate,id,parcelid,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,...,numberofstories,fireplaceflag,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock
0,0.03,2017-01-01,1727539,14297519,,,,3.5,4.0,,...,,,485713.0,1023282.0,2016.0,537569.0,11013.72,,,60590630072012.0
1,0.06,2017-01-01,1387261,17052889,,,,1.0,2.0,,...,1.0,,88000.0,464000.0,2016.0,376000.0,5672.48,,,61110010023006.0
2,0.01,2017-01-01,11677,14186244,,,,2.0,3.0,,...,1.0,,85289.0,564778.0,2016.0,479489.0,6488.3,,,60590218022012.0
3,-0.1,2017-01-01,2288172,12177905,,,,3.0,4.0,,...,,,108918.0,145143.0,2016.0,36225.0,1777.51,,,60373001001006.0
4,0.01,2017-01-01,1970746,10887214,1.0,,,3.0,3.0,,...,,,73681.0,119407.0,2016.0,45726.0,1533.89,,,60371236012000.0


2. Summarizes the data you have just read into a dataframe in the ways we have discussed in previous models (sample view, datatypes, value counts, summary stats, ...)


In [None]:
df.shape

In [None]:
print(df.info())

In [None]:
pd.get_option("display.max_rows")

In [None]:
pd.set_option("max_r", 100)
pd.set_option('display.max_columns', None)


In [None]:
pd.get_option("display.max_rows")

### nulls_by_col

3. Write a function that takes in a dataframe of observations and attributes and returns a dataframe where each row is an attribute name, the first column is the number of rows with missing values for that attribute, and the second column is percent of total rows that have missing values for that attribute. Run a function and document takeaways from this on how you want to handle missing values.

In [None]:
def nulls_by_col(df):
    # Look at the number missing.
    num_missing = df.isnull().sum()
        # num_missing.head(20)

    # number of rows
    rows = df.shape[0]
        #rows

    # percent missing
    pct_missing = num_missing/rows
        # pct_missing.head(20)

    cols_missing = pd.DataFrame({'num_rows_missing': num_missing, 'pct_rows_missing': pct_missing})
        #cols_missing.head(20)
    
    return cols_missing

In [None]:
nulls_by_col(df)

### nulls_by_row

4. Write a function that takes in a dataframe and returns a dataframe with 3 columns: the number of columns missing, percent of columns missing, the numbner of rows with n columns missing. Run the function and document takeaways from this on you you want to handle the missing values.

In [None]:
def nulls_by_row(df):
    # Look as nulls by rows (axis = 1)
    num_cols_missing = df.isnull().sum(axis=1)
        # num_cols_missing
    
    # number of colums
    columns = df.shape[1]
    
    # Pecents of colums missing 
    pct_cols_missing = num_cols_missing/columns * 100
        #pct_cols_missing
        #pct_cols_missing.value_counts().sort_index()
    
    # Amount of rows with missing columns and percentage
    rows_missing = pd.DataFrame({'num_cols_missing': num_cols_missing, 'pct_cols_missing': pct_cols_missing}).reset_index().groupby(['num_cols_missing', 'pct_cols_missing']).count().rename(index = str, columns={'index':'num_rows'}).reset_index()

    return rows_missing

In [None]:
nulls_by_row(df)

In [None]:
def df_value_counts(df):
    for col in df.columns:
        print(f'{col}:')
        if df[col].dtype == 'object':
            col_count = df[col].value_counts()
        else:
            if df[col].nunique() >= 35:
                col_count = df[col].value_counts(bins=10, sort=False)
            else:
                col_count = df[col].value_counts()
        print(col_count)
        print('\n')


In [None]:
df_value_counts(df)

In [None]:
def df_summary(df):
    print(f'--- Shape:{df.shape}')
    print('\n--- Info:')
    df.info()
    print('\n--- Descriptions:')
    print(df.describe(include='all'))
    print(f'\n--- Nulls by Column:\n {nulls_by_col(df)}')
    print(f'\n--- Nulls by Row:\n {nulls_by_row(df)}')
    print('\n--- Value Counts:\n')
    print(df_value_counts(df))

In [None]:
df_summary(df)

In [None]:
df.hist(figsize=(36, 80), bins=20)
plt.show()





In [None]:
def get_outliers(s, k):
    """
    Given a series and a cutoff value, k, returns the upper outliers for the series.
    
    The values returned will be either 0 (if the point is not an outlier), or a number that indicated how far away from the upper bound the observation is.
    """
    
    q1, q3 = s.quantile([.25,.75])
    iqr = q3 - q1
    upper_bound = q3 + k * iqr
    lower_bound = q1 - k * iqr
    return s.apply(lambda x: max([x - upper_bound,0])), s.apply(lambda x: min([x - lower_bound,0]))

In [None]:
def add_outlier_columns(df,k):
    """
    Add a column with the suffix _outliers for all the numeric
    """
    for col in df.select_dtypes('number'):
        df[col + '_lower_outliers'] = get_outliers(df[col],k)[1]
        df[col + '_upper_outliers'] = get_outliers(df[col],k)[0]
    return df

In [None]:
add_outlier_columns(df_single_unit, k = 1.5)

# Prepare

1. Remove any properties that are likely to be something other than single unit properties. (e.g. no duplexes, not land/lot,...) There are multiple ways to estimate that a property is a single unit, and there is no single not a single "right" answer. But for this exercise, do not purely filter by unitcnt as we did previously. Add some new logic that will reduce the number of properties that are falsely removed. You might want to use #bedrooms, square_fee, unit type or the like to then identify those with unitcnt not defined.

In [None]:
df.propertylandusetypeid.value_counts()

In [None]:
df.bathroomcnt.value_counts().sort_index()

In [None]:
df.bedroomcnt.value_counts().sort_index()

In [None]:
df.calculatedfinishedsquarefeet.value_counts().sort_index()

In [3]:
df_single_unit = df[(df.propertylandusetypeid == 261) &
                    (df.bedroomcnt > 0) &
                    (df.bathroomcnt > 0)]
df_single_unit.shape

(52168, 61)

In [4]:
df_single_unit.isnull().sum()

logerror                            0
transactiondate                     0
id                                  0
parcelid                            0
airconditioningtypeid           38563
architecturalstyletypeid        52098
basementsqft                    52121
bathroomcnt                         0
bedroomcnt                          0
buildingclasstypeid             52168
buildingqualitytypeid           18541
calculatedbathnbr                  16
decktypeid                      51781
finishedfloor1squarefeet        47814
calculatedfinishedsquarefeet        8
finishedsquarefeet12              166
finishedsquarefeet13            52168
finishedsquarefeet15            52168
finishedsquarefeet50            47814
finishedsquarefeet6             52010
fips                                0
fireplacecnt                    44947
fullbathcnt                        16
garagecarcnt                    34202
garagetotalsqft                 34202
hashottuborspa                  50654
heatingorsys

2. Create a function that will drop rows or columns based on the percent of values that are missing: handle_missing_values(df, prop_required_column, prop_required_row)

    - The input
        - A datframe
        - A number between 0 and 1 that represents the proportion, for each column, of rows with non-missing values required to keep the column. i.e. if prop_required_column = .6, then you are requiring a column to have at least 60% of values not-NA(no more than 40% missing).
        - A number between 0 and 1 that represents the proportion, of reach rows, of columns/variables with non-missing values required to keep the row. For example, if prop_required_row =.75, than you are requiring a row to have at least 75% of variables with non-missing value(no more than 25% missing).
    - The output:
        - The dataframe with the columns and rows dropped as indicated. ***Be sure to drop the columns prior to the rows in your function***
    - hint:
        - Look up the dropna documentation
        - you will want to compute a threshold from your input values(prop_required) and a total number of rows or columns.
     

In [6]:
# remove unwanted columns:
def remove_columns(df, cols_to_remove):
    df = df.drop(columns = cols_to_remove)
    return df

In [7]:
# handle missing values according to a set of required thresholds.
def handle_missing_values(df, prop_required_column = .5, prop_required_row = .75):
    threshold = int(round(prop_required_column*len(df.index)))
    df.dropna(axis=1, thresh=threshold, inplace = True)
    threshold = int(round(prop_required_row*len(df.columns)))
    df.dropna(axis=0, thresh=threshold, inplace = True)
    return df

In [8]:
#handle_missing_values(df_single_unit)

In [9]:
# prep data frame by removing unwanted columns. Then remove columns and rows that did not have enough data set by a proportion for rows and columns.
def data_prep(df, cols_to_remove=[], prop_required_column = .5, prop_required_row = .75):
    df = remove_columns(df, cols_to_remove)
    df = handle_missing_values(df, prop_required_column, prop_required_row)
    return df

In [10]:
df = data_prep(df_single_unit)
df.shape

(52167, 32)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52167 entries, 0 to 77379
Data columns (total 32 columns):
logerror                        52167 non-null float64
transactiondate                 52167 non-null object
id                              52167 non-null int64
parcelid                        52167 non-null int64
bathroomcnt                     52167 non-null float64
bedroomcnt                      52167 non-null float64
buildingqualitytypeid           33627 non-null float64
calculatedbathnbr               52152 non-null float64
calculatedfinishedsquarefeet    52160 non-null float64
finishedsquarefeet12            52002 non-null float64
fips                            52167 non-null float64
fullbathcnt                     52152 non-null float64
heatingorsystemtypeid           33823 non-null float64
latitude                        52167 non-null float64
longitude                       52167 non-null float64
lotsizesquarefeet               51813 non-null float64
propertycountyla

In [13]:
df.fips.value_counts()

6,037.00    33751
6,059.00    14060
6,111.00     4356
Name: fips, dtype: int64

In [None]:
df.buildingqualitytypeid.value_counts(dropna=False)

In [None]:
df.heatingorsystemtypeid.value_counts(dropna=False)

In [None]:
df.isnull().sum()

In [None]:
df.taxvaluedollarcnt.value_counts(dropna=False).sort_index().tail()

In [None]:
df.drop(list(df[df.taxvaluedollarcnt.isna() == True].index)).shape

3.  Decide how to handle the remaining missing values:
    - Fill with constant falue.
    - Impute with mean, median, mode.
    - Drop row/column

In [None]:
def fill_missing_values(df, fill_value):
    df.fillna(fill_value)
    return df