# Acquire & Summarize

In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from acquire import get_zillow_data

import warnings

pd.set_option('display.max_rows', None)
warnings.filterwarnings('ignore')

Acquire data from mySQL using the python module to connect and query. You will want to end with a single dataframe. Make sure to include: the `logerror`, all fields related to the properties that are available. You will end up using all the tables in the database.

1. Be sure to do the correct join (inner, outer, etc.). We do not want to eliminate properties purely because they may have a null value for `airconditioningtypeid`.
2. Only include properties with a transaction in 2017, and include only the last transaction for each properity (so no duplicate property ID's), along with `zestimate error` and `date of transaction`.
3. Only include properties that include a `latitude` and `longitude` value.


This is the sql query I used to pull all the requested data from the zillow database.

```python
sql_query = '''
select *
from properties_2017
join(select parcelid,
    logerror,
    max(transactiondate) as lasttransactiondate
    from predictions_2017
    group by parcelid, logerror
    ) as predictions using(parcelid)
left join `airconditioningtype` using(`airconditioningtypeid`)
left join `architecturalstyletype` using(`architecturalstyletypeid`)
left join `buildingclasstype` using(`buildingclasstypeid`)
left join `heatingorsystemtype` using(`heatingorsystemtypeid`)
left join `propertylandusetype` using(`propertylandusetypeid`)
left join `storytype` using(`storytypeid`)
left join `typeconstructiontype` using(`typeconstructiontypeid`)
where (latitude is not null
and longitude is not null);'''

df = get_zillow_data(sql_query)
```

In [17]:
# Load the zillow dataset from a cached csv file
df = pd.read_csv('zillow.csv')

Summarize your data (summary stats, info, dtypes, shape, distributions, value_counts, etc.)

In [18]:
df.shape

(77575, 68)

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77575 entries, 0 to 77574
Data columns (total 68 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   typeconstructiontypeid        222 non-null    float64
 1   storytypeid                   50 non-null     float64
 2   propertylandusetypeid         77575 non-null  float64
 3   heatingorsystemtypeid         49570 non-null  float64
 4   buildingclasstypeid           15 non-null     float64
 5   architecturalstyletypeid      206 non-null    float64
 6   airconditioningtypeid         25006 non-null  float64
 7   parcelid                      77575 non-null  int64  
 8   id                            77575 non-null  int64  
 9   basementsqft                  50 non-null     float64
 10  bathroomcnt                   77575 non-null  float64
 11  bedroomcnt                    77575 non-null  float64
 12  buildingqualitytypeid         49809 non-null  float64
 13  c

In [None]:
# sns.pairplot(df) Nope

In [25]:
# Using `isnull()` and `notnull()` we can calculate the number of missing values and non-null values.
nulls = df.isnull().sum()
non_nulls = df.notnull().sum()

# Add missing values and non-null values together to get the total number values in each column.
total_values = nulls + non_nulls

# Create a variable to store the percentage of missing values in each column.
pct_missing = (nulls/total_values).sort_values(ascending=False)

# Perform formatting to clearly see the percentage of missing values in each column.
pct_missing_chart = pct_missing.apply("{0:.2%}".format)

# Display table to the user showing the percentage of missing values in each column.
print('Percentage of values missing per column')
print('-' * 39)
print(f"{pct_missing_chart}")

Percentage of values missing per column
---------------------------------------
buildingclasstypeid             99.98%
buildingclassdesc               99.98%
finishedsquarefeet13            99.95%
storytypeid                     99.94%
basementsqft                    99.94%
storydesc                       99.94%
yardbuildingsqft26              99.91%
fireplaceflag                   99.78%
architecturalstyledesc          99.73%
architecturalstyletypeid        99.73%
typeconstructiondesc            99.71%
typeconstructiontypeid          99.71%
finishedsquarefeet6             99.50%
pooltypeid10                    99.40%
decktypeid                      99.21%
poolsizesum                     98.88%
pooltypeid2                     98.62%
hashottuborspa                  98.02%
yardbuildingsqft17              96.92%
taxdelinquencyyear              96.26%
taxdelinquencyflag              96.26%
finishedsquarefeet15            96.10%
finishedsquarefeet50            92.22%
finishedfloor1squarefee

Write a function that takes in a dataframe of observations and attributes and returns a dataframe where each row is an atttribute name, the first column is the number of rows with missing values for that attribute, and the second column is percent of total rows that have missing values for that attribute. Run the function and document takeaways from this on how you want to handle missing values. 