# Acquire Zillow

## For the following, iterate through the steps you would take to create functions: Write the code to do the following in a jupyter notebook, test it, convert to functions, then create the file to house those functions.

## You will have a zillow.ipynb file and a helper file for each section in the pipeline.

* Acquire data from mySQL using the python module to connect and query. You will want to end with a single dataframe.

* Make sure to include: the logerror, all fields related to the properties that are available. You will end up using all the tables in the database.

* Be sure to do the correct join (inner, outer, etc.). We do not want to eliminate properties purely because they may have a null value for airconditioningtypeid.

* Only include properties with a transaction in 2017, and include only the last transaction for each property (so no duplicate property ID's), along with zestimate error and date of transaction.
* Only include properties that include a latitude and longitude value.

In [1]:
from env import user, host, password
import pandas as pd
import matplotlib.pyplot as plt
import os
import warnings
warnings.filterwarnings("ignore")

In [2]:
def get_connection(database, user=user, host=host, password=password):
    
    return f"mysql+pymysql://{user}:{password}@{host}/{database}"

In [3]:
def cache_sql_data(df, database):
    
        df.to_csv(f'{database}_query.csv',index = False)

In [4]:
def get_sql_data(database,query):
    
    if os.path.isfile(f'{database}_query.csv') == False:
        
        df = pd.read_sql(query, get_connection(database))
        
        cache_sql_data(df, database)
        
    return pd.read_csv(f'{database}_query.csv')

In [5]:
query = '''

select * 
from predictions_2017

left join properties_2017 using(parcelid)
left join airconditioningtype using(airconditioningtypeid)
left join architecturalstyletype using(architecturalstyletypeid)
left join buildingclasstype using(buildingclasstypeid)
left join heatingorsystemtype using(heatingorsystemtypeid)
left join propertylandusetype using(propertylandusetypeid)
left join storytype using(storytypeid)
left join typeconstructiontype using(typeconstructiontypeid)

where latitude is not null

and longitude is not null

and (parcelid, transactiondate) in (select parcelid, max(transactiondate)
                                      from predictions_2017
                                      group by parcelid)
'''

database = "zillow"

In [6]:
df = get_sql_data(database,query)
df.head()

OperationalError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query ([Errno 54] Connection reset by peer)')
[SQL: 

select * 
from predictions_2017

left join properties_2017 using(parcelid)
left join airconditioningtype using(airconditioningtypeid)
left join architecturalstyletype using(architecturalstyletypeid)
left join buildingclasstype using(buildingclasstypeid)
left join heatingorsystemtype using(heatingorsystemtypeid)
left join propertylandusetype using(propertylandusetypeid)
left join storytype using(storytypeid)
left join typeconstructiontype using(typeconstructiontypeid)

where latitude is not null

and longitude is not null

and (parcelid, transactiondate) in (select parcelid, max(transactiondate)
                                      from predictions_2017
                                      group by parcelid)
]
(Background on this error at: http://sqlalche.me/e/13/e3q8)

# Mall Customers

## notebook

* Acquire data from mall_customers.customers in mysql database.
* Summarize data (include distributions and descriptive statistics).
* Detect outliers using IQR.
* Split data (train, validate, and test split).
* Encode categorical columns using a one hot encoder (pd.get_dummies).
* Handles missing values.
* Scaling

In [None]:
database = "mall_customers"

query = "select * from customers"

In [None]:
df = get_sql_data(database,query)

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
# drop customer ID
df = df[['gender', 'age', 'annual_income', 'spending_score']]
df.head()

In [None]:
# distribution of the data
num_cols = df.columns[[df[col].dtype == 'int64' for col in df.columns]]

for col in num_cols:
    plt.hist(df[col])
    plt.title(col)
    plt.show()
    


In [None]:
num_cols = df.columns[[df[col].dtype == 'object' for col in df.columns]]

for col in num_cols
df['gender'].value_counts().plot(kind='bar', title = "Gender Distribution")