#### Zillow Dataset

In [1]:
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np
import os
from env import host, user, password
import wrangle_zillow
from sklearn.model_selection import train_test_split

Create a python script or jupyter notebook named `explore_zillow` and do the following:

In [3]:
def get_connection(db, user=user, host=host, password=password):
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'
    
def get_zillow_data():
    query = '''
    SELECT p1.*
            , p2.transactiondate
            , p2.logerror 
            , ac.airconditioningdesc
            , arch.architecturalstyledesc
            , bldg.buildingclassdesc
            , heat.heatingorsystemdesc
            , land.propertylandusedesc
            , stories.storydesc
            , const.typeconstructiondesc
    FROM zillow.properties_2017 p1
    LEFT JOIN zillow.airconditioningtype ac USING(airconditioningtypeid)
    LEFT JOIN zillow.architecturalstyletype arch USING(architecturalstyletypeid)
    LEFT JOIN zillow.buildingclasstype bldg USING(buildingclasstypeid)
    LEFT JOIN zillow.heatingorsystemtype heat USING(heatingorsystemtypeid)
    LEFT JOIN zillow.propertylandusetype land USING(propertylandusetypeid)
    LEFT JOIN zillow.storytype stories USING(storytypeid)
    LEFT JOIN zillow.typeconstructiontype const USING(typeconstructiontypeid)
    INNER JOIN (
	    SELECT p2.parcelid, p1.logerror, p2.max_transactiondate AS transactiondate 
            FROM zillow.predictions_2017 p1
            INNER JOIN (SELECT parcelid, MAX(transactiondate) AS max_transactiondate 
                    FROM zillow.predictions_2017 
                    GROUP BY parcelid) p2
            ON p1.parcelid = p2.parcelid AND p1.transactiondate = p2.max_transactiondate
        ) p2 USING(parcelid)
    INNER JOIN (
	    SELECT parcelid, logerror, MAX(transactiondate) AS transactiondate FROM zillow.predictions_2017 GROUP BY parcelid, logerror
        ) t2 USING(parcelid, transactiondate)
    WHERE (p1.bedroomcnt > 0 AND p1.bathroomcnt > 0 
            AND calculatedfinishedsquarefeet > 500
            AND latitude IS NOT NULL AND longitude IS NOT NULL)
            AND (unitcnt = 1 OR unitcnt IS NULL)
    ;
    '''
    return pd.read_sql(query, get_connection('zillow'))

In [4]:
df = get_zillow_data()

In [5]:
df.head()

Unnamed: 0,id,parcelid,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,...,censustractandblock,transactiondate,logerror,airconditioningdesc,architecturalstyledesc,buildingclassdesc,heatingorsystemdesc,propertylandusedesc,storydesc,typeconstructiondesc
0,1727539,14297519,,,,3.5,4.0,,,3.5,...,60590630000000.0,2017-01-01,0.025595,,,,,Single Family Residential,,
1,1387261,17052889,,,,1.0,2.0,,,1.0,...,61110010000000.0,2017-01-01,0.055619,,,,,Single Family Residential,,
2,11677,14186244,,,,2.0,3.0,,,2.0,...,60590220000000.0,2017-01-01,0.005383,,,,,Single Family Residential,,
3,2288172,12177905,,,,3.0,4.0,,8.0,3.0,...,60373000000000.0,2017-01-01,-0.10341,,,,Central,Single Family Residential,,
4,1970746,10887214,1.0,,,3.0,3.0,,8.0,3.0,...,60371240000000.0,2017-01-01,0.00694,Central,,,Central,Condominium,,


1. Ask at least 5 questions about the data, keeping in mind that your target variable is `logerror`. e.g. Is `logerror` significantly different for properties in LA County vs Orange County vs Ventura County?

1.
2.
3.
4.
5.

2. Answer those questions through a mix of statistical tests and visualizations.

In your exploration, be sure you include the following:
1. a plot with at least 3 dimensions, such as x, y, and color.

2. at least 3 different **types** of plots (like box, scatter, bar, ...)

3. at least 2 statistical tests.

4. document takeaways/conclusions after each question is addressed.

**Bonus**:
Compute the mean(logerror) by zipcode and the overall mean(logerror). Write a loop that will run a t-test between the overall mean and the mean for each zip code. We want to identify the zip codes where the error is significantly higher or lower than the expected error.