# Exploratory Data Analysis - Zillow Dataset

This is an exploratory data analysis of the Zillow dataset, taking inspiration from 

- [https://www.kaggle.com/philippsp/exploratory-analysis-zillow](https://www.kaggle.com/philippsp/exploratory-analysis-zillow)
- [https://www.kaggle.com/headsortails/pytanic](https://www.kaggle.com/headsortails/pytanic)

In [None]:
#%matplotlib inline

import pandas as pd
import numpy as np
from scipy import stats
from io import StringIO
import sklearn as sk
import itertools
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
from io import StringIO
from statsmodels.graphics.mosaicplot import mosaic

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn import svm
import xgboost as xgb
from mlxtend.classifier import StackingClassifier

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

sns.set(style='white', context='notebook', palette='deep')
pd.options.display.max_columns = 999

# Load Input Data

In [None]:
properties        = pd.read_csv('../input/properties_2017.csv')
transactions      = pd.read_csv('../input/train_2017.csv')
sample_submission = pd.read_csv('../input/sample_submission.csv')

Remap the column names to be more human readable

In [None]:
data_dictionary_string = """key|old_key|description
aircon|airconditioningtypeid|Type of cooling system present in the home (if any)architectural_style
architecturalstyletypeid|architecturalstyletypeid|Architectural style of the home (i.e. ranch, colonial, split-level, etc…)
area_base|finishedsquarefeet6|Base unfinished and finished area
area_firstfloor_finished|finishedfloor1squarefeet|Size of the finished living area on the first (entry) floor of the home
area_garage|garagetotalsqft|Total number of square feet of all garages on lot including an attached garage
area_live_finished|finishedsquarefeet12|Finished living area
area_liveperi_finished|finishedsquarefeet13|Perimeter living area
area_lot|lotsizesquarefeet|Area of the lot in square feet
area_patio|yardbuildingsqft17|Patio in yard
area_pool|poolsizesum|Total square footage of all pools on property
area_shed|yardbuildingsqft26|Storage shed/building in yard
area_total_calc|calculatedfinishedsquarefeet|Calculated total finished living area of the home
area_total_finished|finishedsquarefeet15|Total area
area_unknown|finishedsquarefeet50|Size of the finished living area on the first (entry) floor of the home
basementsqft|basementsqft|Finished living area below or partially below ground level
build_year|yearbuilt|The Year the principal residence was built
deck|decktypeid|Type of deck (if any) present on parcelfinishedfloor1squarefeet
flag_fireplace|fireplaceflag|Is a fireplace present in this home
flag_tub|hashottuborspa|Does the home have a hot tub or spa
framing|buildingclasstypeid|The building framing type (steel frame, wood frame, concrete/brick)
heating|heatingorsystemtypeid|Type of home heating system
id_fips|fips|Federal Information Processing Standard code - see https://en.wikipedia.org/wiki/FIPS_county_code for more details
id_parcel|parcelid|Unique identifier for parcels (lots)
id_zoning_raw|rawcensustractandblock|Census tract and block ID combined - also contains blockgroup assignment by extension
id_zoning|censustractandblock|Census tract and block ID combined - also contains blockgroup assignment by extension
latitude|latitude|Latitude of the middle of the parcel multiplied by 10e6
longitude|longitude|Longitude of the middle of the parcel multiplied by 10e6
material|typeconstructiontypeid|What type of construction material was used to construct the home
num_75_bath|threequarterbathnbr|Number of 3/4 bathrooms in house (shower + sink + toilet)
num_bathroom_calc|calculatedbathnbr|Number of bathrooms in home including fractional bathroom
num_bathroom|bathroomcnt|Number of bathrooms in home including fractional bathrooms
num_bath|fullbathcnt|Number of full bathrooms (sink, shower + bathtub, and toilet) present in home
num_bedroom|bedroomcnt|Number of bedrooms in home
num_fireplace|fireplacecnt|Number of fireplaces in a home (if any)
num_garage|garagecarcnt|Total number of garages on the lot including an attached garage
num_pool|poolcnt|Number of pools on the lot (if any)
num_room|roomcnt|Total number of rooms in the principal residence
num_story|numberofstories|Number of stories or levels the home has
num_unit|unitcnt|Number of units the structure is built into (i.e. 2 = duplex, 3 = triplex, etc...)
pooltypeid10|pooltypeid10|Spa or Hot Tub
pooltypeid2|pooltypeid2|Pool with Spa/Hot Tub
pooltypeid7|pooltypeid7|Pool without hot tub
quality|buildingqualitytypeid|Overall assessment of condition of the building from best (lowest) to worst (highest)
region_city|regionidcity|City in which the property is located (if any)
region_county|regionidcounty|County in which the property is located
region_neighbor|regionidneighborhood|Neighborhood in which the property is located
region_zip|regionidzip|Zip code in which the property is located
story|storytypeid|Type of floors in a multi-story house (i.e. basement and main level, split-level, attic, etc.). See tab for details.
tax_building|structuretaxvaluedollarcnt|The assessed value of the built structure on the parcel
tax_delinquency_year|taxdelinquencyyear|Year for which the unpaid propert taxes were due
tax_delinquency|taxdelinquencyflag|Property taxes for this parcel are past due as of 2015
tax_land|landtaxvaluedollarcnt|The assessed value of the land area of the parcel
tax_property|taxamount|The total property tax assessed for that assessment year
tax_total|taxvaluedollarcnt|The total tax assessed value of the parcel
tax_year|assessmentyear|The year of the property tax assessmentbasementsqft
zoning_landuse_county|propertycountylandusecode|County land use code i.e. it's zoning at the county level
zoning_landuse|propertylandusetypeid|Type of land use the property is zoned for
zoning_property|propertyzoningdesc|Description of the allowed land uses (zoning) for that property
"""

data_dictionary_df = pd.read_csv(StringIO(data_dictionary_string), sep="|")
data_dictionary_df.sort_values(by="key", inplace=True)
#data_dictionary_df.index = data_dictionary_df["key"]
data_dictionary_df

In [None]:
# Create quick lookup 
data_dictionary = data_dictionary_df["description"]
data_dictionary.index = data_dictionary_df["key"]
data_dictionary["id_parcel"]

In [None]:
# Remap properties with new keys from data_dictionary
data_dictionary_rename = data_dictionary_df["key"]
data_dictionary_rename.index  = data_dictionary_df["old_key"]
data_dictionary_rename.to_dict()

# Apply rename to properties
properties.rename(columns=data_dictionary_rename, inplace=True)
properties.set_index('id_parcel', drop=False, inplace=True)
properties.index.name = 'id'
properties.head()

In [None]:
transactions = transactions.rename(columns={
    "parcelid": "id_parcel",  
    "transactiondate": "date" 
})
transactions.sort_values(by="id_parcel", inplace=True)
transactions.set_index('id_parcel', drop=False, inplace=True)
transactions.index.name = 'id'
transactions.head()

We can now combined the two datasets to add in transaction data (logerror + date) to properties

In [None]:
properties = properties.join(transactions, on="id_parcel", rsuffix="_transaction", how="inner", sort=True)
properties.head()

In [None]:
transactions = transactions.join(properties, on="id_parcel", rsuffix="_property", how="outer", sort=True)
transactions.head()

# Duplicate Properties

Properties that have been sold more than once will have multiple transaction entries.

In [None]:
transaction_counts = pd.DataFrame({
    'count' : transactions.groupby("id_parcel").size()
}).reset_index()

transaction_duplicate_counts = {
    1: { 
        "total":   transaction_counts[transaction_counts['count'].eq(1)].size, 
        "percent": transaction_counts[transaction_counts['count'].eq(1)].size / transaction_counts.size * 100
    },    
    2: { 
        "total":   transaction_counts[transaction_counts['count'].eq(2)].size, 
        "percent": transaction_counts[transaction_counts['count'].eq(2)].size / transaction_counts.size * 100
    }, 
    3: { 
        "total":   transaction_counts[transaction_counts['count'].eq(3)].size, 
        "percent": transaction_counts[transaction_counts['count'].eq(3)].size / transaction_counts.size * 100
    }, 
    4: { 
        "total":   transaction_counts[transaction_counts['count'].gt(3)].size, 
        "percent": transaction_counts[transaction_counts['count'].gt(3)].size / transaction_counts.size * 100
    }    
}
transaction_duplicate_counts

- 181108 (99.74%) where sold only once, making up the vast majority of the data
- 254 (0.25%) where sold twice, 
- 2 (0.003%) where sold 3 times.

The next question is to determine if any properties where never sold in the transactions dataset 

In [None]:
properties_with_transactions    = properties[properties['id_parcel_transaction'].notnull()];
properties_without_transactions = properties[properties['id_parcel_transaction'].isnull()];
{
   "transactions": { "total": transactions.size, "percent": transactions.size / properties.size * 100 },
   "properties":   { "total": properties.size, "percent":   properties.size / properties.size * 100 },
   "properties_with_transactions":    { "total": properties_with_transactions.size,    "percent": properties_with_transactions.size / properties.size * 100 },
   "properties_without_transactions": { "total": properties_without_transactions.size, "percent": properties_without_transactions.size / properties.size * 100 },
}

As we can see, properties_with_transactions makes up only 2.7% of the original data, and this is the only data useful to us

In [None]:
properties.head()

# Correlation Matrix

Lets do an initial top level correlation matrix analysis.

By calculating the mean of each column, we find the attributes with the most cross-correlation. 


In [None]:
properties.drop(['id_parcel'],axis=1).corr().mean(axis=0).sort_values(ascending=False)

The top cross-correlated entries are: 

- 0.338765 - area_total_calc 
- 0.270347 - tax_building
- 0.266564 - num_bath
- 0.245959 - num_fireplace
- 0.220768 - num_garage
- 0.204202 - num_bedroom                 

The larger and higher quality the house is, the more of everything else tends to be, including tax. 

Baths, fireplaces and even garages are better proxies for size and quality than bedrooms. This is the metric vs measurement effect, with number of bedrooms usually being the most visible statistic in an estate agents, thus the one with the greatest psychological effect of market price. Thus house builders or even owners have an incentive to create multiple smaller bedrooms to make a small house look bigger. People would rarely go to the same effort with baths or fireplaces thus are a better proxy for quality.

The most anti-correlated attribute is region_county

In [None]:
plt.figure(figsize=(14,12))
sns.heatmap(properties.corr(), vmax=0.6, square=True, annot=False)

Several of these columns have 

In [None]:
cross_corellation_matrix = properties.corr().mean().sort_values(ascending=False)
cross_corellation_matrix

The top cross-correlated attributes could be considered a proxy for underlying utility value, as opposed to market price.

- **area_total_calc** is the most cross-correlated item. The larger the overall property, the more room there is for everything else, including tax.

- **tax_building** is far more cross-correlated with utility value than **tax_land** which is maybe more correlated with market price

- **id_parcel** is a random field, which would suggest a correlation strength of below **0.13** can assumed to be indistinguishable from noise. This is matched the strongest anti-correlation field **region_county** with a correlation strength of -0.12

- **num_bath (0.259448)** vs **num_bathroom (0.250364)** suggests a difference between metrics and measurements. A bathroom by definition contains a bath (representing utility value), whereas an estate agent trying to optimise for market price would quote the **num_bathroom** statistic, which even though it is slightly less cross-correlated with utility value.  Properties with more bathrooms than baths may be prone to an error in valuation.

- **num_room (0.177573)**  vs **num_bathroom (0.250364)** may be another indicator of sales marketing, we suspect that in an house optimised for utility value, that bathrooms would scale proportionally with total rooms and the rest of the cross-correlation matrix. If there might higher ratio of rooms without bathrooms, it suggests the house is optimized for market price.

- A possible avenue to explore, is that error in estimate vs sales price may be correlated to the difference in correlation between the most cross-correlated attributes (> +0.24) vs the lesser cross-correlated attributes (0.15-0.20).

- Several attributes produced a NaN result for the mean correlation matrix, suggesting they lack sufficient non-null data to be used for measuring correlations, thus could be safely removed from the properties dataset: **framing, deck, num_pool, pooltypeid10, pooltypeid2, pooltypeid7, story, tax_year**     

- **logerror (0.037710)** is almost a perfectly uncorrelated with all the other attributes provided, which may explain why Zillow have attached a million dollar prize for being to correctly guess it

In [None]:
attributes_id      = cross_corellation_matrix[['id_parcel','id_zoning','id_zoning_raw', 'id_parcel_transaction']].keys()
attributes_utility = cross_corellation_matrix[cross_corellation_matrix.gt(0.22)].keys()
attributes_price   = cross_corellation_matrix[cross_corellation_matrix.lt(0.22) & cross_corellation_matrix.gt(0.15)].keys()
attributes_random  = cross_corellation_matrix[cross_corellation_matrix.lt(0.15)].drop(attributes_id).keys()
attributes_null    = cross_corellation_matrix[cross_corellation_matrix.isnull()].keys()

In [None]:
attributes_id

In [None]:
attributes_utility

In [None]:
attributes_price

In [None]:
attributes_random

In [None]:
attributes_null

In [None]:
properties = properties.drop(attributes_null, errors='ignore')

Lets explore the correlation between the seemingly random attributes

In [None]:
logerror_corellation = properties.corr()["logerror"].sort_values(ascending=False)
logerror_corellation

In [None]:
attributes_logerror = logerror_corellation[logerror_corellation.abs() > 0.01].index
attributes_logerror

In [None]:
{
    "attributes_id":       logerror_corellation[attributes_id].mean(),
    "attributes_utility":  logerror_corellation[attributes_utility].mean(),
    "attributes_price":    logerror_corellation[attributes_price].mean(),
    "attributes_random":   logerror_corellation[attributes_random].mean(),
    "attributes_logerror": logerror_corellation[attributes_logerror].mean(),
}

The attributes correlated with high logerror seem to the unique attributes that don't cross-correlation with the rest of the attributes

In [None]:
sns.heatmap(properties[attributes_logerror].corr(), vmax=0.6, square=True, annot=False)

The may be the time to try and train a neural network on the top 25 fields most correlated with logerror

In [None]:
logerror_corellation