# Exploratory Data Analysis (EDA) of Zillow Data
In this notebook initial EDA is conducted on the Zillow data set.

## Import required packages

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Import processed data
- Columns are in lower case
- Zip code column renamed to zip
- Index set to date column in datetime format

In [None]:
zill = pd.read_csv('../data/processed/zillow_time_index.csv', index_col=0)

In [None]:
zill.index = pd.to_datetime(zill.index)

In [None]:
zill.zip = zill.zip.apply(lambda z: str(z))

In [None]:
zill.columns

In [None]:
zill.head()

## Below, each feature is analyzed for the data set in turn
Prior to creating any models, each feature is assessed to discover if there are any underlying issues affecting feature selection in this data set.

In [None]:
zill.info()

After initial research, Zillow defines 'sizerank' as the average house price per state divided by the population of that state.

## Datetime Index

In [None]:
len(zill.index.value_counts())

__Key Takeaway__ The original data set included 265 columns for months and years with the associated price of a house.  As such, the value counts for features in the EDA for this data set must be divided by 265 to assure accurate actual counts.

### RegionID

In [None]:
zill.regionid.value_counts()/265

In [None]:
(zill.regionid.value_counts().min()/265), (zill.regionid.value_counts().max()/265)

__Key Takeaway__ This value is unique to all values and therefore adds no value.  It will be removed after the comparison of regions against one another.  As such, this column is added the the 'kill_cols' list for ulitmate deletion.

In [None]:
zill.metro.value_counts()

In [None]:
kill_cols = ['regionid']

## Zip

In [None]:
(zill.zip.value_counts().min()/265), (zill.zip.value_counts().max()/265)

In [None]:
zill.zip.value_counts()

__Key Takeaway__  This is the value for which we are picking the "best" performers.  As such it will be kept for EDA purposes.

## City

In [None]:
(zill.city.value_counts().min()/265), (zill.city.value_counts().max()/265)

In [None]:
zill.city.value_counts()

## State

In [None]:
(zill.state.value_counts().min()/265), (zill.state.value_counts().max()/265)

In [None]:
zill.state.value_counts()/265

## Metro

In [None]:
(zill.metro.value_counts().min()/265), (zill.metro.value_counts().max()/265)

In [None]:
zill.metro.value_counts()/265

## CountyName

In [None]:
(zill.countyname.value_counts().min()/265), (zill.countyname.value_counts().max()/265)

In [None]:
zill.countyname.value_counts()/265

## SizeRank

In [None]:
zill.sizerank.min(), zill.sizerank.max()

In [None]:
zill.sizerank

## Value

In [None]:
zill.value.min(), zill.value.max()

In [None]:
type(zill.index)

### Let's first explore all house values by year in the data set

In [None]:
zill.columns

In [None]:
yearly = zill.groupby([zill.index.year, zill.zip]).agg({'regionid': 'min', 'sizerank': 'min', 'value': 'mean'})

In [None]:
yearly.index.get_level_values(0)

In [None]:
sns.set()

In [None]:
yearly_lineplot = sns.lineplot(x = yearly.index.get_level_values(0), 
                               y = 'value', 
                               data = yearly);

In [None]:
fig = yearly_lineplot.get_figure()    
fig.savefig('../viz/all_values_annual.png')

__Key takeaway__: It appears that there was a dip in the mean of all housing prices starting in 2006 hitting the bottom in 2011 and rebounding through 2018.  Perhaps it's best to investigate housing value from 2011 through 2018.  With that said, it would be interesting to see which zip codes exhibited resilience to the housing crisis from 2006 to 2011 as a potential future indicator of retained value through a national crisis.

In [None]:
y2011_on = yearly.loc[(yearly.index.get_level_values(0) >= 2011)]

In [None]:
y2011_on.index.get_level_values(0)

In [None]:
y2011_to_2018 = pd.DataFrame()
vals_2018 = y2011_on.loc[(y2011_on.index.get_level_values(0) == 2018)].value
vals_2011 = y2011_on.loc[(y2011_on.index.get_level_values(0) == 2011)].value

In [None]:
vals_2011

In [None]:
z = y2011_on.loc[(y2011_on.index.get_level_values(0) == 2018)].index.get_level_values(1).to_list()
v_2011 = vals_2011.to_list()
v_2018 = vals_2018.to_list()

In [None]:
y2011_to_2018 = pd.DataFrame()
y2011_to_2018['zips'] = z
y2011_to_2018['v_2011'] = v_2011
y2011_to_2018['v_2018'] = v_2018
y2011_to_2018['18_less_11'] = y2011_to_2018.v_2018 - y2011_to_2018.v_2011

In [None]:
top10 = y2011_to_2018.sort_values(by = '18_less_11', ascending = False).head(10)

In [None]:
top10

In [None]:
sns.catplot(x = 'zips', 
            y = '18_less_11', 
            data = top10, 
            hue = 'zips', 
            kind = 'bar', 
            height=8.27, 
            aspect=11.7/8.27);

In [None]:
y2011_on_lineplot = sns.lineplot(x = y2011_on.index.get_level_values(0),
                                 y = 'value',
                                 data = y2011_on, 
                                 hue = y2011_on.index.get_level_values(1));

In [None]:
fig2011 = y2011_on_lineplot.get_figure()    
fig2011.savefig('../viz/all_values_annual.png')