# Exploratory Data Analysis (EDA) of Zillow Data
In this notebook initial EDA is conducted on the Zillow data set.

## Import required packages

In [13]:
import pandas as pd
import numpy as np
import seaborn as sns

## Import processed data
- Columns are in lower case
- Zip code column renamed to zip
- Index set to date column in datetime format

In [14]:
zill = pd.read_csv('../data/processed/zillow_time_index.csv', index_col=0)

In [15]:
zill.columns

Index(['regionid', 'zip', 'city', 'state', 'metro', 'countyname', 'sizerank',
       'value'],
      dtype='object')

In [16]:
zill.head()

Unnamed: 0_level_0,regionid,zip,city,state,metro,countyname,sizerank,value
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1996-04-01,84654,60657,Chicago,IL,Chicago,Cook,1,334200.0
1996-04-01,90668,75070,McKinney,TX,Dallas-Fort Worth,Collin,2,235700.0
1996-04-01,91982,77494,Katy,TX,Houston,Harris,3,210400.0
1996-04-01,84616,60614,Chicago,IL,Chicago,Cook,4,498100.0
1996-04-01,93144,79936,El Paso,TX,El Paso,El Paso,5,77300.0


## Below, each feature is analyzed for the data set in turn
Prior to creating any models, each feature is assessed to discover if there are any underlying issues affecting feature selection in this data set.

In [17]:
zill.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3901595 entries, 1996-04-01 to 2018-04-01
Data columns (total 8 columns):
 #   Column      Dtype  
---  ------      -----  
 0   regionid    int64  
 1   zip         int64  
 2   city        object 
 3   state       object 
 4   metro       object 
 5   countyname  object 
 6   sizerank    int64  
 7   value       float64
dtypes: float64(1), int64(3), object(4)
memory usage: 267.9+ MB


After initial research, Zillow defines 'sizerank' as the average house price per state divided by the population of that state.

## Datetime Index

In [20]:
len(zill.index.value_counts())

265

__Key Takeaway__ The original data set included 265 columns for months and years with the associated price of a house.  As such, the value counts for features in the EDA for this data set must be divided by 265 to assure accurate actual counts.

### RegionID

In [34]:
zill.regionid.value_counts()/265

63457    1.0
92897    1.0
60113    1.0
99067    1.0
74479    1.0
        ... 
89457    1.0
99690    1.0
77180    1.0
60740    1.0
69666    1.0
Name: regionid, Length: 14723, dtype: float64

In [44]:
(zill.regionid.value_counts().min()/265), (zill.regionid.value_counts().max()/265)

(1.0, 1.0)

__Key Takeaway__ This value is unique to all values and therefore adds no value.  It will be removed after the comparison of regions against one another.  As such, this column is added the the 'kill_cols' list for ulitmate deletion.

In [23]:
zill.metro.value_counts()

New York                          206435
Los Angeles-Long Beach-Anaheim     91955
Chicago                            86125
Philadelphia                       74465
Washington                         65985
                                   ...  
Clarksdale                           265
Alamogordo                           265
Beeville                             265
Alice                                265
Sweetwater                           265
Name: metro, Length: 701, dtype: int64

In [45]:
kill_cols = ['regionid']

## Zip

In [46]:
(zill.zip.value_counts().min()/265), (zill.zip.value_counts().max()/265)

(1.0, 1.0)

__Key Takeaway__  This is the value for which we are picking the "best" performers.  As such it will be kept for EDA purposes.

## City

In [47]:
(zill.city.value_counts().min()/265), (zill.city.value_counts().max()/265)

(1.0, 114.0)

In [48]:
zill.city.value_counts()

New York            30210
Los Angeles         25175
Houston             23320
San Antonio         12720
Washington          11925
                    ...  
Minneota              265
Southern Shores       265
Camp Hill             265
Oronoko               265
Highland Springs      265
Name: city, Length: 7554, dtype: int64

## State

In [49]:
(zill.state.value_counts().min()/265), (zill.state.value_counts().max()/265)

(16.0, 1224.0)

In [58]:
zill.state.value_counts()/265

CA    1224.0
NY    1015.0
TX     989.0
PA     831.0
FL     785.0
OH     588.0
IL     547.0
NJ     502.0
MI     499.0
NC     428.0
IN     428.0
MA     417.0
TN     404.0
VA     401.0
MN     375.0
GA     345.0
WA     341.0
WI     332.0
MO     319.0
MD     317.0
CO     249.0
KS     241.0
AZ     230.0
OR     224.0
OK     221.0
SC     206.0
NH     199.0
LA     193.0
AL     183.0
IA     158.0
MS     153.0
KY     139.0
CT     124.0
UT     121.0
ID     110.0
AR     105.0
NV     103.0
ME      86.0
NE      83.0
WV      72.0
MT      71.0
HI      62.0
NM      60.0
RI      59.0
DE      41.0
WY      31.0
ND      31.0
AK      28.0
SD      19.0
DC      18.0
VT      16.0
Name: state, dtype: float64

## Metro

In [50]:
(zill.metro.value_counts().min()/265), (zill.metro.value_counts().max()/265)

(1.0, 779.0)

In [59]:
zill.metro.value_counts()/265

New York                          779.0
Los Angeles-Long Beach-Anaheim    347.0
Chicago                           325.0
Philadelphia                      281.0
Washington                        249.0
                                  ...  
Clarksdale                          1.0
Alamogordo                          1.0
Beeville                            1.0
Alice                               1.0
Sweetwater                          1.0
Name: metro, Length: 701, dtype: float64

## CountyName

In [51]:
(zill.countyname.value_counts().min()/265), (zill.countyname.value_counts().max()/265)

(1.0, 264.0)

In [60]:
zill.countyname.value_counts()/265

Los Angeles    264.0
Jefferson      175.0
Orange         166.0
Washington     164.0
Montgomery     159.0
               ...  
Bullitt          1.0
Erath            1.0
Bailey           1.0
Yankton          1.0
Lemhi            1.0
Name: countyname, Length: 1212, dtype: float64

## SizeRank

In [65]:
zill.sizerank.min(), zill.sizerank.max()

(1, 14723)

In [64]:
zill.sizerank

date
1996-04-01        1
1996-04-01        2
1996-04-01        3
1996-04-01        4
1996-04-01        5
              ...  
2018-04-01    14719
2018-04-01    14720
2018-04-01    14721
2018-04-01    14722
2018-04-01    14723
Name: sizerank, Length: 3901595, dtype: int64

## Value

In [56]:
zill.value.min(), zill.value.max()

(11300.0, 19314900.0)

In [None]:
sns.relplot(x = zill.index, y = zill.value, kind='line', data = zill)