# Capstone: Zillow Economics Data

### Task: 



### Context

Zillow's Economic Research Team collects, cleans and publishes housing and economic data from a variety of public and proprietary sources. Public property record data filed with local municipalities -- including deeds, property facts, parcel information and transactional histories -- forms the backbone of our data products, and is fleshed out with proprietary data derived from property listings and user behavior on Zillow.

The large majority of Zillow's aggregated housing market and economic data is made available for free download at zillow.com/data.

### Content
#### Variable Availability:

Zillow Home Value Index (ZHVI): A smoothed seasonally adjusted measure of the median estimated home value across a given region and housing type. A dollar denominated alternative to repeat-sales indices. Find a more detailed methodology here: http://www.zillow.com/research/zhvi-methodology-6032/

Zillow Rent Index (ZRI): A smoothed seasonally adjusted measure of the median estimated market rate rent across a given region and housing type. A dollar denominated alternative to repeat-rent indices. Find a more detailed methodology here: http://www.zillow.com/research/zillow-rent-index-methodology-2393/

For-Sale Listing/Inventory Metrics: Zillow provides many variables capturing current and historical for-sale listings availability, generally from 2012 to current. These variables include median list prices and inventory counts, both by various property types. Variables capturing for-sale market competitiveness including share of listings with a price cut, median price cut size, age of inventory, and the days a listing spend on Zillow before the sale is final.

Home Sales Metrics: Zillow provides data on sold homes including median sale price by various housing types, sale counts (methodology here: http://www.zillow.com/research/home-sales-methodology-7733/), and a normalized view of sale volume referred to as turnover. The prevalence of foreclosures is also provided as ratio of the housing stock and the share of all sales in which the home was previously foreclosed upon.

For-Rent Listing Metrics: Zillow provides median rents prices and median rent price per square foot by property type and bedroom count.

#### Housing type definitions:

All Homes: Zillow defines all homes as single-family, condominium and co-operative homes with a county record. Unless specified, all series cover this segment of the housing stock.

Condo/Co-op: Condominium and co-operative homes.

Multifamily 5+ units: Units in buildings with 5 or more housing units, that are not a condominiums or co-ops.

Duplex/Triplex: Housing units in buildings with 2 or 3 housing units.

Tiers: By metro, we determine price tier cutoffs that divide the all homes housing stock into thirds using the full distribution of estimated home values. We then estimate real estate metrics within the property sets, Bottom, Middle, and Top, defined by these cutoffs. When reported at the national level, all Bottom Tier homes defined at the metro level are pooled together to form the national bottom tier. The same holds for Middle and Top Tier homes.

#### Regional Availability:

Zillow metrics are reported for common US geographies including Nation, State, Metro (2013 Census Defined CBSAs), County, City, ZIP code, and Neighborhood.

We provide a crosswalk between colloquial Zillow region names and federally defined region names and linking variables such as County FIPS codes and CBSA codes. Cities and Neighborhoods do not match standard jurisdictional boundaries. Zillow city boundaries reflect mailing address conventions and so are often visually similar to collections of ZIP codes. Zillow neighborhood boundaries can be found here.

Suppression Rules: To ensure reliability of reported values the Zillow Economic Research team applies suppression rules triggered by low sample sizes and excessive volatility. These rules are customized to the metric and region type and explain most missingness found in the provided datasets.

#### Additional Data Products

The following data products and more are available for free download exclusively at Zillow.com/Data:

Zillow Home Value Forecast
Zillow Rent Forecast
Negative Equity (the share of mortgaged properties worth less than mortgage balance)
Zillow Home Price Expectations Survey
Zillow Housing Aspirations Report
Zillow Rising Sea Levels Research
Cash Buyers Time Series
Buy vs. Rent Breakeven Horizon
Mortgage Affordability, Rental Affordability, Price-to-Income Ratio
Conventional 30-year Fixed Mortgage Rate, Weekly Time Series
Jumbo 30-year Fixed Mortgage Rates, Weekly Time Series

### Acknowledgements
The mission of the Zillow Economic Research Team is to be the most open, authoritative source for timely and accurate housing data and unbiased insight. We aim to empower consumers, industry professionals, policy makers and researchers looking to better understand the housing market.

To see more of our mission in action, we invite you to learn more about us and to check out our collection of research briefs, stories, data tools and past presentations at https://www.zillow.com/research/

### Inspiration

Zillow, and the Zillow Economic Research Team, firmly believe that not only do data want to be free, data are going to be free. Instead of simply publishing raw data, we believe in the power of pushing data up the ladder from raw data bits, to actionable information and finally to unique insight. We aim to answer questions of all kinds, even questions our users may not have known they had before coming to us. When done right, we firmly believe this process of turning data into insight can be transformational in people's lives.

Please join us on this journey, and we're excited to see what insights you can discover hidden amongst our data!

## 2. Dataset exploration

### 2.1. Loading

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
# from sklearn.model_selection import cross_val_score
# from sklearn.model_selection import train_test_split
import seaborn as sns
# import statsmodels.api as sm
# from sklearn.linear_model import LogisticRegression
# from sklearn.linear_model import Lasso

import statsmodels.formula.api as smf

%matplotlib inline

  from pandas.core import datetools


### 2.2. Data cleaning and feature selection (human approach)

In [7]:
df=pd.read_csv("../datasets/zecon/City_time_series.csv")

In [8]:
df.head(n=3)

Unnamed: 0,Date,RegionName,InventorySeasonallyAdjusted_AllHomes,InventoryRaw_AllHomes,MedianListingPricePerSqft_1Bedroom,MedianListingPricePerSqft_2Bedroom,MedianListingPricePerSqft_3Bedroom,MedianListingPricePerSqft_4Bedroom,MedianListingPricePerSqft_5BedroomOrMore,MedianListingPricePerSqft_AllHomes,...,ZHVI_BottomTier,ZHVI_CondoCoop,ZHVI_MiddleTier,ZHVI_SingleFamilyResidence,ZHVI_TopTier,ZRI_AllHomes,ZRI_AllHomesPlusMultifamily,ZriPerSqft_AllHomes,Zri_MultiFamilyResidenceRental,Zri_SingleFamilyResidenceRental
0,1996-04-30,abbottstownadamspa,,,,,,,,,...,,,,,108700.0,,,,,
1,1996-04-30,aberdeenbinghamid,,,,,,,,,...,,,,,168400.0,,,,,
2,1996-04-30,aberdeenharfordmd,,,,,,,,,...,81300.0,137900.0,109600.0,108600.0,147900.0,,,,,


In [4]:
df.columns

Index(['Date', 'RegionName', 'InventorySeasonallyAdjusted_AllHomes',
       'InventoryRaw_AllHomes', 'MedianListingPricePerSqft_1Bedroom',
       'MedianListingPricePerSqft_2Bedroom',
       'MedianListingPricePerSqft_3Bedroom',
       'MedianListingPricePerSqft_4Bedroom',
       'MedianListingPricePerSqft_5BedroomOrMore',
       'MedianListingPricePerSqft_AllHomes',
       'MedianListingPricePerSqft_CondoCoop',
       'MedianListingPricePerSqft_DuplexTriplex',
       'MedianListingPricePerSqft_SingleFamilyResidence',
       'MedianListingPrice_1Bedroom', 'MedianListingPrice_2Bedroom',
       'MedianListingPrice_3Bedroom', 'MedianListingPrice_4Bedroom',
       'MedianListingPrice_5BedroomOrMore', 'MedianListingPrice_AllHomes',
       'MedianListingPrice_CondoCoop', 'MedianListingPrice_DuplexTriplex',
       'MedianListingPrice_SingleFamilyResidence',
       'MedianPctOfPriceReduction_AllHomes',
       'MedianPctOfPriceReduction_CondoCoop',
       'MedianPctOfPriceReduction_SingleFamily

In [5]:
df_features = df[['Date', 'RegionName',
'InventoryRaw_AllHomes',
'MedianListingPricePerSqft_1Bedroom',
       'MedianListingPricePerSqft_2Bedroom',
       'MedianListingPricePerSqft_3Bedroom',
       'MedianListingPricePerSqft_4Bedroom',
       'MedianListingPricePerSqft_5BedroomOrMore',
       'MedianListingPricePerSqft_AllHomes',
       'MedianListingPricePerSqft_CondoCoop',
       'MedianListingPricePerSqft_DuplexTriplex',
       'MedianListingPricePerSqft_SingleFamilyResidence',
     'MedianRentalPricePerSqft_1Bedroom',
       'MedianRentalPricePerSqft_2Bedroom',
       'MedianRentalPricePerSqft_3Bedroom',
       'MedianRentalPricePerSqft_4Bedroom',
       'MedianRentalPricePerSqft_5BedroomOrMore',
       'MedianRentalPricePerSqft_AllHomes',
       'MedianRentalPricePerSqft_CondoCoop',
       'MedianRentalPricePerSqft_DuplexTriplex',
       'MedianRentalPricePerSqft_MultiFamilyResidence5PlusUnits',
       'MedianRentalPricePerSqft_SingleFamilyResidence',
       'MedianRentalPricePerSqft_Studio']]

    
    

In [6]:
df_features.describe()

Unnamed: 0,InventoryRaw_AllHomes,MedianListingPricePerSqft_1Bedroom,MedianListingPricePerSqft_2Bedroom,MedianListingPricePerSqft_3Bedroom,MedianListingPricePerSqft_4Bedroom,MedianListingPricePerSqft_5BedroomOrMore,MedianListingPricePerSqft_AllHomes,MedianListingPricePerSqft_CondoCoop,MedianListingPricePerSqft_DuplexTriplex,MedianListingPricePerSqft_SingleFamilyResidence,...,MedianRentalPricePerSqft_2Bedroom,MedianRentalPricePerSqft_3Bedroom,MedianRentalPricePerSqft_4Bedroom,MedianRentalPricePerSqft_5BedroomOrMore,MedianRentalPricePerSqft_AllHomes,MedianRentalPricePerSqft_CondoCoop,MedianRentalPricePerSqft_DuplexTriplex,MedianRentalPricePerSqft_MultiFamilyResidence5PlusUnits,MedianRentalPricePerSqft_SingleFamilyResidence,MedianRentalPricePerSqft_Studio
count,40940.0,430.0,3881.0,13309.0,7612.0,1963.0,29337.0,2945.0,409.0,28256.0,...,1699.0,1832.0,526.0,58.0,3978.0,880.0,304.0,2284.0,3288.0,935.0
mean,165.13615,262.501879,172.717092,145.208204,154.774094,195.538625,151.305542,213.595639,108.472119,149.500054,...,1.3764,1.059724,0.944972,0.825126,1.115962,1.583022,1.395043,1.44364,1.007844,1.109478
std,490.145031,166.056221,123.945307,93.114383,99.649057,162.262648,101.152842,144.276312,107.807656,101.046032,...,0.60402,0.463674,0.391578,0.34398,0.643622,0.66128,0.700599,0.596103,0.593224,0.634485
min,4.0,29.471545,22.4846,24.211165,21.827555,18.331015,22.727273,36.585366,16.927083,22.921607,...,0.557292,0.427003,0.413532,0.346246,0.447967,0.585938,0.331255,0.486667,0.445765,0.446519
25%,37.0,154.092329,96.839546,93.233683,100.061117,104.058934,93.560815,120.904836,43.853659,92.944755,...,0.942262,0.766458,0.718746,0.646423,0.774515,1.13071,0.885193,1.010305,0.745386,0.738022
50%,72.0,217.322725,135.183737,120.192308,127.022032,144.11315,123.915145,169.357819,72.392574,122.627353,...,1.229206,0.904412,0.81684,0.747363,0.935999,1.422801,1.210227,1.292227,0.871258,0.907716
75%,153.0,323.50597,201.893939,164.078283,171.905536,224.450557,171.671722,262.5,132.654605,170.39673,...,1.64222,1.198344,1.071916,0.96121,1.26751,1.884919,1.699926,1.727779,1.13435,1.243386
max,29265.0,1097.262273,1228.127025,1744.693225,1279.967493,1284.443417,1558.66266,1423.141399,710.461285,1558.66266,...,4.347826,3.820086,3.549254,1.814512,17.0,4.774898,4.5,4.732408,17.0,5.263336
