# Data Cleaning - Zillow Data

## Introduction

In the following notebook, I will be cleaning data from [Zillow](https://www.zillow.com/research/data/). This data will be used to supplement my analysis of the Airbnb listings in the San Francisco Area.

I intend to use this data to compare the cost of rent in SF to the cost of renting an average Airbnb for a month.

**Read in necessary libraries**

In [5]:
#Read in libraries
import dask.dataframe as dd
import swifter

import pandas as pd

import re

import numpy as np
from scipy import stats

**Settings for Notebook**

In [6]:
#Increase number of columns and rows displayed by Pandas
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows',100)

#supress future warnings
import warnings; warnings.simplefilter(action='ignore', category=FutureWarning)

**Read in Data**

In [7]:
#Set path to Zillow Data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\01_Raw\Zillow Raw Data\Zip_MedianRentalPricePerSqft_AllHomes.csv'

#Read in Zillow data
zillow = pd.read_csv(path, header=1)

## Data Preview

In [8]:
#Print shape dtypes of calendar data
print('Original zillow data shape:', zillow.shape)

#Preview Zillow data
zillow.head()

Original zillow data shape: (6647, 123)


Unnamed: 0,RegionName,City,State,Metro,CountyName,SizeRank,2010-02,2010-03,2010-04,2010-05,2010-06,2010-07,2010-08,2010-09,2010-10,2010-11,2010-12,2011-01,2011-02,2011-03,2011-04,2011-05,2011-06,2011-07,2011-08,2011-09,2011-10,2011-11,2011-12,2012-01,2012-02,2012-03,2012-04,2012-05,2012-06,2012-07,2012-08,2012-09,2012-10,2012-11,2012-12,2013-01,2013-02,2013-03,2013-04,2013-05,2013-06,2013-07,2013-08,2013-09,2013-10,2013-11,2013-12,2014-01,2014-02,2014-03,2014-04,2014-05,2014-06,2014-07,2014-08,2014-09,2014-10,2014-11,2014-12,2015-01,2015-02,2015-03,2015-04,2015-05,2015-06,2015-07,2015-08,2015-09,2015-10,2015-11,2015-12,2016-01,2016-02,2016-03,2016-04,2016-05,2016-06,2016-07,2016-08,2016-09,2016-10,2016-11,2016-12,2017-01,2017-02,2017-03,2017-04,2017-05,2017-06,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04,2018-05,2018-06,2018-07,2018-08,2018-09,2018-10,2018-11,2018-12,2019-01,2019-02,2019-03,2019-04,2019-05,2019-06,2019-07,2019-08,2019-09,2019-10
0,10025,New York,NY,New York-Newark-Jersey City,New York County,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.76058583563554,4.75926225216817,4.72836850649351,4.64949307541222,4.59623430962343,4.50281425891182,4.50281425891182,4.56317204301075,4.52049184484927,4.60216891910046,4.57918682840643,4.60268948655257,4.75888324873096,4.74417376722262,4.74500577007049,4.64285714285714,4.72461414875685,4.77064220183486,4.78945035460993,4.71783655135425,4.75638994265917,4.79683972911964,4.87218045112782,4.81597038398057,4.81729197601206,4.84934472934473,4.85578965282505,4.88888888888889,4.88431876606684,4.75974025974026,4.84488975992161,4.85626373626374,4.85976408912189,4.82315112540193,4.78151921802518,4.86759581881533,4.75926225216817,4.85626373626374,4.91803278688525,4.81831395348837,4.85538461538462,4.82815057283142,4.83857949959645,4.82274902131361,4.85538461538462,4.83058813526497,4.84988452655889,4.85714285714286,4.85714285714286,4.88251470430332,4.85714285714286,4.85443223443223,4.62277331470485,4.58981229127481,4.63517409625194,4.69043151969981,4.68439704639961,4.71236247828605,4.60216891910046,4.70994790270705,4.72727272727273,4.74137931034483,4.84530301829993,4.74285342704264,4.75964125560538,4.79166666666667,4.79583333333333,4.75888324873096,4.77821282251545,4.73678137340376,4.71544715447155,4.75926225216817,4.79166666666667
1,10023,New York,NY,New York-Newark-Jersey City,New York County,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.00853816012112,5.1244167962675,5.196,5.22210357210357,5.28953229398664,5.21428571428571,5.2871826926416,5.35114465476605,5.5,5.45755237045204,5.46099290780142,5.48961424332344,5.49480712166172,5.45255949114999,5.46099290780142,5.43564572982298,5.39325842696629,5.39325842696629,5.44722417062843,5.50358324145535,5.5,5.57963163596966,5.52411657831492,5.47795165718916,5.47645125958379,5.56328233657858,5.56328233657858,5.5,5.49450549450549,5.5,5.46099290780142,5.41125541125541,5.41272570937231,5.41666666666667,5.37636761487965,5.35294117647059,5.33768656716418,5.26853718637993,5.06445672191529,5.41907514450867,5.41760299625468,5.37636761487965,5.50280898876404,5.52238805970149,5.45901639344262,5.45901639344262,5.43929795202805,5.37117903930131,5.41041440135315,5.34303416074707,5.34303416074707,5.37636761487965,5.13106433033714,5.07368421052632,5.23076923076923,5.31111441995897,5.36965305736835,5.33333333333333,5.18933784222572,5.20385050962627,5.34402852049911,5.46099290780142,5.35294117647059,5.37636761487965,5.24433849821216,5.23936525506212,5.35489285296069,5.37636761487965,5.34164304589811,5.34124629080119,5.39351851851852,5.33333333333333,5.32819932914406
2,77494,Katy,TX,Houston-The Woodlands-Sugar Land,Harris County,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.935449360073317,0.949538795442214,0.914634146341463,0.903023164507263,0.935510041411213,1.00414903973043,0.982042648709315,0.915948275862069,0.925001457480993,0.928812741312741,0.91628940913354,0.920132499079867,0.919618528610354,0.922916814914965,0.924884432216527,0.940017905102954,0.927235306810071,0.914494741655235,0.914703912456163,0.892230397036487,0.882240122746452,0.907391046354033,0.911686306467552,0.906612327181635,0.910931174089069,0.921464243020099,0.90844399900453,0.923443550789395,0.928411995925114,0.910273081924577,0.902668759811617,0.895346825903771,0.900558071141,0.90293453724605,0.889328063241107,0.886766712141883,0.874514158800666,0.877413920378329,0.882296531961461,0.889248181083266,0.873743993010048,0.860832137733142,0.832551594746717,0.814219788210597,0.840034194427377,0.829737848318178,0.83011909092934,0.842234779438571,0.845590030544164,0.844414944364788,0.860165759912808,0.874172185430464,0.859758820902188,0.850277264325323,0.857796416317194,0.862349562827356,0.887784090909091,0.894854710533936,0.883838383838384,0.879404617253949,0.878774222622803,0.888324873096447,0.893186610437533,0.87373020721072,0.85978835978836,0.850462370273133,0.838217697820064,0.843280247977669,0.847439083381078,0.87110826593386,0.859598853868195,0.878734622144112,0.876372711436468,0.875818228355453,0.869565217391304,0.878677431475571,0.869985500241663,0.865800910746434,0.859745382633485,0.865133917990045
3,77449,Katy,TX,Houston-The Woodlands-Sugar Land,Harris County,4,,,,,,,,,,,,0.643086816720257,0.682687870780422,0.704924026777126,0.662983425414365,0.678969688593396,0.708128078817734,0.714285714285714,0.708128078817734,0.700218839092689,0.68358724875176,0.688706540814875,0.701772479292492,0.695127561142578,0.702634192535107,0.730034689149299,0.698563037984365,0.688073394495413,0.719345894867877,0.717828159216798,0.698312371133711,0.685684147076552,0.727784728161808,0.702009895000029,0.71720620777584,0.721784776902887,0.776892430278884,0.739096573208723,0.740740740740741,0.733805668016194,0.740740740740741,0.766189879467293,0.754884547069272,0.732535986934065,0.712927756653992,0.714540448232051,0.722492290560219,0.722492290560219,0.733794604798441,0.734338412241271,0.743541294982617,0.766650694777192,0.789108166796505,0.784698381559588,0.767690253671562,0.767953507679535,0.792368908643575,0.781889519293195,0.795513823682838,0.795513823682838,0.798611111111111,0.795396727131311,0.795513823682838,0.80222739146312,0.809802933686258,0.817941952506596,0.8169828456895,0.798371947401378,0.784982935153584,0.792811839323467,0.807626636311895,0.805871779508688,0.814026299311209,0.822755054066761,0.806577341497801,0.826617099761266,0.80546265328874,0.784963259868489,0.805035128805621,0.795909662821283,0.81735609041823,0.803673938002296,0.793864370290635,0.798001256947305,0.796703296703297,0.801282051282051,0.827991452991453,0.83955223880597,0.829875518672199,0.829875518672199,0.81731065223548,0.819664853268317,0.815981992121553,0.834812955983634,0.820870535714286,0.814332247557003,0.834648558134144,0.841701122268163,0.833129185040013,0.829142712840913,0.837182448036952,0.838579672405331,0.833754634310752,0.826832151300236,0.827205882352941,0.831096880958377,0.843644544431946,0.845955363443896,0.843644544431946,0.848219032429559,0.858369098712446,0.874191373958845,0.876655052264808,0.868952559887271,0.863874345549738,0.855809128630705,0.870863078527312
4,77084,Houston,TX,Houston-The Woodlands-Sugar Land,Harris County,5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.918635170603675,0.801060718328926,0.722687969036529,0.727617842454919,0.737573505072337,0.736648250460405,0.739476678043231,0.729483282674772,0.736648250460405,0.777281712264353,0.788247939806521,0.783072129845835,0.771461716937355,0.764934106294109,0.763550056505283,0.769563550040559,0.76904296875,0.774082568807339,0.772781097396592,0.773718924403856,0.780074105685274,0.790572792362768,0.787124153593369,0.790273556231003,0.807840646776111,0.791602439514778,0.786278081360049,0.782254607801114,0.79764306676449,0.795334040296925,0.780380011135857,0.772545094780529,0.778036169297264,0.772849462365591,0.771963824289406,0.788378945429911,0.802615933412604,0.792063079629037,0.817720171854265,0.802752293577982,0.783169533169533,0.785705529424258,0.784879148240032,0.799142249051115,0.796930342384888,0.795454545454545,0.797837164974521,0.806858105323403,0.808314087759815,0.815850815850816,0.802888700084962,0.809299587992937,0.827070159132761,0.834617317448971,0.823691460055096,0.828803143557775,0.833972765326236,0.833743236596163,0.826808152961992,0.824145534729879,0.826411218606194,0.823629717102886,0.810564342796146,0.815789473684211,0.818661971830986,0.81842598098043,0.834748568666065,0.839749264788188,0.845860880272254,0.863625952729232,0.843222985633979,0.846242111302352,0.842245505331108,0.848329048843188,0.840685583105551


**Convert zillow data into a tidy dataset**

In [9]:
#Set columns for melt
id_vars = list(zillow.loc[:,:'SizeRank'].columns.values)
value_vars = list(zillow.iloc[:,6:].columns.values)

#Melt zillow. Create a Data and Price/SqrFt column
zillow = zillow.melt(id_vars= id_vars,value_vars= value_vars, var_name='Date', value_name= 'Price_SqrFt' )

#Print updated shape and dtypes of zillow data
print('Updated zillow data shape:',zillow.shape)
print('Zillow data types: \n', zillow.dtypes)

Updated zillow data shape: (777699, 8)
Zillow data types: 
 RegionName     object
City           object
State          object
Metro          object
CountyName     object
SizeRank       object
Date           object
Price_SqrFt    object
dtype: object


There appears to be some string values in RegionName, SizeRank, Date, and Price_SqrFt that we will need to isolate and remove          

In [10]:
#Pull rows that contain string in region name and remove
zillow.drop(zillow[zillow['RegionName'].str.contains('[A-Za-z]')].index,inplace=True)

#Create list of cols to convert into numeric
cols = ['RegionName', 'SizeRank', 'Price_SqrFt']

#Set data types
zillow.Date= zillow.Date.astype('datetime64[ns]')
zillow[cols] = zillow[cols].apply(pd.to_numeric, errors='coerce')

In [11]:
#Print updated shape and dtypes of zillow data
print('Updated zillow data shape:',zillow.shape)
print('Zillow data types: \n', zillow.dtypes)

Updated zillow data shape: (777465, 8)
Zillow data types: 
 RegionName            float64
City                   object
State                  object
Metro                  object
CountyName             object
SizeRank              float64
Date           datetime64[ns]
Price_SqrFt           float64
dtype: object


**Zillow Metrics**

In [12]:
#Describe zillow
display(zillow.describe())

Unnamed: 0,RegionName,SizeRank,Price_SqrFt
count,777348.0,777348.0,385366.0
mean,55429.895846,1661.414509,1.299695
std,29615.932146,958.969772,1.037336
min,1545.0,1.0,0.111453
25%,30329.0,831.0,0.808539
50%,52245.5,1661.5,1.007005
75%,85013.0,2492.0,1.430175
max,99705.0,3322.0,23.969319


In [13]:
#Zillow variance
print('Variance:\n', zillow.var(axis=0))

Variance:
 RegionName     8.771034e+08
SizeRank       9.196230e+05
Price_SqrFt    1.076066e+00
dtype: float64


# Export Cleaned Data

In [14]:
#Set path to export cleaned zillow data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate\12_23_2019_Zillow_Cleaned.csv'

#Write file
zillow.to_csv(path, sep=',')