# Data Cleaning - Zillow Data

## Introduction

In the following notebook, I will be cleaning an aggregation of Zillow Median Rent across the United States. This aggregation consists of data from 03/2010 through 11/2019.

The aggregation source code can be found [here](https://github.com/KishenSharma6/Airbnb-SF_ML_-_Text_Analysis/blob/master/Project%20Codes/01.%20Raw%20Data%20Aggregation%20Scripts/12_28_2019_Zillow_Raw_Data_Aggregation.ipynb)

Raw data can be found [here](https://github.com/KishenSharma6/Airbnb-SF_ML_-_Text_Analysis/tree/master/Data/01_Raw/Zillow%20Raw%20Data)

**Read in necessary libraries**

In [89]:
#Read in libraries
import pandas as pd

import re

import numpy as np

**Settings for Notebook**

In [90]:
#Increase number of columns and rows displayed by Pandas
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows',100)

#supress future warnings
import warnings; warnings.simplefilter(action='ignore', category=FutureWarning)

**Read in Data**

In [91]:
#Set path to Zillow Data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\SF Airbnb Raw Data - Aggregated\12_29_2019_Zillow_Raw_Aggregated.csv'

#Read in Zillow data
zillow = pd.read_csv(path,index_col=0 )

#Create list of columns to begin zillow with
cols_to_order=['City', 'CountyName', 'Metro', 'RegionName', 'State', 'SizeRank', 'Beds']

#Set new column order
new_columns = cols_to_order + (zillow.columns.drop(cols_to_order).tolist())

#Update zillow
zillow = zillow[new_columns]

#Rename some columns for clarity
zillow = zillow.rename(columns= {'Beds':'Bedrooms',
                                 'CountyName':'County',
                                'RegionName':'Zip'})

#Remove duplicates
zillow.drop_duplicates(inplace = True)

## Data Preview

In [92]:
#Print shape of zillow data
print('Original zillow data shape:', zillow.shape)

#Preview zillow data
zillow.head()

Original zillow data shape: (6041, 124)


Unnamed: 0,City,County,Metro,Zip,State,SizeRank,Bedrooms,2010-03,2010-04,2010-05,2010-06,2010-07,2010-08,2010-09,2010-10,2010-11,2010-12,2011-01,2011-02,2011-03,2011-04,2011-05,2011-06,2011-07,2011-08,2011-09,2011-10,2011-11,2011-12,2012-01,2012-02,2012-03,2012-04,2012-05,2012-06,2012-07,2012-08,2012-09,2012-10,2012-11,2012-12,2013-01,2013-02,2013-03,2013-04,2013-05,2013-06,2013-07,2013-08,2013-09,2013-10,2013-11,2013-12,2014-01,2014-02,2014-03,2014-04,2014-05,2014-06,2014-07,2014-08,2014-09,2014-10,2014-11,2014-12,2015-01,2015-02,2015-03,2015-04,2015-05,2015-06,2015-07,2015-08,2015-09,2015-10,2015-11,2015-12,2016-01,2016-02,2016-03,2016-04,2016-05,2016-06,2016-07,2016-08,2016-09,2016-10,2016-11,2016-12,2017-01,2017-02,2017-03,2017-04,2017-05,2017-06,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04,2018-05,2018-06,2018-07,2018-08,2018-09,2018-10,2018-11,2018-12,2019-01,2019-02,2019-03,2019-04,2019-05,2019-06,2019-07,2019-08,2019-09,2019-10,2019-11
0,New York,New York County,New York-Newark-Jersey City,10025,NY,1,1,,,,,,,,2600.0,2689.0,2678.0,2400.0,2392.5,2450.0,2495.0,2550.0,2550.0,2500.0,2495.0,2395.0,2425.0,2595.0,2640.0,2500.0,2500.0,2500.0,2550.0,2590.0,2595.0,2625.0,2697.0,2700.0,2800.0,2850.0,2750.0,2732.5,2695.0,2695.0,2730.0,2800.0,2809.0,2700.0,2700.0,2878.5,2800.0,2800.0,2850.0,2900.0,2900.0,2900.0,2850.0,2800.0,2750.0,3200.0,3050.0,3100.0,3075.0,3100.0,3100.0,3100.0,3088.0,3100.0,3100.0,3000.0,2962.5,2997.5,3050.0,3097.5,3050.0,3050.0,3000.0,3050.0,3050.0,3100.0,3000.0,3100.0,3050.0,3000.0,3000.0,3100.0,3050.0,3012.5,2950.0,3032.5,3000.0,3000.0,2995.0,2975.0,2900.0,2900.0,3025.0,3025.0,3000.0,2935.0,3025.0,3032.5,3040.5,2995.0,2975.0,2995.0,3000.0,2950.0,2900.0,2900.0,2950.0,3000.0,3100.0,3150.0,3195.0,3200.0,3200.0,3100.0,3150.0,3100.0,3100.0,3050.0,3100.0,3156.0
1,Chicago,Cook County,Chicago-Naperville-Elgin,60657,IL,2,1,,,,,,,,,,,1075.0,1095.0,1095.0,1080.0,1110.0,1125.0,1125.0,1095.0,1095.0,1100.0,1105.0,1125.0,1125.0,1100.0,1100.0,1125.0,1125.0,1145.0,1125.0,1095.0,1105.0,1147.5,1150.0,1175.0,1195.0,1205.0,1199.0,1200.0,1220.0,1230.0,1250.0,1270.0,1250.0,1245.0,1245.0,1245.0,1245.0,1250.0,1250.0,1250.0,1270.0,1265.0,1275.0,1275.0,1250.0,1250.0,1252.5,1255.0,1265.0,1295.0,1295.0,1300.0,1305.0,1325.0,1315.0,1295.0,1295.0,1295.0,1300.0,1312.5,1325.0,1345.0,1395.0,1395.0,1395.0,1400.0,1395.0,1375.0,1340.0,1350.0,1295.0,1295.0,1350.0,1450.0,1450.0,1450.0,1450.0,1465.0,1450.0,1420.0,1400.0,1375.0,1370.0,1386.0,1400.0,1440.0,1425.0,1445.0,1450.0,1450.0,1450.0,1400.0,1370.0,1350.0,1350.0,1395.0,1450.0,1425.0,1475.0,1490.0,1475.0,1495.0,1452.5,1425.0,1425.0,1410.0,1400.0
2,New York,New York County,New York-Newark-Jersey City,10023,NY,3,1,,,,,,,,2995.0,3025.0,3000.0,2770.0,2835.0,2900.0,3000.0,3000.0,2850.0,2900.0,2900.0,2950.0,2999.5,3100.0,3150.0,3100.0,3070.0,3000.0,3000.0,3100.0,3000.0,3000.0,3000.0,3100.0,3117.5,3100.0,3037.5,3000.0,3000.0,3150.0,3047.0,3042.0,3100.0,3109.0,3150.0,3200.0,3200.0,3100.0,3195.0,3200.0,3200.0,3200.0,3199.0,3200.0,3175.0,3295.0,3337.5,3300.0,3300.0,3350.0,3300.0,3300.0,3300.0,3350.0,3369.0,3300.0,3300.0,3337.5,3350.0,3350.0,3300.0,3300.0,3300.0,3300.0,3300.0,3300.0,3300.0,3300.0,3300.0,3100.0,3300.0,3300.0,3300.0,3300.0,3300.0,3226.0,3295.0,3250.0,3300.0,3299.5,3300.0,3300.0,3295.0,3290.5,3295.0,3300.0,3300.0,3259.0,3250.0,3225.0,3285.0,3300.0,3300.0,3300.0,3300.0,3300.0,3300.0,3295.0,3300.0,3300.0,3300.0,3300.0,3350.0,3350.0,3325.0,3297.5,3350.0,3350.0,3400.0,3391.5
3,Katy,Harris County,Houston-The Woodlands-Sugar Land,77494,TX,4,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1042.0,1058.0,1010.0,1010.0,1010.0,1055.0,1078.0,1022.0,1078.0,1088.0,1080.0,1055.0,1052.0,1050.0,1034.0,1000.0,1000.0,980.0,1032.5,1004.5,1019.5,1005.0,985.0,1021.0,1020.0,1020.0,1020.0,1063.5,1000.0,1000.0,985.0,1029.0,1039.0,1205.0,1057.0,1095.0,1104.0,1127.0,1225.0,1175.0,1082.0,1045.0,1081.0,1058.0,1119.0,1134.5,1097.0,1119.0,1175.0,1165.0,1124.0,1190.0,1202.5,1179.0,1175.0,1179.0,1175.0,1190.0,1205.0,1200.0,1178.0,1209.0,1209.0,1177.0,1245.0,1213.0,1144.0,1159.0,1159.0,1155.0,1152.5,1194.0,1236.5,1201.0,1206.0,1169.0,1157.0,1127.5
4,Chicago,Cook County,Chicago-Naperville-Elgin,60614,IL,5,1,,,,,,,,,,,1250.0,1200.0,1199.0,1175.0,1195.0,1245.0,1265.0,1260.0,1295.0,1300.0,1300.0,1375.0,1375.0,1365.0,1365.0,1350.0,1315.0,1400.0,1435.0,1485.0,1495.0,1495.0,1515.0,1545.0,1500.0,1492.5,1450.0,1450.0,1500.0,1497.5,1450.0,1445.0,1467.5,1450.0,1450.0,1550.0,1525.0,1495.0,1450.0,1435.0,1445.0,1425.0,1395.0,1395.0,1465.0,1500.0,1644.0,1500.0,1525.0,1485.0,1395.0,1475.0,1475.0,1450.0,1425.0,1435.0,1445.0,1425.0,1565.0,1575.0,1550.0,1666.0,1550.0,1525.0,1460.0,1460.0,1455.0,1525.0,1500.0,1450.0,1500.0,1500.0,1575.0,1550.0,1545.0,1495.0,1500.0,1500.0,1535.0,1551.0,1595.0,1595.0,1500.0,1545.0,1575.0,1610.0,1595.0,1600.0,1600.0,1595.0,1577.0,1595.0,1550.0,1550.0,1520.0,1600.0,1650.0,1650.0,1695.0,1675.0,1700.0,1675.0,1720.0,1700.0,1670.0,1625.0,1600.0


**Convert zillow data into a tidy dataset**

In [93]:
#Set columns for melt
id_vars = list(zillow.loc[:,:'Bedrooms'].columns.values)
value_vars = list(zillow.iloc[:,7:].columns.values)

#Melt zillow. Create a Data and Price/SqrFt column
zillow = zillow.melt(id_vars= id_vars,value_vars= value_vars, var_name='Date', value_name= 'Median_Rent' )

In [94]:
#Print updated shape and data types
print('Updated Zillow data shape:',zillow.shape)
print('Original Zillow data types: \n', zillow.dtypes)

#Preview updated data
display(zillow.head())

Updated Zillow data shape: (706797, 9)
Original Zillow data types: 
 City            object
County          object
Metro           object
Zip              int64
State           object
SizeRank         int64
Bedrooms         int64
Date            object
Median_Rent    float64
dtype: object


Unnamed: 0,City,County,Metro,Zip,State,SizeRank,Bedrooms,Date,Median_Rent
0,New York,New York County,New York-Newark-Jersey City,10025,NY,1,1,2010-03,
1,Chicago,Cook County,Chicago-Naperville-Elgin,60657,IL,2,1,2010-03,
2,New York,New York County,New York-Newark-Jersey City,10023,NY,3,1,2010-03,
3,Katy,Harris County,Houston-The Woodlands-Sugar Land,77494,TX,4,1,2010-03,
4,Chicago,Cook County,Chicago-Naperville-Elgin,60614,IL,5,1,2010-03,


### Data Cleaning

**Data Type Conversion**

In [95]:
#Set date data type
zillow.Date= zillow.Date.astype('datetime64[ns]')

#Convert Date to month_year
zillow['Date'] = zillow['Date'].dt.to_period('M')

#Convert Bedrooms and SizeRank to objects
cols = ['Bedrooms', 'SizeRank','Zip']
zillow[cols] = zillow[cols].astype(object)

#Print updated data types
print('Updated Zillow data types: \n', zillow.dtypes)

Updated Zillow data types: 
 City              object
County            object
Metro             object
Zip               object
State             object
SizeRank          object
Bedrooms          object
Date           period[M]
Median_Rent      float64
dtype: object


**Missing Values**

In [96]:
#Removing rows where Median_Rent is null
zillow = zillow[-zillow.Median_Rent.isnull()]

#Print updated shape
print('Updated zillow data shape:',zillow.shape)

#Preview updated data
display(zillow.head())

Updated zillow data shape: (257971, 9)


Unnamed: 0,City,County,Metro,Zip,State,SizeRank,Bedrooms,Date,Median_Rent
4298,Virginia Beach,Virginia Beach City,Virginia Beach-Norfolk-Newport News,23462,VA,83,3,2010-03,1200.0
10339,Virginia Beach,Virginia Beach City,Virginia Beach-Norfolk-Newport News,23462,VA,83,3,2010-04,1250.0
16380,Virginia Beach,Virginia Beach City,Virginia Beach-Norfolk-Newport News,23462,VA,83,3,2010-05,1200.0
22421,Virginia Beach,Virginia Beach City,Virginia Beach-Norfolk-Newport News,23462,VA,83,3,2010-06,1250.0
28462,Virginia Beach,Virginia Beach City,Virginia Beach-Norfolk-Newport News,23462,VA,83,3,2010-07,1225.0


**Data Cleaning**

In [97]:
#Remove 'County' from County
zillow.County = zillow.County.str.replace('County','', regex=True)

**Zillow Metrics**

In [98]:
#Describe zillow
display(zillow.describe())

Unnamed: 0,Median_Rent
count,257971.0
mean,1827.066172
std,1489.317983
min,350.0
25%,1200.0
50%,1570.0
75%,2124.0
max,50000.0


In [99]:
#Zillow variance
print('Variance:\n', zillow.var(axis=0))

Variance:
 Median_Rent    2.218068e+06
dtype: float64


# Export Cleaned Data

In [100]:
#Set path to export cleaned zillow data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate\12_29_2019_Zillow_Cleaned.csv'

#Write file
zillow.to_csv(path, sep=',')