# Data Immersion Task 6.1 - Sourcing Open Data

### Table of Contents

##### 1. Historical Monthly Inventory Zip data from Realtor.com
        A. Consistency Checks
        B. Data Cleaning
        C. Basic descriptive statistical analysis
##### 2. Zillow Observed Rent Index by ZIP code
        A. Consistency Checks
        B. Data Cleaning
        C. Basic descriptive statistical analysis
##### 3. Apartment List Vacancy Index
        A. Consistency Checks
        B. Data Cleaning
        C. Basic descriptive statistical analysis
##### 4. Apartment List Rent Estimates
        A. Consistency Checks
        B. Data Cleaning
        C. Basic descriptive statistical analysis
##### 5. Zip code list acquired from major protein company facility locations
        A. Consistency Checks
        B. Data Cleaning
        C. Basic descriptive statistical analysis
        D. Creating zip code list to be used for web scraper
##### 6. Realtor.com scraped homes for sale listings for zip codes in list
        A. Web scraping script
        B. Consistency Checks
        C. Data Cleaning
        D. Basic descriptive statistical analysis
##### 7. Realtor.com scraped rental listings for zip codes in list
        A. Web scraping script
        B. Consistency Checks
        C. Data Cleaning
        D. Basic descriptive statistical analysis

In [2]:
# Importing libraries

import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
import scipy

In [3]:
# Defining path
path=r'D:\Adam\Employment\Data Analysis Course\Final Data Project'

## 1.Historical Monthly Inventory Zip data from Realtor.com

In [8]:
# Importing dataframe
df_rh = pd.read_csv(os.path.join(path, 'Data', 'Original', 'RDC_Inventory_Core_Metrics_Zip_History.csv'), index_col = False)

  df_rh = pd.read_csv(os.path.join(path, 'Data', 'RDC_Inventory_Core_Metrics_Zip_History.csv'), index_col = False)


### 1A. Consistency Checks

In [6]:
# checking dataset info
df_rh.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2298453 entries, 0 to 2298452
Data columns (total 40 columns):
 #   Column                                   Dtype  
---  ------                                   -----  
 0   month_date_yyyymm                        object 
 1   postal_code                              object 
 2   zip_name                                 object 
 3   median_listing_price                     float64
 4   median_listing_price_mm                  float64
 5   median_listing_price_yy                  float64
 6   active_listing_count                     float64
 7   active_listing_count_mm                  float64
 8   active_listing_count_yy                  float64
 9   median_days_on_market                    float64
 10  median_days_on_market_mm                 float64
 11  median_days_on_market_yy                 float64
 12  new_listing_count                        float64
 13  new_listing_count_mm                     float64
 14  new_listing_count_

### 1B. Data Cleaning

In [9]:
# Checking for duplicates (and removing them if there are any)

df_rh.drop_duplicates(inplace=True)

In [10]:
df_rh.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2298453 entries, 0 to 2298452
Data columns (total 40 columns):
 #   Column                                   Dtype  
---  ------                                   -----  
 0   month_date_yyyymm                        object 
 1   postal_code                              object 
 2   zip_name                                 object 
 3   median_listing_price                     float64
 4   median_listing_price_mm                  float64
 5   median_listing_price_yy                  float64
 6   active_listing_count                     float64
 7   active_listing_count_mm                  float64
 8   active_listing_count_yy                  float64
 9   median_days_on_market                    float64
 10  median_days_on_market_mm                 float64
 11  median_days_on_market_yy                 float64
 12  new_listing_count                        float64
 13  new_listing_count_mm                     float64
 14  new_listing_count_

#### No duplicates were found

In [12]:
# Removing unneeded columns
# I'm removing all month-over-month values because I'm not interested in monthly changes in the housing market
# since housing is very seasonal. However, I'm leaving year-over-year values, because it can help predict
# larger trends of availability in the housing market. I'm also keeping all the variables that track number of homes,
# the cost of the homes (average & median), and the squarefeet, as well as price per square foot

df_rh_new = df_rh.drop(['active_listing_count_mm', 'median_listing_price_mm', 'median_days_on_market', 'median_days_on_market_mm', 'median_days_on_market_yy', 'new_listing_count_mm', 'price_increased_count', 'price_increased_count_mm', 'price_increased_count_yy', 'price_reduced_count', 'price_reduced_count_mm', 'price_reduced_count_yy', 'pending_listing_count', 'pending_listing_count_mm', 'pending_listing_count_yy', 'median_listing_price_per_square_foot_mm', 'median_square_feet_mm', 'average_listing_price_mm', 'total_listing_count_mm', 'pending_ratio', 'pending_ratio_mm', 'pending_ratio_yy'], axis=1)

In [13]:
df_rh_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2298453 entries, 0 to 2298452
Data columns (total 18 columns):
 #   Column                                   Dtype  
---  ------                                   -----  
 0   month_date_yyyymm                        object 
 1   postal_code                              object 
 2   zip_name                                 object 
 3   median_listing_price                     float64
 4   median_listing_price_yy                  float64
 5   active_listing_count                     float64
 6   active_listing_count_yy                  float64
 7   new_listing_count                        float64
 8   new_listing_count_yy                     float64
 9   median_listing_price_per_square_foot     float64
 10  median_listing_price_per_square_foot_yy  float64
 11  median_square_feet                       float64
 12  median_square_feet_yy                    float64
 13  average_listing_price                    float64
 14  average_listing_pr

In [18]:
# Checking for missing values
missing_values = df_rh_new.isnull().sum()

# total rows in dataframe
total_rows = df_rh_new.shape[0]

# calculating percentage of missing values
percent_missing = ((missing_values / total_rows) * 100).round(2)

# Making a dataframe to show missing values
missing_value_table = pd.concat([missing_values, percent_missing], axis=1)

# Naming columns for sorting
missing_value_table.columns = ['Missing Values', 'Percent Missing']

# sorting the view based on most missing values %
missing_value_table.sort_values('Percent Missing', ascending=True, inplace=True)

print(missing_value_table)

                                         Missing Values  Percent Missing
month_date_yyyymm                                     0             0.00
postal_code                                           0             0.00
total_listing_count                                3277             0.14
new_listing_count                                  3277             0.14
active_listing_count                               4728             0.21
median_listing_price                               6752             0.29
average_listing_price                              6752             0.29
median_square_feet                                26112             1.14
median_listing_price_per_square_foot              26357             1.15
zip_name                                          65838             2.86
quality_flag                                     348411            15.16
total_listing_count_yy                           448028            19.49
average_listing_price_yy                         46

Interestingly, I see that a whole lot of the year over year data is missing.
None of the postal codes are missing, which is helpful. But I can't guess why so much year-over-year data is missing.
Presumably it has something to do with some zip codes having less historical data. Maybe they weren't calculating the year-over-year for those zip codes.
I think, because I'm not sure if I'll use all that year-over-year data, I'll ignore the missing values for now.
If I find a good use case for that data, I may still be able to use it even if some zip codes don't have long-reaching historical data.

In [19]:
# Renaming one column

df_rh_new = df_rh_new.rename(columns={'zip_name': 'city state'})

### 1C. Basic descriptive statistical analysis

In [21]:
# counting number of unique values in each column

print(df_rh_new.nunique())

month_date_yyyymm                              83
postal_code                                 38373
city state                                  25728
median_listing_price                       204371
median_listing_price_yy                     55684
active_listing_count                         1232
active_listing_count_yy                     23369
new_listing_count                             191
new_listing_count_yy                         4329
median_listing_price_per_square_foot         2854
median_listing_price_per_square_foot_yy     46435
median_square_feet                           7856
median_square_feet_yy                       33256
average_listing_price                      739779
average_listing_price_yy                    56802
total_listing_count                          1414
total_listing_count_yy                      21477
quality_flag                                    2
dtype: int64


In [24]:
df_rh_new.describe().applymap(lambda x: f"{x:0.2f}")

Unnamed: 0,median_listing_price,median_listing_price_yy,active_listing_count,active_listing_count_yy,new_listing_count,new_listing_count_yy,median_listing_price_per_square_foot,median_listing_price_per_square_foot_yy,median_square_feet,median_square_feet_yy,average_listing_price,average_listing_price_yy,total_listing_count,total_listing_count_yy,quality_flag
count,2291701.0,1834276.0,2293725.0,1823346.0,2295176.0,1213656.0,2272096.0,1815793.0,2272341.0,1816039.0,2291701.0,1834276.0,2295176.0,1850425.0,1950042.0
mean,348570.43,13.92,33.12,0.05,14.6,-0.02,177.05,10.25,2006.73,0.09,424998.59,13.57,48.49,0.07,0.55
std,844241.55,2444.92,59.84,0.95,24.41,0.75,1271.31,1723.15,17252.76,13.77,984581.78,2436.91,81.8,1.06,0.5
min,1.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,1.0,-1.0,1.0,-1.0,0.0,-1.0,0.0
25%,142975.0,-0.08,3.0,-0.39,0.0,-0.44,86.0,-0.03,1498.0,-0.11,165062.0,-0.09,4.0,-0.3,0.0
50%,242000.0,0.07,11.0,-0.11,4.0,-0.06,129.0,0.07,1815.0,0.0,278533.0,0.06,16.0,-0.07,1.0
75%,395000.0,0.27,39.0,0.2,20.0,0.22,193.0,0.21,2234.0,0.12,463792.0,0.26,60.0,0.18,1.0
max,279000000.0,1789999.0,2646.0,233.0,802.0,39.0,795000.0,836107.24,12503400.0,8681.66,279000000.0,1789999.0,2860.0,274.0,1.0


#### All of the descriptive statistics seem to fall within expected ranges, particularly the quartiles for things like median listing price, square feet, listing price, etc

In [26]:
df_rh_new['month_date_yyyymm'].value_counts()

201709                                                       29313
201707                                                       29293
201809                                                       29291
201610                                                       29264
201608                                                       29261
                                                             ...  
202203                                                       26734
202202                                                       26513
201607                                                       24683
201607                                                        4692
quality_flag = 1:  year-over-year figures may be impacted        1
Name: month_date_yyyymm, Length: 83, dtype: int64

In [27]:
# Removing the one totally unneccessary row

df_rh_new = df_rh_new[df_rh_new['month_date_yyyymm'] != 'quality_flag = 1:  year-over-year figures may be impacted']

In [28]:
df_rh_new['month_date_yyyymm'].value_counts()

201709    29313
201707    29293
201809    29291
201610    29264
201608    29261
          ...  
202102    26845
202203    26734
202202    26513
201607    24683
201607     4692
Name: month_date_yyyymm, Length: 82, dtype: int64

In [31]:
# Exporting cleaned data

df_rh_new.to_csv(os.path.join(path, 'Data', 'Cleaned', 'Realtor_Historical_Zip_Inventory_ForSale.csv'), index=False)

## 2. Zillow Observed Rent Index by ZIP code

In [32]:
# Importing dataframe
df_zr = pd.read_csv(os.path.join(path, 'Data', 'Original', 'Zip_zori_sm_month.csv'), index_col = False)

### 2A. Consistency Checks

In [40]:
# checking dataset info

df_zr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6376 entries, 0 to 6375
Columns: 106 entries, RegionID to 2023-03-31
dtypes: float64(97), int64(3), object(6)
memory usage: 5.2+ MB


In [48]:
df_zr.head(10)

Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,State,City,Metro,CountyName,2015-03-31,2015-04-30,2015-05-31,2015-06-30,2015-07-31,2015-08-31,2015-09-30,2015-10-31,2015-11-30,2015-12-31,2016-01-31,2016-02-29,2016-03-31,2016-04-30,2016-05-31,2016-06-30,2016-07-31,2016-08-31,2016-09-30,2016-10-31,2016-11-30,2016-12-31,2017-01-31,2017-02-28,2017-03-31,2017-04-30,2017-05-31,2017-06-30,2017-07-31,2017-08-31,2017-09-30,2017-10-31,2017-11-30,2017-12-31,2018-01-31,2018-02-28,2018-03-31,2018-04-30,2018-05-31,2018-06-30,2018-07-31,2018-08-31,2018-09-30,2018-10-31,2018-11-30,2018-12-31,2019-01-31,2019-02-28,2019-03-31,2019-04-30,2019-05-31,2019-06-30,2019-07-31,2019-08-31,2019-09-30,2019-10-31,2019-11-30,2019-12-31,2020-01-31,2020-02-29,2020-03-31,2020-04-30,2020-05-31,2020-06-30,2020-07-31,2020-08-31,2020-09-30,2020-10-31,2020-11-30,2020-12-31,2021-01-31,2021-02-28,2021-03-31,2021-04-30,2021-05-31,2021-06-30,2021-07-31,2021-08-31,2021-09-30,2021-10-31,2021-11-30,2021-12-31,2022-01-31,2022-02-28,2022-03-31,2022-04-30,2022-05-31,2022-06-30,2022-07-31,2022-08-31,2022-09-30,2022-10-31,2022-11-30,2022-12-31,2023-01-31,2023-02-28,2023-03-31
0,91982,1,77494,zip,TX,TX,Katy,"Houston-The Woodlands-Sugar Land, TX",Fort Bend County,1497.424718,1499.24853,1511.190697,1518.373209,1524.778886,1534.467413,1525.972679,1513.825732,1496.067628,1484.176959,1477.602784,1469.89333,1469.266453,1465.506633,1460.69749,1464.839719,1463.348701,1463.308774,1451.153055,1440.484725,1414.325031,1403.325081,1401.745874,1422.660319,1436.056953,1452.028342,1457.738077,1465.455128,1456.703293,1446.772335,1444.602727,1455.723864,1471.962899,1478.867489,1485.714413,1490.908561,1490.231947,1486.949121,1484.584374,1491.146678,1492.230793,1488.545118,1470.569482,1467.520836,1468.130101,1479.471622,1480.83772,1484.407342,1489.211373,1492.102559,1496.527475,1501.177687,1505.78083,1515.566709,1506.300283,1510.691959,1499.980684,1505.795181,1495.972559,1498.902275,1502.508405,1514.348774,1507.873714,1506.337641,1504.307536,1513.752688,1527.794248,1524.872925,1537.864851,1534.983419,1542.192281,1533.228151,1535.895213,1558.783273,1594.669434,1648.675204,1694.636887,1731.006308,1744.970826,1758.066255,1775.214637,1781.542878,1776.812683,1779.581445,1784.77224,1805.412721,1820.739019,1842.220063,1868.047131,1873.817718,1878.928788,1858.462009,1840.580666,1837.859685,1866.420457,1878.317284,1879.541555
1,91940,2,77449,zip,TX,TX,Katy,"Houston-The Woodlands-Sugar Land, TX",Harris County,1277.132974,1290.552348,1297.324146,1310.568023,1319.342211,1328.941113,1325.851755,1324.083586,1323.559282,1325.39794,1320.378924,1316.058328,1319.733662,1327.915762,1333.973771,1327.52958,1321.263721,1321.523778,1321.007518,1325.192442,1316.672574,1309.092448,1299.634126,1303.0657,1313.825541,1322.124167,1325.332426,1324.231674,1331.204495,1332.155926,1344.550593,1342.235872,1349.215973,1345.98962,1356.093238,1361.766509,1367.501081,1371.292204,1375.634822,1384.091429,1389.244946,1393.950416,1388.586372,1384.673133,1385.688017,1383.849626,1376.778312,1372.473906,1377.052443,1383.838332,1390.497816,1396.109101,1406.476355,1405.774508,1408.497078,1403.872928,1402.392732,1397.623038,1402.252585,1407.792889,1417.250549,1419.855427,1406.522741,1407.013628,1410.859196,1435.006098,1439.897578,1444.169843,1454.399468,1476.26225,1487.010318,1491.236476,1501.682413,1519.444443,1568.593329,1608.079967,1663.924436,1675.820344,1686.698612,1682.038321,1687.457652,1695.160415,1698.073415,1720.708202,1738.002891,1750.625491,1744.870716,1751.848602,1802.897432,1844.663002,1839.843822,1817.949304,1784.282291,1780.857238,1772.767233,1782.596162,1792.935784
2,92593,4,78660,zip,TX,TX,Pflugerville,"Austin-Round Rock-Georgetown, TX",Travis County,1213.896723,1202.061542,1210.64875,1235.145033,1257.007757,1257.881811,1257.506468,1255.223144,1256.37794,1259.291195,1266.857438,1262.929548,,1276.916177,1291.094497,1300.589664,1309.258325,1308.464719,1299.734453,1306.002029,1310.97273,1300.607528,1284.189301,1280.334085,1295.872435,1309.923928,1310.853327,1318.664933,1314.511684,1324.520001,1322.754487,1328.07691,1320.924809,1313.570585,1303.632042,1317.669723,1327.345552,1343.32599,1347.656499,1358.513485,1357.594438,1356.668627,1355.37854,1358.251351,1362.794213,1361.303731,1362.842623,1357.634069,1359.213324,1367.584804,1378.62248,1388.271416,1394.832913,1399.472838,1402.246878,1398.979755,1392.002211,1398.6203,1399.746265,1410.945804,1416.001456,1419.887578,1419.578231,1407.293023,1410.266391,1430.904864,1441.446165,1444.441106,1429.829386,1433.071054,1440.360359,1437.513873,1450.528819,1478.00105,1534.609844,1585.627305,1631.632943,,1716.093367,1745.067127,1740.623827,1753.971036,1757.736583,1762.095083,1774.815096,1776.177778,1805.584129,1829.508574,1855.75404,1844.563025,1836.608744,1829.236068,1832.788915,1818.796956,1825.326246,1824.967456,1828.388042
3,62093,5,11385,zip,NY,NY,New York,"New York-Newark-Jersey City, NY-NJ-PA",Queens County,2002.268262,2037.526224,2061.543907,2082.22329,2090.946531,2101.119163,2127.159394,2124.550233,2120.522267,2119.040449,2119.092187,2156.149319,2180.213141,2212.880008,2195.06884,2185.687996,2189.945202,2207.49578,2217.205063,2229.215506,2196.001381,2188.261579,2164.387386,2177.838793,2191.762129,2194.234774,2197.057057,2195.851515,2225.32326,2256.535411,2265.851461,2251.343189,2232.20705,2227.860084,2211.190543,2212.089833,2199.359855,2219.327416,2223.110451,2236.012802,2245.854843,2262.433113,2282.367337,2301.812405,2285.690402,2274.918462,2273.711925,2297.896204,2303.359826,2298.496611,2292.571896,2309.365078,2329.386519,2349.984191,2354.155218,2350.999494,2361.658657,2362.27045,2350.602492,2346.078132,2349.50972,2352.87888,2375.710817,2359.50756,2356.634053,2316.063959,2299.685891,2271.613823,2227.638376,2219.586462,2210.811419,2201.843389,2181.017894,2173.213028,2200.773561,2238.683011,2278.74516,2316.200824,2358.112678,2410.64992,2426.294188,2451.322199,2445.761932,2479.873594,2489.440934,2544.843223,2589.899639,2690.483038,2767.820061,2844.072074,2850.187383,2849.302473,2818.58222,2771.866885,2723.280995,2722.922736,2746.429739
4,93144,6,79936,zip,TX,TX,El Paso,"El Paso, TX",El Paso County,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1132.474356,1135.498773,1140.499609,1145.384713,1179.258952,1216.892264,1215.499492,1233.988449,1230.355814,1252.408703,1265.835938,1254.378579,1262.539224,1273.463572,1319.020253,1352.624832,1364.329298,1341.859387,1352.863897,1360.447722,1397.850669,1414.703422,1412.378205
5,62019,7,11208,zip,NY,NY,New York,"New York-Newark-Jersey City, NY-NJ-PA",Kings County,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2030.940908,2025.097436,1949.032972,1999.685305,2007.965498,2110.120608,2112.371382,2164.6242,2167.730603,2144.357726,2102.545288,2081.750948,2145.478402,2111.107917,2146.378433,2176.348498,2349.645808,2430.384194,2396.61187,2424.384161,2470.706539,2516.26695,2487.859244,2456.280551,2489.307247,2563.947896,2610.525206,2676.616409,2620.478358,2563.108333
6,95992,8,90011,zip,CA,CA,Los Angeles,"Los Angeles-Long Beach-Anaheim, CA",Los Angeles County,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2405.565205,2355.519949,2428.117584,2451.3297,2519.001799,2539.021831,2501.664937,2513.53332,2463.372364,2515.0
7,91733,9,77084,zip,TX,TX,Houston,"Houston-The Woodlands-Sugar Land, TX",Harris County,1233.641967,1244.075336,1247.713054,1254.876364,1259.601486,1270.391466,1276.234223,1271.331775,1265.734134,1261.182921,1262.46933,1265.067362,1264.97308,1267.502484,1266.446462,1257.095838,1261.365986,1267.67707,1273.270092,1268.341789,1261.121591,1262.989288,1262.565952,1256.035087,1257.852953,1264.966178,1276.34552,1281.56189,1285.075282,1287.668497,1307.381558,1323.800426,1338.370093,1323.963358,1314.530976,1307.837366,1305.776412,1315.449421,1320.412394,1332.744922,1324.810525,1332.027619,1350.403947,1352.831016,1360.323345,1341.403137,1345.967308,1328.68846,1340.097493,1348.519395,1374.659935,1374.870573,1379.988053,1372.430132,1370.333551,1364.991152,1364.554445,1367.731956,1365.790042,1374.852702,1382.17891,1394.477302,1393.408276,1396.06915,1395.07505,1398.666761,1405.83041,1404.734647,1407.847035,1409.717848,1420.190451,1428.472731,1435.355058,1450.284327,1470.039106,1503.41192,1549.86702,1600.29715,1650.68321,1647.014189,1645.690382,1646.895238,1666.982436,1675.483034,1661.685858,1668.53619,1684.548159,1719.482869,1742.866425,1727.62983,1710.984213,1695.807557,1709.414886,1701.107999,1700.951012,1694.641319,1712.270396
8,91926,10,77433,zip,TX,TX,Cypress,"Houston-The Woodlands-Sugar Land, TX",Harris County,1271.117945,1275.211973,1261.860826,1262.016176,1277.909648,1298.31711,1307.127937,1310.138559,1294.814174,1290.963166,1283.63803,1297.653204,1309.647767,1308.717502,1312.976293,1306.662112,1320.529641,1322.345581,1327.979332,1311.284587,1292.597219,1278.892625,1281.110115,1301.119041,1304.663159,1308.460275,1305.620214,1310.262601,1317.587642,1324.30614,1332.365748,1344.160547,1356.577275,1364.518387,1359.989701,1374.236887,1379.954356,1391.194625,1374.672072,1372.730616,1372.66137,1372.318824,1379.835119,1376.25408,1375.858065,1376.246811,1380.438751,1385.217597,1375.113366,1378.293585,1390.830751,1405.288417,1411.988846,1413.133151,1410.933394,1406.120132,1398.630794,1400.89448,1417.781026,1422.959677,1440.987031,1435.822648,1446.099193,1437.744322,1447.171931,1451.813395,1455.613663,1460.769998,1479.559194,1494.942613,1505.26153,1492.467617,1488.27712,1497.332756,1535.902039,1589.605508,1640.338707,1680.083689,1683.627711,1687.008185,1704.62001,1743.561824,1766.490613,1761.327204,1760.505734,1761.984858,1763.88068,1781.881801,1804.866395,1827.713655,1798.208562,1786.982792,1784.521263,1791.03631,1802.4172,1803.497801,1804.064127
9,84630,11,60629,zip,IL,IL,Chicago,"Chicago-Naperville-Elgin, IL-IN-WI",Cook County,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,959.539515,954.803226,951.800956,956.730579,959.406923,968.090732,975.39179,997.122444,1005.856883,1026.783942,1008.533068,1011.548906,1015.097193,1052.609519,1073.280237,1091.769474,1060.861816,1064.324606,1053.516263,1075.694733,1093.260254,1104.233046,1109.350385,1104.877778,1111.897029,1140.708333


In [44]:
# figuring out the dtype for each column because there are too many of them

# making it show me all the rows
pd.options.display.max_rows = len(df_zr.dtypes)

print(df_zr.dtypes)

RegionID        int64
SizeRank        int64
RegionName      int64
RegionType     object
StateName      object
State          object
City           object
Metro          object
CountyName     object
2015-03-31    float64
2015-04-30    float64
2015-05-31    float64
2015-06-30    float64
2015-07-31    float64
2015-08-31    float64
2015-09-30    float64
2015-10-31    float64
2015-11-30    float64
2015-12-31    float64
2016-01-31    float64
2016-02-29    float64
2016-03-31    float64
2016-04-30    float64
2016-05-31    float64
2016-06-30    float64
2016-07-31    float64
2016-08-31    float64
2016-09-30    float64
2016-10-31    float64
2016-11-30    float64
2016-12-31    float64
2017-01-31    float64
2017-02-28    float64
2017-03-31    float64
2017-04-30    float64
2017-05-31    float64
2017-06-30    float64
2017-07-31    float64
2017-08-31    float64
2017-09-30    float64
2017-10-31    float64
2017-11-30    float64
2017-12-31    float64
2018-01-31    float64
2018-02-28    float64
2018-03-31

All the datatypes look appropriate

In [49]:
# Checking for duplicates (and removing them if there are any)

df_zr.drop_duplicates(inplace=True)

In [50]:
df_zr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6376 entries, 0 to 6375
Columns: 106 entries, RegionID to 2023-03-31
dtypes: float64(97), int64(3), object(6)
memory usage: 5.2+ MB


Apparently there weren't any duplicates

In [51]:
# Checking for missing values
missing_values = df_zr.isnull().sum()

# total rows in dataframe
total_rows = df_zr.shape[0]

# calculating percentage of missing values
percent_missing = ((missing_values / total_rows) * 100).round(2)

# Making a dataframe to show missing values
missing_value_table = pd.concat([missing_values, percent_missing], axis=1)

# Naming columns for sorting
missing_value_table.columns = ['Missing Values', 'Percent Missing']

# sorting the view based on most missing values %
missing_value_table.sort_values('Percent Missing', ascending=True, inplace=True)

print(missing_value_table)

            Missing Values  Percent Missing
RegionID                 0             0.00
SizeRank                 0             0.00
RegionName               0             0.00
RegionType               0             0.00
StateName                0             0.00
State                    0             0.00
CountyName               0             0.00
2023-03-31               7             0.11
Metro                    8             0.13
City                    53             0.83
2023-02-28             903            14.16
2023-01-31            1198            18.79
2022-12-31            1557            24.42
2022-11-30            1697            26.62
2022-10-31            1789            28.06
2022-09-30            1903            29.85
2022-08-31            1950            30.58
2022-07-31            2012            31.56
2022-06-30            2076            32.56
2022-05-31            2161            33.89
2022-04-30            2278            35.73
2022-03-31            2363      

So, the older the data, the more zip codes are missing, that makes sense. Though it's strange that even just one month before there were 15% less bits of data by zip code.  And apparently some zip codes are represented but without any data.  I wonder why.

In [53]:
# counting number of unique values in each column

print(df_zr.nunique())

RegionID      6376
SizeRank      6012
RegionName    6376
RegionType       1
StateName       52
State           52
City          2565
Metro          480
CountyName     671
2015-03-31    1096
2015-04-30    1117
2015-05-31    1134
2015-06-30    1142
2015-07-31    1153
2015-08-31    1157
2015-09-30    1170
2015-10-31    1173
2015-11-30    1193
2015-12-31    1223
2016-01-31    1279
2016-02-29    1328
2016-03-31    1356
2016-04-30    1422
2016-05-31    1628
2016-06-30    1682
2016-07-31    1695
2016-08-31    1711
2016-09-30    1717
2016-10-31    1749
2016-11-30    1777
2016-12-31    1821
2017-01-31    1920
2017-02-28    1925
2017-03-31    1948
2017-04-30    1967
2017-05-31    1991
2017-06-30    2002
2017-07-31    2019
2017-08-31    2027
2017-09-30    2030
2017-10-31    2050
2017-11-30    2070
2017-12-31    2089
2018-01-31    2173
2018-02-28    2191
2018-03-31    2220
2018-04-30    2235
2018-05-31    2250
2018-06-30    2251
2018-07-31    2259
2018-08-31    2264
2018-09-30    2269
2018-10-31  

### 2B. Data Cleaning

In [54]:
# Removing unneeded columns. The State and StateName appear to be the exact same and have the same number of values.
# I don't need RegionType because I already know they're all zip codes.
# I manually checked on the internet and RegionName is the real zip code and RegionID must correspond to something
# else in Zillow's systems (but unneeded). I'm also not interested in SizeRank, though City, Metro, and CountyName might all
# come in handy.

df_zr_new = df_zr.drop(['RegionID', 'SizeRank', 'RegionType', 'StateName'], axis=1)

### 2C. Basic Descriptive Analysis

In [55]:
df_zr_new.describe().applymap(lambda x: f"{x:0.2f}")

Unnamed: 0,RegionName,2015-03-31,2015-04-30,2015-05-31,2015-06-30,2015-07-31,2015-08-31,2015-09-30,2015-10-31,2015-11-30,2015-12-31,2016-01-31,2016-02-29,2016-03-31,2016-04-30,2016-05-31,2016-06-30,2016-07-31,2016-08-31,2016-09-30,2016-10-31,2016-11-30,2016-12-31,2017-01-31,2017-02-28,2017-03-31,2017-04-30,2017-05-31,2017-06-30,2017-07-31,2017-08-31,2017-09-30,2017-10-31,2017-11-30,2017-12-31,2018-01-31,2018-02-28,2018-03-31,2018-04-30,2018-05-31,2018-06-30,2018-07-31,2018-08-31,2018-09-30,2018-10-31,2018-11-30,2018-12-31,2019-01-31,2019-02-28,2019-03-31,2019-04-30,2019-05-31,2019-06-30,2019-07-31,2019-08-31,2019-09-30,2019-10-31,2019-11-30,2019-12-31,2020-01-31,2020-02-29,2020-03-31,2020-04-30,2020-05-31,2020-06-30,2020-07-31,2020-08-31,2020-09-30,2020-10-31,2020-11-30,2020-12-31,2021-01-31,2021-02-28,2021-03-31,2021-04-30,2021-05-31,2021-06-30,2021-07-31,2021-08-31,2021-09-30,2021-10-31,2021-11-30,2021-12-31,2022-01-31,2022-02-28,2022-03-31,2022-04-30,2022-05-31,2022-06-30,2022-07-31,2022-08-31,2022-09-30,2022-10-31,2022-11-30,2022-12-31,2023-01-31,2023-02-28,2023-03-31
count,6376.0,1096.0,1117.0,1134.0,1142.0,1153.0,1157.0,1170.0,1173.0,1193.0,1223.0,1279.0,1328.0,1356.0,1422.0,1628.0,1682.0,1695.0,1711.0,1717.0,1749.0,1777.0,1821.0,1920.0,1925.0,1948.0,1967.0,1991.0,2002.0,2019.0,2027.0,2030.0,2050.0,2070.0,2089.0,2173.0,2191.0,2220.0,2235.0,2250.0,2251.0,2259.0,2264.0,2269.0,2278.0,2282.0,2314.0,2368.0,2380.0,2407.0,2412.0,2416.0,2419.0,2417.0,2424.0,2420.0,2425.0,2430.0,2445.0,2457.0,2466.0,2471.0,2479.0,2497.0,2498.0,2504.0,2509.0,2515.0,2544.0,2575.0,2628.0,2742.0,2816.0,2977.0,3016.0,3051.0,3086.0,3093.0,3127.0,3177.0,3271.0,3357.0,3499.0,3703.0,3838.0,4013.0,4098.0,4215.0,4300.0,4364.0,4426.0,4473.0,4587.0,4679.0,4819.0,5178.0,5473.0,6369.0
mean,53564.17,1406.11,1421.92,1433.23,1443.5,1487.17,1494.81,1497.54,1498.99,1499.07,1497.39,1503.34,1504.85,1514.4,1516.03,1512.22,1514.56,1524.21,1524.12,1525.78,1521.25,1517.65,1513.37,1521.05,1531.12,1541.24,1557.0,1568.74,1577.86,1583.52,1578.85,1598.89,1598.21,1593.67,1592.45,1594.12,1600.86,1609.04,1620.57,1632.56,1644.67,1652.56,1657.9,1657.85,1656.1,1656.83,1655.51,1655.45,1661.15,1670.74,1682.16,1693.81,1704.9,1714.12,1715.4,1716.18,1713.88,1712.21,1710.85,1713.95,1721.47,1731.3,1731.96,1727.85,1724.79,1729.38,1730.53,1728.44,1725.34,1723.49,1718.14,1715.26,1716.08,1717.69,1741.27,1770.1,1806.26,1839.18,1874.02,1896.99,1904.61,1914.06,1913.91,1919.48,1931.05,1943.0,1964.35,1987.75,2007.49,2025.84,2042.79,2038.08,2025.93,2016.88,2015.69,2018.28,2020.93,2060.61
std,30000.75,621.26,626.24,629.06,633.83,1099.52,1113.07,1105.14,1112.1,1117.5,1137.5,1146.12,1155.77,1159.49,1131.85,1050.02,1016.06,1006.66,992.7,997.48,981.02,994.02,1000.87,1023.71,1031.39,1021.12,1004.23,993.15,1005.4,1007.55,892.75,1093.47,1098.57,1103.49,1108.42,1076.37,1064.16,1011.1,1009.36,1017.71,1039.57,1064.48,1081.52,1084.59,1060.19,1052.92,1037.64,1018.9,1010.74,1006.6,1008.24,1010.24,1026.86,1014.37,985.31,963.95,963.53,973.86,978.02,975.95,966.75,987.63,1003.44,1008.56,996.54,1000.52,969.04,939.51,926.69,964.11,987.07,985.08,990.78,1021.36,1061.68,1103.91,1131.6,1128.94,1082.11,1019.52,998.21,1011.9,1008.54,1042.37,1078.87,1144.7,1201.83,1211.4,1212.74,1171.62,1280.24,1205.56,1183.49,1203.08,1247.51,1332.26,1395.12,2026.07
min,901.0,501.49,505.98,494.02,502.58,518.89,529.47,535.31,529.66,530.77,492.83,522.68,529.15,526.14,531.1,536.46,539.6,544.15,463.29,469.31,480.57,504.69,495.9,490.65,488.76,570.27,570.27,571.91,563.91,564.49,557.88,48.92,51.9,50.93,52.07,53.72,54.59,54.33,52.6,51.78,52.01,53.93,53.64,52.66,51.78,52.18,51.78,50.96,50.87,51.99,53.56,52.74,52.1,52.3,55.52,56.87,57.56,54.26,54.29,55.21,56.72,56.6,52.91,56.52,58.7,64.02,63.6,66.3,64.4,64.74,55.6,60.8,59.7,513.45,80.75,93.12,94.47,96.52,102.85,105.57,104.82,99.72,103.15,105.67,109.31,117.95,118.71,128.19,89.71,68.42,191.24,151.01,554.18,548.57,548.95,542.81,560.9,575.56
25%,29831.5,1011.69,1024.27,1032.42,1038.43,1048.16,1048.72,1052.93,1054.0,1052.79,1053.21,1061.69,1064.46,1070.52,1072.85,1075.37,1080.34,1087.92,1089.55,1095.32,1091.18,1091.15,1090.44,1098.13,1104.2,1109.23,1120.04,1128.35,1134.23,1139.19,1145.22,1144.53,1142.53,1140.31,1142.27,1147.25,1149.69,1160.47,1170.61,1180.05,1189.18,1194.45,1200.96,1201.25,1202.72,1203.89,1206.88,1205.09,1210.64,1221.89,1231.97,1240.29,1251.24,1261.0,1266.14,1265.85,1259.85,1260.83,1260.54,1263.58,1270.35,1278.97,1275.95,1280.12,1283.86,1290.56,1299.13,1300.25,1305.52,1302.24,1299.04,1299.15,1301.47,1300.74,1318.35,1339.63,1367.28,1390.68,1421.42,1437.95,1443.36,1450.9,1453.19,1451.88,1452.64,1454.02,1467.37,1483.14,1496.14,1513.76,1521.16,1518.48,1506.3,1494.62,1491.87,1479.73,1479.3,1457.94
50%,52243.0,1237.94,1254.38,1265.36,1270.22,1278.57,1281.37,1282.06,1287.23,1291.99,1292.42,1299.54,1303.35,1310.1,1317.96,1321.22,1324.1,1333.91,1333.88,1332.3,1332.95,1329.97,1328.58,1331.34,1340.82,1351.0,1365.37,1375.96,1386.45,1393.99,1393.85,1394.9,1397.25,1385.48,1388.91,1391.77,1404.98,1412.04,1427.07,1433.2,1445.48,1451.74,1457.82,1457.73,1455.41,1455.63,1454.12,1453.51,1462.15,1471.7,1482.88,1495.35,1501.86,1510.36,1512.58,1513.3,1510.48,1505.31,1505.8,1506.76,1515.63,1530.15,1525.56,1523.09,1521.03,1533.19,1544.85,1554.15,1558.93,1557.96,1556.93,1552.77,1558.5,1567.58,1586.69,1611.38,1645.94,1683.36,1721.1,1747.11,1759.17,1766.77,1768.0,1768.77,1779.78,1786.73,1805.27,1828.59,1849.99,1871.77,1874.93,1876.35,1867.1,1851.69,1844.18,1834.88,1837.59,1832.0
75%,80408.25,1618.88,1640.48,1644.39,1664.6,1673.64,1681.9,1689.47,1699.41,1695.97,1699.8,1708.17,1706.69,1720.11,1716.48,1716.26,1729.14,1749.24,1748.61,1744.31,1744.2,1735.44,1723.47,1726.27,1740.16,1748.59,1772.78,1785.47,1801.98,1809.33,1814.11,1824.76,1817.68,1820.86,1819.35,1826.24,1839.75,1845.78,1858.15,1866.67,1880.25,1890.46,1893.85,1889.37,1889.56,1897.57,1905.32,1903.07,1905.53,1915.95,1929.09,1943.8,1950.91,1961.2,1965.5,1963.43,1965.75,1958.37,1958.83,1965.55,1968.91,1980.93,1987.54,1978.23,1978.82,1981.34,1988.07,1983.92,1982.71,1973.04,1968.17,1961.55,1960.7,1963.82,1988.02,2018.54,2062.22,2105.42,2150.84,2180.47,2178.6,2188.97,2185.89,2195.82,2211.18,2219.37,2244.76,2280.46,2309.4,2325.56,2337.2,2340.09,2331.77,2319.54,2323.54,2325.24,2326.97,2330.35
max,99705.0,7744.95,7997.14,7865.72,7776.15,26328.55,26927.08,27099.91,27190.27,27439.2,28197.74,28935.92,29867.54,30286.53,29905.67,28706.11,27081.51,27175.4,27017.13,27909.09,27011.84,27653.35,28045.2,29507.09,29753.62,29574.25,29051.19,28578.78,29685.46,28968.65,28844.86,27323.89,27912.22,27709.96,26844.35,25804.61,25693.05,23599.48,23823.28,24842.35,25818.49,26734.04,27489.88,27901.34,26908.53,26739.87,26248.02,25274.47,24972.28,25239.66,25694.41,25791.83,26045.16,25678.64,24321.4,23796.18,23815.16,23611.85,23188.28,23127.8,22470.94,24030.76,24405.69,25637.12,24958.47,25649.22,24957.91,25431.83,25215.08,27510.07,28261.31,28871.46,29044.89,30894.55,32307.99,33403.67,33655.55,33143.63,32615.16,29813.37,30484.96,30949.32,31220.7,32140.38,32309.65,36556.25,40047.53,41363.23,39379.65,38425.4,38528.87,36731.99,34685.66,36121.0,39250.2,40354.97,45075.03,68750.0


All the descriptive stats seem within the expected range. Though apparently rent max in some zip code is CRAZY! I will have to see if that same zip code is within the area I'm intending to analyze

In [56]:
# Counting value counts in some of the "object" columns

df_zr_new['RegionName'].value_counts()

77494    1
31558    1
97333    1
34747    1
71446    1
        ..
64804    1
38118    1
32714    1
45373    1
12085    1
Name: RegionName, Length: 6376, dtype: int64

In [57]:
df_zr_new['State'].value_counts()

CA    834
FL    611
TX    589
NY    293
PA    225
IL    211
GA    207
WA    194
OH    190
NC    183
VA    178
AZ    173
MA    162
CO    156
NJ    155
MI    147
MD    132
MN    122
MO    121
TN    116
IN    114
WI     95
SC     89
OK     85
OR     82
AL     82
NV     73
LA     71
CT     69
KS     65
UT     65
KY     64
NE     46
AR     45
IA     45
ID     35
RI     29
NM     28
MS     25
HI     25
DC     21
NH     20
DE     16
SD     14
ND     13
MT     13
WV     12
ME     11
AK     10
WY     10
VT      3
PR      2
Name: State, dtype: int64

This seems to check out

In [58]:
df_zr_new['Metro'].value_counts()

New York-Newark-Jersey City, NY-NJ-PA    317
Los Angeles-Long Beach-Anaheim, CA       299
Chicago-Naperville-Elgin, IL-IN-WI       189
Dallas-Fort Worth-Arlington, TX          186
Houston-The Woodlands-Sugar Land, TX     164
                                        ... 
Red Bluff, CA                              1
Midland, MI                                1
Fremont, NE                                1
Rolla, MO                                  1
Sioux City, IA-NE-SD                       1
Name: Metro, Length: 480, dtype: int64

This also seems to check out

In [59]:
# Exporting cleaned data

df_zr_new.to_csv(os.path.join(path, 'Data', 'Cleaned', 'Zillow_Historical_Zip_Rental_Rates.csv'), index=False)

## 3. Apartment List Vacancy Index

In [4]:
# Importing dataframe
df_av = pd.read_csv(os.path.join(path, 'Data', 'Original', 'Apartment_List_Vacancy_Index_2023_04.csv'), index_col = False)

### 3A. Consistency Checks

In [61]:
# checking dataset info

df_av.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 455 entries, 0 to 454
Data columns (total 83 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   location_name       455 non-null    object 
 1   location_type       455 non-null    object 
 2   location_fips_code  455 non-null    int64  
 3   population          455 non-null    int64  
 4   state               353 non-null    object 
 5   county              311 non-null    object 
 6   metro               412 non-null    object 
 7   2017_01             439 non-null    float64
 8   2017_02             443 non-null    float64
 9   2017_03             444 non-null    float64
 10  2017_04             445 non-null    float64
 11  2017_05             447 non-null    float64
 12  2017_06             447 non-null    float64
 13  2017_07             447 non-null    float64
 14  2017_08             447 non-null    float64
 15  2017_09             447 non-null    float64
 16  2017_10 

all the dtypes look appropriate

In [5]:
# Checking for duplicates (and removing them if there are any) - then checking if the numbers change

df_av.drop_duplicates(inplace=True)

df_av.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 455 entries, 0 to 454
Data columns (total 83 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   location_name       455 non-null    object 
 1   location_type       455 non-null    object 
 2   location_fips_code  455 non-null    int64  
 3   population          455 non-null    int64  
 4   state               353 non-null    object 
 5   county              311 non-null    object 
 6   metro               412 non-null    object 
 7   2017_01             439 non-null    float64
 8   2017_02             443 non-null    float64
 9   2017_03             444 non-null    float64
 10  2017_04             445 non-null    float64
 11  2017_05             447 non-null    float64
 12  2017_06             447 non-null    float64
 13  2017_07             447 non-null    float64
 14  2017_08             447 non-null    float64
 15  2017_09             447 non-null    float64
 16  2017_10 

Nothing changed, there were no duplicates

In [7]:
# Checking for missing values
missing_values = df_av.isnull().sum()

# total rows in dataframe
total_rows = df_av.shape[0]

# making it show me all the rows
pd.options.display.max_rows = len(df_av.dtypes)

# calculating percentage of missing values
percent_missing = ((missing_values / total_rows) * 100).round(2)

# Making a dataframe to show missing values
missing_value_table = pd.concat([missing_values, percent_missing], axis=1)

# Naming columns for sorting
missing_value_table.columns = ['Missing Values', 'Percent Missing']

# sorting the view based on most missing values %
missing_value_table.sort_values('Percent Missing', ascending=True, inplace=True)

print(missing_value_table)

                    Missing Values  Percent Missing
location_name                    0             0.00
2021_05                          0             0.00
2021_04                          0             0.00
2021_03                          0             0.00
2021_02                          0             0.00
2021_01                          0             0.00
2020_12                          0             0.00
2020_11                          0             0.00
2020_10                          0             0.00
2020_09                          0             0.00
2020_08                          0             0.00
2020_07                          0             0.00
2020_06                          0             0.00
2020_05                          0             0.00
2020_04                          0             0.00
2020_03                          0             0.00
2020_02                          0             0.00
2020_01                          0             0.00
2021_06     

So the historicity of the data is good. Lots of counties not mentioned, quite a few states not mentioned. But most mention metro.  I'll just have to see how many of my food production facility zip codes this matches up with to see if it's worth using or not.  My guess is that it'll give extra data consideration for some but not for most.

In [8]:
# counting number of unique values in each column

print(df_av.nunique())

location_name         455
location_type           5
location_fips_code    455
population            429
state                  42
county                157
metro                 193
2017_01               386
2017_02               389
2017_03               390
2017_04               392
2017_05               395
2017_06               397
2017_07               398
2017_08               399
2017_09               398
2017_10               399
2017_11               399
2017_12               405
2018_01               405
2018_02               403
2018_03               408
2018_04               407
2018_05               408
2018_06               408
2018_07               413
2018_08               411
2018_09               410
2018_10               416
2018_11               416
2018_12               417
2019_01               421
2019_02               418
2019_03               417
2019_04               418
2019_05               419
2019_06               421
2019_07               420
2019_08     

that all seems about right

### 3B. Data Cleaning

In [10]:
# Removing unneeded columns. I don't need the "location_fips_code" since it doesn't seem to correspond to anything recognizable
# that I might need. Similarly I don't need "population", but it's possible I'll find use for state, county, and metro
# to use for linking with my zip codes (since there are no zip codes here)

df_av_new = df_av.drop(['location_fips_code', 'population'], axis=1)

### 3C. Basic Descriptive Analysis

In [11]:
# Checking basic stats on each column

df_av_new.describe().applymap(lambda x: f"{x:0.2f}")

Unnamed: 0,2017_01,2017_02,2017_03,2017_04,2017_05,2017_06,2017_07,2017_08,2017_09,2017_10,...,2022_07,2022_08,2022_09,2022_10,2022_11,2022_12,2023_01,2023_02,2023_03,2023_04
count,439.0,443.0,444.0,445.0,447.0,447.0,447.0,447.0,447.0,447.0,...,455.0,455.0,455.0,455.0,455.0,455.0,455.0,455.0,455.0,455.0
mean,0.07,0.07,0.07,0.07,0.07,0.07,0.07,0.07,0.07,0.07,...,0.05,0.05,0.05,0.05,0.06,0.06,0.06,0.06,0.07,0.07
std,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.02,0.02,...,0.02,0.01,0.01,0.01,0.01,0.01,0.01,0.02,0.02,0.02
min,0.0,0.0,0.01,0.01,0.01,0.01,0.02,0.02,0.02,0.02,...,0.01,0.01,0.01,0.02,0.02,0.02,0.02,0.02,0.03,0.03
25%,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.06,0.06,0.06,...,0.04,0.04,0.04,0.05,0.05,0.05,0.05,0.05,0.05,0.06
50%,0.06,0.07,0.07,0.07,0.07,0.07,0.07,0.07,0.07,0.07,...,0.05,0.05,0.05,0.05,0.06,0.06,0.06,0.06,0.06,0.07
75%,0.08,0.08,0.08,0.08,0.08,0.08,0.08,0.08,0.08,0.08,...,0.06,0.06,0.06,0.06,0.06,0.07,0.07,0.07,0.07,0.08
max,0.2,0.2,0.21,0.23,0.26,0.28,0.3,0.29,0.27,0.26,...,0.13,0.13,0.14,0.14,0.13,0.13,0.13,0.13,0.13,0.13


That's about what I'd expect. Since it's percentage of inventory, all the numbers are between 0 and 1, but most of the time inventory percentage available is much lower

In [12]:
# Counting value counts in some of the "object" columns

df_av_new['location_name'].value_counts()

United States                1
Huntsville, AL Metro Area    1
Cleveland, OH                1
Savannah, GA Metro Area      1
Placer County, CA            1
                            ..
Milwaukee County, WI         1
Marion County, IN            1
Austin, TX                   1
Westchester County, NY       1
Katy, TX                     1
Name: location_name, Length: 455, dtype: int64

In [13]:
df_av_new['location_type'].value_counts()

County      166
City        145
Metro       101
State        42
National      1
Name: location_type, dtype: int64

In [14]:
df_av_new['state'].value_counts()

Texas                   38
California              32
Florida                 20
Virginia                16
North Carolina          15
Colorado                15
Washington              15
Georgia                 13
Tennessee               11
New Jersey              11
Arizona                 11
Maryland                11
North Dakota             9
Pennsylvania             9
South Carolina           9
Minnesota                8
Massachusetts            8
Oregon                   7
Michigan                 7
Ohio                     7
Nebraska                 6
Nevada                   6
Missouri                 6
New York                 5
Oklahoma                 5
Kentucky                 5
Wisconsin                5
Louisiana                4
Connecticut              4
Alabama                  4
Illinois                 4
New Mexico               3
South Dakota             3
Indiana                  3
Utah                     3
Kansas                   3
Arkansas                 3
I

In [15]:
df_av_new['county'].value_counts()

Maricopa County       8
King County           7
Los Angeles County    6
Dallas County         6
Montgomery County     6
                     ..
Chester County        1
Fairfield County      1
Volusia County        1
Bergen County         1
Burlington County     1
Name: county, Length: 157, dtype: int64

In [16]:
df_av_new['metro'].value_counts()

Dallas-Fort Worth-Arlington, TX                 17
Washington-Arlington-Alexandria, DC-VA-MD-WV    13
Seattle-Tacoma-Bellevue, WA                     11
New York-Newark-Jersey City, NY-NJ-PA           11
Houston-The Woodlands-Sugar Land, TX             9
                                                ..
Tucson, AZ Metro Area                            1
Tulsa, OK Metro Area                             1
Bridgeport-Stamford-Norwalk, CT Metro Area       1
Bridgeport-Stamford-Norwalk, CT                  1
Knoxville, TN Metro Area                         1
Name: metro, Length: 193, dtype: int64

I'd say that all of this is about what I'd expect.  So overall very clean, but possibly limited data. Hopefully I'll find enough of it that I can use.

In [17]:
# Exporting cleaned data

df_av_new.to_csv(os.path.join(path, 'Data', 'Cleaned', 'Apartment_List_Vacancy_Index.csv'), index=False)

## 4. Apartment List Rent Estimates

In [18]:
# Importing dataframe
df_ar = pd.read_csv(os.path.join(path, 'Data', 'Original', 'Apartment_List_Rent_Estimates_2023_04.csv'), index_col = False)

### 4A. Consistency Checks

In [19]:
# checking dataset info

df_ar.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3378 entries, 0 to 3377
Data columns (total 84 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   location_name       3378 non-null   object 
 1   location_type       3378 non-null   object 
 2   location_fips_code  3378 non-null   int64  
 3   population          3378 non-null   int64  
 4   state               2835 non-null   object 
 5   county              2694 non-null   object 
 6   metro               3234 non-null   object 
 7   bed_size            3378 non-null   object 
 8   2017_01             3108 non-null   float64
 9   2017_02             3126 non-null   float64
 10  2017_03             3165 non-null   float64
 11  2017_04             3183 non-null   float64
 12  2017_05             3207 non-null   float64
 13  2017_06             3237 non-null   float64
 14  2017_07             3234 non-null   float64
 15  2017_08             3231 non-null   float64
 16  2017_0

it's interesting that some of the rent estimates for the different dates are either int64 or float64. I'll change it to make it consistent with the same info as in the apartment vacancies one

In [20]:
# changing the date columns that are int64 to float64 so that they can be consistent

df_ar = df_ar.astype({'2019_06': 'float64', '2019_07': 'float64', '2019_08': 'float64', '2019_10': 'float64', '2019_11': 'float64', '2020_01': 'float64', '2020_02': 'float64', '2020_03': 'float64', '2020_04': 'float64', '2020_05': 'float64', '2020_06': 'float64', '2020_07': 'float64', '2020_08': 'float64', '2020_09': 'float64', '2020_10': 'float64', '2020_11': 'float64', '2020_12': 'float64', '2021_01': 'float64', '2021_02': 'float64', '2021_03': 'float64', '2021_04': 'float64', '2021_05': 'float64', '2021_06': 'float64', '2021_07': 'float64', '2021_08': 'float64', '2021_09': 'float64', '2021_10': 'float64', '2021_11': 'float64', '2021_12': 'float64', '2022_01': 'float64','2022_02': 'float64', '2022_03': 'float64', '2022_04': 'float64', '2022_05': 'float64', '2022_06': 'float64', '2022_07': 'float64', '2022_08': 'float64', '2022_09': 'float64', '2022_10': 'float64', '2022_11': 'float64', '2022_12': 'float64', '2023_01': 'float64', '2023_02': 'float64', '2023_03': 'float64', '2023_04': 'float64'})

In [23]:
# Checking for duplicates (and removing them if there are any) - then checking if the numbers change

df_ar.drop_duplicates(inplace=True)

df_ar.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3378 entries, 0 to 3377
Data columns (total 84 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   location_name       3378 non-null   object 
 1   location_type       3378 non-null   object 
 2   location_fips_code  3378 non-null   int64  
 3   population          3378 non-null   int64  
 4   state               2835 non-null   object 
 5   county              2694 non-null   object 
 6   metro               3234 non-null   object 
 7   bed_size            3378 non-null   object 
 8   2017_01             3108 non-null   float64
 9   2017_02             3126 non-null   float64
 10  2017_03             3165 non-null   float64
 11  2017_04             3183 non-null   float64
 12  2017_05             3207 non-null   float64
 13  2017_06             3237 non-null   float64
 14  2017_07             3234 non-null   float64
 15  2017_08             3231 non-null   float64
 16  2017_0

no duplicates

In [25]:
# Checking for missing values
missing_values = df_ar.isnull().sum()

# total rows in dataframe
total_rows = df_ar.shape[0]

# making it show me all the rows
pd.options.display.max_rows = len(df_ar.dtypes)

# calculating percentage of missing values
percent_missing = ((missing_values / total_rows) * 100).round(2)

# Making a dataframe to show missing values
missing_value_table = pd.concat([missing_values, percent_missing], axis=1)

# Naming columns for sorting
missing_value_table.columns = ['Missing Values', 'Percent Missing']

# sorting the view based on most missing values %
missing_value_table.sort_values('Percent Missing', ascending=True, inplace=True)

print(missing_value_table)

                    Missing Values  Percent Missing
location_name                    0             0.00
2021_05                          0             0.00
2021_04                          0             0.00
2021_03                          0             0.00
2021_02                          0             0.00
2021_01                          0             0.00
2020_12                          0             0.00
2020_11                          0             0.00
2020_10                          0             0.00
2020_09                          0             0.00
2020_08                          0             0.00
2020_07                          0             0.00
2020_06                          0             0.00
2020_05                          0             0.00
2020_04                          0             0.00
2020_03                          0             0.00
2020_02                          0             0.00
2020_01                          0             0.00
2021_06     

I'd say this was expected. Older dates have less info. Metro state and county have much less missing info than the vacancy dataset, but otherwise similar trends.

In [26]:
# counting number of unique values in each column

print(df_ar.nunique())

location_name         1067
location_type            5
location_fips_code    1126
population            1066
state                   47
county                 299
metro                  181
bed_size                 3
2017_01               1178
2017_02               1213
2017_03               1215
2017_04               1207
2017_05               1219
2017_06               1244
2017_07               1220
2017_08               1216
2017_09               1223
2017_10               1206
2017_11               1220
2017_12               1203
2018_01               1204
2018_02               1226
2018_03               1214
2018_04               1242
2018_05               1235
2018_06               1235
2018_07               1252
2018_08               1252
2018_09               1251
2018_10               1248
2018_11               1254
2018_12               1235
2019_01               1241
2019_02               1264
2019_03               1262
2019_04               1236
2019_05               1255
2

this all seems expected

### 4B. Data Cleaning

In [27]:
# Removing unneeded columns. Similar to rental vacancies, I don't need the "location_fips_code" since it doesn't 
# seem to correspond to anything recognizable that I might need. But I'm keeping the rest to possibly help
# identify the regions I want to link to my food production facility address list

df_ar_new = df_ar.drop(['location_fips_code', 'population'], axis=1)

### 4C. Basic descriptive analysis

In [28]:
# Checking basic stats on each column

df_ar_new.describe().applymap(lambda x: f"{x:0.2f}")

Unnamed: 0,2017_01,2017_02,2017_03,2017_04,2017_05,2017_06,2017_07,2017_08,2017_09,2017_10,...,2022_07,2022_08,2022_09,2022_10,2022_11,2022_12,2023_01,2023_02,2023_03,2023_04
count,3108.0,3126.0,3165.0,3183.0,3207.0,3237.0,3234.0,3231.0,3243.0,3243.0,...,3378.0,3378.0,3378.0,3378.0,3378.0,3378.0,3378.0,3378.0,3378.0,3378.0
mean,1131.48,1136.63,1141.31,1151.32,1164.75,1172.95,1178.15,1176.43,1172.98,1168.73,...,1541.56,1543.45,1533.49,1517.17,1499.01,1485.71,1481.55,1485.85,1495.12,1504.0
std,404.01,409.87,409.06,411.13,415.71,418.07,418.75,416.0,413.09,408.27,...,489.6,488.71,482.81,473.77,465.18,459.56,457.7,460.14,464.11,467.72
min,439.0,439.0,440.0,456.0,460.0,480.0,499.0,514.0,552.0,549.0,...,586.0,602.0,615.0,623.0,618.0,611.0,612.0,604.0,605.0,605.0
25%,849.0,851.0,854.0,863.0,870.0,875.0,882.0,881.0,876.5,876.0,...,1180.0,1183.0,1178.0,1165.0,1153.0,1143.0,1142.0,1145.0,1149.25,1156.0
50%,1037.0,1038.5,1043.0,1053.0,1063.0,1071.0,1077.0,1075.0,1074.0,1074.0,...,1464.0,1468.0,1457.5,1442.5,1427.0,1415.0,1407.0,1409.0,1417.0,1423.5
75%,1308.25,1316.75,1320.0,1330.0,1347.0,1360.0,1363.75,1362.0,1362.5,1353.5,...,1807.0,1809.0,1795.75,1772.0,1750.75,1735.0,1733.75,1738.75,1746.0,1756.75
max,4248.0,4278.0,4311.0,4321.0,4344.0,4342.0,4354.0,4369.0,4420.0,4435.0,...,5409.0,5417.0,5359.0,5315.0,5235.0,5212.0,5204.0,5254.0,5217.0,5216.0


this all seems likely very accurate

In [30]:
# Counting value counts in some of the "object" columns

df_ar_new['location_name'].value_counts()

Santa Fe, NM              6
Rochester, MN             6
Wilmington, NC            6
Knoxville, TN             6
Baton Rouge, LA           6
                         ..
Butler County, OH         3
Forsyth County, NC        3
Eugene-Springfield, OR    3
Lane County, OR           3
Glendale, CO              3
Name: location_name, Length: 1067, dtype: int64

It's interesting that some have 6, but I checked it out and it's the difference between one being the city and one being the metro.

In [31]:
# Counting value counts in some of the "object" columns

df_ar_new['location_type'].value_counts()

City        1707
County       987
Metro        540
State        141
National       3
Name: location_type, dtype: int64

In [32]:
# Counting value counts in some of the "object" columns

df_ar_new['state'].value_counts()

Texas                   339
California              315
Florida                 252
Georgia                 150
Virginia                138
Washington              105
Colorado                105
North Carolina          102
Maryland                102
Minnesota                81
Massachusetts            75
Illinois                 66
South Carolina           60
Tennessee                57
New Jersey               57
Arizona                  54
Indiana                  54
Ohio                     51
Louisiana                51
Michigan                 51
New York                 48
Pennsylvania             48
Oregon                   45
Alabama                  42
Missouri                 42
Utah                     33
Nevada                   33
Kansas                   27
Connecticut              27
Oklahoma                 24
Nebraska                 24
Wisconsin                18
Arkansas                 18
Kentucky                 18
New Mexico               15
Mississippi         

In [33]:
# Counting value counts in some of the "object" columns

df_ar_new['county'].value_counts()

Orange County           60
Montgomery County       57
Los Angeles County      42
Broward County          42
King County             39
                        ..
Champaign County         3
Santa Barbara County     3
Berkeley County          3
Yolo County              3
Atlantic County          3
Name: county, Length: 299, dtype: int64

In [34]:
# Counting value counts in some of the "object" columns

df_ar_new['metro'].value_counts()

Washington-Arlington-Alexandria, DC-VA-MD-WV    138
Dallas-Fort Worth-Arlington, TX                 114
Atlanta-Sandy Springs-Alpharetta, GA            108
Miami-Fort Lauderdale-Pompano Beach, FL          90
Los Angeles-Long Beach-Anaheim, CA               90
                                               ... 
Bismarck, ND                                      3
Harrisburg-Carlisle, PA                           3
Houma-Thibodaux, LA                               3
Toledo, OH                                        3
Moses Lake, WA                                    3
Name: metro, Length: 181, dtype: int64

In [35]:
# Counting value counts in some of the "object" columns

df_ar_new['bed_size'].value_counts()

overall    1126
1br        1126
2br        1126
Name: bed_size, dtype: int64

all of the above looks as expected

In [36]:
# Exporting cleaned data

df_ar_new.to_csv(os.path.join(path, 'Data', 'Cleaned', 'Apartment_List_Rent_Estimates.csv'), index=False)

## 5. Zip code list acquired from major protein company facility locations

In [37]:
# Importing dataframe
df_tz = pd.read_csv(os.path.join(path, 'Data', 'Original', '3.21.23 plant addresses.csv'), index_col = False)

In [38]:
# checking dataset info

df_tz.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113 entries, 0 to 112
Data columns (total 5 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   Location                           113 non-null    object
 1   Location Address - Line 1          113 non-null    object
 2   Location Address - City            113 non-null    object
 3   Location Address - State/Province  113 non-null    object
 4   Location Address - Postal Code     113 non-null    object
dtypes: object(5)
memory usage: 4.5+ KB


I thought about changing the "postal code" to "int64" but I changed my mind so that the zeroes at the beginning of some zip codes will remain the way they are. And "int64" probably can't have a leading zero

In [39]:
# Checking for duplicates (and removing them if there are any) - then checking if the numbers change

df_tz.drop_duplicates(inplace=True)

df_tz.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 113 entries, 0 to 112
Data columns (total 5 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   Location                           113 non-null    object
 1   Location Address - Line 1          113 non-null    object
 2   Location Address - City            113 non-null    object
 3   Location Address - State/Province  113 non-null    object
 4   Location Address - Postal Code     113 non-null    object
dtypes: object(5)
memory usage: 5.3+ KB


no duplicates, though it seems strange to me that now it's using slightly more memory

In [40]:
# Checking for missing values
missing_values = df_tz.isnull().sum()

# total rows in dataframe
total_rows = df_tz.shape[0]

# making it show me all the rows
pd.options.display.max_rows = len(df_tz.dtypes)

# calculating percentage of missing values
percent_missing = ((missing_values / total_rows) * 100).round(2)

# Making a dataframe to show missing values
missing_value_table = pd.concat([missing_values, percent_missing], axis=1)

# Naming columns for sorting
missing_value_table.columns = ['Missing Values', 'Percent Missing']

# sorting the view based on most missing values %
missing_value_table.sort_values('Percent Missing', ascending=True, inplace=True)

print(missing_value_table)

                                   Missing Values  Percent Missing
Location                                        0              0.0
Location Address - Line 1                       0              0.0
Location Address - City                         0              0.0
Location Address - State/Province               0              0.0
Location Address - Postal Code                  0              0.0


No missing values, just a clean straight list

In [41]:
# counting number of unique values in each column

print(df_tz.nunique())

Location                             113
Location Address - Line 1            105
Location Address - City               96
Location Address - State/Province     26
Location Address - Postal Code        97
dtype: int64


The interesting thing is that 8 locations share the same address line, even though they're technically considered different facilites.  I'll look into that later.  This dataset (i.e. the goals of the project) targets 97 specific zip codes, which are in 96 different cities.  So some cities/zip codes obviously have multiple facilities

### 5B. Data Cleaning

In [42]:
# I don't have any unwanted columns here, 
# but I want to change the column names to make them fit better with pandas df naming conventions

In [43]:
# Renaming columns

df_tz = df_tz.rename(columns={'Location Address - Line 1': 'address', 'Location Address - City': 'city', 'Location Address - State/Province': 'state', 'Location Address - Postal Code': 'zip'})

In [44]:
df_tz.head()

Unnamed: 0,Location,address,city,state,zip
0,54th St Enid Plant,201 S Raleigh Rd,Enid,Oklahoma,73701
1,Albany Plant,2294 KY 90 W,Albany,Kentucky,42602
2,Albany Waste Water Plant,2294 KY 90 W,Albany,Kentucky,42602
3,Albertville Plant,6600 Hwy 431 S,Albertville,Alabama,35950
4,Amarillo Plant,5000 FM1912,Amarillo,Texas,79108


In [45]:
# Dropping rows that have facilities that were recently closed

df_tz.drop(df_tz[df_tz['city'] == 'Van Buren'].index, inplace=True)
df_tz.drop(df_tz[df_tz['city'] == 'Glen Allen'].index, inplace=True)


### 5C. Basic Descriptive Analysis

In [48]:
# There aren't any numeric columns to run stats on. So, counting value counts in the "object" columns

df_tz['Location'].value_counts()

54th St Enid Plant    1
New London Plant      1
                     ..
Dakota City Plant     1
Zeeland Plant         1
Name: Location, Length: 111, dtype: int64

In [49]:
df_tz['address'].value_counts()

57 Melvin Clark Road    3
2294 KY 90 W            2
                       ..
1131 Dakota Ave         1
8300 96th Avenue        1
Name: address, Length: 103, dtype: int64

In [50]:
df_tz['city'].value_counts()

Enid          4
Wilkesboro    3
             ..
Dallas        1
Zeeland       1
Name: city, Length: 94, dtype: int64

In [54]:
# making it show me all the rows
pd.set_option('display.max_rows', None)
df_tz['state'].value_counts().to_frame()

Unnamed: 0,state
Arkansas,18
Texas,10
Iowa,8
Alabama,7
Nebraska,7
Missouri,6
North Carolina,6
Tennessee,5
Kansas,5
Kentucky,5


In [55]:
df_tz['zip'].value_counts()

73701         4
28697         3
36027         3
68748         2
72756         2
50588         2
72638         2
31730         2
17557         2
42602         2
72764         2
50703         2
54961         1
68137         1
76180         1
64854         1
38059         1
99363         1
38261         1
72114         1
71852         1
71602         1
28112         1
65708         1
27332         1
52738         1
46947         1
50220         1
30161         1
04102         1
23442         1
68462         1
48091         1
72958         1
08360         1
31092         1
76384         1
72704         1
72802         1
68450         1
42452         1
64507         1
75090         1
37160         1
78155         1
65301         1
94580         1
35077         1
66106         1
68850         1
67505-1111    1
61257         1
32254         1
49684         1
39204         1
41001-9001    1
72830         1
28610         1
45246         1
60641         1
39183         1
75935         1
62232   

Looks like I need to update a couple of those zip codes to take off the numbers

In [56]:
# removing the latter parts of any zip codes that have the -

df_tz['zip'] = df_tz['zip'].str.split('-').str[0]

In [57]:
df_tz['zip'].value_counts()

73701    4
28697    3
36027    3
68748    2
72756    2
50588    2
72638    2
31730    2
17557    2
42602    2
72764    2
50703    2
54961    1
68137    1
76180    1
64854    1
38059    1
99363    1
38261    1
72114    1
71852    1
71602    1
28112    1
65708    1
27332    1
52738    1
46947    1
50220    1
30161    1
04102    1
23442    1
68462    1
48091    1
72958    1
08360    1
31092    1
76384    1
72704    1
72802    1
68450    1
42452    1
64507    1
75090    1
37160    1
78155    1
65301    1
94580    1
35077    1
66106    1
68850    1
67505    1
61257    1
32254    1
49684    1
39204    1
41001    1
72830    1
28610    1
45246    1
60641    1
39183    1
75935    1
62232    1
39051    1
75633    1
74728    1
42101    1
35031    1
72616    1
44001    1
79108    1
35950    1
64020    1
47112    1
51501    1
72901    1
67501    1
38343    1
77029    1
71801    1
76117    1
71944    1
37072    1
35906    1
67851    1
30040    1
66801    1
29640    1
63841    1
51442    1
72834    1

it worked.  Looks like I've done all the checks on it I can do now, and it's a nice clean list.

In [58]:
# Exporting cleaned data

df_tz.to_csv(os.path.join(path, 'Data', 'Cleaned', 'All_Tyson_Addresses.csv'), index=False)

### 5D. Creating zip code list to be used for web scraper

In [120]:
# Creating a dataframe with just food processing facility zips but with no duplicates - to be used in the real estate scraper

# I came back in and modified this to do astype(str) so that I can preserve the leading zeroes
df_tzips = df_tz[['zip']].astype(str)

In [121]:
df_tzips.drop_duplicates(inplace=True)

df_tzips.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 0 to 112
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   zip     95 non-null     object
dtypes: object(1)
memory usage: 1.5+ KB


In [122]:
print(df_tzips)

       zip
0    73701
1    42602
3    35950
4    79108
5    44001
6    72764
7    72616
8    35031
9    42101
10   74728
11   31730
13   75633
14   39051
15   62232
16   75935
17   39183
18   60641
19   72756
20   45246
21   28610
22   72830
23   41001
24   39204
25   49684
26   64020
27   47112
28   51501
29   30040
30   68731
31   75237
32   24586
33   72834
34   51442
35   63841
36   29640
37   66801
39   36027
42   67851
43   72901
44   35906
46   37072
47   71944
48   72638
50   76117
51   71801
52   77029
53   38343
54   67505
55   67501
56   32254
57   61257
58   66106
59   68850
60   46947
61   52738
62   68748
64   27332
65   65708
66   28112
67   72114
68   71852
69   17557
71   54961
72   38059
73   64854
74   76180
75   38261
76   68137
77   99363
78   50220
79   71602
81   04102
83   42452
84   30161
85   35077
86   94580
87   65301
88   78155
89   37160
90   75090
91   64507
92   50588
94   68450
95   23442
96   72802
98   72704
100  76384
101  31092
102  08360
103  72958

In [123]:
# Exporting zips only list

df_tzips.to_csv(os.path.join(path, 'Data', 'Cleaned', 'Tyson_zips_only.csv'), index=False, header=False)

It seems like it worked (to preserve leading zeroes) if I open in notepad or something. Not sure why it doesn't show up when I bring it back in here. I guess it automatically imports as string here?

In [131]:
# Checking to make sure that zip codes with leading zeroes remain intact

# Importing dataframe
df_tzips_import = pd.read_csv(os.path.join(path, 'Data', 'Cleaned', 'Tyson_zips_only.csv'), dtype ='str', index_col = False)

In [132]:
# Set display options to print all rows
pd.set_option('display.max_rows', None)

df_tzips_import.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   73701   94 non-null     object
dtypes: object(1)
memory usage: 880.0+ bytes


In [133]:
print(df_tzips_import)

    73701
0   42602
1   35950
2   79108
3   44001
4   72764
5   72616
6   35031
7   42101
8   74728
9   31730
10  75633
11  39051
12  62232
13  75935
14  39183
15  60641
16  72756
17  45246
18  28610
19  72830
20  41001
21  39204
22  49684
23  64020
24  47112
25  51501
26  30040
27  68731
28  75237
29  24586
30  72834
31  51442
32  63841
33  29640
34  66801
35  36027
36  67851
37  72901
38  35906
39  37072
40  71944
41  72638
42  76117
43  71801
44  77029
45  38343
46  67505
47  67501
48  32254
49  61257
50  66106
51  68850
52  46947
53  52738
54  68748
55  27332
56  65708
57  28112
58  72114
59  71852
60  17557
61  54961
62  38059
63  64854
64  76180
65  38261
66  68137
67  99363
68  50220
69  71602
70  04102
71  42452
72  30161
73  35077
74  94580
75  65301
76  78155
77  37160
78  75090
79  64507
80  50588
81  68450
82  23442
83  72802
84  72704
85  76384
86  31092
87  08360
88  72958
89  48091
90  50703
91  68462
92  28697
93  49464


it didn't work the first time, so I'm going to convert the zip code column to "string" before exporting it then try again and check it again

OK, after playing with it a bit, I finally figured it out. Importing csv's like that in pandas automatically changes the dtype unless I specify the dtype is "str". Then it'll show correctly.

##  6. Realtor.com scraped homes for sale listings for zip codes in list

### 6A. Web Scraping Script

In [224]:
# Below is the code I created for scraping the for sale homes using the zip code list from realtor.com
# I developed this code in another notebook (that's a bit messy, because of a lot of trial and error)
# I wasn't going to run it again in this notebook, because it took a few hours to run
# But then I had to tweak it so it would include the zip code it was searching for in the dataframe.
# So then I ran it again (more than once, because it got caught once)
# I wrote this code by looking at a bunch of tutorials and also asking ChatGPT and Google Bard for help
# And then by learning more from the AI chatbots, and then fixing the code more, and talking to other people
# and a really long process of incrementally improving it until it worked just perfectly.

import os
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import time
import random
import csv
import re

start_time = time.time()

# Defining path
path=r'D:\Adam\Employment\Data Analysis Course\Final Data Project'

# Defining location of zipfile
zipfile = 'D:\Adam\Employment\Data Analysis Course\Final Data Project\Data\Cleaned\Tyson_zips_only.csv'

# Open the CSV file that contains the list of zip codes.
with open(zipfile, 'r') as f:
    reader = csv.reader(f)
    zip_codes = list(reader)

data = []    
    
# Loop through the list of zip codes.
for zip_code in zip_codes:
    print(f"Processing zip code: {zip_code[0]}")
    
    # Add a random delay before going to next zip code
    time.sleep(random.randint(30, 90))
    
    # Loop through the page numbers (1-20).
    for page_num in range(1, 20):
        print(f"Processing page {page_num}")
    
        # Update the URL in the script.
        url_template = "https://www.realtor.com/realestateandhomes-search/{}/pg-{}"
        url = url_template.format(zip_code[0], page_num)

        # Set user-agent header to avoid bot detection
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

        # Add a random delay before making the request
        time.sleep(random.randint(8, 16))

        # Send HTTP GET request to the URL and get the HTML response
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            print("Request successful")
        else:
            print(f"Request failed with status code: {response.status_code}. No more listings pages found.")
            break

        # Parse the HTML response using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find the container for all the properties
        container = soup.find('section', class_='srp-content')

        # Check if the container is empty
        if container is None:
            print("No listings found")
            break

        # Check if the page contains the text "We're Sorry"
        if "We're Sorry" in soup.text:
            print("No more pages")
            break

        # Find all the listings in the container
        listings = container.find_all('div', attrs={'data-label': 'property-card'})

        # Loop through each listing and extract the details
        for listing in listings:
            if listing is not None:
                try:
                    address1 = listing.find('div', attrs={'data-label': 'pc-address'}).text.strip()
                except AttributeError:
                    address1 = ''

                try:
                    address2 = listing.find('div', attrs={'data-label': 'pc-address-second'}).text.strip()
                except AttributeError:
                    address2 = ''

                try:
                    availability = listing.find('span', class_='statusText').text.strip()
                except AttributeError:
                    style = ''

                try:
                    price = listing.find('span', attrs={'data-label': 'pc-price'}).text.strip().replace('From', '').replace(',', '').replace('$', '')
                except AttributeError:
                    price = ''

                try:
                    beds = listing.find('li', attrs={'data-label': 'pc-meta-beds'}).text.strip().replace('bed', '')
                except AttributeError:
                    beds = ''

                try:
                    baths = listing.find('li', attrs={'data-label': 'pc-meta-baths'}).text.strip().replace('bath', '')
                except AttributeError:
                    baths = ''

                try:
                    sqft = listing.find('li', attrs={'data-label': 'pc-meta-sqft'}).text.strip().replace(',', '').split('sqft')[0]
                except AttributeError:
                    sqft = ''

            else:
                print('No listing found.')

            # Appending the data
            data.append([zip_code[0], address1, address2, availability, price, beds, baths, sqft])
       
        # stopping pagination if less than 41 listings found
        if soup.find(id="srp-footer-found-listing") is None:
            print("Not enough listings found for this zip code")
            break
        
        if int(soup.find(id="srp-footer-found-listing").text.strip().split()[1]) < 40:
            print("No more listings found")
            break
        
        else:
            continue
        
    print("Finished processing zip code")

# Create a Pandas dataframe from the extracted data
df = pd.DataFrame(data, columns=['TysonZip', 'Address1', 'Address2', 'Availability', 'Price', 'Beds', 'Baths', 'Sqft'])

# get today's date in the format YYYY-MM-DD
today = datetime.datetime.today().strftime('%Y-%m-%d')

# Export the dataframe to a CSV file
df.to_csv(os.path.join(path, 'Data', 'Scraped', f'TysonZips_forsale_{today}.csv'), index=False)

# Print the dataframe
print(df)

end_time = time.time()

execution_time = end_time - start_time

print("Execution time:", execution_time)

Processing zip code: 73701
Processing page 1
Request successful
Processing page 2
Request successful
Processing page 3
Request failed with status code: 404. No more listings pages found.
Finished processing zip code
Processing zip code: 42602
Processing page 1
Request successful
Processing page 2
Request successful
Processing page 3
Request failed with status code: 404. No more listings pages found.
Finished processing zip code
Processing zip code: 35950
Processing page 1
Request successful
Processing page 2
Request successful
Processing page 3
Request failed with status code: 404. No more listings pages found.
Finished processing zip code
Processing zip code: 79108
Processing page 1
Request successful
Processing page 2
Request successful
Processing page 3
Request failed with status code: 404. No more listings pages found.
Finished processing zip code
Processing zip code: 44001
Processing page 1
Request successful
Processing page 2
Request successful
Processing page 3
Request failed wi

Request successful
Processing page 4
Request successful
Processing page 5
Request successful
Processing page 6
Request successful
Processing page 7
Request failed with status code: 404. No more listings pages found.
Finished processing zip code
Processing zip code: 66801
Processing page 1
Request successful
Processing page 2
Request successful
Processing page 3
Request failed with status code: 404. No more listings pages found.
Finished processing zip code
Processing zip code: 36027
Processing page 1
Request successful
Processing page 2
Request successful
Processing page 3
Request successful
Processing page 4
Request successful
Processing page 5
Request failed with status code: 404. No more listings pages found.
Finished processing zip code
Processing zip code: 67851
Processing page 1
Request successful
Not enough listings found for this zip code
Finished processing zip code
Processing zip code: 72901
Processing page 1
Request successful
Processing page 2
Request successful
Processing 

Request successful
Processing page 12
Request successful
Processing page 13
Request successful
Processing page 14
Request successful
Processing page 15
Request successful
Processing page 16
Request successful
Processing page 17
Request successful
Processing page 18
Request successful
Processing page 19
Request successful
Finished processing zip code
Processing zip code: 37160
Processing page 1
Request successful
Processing page 2
Request successful
Processing page 3
Request successful
Processing page 4
Request successful
No listings found
Finished processing zip code
Processing zip code: 75090
Processing page 1
Request successful
Processing page 2
Request successful
Processing page 3
Request successful
Processing page 4
Request successful
Processing page 5
Request successful
Processing page 6
Request successful
Processing page 7
Request failed with status code: 404. No more listings pages found.
Finished processing zip code
Processing zip code: 64507
Processing page 1
Request successfu

This got caught partway and I had to turn on a vpn, so I'll have to do the other zip codes separately and combine the list

In [225]:
# Running it again with the skipped zips

import os
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import time
import random
import csv
import re

start_time = time.time()

# Defining path
path=r'D:\Adam\Employment\Data Analysis Course\Final Data Project'

# Defining location of zipfile
zipfile = 'D:\Adam\Employment\Data Analysis Course\Final Data Project\Data\Cleaned\Tyson_skippedZips.csv'

# Open the CSV file that contains the list of zip codes.
with open(zipfile, 'r') as f:
    reader = csv.reader(f)
    zip_codes = list(reader)

data = []    
    
# Loop through the list of zip codes.
for zip_code in zip_codes:
    print(f"Processing zip code: {zip_code[0]}")
    
    # Add a random delay before going to next zip code
    time.sleep(random.randint(30, 90))
    
    # Loop through the page numbers (1-20).
    for page_num in range(1, 20):
        print(f"Processing page {page_num}")
    
        # Update the URL in the script.
        url_template = "https://www.realtor.com/realestateandhomes-search/{}/pg-{}"
        url = url_template.format(zip_code[0], page_num)

        # Set user-agent header to avoid bot detection
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

        # Add a random delay before making the request
        time.sleep(random.randint(8, 16))

        # Send HTTP GET request to the URL and get the HTML response
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            print("Request successful")
        else:
            print(f"Request failed with status code: {response.status_code}. No more listings pages found.")
            break

        # Parse the HTML response using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find the container for all the properties
        container = soup.find('section', class_='srp-content')

        # Check if the container is empty
        if container is None:
            print("No listings found")
            break

        # Check if the page contains the text "We're Sorry"
        if "We're Sorry" in soup.text:
            print("No more pages")
            break

        # Find all the listings in the container
        listings = container.find_all('div', attrs={'data-label': 'property-card'})

        # Loop through each listing and extract the details
        for listing in listings:
            if listing is not None:
                try:
                    address1 = listing.find('div', attrs={'data-label': 'pc-address'}).text.strip()
                except AttributeError:
                    address1 = ''

                try:
                    address2 = listing.find('div', attrs={'data-label': 'pc-address-second'}).text.strip()
                except AttributeError:
                    address2 = ''

                try:
                    availability = listing.find('span', class_='statusText').text.strip()
                except AttributeError:
                    style = ''

                try:
                    price = listing.find('span', attrs={'data-label': 'pc-price'}).text.strip().replace('From', '').replace(',', '').replace('$', '')
                except AttributeError:
                    price = ''

                try:
                    beds = listing.find('li', attrs={'data-label': 'pc-meta-beds'}).text.strip().replace('bed', '')
                except AttributeError:
                    beds = ''

                try:
                    baths = listing.find('li', attrs={'data-label': 'pc-meta-baths'}).text.strip().replace('bath', '')
                except AttributeError:
                    baths = ''

                try:
                    sqft = listing.find('li', attrs={'data-label': 'pc-meta-sqft'}).text.strip().replace(',', '').split('sqft')[0]
                except AttributeError:
                    sqft = ''

            else:
                print('No listing found.')

            # Appending the data
            data.append([zip_code[0], address1, address2, availability, price, beds, baths, sqft])
       
        # stopping pagination if less than 41 listings found
        if soup.find(id="srp-footer-found-listing") is None:
            print("Not enough listings found for this zip code")
            break
        
        if int(soup.find(id="srp-footer-found-listing").text.strip().split()[1]) < 40:
            print("No more listings found")
            break
        
        else:
            continue
        
    print("Finished processing zip code")

# Create a Pandas dataframe from the extracted data
df = pd.DataFrame(data, columns=['TysonZip', 'Address1', 'Address2', 'Availability', 'Price', 'Beds', 'Baths', 'Sqft'])

# get today's date in the format YYYY-MM-DD
today = datetime.datetime.today().strftime('%Y-%m-%d')

# Export the dataframe to a CSV file
df.to_csv(os.path.join(path, 'Data', 'Scraped', f'Tyson_skippedZips_forsale_{today}.csv'), index=False)

# Print the dataframe
print(df)

end_time = time.time()

execution_time = end_time - start_time

print("Execution time:", execution_time)

Processing zip code: 71801
Processing page 1
Request successful
Processing page 2
Request successful
Processing page 3
Request failed with status code: 404. No more listings pages found.
Finished processing zip code
Processing zip code: 77029
Processing page 1
Request successful
Processing page 2
Request successful
Processing page 3
Request failed with status code: 404. No more listings pages found.
Finished processing zip code
Processing zip code: 38343
Processing page 1
Request successful
Processing page 2
Request successful
Processing page 3
Request failed with status code: 404. No more listings pages found.
Finished processing zip code
Processing zip code: 67505
Processing page 1
Request successful
No more listings found
Finished processing zip code
Processing zip code: 67501
Processing page 1
Request successful
Processing page 2
Request successful
Processing page 3
Request failed with status code: 404. No more listings pages found.
Finished processing zip code
Processing zip code:

In [248]:
# I saw that two Tyson zips weren't represented below and I found out which ones, and when checking realtor.com
# myself, they have listings. So I'm going to try to run my script again on just those two.

import os
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import time
import random
import csv
import re

start_time = time.time()

# Defining path
path=r'D:\Adam\Employment\Data Analysis Course\Final Data Project'

# Defining location of zipfile
zipfile = 'D:\Adam\Employment\Data Analysis Course\Final Data Project\Data\Cleaned\TwoTysonZips.csv'

# Open the CSV file that contains the list of zip codes.
with open(zipfile, 'r') as f:
    reader = csv.reader(f)
    zip_codes = list(reader)

data = []    
    
# Loop through the list of zip codes.
for zip_code in zip_codes:
    print(f"Processing zip code: {zip_code[0]}")
    
    # Add a random delay before going to next zip code
    time.sleep(random.randint(30, 90))
    
    # Loop through the page numbers (1-20).
    for page_num in range(1, 20):
        print(f"Processing page {page_num}")
    
        # Update the URL in the script.
        url_template = "https://www.realtor.com/realestateandhomes-search/{}/pg-{}"
        url = url_template.format(zip_code[0], page_num)

        # Set user-agent header to avoid bot detection
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

        # Add a random delay before making the request
        time.sleep(random.randint(8, 16))

        # Send HTTP GET request to the URL and get the HTML response
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            print("Request successful")
        else:
            print(f"Request failed with status code: {response.status_code}. No more listings pages found.")
            break

        # Parse the HTML response using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find the container for all the properties
        container = soup.find('section', class_='srp-content')

        # Check if the container is empty
        if container is None:
            print("No listings found")
            break

        # Check if the page contains the text "We're Sorry"
        if "We're Sorry" in soup.text:
            print("No more pages")
            break

        # Find all the listings in the container
        listings = container.find_all('div', attrs={'data-label': 'property-card'})

        # Loop through each listing and extract the details
        for listing in listings:
            if listing is not None:
                try:
                    address1 = listing.find('div', attrs={'data-label': 'pc-address'}).text.strip()
                except AttributeError:
                    address1 = ''

                try:
                    address2 = listing.find('div', attrs={'data-label': 'pc-address-second'}).text.strip()
                except AttributeError:
                    address2 = ''

                try:
                    availability = listing.find('span', class_='statusText').text.strip()
                except AttributeError:
                    style = ''

                try:
                    price = listing.find('span', attrs={'data-label': 'pc-price'}).text.strip().replace('From', '').replace(',', '').replace('$', '')
                except AttributeError:
                    price = ''

                try:
                    beds = listing.find('li', attrs={'data-label': 'pc-meta-beds'}).text.strip().replace('bed', '')
                except AttributeError:
                    beds = ''

                try:
                    baths = listing.find('li', attrs={'data-label': 'pc-meta-baths'}).text.strip().replace('bath', '')
                except AttributeError:
                    baths = ''

                try:
                    sqft = listing.find('li', attrs={'data-label': 'pc-meta-sqft'}).text.strip().replace(',', '').split('sqft')[0]
                except AttributeError:
                    sqft = ''

            else:
                print('No listing found.')

            # Appending the data
            data.append([zip_code[0], address1, address2, availability, price, beds, baths, sqft])
       
        # stopping pagination if less than 41 listings found
        if soup.find(id="srp-footer-found-listing") is None:
            print("Not enough listings found for this zip code")
            break
        
        if int(soup.find(id="srp-footer-found-listing").text.strip().split()[1]) < 40:
            print("No more listings found")
            break
        
        else:
            continue
        
    print("Finished processing zip code")

# Create a Pandas dataframe from the extracted data
df = pd.DataFrame(data, columns=['TysonZip', 'Address1', 'Address2', 'Availability', 'Price', 'Beds', 'Baths', 'Sqft'])

# get today's date in the format YYYY-MM-DD
today = datetime.datetime.today().strftime('%Y-%m-%d')

# Export the dataframe to a CSV file
df.to_csv(os.path.join(path, 'Data', 'Scraped', f'2TysonZips_forsale_{today}.csv'), index=False)

# Print the dataframe
print(df)

end_time = time.time()

execution_time = end_time - start_time

print("Execution time:", execution_time)

Processing zip code: 31730
Processing page 1
Request successful
No more listings found
Finished processing zip code
Processing zip code: 76117
Processing page 1
Request successful
Processing page 2
Request successful
Processing page 3
Request failed with status code: 404. No more listings pages found.
Finished processing zip code
   TysonZip                                         Address1  \
0     31730           1432 County Line Rd, Camilla, GA 31730   
1     31730                53 E Morgan St, Camilla, GA 31730   
2     31730               555 Moultrie Rd, Camilla, GA 31730   
3     31730                211 S Scott St, Camilla, GA 31730   
..      ...                                              ...   
89    76117  3516 Tourist Dr, North Richland Hills, TX 76117   
90    76117              513 Fieldstone Ln, Haslet, TX 76117   
91    76117            4609 Nadine Dr, Haltom City, TX 76117   
92    76117     6208 Little Fossil Rd, Haltom City, TX 76117   

                          A

In [249]:
# Combining the three lists into a single csv

df_rs1 = pd.read_csv(os.path.join(path, 'Data', 'Scraped', 'TysonZips_forsale_2023-05-05.csv'), index_col = False)
df_rs2 = pd.read_csv(os.path.join(path, 'Data', 'Scraped', 'Tyson_skippedZips_forsale_2023-05-05.csv'), index_col = False)
df_rs3 = pd.read_csv(os.path.join(path, 'Data', 'Scraped', '2TysonZips_forsale_2023-05-05.csv'), index_col = False)

In [253]:
# combining the two dataframes

df_rs = pd.concat([df_rs1, df_rs2, df_rs3])

# resetting the index of the combined dataset
df_rs = df_rs.reset_index(drop=True)

# Making sure TysonZip stays as string (I hope)
df_rs['TysonZip'] = df_rs['TysonZip'].astype(str)

# exporting to a combined csv
df_rs.to_csv(os.path.join(path, 'Data', 'Scraped', f'TysonZips_forsale_combined_{today}.csv'), index=False)

In [254]:
# Importing scraped data
df_rs = pd.read_csv(os.path.join(path, 'Data', 'Scraped', f'TysonZips_forsale_combined_{today}.csv'), dtype={'TysonZip': str}, index_col = False)

### 6B. Consistency Checks

In [255]:
# checking dataset info

df_rs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9248 entries, 0 to 9247
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   TysonZip      9248 non-null   object 
 1   Address1      9227 non-null   object 
 2   Address2      9227 non-null   object 
 3   Availability  9248 non-null   object 
 4   Price         9248 non-null   object 
 5   Beds          6578 non-null   object 
 6   Baths         6538 non-null   object 
 7   Sqft          6331 non-null   float64
dtypes: float64(1), object(7)
memory usage: 578.1+ KB


These dtypes need to be cleaned up. Before I run my script again, I think I'll code that into the script

In [256]:
print(df_rs['Price'])

0       315000
1        69900
2        84900
3        35000
         ...  
9244    114000
9245    589000
9246    235000
9247    165000
Name: Price, Length: 9248, dtype: object


I'm going back and re-using some of this code to make it easier on myself.  The output above no longer has the "From" on some of it because I changed my scraper script, so the section below is from the former implementation of it.

In [145]:
# Need to clean up the way some of those values were created (and later I'll fix that in the scraper script)

df_rs['Price'] = df_rs['Price'].str.replace('From', '')

In [257]:
# Looking for any other strange values in the Price column:

df_rs['Price'].value_counts()

150000     76
275000     71
250000     71
45000      66
           ..
1199995     1
687000      1
829900      1
245500      1
Name: Price, Length: 2645, dtype: int64

In [258]:
# Found a "Contact for Price" row and need to drop the whole row

df_rs.drop(df_rs[df_rs['Price'] == 'Contact For Price'].index, inplace=True)

In [259]:
df_rs['Price'] = df_rs['Price'].astype('float64')

In [260]:
df_rs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9240 entries, 0 to 9247
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   TysonZip      9240 non-null   object 
 1   Address1      9219 non-null   object 
 2   Address2      9219 non-null   object 
 3   Availability  9240 non-null   object 
 4   Price         9240 non-null   float64
 5   Beds          6573 non-null   object 
 6   Baths         6533 non-null   object 
 7   Sqft          6327 non-null   float64
dtypes: float64(2), object(6)
memory usage: 649.7+ KB


This looks pretty good now

In [261]:
# Checking for duplicates (and removing them if there are any) - then checking if the numbers change

df_rs.drop_duplicates(inplace=True)

df_rs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8979 entries, 0 to 9244
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   TysonZip      8979 non-null   object 
 1   Address1      8959 non-null   object 
 2   Address2      8959 non-null   object 
 3   Availability  8979 non-null   object 
 4   Price         8979 non-null   float64
 5   Beds          6431 non-null   object 
 6   Baths         6391 non-null   object 
 7   Sqft          6188 non-null   float64
dtypes: float64(2), object(6)
memory usage: 631.3+ KB


looks like there were a couple hundred duplicates.  Probably because of the way I re-did some zip codes when the tracker got caught and then combined the two lists.

In [262]:
# Checking for missing values
missing_values = df_rs.isnull().sum()

# total rows in dataframe
total_rows = df_rs.shape[0]

# making it show me all the rows
pd.options.display.max_rows = len(df_rs.dtypes)

# calculating percentage of missing values
percent_missing = ((missing_values / total_rows) * 100).round(2)

# Making a dataframe to show missing values
missing_value_table = pd.concat([missing_values, percent_missing], axis=1)

# Naming columns for sorting
missing_value_table.columns = ['Missing Values', 'Percent Missing']

# sorting the view based on most missing values %
missing_value_table.sort_values('Percent Missing', ascending=True, inplace=True)

print(missing_value_table)

              Missing Values  Percent Missing
TysonZip                   0             0.00
Availability               0             0.00
Price                      0             0.00
Address1                  20             0.22
Address2                  20             0.22
Beds                    2548            28.38
Baths                   2588            28.82
Sqft                    2791            31.08


In [263]:
# Sqft has a lot of missing values, and in looking at it, it appears that all the addresses with missing sqft values are
# just lots for sale. Which I don't believe helps the purpose of this analysis. So I'll remove all of those with missing
# sqft values. I'm not worried much about missing beds and baths info

df_rs.dropna(subset=['Sqft'], inplace=True)

In [264]:
df_rs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6188 entries, 0 to 9242
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   TysonZip      6188 non-null   object 
 1   Address1      6181 non-null   object 
 2   Address2      6181 non-null   object 
 3   Availability  6188 non-null   object 
 4   Price         6188 non-null   float64
 5   Beds          6132 non-null   object 
 6   Baths         6106 non-null   object 
 7   Sqft          6188 non-null   float64
dtypes: float64(2), object(6)
memory usage: 435.1+ KB


In [265]:
# Figuring out what the 20 rows with missing addresses are (if they weren't already dropped)

# Select rows with missing values in the "Address1" column
missing_address1 = df_rs[df_rs['Address1'].isna()]

# Print the selected rows
print(missing_address1)

     TysonZip Address1 Address2 Availability     Price Beds Baths    Sqft
1892    39183      NaN      NaN     For Sale   67500.0    2     1   920.0
1924    39183      NaN      NaN      Pending  169900.0    3     2  1500.0
2132    72756      NaN      NaN     For Sale  399950.0    2     1  1104.0
3374    30040      NaN      NaN     For Sale  645000.0    5     4  3156.0
6452    78155      NaN      NaN     For Sale  154999.0    4     2  1176.0
7047    78155      NaN      NaN     For Sale  295000.0    3     2  1188.0
7725    72802      NaN      NaN      Pending  160000.0    3     1  1120.0


In [266]:
# Interestingly they have info without addresses.  I'm just going to drop them

df_rs.dropna(subset=['Address1'], inplace=True)

In [267]:
df_rs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6181 entries, 0 to 9242
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   TysonZip      6181 non-null   object 
 1   Address1      6181 non-null   object 
 2   Address2      6181 non-null   object 
 3   Availability  6181 non-null   object 
 4   Price         6181 non-null   float64
 5   Beds          6125 non-null   object 
 6   Baths         6099 non-null   object 
 7   Sqft          6181 non-null   float64
dtypes: float64(2), object(6)
memory usage: 434.6+ KB


In [268]:
# counting number of unique values in each column

print(df_rs.nunique())

TysonZip          95
Address1        6161
Address2         274
Availability       5
Price           2070
Beds              20
Baths             33
Sqft            2324
dtype: int64


The number of different unique values in Beds and in Baths is surprising. I'll have to look that up later during the basic statisical analysis

### 6C. Data Cleaning

In [269]:
# I need to split up the Address2 data.  I'm not sure that the Address1 data is important for my goals so I'll only
# wrangle that later if I need to

df_rs['City'] = df_rs['Address2'].str.split(',', expand=True)[0]

In [270]:
# Now I'm making a new column for the state abbreviation

df_rs['State'] = df_rs['Address2'].str.split(', ', expand=True)[1]
df_rs['State'] = df_rs['State'].str.split(' ', expand=True)[0]

In [271]:
# Now I'm going to try to do that with zip code. Though I know some didn't capture zip code, but hopefully I won't have an error

df_rs['Zip'] = df_rs['Address2'].str.split(', ', expand=True)[1]
df_rs['Zip'] = df_rs['Zip'].str.split(' ', expand=True)[1]

In [272]:
df_rs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6181 entries, 0 to 9242
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   TysonZip      6181 non-null   object 
 1   Address1      6181 non-null   object 
 2   Address2      6181 non-null   object 
 3   Availability  6181 non-null   object 
 4   Price         6181 non-null   float64
 5   Beds          6125 non-null   object 
 6   Baths         6099 non-null   object 
 7   Sqft          6181 non-null   float64
 8   City          6181 non-null   object 
 9   State         6181 non-null   object 
 10  Zip           5606 non-null   object 
dtypes: float64(2), object(9)
memory usage: 579.5+ KB


Looks like it worked great!

### 6D. Basic Descriptive Analysis

In [273]:
# Checking basic stats on each column

# making it show me all the rows
pd.options.display.max_rows = len(df_rs.dtypes)

df_rs.describe().applymap(lambda x: f"{x:0.2f}")

Unnamed: 0,Price,Sqft
count,6181.0,6181.0
mean,399690.81,2123.99
std,644334.08,3757.5
min,1.0,288.0
25%,209900.0,1390.0
50%,305550.0,1808.0
75%,450000.0,2438.0
max,38500000.0,256672.0


I'd say that these results look about right, but the max values are pretty crazy!

In [274]:
# Now I'm going to count unique values in each object column

df_rs['Address1'].value_counts()

68 Ogden Rd, Broken Bow, OK 74728            2
116 S Union St, Traverse City, MI 49684      2
1205 Secrest Commons Dr, Monroe, NC 28112    2
23 Double Oak Dr, Broken Bow, OK 74728       2
1208 Secrest Commons Dr, Monroe, NC 28112    2
                                            ..
618 Prairie Blvd, Dakota Dunes, SD 57049     1
Elm, Dakota City, NE 68731                   1
1307 Mulberry St, Dakota City, NE 68731      1
603 S 20th St, Dakota City, NE 68731         1
4701 Kindred St, Haltom City, TX 76117       1
Name: Address1, Length: 6161, dtype: int64

In [275]:
# dropping houses with duplicate addresses and sqft listed

df_rs.drop_duplicates(subset=['Address1', 'Sqft'], inplace=True)

In [276]:
df_rs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6173 entries, 0 to 9242
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   TysonZip      6173 non-null   object 
 1   Address1      6173 non-null   object 
 2   Address2      6173 non-null   object 
 3   Availability  6173 non-null   object 
 4   Price         6173 non-null   float64
 5   Beds          6117 non-null   object 
 6   Baths         6091 non-null   object 
 7   Sqft          6173 non-null   float64
 8   City          6173 non-null   object 
 9   State         6173 non-null   object 
 10  Zip           5598 non-null   object 
dtypes: float64(2), object(9)
memory usage: 578.7+ KB


In [277]:
df_rs['Address1'].value_counts()

162 Pr # 3358, Clarksville, AR 72830         2
170 Pr # 3397, Clarksville, AR 72830         2
1205 Secrest Commons Dr, Monroe, NC 28112    2
3143 AR Highway 174, Hope, AR 71801          2
1208 Secrest Commons Dr, Monroe, NC 28112    2
                                            ..
4955 Stallions Gait Rd, Cumming, GA 30040    1
4315 Alister Park Dr, Cumming, GA 30040      1
3160 Amble Valley Ave, Cumming, GA 30040     1
4915 Stallions Gait Rd, Cumming, GA 30040    1
4701 Kindred St, Haltom City, TX 76117       1
Name: Address1, Length: 6161, dtype: int64

In [278]:
# Making list of duplicates

duplicates_rs = df_rs[df_rs.duplicated(['Address1'], keep=False)]

# making it show me all the rows
pd.options.display.max_rows = len(duplicates_rs)

print(duplicates_rs)

     TysonZip                                     Address1  \
880     74728     17 Happy Place Cir, Broken Bow, OK 74728   
888     74728       23 Double Oak Dr, Broken Bow, OK 74728   
1052    74728     17 Happy Place Cir, Broken Bow, OK 74728   
1088    74728       16 Double Oak Dr, Broken Bow, OK 74728   
1193    74728       23 Double Oak Dr, Broken Bow, OK 74728   
1205    74728       16 Double Oak Dr, Broken Bow, OK 74728   
2477    72830         162 Pr # 3358, Clarksville, AR 72830   
2487    72830         162 Pr # 3358, Clarksville, AR 72830   
2495    72830         170 Pr # 3397, Clarksville, AR 72830   
2518    72830         170 Pr # 3397, Clarksville, AR 72830   
2891    49684  810 Cottageview Dr, Traverse City, MI 49684   
2922    49684  810 Cottageview Dr, Traverse City, MI 49684   
4702    71801          3143 AR Highway 174, Hope, AR 71801   
4710    71801          3143 AR Highway 174, Hope, AR 71801   
5243    28112    1303 Secrest Commons Dr, Monroe, NC 28112   
5265    

That doesn't seem to have gotten rid of all the duplicates, but I'm going to keep it as-is for now, because I'm not sure what's causing them.

In [279]:
# Now I'm going to count unique values in each object column

df_rs['Address2'].value_counts()

Seguin, TX 78155           450
Broken Bow, OK 74728       311
Fayetteville, AR 72704     293
Seguin, TX                 263
Bowling Green, KY 42101    221
                          ... 
Eltopia, WA 99301            1
Allen, TX 75013              1
Arlington, TX 76018          1
Dallas, TX 75223             1
Dallas, TX 75204             1
Name: Address2, Length: 274, dtype: int64

In [280]:
# Now I'm going to count unique values in each object column

df_rs['Availability'].value_counts()

For Sale       4000
Pending        1780
Contingent      376
Foreclosure      12
Coming Soon       5
Name: Availability, dtype: int64

In [281]:
# Now I'm going to count unique values in each object column

# making it show me all the rows
pd.options.display.max_rows = len(df_rs['Beds'])

df_rs['Beds'].value_counts()

3         2871
4         1715
2          772
5          436
1          111
6           86
3.0         44
7           21
8           12
Studio      12
2.0         12
4.0         11
9            4
5.0          2
12           2
14           2
13           1
41           1
155          1
6.0          1
Name: Beds, dtype: int64

In [282]:
print(df_rs[df_rs['Beds'] == '155'])

     TysonZip                                           Address1  \
2427    28610  2675 Bethlehem Dr Unit Portfolio, Claremont, N...   

                 Address2 Availability      Price Beds  Baths     Sqft  \
2427  Claremont, NC 28610      Pending  8567000.0  155  86.5+  74945.0   

           City State    Zip  
2427  Claremont    NC  28610  


74,000 sqft - a whole apartment complex I guess.  I'll take it off the list, as it's not a normal house for someone to buy.

In [283]:
print(df_rs[df_rs['Beds'] == '41'])

     TysonZip                         Address1          Address2 Availability  \
5360    28112  708 Cotton St, Monroe, NC 28112  Monroe, NC 28112      Pending   

          Price Beds Baths     Sqft    City State    Zip  
5360  3000000.0   41    41  43800.0  Monroe    NC  28112  


In [288]:
print(df_rs[df_rs['Beds'] == '3.0'])

     TysonZip                                           Address1  \
9155    31730             1432 County Line Rd, Camilla, GA 31730   
9156    31730                  53 E Morgan St, Camilla, GA 31730   
9157    31730                 555 Moultrie Rd, Camilla, GA 31730   
9158    31730                  211 S Scott St, Camilla, GA 31730   
9159    31730              2989 Old Pelham Rd, Camilla, GA 31730   
9165    31730                49 E Thompson St, Camilla, GA 31730   
9167    31730                196 Macdonald St, Camilla, GA 31730   
9172    31730               375 E Camellia Dr, Camilla, GA 31730   
9174    31730                   64 Scott St S, Camilla, GA 31730   
9175    31730              4558 Friendship Rd, Camilla, GA 31730   
9176    31730                  110 Dogwood St, Camilla, GA 31730   
9177    31730                199 Macdonald St, Camilla, GA 31730   
9178    31730                     64 Scott St, Camilla, GA 31730   
9179    31730              367 County Line Rd, C

In [286]:
# Dropping rows with too many beds (i.e. apartment complexes or the like)

df_rs.drop(df_rs[df_rs['Beds'] == '155'].index, inplace=True)
df_rs.drop(df_rs[df_rs['Beds'] == '41'].index, inplace=True)

In [287]:
df_rs['Beds'].value_counts()

3         2871
4         1715
2          772
5          436
1          111
6           86
3.0         44
7           21
8           12
Studio      12
2.0         12
4.0         11
9            4
5.0          2
14           2
12           2
13           1
6.0          1
Name: Beds, dtype: int64

In [289]:
# Replacing those values with periods with the normal values

df_rs['Beds'] = df_rs['Beds'].replace({'3.0': '3', '2.0': '2', '1.0': '1', '4.0': '4', '5.0': '5', '6.0': '6'})

In [290]:
df_rs['Beds'].value_counts()

3         2915
4         1726
2          784
5          438
1          111
6           87
7           21
8           12
Studio      12
9            4
12           2
14           2
13           1
Name: Beds, dtype: int64

In [291]:
df_rs['Baths'].value_counts()

2       2371
2.5      997
1        846
3        781
3.5      379
1.5      254
4        156
4.5       92
2.0       33
5.5       30
5         26
3.5+      23
4.5+      18
1.0       18
2.5+      16
1.5+       9
6.5        8
5.5+       6
6          5
6.5+       3
3.0        3
9          2
9.5        2
8          2
0.5+       2
7          2
7.5        1
15         1
7.5+       1
0.5        1
4.0        1
Name: Baths, dtype: int64

In [294]:
# Replacing those values with periods with the normal values

df_rs['Baths'] = df_rs['Baths'].replace({'3.0': '3', '2.0': '2', '1.0': '1', '4.0': '4'})

# I'm going to do that with the + values too

df_rs['Baths'] = df_rs['Baths'].replace({'3.5+': '3.5', '4.5+': '4.5', '2.5+': '2.5', '1.5+': '1.5', '5.5+': '5.5', '6.5+': '6.5', '0.5+': '0.5', '7.5+': '7.5'})

In [295]:
df_rs['Baths'].value_counts()

2      2404
2.5    1013
1       864
3       784
3.5     402
1.5     263
4       157
4.5     110
5.5      36
5        26
6.5      11
6         5
0.5       3
9.5       2
8         2
7.5       2
9         2
7         2
15        1
Name: Baths, dtype: int64

This seems likely to be accurate enough

In [296]:
print(df_rs[df_rs['Baths'] == '15'])

     TysonZip                                     Address1  \
2534    72830  342 County Road 3446, Clarksville, AR 72830   

                   Address2 Availability    Price Beds Baths    Sqft  \
2534  Clarksville, AR 72830      Pending  27000.0    3    15  1216.0   

             City State    Zip  
2534  Clarksville    AR  72830  


In [297]:
# I checked the address for the one with 15 beds and it was clearly a mistake. So changing it here.

df_rs.loc[df_rs['Beds'] == '15', 'Beds'] = '1.5'

In [298]:
df_rs['Sqft'].value_counts()

1800.0      31
1200.0      31
1500.0      28
1400.0      24
1344.0      23
2000.0      21
960.0       20
2400.0      18
1688.0      18
1300.0      18
1440.0      18
1600.0      18
1680.0      18
1983.0      17
1248.0      17
1000.0      17
1487.0      17
1350.0      16
1456.0      16
2500.0      15
1347.0      15
1636.0      15
1862.0      15
1792.0      15
1152.0      15
1120.0      14
2100.0      14
1567.0      14
1512.0      14
2200.0      13
1056.0      13
2174.0      13
1568.0      13
1910.0      13
1796.0      13
2300.0      13
1700.0      13
1560.0      13
1790.0      13
2304.0      13
1626.0      12
1080.0      12
1280.0      12
900.0       12
1750.0      12
1217.0      12
2511.0      12
1250.0      12
1618.0      12
1266.0      12
1958.0      11
2368.0      11
720.0       11
1830.0      11
924.0       11
2601.0      11
1474.0      11
1296.0      11
1740.0      11
1092.0      11
1699.0      11
1008.0      11
1292.0      11
1050.0      10
1850.0      10
1667.0      10
1595.0    

In [304]:
print(df_rs[df_rs['Sqft'] > 15000])

     TysonZip                                    Address1  \
5030    27332         106 Grinnel Loop, Sanford, NC 27332   
5032    27332         103 Grinnel Loop, Sanford, NC 27332   
5674    38261         1732 Stone St, Union City, TN 38261   
7422    75090    820 N Sam Rayburn Fwy, Sherman, TX 75090   
7888    72704  4199 W Sante Fe Sq, Fayetteville, AR 72704   
8161    31092         202 St Charles Pl, Veinna, GA 31092   

                    Address2 Availability       Price Beds Baths      Sqft  \
5030       Sanford, NC 27332     For Sale    390000.0    4   2.5   22137.0   
5032       Sanford, NC 27332     For Sale    390000.0    4   2.5   22137.0   
5674    Union City, TN 38261     For Sale   4500000.0  NaN   NaN   62844.0   
7422       Sherman, TX 75090      Pending   8000000.0  NaN   NaN   65986.0   
7888  Fayetteville, AR 72704      Pending  38500000.0  NaN   NaN  256672.0   
8161        Veinna, GA 31092     For Sale   1850000.0  NaN   NaN   30055.0   

              City State 

I checked those above individually.  The first two are errors, but the others are all multi-family units, so I'm removing them

In [305]:
df_rs.drop(df_rs[df_rs['Sqft'] > 30000].index, inplace=True)

In [310]:
df_rs.loc[df_rs['Sqft'] == 22137, 'Sqft'] = 2213

In [311]:
print(df_rs[df_rs['Sqft'] > 15000])

Empty DataFrame
Columns: [TysonZip, Address1, Address2, Availability, Price, Beds, Baths, Sqft, City, State, Zip]
Index: []


In [312]:
df_rs['City'].value_counts()

Seguin                  713
Broken Bow              311
Fayetteville            295
Cumming                 230
Bowling Green           226
Sherman                 192
Monroe                  169
Sanford                 158
Alexandria              142
Rogers                  137
Springdale              125
Traverse City           119
Shelbyville             106
Sedalia                 103
Goodlettsville           97
Vineland                 87
Jacksonville             82
Russellville             73
Easley                   72
Chicago                  71
Rome                     68
Zeeland                  65
Jackson                  65
Caseyville               63
Carthage                 63
Emporia                  59
Eufaula                  59
Vicksburg                58
Council Bluffs           56
Enid                     56
Warren                   55
Monett                   54
Hutchinson               53
Amherst                  52
Clarksville              52
Waterloo            

In [313]:
df_rs['State'].value_counts()

TX    1245
AR    1042
KY     418
OK     374
NC     349
GA     333
TN     294
IA     262
MO     245
MI     239
KS     203
AL     199
NE     174
IL     171
MS     136
NJ      87
FL      82
SC      72
OH      65
IN      59
WA      32
CA      27
WI      14
ME      14
PA       9
VA       8
OR       8
SD       6
Name: State, dtype: int64

That worked this time

In [314]:
df_rs['Zip'].value_counts()

78155    451
74728    317
72704    292
42101    222
30040    201
27332    152
72756    140
75090    140
72764    124
28112    122
49684    119
65301    103
37160     99
37072     97
08360     87
32254     82
72802     73
60641     71
39204     65
62232     64
30161     60
36027     60
29640     60
39183     59
66801     59
73701     56
51501     56
77029     56
48091     55
65708     54
41001     54
72830     53
67501     53
50703     52
75633     51
49464     50
35077     49
38343     46
63841     45
72114     45
76117     45
44001     45
71801     44
68137     43
71602     43
72834     41
35906     39
66106     38
72901     35
38261     34
72638     33
35950     32
76384     32
72958     32
64507     31
79108     31
72616     31
46947     29
50588     29
47112     29
94580     27
76180     27
75935     27
71852     25
68850     24
51442     24
68701     23
31730     21
50220     21
28610     20
68462     20
67846     20
35031     17
42420     17
99337     14
68305     14
31092     14

In [315]:
print(df_rs['Zip'].nunique())

219


That's significantly more zip codes than we have food production facility zip codes. Probably because when homes aren't found in that zip code it looks for homes within a 3-mile radius instead. So I need to set up my data-scraping so that when it finds homes, it also adds a line to the data appending that will add the Tyson zip it was looking for.

So the notes above are the way things turned out the first time. So I fixed my scraper and ran through it all again with a new column adding TysonZip, so that it's easy to link up to each food production facility.  Then I cleaned it and checked it all again too.

In [419]:
# Importing cleaned data again
df_rs = pd.read_csv(os.path.join(path, 'Data', 'Cleaned', 'Realtor_Scraped_ForSale.csv'), dtype={'TysonZip': str}, index_col = False)

df_rs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6167 entries, 0 to 6166
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   TysonZip      6167 non-null   object 
 1   Address1      6167 non-null   object 
 2   Address2      6167 non-null   object 
 3   Availability  6167 non-null   object 
 4   Price         6167 non-null   float64
 5   Beds          6115 non-null   object 
 6   Baths         6089 non-null   float64
 7   Sqft          6167 non-null   float64
 8   City          6166 non-null   object 
 9   State         6167 non-null   object 
 10  Zip           5592 non-null   float64
dtypes: float64(4), object(7)
memory usage: 530.1+ KB


In [316]:
# I'm going to go ahead and export what I have and then update my scraper and run it again

df_rs.to_csv(os.path.join(path, 'Data', 'Cleaned', 'Realtor_Scraped_ForSale.csv'), index=False)

## 7. Realtor.com scraped rental listings for zip codes in list

### 7A. Web Scraping Script

In [229]:
# I'm going to go ahead and run this rental script again since I updated the other list and I've updated the script
# a little to make it easier to link to the zip code list

import os
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import time
import random
import csv
import re

start_time = time.time()

# Defining path
path=r'D:\Adam\Employment\Data Analysis Course\Final Data Project'

# Defining location of zipfile
zipfile = 'D:\Adam\Employment\Data Analysis Course\Final Data Project\Data\Cleaned\Tyson_zips_only.csv'

# Open the CSV file that contains the list of zip codes.
with open(zipfile, 'r') as f:
    reader = csv.reader(f)
    zip_codes = list(reader)

data = []    
    
# Loop through the list of zip codes.
for zip_code in zip_codes:
    print(f"Processing zip code: {zip_code[0]}")
    
    # Add a random delay before going to next zip code
    time.sleep(random.randint(30, 90))
    
    # Loop through the page numbers (1-20).
    for page_num in range(1, 20):
        print(f"Processing page {page_num}")
    
        # Update the URL in the script.
        url_template = "https://www.realtor.com/apartments/{}/pg-{}"
        url = url_template.format(zip_code[0], page_num)

        # Set user-agent header to avoid bot detection
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

        # Add a random delay before making the request
        time.sleep(random.randint(8, 16))

        # Send HTTP GET request to the URL and get the HTML response
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            print("Request successful")
        else:
            print(f"Request failed with status code: {response.status_code}. No more listings pages found.")
            break

        # Parse the HTML response using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find the container for all the properties
        container = soup.find('section', class_=re.compile(r'PropertiesList_propertiesContainer'))

        # Check if the container is empty
        if container is None:
            print("No listings found")
            break

        # Check if the page contains the text "We're Sorry"
        if "We're Sorry" in soup.text:
            print("No more pages")
            break

        # Find all the listings in the container
        listings = container.find_all('div', class_=re.compile(r'BasePropertyCard_propertyCardWrap'))

        # Loop through each listing and extract the details
        for listing in listings:
            if listing is not None:
                try:
                    address1 = listing.find('div', {'class': 'truncate-line', 'data-testid': 'card-address-1'}).text.strip()
                except AttributeError:
                    address1 = ''

                try:
                    address2 = listing.find('div', {'class': 'truncate-line', 'data-testid': 'card-address-2'}).text.strip()
                except AttributeError:
                    address2 = ''

                try:
                    style = listing.find('div', {'class': 'message'}).text.split(' - ')[1]
                except AttributeError:
                    style = ''

                try:
                    price = listing.find('div', class_=re.compile(r'Price__Component')).text.strip().replace(',', '').replace('$', '')
                except AttributeError:
                    price = ''

                try:
                    beds = listing.find('li', {'data-testid': 'property-meta-beds'}).text.strip().replace('bed', '')
                except AttributeError:
                    beds = ''

                try:
                    baths = listing.find('li', {'data-testid': 'property-meta-baths'}).text.strip().replace('bath', '')
                except AttributeError:
                    baths = ''

                try:
                    sqft = listing.find('li', {'data-testid': 'property-meta-sqft'}).text.strip().replace(',', '').split('sqft')[0]
                except AttributeError:
                    sqft = ''

            else:
                print('No listing found.')

            # Split the beds, baths, and sqft values at the " - " and create two new rows with the values before and after the " - "

            if "-" in price:
                price_split = price.split(" - ")
                if "-" in beds:
                    beds_split = beds.split(" - ")
                    if "-" in sqft:
                        sqft_split = sqft.split(" - ")
                        if "-" in baths:
                            baths_split = baths.split(" - ")
                            data.append([zip_code[0], address1, address2, style, price_split[0], beds_split[0], baths_split[0], sqft_split[0]])
                            data.append([zip_code[0], address1, address2, style, price_split[1], beds_split[1], baths_split[1], sqft_split[1]])

                        else:
                            data.append([zip_code[0], address1, address2, style, price_split[0], beds_split[0], baths, sqft_split[0]])
                            data.append([zip_code[0], address1, address2, style, price_split[1], beds_split[1], baths, sqft_split[1]])
                    else:
                        data.append([zip_code[0], address1, address2, style, price_split[0], beds_split[0], baths, sqft])
                        data.append([zip_code[0], address1, address2, style, price_split[1], beds_split[1], baths, sqft])    
                else:
                    data.append([zip_code[0], address1, address2, style, price_split[0], beds, baths, sqft])
                    data.append([zip_code[0], address1, address2, style, price_split[1], beds, baths, sqft])
            else:
                data.append([zip_code[0], address1, address2, style, price, beds, baths, sqft])
       
        # stopping pagination if less than 41 listings found
        if int(soup.find('div', class_=re.compile(r'MatchProperties')).text.strip().split(' matching')[0]) < 40:
            print("No more listings")
            break
        else:
            continue
        
    print("Finished processing zip code")

# Create a Pandas dataframe from the extracted data
df = pd.DataFrame(data, columns=['TysonZip', 'Address1', 'Address2', 'Style', 'Price', 'Beds', 'Baths', 'Sqft'])

# get today's date in the format YYYY-MM-DD
today = datetime.datetime.today().strftime('%Y-%m-%d')

# Export the dataframe to a CSV file
df.to_csv(os.path.join(path, 'Data', 'Scraped', f'TysonZips_rentals_{today}.csv'), index=False)

# Print the dataframe
print(df)

end_time = time.time()

execution_time = end_time - start_time

print("Execution time:", execution_time)

Processing zip code: 73701
Processing page 1
Request successful
No more listings
Finished processing zip code
Processing zip code: 42602
Processing page 1
Request successful
No listings found
Finished processing zip code
Processing zip code: 35950
Processing page 1
Request successful
No more listings
Finished processing zip code
Processing zip code: 79108
Processing page 1
Request successful
No more listings
Finished processing zip code
Processing zip code: 44001
Processing page 1
Request successful
No more listings
Finished processing zip code
Processing zip code: 72764
Processing page 1
Request successful
Processing page 2
Request successful
Processing page 3
Request failed with status code: 404. No more listings pages found.
Finished processing zip code
Processing zip code: 72616
Processing page 1
Request successful
No more listings
Finished processing zip code
Processing zip code: 35031
Processing page 1
Request successful
No more listings
Finished processing zip code
Processing zi

Processing page 1
Request successful
No more listings
Finished processing zip code
Processing zip code: 38059
Processing page 1
Request successful
No more listings
Finished processing zip code
Processing zip code: 64854
Processing page 1
Request successful
No more listings
Finished processing zip code
Processing zip code: 76180
Processing page 1
Request successful
Processing page 2
Request successful
Processing page 3
Request successful
Processing page 4
Request successful
Processing page 5
Request failed with status code: 404. No more listings pages found.
Finished processing zip code
Processing zip code: 38261
Processing page 1
Request successful
No more listings
Finished processing zip code
Processing zip code: 68137
Processing page 1
Request successful
No more listings
Finished processing zip code
Processing zip code: 99363
Processing page 1
Request successful
No listings found
Finished processing zip code
Processing zip code: 50220
Processing page 1
Request successful
No more list

In [374]:
# Importing scraped rental data
df_rr = pd.read_csv(os.path.join(path, 'Data', 'Scraped', f'TysonZips_rentals_2023-05-05.csv'), dtype={'TysonZip': str}, index_col = False)

### 7B. Consistency Checks

In [375]:
# checking dataset info

df_rr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1610 entries, 0 to 1609
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   TysonZip  1610 non-null   object
 1   Address1  1606 non-null   object
 2   Address2  1606 non-null   object
 3   Style     1610 non-null   object
 4   Price     1610 non-null   object
 5   Beds      1605 non-null   object
 6   Baths     1602 non-null   object
 7   Sqft      1415 non-null   object
dtypes: object(8)
memory usage: 100.8+ KB


I'm going to need to convert price and Sqft to integers... but there may be some strange values in those, so I'll have to check that first

In [377]:
# making it show me all the rows
pd.options.display.max_rows = len(df_rr['Price'].value_counts())

df_rr['Price'].value_counts()

1395                 28
1200                 27
650                  24
1500                 23
2400                 22
900                  22
1649                 21
1100                 21
850                  20
750                  20
1450                 20
1300                 19
1050                 19
1850                 19
1650                 18
895                  18
1600                 18
1250                 18
2000                 17
1950                 17
1800                 17
1000                 17
950                  17
2200                 16
1550                 16
1900                 14
795                  14
1350                 14
1495                 14
2500                 13
1195                 13
1695                 13
695                  13
1750                 12
625                  12
1995                 12
2100                 12
1275                 12
1599                 12
Contact For Price    11
1595                 11
2600            

I found only one strange value, "contact for price", so I'm going to drop that now.

In [378]:
# Dropping rows with "Contact For Price" value

df_rr.drop(df_rr[df_rr['Price'] == 'Contact For Price'].index, inplace=True)

In [379]:
df_rr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1599 entries, 0 to 1609
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   TysonZip  1599 non-null   object
 1   Address1  1595 non-null   object
 2   Address2  1595 non-null   object
 3   Style     1599 non-null   object
 4   Price     1599 non-null   object
 5   Beds      1597 non-null   object
 6   Baths     1594 non-null   object
 7   Sqft      1409 non-null   object
dtypes: object(8)
memory usage: 112.4+ KB


In [380]:
# making it an int64 dtype

df_rr = df_rr.astype({'Price': 'int64'})

In [381]:
df_rr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1599 entries, 0 to 1609
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   TysonZip  1599 non-null   object
 1   Address1  1595 non-null   object
 2   Address2  1595 non-null   object
 3   Style     1599 non-null   object
 4   Price     1599 non-null   int64 
 5   Beds      1597 non-null   object
 6   Baths     1594 non-null   object
 7   Sqft      1409 non-null   object
dtypes: int64(1), object(7)
memory usage: 112.4+ KB


In [383]:
# making it show me all the rows
pd.options.display.max_rows = len(df_rr['Sqft'].value_counts())

df_rr['Sqft'].value_counts()

700            29
1000           21
1100           21
900            20
1050           18
400            16
1200           15
694            14
1729           14
800            14
950            14
1300           13
550            12
1146           12
750            11
850            11
1067           10
726            10
600            10
1261           10
1159            9
650             9
1634            8
816             8
1372            7
1350            7
660             7
884             7
680             7
784             7
500             6
1680            6
350             6
1064            6
1500            6
1212            6
1440            6
1080            6
1082            6
2002            6
956             6
1152            6
960             6
1796            5
952             5
855             5
1400            5
663             5
2032            5
720             5
2000            5
727             5
1344            5
1056            5
1738            5
564       

In [384]:
print(df_rr[df_rr['Sqft'] == '986 - 1150'])

    TysonZip       Address1                Address2      Style  Price   Beds  \
605    35906  98 Sutton Cir  Rainbow City, AL 35906  Apartment    940  2 - 3   

    Baths        Sqft  
605   1.5  986 - 1150  


A lot of them still have 2 values separated by a dash, so I'm just going to keep the lower value (since with the way I wrote my script, it means these only had 1 price listed, and it's more likely they listed the "starting from" price than the highest possible price).

In [385]:
df_rr['Sqft'] = df_rr['Sqft'].str.split(' - ').str[0]

In [386]:
df_rr['Sqft'].value_counts()

700      30
1000     21
1100     21
900      20
1050     18
400      16
1200     15
694      14
950      14
1729     14
800      14
1300     13
600      12
1146     12
550      12
726      11
750      11
850      11
1261     10
1067     10
650       9
1159      9
816       8
1634      8
1372      7
680       7
784       7
1056      7
884       7
660       7
1350      7
1440      6
1082      6
1212      6
1500      6
350       6
1080      6
956       6
500       6
960       6
2002      6
1064      6
1152      6
1680      6
727       5
300       5
952       5
855       5
720       5
1400      5
1796      5
663       5
1738      5
2000      5
2032      5
327       5
1344      5
1380      4
1627      4
1150      4
2054      4
1131      4
450       4
1298      4
1450      4
675       4
1600      4
1216      4
1340      4
640       4
924       4
525       4
1234      4
1800      4
564       4
418       4
1319      4
1280      4
1395      3
974       3
1508      3
1470      3
780       3
1147

In [387]:
print(df_rr[df_rr['Sqft'].isnull()])

     TysonZip                            Address1  \
1       73701           1201 E Broadway Ave Apt J   
3       73701                       624 N 10th St   
4       73701                       812 W Elm Ave   
5       73701                     1910 N Grand St   
14      79108                   10505 Broadway Dr   
15      79108                                 NaN   
16      79108                  8403 N Highway 287   
20      79108                       4611 Dumas Dr   
21      44001                        36 Westwoods   
22      44001                        36 Westwoods   
31      72764          3810 S Thompson Ave Unit 7   
32      72764         3810 S Thompson Ave Unit 25   
42      72764          3810 S Thompson Ave Unit 6   
43      72764         3810 S Thompson Ave Unit 32   
44      72764          3810 S Thompson Ave Unit 5   
45      72764         3810 S Thompson Ave Unit 35   
46      72764         3810 S Thompson Ave Unit 39   
47      72764         3810 S Thompson Ave Unit

In [388]:
# Convert sqft to numeric while keeping null values as null

df_rr['Sqft'] = pd.to_numeric(df_rr['Sqft'], errors='coerce')

In [389]:
df_rr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1599 entries, 0 to 1609
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   TysonZip  1599 non-null   object 
 1   Address1  1595 non-null   object 
 2   Address2  1595 non-null   object 
 3   Style     1599 non-null   object 
 4   Price     1599 non-null   int64  
 5   Beds      1597 non-null   object 
 6   Baths     1594 non-null   object 
 7   Sqft      1409 non-null   float64
dtypes: float64(1), int64(1), object(6)
memory usage: 112.4+ KB


In [390]:
# Checking for duplicates (and reviewing them first - because sometimes apartments have the same address)

df_rr_dups = df_rr[df_rr.duplicated()]

print(df_rr_dups)

    TysonZip                         Address1                     Address2  \
74     72764             315 Texas Ln Apt 206         Springdale, AR 72764   
75     72764         1414 S Powell St Apt 202         Springdale, AR 72764   
76     72764         1414 S Powell St Apt 102         Springdale, AR 72764   
77     72764         1706 S Powell St Apt 203         Springdale, AR 72764   
78     72764            1160 Dove Loop Unit B         Springdale, AR 72764   
79     72764                884 Palisades Ave         Springdale, AR 72764   
80     72764           1600 Juniper Cir Apt B         Springdale, AR 72764   
81     72764              3444 Acorn Falls Ln         Springdale, AR 72764   
82     72764         1006 Walnut Ave Unit 101         Springdale, AR 72764   
83     72764                 3568 Bastille St         Springdale, AR 72764   
84     72764              420 Park St Apt 201         Springdale, AR 72764   
85     72764                  703 W Grove Ave         Springdale

I manually checked some of these addresses and they do mostly look like multi-unit properties, so I'm going to go ahead and leave them

In [391]:
# Checking for missing values

missing_values = df_rr.isnull().sum()

# total rows in dataframe
total_rows = df_rr.shape[0]

# making it show me all the rows
pd.options.display.max_rows = len(df_rr.dtypes)

# calculating percentage of missing values
percent_missing = ((missing_values / total_rows) * 100).round(2)

# Making a dataframe to show missing values
missing_value_table = pd.concat([missing_values, percent_missing], axis=1)

# Naming columns for sorting
missing_value_table.columns = ['Missing Values', 'Percent Missing']

# sorting the view based on most missing values %
missing_value_table.sort_values('Percent Missing', ascending=True, inplace=True)

print(missing_value_table)

          Missing Values  Percent Missing
TysonZip               0             0.00
Style                  0             0.00
Price                  0             0.00
Beds                   2             0.13
Address1               4             0.25
Address2               4             0.25
Baths                  5             0.31
Sqft                 190            11.88


Sqft has a lot of missing values, but I think that's ok. At this point, the main interest are whether or not units for rent are available within a certain price range at the food facility locations

In [392]:
# counting number of unique values in each column

print(df_rr.nunique())

TysonZip      75
Address1    1357
Address2      88
Style          7
Price        472
Beds          13
Baths         14
Sqft         653
dtype: int64


So only 75 of the 95 TysonZips had rental units available.  That seems like less than anticipated. I'm going to check a lot of those zip codes manually and see if it stays that way, or if there was some error with the scraper.  The other details look like I'd expect *maybe*, though the large number of different Beds and Baths units might indicate there's still something to be cleaned in the cleaning phase.

I checked out all of those zip codes manually and it's true that there's nothing there. So the scraper worked, but those places are apparently small enough that they can't meet rental housing demand and/or don't list on major sites like realtor.com (I also checked a couple of them on Trulia and also nothing)

### 7C. Data Cleaning

In [393]:
# I'm going to check the value counts for each columns because there's probably still some cleaning to do on Beds and Baths

df_rr['Style'].value_counts()

Apartment         694
House             664
Townhome           76
Condo              66
Other              62
Condo/Townhome     22
Duplex/Triplex     15
Name: Style, dtype: int64

In [394]:
# making it show me all the rows
pd.options.display.max_rows = len(df_rr['Beds'].value_counts())

df_rr['Beds'].value_counts()

3             564
2             462
1             317
4             162
Studio         47
5              19
Studio+        15
2 - 3           3
6               2
Studio - 2      2
1 - 2           2
1 - 3           1
1 - 4           1
Name: Beds, dtype: int64

In [395]:
# I'm going to keep the first value in those that say "2-3" and stuff like that (for the same reason as above, because
# if they only listed one price, they probably listed the starting price, adn thus what would apply to the smallest unit)

df_rr['Beds'] = df_rr['Beds'].str.split(' - ').str[0]

In [396]:
# Changing Studio+ to Studio
df_rr['Beds'] = df_rr['Beds'].replace({'Studio+': 'Studio'})

In [397]:
df_rr['Beds'].value_counts()

3         564
2         465
1         321
4         162
Studio     64
5          19
6           2
Name: Beds, dtype: int64

In [398]:
# making it show me all the rows
pd.options.display.max_rows = len(df_rr['Baths'].value_counts())

df_rr['Baths'].value_counts()

1          745
2          575
2.5        130
3.5         42
1.5         41
3           33
1 - 2        9
4            8
4.5          5
0.5+         2
5.5+         1
3.5+         1
2.5+         1
1 - 2.5      1
Name: Baths, dtype: int64

In [399]:
# Keeping the first number when there's a -, for the same reason as above

df_rr['Baths'] = df_rr['Baths'].str.split(' - ').str[0]

In [403]:
# Changing Baths+ to value without +

df_rr['Baths'] = df_rr['Baths'].replace({'5.5+': '5.5', '3.5+': '3.5', '2.5+': '2.5'})

In [404]:
# making it show me all the rows
pd.options.display.max_rows = len(df_rr['Baths'].value_counts())

df_rr['Baths'].value_counts()

1       755
2       575
2.5     131
3.5      43
1.5      41
3        33
4         8
4.5       5
0.5+      2
5.5       1
Name: Baths, dtype: int64

In [409]:
# Going to split Address 2 into city, state, and zip

df_rr['City'] = df_rr['Address2'].str.split(',', expand=True)[0]

In [410]:
# Now I'm making a new column for the state abbreviation

df_rr['State'] = df_rr['Address2'].str.split(', ', expand=True)[1]
df_rr['State'] = df_rr['State'].str.split(' ', expand=True)[0]

In [411]:
# Now I'm going to try to do that with zip code.

df_rr['Zip'] = df_rr['Address2'].str.split(', ', expand=True)[1]
df_rr['Zip'] = df_rr['Zip'].str.split(' ', expand=True)[1]

In [416]:
# counting number of unique values in each column

print(df_rr.nunique())

TysonZip      75
Address1    1357
Address2      88
Style          7
Price        472
Beds           7
Baths         10
Sqft         653
City          81
State         25
Zip           77
dtype: int64


So we only have a couple more in Zip than we have in TysonZip, which means even when the search automatically expands by 3 miles when nothing found in original zip code, in only 2 of those instances were they able to find rental property outside the zip code but still in the local radius.  Doesn't bode well for those food processing facilities that don't have rentals within their zip code - it seems to indicate no rentals available in the local vicinity either.

### 7D. Basic Descriptive Analysis

In [405]:
# Checking basic stats on each column

# making it show me all the rows
pd.options.display.max_rows = len(df_rr.dtypes)

df_rr.describe().applymap(lambda x: f"{x:0.2f}")

Unnamed: 0,Price,Sqft
count,1599.0,1409.0
mean,1531.75,1247.92
std,717.73,964.62
min,1.0,100.0
25%,1017.5,784.0
50%,1434.0,1100.0
75%,1870.0,1470.0
max,6300.0,21439.0


That seems about like I'd expect, though it's terrible how high rent prices are for relatively small homes

In [406]:
# Now I'm going to count unique values in each object column

df_rr['Address1'].value_counts()

401 E Ave              4
6021 Parker Blvd       4
5401 Jenny Lind Rd     4
1507 Nona St           4
                      ..
1410 C St S            1
2510 Ionia St          1
912 S 19th St Apt A    1
566 Alice St           1
Name: Address1, Length: 1357, dtype: int64

In [407]:
# Making list of duplicates

duplicates_rr = df_rr[df_rr.duplicated(['Address1'], keep=False)]

# making it show me all the rows
pd.options.display.max_rows = len(duplicates_rr)

print(duplicates_rr)

     TysonZip                         Address1  \
15      79108                              NaN   
17      79108                              NaN   
21      44001                     36 Westwoods   
22      44001                     36 Westwoods   
23      44001                   7055 Quarry Rd   
24      44001                   7055 Quarry Rd   
25      72764               1701 S West End St   
26      72764               1701 S West End St   
27      72764                1517 Electric Ave   
28      72764                1517 Electric Ave   
29      72764                    5325 N Oak St   
30      72764                    5325 N Oak St   
34      72764         514 Butterfield Coach Rd   
35      72764         514 Butterfield Coach Rd   
36      72764               3702-B Locksley St   
37      72764               3702-B Locksley St   
38      72764                   1406 Oriole St   
39      72764                   1406 Oriole St   
40      72764                   726 E Emma Ave   


So there's a lot of address duplicates because of the way I wrote the code to capture both high and low values for each rental unit that listed a range. So I won't remove them.

In [408]:
# Now I'm going to count unique values in each object column

df_rr['Address2'].value_counts()

North Richland Hills, TX 76180    181
Chicago, IL 60641                 133
Fayetteville, AR 72704            105
Seguin, TX 78155                   98
North Little Rock, AR 72114        81
Springdale, AR 72764               70
Cumming, GA 30040                  67
Sherman, TX 75090                  66
Fort Smith, AR 72901               65
Jacksonville, FL 32254             61
Easley, SC 29640                   50
Haltom City, TX 76117              41
Waterloo, IA 50703                 33
Dallas, TX 75237                   31
Goodlettsville, TN 37072           31
Bowling Green, KY 42101            29
Hutchinson, KS 67501               26
Omaha, NE 68137                    22
Rogers, AR 72756                   22
Warren, MI 48091                   22
Jackson, MS 39204                  21
Monroe, NC 28112                   21
Sanford, NC 27332                  20
Rome, GA 30161                     20
Emporia, KS 66801                  20
Houston, TX 77029                  17
Vineland, NJ

In [412]:
df_rr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1599 entries, 0 to 1609
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   TysonZip  1599 non-null   object 
 1   Address1  1595 non-null   object 
 2   Address2  1595 non-null   object 
 3   Style     1599 non-null   object 
 4   Price     1599 non-null   int64  
 5   Beds      1597 non-null   object 
 6   Baths     1594 non-null   object 
 7   Sqft      1409 non-null   float64
 8   City      1595 non-null   object 
 9   State     1595 non-null   object 
 10  Zip       1595 non-null   object 
dtypes: float64(1), int64(1), object(9)
memory usage: 149.9+ KB


In [413]:
# Now I'm going to count unique values in each object column

df_rr['City'].value_counts()

North Richland Hills    183
Chicago                 133
Fayetteville            106
Seguin                   98
North Little Rock        81
Springdale               71
Cumming                  67
Sherman                  66
Fort Smith               65
Jacksonville             61
Easley                   50
Haltom City              41
Waterloo                 33
Jackson                  32
Goodlettsville           31
Dallas                   31
Bowling Green            29
Hutchinson               28
Rogers                   22
Omaha                    22
Warren                   22
Monroe                   21
Sanford                  20
Rome                     20
Emporia                  20
Houston                  17
Vineland                 15
Shelbyville              14
Kansas City              12
Enid                     12
Council Bluffs           12
Cincinnati               11
Rainbow City              9
Portland                  8
Lincoln                   8
San Lorenzo         

In [414]:
# Now I'm going to count unique values in each object column

df_rr['State'].value_counts()

TX    448
AR    360
IL    136
GA     88
TN     64
FL     61
KS     60
SC     50
IA     50
NC     43
KY     34
MI     31
NE     31
MS     28
OH     18
OK     15
NJ     15
AL     14
MO     11
CA     11
ME      8
IN      7
PA      6
WI      4
VA      2
Name: State, dtype: int64

In [415]:
# Now I'm going to count unique values in each object column

df_rr['Zip'].value_counts()

76180    182
60641    133
72704    105
78155     98
72114     83
72764     71
75090     67
30040     67
72901     65
32254     61
29640     50
76117     44
50703     33
37072     31
75237     31
42101     30
67501     26
68137     22
72756     22
48091     22
28112     21
39204     21
27332     20
30161     20
66801     20
77029     17
08360     15
45246     14
37160     14
73701     12
51501     12
66106     12
94580     11
38305     11
35906      9
04102      8
39183      7
49684      7
46947      6
17557      6
79108      5
68507      5
64507      5
44001      4
54961      4
72638      4
38261      4
65301      4
72715      3
71602      3
68505      3
72802      3
74728      3
50220      3
35077      2
50588      2
38355      2
38059      2
24586      2
75633      2
62232      2
28610      2
41071      2
41076      2
49464      2
35950      2
67502      2
72834      1
68776      1
65708      1
63841      1
39854      1
76384      1
75935      1
61275      1
35031      1
47112      1

In [420]:
df_rr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1599 entries, 0 to 1609
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   TysonZip  1599 non-null   object 
 1   Address1  1595 non-null   object 
 2   Address2  1595 non-null   object 
 3   Style     1599 non-null   object 
 4   Price     1599 non-null   int64  
 5   Beds      1597 non-null   object 
 6   Baths     1594 non-null   object 
 7   Sqft      1409 non-null   float64
 8   City      1595 non-null   object 
 9   State     1595 non-null   object 
 10  Zip       1595 non-null   object 
dtypes: float64(1), int64(1), object(9)
memory usage: 149.9+ KB


In [417]:
# Time to export the data and wrap up this lesson

df_rr.to_csv(os.path.join(path, 'Data', 'Cleaned', 'Realtor_Scraped_Rental.csv'), index=False)