# Foot traffic data cleaning process

In [1]:
import pandas as pd
import numpy as np
import os
import data_cleaning_methods

## IV. Google Community Mobility Reports

Data Summary

These datasets show how visits and length of stay at different places change compared to a baseline. These changes are calculated using the same kind of aggregated and anonymized data used to show popular times for places in Google Maps. 

The data shows how visitors to (or time spent in) categorized places change compared to  baseline days. A baseline day represents a normal value for that day of the week. **The baseline day is the median value from the 5‑week period Jan 3 – Feb 6, 2020.**

For each region-category, the baseline isn’t a single value—it’s 7 individual values. The same number of visitors on 2 different days of the week, result in different percentage changes. So, we recommend the following:

Don’t infer that larger changes mean more visitors or smaller changes mean less visitors.
Avoid comparing day-to-day changes. Especially weekends with weekdays.

The dataset record ranges from February 2020 to October 2020 for the following categories:

- Grocery & pharmacy. Mobility trends for places like grocery markets, food warehouses, farmers markets, specialty food shops, drug stores, and pharmacies.
- Parks. Mobility trends for places like local parks, national parks, public beaches, marinas, dog parks, plazas, and public gardens.
- Transit stations. Mobility trends for places like public transport hubs such as subway, bus, and train stations.
- Retail & recreation. Mobility trends for places like restaurants, cafes, shopping centers, theme parks, museums, libraries, and movie theaters.
- Residential. Mobility trends for places of residence.
- Workplaces. Mobility trends for places of work.

In [2]:
pattern ='ST'
df_states=[]

for f in os.listdir('../../data/foot_traffic/source/Google_Community_Mobility_Reports/'):
    if pattern in f:
        data=pd.read_csv(os.path.join('../../data/foot_traffic/source/Google_Community_Mobility_Reports/',f), index_col=0)
        data['category'] = f.split('_percent')[0]
        df_states.append(data)

In [3]:
pattern ='CO'
df_counties=[]

for f in os.listdir('../../data/foot_traffic/source/Google_Community_Mobility_Reports/'):
    if pattern in f:
        data=pd.read_csv(os.path.join('../../data/foot_traffic/source/Google_Community_Mobility_Reports/',f), index_col=0)
        data['category'] = f.split('_percent')[0]
        df_counties.append(data)

In [4]:
df_states[0].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 50
Columns: 259 entries, STATE to category
dtypes: float64(256), int64(1), object(2)
memory usage: 103.6+ KB


In [5]:
df_counties[0].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3143 entries, 0 to 3142
Columns: 260 entries, COUNTY to category
dtypes: float64(256), int64(1), object(3)
memory usage: 6.3+ MB


In [6]:
df_google_states = pd.concat(df_states, ignore_index=True)
df_google_counties = pd.concat(df_counties, ignore_index=True)

In [7]:
df_google_states.head()

Unnamed: 0,STATE,NAME,2020-02-15,2020-02-16,2020-02-17,2020-02-18,2020-02-19,2020-02-20,2020-02-21,2020-02-22,...,2020-10-19,2020-10-20,2020-10-21,2020-10-22,2020-10-23,2020-10-24,2020-10-25,2020-10-26,2020-10-27,category
0,1,Alabama,7.0,3.0,7.0,-1.0,4.0,-1.0,9.0,13.0,...,-8.0,-9.0,-6.0,-6.0,-8.0,-10.0,-2.0,-8.0,-8.0,transit_stations
1,2,Alaska,2.0,6.0,-6.0,4.0,4.0,1.0,5.0,3.0,...,-24.0,-23.0,-24.0,-22.0,-26.0,-17.0,-30.0,-25.0,-25.0,transit_stations
2,4,Arizona,3.0,3.0,1.0,6.0,7.0,5.0,4.0,-6.0,...,-20.0,-22.0,-22.0,-22.0,-21.0,-17.0,-18.0,-24.0,-22.0,transit_stations
3,5,Arkansas,-3.0,0.0,-1.0,-2.0,0.0,1.0,2.0,2.0,...,-5.0,-3.0,-2.0,-1.0,-3.0,2.0,3.0,-8.0,-6.0,transit_stations
4,6,California,1.0,1.0,-12.0,3.0,1.0,1.0,0.0,-4.0,...,-41.0,-41.0,-41.0,-41.0,-39.0,-32.0,-35.0,-42.0,-41.0,transit_stations


In [8]:
df_google_counties.head()

Unnamed: 0,COUNTY,NAME,addr,2020-02-15,2020-02-16,2020-02-17,2020-02-18,2020-02-19,2020-02-20,2020-02-21,...,2020-10-25,2020-10-26,2020-10-27,2020-08-22,2020-08-23,2020-08-29,2020-08-30,2020-09-05,2020-09-06,category
0,1001,Autauga County,"Alabama Autauga County, United States",,,5.0,0.0,0.0,2.0,0.0,...,2.0,3.0,5.0,,,,,,,residential
1,1003,Baldwin County,"Alabama Baldwin County, United States",-2.0,2.0,1.0,0.0,-1.0,3.0,-1.0,...,0.0,1.0,2.0,1.0,4.0,3.0,3.0,0.0,0.0,residential
2,1005,Barbour County,"Alabama Barbour County, United States",,,,,,,,...,,,,,,,,,,residential
3,1007,Bibb County,"Alabama Bibb County, United States",,,,,,,,...,,,,,,,,,,residential
4,1009,Blount County,"Alabama Blount County, United States",,,4.0,2.0,0.0,3.0,1.0,...,2.0,5.0,4.0,,,,,,,residential


In [9]:
print("Missing data? {}".format(data_cleaning_methods.missing_bool(df_google_states)))
abs_missing = data_cleaning_methods.frequency_missing(df_google_states, 'relative')
print("Absolute number of missing values: {}".format(abs_missing))
col_min, col_max = data_cleaning_methods.missing(df_google_states)
print("Column with lowest amount of missings contains {} % missings.".format(col_min))
print("Column with highest amount of missings contains {} % missings.\n\n".format(col_max))

Missing data? True
Absolute number of missing values: STATE         0
NAME          0
2020-02-15    0
2020-02-16    0
2020-02-17    0
2020-02-18    0
2020-02-19    0
2020-02-20    0
2020-02-21    0
2020-02-22    0
2020-02-23    0
2020-02-24    0
2020-02-25    0
2020-02-26    0
2020-02-27    0
2020-02-28    0
2020-02-29    0
2020-03-01    0
2020-03-02    0
2020-03-03    0
2020-03-04    0
2020-03-05    0
2020-03-06    0
2020-03-07    0
2020-03-08    0
2020-03-09    0
2020-03-10    0
2020-03-11    0
2020-03-12    0
2020-03-13    0
             ..
2020-09-29    0
2020-09-30    0
2020-10-01    0
2020-10-02    0
2020-10-03    1
2020-10-04    0
2020-10-05    0
2020-10-06    0
2020-10-07    0
2020-10-08    0
2020-10-09    0
2020-10-10    1
2020-10-11    0
2020-10-12    0
2020-10-13    0
2020-10-14    0
2020-10-15    0
2020-10-16    0
2020-10-17    1
2020-10-18    0
2020-10-19    0
2020-10-20    0
2020-10-21    0
2020-10-22    0
2020-10-23    0
2020-10-24    1
2020-10-25    0
2020-10-26    0
20

Some days have missing values, but we are going to fill them with median depending on the day of the week.

Gaps are intentional and happen because the data doesn’t meet the quality and privacy threshold—when there isn’t enough data to ensure anonymity. When one (or more) of the categories has gaps, a report shows the following:

- An asterisk (*) next to a category name means the percentage change is the last-recorded date and not the report date.
- A Not enough data for this date message.

The same, by counties:

In [10]:
print("Missing data? {}".format(data_cleaning_methods.missing_bool(df_google_counties)))
abs_missing = data_cleaning_methods.frequency_missing(df_google_counties, 'relative')
print("Absolute number of missing values: {}".format(abs_missing))
col_min, col_max = data_cleaning_methods.missing(df_google_counties)
print("Column with lowest amount of missings contains {} % missings.".format(col_min))
print("Column with highest amount of missings contains {} % missings.\n\n".format(col_max))

Missing data? True
Absolute number of missing values: COUNTY            0
NAME              0
addr           1872
2020-02-15     9423
2020-02-16     9992
2020-02-17     8429
2020-02-18     8275
2020-02-19     8239
2020-02-20     8162
2020-02-21     8109
2020-02-22     9481
2020-02-23    10036
2020-02-24     8330
2020-02-25     8299
2020-02-26     8282
2020-02-27     8238
2020-02-28     8137
2020-02-29     9480
2020-03-01    10137
2020-03-02     8393
2020-03-03     8401
2020-03-04     8320
2020-03-05     8262
2020-03-06     8151
2020-03-07     9557
2020-03-08    10183
2020-03-09     8460
2020-03-10     8410
2020-03-11     8389
2020-03-12     8364
              ...  
2020-10-05     9832
2020-10-06     9805
2020-10-07     9755
2020-10-08     9700
2020-10-09     9571
2020-10-10    10955
2020-10-11    11553
2020-10-12     9824
2020-10-13     9776
2020-10-14     9754
2020-10-15     9692
2020-10-16     9524
2020-10-17    10965
2020-10-18    11546
2020-10-19     9804
2020-10-20     9755
2020-1

In this case, we can drop the `addr` column. 

In [11]:
df_google_counties.drop(columns='addr', inplace=True)

#### Finally, we save 2 files integrating the datasets by counties and states combining all the categories of places.

In [12]:
df_google_states.to_csv('../../data/foot_traffic/interim/google_states.csv', index = False)
df_google_counties.to_csv('../../data/foot_traffic/interim/google_counties.csv', index = False)

In [13]:
df_google_counties.head()

Unnamed: 0,COUNTY,NAME,2020-02-15,2020-02-16,2020-02-17,2020-02-18,2020-02-19,2020-02-20,2020-02-21,2020-02-22,...,2020-10-25,2020-10-26,2020-10-27,2020-08-22,2020-08-23,2020-08-29,2020-08-30,2020-09-05,2020-09-06,category
0,1001,Autauga County,,,5.0,0.0,0.0,2.0,0.0,,...,2.0,3.0,5.0,,,,,,,residential
1,1003,Baldwin County,-2.0,2.0,1.0,0.0,-1.0,3.0,-1.0,-2.0,...,0.0,1.0,2.0,1.0,4.0,3.0,3.0,0.0,0.0,residential
2,1005,Barbour County,,,,,,,,,...,,,,,,,,,,residential
3,1007,Bibb County,,,,,,,,,...,,,,,,,,,,residential
4,1009,Blount County,,,4.0,2.0,0.0,3.0,1.0,,...,2.0,5.0,4.0,,,,,,,residential
