# Foot traffic data cleaning process

In [1]:
import pandas as pd
import numpy as np
import os
import data_cleaning_methods

## I. Descartes Lab Mobility Change

These datasets show how visits and length of stay at different places change compared to a baseline. 

m50: The median of the max-distance mobility for all samples in the specified region.
m50_index: The percent of normal m50 in the region, with normal m50 defined during 2020-02-17 to 2020-03-07.

The file named YY_US_DL-us-m50.csv is the data show the median of the max-distance mobility for all samples in the specified region at YY level. YY can be State or County.

The file named YY_US_DL-us- m50_index.csv is the data show the percent of normal m50 in the region, with normal m50 defined during 2020-02-17 to 2020-03-07. YY can be State or County.
 - Data Sources: Descarteslabs: Data for Mobility Changes in Response to COVID-19
https://github.com/descarteslabs/DL-COVID-19
 - People Contribution & Credit: Xiaokang Fu, Tao Hu, data crawler, data quality control and data validation
 
### Mobility Changes in Response to COVID-19
#### Abstract
In response to the COVID-19 pandemic, both vol- untary changes in behavior and administrative restrictions on human interactions have occurred. These actions are intended to reduce the transmission rate of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). We use anonymized and/or de-identified mobile device locations to measure mobility, a statistic representing the distance a typical member of a given population moves in a day. Results indicate that a large reduction in mobility has taken place, both in the United States and globally. In the US, large mobility reductions have been detected associated with the onset of the COVID-19 threat and specific government directives. Mobility data at the US admin1 (state) and admin2 (county) level have been made freely available under a Creative Commons Attribution (CC BY 4.0) license via the GitHub repository

In [2]:
path ='County_US_DL'
df_counties_lists=[]

for f in os.listdir('../../data/foot_traffic/source/Descartes_Lab_Mobility_Change/'):
    if path in f:
        data=pd.read_csv(os.path.join('../../data/foot_traffic/source/Descartes_Lab_Mobility_Change/',f), index_col=0)
        df_counties_lists.append(data)

In [3]:
path ='State_US_DL'
df_states_lists=[]

for f in os.listdir('../../data/foot_traffic/source/Descartes_Lab_Mobility_Change/'):
    if path in f:
        data=pd.read_csv(os.path.join('../../data/foot_traffic/source/Descartes_Lab_Mobility_Change/',f), index_col=0)
        df_states_lists.append(data)

Number of datasets by type (county divisions and state divisions):

In [4]:
print('Number of datasets with county divisions: {}'.format(len(df_counties_lists)))
print('Number of datasets with state divisions: {}'.format(len(df_counties_lists)))

Number of datasets with county divisions: 2
Number of datasets with state divisions: 2


In [5]:
df_counties_lists[0].head(2)

Unnamed: 0,COUNTY,NAME,country_code,admin_level,admin1,admin2,2020-03-01,2020-03-02,2020-03-03,2020-03-04,...,2020-10-21,2020-10-22,2020-10-23,2020-10-24,2020-10-25,2020-10-26,2020-10-27,2020-10-28,2020-10-29,2020-10-30
0,1001,Autauga County,US,2.0,Alabama,Autauga County,49.0,100.0,95.0,95.0,...,80.0,79.0,91.0,47.0,39.0,66.0,76.0,76.0,65.0,91.0
1,1003,Baldwin County,US,2.0,Alabama,Baldwin County,81.0,100.0,95.0,90.0,...,86.0,86.0,99.0,75.0,69.0,80.0,82.0,64.0,67.0,93.0


In [6]:
df_counties_lists[0].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3143 entries, 0 to 3142
Columns: 247 entries, COUNTY to 2020-10-30
dtypes: float64(242), int64(1), object(4)
memory usage: 5.9+ MB


- 3143 rows and 247 variables
- 4 categorical variables, 242 numerical variables representing the median of the max-distance mobility for all samples in the specified region (in this case, county) by date.
- Dates from 03-01-2020 to 10-30-2020

In [7]:
df_counties_lists[1].head(2)

Unnamed: 0,COUNTY,NAME,country_code,admin_level,admin1,admin2,2020-03-01,2020-03-02,2020-03-03,2020-03-04,...,2020-10-21,2020-10-22,2020-10-23,2020-10-24,2020-10-25,2020-10-26,2020-10-27,2020-10-28,2020-10-29,2020-10-30
0,1001,Autauga County,US,2.0,Alabama,Autauga County,7.194,14.587,13.865,13.88,...,11.757,11.57,13.286,6.866,5.692,9.67,11.121,11.131,9.503,13.419
1,1003,Baldwin County,US,2.0,Alabama,Baldwin County,9.78,12.042,11.481,10.879,...,10.37,10.37,11.964,9.052,8.372,9.642,9.926,7.776,8.138,11.214


In [8]:
df_counties_lists[1].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3143 entries, 0 to 3142
Columns: 247 entries, COUNTY to 2020-10-30
dtypes: float64(242), int64(1), object(4)
memory usage: 5.9+ MB


- 3143 rows and 247 variables
- 4 categorical variables, 242 numerical variables representing the percent of normal m50 in the region, with normal m50 defined during 2020-02-17 to 2020-03-07 by county and date.
- Dates from 03-01-2020 to 10-30-2020

In [9]:
df_states_lists[0].head(2)

Unnamed: 0,STATE,NAME,country_code,admin_level,admin1,admin2,2020-03-01,2020-03-02,2020-03-03,2020-03-04,...,2020-10-21,2020-10-22,2020-10-23,2020-10-24,2020-10-25,2020-10-26,2020-10-27,2020-10-28,2020-10-29,2020-10-30
0,1,Alabama,US,1,Alabama,,79.0,98.0,100.0,96.0,...,94.0,94.0,110.0,76.0,63.0,84.0,89.0,83.0,87.0,108.0
1,2,Alaska,US,1,Alaska,,20.0,86.0,94.0,100.0,...,90.0,96.0,100.0,72.0,30.0,75.0,85.0,82.0,77.0,81.0


In [10]:
df_states_lists[0].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 50
Columns: 247 entries, STATE to 2020-10-30
dtypes: float64(208), int64(36), object(3)
memory usage: 98.8+ KB


- 51 rows and 247 variables
- 3 categorical variables, 242 numerical variables representing representing the median of the max-distance mobility for all samples in the specified region (in this case, county) by date, 2 int64 variables representing the state and admin level.
- Dates from 03-01-2020 to 10-30-2020

In [11]:
df_states_lists[1].head(2)

Unnamed: 0,STATE,NAME,country_code,admin_level,admin1,admin2,2020-03-01,2020-03-02,2020-03-03,2020-03-04,...,2020-10-21,2020-10-22,2020-10-23,2020-10-24,2020-10-25,2020-10-26,2020-10-27,2020-10-28,2020-10-29,2020-10-30
0,1,Alabama,US,1,Alabama,,8.331,10.398,10.538,10.144,...,9.98,9.908,11.684,8.053,6.684,8.868,9.398,8.766,9.253,11.41
1,2,Alaska,US,1,Alaska,,0.808,3.463,3.791,4.003,...,3.638,3.848,4.03,2.906,1.208,3.029,3.409,3.312,3.106,3.274


In [12]:
df_states_lists[1].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 50
Columns: 247 entries, STATE to 2020-10-30
dtypes: float64(242), int64(2), object(3)
memory usage: 98.8+ KB


- 51 rows and 247 variables
- 3 categorical variables, 242 numerical variables representing the percent of normal m50 in the region, with normal m50 defined during 2020-02-17 to 2020-03-07 by county and date, 2 int64 variables representing the state and admin level.
- Dates from 03-01-2020 to 10-30-2020

## 1. Missing data:

In [13]:
for df in df_states_lists:
    print("\n\nMissing data? {}".format(data_cleaning_methods.missing_bool(df)))
    abs_missing = data_cleaning_methods.frequency_missing(df, 'absolute')
    print("Absolute number of missing values: {}".format(abs_missing))
    col_min, col_max = data_cleaning_methods.missing(df)
    print("Column with lowest amount of missings contains {} % missings.".format(col_min))
    print("Column with highest amount of missings contains {} % missings.".format(col_max))
    data_cleaning_methods.columns_rows_missing(df)



Missing data? True
Absolute number of missing values: 51
Column with lowest amount of missings contains 0.0 % missings.
Column with highest amount of missings contains 100.0 % missings.
Empty DataFrame
Columns: [STATE, NAME, country_code, admin_level, admin1, admin2, 2020-03-01, 2020-03-02, 2020-03-03, 2020-03-04, 2020-03-05, 2020-03-06, 2020-03-07, 2020-03-08, 2020-03-09, 2020-03-10, 2020-03-11, 2020-03-12, 2020-03-13, 2020-03-14, 2020-03-15, 2020-03-16, 2020-03-17, 2020-03-18, 2020-03-19, 2020-03-20, 2020-03-21, 2020-03-22, 2020-03-23, 2020-03-24, 2020-03-25, 2020-03-26, 2020-03-27, 2020-03-28, 2020-03-29, 2020-03-30, 2020-03-31, 2020-04-01, 2020-04-02, 2020-04-03, 2020-04-04, 2020-04-05, 2020-04-06, 2020-04-07, 2020-04-08, 2020-04-09, 2020-04-10, 2020-04-11, 2020-04-12, 2020-04-13, 2020-04-14, 2020-04-15, 2020-04-16, 2020-04-17, 2020-04-18, 2020-04-19, 2020-04-21, 2020-04-22, 2020-04-23, 2020-04-24, 2020-04-25, 2020-04-26, 2020-04-27, 2020-04-28, 2020-04-29, 2020-04-30, 2020-05-01

In [14]:
for df in df_counties_lists:
    print("\n\nMissing data? {}".format(data_cleaning_methods.missing_bool(df)))
    abs_missing = data_cleaning_methods.frequency_missing(df, 'relative')
    print("Absolute number of missing values: {}".format(abs_missing))
    col_min, col_max = data_cleaning_methods.missing(df)
    print("Column with lowest amount of missings contains {} % missings.".format(col_min))
    print("Column with highest amount of missings contains {} % missings.".format(col_max))



Missing data? True
Absolute number of missing values: COUNTY            0
NAME              0
country_code    473
admin_level     473
admin1          473
admin2          473
2020-03-01      534
2020-03-02      527
2020-03-03      519
2020-03-04      498
2020-03-05      518
2020-03-06      531
2020-03-07      521
2020-03-08      532
2020-03-09      513
2020-03-10      519
2020-03-11      512
2020-03-12      514
2020-03-13      528
2020-03-14      540
2020-03-15      546
2020-03-16      553
2020-03-17      525
2020-03-18      537
2020-03-19      534
2020-03-20      562
2020-03-21      616
2020-03-22      561
2020-03-23      533
2020-03-24      515
               ... 
2020-09-30      475
2020-10-01      476
2020-10-02      475
2020-10-03      474
2020-10-04      474
2020-10-05      474
2020-10-06      474
2020-10-07      474
2020-10-09      475
2020-10-10      475
2020-10-11      475
2020-10-12      474
2020-10-13      474
2020-10-14      474
2020-10-15      474
2020-10-16      474
2020

Checking country_codes:

In [15]:
df_counties_lists[1].groupby('country_code').count()

Unnamed: 0_level_0,COUNTY,NAME,admin_level,admin1,admin2,2020-03-01,2020-03-02,2020-03-03,2020-03-04,2020-03-05,...,2020-10-21,2020-10-22,2020-10-23,2020-10-24,2020-10-25,2020-10-26,2020-10-27,2020-10-28,2020-10-29,2020-10-30
country_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
US,2670,2670,2670,2670,2670,2609,2616,2624,2645,2625,...,2669,2669,2669,2669,2669,2669,2669,2669,2669,2669


In [16]:
df_states_lists[1].groupby('country_code').count()

Unnamed: 0_level_0,STATE,NAME,admin_level,admin1,admin2,2020-03-01,2020-03-02,2020-03-03,2020-03-04,2020-03-05,...,2020-10-21,2020-10-22,2020-10-23,2020-10-24,2020-10-25,2020-10-26,2020-10-27,2020-10-28,2020-10-29,2020-10-30
country_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
US,51,51,51,51,0,51,51,51,51,51,...,51,51,51,51,51,51,51,51,51,51


The datasets only have information from the US. 

#### For counties dataset, we proceed to drop nan rows and delete admin information. Then, the clean datasets are saved in the interim folder

In [17]:
csv_names = ['m50_max_counties', 'm50_percent_counties']

for i, df in enumerate(df_counties_lists):
    df.dropna(inplace=True)
    df.drop(columns=['admin_level', 'admin1', 'admin2'], inplace=True)
    df.to_csv('../../data/foot_traffic/interim/'+csv_names[i]+'.csv', index = False)

#### For states dataset, we drop the nan column admin2 and delete admin information.

In [18]:
csv_names = ['m50_max_states', 'm50_percent_states']

for i, df in enumerate(df_states_lists):
    df.dropna(axis='columns', inplace=True)
    df.drop(columns=['admin_level', 'admin1'], inplace=True)
    df.to_csv('../../data/foot_traffic/interim/'+csv_names[i]+'.csv', index = False)