# Recovery Data Partnership - SafeGraph user guide

### 1. Install Dependencies
uncomment the line below and run it

In [7]:
# ! pip3 install s3fs pandas

In [4]:
import pandas as pd
import s3fs

We will be using `s3fs` to access our files in aws s3. Authentication is handled as below:

In [5]:
s3 = s3fs.S3FileSystem(
      key='<YOUR KEY HERE>',
      secret='<YOUR SECRET HERE>', 
      client_kwargs={
          'endpoint_url': 'https://s3.amazonaws.com', 
          'region_name':'us-east-1'
      }
    )

### 2. Listing available files

`s3.ls` will allow you to list directories given s3 path. For this project specifically, we will use `recovery-data-partnership/output` folder as the root folder for delivering all output files


In [7]:
s3.ls('recovery-data-partnership/output/')

['recovery-data-partnership/output/example',
 'recovery-data-partnership/output/lookups',
 'recovery-data-partnership/output/poi',
 'recovery-data-partnership/output/social_distancing']

Under the `output` folder, you will find data products by category. All social distancing related data products will be stored under `social_distancing`, and all point of interest and patterns datasets will be stored under `poi`. 

In [8]:
s3.ls('recovery-data-partnership/output/social_distancing/')

['recovery-data-partnership/output/social_distancing/weekly_county_trips',
 'recovery-data-partnership/output/social_distancing/weekly_state_trips']

Because of the large size of the datasets, we paritioned each output table by year and quarter.

In [10]:
s3.ls('recovery-data-partnership/output/social_distancing/weekly_state_trips/')

['recovery-data-partnership/output/social_distancing/weekly_state_trips/weekly_state_trips_2019Q1.csv.zip',
 'recovery-data-partnership/output/social_distancing/weekly_state_trips/weekly_state_trips_2019Q2.csv.zip',
 'recovery-data-partnership/output/social_distancing/weekly_state_trips/weekly_state_trips_2019Q3.csv.zip',
 'recovery-data-partnership/output/social_distancing/weekly_state_trips/weekly_state_trips_2019Q4.csv.zip',
 'recovery-data-partnership/output/social_distancing/weekly_state_trips/weekly_state_trips_2020Q1.csv.zip',
 'recovery-data-partnership/output/social_distancing/weekly_state_trips/weekly_state_trips_2020Q2.csv.zip',
 'recovery-data-partnership/output/social_distancing/weekly_state_trips/weekly_state_trips_2020Q3.csv.zip']

### 3. Reading files to dataframe and combine them together

You can easily write a loop to loop through all the year quarter partitions of the datasets and use `pd.concat` to concatenate paritions into one big table.

In [18]:
dfs = []
for dataset in s3.ls('recovery-data-partnership/output/social_distancing/weekly_state_trips/'):
  df = pd.read_csv(s3.open(dataset, mode='rb'), compression='zip')
  dfs.append(df)
  del df

In [19]:
weekly_state_trips = pd.concat(dfs)

> Note that because we are paritioning by quarter, that means the last week of a quarter can cross quarters and show up in two paritions. To avoid confusion, we recommend doing a group by by week and the geographic boundry.

In [22]:
weekly_state_trips = weekly_state_trips.groupby(['year_week',	'state']).sum()

In [26]:
weekly_state_trips

Unnamed: 0_level_0,Unnamed: 1_level_0,to_nyc,from_nyc,net_nyc
year_week,state,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-01,1,48691,21817,26874
2019-01,2,2288,2044,244
2019-01,4,42800,56635,-13835
2019-01,5,20910,14140,6770
2019-01,6,364721,507026,-142305
...,...,...,...,...
2020-40,54,3038,8467,-5429
2020-40,55,2086,7714,-5628
2020-40,56,434,3882,-3448
2020-40,72,2828,15001,-12173


### 4. We also prepared look up tables

In [24]:
s3.ls('recovery-data-partnership/output/lookups/')

['recovery-data-partnership/output/lookups/',
 'recovery-data-partnership/output/lookups/fips_to_county.csv',
 'recovery-data-partnership/output/lookups/fips_to_state.csv',
 'recovery-data-partnership/output/lookups/naics_sector.csv',
 'recovery-data-partnership/output/lookups/naics_subsector.csv',
 'recovery-data-partnership/output/lookups/nta_to_boro_county.csv']