# Accessing data from Google Cloud Storage

### Purpose
The purpose of this notebook is to explore using Google Cloud Storage to cache covid data to reduce the need to request and process the data each time the heroku server is accessed.

- First, connect and read from cloud storage, from a notebook.
- Second, write to cloud to storage, from a notebook.
- Third, update the heroku server and test reading from cloud storage.
- Fourth, write a script that will request and process data from the source and write to cloud storage.

## Set Up
First we need to knock this notebook one directory up to the root directory of the app.

This is just for cleanliness purposes; to keep all notebooks in a notebooks directory.

In [1]:
pwd

'/Users/DanOvadia/Projects/covid-hotspots/notebooks'

In [2]:
cd ..

/Users/DanOvadia/Projects/covid-hotspots


In [3]:
import pandas as pd
from os import path
from google.cloud import storage

from config import config

%load_ext autoreload
%autoreload 1
%aimport config.config

In order to access google cloud storage we need to set an environment variable `GOOGLE_APPLICATION_CREDENTIALS` for this session. This environment variable will point towards the service account credentials json.

In the actual app, we will handle this separately.

In [5]:
# App Engine default service account credentials 
config_file = config.config.service_account_credentials_file

In [7]:
# Assign enviroment variable
%env GOOGLE_APPLICATION_CREDENTIALS=/Users/DanOvadia/Projects/covid-hotspots/config/$config_file

env: GOOGLE_APPLICATION_CREDENTIALS=/Users/DanOvadia/Projects/covid-hotspots/config/ytd-shared-project-eb630837f7b3.json


Lets test to see if we've established our authentication for google cloud storage.

In [13]:
def implicit():
    from google.cloud import storage

    # If you don't specify credentials when constructing the client, the
    # client library will look for credentials in the environment.
    storage_client = storage.Client()

    # Make an authenticated API request
    buckets = list(storage_client.list_buckets())
    print(buckets)
implicit()

[<Bucket: staging.ytd-shared-project.appspot.com>, <Bucket: us.artifacts.ytd-shared-project.appspot.com>, <Bucket: us_covid_hotspot-bucket>, <Bucket: ytd-shared-project.appspot.com>]


Yup, we look good. Now, lets import the data.

## Read from Cloud Storage

In [10]:
df = pd.read_csv('gs://us_covid_hotspot-bucket/covid_counties_20200831.csv')

In [11]:
df.head()

Unnamed: 0,date,county,state,fips,cases,deaths,SUMLEV,REGION,DIVISION,STATE,...,CTYNAME,POPESTIMATE2019,CENSUS2010POP,FIPS,casesPerMillion,deathsPerMillion,case_diff,death_diff,cases_14MA,deaths_14MA
0,2020-01-21,Snohomish,Washington,53061.0,1,0,50.0,4.0,9.0,53.0,...,Snohomish County,822083.0,713335.0,53061.0,1.216422,0.0,,,,
1,2020-01-22,Snohomish,Washington,53061.0,1,0,50.0,4.0,9.0,53.0,...,Snohomish County,822083.0,713335.0,53061.0,1.216422,0.0,0.0,0.0,,
2,2020-01-23,Snohomish,Washington,53061.0,1,0,50.0,4.0,9.0,53.0,...,Snohomish County,822083.0,713335.0,53061.0,1.216422,0.0,0.0,0.0,,
3,2020-01-24,Cook,Illinois,17031.0,1,0,50.0,2.0,3.0,17.0,...,Cook County,5150233.0,5194675.0,17031.0,0.194166,0.0,,,,
4,2020-01-24,Snohomish,Washington,53061.0,1,0,50.0,4.0,9.0,53.0,...,Snohomish County,822083.0,713335.0,53061.0,1.216422,0.0,0.0,0.0,,
