# Accessing data from Google Cloud Storage

### Purpose
The purpose of this notebook is to explore using Google Cloud Storage to cache covid data to reduce the need to request and process the data each time the heroku server is accessed.

- First, connect and read from cloud storage, from a notebook.
- Second, write to cloud to storage, from a notebook.
- Third, update the heroku server and test reading from cloud storage.
- Fourth, write a script that will request and process data from the source and write to cloud storage.

## Set Up
First we need to knock this notebook one directory up to the root directory of the app.

This is just for cleanliness purposes; to keep all notebooks in a notebooks directory.

In [1]:
pwd

'/Users/DanOvadia/Projects/covid-hotspots/notebooks'

In [2]:
cd ..

/Users/DanOvadia/Projects/covid-hotspots


## Import Libraries

In [76]:
import pandas as pd
import os
from google.cloud import storage

from config import config
from modules import data_processing

%load_ext autoreload
%autoreload 1
%aimport config.config
%aimport modules.data_processing

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In order to access google cloud storage we need to set an environment variable `GOOGLE_APPLICATION_CREDENTIALS` for this session. This environment variable will point towards the service account credentials json.

In the actual app, we will handle this separately.

In [4]:
# App Engine default service account credentials 
config_file = config.config.service_account_credentials_file

In [10]:
# Assign enviroment variable if it hasn't been assigned yet
if 'GOOGLE_APPLICATION_CREDENTIALS' not in os.environ:
    %env GOOGLE_APPLICATION_CREDENTIALS=/Users/DanOvadia/Projects/covid-hotspots/config/$config_file
    
print(f"Credentials already in environment: {'GOOGLE_APPLICATION_CREDENTIALS' in os.environ}")


Credentials already in environment: True


Lets test to see if we've established our authentication for google cloud storage.

### List available buckets for our service account

In [46]:
def implicit():
    from google.cloud import storage

    # If you don't specify credentials when constructing the client, the
    # client library will look for credentials in the environment.
    storage_client = storage.Client()

    # Make an authenticated API request
    buckets = [bucket.name for bucket in storage_client.list_buckets()]
    print(buckets)
    return buckets
a = implicit()

['staging.ytd-shared-project.appspot.com', 'us.artifacts.ytd-shared-project.appspot.com', 'us_covid_hotspot-bucket', 'ytd-shared-project.appspot.com']


Yup, we look good. Now, lets import the data.

## Connect to cloud storage

First we create a `google.cloud.storage.client.Client` object: Client to bundle configuration needed for API requests.

In [None]:
CLIENT = storage.Client()

Using the name of a bucket, we grab a bucket object from our client bundle.

In [49]:
BUCKET_NAME = 'us_covid_hotspot-bucket'
BUCKET = CLIENT.get_bucket(bucket_or_name = BUCKET_NAME)

In [50]:
[blob.name for blob in BUCKET.list_blobs()]

['covid_counties_20200831.csv',
 'covid_states.csv',
 'test-covid_counties_20200831.csv.gz']

In [20]:
FILENAME = 'covid_states.csv'

In [None]:
blob = storage.blob.Blob(name = BLOB_NAME, bucket = BUCKET_NAME)

In [None]:
BUCKET.get_blob()

In [57]:
BUCKET_NAME = 'us_covid_hotspot-bucket'
BUCKET = client.get_bucket(bucket_or_name=BUCKET_NAME)

In [58]:
BUCKET_NAME = 'us_covid_hotspot-bucket'
BLOB_NAME = "covid_states.csv"

In [60]:
df = get_df_from_blob(BUCKET_NAME, BLOB_NAME)

In [70]:
def get_df_from_blob(bucket_name, blob_name):
    client = storage.Client()
    
    #bucket = client.get_bucket(bucket_or_name = bucket_name)
    #blob = bucket.get_blob(blob_name=blob_name)
    
    BLOB_URI = f"gs://{BUCKET_NAME}/{BLOB_NAME}"
    
    return pd.read_csv(BLOB_URI, compression = 'gzip')

## Reading from a blob

I attempted two ways to read from the blob. Converting the blob to a string, storing the string, then using BytesIO to read that string and send it to pd.read_CSV.

I also attempted a method provided by pandas to read direction from a blob using the `gs://[BUCKET_NAME]/[BLOB_NAME]` URI of the blob.

In [53]:
BLOB_URI = f"gs://{BUCKET_NAME}/{BLOB_NAME}"

In [54]:
BLOB_URI

'gs://us_covid_hotspot-bucket/covid_states.csv'

In [56]:
%%time
df = pd.read_csv(BLOB_URI)

CPU times: user 421 ms, sys: 93.1 ms, total: 514 ms
Wall time: 2.6 s


In [69]:
df.head()

Unnamed: 0,date,county,state,fips,cases,deaths,FIPS,POPESTIMATE2019,CENSUS2010POP,casesPerMillion,deathsPerMillion,case_diff,death_diff,cases_14MA,deaths_14MA
0,2020-01-21,Snohomish,Washington,53061.0,1,0,53061.0,822083.0,713335.0,1.216422,0.0,,,,
1,2020-01-22,Snohomish,Washington,53061.0,1,0,53061.0,822083.0,713335.0,1.216422,0.0,0.0,0.0,,
2,2020-01-23,Snohomish,Washington,53061.0,1,0,53061.0,822083.0,713335.0,1.216422,0.0,0.0,0.0,,
3,2020-01-24,Cook,Illinois,17031.0,1,0,17031.0,5150233.0,5194675.0,0.194166,0.0,,,,
4,2020-01-24,Snohomish,Washington,53061.0,1,0,53061.0,822083.0,713335.0,1.216422,0.0,0.0,0.0,,


In [None]:
%%time
content = blob.download_as_string()
df = pd.read_csv(BytesIO(content))

## Writing to Blob

In [99]:
BUCKET_NAME = 'us_covid_hotspot-bucket'
#BLOB_NAME = "covid_states.csv"
BLOB_NAME = "covid_states_20200901.csv.gz"
FILE_PATH = f"data/{BLOB_NAME}"

In [100]:
%%time
write_blob_to_gcs(BUCKET_NAME, 
                  BLOB_NAME, 
                  FILE_PATH)

CPU times: user 87.6 ms, sys: 21.3 ms, total: 109 ms
Wall time: 9.28 s


['covid_counties_20200831.csv',
 'covid_counties_20200901.csv.gz',
 'covid_states.csv',
 'covid_states_20200901.csv.gz',
 'test-covid_counties_20200831.csv.gz']

Now lets test to see if we got the result we wanted

In [72]:
%%time 
df = get_df_from_blob(BUCKET_NAME, BLOB_NAME)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.91 µs


In [73]:
df.head()

Unnamed: 0,date,county,state,fips,cases,deaths,FIPS,POPESTIMATE2019,CENSUS2010POP,casesPerMillion,deathsPerMillion,case_diff,death_diff,cases_14MA,deaths_14MA
0,2020-01-21,Snohomish,Washington,53061.0,1,0,53061.0,822083.0,713335.0,1.216422,0.0,,,,
1,2020-01-22,Snohomish,Washington,53061.0,1,0,53061.0,822083.0,713335.0,1.216422,0.0,0.0,0.0,,
2,2020-01-23,Snohomish,Washington,53061.0,1,0,53061.0,822083.0,713335.0,1.216422,0.0,0.0,0.0,,
3,2020-01-24,Cook,Illinois,17031.0,1,0,17031.0,5150233.0,5194675.0,0.194166,0.0,,,,
4,2020-01-24,Snohomish,Washington,53061.0,1,0,53061.0,822083.0,713335.0,1.216422,0.0,0.0,0.0,,


In [65]:
def write_blob_to_gcs(bucket_name, blob_name, filepath):
    # Client to bundle configuration needed for API requests.
    client = storage.Client()

    # Extract the bucket object from the client bundle
    bucket = client.get_bucket(bucket_name)

    # Instantiate or extract the blob object from the bucket
    blob = storage.blob.Blob(blob_name,bucket)

    # Upload the file to the specific blob
    blob.upload_from_filename(filepath)
    
    return [blob_item.name for blob_item in bucket.list_blobs()]

## Reading from Blob

In [98]:
BUCKET_NAME = 'us_covid_hotspot-bucket'
BLOB_NAME = 

'us_covid_hotspot-bucket'

In [None]:
write_blob_to_gcs()

In [102]:
[blob_item.name for blob_item in BUCKET.list_blobs()]

['covid_counties_20200831.csv',
 'covid_counties_20200901.csv.gz',
 'covid_states.csv',
 'covid_states_20200901.csv.gz',
 'test-covid_counties_20200831.csv.gz']

### Trying to zip and save

In [None]:
covid_counties.head()

In [None]:
# write a pandas dataframe to gzipped CSV file
covid_counties.to_csv("data/education_salary.csv.gz", 
           index=False, 
           compression="gzip")

In [None]:
pd.read_csv()

In [104]:
a = pd.read_csv("data/covid_states_20200901.csv.gz", compression="gzip")

In [None]:
df = pd.read_csv('gs://us_covid_hotspot-bucket/covid_states_20200901.csv.gz')

In [91]:
import time

today = time.strftime('%Y%m%d')
filepath = f'data/covid_states_{today}.csv.gz'
cache_mode = 2

(os.path.exists(filepath) and cache_mode in (1,2))

True

In [107]:
#a = data_processing.get_covid_county_data()
b = data_processing.get_covid_state_data(cache_mode = 3)


Pulling state data from Cloud Storage
gs://us_covid_hotspot-bucket/covid_states_20200901.csv.gz


In [108]:
b.head()

Unnamed: 0,date,state,positive,negative,pending,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,onVentilatorCurrently,...,density,lat,long,Lived,Standard,fips_y,case_pm,death_pm,deaths_14MA,cases_14MA
0,2020-09-01,AK,6160.0,368095.0,,41.0,,,,9.0,...,1.2863,63.59,-154.49,27.0,1.0,2.0,8392.347705,53.13337,,
1,2020-09-01,AL,127616.0,831304.0,,990.0,14538.0,,1487.0,,...,96.9221,32.32,-86.9,93.0,37.0,1.0,25998.340471,448.19105,,
2,2020-09-01,AR,61497.0,669528.0,,423.0,4306.0,,,85.0,...,58.403,35.2,-91.83,68.0,22.0,5.0,20235.939531,267.851355,,
3,2020-09-01,AS,0.0,1514.0,,,,,,,...,,,,,,,,,,
4,2020-09-01,AZ,202342.0,1006648.0,,729.0,21405.0,253.0,,150.0,...,64.955,34.05,-111.09,125.0,23.0,4.0,27423.211295,683.608335,,


In [109]:
%%time
bucket_name = 'us_covid_hotspot-bucket'
blob_name = "covid_counties_20200901.csv.gz"
blob_uri = f"gs://{bucket_name}/{blob_name}"
print(f"Pulling county data from GCS [{blob_uri}]")

# Get the client object to make the request
client = storage.Client()

df = pd.read_csv(blob_uri, compression = 'gzip')

df['date'] = pd.to_datetime(df['date'], format = '%Y-%m-%d')

Pulling county data from GCS [gs://us_covid_hotspot-bucket/covid_counties_20200901.csv.gz]


In [111]:
df['date'] = pd.to_datetime(df['date'], format = '%Y-%m-%d')