# Accessing data from Google Cloud Storage

### Purpose
The purpose of this notebook is to explore using Google Cloud Storage to cache covid data to reduce the need to request and process the data each time the heroku server is accessed.

- First, connect and read from cloud storage, from a notebook.
- Second, write to cloud to storage, from a notebook.
- Third, update the heroku server and test reading from cloud storage.
- Fourth, write a script that will request and process data from the source and write to cloud storage.

## Set Up
First we need to knock this notebook one directory up to the root directory of the app.

This is just for cleanliness purposes; to keep all notebooks in a notebooks directory.

In [1]:
pwd

'/Users/DanOvadia/Projects/covid-hotspots/notebooks'

In [2]:
cd ..

/Users/DanOvadia/Projects/covid-hotspots


In [3]:
import pandas as pd
from os import path
from google.cloud import storage

from config import config

%load_ext autoreload
%autoreload 1
%aimport config.config

In order to access google cloud storage we need to set an environment variable `GOOGLE_APPLICATION_CREDENTIALS` for this session. This environment variable will point towards the service account credentials json.

In the actual app, we will handle this separately.

In [5]:
# App Engine default service account credentials 
config_file = config.config.service_account_credentials_file

In [7]:
# Assign enviroment variable
%env GOOGLE_APPLICATION_CREDENTIALS=/Users/DanOvadia/Projects/covid-hotspots/config/$config_file

env: GOOGLE_APPLICATION_CREDENTIALS=/Users/DanOvadia/Projects/covid-hotspots/config/ytd-shared-project-eb630837f7b3.json


Lets test to see if we've established our authentication for google cloud storage.

In [13]:
def implicit():
    from google.cloud import storage

    # If you don't specify credentials when constructing the client, the
    # client library will look for credentials in the environment.
    storage_client = storage.Client()

    # Make an authenticated API request
    buckets = list(storage_client.list_buckets())
    print(buckets)
implicit()

[<Bucket: staging.ytd-shared-project.appspot.com>, <Bucket: us.artifacts.ytd-shared-project.appspot.com>, <Bucket: us_covid_hotspot-bucket>, <Bucket: ytd-shared-project.appspot.com>]


Yup, we look good. Now, lets import the data.

## Read from Cloud Storage

In [27]:
covid_counties = pd.read_csv('data/covid_counties_20200901.csv')

In [28]:
covid_counties.to_json('data/covid_json.json')

In [29]:
covid_counties.to_hdf('data/covid_hdf.hdf')

TypeError: to_hdf() missing 1 required positional argument: 'key'

In [21]:
file_name = "covid_counties_20200831.csv"
bucket_name = "us_covid_hotspot-bucket"

In [18]:
client = storage.Client()

In [53]:
bucket = client.get_bucket('us_covid_hotspot-bucket')

In [58]:
[print(blob) for blob in bucket.list_blobs()]

<Blob: us_covid_hotspot-bucket, covid_counties_20200831.csv, 1598919467735996>
<Blob: us_covid_hotspot-bucket, covid_states.csv, 1598994661100202>
<Blob: us_covid_hotspot-bucket, test-covid_counties_20200831.csv.gz, 1598993956887189>


[None, None, None]

## Writing to Blob

In [57]:
%%time
write_blob_to_gcs("us_covid_hotspot-bucket", 
                  "covid_states.csv", 
                  "data/covid_states_20200901.csv")

In [56]:
def write_blob_to_gcs(bucket_name, blob_name, filepath):
    # Client to bundle configuration needed for API requests.
    client = storage.Client()

    # Extract the bucket object from the client bundle
    bucket = client.get_bucket(bucket_name)

    # Instantiate or extract the blob object from the bucket
    blob = storage.blob.Blob(blob_name,bucket)

    # Upload the file to the specific blob
    blob.upload_from_filename(filepath)

In [50]:
#blob = bucket.get_blob(new_file)
blob.upload_from_filename("data/education_salary.csv.gz")

## Reading from Blob

In [61]:
bucket.id

'us_covid_hotspot-bucket'

In [24]:
%%time
content = blob.download_as_string()
df = pd.read_csv(BytesIO(content))

CPU times: user 5.14 s, sys: 1.53 s, total: 6.68 s
Wall time: 51.7 s


In [59]:
%%time
df = pd.read_csv('gs://us_covid_hotspot-bucket/covid_states.csv')

CPU times: user 336 ms, sys: 99.3 ms, total: 436 ms
Wall time: 3.1 s


In [60]:
df.head()

Unnamed: 0,date,state,positive,negative,pending,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,onVentilatorCurrently,...,density,lat,long,Lived,Standard,fips_y,case_pm,death_pm,deaths_14MA,cases_14MA
0,2020-08-31,AK,6125.0,342505.0,,39.0,,,,8.0,...,1.2863,63.59,-154.49,27.0,1.0,2.0,8344.663911,50.408582,,
1,2020-08-31,AL,126058.0,859229.0,,1004.0,14267.0,,1474.0,,...,96.9221,32.32,-86.9,93.0,37.0,1.0,25680.939718,444.524032,,
2,2020-08-31,AR,61224.0,665811.0,,420.0,4213.0,,,87.0,...,58.403,35.2,-91.83,68.0,22.0,5.0,20146.10732,262.257408,,
3,2020-08-31,AS,0.0,1514.0,,,,,,,...,,,,,,,,,,
4,2020-08-31,AZ,201835.0,1002594.0,,768.0,21405.0,256.0,,152.0,...,64.955,34.05,-111.09,125.0,23.0,4.0,27354.498086,681.5754,,


### Trying to zip and save

In [30]:
covid_counties.head()

Unnamed: 0,date,county,state,fips,cases,deaths,SUMLEV,REGION,DIVISION,STATE,...,CTYNAME,POPESTIMATE2019,CENSUS2010POP,FIPS,casesPerMillion,deathsPerMillion,case_diff,death_diff,cases_14MA,deaths_14MA
0,2020-01-21,Snohomish,Washington,53061.0,1,0,50.0,4.0,9.0,53.0,...,Snohomish County,822083.0,713335.0,53061.0,1.216422,0.0,,,,
1,2020-01-22,Snohomish,Washington,53061.0,1,0,50.0,4.0,9.0,53.0,...,Snohomish County,822083.0,713335.0,53061.0,1.216422,0.0,0.0,0.0,,
2,2020-01-23,Snohomish,Washington,53061.0,1,0,50.0,4.0,9.0,53.0,...,Snohomish County,822083.0,713335.0,53061.0,1.216422,0.0,0.0,0.0,,
3,2020-01-24,Cook,Illinois,17031.0,1,0,50.0,2.0,3.0,17.0,...,Cook County,5150233.0,5194675.0,17031.0,0.194166,0.0,,,,
4,2020-01-24,Snohomish,Washington,53061.0,1,0,50.0,4.0,9.0,53.0,...,Snohomish County,822083.0,713335.0,53061.0,1.216422,0.0,0.0,0.0,,


In [31]:
# write a pandas dataframe to gzipped CSV file
covid_counties.to_csv("data/education_salary.csv.gz", 
           index=False, 
           compression="gzip")

In [None]:
pd.read_csv()

In [34]:
a = pd.read_csv("data/education_salary.csv.gz", compression="gzip")

In [35]:
a.head()

Unnamed: 0,date,county,state,fips,cases,deaths,SUMLEV,REGION,DIVISION,STATE,...,CTYNAME,POPESTIMATE2019,CENSUS2010POP,FIPS,casesPerMillion,deathsPerMillion,case_diff,death_diff,cases_14MA,deaths_14MA
0,2020-01-21,Snohomish,Washington,53061.0,1,0,50.0,4.0,9.0,53.0,...,Snohomish County,822083.0,713335.0,53061.0,1.216422,0.0,,,,
1,2020-01-22,Snohomish,Washington,53061.0,1,0,50.0,4.0,9.0,53.0,...,Snohomish County,822083.0,713335.0,53061.0,1.216422,0.0,0.0,0.0,,
2,2020-01-23,Snohomish,Washington,53061.0,1,0,50.0,4.0,9.0,53.0,...,Snohomish County,822083.0,713335.0,53061.0,1.216422,0.0,0.0,0.0,,
3,2020-01-24,Cook,Illinois,17031.0,1,0,50.0,2.0,3.0,17.0,...,Cook County,5150233.0,5194675.0,17031.0,0.194166,0.0,,,,
4,2020-01-24,Snohomish,Washington,53061.0,1,0,50.0,4.0,9.0,53.0,...,Snohomish County,822083.0,713335.0,53061.0,1.216422,0.0,0.0,0.0,,
