# Accessing data from Google Cloud Storage

### Purpose
The purpose of this notebook is to explore using Google Cloud Storage to cache covid data to reduce the need to request and process the data each time the heroku server is accessed.

- **First**, connect and read from cloud storage, from a notebook.
- **Second**, write to cloud to storage, from a notebook.
- **Third**, run the server locally and test reading from cloud storage.
- **Fourth**, update the heroku server and test reading from cloud storage.
- **Fifth**, write a script that will request and process data from the source and write to cloud storage.
- **Sixth**, deploy an app-engine instance that can update the bucket as needed.
- **Seventh**, create a cron job on GCP to trigger the app-engine instance every morning to update retrieve, process, and update data.

### Sections
1. [Environment Setup](#1---Environment-Setup)
2. [Connecting to Cloud Storage Client](#2---Connecting-to-Cloud-Storage-Client)
3. [Reading from Cloud Storage](#3---Reading-from-Cloud-Storage)
4. [Writing to Cloud Storage](#4--Writing-to-Cloud-Storage)
5. [Writing to Cloud Storage using gzip and BytesIO](#5---Writing-to-Cloud-Storage-using-gzip-and-BytesIO)
6. [Conclusion](#6---Conclusion)

---

## 1 - Environment Setup

First we need to knock this notebook one directory up to the root directory of the app.

This is just for cleanliness purposes; to keep all notebooks in a notebooks directory.

In [1]:
pwd

'/Users/DanOvadia/Projects/covid-hotspots/notebooks'

In [2]:
cd ..

/Users/DanOvadia/Projects/covid-hotspots


### Import Libraries

In [3]:
import os
import sys
import time
import requests

import pandas as pd
import gzip
from io import BytesIO, TextIOWrapper
from google.cloud import storage


from config import config
from modules import data_processing

%load_ext autoreload
%autoreload 1
%aimport config.config
%aimport modules.data_processing

In order to access google cloud storage we need to set an environment variable `GOOGLE_APPLICATION_CREDENTIALS` for this session. This environment variable will point towards the service account credentials json.

In the actual app, we will handle this separately.

### Setting Credentials

In order to establish a connection with the GCS Client, we will need an OAuth Credentials json from a service account associated with the Goocle Cloud Project that the bucket was created from. 

In [4]:
# App Engine default service account credentials 
if 'GOOGLE_APPLICATION_CREDENTIALS' not in os.environ:
    print("Credentials: {'GOOGLE_APPLICATION_CREDENTIALS' in os.environ}. Setting environment variable.")
    # Retrieve the name of the file from config.py
    CONFIG_FILENAME = config.config.service_account_credentials_file

    # Generate path for my personal computer
    CONFIG_PATH = f"/Users/DanOvadia/Projects/covid-hotspots/config/{CONFIG_FILENAME}"

    # Assign the environment variable for this session of python
    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = CONFIG_PATH
else:
    print("Credentials: {'GOOGLE_APPLICATION_CREDENTIALS' in os.environ}. Proceeding.")

Credentials: {'GOOGLE_APPLICATION_CREDENTIALS' in os.environ}. Setting environment variable.


In [None]:
# Check to verify
os.environ['GOOGLE_APPLICATION_CREDENTIALS']

The following code generates a string that can be used in terminal to set the local session's environment variable for `GOOGLE_APPLICATION_CREDENTIALS`.

In [7]:
# Produce string to use in terminal if running elsewhere.
print(f"export GOOGLE_APPLICATION_CREDENTIALS=/Users/DanOvadia/Projects/covid-hotspots/config/{CONFIG_FILENAME}")

export GOOGLE_APPLICATION_CREDENTIALS=/Users/DanOvadia/Projects/covid-hotspots/config/ytd-shared-project-eb630837f7b3.json


---
## 2 - Connecting to Cloud Storage Client

First we create a `google.cloud.storage.client.Client` object. The Client is used to bundle configuration needed for API requests.

In [9]:
CLIENT = storage.Client()

Retrieve the bucket object by passing in the name of the bucket, in this case: `bucket_name="us_covid_hotspot-bucket"`.

In [31]:
BUCKET_NAME = 'us_covid_hotspot-bucket'
BUCKET = CLIENT.get_bucket(bucket_or_name="us_covid_hotspot-bucket")
type(BUCKET)

google.cloud.storage.bucket.Bucket

In [17]:
# List all the names of our blobs in the bucket
[blob.name for blob in BUCKET.list_blobs()]

['covid_counties.csv.gz',
 'covid_counties_20200901.csv.gz',
 'covid_states.csv.gz',
 'covid_states_20200901.csv.gz',
 'states_populuations.csv']

In [29]:
# Assign a filename
BLOB_NAME = 'covid_states.csv'

# Retrieve or create a blob object
BLOB = BUCKET.get_blob(name = BLOB_NAME)
print(f"blob is named {BLOB.name} and is {type(BLOB)} object")

blob is named covid_states.csv and is <class 'google.cloud.storage.blob.Blob'> object


In [None]:
# Generate a BLOB_URI to retrieve data
BLOB_URI = f"gs://{BUCKET_NAME}/{BLOB_NAME}"

In [None]:
def implicit():
    # Establish a connection
    client = storage.Client()
    # Make an authenticated API request to list buckets
    buckets = [bucket.name for bucket in client.list_buckets()]
    print(buckets)
    return buckets
a = implicit()

---
## 3 - Reading from Cloud Storage

I attempted two ways to read from the blob. Converting the blob to a string, storing the string, then using BytesIO to read that string and send it to pd.read_CSV.

I also attempted a method provided by pandas to read direction from a blob using the `gs://[BUCKET_NAME]/[BLOB_NAME]` URI of the blob.

In [None]:
BLOB_URI = f"gs://{BUCKET_NAME}/{BLOB_NAME}"

In [None]:
BLOB_URI

In [None]:
%%time
df = pd.read_csv(BLOB_URI)

In [None]:
df.head()

In [None]:
%%time
content = blob.download_as_string()
df = pd.read_csv(BytesIO(content))

In [None]:
df = get_df_from_blob(BUCKET_NAME, BLOB_NAME)

In [None]:
def get_df_from_blob(bucket_name, blob_name):
    client = storage.Client()
    
    #bucket = client.get_bucket(bucket_or_name = bucket_name)
    #blob = bucket.get_blob(blob_name=blob_name)
    
    BLOB_URI = f"gs://{BUCKET_NAME}/{BLOB_NAME}"
    
    return pd.read_csv(BLOB_URI, compression = 'gzip')

## Reading from Blob

In [None]:
BUCKET_NAME = 'us_covid_hotspot-bucket'
BLOB_NAME = 

In [None]:
write_blob_to_gcs()

In [None]:
[blob_item.name for blob_item in BUCKET.list_blobs()]

---
## 4 - Writing to Cloud Storage

In [None]:
BUCKET_NAME = 'us_covid_hotspot-bucket'
#BLOB_NAME = "covid_states.csv"
BLOB_NAME = "covid_states_20200901.csv.gz"
FILE_PATH = f"data/{BLOB_NAME}"

In [None]:
%%time
write_blob_to_gcs(BUCKET_NAME, 
                  BLOB_NAME, 
                  FILE_PATH)

Now lets test to see if we got the result we wanted

In [None]:
%%time 
df = get_df_from_blob(BUCKET_NAME, BLOB_NAME)

In [None]:
df.head()

In [None]:
def write_blob_to_gcs(bucket_name, blob_name, filepath):
    # Client to bundle configuration needed for API requests.
    client = storage.Client()

    # Extract the bucket object from the client bundle
    bucket = client.get_bucket(bucket_name)

    # Instantiate or extract the blob object from the bucket
    blob = storage.blob.Blob(blob_name,bucket)

    # Upload the file to the specific blob
    blob.upload_from_filename(filepath)
    
    return [blob_item.name for blob_item in bucket.list_blobs()]

### Trying to zip and save

In [None]:
covid_counties.head()

In [None]:
# write a pandas dataframe to gzipped CSV file
covid_counties.to_csv("data/education_salary.csv.gz", 
           index=False, 
           compression="gzip")

In [None]:
pd.read_csv()

In [None]:
a = pd.read_csv("data/covid_states_20200901.csv.gz", compression="gzip")

In [None]:
df = pd.read_csv('gs://us_covid_hotspot-bucket/covid_states_20200901.csv.gz')

In [None]:
import time

today = time.strftime('%Y%m%d')
filepath = f'data/covid_states_{today}.csv.gz'
cache_mode = 2

(os.path.exists(filepath) and cache_mode in (1,2))

In [None]:
#a = data_processing.get_covid_county_data()
b = data_processing.get_covid_state_data(cache_mode = 3)


---
## 5 - Writing to Cloud Storage using gzip and BytesIO

In [None]:
%%time
bucket_name = 'us_covid_hotspot-bucket'
blob_name = "covid_counties_20200901.csv.gz"
blob_uri = f"gs://{bucket_name}/{blob_name}"
print(f"Pulling county data from GCS [{blob_uri}]")

# Get the client object to make the request
client = storage.Client()

df = pd.read_csv(blob_uri, compression = 'gzip')

df['date'] = pd.to_datetime(df['date'], format = '%Y-%m-%d')

In [None]:
df['date'] = pd.to_datetime(df['date'], format = '%Y-%m-%d')

### Using gzip and IO

Since we Cloud Function is serverless, we cannot temporarily store `covid_states.csv.gz` so we'll need to store it in memory as a `BytesIO` object in buffer.

In [None]:
import io

In [None]:
io.TextIOWrapper()

In [None]:
import gzip
import pandas as pd
from io import BytesIO, TextIOWrapper

In [None]:
# Read in new county data from Covid Tracking Project
COVID_STATES_DF = data_processing.get_covid_state_data(cache_mode = 0)

In [None]:
BUCKET_NAME = 'us_covid_hotspot-bucket'
BLOB_NAME = 'covid_states2.csv.gz'
FILEPATH = f"data/{BLOB_NAME}"
BLOB_URI = f"gs://{BUCKET_NAME}/{BLOB_NAME}"

In [None]:
type(gz_buffer)

In [None]:
gz_buffer = BytesIO()

with gzip.GzipFile(mode='w',fileobj=gz_buffer) as gz_file:
    COVID_STATES_DF.to_csv(TextIOWrapper(gz_file,'utf8'),
                          index=False)
gz_buffer.seek(0);

In [None]:
client = storage.Client()
bucket = client.get_bucket(BUCKET_NAME)

blob = storage.blob.Blob(BLOB_NAME, bucket)

In [None]:
%%time
blob.upload_from_file(file_obj=gz_buffer, content_type = 'text/csv')

In [None]:
df = pd.read_csv(BLOB_URI)

In [None]:
BLOB_NAME = 'covid_counties.csv.gz'

In [None]:
COVID_COUNTIES_DF.head()

In [None]:
COVID_COUNTIES_DF = data_processing.get_covid_county_data(cache_mode = 0)

In [None]:
a = data_processing.get_census_county_data()

---
## Conclusion

In conclusion, my final function that I will be using for writing to GCS from the Cloud Function, will be the following:

In [None]:
def write_df_to_GCS(df, blob_name, bucket_name):
    # generate blob_uri
    blob_uri = f"gs://{bucket_name}/{blob_name}"
    
    # Instantiate BytesIO Object
    gz_buffer = BytesIO()
    
    # Instantiate a GzipFile Object using the BytesIO object, 
      # write to it using a text wrapper.
    with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:
        df.to_csv(TextIOWrapper(gz_file,'utf8'),
                 index=False)
    # set buffer at start
    gz_buffer.seek(0)
    
    # Instantiate Client to Storage
    client = storage.Client()
    # Retrieve the bucket
    bucket = client.get_bucket(bucket_name)
    # Retrieve or create the blob
    blob = storage.blob.Blob(blob_name, bucket)
    
    # Upload buffer to blob
    blob.upload_from_file(file_obj=gz_buffer, content_type='text/csv')
    
    # Return list of current contents in bucket.
    return [blob_item.name for blob_item in bucket.list_blobs()]