# 1. Google Cloud Storage API connection

**Note**: This notebook permits the connection to Google Cloud Storage files, mainly to load them and store them in local memory in pickle format, which consumes less memory space.

<div class="alert alert-block alert-warning">
  
<b>Notebook objectives:</b>
    
* Connect to data source via Google Cloud Storage API
    
* Export daily data in pickle format and store in specified directory
    

## 1.1 Packages Installs

Uncomment and run eveytime a new environment is created

In [2]:
### Google Cloud Storage API packages
!pip install cloudstorage
!pip install --upgrade google-cloud-storage # might only need this update if there are inheritance problems.
!pip install gcsfs

## 1.2 Importing packages

- Google Bigquery

- Json (JavaScript Object Notation)

- Google Cloud Storage

- Google Authentication

In [1]:
##### Importing packages

# Google API handling
import json
from google.cloud import bigquery
from google.cloud import storage
from google.oauth2 import service_account

# Data handling
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# File format handling
import pickle
import bz2
import _pickle as cPickle


# Path set up (If a new directory is used all of the paths need to be updated)
path = "/project/data/" 
path_w1 = "/project/data/w1/"
path_w2 = "/project/data/w2/"
path_w3 = "/project/data/w3/"
path_w4 = "/project/data/w4/"
path_w5 = "/project/data/w5/"
json_path = "/project/KEYFILE.json"

# Pre-defined functions

These functions are used to handle the data from Google Cloud Storage and apply transformations to reduce the size of the raw csvs vs random sampling.

In [2]:
###### Pre-defined functions

### Function to transform files in google storage from csv format to pickle compressed form

def transform_2_pickle(csv_list, path_pickle, google_JSON_path):
    
    '''
    This function receives:
    - List of Google Cloud Storage gsuit file paths in which the data is store (https://cloud.google.com/storage/docs/gsutil)
    - Path in the working directory to save the generated pickle files
    - Path in the working directory in which a JSON token provides de credentials to access Google Cloud Storage
    
    The  function will out put:
     - New compressed pickles formats of the data stored in the provided directory
    
    '''
    for data in enumerate(csv_list):
        temp_df = pd.read_csv(data[1], storage_options={"token": google_JSON_path})

        # file conversion to pickle and compression
        itr = str(data[0]+1)
        filename = 'client_day_' + itr
        sfile = bz2.BZ2File(path_pickle + filename, 'w')
        pickle.dump(temp_df,sfile)
        sfile.close()
        
    return print(f'Successfully stored {len(csv_list)} new data files in pkl format under path: {path_pickle}') 

# 2. Handling data from Google Cloud Storage

In order to avoid collapsing the kernel, the pickle generation has to be done for each week's data per time.

Additionally, after the pickle data is generated is recommended to restart the kernel before attempting to generate the new data frame from the samples.

## 2.1 Connection to Google Cloud API

Loading a csv from google cloud storage providing the file path.

**Note:** Only run this cell if a complete csv from gcs wants to be loaded into memory.


In [None]:
# get data from google cloud path
# raw_data = pd.read_csv('gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_144038_943556045_3669978067.csv',
#                  storage_options={"token": "/project/notebooks/sandbox/KEYFILE.json"})

# data = raw_data.copy() #creating a copy to avoid altering original data

## 2.2 Google cloud file paths

lists containing google cloud storage paths containing daily data in csv format.

In [4]:
# Defining a complete list of google cloud storage paths to connect via API
client_data = ['gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_144038_943556045_3669978067.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_144253_943556045_3669977120.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_144637_943556045_3669977266.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_144806_943556045_3669975884.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_145007_943556045_3669976899.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_145011_943556045_3669976131.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_145409_943556045_3669977107.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_150254_943556045_3669976775.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_150354_943556045_3669977404.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_151355_943556045_3669976894.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_152655_943556045_3669976586.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_153710_943556045_3669977873.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_154954_943556045_3669976870.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_162459_943556045_3669977382.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_162643_943556045_3669977882.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_163945_943556045_3669978002.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220422_013031_943556045_3670354162.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220423_015510_943556045_3671466489.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220424_013336_943556045_3672541034.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220425_010447_943556045_3673605064.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220426_020558_943556045_3674759660.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220427_005711_943556045_3675887689.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220428_012607_943556045_3677006841.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220429_020119_943556045_3678135669.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220430_015833_943556045_3679248545.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220501_014108_943556045_3680324184.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220502_012949_943556045_3681411070.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220503_023223_943556045_3682562803.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220504_014637_943556045_3683713181.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220505_005531_943556045_3684873343.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220506_005709_943556045_3686025686.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220507_011303_943556045_3687156915.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220508_010559_943556045_3688230263.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220509_004721_943556045_3689305375.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220510_014103_943556045_3690473264.csv']

In [5]:
# 1st week data list of google cloud map paths
first_week_data = ['gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_144038_943556045_3669978067.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_144253_943556045_3669977120.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_144637_943556045_3669977266.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_144806_943556045_3669975884.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_145007_943556045_3669976899.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_145011_943556045_3669976131.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_145409_943556045_3669977107.csv']

In [6]:
# 2nd week data list of google cloud map paths
second_week_data = ['gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_150254_943556045_3669976775.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_150354_943556045_3669977404.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_151355_943556045_3669976894.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_152655_943556045_3669976586.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_153710_943556045_3669977873.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_154954_943556045_3669976870.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_162459_943556045_3669977382.csv']

In [7]:
# 3rd week data list of google cloud map paths
third_week_data = ['gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_162643_943556045_3669977882.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220421_163945_943556045_3669978002.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220422_013031_943556045_3670354162.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220423_015510_943556045_3671466489.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220424_013336_943556045_3672541034.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220425_010447_943556045_3673605064.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220426_020558_943556045_3674759660.csv']

In [8]:
# 4th week data list of google cloud map paths
fourth_week_data = ['gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220427_005711_943556045_3675887689.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220428_012607_943556045_3677006841.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220429_020119_943556045_3678135669.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220430_015833_943556045_3679248545.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220501_014108_943556045_3680324184.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220502_012949_943556045_3681411070.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220503_023223_943556045_3682562803.csv']

In [9]:
# 5th week data list of google cloud map paths
fifth_week_data = ['gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220504_014637_943556045_3683713181.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220505_005531_943556045_3684873343.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220506_005709_943556045_3686025686.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220507_011303_943556045_3687156915.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220508_010559_943556045_3688230263.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220509_004721_943556045_3689305375.csv',
              'gs://map_testingdata/ADD_TestingFolder/OceanSaver/OceanSaver_ADD_20220510_014103_943556045_3690473264.csv']

## 2.3 Generate pickle files

This step is covered to store in the local directory a compressed file form of the data (i.e., pickle) to optimized memory usage.

In [36]:
# Run pre-defined function to transform csv files to pickles

transform_2_pickle(first_week_data, path_w1, json_path)

Successfully stored 7 new data files in pkl format under path: /project/data/w1/
