# Python Client for GCS

Google Cloud Storage (GCS), buckets to hold information, data, files.

This notebook illustrates some common interactions with GCS and provides tips on using the Python Client to:
- List buckets
- List files
- create buckets
- delete buckets
- Download files
- Upload files
- more

Resources:
- [Product](https://cloud.google.com/storage)
- [Client API](https://github.com/googleapis/python-storage)
- [Client API Documentation](https://cloud.google.com/python/docs/reference/storage/latest)


---
## Setup

inputs:

In [1]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [2]:
REGION = 'us-central1'
EXPERIMENT = 'gcs'
SERIES = 'tips'

packages:

In [3]:
from google.cloud import storage
import os, shutil
import glob
from datetime import datetime

clients:

In [4]:
gcs = storage.Client()

parameters:

In [5]:
DIR = f'temp/{EXPERIMENT}'
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

environment:

In [6]:
# remove directory named DIR if exists
shutil.rmtree(DIR, ignore_errors = True)

# create directory DIR
os.makedirs(DIR)

# check for existance of DIR
print('DIR exists? ', os.path.exists(DIR))

# list contents of directory one level higher than DIR
os.listdir(DIR + '/../')

DIR exists?  True


['job-parms', 'gcs', 'multiprocess']

---
## Buckets

Buckets are the actual storage locations of files.  They have resources in projects, have properties and permissions, and are located in a region.

In [44]:
print(f"View the projects buckets directly here:\nhttps://console.cloud.google.com/storage/browser?forceOnBucketsSortingFiltering=false&project={PROJECT_ID}")

View the projects buckets directly here:
https://console.cloud.google.com/storage/browser?forceOnBucketsSortingFiltering=false&project=statmike-mlops-349915


### Is there a bucket in this project with the same names as the project?

In [7]:
PROJECT_ID

'statmike-mlops-349915'

In [8]:
lookup = gcs.lookup_bucket(PROJECT_ID)
type(lookup)

google.cloud.storage.bucket.Bucket

In [9]:
lookup = gcs.lookup_bucket(PROJECT_ID+'not_real_name')
type(lookup)

NoneType

### List the buckets in the current project:

In [10]:
list(gcs.list_buckets())

[<Bucket: cloud-ai-platform-a68e7f3a-fac8-47f6-9f92-fff95c09cdb8>,
 <Bucket: statmike-mlops-349915>,
 <Bucket: statmike-mlops-349915-vertex-pipelines-us-central1>]

In [11]:
list(gcs.list_buckets(prefix = 'statmike'))

[<Bucket: statmike-mlops-349915>,
 <Bucket: statmike-mlops-349915-vertex-pipelines-us-central1>]

### Create a new bucket

In [12]:
bucket = gcs.bucket(PROJECT_ID + TIMESTAMP)

In [13]:
bucket = gcs.create_bucket(bucket, project = PROJECT_ID, location = REGION)
bucket

<Bucket: statmike-mlops-34991520220920104702>

In [14]:
list(gcs.list_buckets(prefix = 'statmike'))

[<Bucket: statmike-mlops-349915>,
 <Bucket: statmike-mlops-349915-vertex-pipelines-us-central1>,
 <Bucket: statmike-mlops-34991520220920104702>]

In [37]:
print(f"View the bucket directly here:\nhttps://console.cloud.google.com/storage/browser/{PROJECT_ID + TIMESTAMP};tab=objects&project={PROJECT_ID}")

View the bucket directly here:
https://console.cloud.google.com/storage/browser/statmike-mlops-34991520220920104702;tab=objects&project=statmike-mlops-349915


### Retrieve a specific bucket:

In [15]:
bucket = gcs.bucket(PROJECT_ID + TIMESTAMP)

In [16]:
bucket

<Bucket: statmike-mlops-34991520220920104702>

In [17]:
bucket.name

'statmike-mlops-34991520220920104702'

In [18]:
bucket.path

'/b/statmike-mlops-34991520220920104702'

## Files (blobs)

Files are objects, called blobs.  The name includes and prefix you want to use to represent a folder structure.  That's right, there are no actual folders in object storage.  Just files name prefixed with folder like names to help organize and find information.  

### Make some local files

In [19]:
n_folders = 3
n_files = 100

for folder in range(n_folders):
    if not os.path.exists(f'./{DIR}/folder_{folder}'): os.mkdir(f'./{DIR}/folder_{folder}')
    for f in range(n_files):
        with open(f'./{DIR}/folder_{folder}/myfile_{f}.txt', 'w') as file:
            file.write(f'Creating the example file named: myfile_{f}.txt')

In [20]:
os.listdir(f'./{DIR}')[0:10]

['folder_1', 'folder_2', 'folder_0']

In [21]:
os.listdir(f'./{DIR}/folder_0')[0:10]

['myfile_95.txt',
 'myfile_93.txt',
 'myfile_65.txt',
 'myfile_98.txt',
 'myfile_14.txt',
 'myfile_10.txt',
 'myfile_49.txt',
 'myfile_38.txt',
 'myfile_90.txt',
 'myfile_25.txt']

### Uploading Files to Bucket

Get a list of files in the local folder:

In [22]:
glob.glob(f'./{DIR}/**/**')[0:5]

['./temp/gcs/folder_1/myfile_95.txt',
 './temp/gcs/folder_1/myfile_93.txt',
 './temp/gcs/folder_1/myfile_65.txt',
 './temp/gcs/folder_1/myfile_98.txt',
 './temp/gcs/folder_1/myfile_14.txt']

In [23]:
gcs_path_prefix = 'my_folder/my_subfolder/'

In [25]:
for file in glob.glob(f'./{DIR}/**/**'):
    file_path = ('/').join(file.split('/')[-2:]) # just the subfolder and filename
    blob = bucket.blob(gcs_path_prefix + file_path)
    blob.upload_from_filename(file)

In [38]:
print(f"View the bucket directly here:\nhttps://console.cloud.google.com/storage/browser/{PROJECT_ID + TIMESTAMP};tab=objects&project={PROJECT_ID}")

View the bucket directly here:
https://console.cloud.google.com/storage/browser/statmike-mlops-34991520220920104702;tab=objects&project=statmike-mlops-349915


### List Files in Bucket

In [26]:
list(bucket.list_blobs(max_results = 5))

[<Blob: statmike-mlops-34991520220920104702, my_folder/my_subfolder/folder_0/myfile_0.txt, 1663670925276172>,
 <Blob: statmike-mlops-34991520220920104702, my_folder/my_subfolder/folder_0/myfile_1.txt, 1663670923166027>,
 <Blob: statmike-mlops-34991520220920104702, my_folder/my_subfolder/folder_0/myfile_10.txt, 1663670922513872>,
 <Blob: statmike-mlops-34991520220920104702, my_folder/my_subfolder/folder_0/myfile_11.txt, 1663670923347570>,
 <Blob: statmike-mlops-34991520220920104702, my_folder/my_subfolder/folder_0/myfile_12.txt, 1663670923106588>]

In [27]:
list(bucket.list_blobs(max_results = 5, prefix = gcs_path_prefix + 'folder_1/'))

[<Blob: statmike-mlops-34991520220920104702, my_folder/my_subfolder/folder_1/myfile_0.txt, 1663670906045754>,
 <Blob: statmike-mlops-34991520220920104702, my_folder/my_subfolder/folder_1/myfile_1.txt, 1663670902233487>,
 <Blob: statmike-mlops-34991520220920104702, my_folder/my_subfolder/folder_1/myfile_10.txt, 1663670901064714>,
 <Blob: statmike-mlops-34991520220920104702, my_folder/my_subfolder/folder_1/myfile_11.txt, 1663670902565371>,
 <Blob: statmike-mlops-34991520220920104702, my_folder/my_subfolder/folder_1/myfile_12.txt, 1663670902144210>]

### Downloading

In [28]:
os.makedirs(f'./{DIR}/downloaded')

In [29]:
os.listdir(f'./{DIR}')

['downloaded', 'folder_1', 'folder_2', 'folder_0']

download files from one subfolder on GCS to a local folder:

In [30]:
for blob in bucket.list_blobs(prefix = gcs_path_prefix + 'folder_0/', delimiter = '/'):
    blob.download_to_filename(f"./{DIR}/downloaded/{blob.name.split('/')[-1]}")

In [31]:
os.listdir(f'./{DIR}/downloaded')[0:10]

['myfile_95.txt',
 'myfile_93.txt',
 'myfile_65.txt',
 'myfile_98.txt',
 'myfile_14.txt',
 'myfile_10.txt',
 'myfile_49.txt',
 'myfile_38.txt',
 'myfile_90.txt',
 'myfile_25.txt']

## Buckets, Again

### Delete The Bucket

In [32]:
bucket

<Bucket: statmike-mlops-34991520220920104702>

In [33]:
bucket.delete()
# results in error: Conflict: 409 DELETE https://path: The bucket you tried to delete is not empty.

Conflict: 409 DELETE https://storage.googleapis.com/storage/v1/b/statmike-mlops-34991520220920104702?prettyPrint=false: The bucket you tried to delete is not empty.

In [34]:
bucket.delete(force = True)
# results in error: ValueError: Refusing to delete bucket with more than 256 objects. If you actually want to delete this bucket, please delete the objects yourself before calling Bucket.delete().

ValueError: Refusing to delete bucket with more than 256 objects. If you actually want to delete this bucket, please delete the objects yourself before calling Bucket.delete().

In [35]:
bucket.delete_blobs(blobs = list(bucket.list_blobs()))

In [40]:
list(gcs.list_buckets())

[<Bucket: cloud-ai-platform-a68e7f3a-fac8-47f6-9f92-fff95c09cdb8>,
 <Bucket: statmike-mlops-349915>,
 <Bucket: statmike-mlops-349915-vertex-pipelines-us-central1>,
 <Bucket: statmike-mlops-34991520220920104702>]

In [41]:
bucket.delete()
# works, because the bucket is empty

In [42]:
list(gcs.list_buckets())

[<Bucket: cloud-ai-platform-a68e7f3a-fac8-47f6-9f92-fff95c09cdb8>,
 <Bucket: statmike-mlops-349915>,
 <Bucket: statmike-mlops-349915-vertex-pipelines-us-central1>]

In [43]:
print(f"View the projects buckets directly here:\nhttps://console.cloud.google.com/storage/browser?forceOnBucketsSortingFiltering=false&project={PROJECT_ID}")

View the projects buckets directly here:
https://console.cloud.google.com/storage/browser?forceOnBucketsSortingFiltering=false&project=statmike-mlops-349915
