<a href="https://colab.research.google.com/github/KarlRadtke/ELOC_database/blob/main/notebooks/Backup_in_google_drive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Copying S3 Buckets to Google Drive in Colab

This notebook provides a workflow to transfer data from AWS S3 buckets
to Google Drive while maintaining the directory structure. Ensure you
have a YAML file with AWS credentials prepared.


Import Necessary Libraries

In [1]:
%%capture
!pip install boto3 tqdm

In [2]:
from google.colab import drive
import boto3
import os
from tqdm.notebook import tqdm
import yaml

### Mount Google Drive

In [3]:
drive.mount('/content/drive')

Mounted at /content/drive


### Load AWS Credentials

We'll load our AWS credentials from a YAML file.

In [4]:
from google.colab import files
uploaded = files.upload()

# Load AWS credentials from YAML
with open("connection_config.yaml", 'r') as file:
    creds = yaml.safe_load(file)

AWS_ACCESS_KEY_ID = creds['access_key']
AWS_SECRET_ACCESS_KEY = creds['secret_access_key']

s3 = boto3.resource(
    's3',
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)


Saving connection_config.yaml to connection_config.yaml


### Define the Copy Function

This function will copy content from a specified S3 bucket to a
designated folder on Google Drive, maintaining the directory structure.

In [5]:
def copy_bucket_to_drive(bucket, destination_folder):
    """
    Copies the contents of an S3 bucket to Google Drive based on filename,
    without overwriting existing files. Includes detailed status in tqdm progress bar.

    Parameters:
    - bucket: boto3 Bucket instance.
    - destination_folder: str. The root directory on Google Drive where the
      bucket contents will be saved.
    """

    # Define the base directory in Google Drive
    base_dir = os.path.join(destination_folder, bucket.name)

    # Initialize counters
    total_copied = 0
    total_skipped = 0

    # Get total number of objects for progress tracking
    total_objects = sum(1 for _ in bucket.objects.all())

    with tqdm(total=total_objects, desc="Initializing...") as pbar:
        for obj in bucket.objects.all():
            # Define where to save the object on Google Drive
            path_on_drive = os.path.join(base_dir, obj.key)

            # Skip directories
            if obj.key.endswith('/'):
                pbar.update(1)  # Update progress bar for skipped directories
                continue

            # Check if the file exists on Google Drive
            if not os.path.exists(path_on_drive):
                # Ensure the directory structure exists
                os.makedirs(os.path.dirname(path_on_drive), exist_ok=True)

                # Download the file
                try:
                    bucket.download_file(obj.key, path_on_drive)
                    total_copied += 1
                except Exception as e:
                    print(f"Error downloading {obj.key}: {e}")
            else:
                total_skipped += 1

            # Update progress bar description with detailed status
            pbar.set_description(f"{pbar.n}/{pbar.total}, Files copied: {total_copied}, Files skipped: {total_skipped}")

            pbar.update(1)  # Update progress bar for each file processed

    # Final summary
    print(f"Completed. Total files copied: {total_copied}. Total files skipped: {total_skipped}.")


### Execute the Copy

Now, we'll specify the S3 buckets we want to copy and then initiate the
copy process. Adjust the `bucket_names` list to contain the names of
the S3 buckets you want to transfer.


In [6]:
#BUCKETS_TO_COPY = ['tangkahan', 'bukit-tiga-puluh', 'sabah', 'way-kambas']
BUCKETS_TO_COPY = ['btp-abt-202307']
DESTINATION_FOLDER = "/content/drive/Shareddrives/ELOC_database"

# Iterate over each bucket and copy to Google Drive
for bucket_name in BUCKETS_TO_COPY:
    try:
        bucket = s3.Bucket(bucket_name)
        copy_bucket_to_drive(bucket, DESTINATION_FOLDER)
    except Exception as e:
        print(f"Error copying bucket {bucket_name}: {e}")


Initializing...:   0%|          | 0/15087 [00:00<?, ?it/s]

Completed. Total files copied: 3112. Total files skipped: 11926.
