# S3 Soundfile Storage Class Management Based on Label Association

This notebook is designed to assign AWS S3 storage classes to soundfiles based on the presence of labels. 


Soundfiles that are associated with labels will retain the S3 Standard storage class, while those without labels will be moved to S3 Intelligent Tiering. 

The determination of whether soundfiles possess labels is based on a CSV file that enumerates all labeled soundfiles. This method ensures a cost-effective storage solution by optimizing the storage class according to the labeling status of each soundfile.

## Import Necessary Libraries
Here we import all required libraries like `boto3` for AWS operations, `pandas` for handling CSV files, `yaml` for reading configurations, and `tqdm` for progress visualization.


In [2]:
# Import necessary libraries
import boto3
import pandas as pd
import yaml
from tqdm.notebook import tqdm
from concurrent.futures import ThreadPoolExecutor

## Load AWS Credentials
AWS credentials are loaded from a YAML file. These credentials are used to authenticate and perform operations on S3 buckets.


In [3]:
# Load the S3 credentials from a YAML file
with open('../config/connection_config.yaml', 'r') as f:
    credentials = yaml.safe_load(f)
    
# Extract the access key and secret access key
access_key = credentials['access_key']
secret_access_key = credentials['secret_access_key']

# Initialize S3 client
s3_client = boto3.client('s3', aws_access_key_id=access_key, aws_secret_access_key=secret_access_key)

## Define Bucket Names and Read CSV File
Specify the bucket names to be processed


In [4]:

# List of bucket names
bucket_names = ['sabah', 'way-kambas', 'bukit-tiga-puluh']

## Function to Change Storage Class
This function is used to change the storage class of a file in an S3 bucket. Files not listed in the exclusion list will be moved to Intelligent Tiering.


In [5]:
def change_storage_class(bucket, key):
    s3_client.copy_object(
        Bucket=bucket,
        CopySource={'Bucket': bucket, 'Key': key},
        Key=key,
        StorageClass='INTELLIGENT_TIERING'
    )

## Define Bucket Processing Function
This function processes a single bucket. For each file in the bucket, it checks if the file is in the exclusion list. If not, the file's storage class is changed to Intelligent Tiering.


In [6]:
def update_storage_class_for_bucket(bucket_name, test_limit=None, base_update_interval=100):
    """
    Processes sound files in a specified S3 bucket in parallel. Reads a CSV file 
    from the same bucket to get the exclusion list. Only files within the 
    'soundfiles' directory are processed. Files not in the excluded list are 
    moved to Intelligent Tiering. If a test_limit is provided, only that many 
    files are processed (useful for testing).

    Args:
        bucket_name (str): Name of the S3 bucket to be processed.
        test_limit (int, optional): Number of files to process for testing. 
                                    If None, processes all files.
        base_update_interval (int): Base interval for progress bar updates.

    Returns:
        None: The function does not return anything but prints out the status.
    """
    
    # Construct the CSV file path and read the exclusion list
    csv_file_path = f"s3://{bucket_name}/labels/selection_tables_to_soundfiles.csv"
    df = pd.read_csv(csv_file_path)
    excluded_files = set(df['soundfile_directory'].tolist())

    # Gather list of files to process based on exclusion list
    paginator = s3_client.get_paginator('list_objects_v2')
    page_iterator = paginator.paginate(Bucket=bucket_name, Prefix='soundfiles/')

    files_to_process = [obj['Key'] for page in page_iterator 
                        for obj in page.get('Contents', []) 
                        if obj['Key'] not in excluded_files]
    
    # Adjust the list and update interval if test_limit is set
    if test_limit:
        files_to_process = files_to_process[:test_limit]
        # Adjust update interval for smaller batches
        update_interval = max(1, min(base_update_interval, test_limit // 10))
    else:
        update_interval = base_update_interval
        
    # Initialize the tqdm progress bar
    pbar = tqdm(total=len(files_to_process), desc=f"Processing Bucket: {bucket_name}")

    # Define the file processing function
    def process_file(key):
        change_storage_class(bucket_name, key)
        return key

    # Process the files in parallel
    with ThreadPoolExecutor() as executor:
        for i, filename in enumerate(executor.map(process_file, files_to_process), 1):
            # Update progress bar and print the last processed file name
            if i % update_interval == 0 or i == len(files_to_process):
                pbar.update(update_interval)
                # Clear the line before printing the new message
                print(f"\rLast Processed: {filename}{' ' * (100 - len(filename))}", end='')

    # Close the progress bar and print the final status
    pbar.close()
    print(f"\nStorage class updated for soundfiles in {bucket_name}. "
          f"Processed {len(files_to_process)} files.")

## Execute Bucket Processing
Loop through each bucket and apply the processing function. This step will initiate the file transfer to Intelligent Tiering or retain them in S3 Standard as per the exclusion list.


### Test the Function with a Limited Number of Files
Before processing all buckets, let's test the function on a single bucket with a limited number of files. This ensures the functionality works as expected without impacting a large number of files.


In [7]:
# Test with a limited number of files
test_bucket_name = 'tangkahan' 
update_storage_class_for_bucket(test_bucket_name, test_limit=20)

Processing Bucket: tangkahan:   0%|          | 0/20 [00:00<?, ?it/s]

Last Processed: soundfiles/1631374980_Swift1/Swift1_2021-09-13/Swift1_20210913_110004.wav                           
Storage class updated for soundfiles in tangkahan. Processed 20 files.


In [8]:
# Process each bucket
for bucket in bucket_names:
    update_storage_class_for_bucket(bucket)

print("Storage class updated for all buckets.")

Processing Bucket: sabah:   0%|          | 0/15561 [00:00<?, ?it/s]

Last Processed: soundfiles/eloc22_1677318417833/eloc22_1677318417833_2023-02-28_23-46-58.wav                        
Storage class updated for soundfiles in sabah. Processed 15561 files.


Processing Bucket: way-kambas:   0%|          | 0/1728 [00:00<?, ?it/s]

Last Processed: soundfiles/Swift4/SWIFT4_20211213_084027/SWIFT4_2021-12-14/SWIFT4_20211214_180006.wav               
Storage class updated for soundfiles in way-kambas. Processed 1728 files.


Processing Bucket: bukit-tiga-puluh:   0%|          | 0/27741 [00:00<?, ?it/s]

Last Processed: soundfiles/jambi_data/ELOC19_1660118917118/config_ELOC19_1660118917118.txt                          12_12-06-45.wav
Storage class updated for soundfiles in bukit-tiga-puluh. Processed 27741 files.
Storage class updated for all buckets.
