# S3 Data Download into Lepton Local Storage

This notebook shows how data can be downloaded from an external S3 bucket to mounted storage on a DGX Cloud Lepton Dev Pod. It makes use of a publicly available dataset for global fishery statistics purely as an example of loading a CSV dataset into a pandas DataFrame. The intent is for developers to make use of their own S3 buckets for transferring data to and from the Dev Pod.

<div class="alert alert-block alert-info">
<b>Note:</b> The license and terms of use for this sample dataset can be found <a href=https://registry.opendata.aws/sau-global-fisheries-catch-data/>here</a>.
</div>

### Requirements

-  Image: A Rapids Notebook container image such as `nvcr.io/nvidia/rapidsai/notebooks:25.08-cuda12.9-py3.13` or later
-  Packages: s3fs Python package installed
-  GPU: An NVIDIA Ampere or greater class GPU for cuDF acceleration (for example, A100)

### Storage Mount Setup

Follow the instructions __[here](https://docs.nvidia.com/dgx-cloud/lepton/features/storage/)__ to setup either Node Local or Static NFS volumes in DGX Cloud Lepton. The UID/GID of the prescribed Rapids container image is 1001 (non-root, `rapids`). So, there are two options in Lepton for ensuring the storage mount is writable: 
1. Launch the Dev Pod using not the default user of `rapids` but with `root`.
2. Create a Node Local storage mount using a world-writable location on the node such as`/tmp`. Using `/tmp` of course has the caveat that the files there may be deleted after a reboot. Thus, it's important to also use S3 to periodically store your data.

### Install s3fs

Make sure that we have the __[s3fs](https://s3fs.readthedocs.io/en/latest/)__ pip package installed.

In [None]:
!pip install s3fs

### Import pandas (Optionally with cuDF GPU Acceleration)

If we want to run the GPU accelerated version of __[pandas](https://pandas.pydata.org/docs/index.html)__, we can load the __[cuDF](https://docs.rapids.ai/api/cudf/stable/)__ extension.

In [None]:
%load_ext cudf.pandas

Import pandas in its own cell to make sure the previous load extension step was completed by the kernel.

In [None]:
import pandas as pd

### Anonymous S3 Access

Import the s3fs package and then as an example load a publicly available S3 dataset for yearly fishery statistics from around the globe. No credentials, access keys, secrets, or account are required in this case (anonymous access).

In [None]:
import s3fs

In [None]:
local_path = '/tmp/rfmo_12.csv'
bucket_path = 's3://fisheries-catch-data/global-catch-data/csv/rfmo_12.csv'
s3 = s3fs.S3FileSystem(anon=True)

### Options for Connecting to S3

If you need to access a private S3 bucket then there are keyword arguments for the key and secret to be applied.

In [None]:
# s3 = s3fs.S3FileSystem(
#      key='YOUR_ACCESS_KEY...',
#      secret='YOUR_ACCESS_SECRET...'
#    )

s3fs can also detect and use appropriate environment variables if they have been set for the key and secret.

In [None]:
# export FSSPEC_S3_KEY='YOUR_ACCESS_KEY...'
# export FSSPEC_S3_SECRET='YOUR_ACCESS_SECRET...'
# s3 = s3fs.S3FileSystem()

Credentials can also be detected and used by the underlying boto credential helper from client_kwargs, environment variables, config files, or an EC2 IAM server.

In [None]:
# s3 = s3fs.S3FileSystem(anon=False)

Finally, s3fs is compatible with non-AWS object storage such as MinIO. In this case, we would specify the URL of the MinIO endpoint.

In [None]:
# s3 = s3fs.S3FileSystem(
#      endpoint_url='https://non.aws.such.as.minio...'
#   )

### Download a Dataset from S3

In [None]:
s3.download(bucket_path, local_path)

Perform a cursory check that we have download the dataset.

In [None]:
!ls -lh /tmp/*.csv

### Load the Dataset into a pandas DataFrame

Read the downloaded CSV file into a pandas DataFrame.

In [None]:
df = pd.read_csv(local_path)

### Work with the DataFrame

Check the number of rows in the DataFrame.

In [None]:
len(df)

In [None]:
df.dtypes

Generally describe the original dataset.

In [None]:
df.describe(include='object')

Look at the first 5 rows of the DataFrame.

In [None]:
df.head()

Look at the last 5 rows of the DataFrame.

In [None]:
df.tail()

Count the top 15 different instances of fishing gear in the dataset. Profile the GPU usage (which may be minimal).

In [None]:
%%cudf.pandas.profile
df["gear_name"].value_counts().head(15)

Convert some of the columns from object to numeric or date types.

In [None]:
df['year'] = pd.to_datetime(df['year'])
df['catch_sum'] = pd.to_numeric(df['catch_sum'])
df['real_value'] = pd.to_numeric(df['real_value'])

Check the datatypes again.

In [None]:
df.dtypes

Collect the last 5 years of the dataset.

In [None]:
years = df['year'].unique()[-5:]

Create a new DataFrame from the top 15 countries in the dataset with the largest total catch in metric tonnes between 2014 and 2018.

In [None]:
top_fisheries_by_catch = df[(df['year'] >= years[0]) & (df['year'] <= years[4])].groupby(['fishing_entity'], as_index = False)[['catch_sum']].sum().copy().sort_values(by='catch_sum',ascending=False).head(15)

Plot the result as a horizontal bar chart.

In [None]:
import matplotlib.pyplot as plt

In [None]:
colors = ['red', 'green', 'blue', 'purple', 'orange','yellow']
top_fisheries_by_catch.plot(x='fishing_entity',y='catch_sum',kind='barh',color=colors,legend=False)
plt.title('Top 15 Fishing Catch by Country/Region 2014-2018')
plt.xlabel('Metric Tonnes')
plt.ylabel('Countries/Regions')
plt.show()

This completes the notebook example.