# Copy TSV Data To S3

<img src="img/write_tsv_to_s3.png" width="45%" align="left">

#### We have chosen the [Amazon Customer Reviews Dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) as our main dataset.

The dataset is shared in a public Amazon S3 bucket, and is available in two file formats: 

* Tab separated value (TSV), a text format - `s3://usd-mads-508/amazon-reviews-pds/tsv/`
* Parquet, an optimized columnar binary format - `s3://usd-mads-508/amazon-reviews-pds/parquet/`

The Parquet dataset is partitioned (divided into subfolders) by the column `product_category` to further improve query performance. With this, you can use a `WHERE` clause on product_category in your SQL queries to only read data specific to that category.

We can use the AWS Command Line Interface (CLI) to list the S3 bucket content using the following CLI commands: 


In [1]:
!aws s3 ls s3://usd-mads-508/amazon-reviews-pds/tsv/

2024-11-02 16:07:40          0 
2024-11-02 16:07:40 3629753164 amazon_reviews_multilingual_US_v1_00.tsv
2024-11-02 16:07:40 1971061630 amazon_reviews_us_Apparel_v1_00.tsv
2024-11-02 16:07:40 1350294084 amazon_reviews_us_Automotive_v1_00.tsv
2024-11-02 16:07:40  872274720 amazon_reviews_us_Baby_v1_00.tsv
2024-11-02 16:07:40 2152186111 amazon_reviews_us_Beauty_v1_00.tsv
2024-11-02 16:08:05 3238702530 amazon_reviews_us_Books_v1_02.tsv
2024-11-02 16:08:15 1100169988 amazon_reviews_us_Camera_v1_00.tsv
2024-11-02 16:08:33 3224038446 amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv
2024-11-02 16:08:37  628880453 amazon_reviews_us_Digital_Music_Purchase_v1_00.tsv
2024-11-02 16:08:44   53855391 amazon_reviews_us_Digital_Software_v1_00.tsv
2024-11-02 16:08:46 1288048833 amazon_reviews_us_Digital_Video_Download_v1_00.tsv
2024-11-02 16:08:54   73154460 amazon_reviews_us_Digital_Video_Games_v1_00.tsv
2024-11-02 16:08:57 1725988504 amazon_reviews_us_Electronics_v1_00.tsv
2024-11-02 16:09:18  36697

In [2]:
!aws s3 ls s3://usd-mads-508/amazon-reviews-pds/parquet/

# To Simulate an Application Writing Into Our Data Lake, We Copy the Public TSV Dataset to a Private S3 Bucket in our Account

<img src="img/copy_data_to_s3.png" width="60%" align="left">

# Check Pre-Requisites from the `01_setup/` Folder

In [3]:
%store -r setup_instance_check_passed

In [4]:
try:
    setup_instance_check_passed
except NameError:
    print("+++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Instance Check.")
    print("+++++++++++++++++++++++++++++++")

In [5]:
print(setup_instance_check_passed)

True


In [6]:
%store -r setup_dependencies_passed

In [7]:
try:
    setup_dependencies_passed
except NameError:
    print("+++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup Dependencies.")
    print("+++++++++++++++++++++++++++++++")

In [8]:
print(setup_dependencies_passed)

True


In [9]:
%store -r setup_s3_bucket_passed

In [10]:
try:
    setup_s3_bucket_passed
except NameError:
    print("+++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup S3 Bucket.")
    print("+++++++++++++++++++++++++++++++")

In [11]:
print(setup_s3_bucket_passed)

True


In [12]:
%store -r setup_iam_roles_passed

In [13]:
try:
    setup_iam_roles_passed
except NameError:
    print("+++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup IAM Roles.")
    print("+++++++++++++++++++++++++++++++")

In [14]:
print(setup_iam_roles_passed)

True


In [15]:
if not setup_instance_check_passed:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Instance Check.")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
if not setup_dependencies_passed:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup Dependencies.")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
if not setup_s3_bucket_passed:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup S3 Bucket.")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
if not setup_iam_roles_passed:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup IAM Roles.")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [16]:
import boto3
import sagemaker
import pandas as pd

sess = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
account_id = boto3.client("sts").get_caller_identity().get("Account")

sm = boto3.Session().client(service_name="sagemaker", region_name=region)

# Set S3 Source Location (Public S3 Bucket)

In [17]:
s3_public_path_tsv = "s3://usd-mads-508/amazon-reviews-pds/tsv"

In [18]:
%store s3_public_path_tsv

Stored 's3_public_path_tsv' (str)


# Set S3 Destination Location (Our Private S3 Bucket)

In [19]:
s3_private_path_tsv = "s3://{}/amazon-reviews-pds/tsv".format(bucket)
print(s3_private_path_tsv)

s3://sagemaker-us-east-1-908587188823/amazon-reviews-pds/tsv


In [20]:
%store s3_private_path_tsv

Stored 's3_private_path_tsv' (str)


# Copy Data From the Public S3 Bucket to our Private S3 Bucket in this Account
As the full dataset is pretty large, let's just copy 3 files into our bucket to speed things up later. 

In [21]:
!aws s3 cp --recursive $s3_public_path_tsv/ $s3_private_path_tsv/ --exclude "*" --include "amazon_reviews_us_Digital_Software_v1_00.tsv"
!aws s3 cp --recursive $s3_public_path_tsv/ $s3_private_path_tsv/ --exclude "*" --include "amazon_reviews_us_Digital_Video_Games_v1_00.tsv"
!aws s3 cp --recursive $s3_public_path_tsv/ $s3_private_path_tsv/ --exclude "*" --include "amazon_reviews_us_Gift_Card_v1_00.tsv"

copy: s3://usd-mads-508/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv to s3://sagemaker-us-east-1-908587188823/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv
copy: s3://usd-mads-508/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Games_v1_00.tsv to s3://sagemaker-us-east-1-908587188823/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Games_v1_00.tsv
copy: s3://usd-mads-508/amazon-reviews-pds/tsv/amazon_reviews_us_Gift_Card_v1_00.tsv to s3://sagemaker-us-east-1-908587188823/amazon-reviews-pds/tsv/amazon_reviews_us_Gift_Card_v1_00.tsv


# _Make sure ^^^^ this ^^^^ S3 COPY command above runs succesfully. We will need those datafiles for the rest of this workshop._

# List Files in our Private S3 Bucket in this Account

In [22]:
print(s3_private_path_tsv)

s3://sagemaker-us-east-1-908587188823/amazon-reviews-pds/tsv


In [23]:
!aws s3 ls $s3_private_path_tsv/

2025-03-15 19:27:41   53855391 amazon_reviews_us_Digital_Software_v1_00.tsv
2025-03-15 19:27:45   73154460 amazon_reviews_us_Digital_Video_Games_v1_00.tsv
2025-03-15 19:27:48   39977611 amazon_reviews_us_Gift_Card_v1_00.tsv


In [24]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/sagemaker-{}-{}/amazon-reviews-pds/?region={}&tab=overview">S3 Bucket</a></b>'.format(
            region, account_id, region
        )
    )
)

  from IPython.core.display import display, HTML


# Store Variables for the Next Notebooks

In [25]:
%store

Stored variables and their in-db values:
auto_ml_job_name                        -> 'automl-dm-15-18-42-31'
autopilot_endpoint_arn                  -> 'arn:aws:sagemaker:us-east-1:908587188823:endpoint
autopilot_endpoint_name                 -> 'automl-dm-ep-15-19-18-17'
autopilot_model_arn                     -> 'arn:aws:sagemaker:us-east-1:908587188823:model/au
autopilot_model_name                    -> 'automl-dm-15-18-42-31-dpp0-model-8d2fb09cd16b4efa
autopilot_train_s3_uri                  -> 's3://sagemaker-us-east-1-908587188823/data/amazon
s3_private_path_tsv                     -> 's3://sagemaker-us-east-1-908587188823/amazon-revi
s3_public_path_tsv                      -> 's3://usd-mads-508/amazon-reviews-pds/tsv'
setup_dependencies_passed               -> True
setup_iam_roles_passed                  -> True
setup_instance_check_passed             -> True
setup_s3_bucket_passed                  -> True


# Release Resources

In [26]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

In [27]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}

<IPython.core.display.Javascript object>

In [28]:
# Internal - DO NOT RUN

# step_prefix = '04_prepare'
# !aws s3 cp --recursive $s3_public_path_tsv/ s3://usd-mads-508/$step_prefix/ --exclude "*" --include "amazon_reviews_us_Digital_Software_v1_00.tsv.gz"
# !aws s3 cp --recursive $s3_public_path_tsv/ s3://usd-mads-508/$step_prefix/ --exclude "*" --include "amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz"
# !aws s3 cp --recursive $s3_public_path_tsv/ s3://usd-mads-508/$step_prefix/ --exclude "*" --include "amazon_reviews_us_Gift_Card_v1_00.tsv.gz"
# !aws s3 ls --recursive s3://usd-mads-508/$step_prefix/