# Part 0 - Data preparation

In this notebook we will download the Amazon Review dataset and save it to S3. We will also do some light data preprocessing by only keeping the columns we need, filtering out reviews that are too short, and limiting the size of the datasets.

To read more, please check out https://towardsdatascience.com/setting-up-a-text-summarisation-project-introduction-526622eea4a8.

## Data download

We download the dataset from https://huggingface.co/datasets/amazon_reviews_multi and save it to a Pandas dataframe.

In [3]:
#!pip install datasets

In [2]:
#from datasets import load_dataset
#train_ds = load_dataset("amazon_reviews_multi", "en", split='train')
#val_ds = load_dataset("amazon_reviews_multi", "en", split='validation')
#test_ds = load_dataset("amazon_reviews_multi", "en", split='test')

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/7.38k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/15.5k [00:00<?, ?B/s]

DefunctDatasetError: Dataset 'amazon_reviews_multi' is defunct and no longer accessible due to the decision of data providers

In [None]:
import pandas as pd
df_train = pd.read_csv
df_val = pd.read_csv(val_ds)
df_test = pd.read_csv(test_ds)

In [None]:
df_train.head()

## Filtering the dataset

We want to discard reviews and titles that are too short, so that our model can produce more interesting summaries.

In [None]:
cutoff_title = 5
cutoff_body = 20

In [None]:
df_train = df_train[(df_train['review_title'].apply(lambda x: len(x.split()) >= cutoff_title)) & (df_train['review_body'].apply(lambda x: len(x.split()) >= cutoff_body))]
df_val = df_val[(df_val['review_title'].apply(lambda x: len(x.split()) >= cutoff_title)) & (df_val['review_body'].apply(lambda x: len(x.split()) >= cutoff_body))]
df_test = df_test[(df_test['review_title'].apply(lambda x: len(x.split()) >= cutoff_title)) & (df_test['review_body'].apply(lambda x: len(x.split()) >= cutoff_body))]

## Limiting the size of the datasets

We want to limit the size of the datasets so that training of the model can finish in a reasonable amount of time. This is a decision that we might want to revisit in the experimentation phase if we want to increase the performance of the model.

In [None]:
print(len(df_train), len(df_val), len(df_test))

In [None]:
df_train = df_train.sample(20000, random_state=42)
df_train = df_train.rename(columns={"review_body": "text", "review_title": "summary"})

df_val = df_val.sample(1000, random_state=42)
df_val = df_val.rename(columns={"review_body": "text", "review_title": "summary"})

df_test = df_test.sample(1000, random_state=42)
df_test = df_test.rename(columns={"review_body": "text", "review_title": "summary"})

## Save the data as CSV files and upload them to S3

We need to upload the data to S3 in order to train the model at a later point.

In [None]:
df_train.to_csv('data/train.csv', index=False, columns=['text', 'summary'])
df_val.to_csv('data/val.csv', index=False, columns=['text', 'summary'])
df_test.to_csv('data/test.csv', index=False, columns=['text', 'summary'])

In [7]:
#!pip install sagemaker

In [2]:
#!pip install naas

In [1]:
import boto3

### before creating boto3 client, you need to configure aws from the aws cli.
### To do this, 
- Download and install aws cli file, check version by executing the command "aws --version" from cmd.
- Sign in to AWS console. Open IAM. Create a new user. Click on attach policies directly and give "administrator access"
- After user is created, go to "Security credentials" and generate access key. use case: CLI. Download the key csv file.
- Go to command prompt and type "aws configure". Give all access key information, region name, and output format="json"

In [2]:
s3 = boto3.client(
    "s3",
    aws_access_key_id='YOUR_ID',
    aws_secret_access_key='YOUR_KEY',
    region_name='YOUR_REGION'
)

In [3]:
my_region = boto3.session.Session().region_name # set the region of the instance
print(my_region)

ap-south-1


In [4]:
import sagemaker

sess = sagemaker.Session()
bucket = sess.default_bucket()

sagemaker.config INFO - Not applying SDK defaults from location: C:\ProgramData\sagemaker\sagemaker\config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: C:\Users\ASUS\AppData\Local\sagemaker\sagemaker\config.yaml


In [None]:
## Next step is to upload the datasets (train, val, test) to a storage bucket in S3.
## You can do this from the console by creating a new bucket and uploading the relevant files, OR
## You can use the AWS CLI to upload the files. 
## The CLI commands are shown below.

In [11]:
!aws s3 cp C:/Users/ASUS/Desktop/data/train.csv s3://"bucket_name"/summarization/data/train.csv

'aws' is not recognized as an internal or external command,
operable program or batch file.


In [12]:
#Uploading a file to S3, in other words copying a file from your local file system to S3, is done with aws s3 cp command
#Let's suppose that your file name is file.txt  and this is how you can upload your file to S3
#aws s3 cp file.txt s3://bucket-name

In [None]:
!aws s3 cp C:/Users/ASUS/Desktop/data/val.csv s3://"bucket_name"/summarization/data/val.csv
!aws s3 cp C:/Users/ASUS/Desktop/data/test.csv s3://"bucket_name"/summarization/data/test.csv