# Develop training and inference scripts for Script Mode

## Overview
In this notebook, we will learn how to develop training and inference scripts using HuggingFace framework. We will leverage SageMaker pre-build containers for HuggingFace (with PyTorch backend).

We chose to solve a typical NLP task - text classification. We will use `20 Newsgroups` dataset which assembles ~ 20,000 newsgroup documents across 20 different newsgroups.

By the end of this notebook you will learn how to:
- prepare text corpus for training and inference using Amazon SageMaker;
- develop training script to run in pre-build HugginFace container;
- configure and schedule training job;
- develop inference code;
- configure and deploy real-time inference endpoint;
- test SageMaker endpoint.

Please note, that this notebook was tested on SageMaker Notebook instance with latest PyTorch dependencies installed (conda environment `conda_pytorch_latest_p36`). Please make sure that you runtime have PyTorch and its dependencies installed.

## Preparing Dataset
First, we will download dataset using `sklearn` module facility.

In [14]:
from sklearn.datasets import fetch_20newsgroups

# We select 6 out of 20 diverse newsgroups
categories = [
    "comp.windows.x",
    "rec.autos",
    "sci.electronics",
    "misc.forsale",
    "talk.politics.misc",
    "alt.atheism"
]

train_dataset = fetch_20newsgroups(subset='train',
                                  categories=categories,
                                  shuffle=True,
                                  random_state=42
                                 )
test_dataset = fetch_20newsgroups(subset='test',
                                  categories=categories,
                                  shuffle=True,
                                  random_state=42
                                 )

print(f"Number of training samples: {len(train_dataset['data'])}")
print(f"Number of test samples: {len(test_dataset['data'])}")

print("=========================")
n=6
print(f"Sample news article: {train_dataset['data'][n]} \n category:{train_dataset['target'][n]}")


Number of training samples: 3308
Number of test samples: 2203
Sample news article: From: whaley@sigma.kpc.com (Ken Whaley)
Subject: Re: XCopyPlane Question
In-Reply-To: buzz@bear.com's message of 19 Apr 93 14:15:38 GMT
Organization: Kubota Pacific Computer Inc.
	<BUZZ.93Apr19101538@lion.bear.com>
Lines: 33

> 
> In article <WHALEY.93Apr15103931@sigma.kpc.com> whaley@sigma.kpc.com (Ken Whaley) writes:
> >   Actually, I must also ask the FAQ's #1 most popular reason why graphics
> >   don't show up: do you wait for an expose event before drawing your
> >   rectangle?
> 
> Suppose you have an idle app with a realized and mapped Window that contains
> Xlib graphics.  A button widget, when pressed, will cause a new item
> to be drawn in the Window.  This action clearly should not call XCopyArea() 
> (or equiv) directly; instead, it should register the existence of the new
> item in a memory structure and let the expose event handler take care
> of rendering the image because at that time it

In [18]:
import sklearn
print(sklearn.__version__)
print(transformers.__version__)

0.24.1
4.10.3


Now, we need to save selected datasets into file and upload resulting files to Amazon S3 storage. SageMaker will download them to training container at training time,

In [2]:
import csv

with open('train_dataset.csv', 'w') as f:
    w = csv.DictWriter(f, ['data', 'category_id'])
    w.writeheader()
    for i in range(len(train_dataset["data"])):
        w.writerow({"data":train_dataset["data"][i], "category_id":train_dataset["target"][i]})
        
with open('test_dataset.csv', 'w') as f:
    w = csv.DictWriter(f, ['data', 'category_id'])
    w.writeheader()
    for i in range(len(test_dataset["data"])):
        w.writerow({"data":test_dataset["data"][i], "category_id":test_dataset["target"][i]})

In [3]:
import sagemaker 

session = sagemaker.Session()
session.upload_data("train_dataset.csv", key_prefix="newsgroups")
session.upload_data("test_dataset.csv", key_prefix="newsgroups")

's3://sagemaker-us-east-2-941656036254/newsgroups/test_dataset.csv'

## Developing training script

In [None]:
# TODO: add reading from dataset

In [16]:
import datasets, transformers, torch

In [29]:
#dataset = load_dataset('csv', data_files={"train":"train_dataset.csv", "test":"test_dataset.csv"}, column_names=["data", "target"])

Using custom data configuration default-e7f2998f7619f51e


Downloading and preparing dataset csv/default to /home/ec2-user/.cache/huggingface/datasets/csv/default-e7f2998f7619f51e/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff...


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /home/ec2-user/.cache/huggingface/datasets/csv/default-e7f2998f7619f51e/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
MODEL = "distilbert-base-uncased"

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

In [6]:
train_encodings = tokenizer(train_dataset["data"], truncation=True, padding=True)
test_encodings = tokenizer(test_dataset["data"], truncation=True, padding=True)

In [7]:
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['label'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [8]:
train_enc_dataset = CustomDataset(train_encodings, train_dataset["target"])
test_enc_dataset = CustomDataset(test_encodings, test_dataset["target"])

In [9]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments, AutoConfig, DistilBertConfig

In [10]:
config = DistilBertConfig()
config.num_labels = 6
print(config)

DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.10.3",
  "vocab_size": 30522
}



In [None]:


training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=1,  # batch size per device during training
    per_device_eval_batch_size=1,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=100,
)




model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", config=config)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_enc_dataset,         # training dataset
    eval_dataset=test_enc_dataset             # evaluation dataset
)

trainer.train()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_clas

Step,Training Loss
100,1.7696
200,1.5795
300,1.0601
400,0.7723
500,0.6073
600,0.5429
700,0.1404
800,0.6882
900,0.7728
1000,0.404


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-2000
Configuration saved in ./results/checkpoint-2000/config.json
Model weights saved in ./results/checkpoint-2000/pytorch_model.bin


In [None]:
encoded_input = tokenizer.encode_plus(train_dataset['data'][0], return_tensors="pt")
encoded_input.to("cuda")

classification_logits = model(**encoded_input)[0]
classification_logits = torch.softmax(classification_logits, dim=1).tolist()[0]

print(f"Predicted category: {classification_logits.index(max(classification_logits))}")

## Training Code



In [27]:
from sagemaker.huggingface.estimator import HuggingFace
from sagemaker import get_execution_role

role=get_execution_role()

In [28]:
hyperparameters = {}

In [32]:
estimator = HuggingFace(
    py_version="py36",
    entry_point="train.py",
    source_dir="1_sources",
    pytorch_version="1.7.1",
    transformers_version="4.6.1",
    hyperparameters=hyperparameters,
    instance_type="ml.p2.xlarge",
    instance_count=1,
    role=role
)


estimator.fit()

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p2.xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.