# Huggingface Sagemaker-sdk - Getting Started Demo
## Please select PyTorch 1.13 Python 3.9 as kernel 
### Name Entity Recognition with `wnut_17` dataset


1. [Introduction](#Introduction)  
2. [Development Environment and Permissions](#Development-Environment-and-Permissions)
    1. [Installation](#Installation)  
    2. [Development environment](#Development-environment)  
    3. [Permissions](#Permissions)
3. [Processing](#Preprocessing)   
    1. [Tokenization](#Tokenization)  
    2. [Uploading data to sagemaker_session_bucket](#Uploading-data-to-sagemaker_session_bucket)  
4. [Fine-tuning & starting Sagemaker Training Job](#Fine-tuning-\&-starting-Sagemaker-Training-Job)  
    1. [Creating an Estimator and start a training job](#Creating-an-Estimator-and-start-a-training-job)  
    2. [Estimator Parameters](#Estimator-Parameters)   
    3. [Download fine-tuned model from s3](#Download-fine-tuned-model-from-s3)
    3. [Attach to old training job to an estimator ](#Attach-to-old-training-job-to-an-estimator)  
5. [_Coming soon_:Push model to the Hugging Face hub](#Push-model-to-the-Hugging-Face-hub)

# Introduction

Welcome to our NER example. In this demo, we will use the Hugging Faces `transformers` and `datasets` library together with a custom Amazon sagemaker-sdk extension to fine-tune a pre-trained transformer for Name Entiity Recognition Usecase. In particular, the pre-trained model will be fine-tuned using the `wnut_17` dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. 

[LINK TO WNUT_17 Dataset](https://huggingface.co/datasets/wnut_17)

_**NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances**_

# Development Environment and Permissions 

## Installation

_*Note:* we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if you haven´t it installed_

In [None]:
!pip install "sagemaker>=2.140.0" "transformers==4.26.1" "datasets[s3]==2.10.1" "evaluate" --upgrade

In [None]:
!pip install jupyter_contrib_nbextensions IProgress

## Development environment 

In [None]:
import sagemaker.huggingface

## Permissions

_If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it._

In [None]:
import sagemaker
import boto3


sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    session =
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

# Preprocessing

All the preprocessing will be handled by the scripts from Huggingface. You can either use:
- a dataset from the dataset hub and / or create and upload your dataset to the hub (https://huggingface.co/docs/datasets/about_dataset_load)
- have a dataset on disk or s3 for the script to pull in (must be .csv and .json format)
- adapt the script from Huggingface to deal with your dataset 

## !! The preprocessing below is just to demonstrate how it would be done, without pulling in the dataset from the hub !!

We are using the `datasets` library to download and preprocess the `wnut_17` dataset. After preprocessing, the dataset will be uploaded to our `sagemaker_session_bucket` to be used within our training job. The dataset consists of sentences, where every token should be classified as either no entity or an entity. 


## Tokenization 

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

# tokenizer used in preprocessing
tokenizer_name = 'distilbert-base-uncased'

# dataset used
dataset_name = 'wnut_17'

# s3 key prefix for the data
s3_prefix = 'samples/datasets/wnut_17'

In [None]:
dataset = load_dataset(dataset_name)

# download tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
tokenizer

Datasets from Huggingface are nice in two ways:
- First, it allows you to really easily try a model for a task
- gives you a good intuition how the dataset should be structured before going to deep
- allows you to quickly judge how hard a certain task could be for a model.
- you can explore those datasets very quickly

In [None]:
print(dataset["train"][0])

In [None]:
label_list = dataset["train"].features[f"ner_tags"].feature.names
print(label_list)

#### Quick explanation of the tags:
Here are explanations for each of the Named Entity Recognition (NER) categories in the WNUT-17 dataset:
- O: This tag represents tokens that are not part of any named entity. In other words, it’s used for words or phrases in the text that do not fall under any of the predefined entity categories.
- B-corporation: This tag is used to indicate the beginning of a named entity that refers to a corporation. Named entities of this type include the names of business organizations or companies.
- I-corporation: This tag is used for tokens within a named entity of the “corporation” type. It indicates that the token is part of the ongoing named entity started by a “B-corporation” tag.
- B-creative-work: This tag marks the start of a named entity representing a creative work. Creative works can include titles of books, movies, songs, paintings, and other artistic or creative content.
- I-creative-work: Similar to the “I-corporation” tag, this tag is used for tokens within a named entity of the “creative work” type. It denotes that the token belongs to an ongoing creative work entity.
- B-group: This tag signals the beginning of a named entity that represents a group. This category could include names of clubs, teams, organizations, and other collections of people or entities.
- I-group: Like the previous “I” tags, this tag is used for tokens within a named entity of the “group” type. It indicates that the token is part of a group entity.
- B-location: This tag is used to mark the start of a named entity referring to a location. Locations could include names of cities, countries, landmarks, and other geographical places.
- I-location: Tokens within a named entity of the “location” type are tagged with this “I-location” tag. It signifies that the token is a part of a location entity that started with a “B-location” tag.
- B-person: This tag is used when a named entity represents a person’s name. It’s applied to individual names of people, whether they are public figures, celebrities, or regular individuals.
- I-person: Like other “I” tags, this tag is used for tokens within a named entity of the “person” type. It denotes that the token belongs to an ongoing person entity.
- B-product: This tag indicates the beginning of a named entity representing a product. Products could include names of items, gadgets, commodities, and other things that are manufactured or sold.
- I-product: Tokens within a named entity of the “product” type are labeled with this “I-product” tag. It signifies that the token is part of an ongoing product entity.


These tags collectively allow for the annotation and classification of various types of named entities in text, which is useful for tasks like information extraction, sentiment analysis, and more.

Lets look at how a tokenized example looks like

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenizer

In [None]:
example = dataset["train"][0]
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(tokens)

Applying the tokenization to the whole datasets

In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets map function. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once:

In [None]:
tokenized_wnut = dataset.map(tokenize_and_align_labels, batched=True)

In [None]:
!pip install evaluate

In [None]:
import numpy as np

labels = [label_list[i] for i in example[f"ner_tags"]]

## You can adapt this to run your own very custom evaluation loop
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [None]:
train_dataset = dataset["train"]
# train_dataset.set_format("torch")
validation_dataset = dataset["validation"]
# validation_dataset.set_format("torch")
test_dataset = dataset["test"]
# test_dataset.set_format("torch")

In [None]:
train_dataset

## Uploading data to `sagemaker_session_bucket`

After we processed the `datasets` we are going to use the new `FileSystem` [integration](https://huggingface.co/docs/datasets/filesystems.html) to upload our dataset to S3.

In [None]:
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
train_dataset.save_to_disk(training_input_path)

# save validaiton_dataset to s3
validation_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/validation'
validation_dataset.save_to_disk(validation_input_path)


# save test_dataset to s3
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
test_dataset.save_to_disk(test_input_path)


# Fine-tuning & starting Sagemaker Training Job

In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In a Estimator we define, which fine-tuning script should be used as `entry_point`, which `instance_type` should be used, which `hyperparameters` are passed in .....



```python
huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            base_job_name='huggingface-sdk-extension',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            transformers_version='4.4',
                            pytorch_version='1.6',
                            py_version='py36',
                            role=role,
                            hyperparameters = {'epochs': 1,
                                               'train_batch_size': 32,
                                               'model_name':'distilbert-base-uncased'
                                                })
```

When we create a SageMaker training job, SageMaker takes care of starting and managing all the required ec2 instances for us with the `huggingface` container, uploads the provided fine-tuning script `train.py` and downloads the data from our `sagemaker_session_bucket` into the container at `/opt/ml/input/data`. Then, it starts the training job by running. 

```python
/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32
```

The `hyperparameters` you define in the `HuggingFace` estimator are passed in as named arguments. 

Sagemaker is providing useful properties about the training environment through various environment variables, including the following:

* `SM_MODEL_DIR`: A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.

* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.

* `SM_CHANNEL_XXXX:` A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the HuggingFace estimator’s fit call, named `train` and `test`, the environment variables `SM_CHANNEL_TRAIN` and `SM_CHANNEL_TEST` are set.


To run your training job locally you can define `instance_type='local'` or `instance_type='local_gpu'` for gpu usage. _Note: this does not working within SageMaker Studio_


In [None]:
from transformers import TrainingArguments
from sagemaker.huggingface import HuggingFace

In [None]:
training_args = TrainingArguments("test-trainer",
                                  per_device_train_batch_size=16,
                                  per_device_eval_batch_size=32,
                                  learning_rate=2e-5,
                                  weight_decay=0.01
                                 )
print(type(training_args.to_dict()))

## Creating an Estimator and start a training job

In [None]:
from sagemaker.huggingface import HuggingFace

hyperparameters = {
	'model_name_or_path':'distilbert-base-uncased',
	'output_dir':'/opt/ml/model',
    'dataset_name' : 'wnut_17',
    'do_train':'true',
    'do_eval':'true',
    'num_train_epochs':1
}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.26.0'}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
	entry_point='run_ner.py',
	source_dir='./examples/pytorch/token-classification',
	instance_type='ml.p3.2xlarge',
	instance_count=1,
	role=role,
    training_arguments=training_args,
	git_config=git_config,
	transformers_version='4.26.0',
	pytorch_version='1.13.1',
	py_version='py39',
	hyperparameters = hyperparameters,
)

huggingface_estimator.fit()

In [None]:

# deploy model to SageMaker Inference
predictor = huggingface_estimator.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.m5.xlarge' # ec2 instance type
)


In [None]:
predictor.predict({
	"inputs": "My name is Sarah Jessica Parker but you can call me Jessica and my salary is 2000k",
})

In [None]:
predictor.delete_model()
predictor.delete_endpoint()

# Extras

### Estimator Parameters

In [None]:
# container image used for training job
print(f"container image used for training job: \n{huggingface_estimator.image_uri}\n")

# s3 uri where the trained model is located
print(f"s3 uri where the trained model is located: \n{huggingface_estimator.model_data}\n")

# latest training job name for this estimator
print(f"latest training job name for this estimator: \n{huggingface_estimator.latest_training_job.name}\n")



In [None]:
# access the logs of the training job
huggingface_estimator.sagemaker_session.logs_for_job(huggingface_estimator.latest_training_job.name)

### Attach to old training job to an estimator 

In Sagemaker you can attach an old training job to an estimator to continue training, get results etc..

In [None]:
from sagemaker.estimator import Estimator

# job which is going to be attached to the estimator
old_training_job_name=''

In [None]:
# attach old training job
huggingface_estimator_loaded = Estimator.attach(old_training_job_name)


# get model output s3 from training job
huggingface_estimator_loaded.model_data