# Part 3 - Training (aka *fine-tuning*) a Transformer model

In this part we will finally train our very own Transformers model. We saw that the zer-shot model didn't produce great results, and that's probably because the model was trained on summarising news articles, not academic papers. 

These lines of code are typical setup for Sagemaker, we require them for training jobs: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html

In [2]:
import sagemaker

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sess.default_bucket()

print(f"IAM role arn used for running training: {role}")
print(f"S3 bucket used for storing artifacts: {sess.default_bucket()}")

IAM role arn used for running training: arn:aws:iam::905847418383:role/service-role/AmazonSageMaker-ExecutionRole-20211005T160629
S3 bucket used for storing artifacts: sagemaker-us-east-1-905847418383


We are in the great position that we don't have to write our own training script. Instead we will use a script from the transformers library in Github: https://github.com/huggingface/transformers/blob/v4.6.1/examples/pytorch/summarization/run_summarization.py

In [36]:
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}

These rae the parameters for training, and this is one of the most important levers we can leverage once we are in the experimentation phase. Changing these parameters can influence the model performance and there will be a component of trial & error to find the best model. Also check out https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html for automated hyperparameter tuning. 

In [38]:
# hyperparameters, which are passed into the training job
hyperparameters={'per_device_train_batch_size': 4,
                 'per_device_eval_batch_size': 4,
                 'model_name_or_path': 'sshleifer/distilbart-cnn-12-6',
                 'train_file': '/opt/ml/input/data/datasets/train.csv',
                 'validation_file': '/opt/ml/input/data/datasets/val.csv',
                 'do_train': True,
                 'do_eval': True,
                 'do_predict': False,
                 'predict_with_generate': True,
                 'output_dir': '/opt/ml/model',
                 'num_train_epochs': 3,
                 'learning_rate': 5e-5,
                 'seed': 7,
                 'fp16': True,
                 'val_max_target_length': 20,
                 'text_column': 'text',
                 'summary_column': 'summary',
                 }

# configuration for running training on smdistributed Data Parallel
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

In [39]:
from sagemaker.huggingface import HuggingFace

# create the Estimator
huggingface_estimator = HuggingFace(
      entry_point='run_summarization.py',
      source_dir='./examples/pytorch/summarization',
      git_config=git_config,
      instance_type='ml.p3.16xlarge',
      instance_count=2,
      transformers_version='4.6',
      pytorch_version='1.7',
      py_version='py36',
      role=role,
      hyperparameters = hyperparameters,
      distribution = distribution
)

This will kick off the training job which should take ~45 minutes

In [40]:
huggingface_estimator.fit({'datasets':f's3://{bucket}/summarization/data/'})

Cloning into '/tmp/tmpx7p5b1co'...
Note: switching to 'v4.6.1'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at fb27b276e Release: v4.6.1


2021-12-01 09:15:54 Starting - Starting the training job...
2021-12-01 09:16:04 Starting - Launching requested ML instancesProfilerReport-1638350145: InProgress
.........
2021-12-01 09:17:49 Starting - Preparing the instances for training.........
2021-12-01 09:19:12 Downloading - Downloading input data...
2021-12-01 09:19:51 Training - Downloading the training image............
2021-12-01 09:21:56 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-12-01 09:21:56,897 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-12-01 09:21:56,975 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[35mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[35mbash: no job control in this shell[0m
[35m2021-12-01 09:21:57

In [7]:
import pandas as pd
df_test = pd.read_csv('data/test.csv')
ref_summaries = list(df_test['summary'])

In [8]:
texts = list(df_test['text'])

In [9]:
texts[0]

"  Consider the blow-up $X$ of $\\mathbb{P}^3$ at 6 points in very general position and the 15 lines through the 6 points. We construct an infinite-order pseudo-automorphism $\\phi_X$ on $X$, induced by the complete linear system of a divisor of degree 13. The effective cone of $X$ has infinitely many extremal rays and hence, $X$ is not a Mori Dream Space. The threefold $X$ has a unique anticanonical section which is a Jacobian K3 Kummer surface $S$ of Picard number 17. The restriction of $\\phi_X$ on $S$ realizes one of Keum's 192 infinite-order automorphisms of Jacobian K3 Kummer surfaces. In general, we show the blow-up of $\\mathbb{P}^n$ ($n\\geq 3$) at $(n+3)$ very general points and certain 9 lines through them is not Mori Dream, with infinitely many extremal effective divisors. As an application, for $n\\geq 7$, the blow-up of $\\overline{M}_{0,n}$ at a very general point has infinitely many extremal effective divisors. "

In [15]:
data = {"inputs": texts[0], "max_length":10}
predictor.predict(data)

[{'generated_text': 'Works great but not easy to install. You have to adjust from the top and with the small space I had to stand on the toilet to get it to the right height then get my head between the wall/counter and the toilet .... Like I said it works great'}]

In [None]:
candidate_summaries = []

for i, text in enumerate(texts):
    if i % 50 == 0:
        print(i)
    data = {"inputs": texts[0]}
    candidate = predictor.predict(data)
    candidate_summaries.append(candidate[0]['generated_text'])

In [47]:
def calc_rouge_scores(candidates, references):
    result = metric.compute(predictions=candidates, references=references, use_stemmer=True)
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    return result

In [48]:
calc_rouge_scores(candidate_summaries, ref_summaries)

{'rouge1': 18.962463067641274,
 'rouge2': 10.93389662527964,
 'rougeL': 17.206193741062908,
 'rougeLsum': 17.241895755727892}

In [None]:
file = open("model-summaries.txt", "w")
for s in candidate_summaries:
    file.write(s + "\n")
file.close()

In [49]:
predictor.delete_endpoint()

In [50]:
! mkdir inference_code

In [51]:
%%writefile inference_code/inference.py

# This is the script that will be used in the inference container
import json 
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def model_fn(model_dir):
    """
    Load the model and tokenizer for inference 
    """
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_dir).to(device).eval()
    
    model_dict = {'model':model, 'tokenizer':tokenizer}
    
    return model_dict 


def predict_fn(input_data, model_dict):
    """
    Make a prediction with the model
    """
    text = input_data.pop('inputs')
    parameters_list = input_data.pop('parameters_list', None)
    
    tokenizer = model_dict['tokenizer']
    model = model_dict['model']

    # Parameters may or may not be passed    
    input_ids = tokenizer(text, truncation=True, padding='longest', return_tensors="pt").input_ids.to(device)
    
    if parameters_list:
        predictions = []
        for parameters in parameters_list:
            output = model.generate(input_ids, **parameters)
            predictions.append(tokenizer.batch_decode(output, skip_special_tokens=True))
    else:
        output = model.generate(input_ids)
        predictions = tokenizer.batch_decode(output, skip_special_tokens=True)
    
    return predictions


def input_fn(request_body, request_content_type):
    """
    Transform the input request to a dictionary
    """
    request = json.loads(request_body)

    return request

Writing inference_code/inference.py


In [1]:
from sagemaker.huggingface import HuggingFaceModel

In [47]:
huggingface_estimator.model_data

The history saving thread hit an unexpected error (OperationalError('database or disk is full')).History will not be written to the database.


's3://sagemaker-us-east-1-905847418383/huggingface-pytorch-training-2021-12-01-09-15-45-087/output/model.tar.gz'

In [5]:
model_for_deployment = HuggingFaceModel(entry_point='inference.py',
                                        source_dir='inference_code',
                                        model_data='s3://sagemaker-us-east-1-905847418383/huggingface-pytorch-training-2021-12-01-09-15-45-087/output/model.tar.gz',
                                        role=role,
                                        pytorch_version='1.7.1',
                                        py_version='py36',
                                        transformers_version='4.6.1',
                                        )

In [6]:
predictor = model_for_deployment.deploy(initial_instance_count=1,
                                        instance_type='ml.g4dn.xlarge',
                                        serializer=sagemaker.serializers.JSONSerializer(),
                                        deserializer=sagemaker.deserializers.JSONDeserializer()
                                        )

-------!

In [10]:
data = {"inputs":texts[0], "parameters_list":[{"min_length": 5, "max_length": 20}]}
predictor.predict(data)

[['The blow-up of $\\mathbb{P}^3$ at 6 points']]

In [11]:
ref_summaries[0]

'Birational geometry of blow-ups of projective spaces along points and   lines'

In [12]:
candidate_summaries = []

for i, text in enumerate(texts):
    if i % 50 == 0:
        print(i)
    data = {"inputs":text, "parameters_list":[{"min_length": 5, "max_length": 20}]}
    candidate = predictor.predict(data)
    candidate_summaries.append(candidate[0][0])

0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
950
1000
1050
1100
1150
1200
1250
1300
1350
1400
1450
1500
1550
1600
1650
1700
1750
1800
1850
1900
1950


In [13]:
file = open("model-summaries.txt", "w")
for s in candidate_summaries:
    file.write(s + "\n")
file.close()

In [19]:
!pip install rouge_score

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting rouge_score
  Downloading rouge_score-0.0.4-py2.py3-none-any.whl (22 kB)
Collecting absl-py
  Downloading absl_py-1.0.0-py3-none-any.whl (126 kB)
     |████████████████████████████████| 126 kB 16.9 MB/s            
Installing collected packages: absl-py, rouge-score
Successfully installed absl-py-1.0.0 rouge-score-0.0.4


In [20]:
from datasets import load_metric
metric = load_metric("rouge")

In [21]:
def calc_rouge_scores(candidates, references):
    result = metric.compute(predictions=candidates, references=references, use_stemmer=True)
    result = {key: round(value.mid.fmeasure * 100, 1) for key, value in result.items()}
    return result

In [22]:
calc_rouge_scores(candidate_summaries, ref_summaries)

{'rouge1': 44.5, 'rouge2': 24.5, 'rougeL': 39.4, 'rougeLsum': 39.4}

In [31]:
candidate_summaries_topk = []

for i, text in enumerate(texts):
    if i % 50 == 0:
        print(i)
    data = {"inputs":text, "parameters_list":[{"min_length": 5, "max_length": 20, "num_beams": 50, "top_p": 0.9, "do_sample": True}]}
    candidate = predictor.predict(data)
    candidate_summaries_topk.append(candidate[0][0])

0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
950


In [32]:
calc_rouge_scores(candidate_summaries_topk, ref_summaries)

{'rouge1': 31.0, 'rouge2': 20.0, 'rougeL': 29.2, 'rougeLsum': 29.1}

In [33]:
file = open("model-summaries-top_p.txt", "w")
for s in candidate_summaries_topk:
    file.write(s + "\n")
file.close()

In [12]:
candidate_summaries[:5]

['Didn’t work for me. Second brand I’ve tried',
 'The graphics were not centered and placed more towards the handle than what the Amazon image',
 'Just right for us, towels, first aid kit, tools, personal emergency toiletries',
 '3.5 STARS,... When you purchase instead of rent... you really want',
 'Only 6 of the 12 lights were included. Only 6 of these lights are']

In [14]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting datasets
  Downloading datasets-1.15.1-py3-none-any.whl (290 kB)
     |████████████████████████████████| 290 kB 21.6 MB/s            
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
     |████████████████████████████████| 59 kB 10.3 MB/s            
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2021.11.0-py3-none-any.whl (132 kB)
     |████████████████████████████████| 132 kB 30.5 MB/s            
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
     |████████████████████████████████| 243 kB 47.6 MB/s            
Collecting pyparsing<3,>=2.0.2
  Using cached pyparsing-2.4.7-py2.py3-none-any.whl (67 kB)
[31mERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/PyYAML-6.0.dist-i

In [35]:
!pip install rouge_score

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


In [36]:
from datasets import load_metric
metric = load_metric("rouge")

In [37]:
def calc_rouge_scores(candidates, references):
    result = metric.compute(predictions=candidates, references=references, use_stemmer=True)
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    return result

In [30]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [31]:
tokenizer = AutoTokenizer.from_pretrained('./model/')

In [33]:
model = AutoModelForSeq2SeqLM.from_pretrained('./model/').to('cpu').eval()

In [34]:
texts[0]

"  Consider the blow-up $X$ of $\\mathbb{P}^3$ at 6 points in very general position and the 15 lines through the 6 points. We construct an infinite-order pseudo-automorphism $\\phi_X$ on $X$, induced by the complete linear system of a divisor of degree 13. The effective cone of $X$ has infinitely many extremal rays and hence, $X$ is not a Mori Dream Space. The threefold $X$ has a unique anticanonical section which is a Jacobian K3 Kummer surface $S$ of Picard number 17. The restriction of $\\phi_X$ on $S$ realizes one of Keum's 192 infinite-order automorphisms of Jacobian K3 Kummer surfaces. In general, we show the blow-up of $\\mathbb{P}^n$ ($n\\geq 3$) at $(n+3)$ very general points and certain 9 lines through them is not Mori Dream, with infinitely many extremal effective divisors. As an application, for $n\\geq 7$, the blow-up of $\\overline{M}_{0,n}$ at a very general point has infinitely many extremal effective divisors. "

In [35]:
input_ids = tokenizer(texts[0], truncation=True, padding='longest', return_tensors="pt").input_ids.to('cpu')

In [36]:
output = model.generate(input_ids)
predictions = tokenizer.batch_decode(output, skip_special_tokens=True)

In [37]:
predictions

['The blow-up of $\\mathbb{P}^3$ at 6 points and certain 9 lines through   them is not Mori Dream, with infinitely many extremal effective divisors and applications to the blow-ups of $\\overline{M}_{0,n}$ at a very general point']

In [26]:
!mkdir model

In [27]:
!aws s3 cp s3://sagemaker-us-east-1-905847418383/huggingface-pytorch-training-2021-12-01-09-15-45-087/output/model.tar.gz model/

download: s3://sagemaker-us-east-1-905847418383/huggingface-pytorch-training-2021-12-01-09-15-45-087/output/model.tar.gz to model/model.tar.gz


In [28]:
!tar -xf model/model.tar.gz

^C


In [None]:
tar -xvf model/model.tar.gz