# Hugging Face Language Models with ART

In this notebook we will go over how to use Hugging Face language models with ART. Currently this is a developing feature, and so not all ART tools are supported. Further tools and development is planned. As of ART 1.17 we support:
* Tokenization
* Inference
* Text Generation

If you have a use case that is not supported (or find a bug in this new feature) please raise an issue on ART.

Let's look at how we can use ART to run Hugging Face language models!

In [1]:
import numpy as np
import torch

from art.estimators.language_modeling import HuggingFaceLanguageModel

## Tokenization

Using the ART wrapper for the Hugging Face language model, we can easily tokenize text. The model can accept a string (or list or strings) and output the tokens in the same way as a Hugging Face tokenizer.

In [2]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

language_model = HuggingFaceLanguageModel(
    model=model,
    tokenizer=tokenizer,
)

In [3]:
# We can tokenize a string like a normal Hugging Face tokenizer

output = language_model.tokenize('this is a sample sentence')
output

{'input_ids': [101, 2023, 2003, 1037, 7099, 6251, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [4]:
# We can also tokenize multiple strings and pass any additional keyword arguments

output = language_model.tokenize(['this is a sample sentence', 'another string'], padding=True, truncation=True)
output

{'input_ids': [[101, 2023, 2003, 1037, 7099, 6251, 102],
  [101, 2178, 5164, 102, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0, 0]]}

In [5]:
# We can encode strings into tokens and decode them back into string

token_ids = language_model.encode('this is a sentence')
token_ids

[101, 2023, 2003, 1037, 6251, 102]

In [6]:
strings = language_model.decode(token_ids, skip_special_tokens=True)
strings

'this is a sentence'

## Model Inference

We can use the ART wrapper to perform inference using the Hugging Face language model. Input strings will be automatically tokenized. Additional keyword arguments can be provided or the tokenized inputs can also be passed directly. The output will be a dictionary that contains the same fields as the output of the language model.

In [7]:
# Automatic tokenization

output = language_model.predict('this is a sentence')
output.keys()

dict_keys(['last_hidden_state', 'pooler_output'])

In [8]:
# Manual tokenization

tokens = language_model.tokenize('this is a sentence')
output = language_model.predict(**tokens)
output.keys()

dict_keys(['last_hidden_state', 'pooler_output'])

In [13]:
# Additional keyword arguments

output = language_model.predict('this is a sentence', attention_mask=[1, 1, 1, 1, 0, 0])
output.keys()

dict_keys(['last_hidden_state', 'pooler_output'])

## Text Generation

We can use the ART wrapper to generate text using decoder models with the Hugging Face API.

In [26]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained('t5-small')
model = AutoModelForSeq2SeqLM.from_pretrained('t5-small')

language_model = HuggingFaceLanguageModel(
    model=model,
    tokenizer=tokenizer,
)

In [24]:
# We can generate for a single sentence

output = language_model.generate('translate English to French: This is a nice house.')
output

"C'est une belle maison."

In [25]:
# We can generate for a multiple sentences

output = language_model.generate(['translate English to French: This is a nice house.', 'translate English to German: This is a nice house.'])
output

["C'est une belle maison.", 'Das ist ein schönes Haus.']

## Downstream Tasks

We can use the ART wrapper to perform downstream tasks using various language models.

### BERT Embeddings

In [63]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

language_model = HuggingFaceLanguageModel(
    model=model,
    tokenizer=tokenizer,
)

sentences = [
    'this is a sample sentence',
    'here is another sentence',
    'and yet another sentence',
]

output = language_model.predict(sentences)
sentence_embeddings = output['last_hidden_state'][:, -1]
sentence_embeddings.shape

(3, 768)

### Sentiment Analysis

In [62]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')

language_model = HuggingFaceLanguageModel(
    model=model,
    tokenizer=tokenizer,
)

sentences = [
    'I like apples',
    'I like oranges',
]

output = language_model.predict(sentences)
np.argmax(output['logits'], axis=-1)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


array([1, 1])

### Masked Language Modeling

In [61]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForMaskedLM.from_pretrained('bert-base-uncased')

language_model = HuggingFaceLanguageModel(
    model=model,
    tokenizer=tokenizer,
)

output = language_model.predict('The capital of France is [MASK].')
predicted_token_id = output['logits'][0, -3].argmax(axis=-1)
language_model.decode(predicted_token_id)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


'paris'