In [None]:
pip install transformers

It tokenizes the text and post processes the outputs for users.

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
result = classifier("Learning about AI is very fun!")
print(result)

[{'label': 'POSITIVE', 'score': 0.999871015548706}]


In [None]:
generator = pipeline("text-generation", model="distilgpt2")

result = generator(
    "In this course, we will teach you how to",
    truncation=True,
    max_length = 30,
    num_return_sequences=2,
)
print(result)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to develop the world of computer and communications in high school, if necessary. We will demonstrate the technologies used'}, {'generated_text': 'In this course, we will teach you how to use "Koboda\u200f," our hand-wringing magic wand (if you know'}]


In [None]:
classifier = pipeline("zero-shot-classification")

result = classifier(
    "This is a course about Python list comprehension",
    candidate_labels=["education", "politics", "buisness"]
)
print(result)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'This is a course about Python list comprehension', 'labels': ['education', 'buisness', 'politics'], 'scores': [0.7770820260047913, 0.21406987309455872, 0.008848120458424091]}


https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.pipeline.task Shows all the tasks available to the pipeline.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
result = classifier("I've been waiting for a HuggingFace course my whole life.")
print(result)


[{'label': 'POSITIVE', 'score': 0.9598048329353333}]


This is the tokenization that happens under the hood of the pipeline function.

Notice how it outputs an attention mask as well. This can be used mainly for decoders where masked attention is performed. The mask holds binary values corresponding to the list of input_ids. 1 represents unmasked, and 0 represents masked.

In [None]:
sequence = "Using a Transformer network is simple"
result = tokenizer(sequence)
print(result)

{'input_ids': [101, 2478, 1037, 10938, 2121, 2897, 2003, 3722, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


Tokenizer has a couple of functions to tokenize a sequence. We can get the tokens in text form.

In [None]:
tokens = tokenizer.tokenize(sequence)
print(tokens)

['using', 'a', 'transform', '##er', 'network', 'is', 'simple']


Here we get the token ids exclusively.

Notice any differences in the input_ids above and these ids? The only difference is that the input_ids above have start of sequence and end of sequence tokens bookending the ids (101 and 102).

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[2478, 1037, 10938, 2121, 2897, 2003, 3722]


We can also decode the sequence back out to text from the token ids.

Notice that the decoded string is in lowercase. This is due to the preprocessing of the sequence when tokenized and the nature of tokens.

In [None]:
decoded_string = tokenizer.decode(ids)
print(decoded_string)

using a transformer network is simple


We can input multiple sequences through the pipeline and get results for each sequence.

In [None]:
X_train = ["I've been waiting for a HuggingFace course my whole life.", "Python is awful!"]

result = classifier(X_train)
print(result)

[{'label': 'POSITIVE', 'score': 0.9598048329353333}, {'label': 'NEGATIVE', 'score': 0.9996261596679688}]


We can also group the sequences in a batch for the pipeline to process all at one time.

We know that one of our sequences is noticably longer than the other. This will create issues when building a tensor of tokens (tensors have unified dimensions). This is the reason we have "padding" set to True. As we can see in the output, there are zeros "padding" the end of the second sequence to make the sequence lengths match. We also can see that those padded zeros are masked within the attention mask.

In [None]:
batch = tokenizer(X_train, padding=True, truncation=True, max_length=512, return_tensors="pt")
print(batch)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101, 18750,  2003,  9643,   999,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}


Here is a look at the inference of the model and a bit of the post processing that needs to be done.

We take each sequence out of the batch and pass it through the model (not the pipeline) seperately. Logits are the unnormalized predictions, so we use a softmax function to get a better understanding of the predictions.

The predictions are the exact same to the predcitions above using the pipeline (just with normalized decimal places). Then the labels indicate whether the sequence was positive (1) or negative (0).

In [None]:
import torch
import torch.nn.functional as F

with torch.no_grad(): # no_grad disables gradient calculations because we don't plan on adjusting model parameters
  outputs = model(**batch)
  print(outputs)
  predictions = F.softmax(outputs.logits, dim=1)
  print(predictions)
  labels = torch.argmax(predictions, dim=1)
  print(labels)

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [ 4.3348, -3.5565]]), hidden_states=None, attentions=None)
tensor([[4.0195e-02, 9.5980e-01],
        [9.9963e-01, 3.7386e-04]])
tensor([1, 0])


You can save a pretrained model and tokenizer for later use (especially offline use; check files tab on the left). This is also very helpful for storing a finetuned model.

In [None]:
save_directory = "saved"
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

tok = AutoTokenizer.from_pretrained(save_directory)
mod = AutoModelForSequenceClassification.from_pretrained(save_directory)

On HuggingFace, there are thousands of models from the community that we can try and experiment with.

All we need to do is find the model we want to use on the HuggingFace Model Hub, and then copy the name on the top right to then use in our code. Some models also give code examples below their model description for us to copy as well.

https://huggingface.co/facebook/bart-large-cnn

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]
