## Introduction to the Transformers Library (Colab Recommended) 🤗


To recap, HuggingFace is an AI company that has blown up in the last few years, especially in the realm of Natural Language Processing (NLP).

In particular, the Transformers library has revolutionized the way people  work with large-scale transformer models. The goal of this challenge is to introduce you to these models for the first time and show how easy they can be to work with.

### Why you should love HuggingFace:

#### Pre-trained Models 📚:

One of the best features of the Transformers library is its huge repo of pre-trained models. Whether you're looking to employ BERT, GPT-2, T5, RoBERTa, or any of the other transformer architectures, chances are you'll find a version that suits your needs in their model hub.

#### It's super easy 👍:

The library is designed to be user-friendly. Loading a model and its corresponding tokenizer can be done in just a couple of lines of code. This simplicity extends to fine-tuning as well, allowing you to adapt these powerful models to a wide range of tasks. The `pipelines` library we'll be using lets you go from model selection to getting results in just a few lines.

#### Tokenizer  🔄 and Datasets 📊 Library:

Alongside the Transformers library, HuggingFace also offers the Tokenizers and Datasets libraries. While the first provides efficient and easy-to-use tokenization methods, the second offers a whole bunch of datasets, meaning you have all the tools and data you need in one ecosystem.

#### Community-Driven 🌐:
The HuggingFace community is very active and any community member (you included) can upload their own models and datasets.

__If you are working in Colab__ you'll need to install the appropriate libraries in your Colab environment (you will have them locally if you followed the setup instructions)

In [1]:
# Install the transformers library from HuggingFace
!pip install transformers torch pytesseract

Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Coll

In [2]:
# You'll also need some extra tools that some of these models use under the hood
! pip install sentencepiece sacremoses

Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sacremoses
Successfully installed sacremoses-0.1.1


Over the course of this notebook, you'll be using Pipelines to download and easily use some very powerful models. Bear in mind that some of these models are quite large (up to 500Mb so make sure you have some disk space free on your machine or run this notebook in a Colab with faster download speeds!).

We are going to be using pre-built models and the best resource for implementing them will be using the [Pipelines documentation](https://huggingface.co/docs/transformers/main_classes/pipelines). If you ever want to delete the models locally after use, you can find them here in your root directory at:

`/.cache/huggingface/hub`

### Basic Sentiment : 😀 /  😕 / 😠 / 😟

With that in mind, instantiate a pipeline for sentiment analysis __without__ specifying a model and try testing out that model with the sentence "Transformers are awesome!" Feel free to try some other sentences, too.

In [3]:
pass  # YOUR CODE HERE
from transformers import pipeline
model = pipeline("sentiment-analysis")

sentences = [
    "Transformers are awesome!",
    "This code keeps giving me errors"
]

results = model(sentences)

state = {
    "POSITIVE": "😀",
    "NEGATIVE": "😟",
    "NEUTRAL": "😐"
}

for text, result in zip(sentences, results):
    print(f"{text} {result['label']} {state.get(result['label'], '')}")
    print(f" {result['score']:.2%}\n")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cuda:0


Transformers are awesome! POSITIVE 😀
 99.99%

This code keeps giving me errors NEGATIVE 😟
 99.97%



### Nuanced Sentiment 🤔

HuggingFace will default to using `distilbert-base-uncased-finetuned-sst-2-english` if we don't specify a model. This model will work fine on a lot of basic use cases, but - because it's been trained on a fairly limited corpus of text:

`The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges.`

It's fairly obvious that a model trained on this will likely perform poorly on sentences that include modern language: e.g. "These beats are sick!". Try running these sentences through your pipeline now and you should get negative scores even though they are expressing quite positive sentiment.

In [4]:
pass  # YOUR CODE HERE
from transformers import pipeline
way = pipeline("sentiment-analysis")

texts = [
    "These beats are sick!",
    "That's whack",
    "This pizza slaps"
]
for text in texts:
    result = way(text)[0]
    print(text)
    print(f" {result['label']} ({result['score']:.2%})")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


These beats are sick!
 NEGATIVE (99.97%)
That's whack
 NEGATIVE (95.94%)
This pizza slaps
 NEGATIVE (99.30%)


Go to the list of HuggingFace models to see if you can find a model that will specialize on Twitter sentiment (looking for `"twitter-roberta-base-sentiment-latest"` might be a good place to start) - hopefully that should be a bit more up to date with all this new lingo! Now create a second pipeline, this time __specifying__ that model that we want to use (use `model=`) and see how our performance instantly improves now we're using a fine-tuned model.


In [5]:
pass  # YOUR CODE HERE
model = pipeline("sentiment-analysis")
twitter = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0
Device set to use cuda:0


In [6]:
test_model = [
    "I love data seince",
    "What a wondeful naight",
    "I hate late nights"
]

In [7]:
for text in test_model:
    print(f"{text}")
    print(f"{model(text)[0]}")
    print(f"{twitter(text)[0]}")

I love data seince
{'label': 'POSITIVE', 'score': 0.9996283054351807}
{'label': 'LABEL_2', 'score': 0.9697285890579224}
What a wondeful naight
{'label': 'POSITIVE', 'score': 0.9899623394012451}
{'label': 'LABEL_2', 'score': 0.6254957318305969}
I hate late nights
{'label': 'NEGATIVE', 'score': 0.9945341348648071}
{'label': 'LABEL_0', 'score': 0.9483242630958557}


You should see a much more accurate interpretation of the sentiment we're trying to express.

### Sentiment in other languages

While even our first pipeline will actually perform surprisingly well on simple sentences in other languages (e.g. "C' est bon" or "Esta bueno"), it breaks down when handling more sophisticated ideas in those languages.

Here is an example review for the Jurassic World Dominion movie 😬:

"This was frankly a spectacular failure from start to finish, with  remarkably uninspired performances from some very well-paid actors who acted with all the passion of a wet biscuit"

Tranlated into Korean it reads as this: "이것은 솔직히 처음부터 끝까지 엄청난 실패였으며 젖은 비스킷의 모든 열정으로 연기한 일부 매우 보수가 좋은 배우들의 현저하게 영감을 받지 못한 연기로 끝났습니다."

Try running the Korean text through either your Twitter model; you should see they won't pick up on how bad the review is.

In [8]:
pass  # YOUR CODE HERE
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-ko-en")
result = pipe("이것은 솔직히 처음부터 끝까지 엄청난 실패였으며 젖은 비스킷의 모든 열정으로 연기한 일부 매우 보수가 좋은 배우들의 현저하게 영감을 받지 못한 연기로 끝났습니다")
result[0]['translation_text']

Device set to use cuda:0


'It was a huge failure, frankly, from beginning to end, and it ended up with some of the most conservative actors who played it with all the passion of wet biscuits, obviously uninspired.'

Now see if you can find a model that might perform better in the HuggingFace library and use it. Try using `"matthewburke/korean_sentiment"` in a `text-classification` pipeline and see if your results change.

In [9]:
pass  # YOUR CODE HERE
pipe = pipeline("text-classification",
                model="matthewburke/korean_sentiment")

result = pipe("이것은 솔직히 처음부터 끝까지 엄청난 실패였으며 젖은 비스킷의 모든 열정으로 연기한 일부 매우 보수가 좋은 배우들의 현저하게 영감을 받지 못한 연기로 끝났습니다")
print("Translated text:", result[0])

Device set to use cuda:0


Translated text: {'label': 'LABEL_0', 'score': 0.9602949023246765}


In [10]:
pass  # YOUR CODE HERE
pipe = pipeline("translation",
                model="DunnBC22/opus-mt-ko-en-Korean_Parallel_Corpora")

result = pipe("이것은 솔직히 처음부터 끝까지 엄청난 실패였으며 젖은 비스킷의 모든 열정으로 연기한 일부 매우 보수가 좋은 배우들의 현저하게 영감을 받지 못한 연기로 끝났습니다")
print("Translated text:", result[0])

Device set to use cuda:0


Translated text: {'translation_text': 'It was a huge failure from beginning to end, frankly, and ended with some of the very conservative actors who played with all the passion of the wet Biscuit, which was largely uninspired.'}


### Translation ✍️

Let's stick with our language theme and see if we can find a model that can handle the tasks of translating some sentences for us. The `opus-mt` project from the University of Helsinki is incredibly active on HuggingFace, creating and maintaining models designed to democratize the translation process for many different global languages. Try implementing the `"Helsinki-NLP/opus-mt-<source-language>-<destination-language>"` to see if you can translate between two langauges (e.g. English to Spanish).

In [14]:
pass  # YOUR CODE HERE
from transformers import pipeline

summarizer = pipeline("translation", model="Helsinki-NLP/opus-mt-en-es")
print(summarizer("Good morining")[0]['translation_text'])

Device set to use cuda:0


Buenos días.


### Summarization

Another really useful NLP task is summarizing a large amount of information into a very small amount of words. BART is a model that performs well on tasks like summarization; it contains a combination of two models you've already seen briefly in the lecture - the BERT model and autogressive style GPT model - check out this [link](https://www.projectpro.io/article/transformers-bart-model-explained/553) for some more information on it.

Since BART models can be quite large, try to find the `distilbart-xsum-12-6` model on HuggingFace which is one of the smallest distillations available (we'll talk more about distillations later!). Integrate that model into a `"summarization"` pipeline, then take some text (e.g. perhaps by copy-pasting [a BBC article](https://www.bbc.com/news/topics/cx2pk70323et)) and summarize it with your pipeline!

N.B. You need to be careful about context windows - here, you may run into an issue with your input being too long for the model!

In [15]:
text = """
If you can dream—and not make dreams your master;
If you can think—and not make thoughts your aim;
If you can meet with Triumph and Disaster
And treat those two impostors just the same;
If you can bear to hear the truth you’ve spoken
Twisted by knaves to make a trap for fools,
Or watch the things you gave your life to, broken,
And stoop and build ’em up with worn-out tools
"""

In [19]:
pass  # YOUR CODE HERE
from transformers import pipeline
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
summary = summarizer(text, max_length=100, min_length=30, do_sample=False)
print(summary[0])

Device set to use cuda:0


{'summary_text': ' If you can bear to hear the truth you’ve spoken, or watch the things you gave your life to, broken, or stoop and build ‘em up with worn-out tools’'}


### Going further: Question Answering 🔍

What if we wanted to go further than just a summary? Perhaps asking questions about a specific dataset in an intuitive way? There's a model for that, too! Enter the (reasonably small) `roberta-base-squad2` - a model trained on question-answer pairs that can answer a `question` about a provided `context` (a body of text you will provide). Check the docs [here](https://huggingface.co/deepset/roberta-base-squad2?context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species.&question=How+many+species+are+in+the+Amazon%3F).

You know the drill: Create a `"question-answering"` pipeline with the `roberta-base-squad2` model, then try putting the `article` you picked before as your context and try asking a `question` about it.

In [20]:
pass  # YOUR CODE HERE
Qanswering = pipeline("question-answering", model="deepset/roberta-base-squad2")

Device set to use cuda:0


In [21]:
context = """
All things in life pass away, except what we create from our soul.
People chase after money and status,
but the artist seeks only truth.
A person’s finest moments are those spent listening to their inner voice,
when what they paint or write becomes their true self, not a mirror of others’ desires
"""

In [22]:
questions = [
    "What is the only thing that doesn't pass away?",
    "What do ordinary people chase?",
    "What does the artist seek?",
    "When are a person's finest moments?",
    "What becomes an artist's true self?"
]

In [23]:
print("=== PHILOSOPHICAL QA ===")
for question in questions:
    answer = Qanswering(question=question, context=context)
    print(f"Q: {question}")
    print(f"A: {answer['answer']} (confidence: {answer['score']:.0%})\n")

=== PHILOSOPHICAL QA ===
Q: What is the only thing that doesn't pass away?
A: what we create from our soul (confidence: 41%)

Q: What do ordinary people chase?
A: money and status (confidence: 68%)

Q: What does the artist seek?
A: truth (confidence: 79%)

Q: When are a person's finest moments?
A: listening to their inner voice (confidence: 16%)

Q: What becomes an artist's true self?
A: what they paint or write (confidence: 42%)



### Speech to text 🎤

One of the best models for converting speech to text was made is the open source Whisper model made by OpenAI (creator of ChatGPT etc.) Take a look at the diagram of the model architecture - it should now look quite similar to those you've already seen today:


<img src = https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/whipser.png width = 450px>

Run the following command to download this audio sample and install some additional required packages:

In [24]:
# # line below for Windows/ Linux/ Colab
!sudo apt install ffmpeg

# Uncomment line below for Mac users
# !HOMEBREW_NO_AUTO_UPDATE=1 brew install ffmpeg

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 30 not upgraded.


In [25]:
!mkdir data
!curl https://wagon-public-datasets.s3.amazonaws.com/deep_learning_datasets/harvard.wav > data/harvard.wav

mkdir: cannot create directory ‘data’: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3173k  100 3173k    0     0  1688k      0  0:00:01  0:00:01 --:--:-- 1688k


You can listen to the clip by using the by importing `IPython` and loading the audio file (see the Algebra day recap for an example of how this is done!)

In [26]:
pass  # YOUR CODE HERE
import IPython
IPython.display.Audio("data/harvard.wav")

Output hidden; open in https://colab.research.google.com to view.

Find the smallest Whisper model version on HuggingFace (`whisper-tiny`) and use it to transcribe the audio. Try it on some other `.wav` files if you'd like!

In [27]:
pass  # YOUR CODE HERE
from transformers import pipeline

transcribe = pipeline("automatic-speech-recognition", model="openai/whisper-tiny")

result = transcribe("data/harvard.wav")
print(result["text"])
# haraaaaaaammmmm

Device set to use cuda:0
Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


 The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health in zest. A salt pickle tastes fine with ham. Tacos all pastora are my favorite. A zestful food is the hot cross bun.


### Bonus: Let's get multimodal 😎: Visual Question Answering

We can even use question-answering style models on images if we'd like. Many of these models will use chains under the hood that will extract text from an image then pass it through to a language model. In order to use the following model you will need to make sure you `pip install Pillow pytesseract` which are two libraries that will help us to extract text from our images.

Once that's done, we're going to create a `"document-question-answering"` pipeline - we'll need a model for it, so search for the `layoutlm-invoices` model on HuggingFace. Then try to ask questions about this [`receipt.webp`](https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/receipt.webp) (you download the image to your data folder or you can pass the url directly into your model when you call it). Try asking how much the eggs cost, what sales tax was and what the total was. Feel free to try it on some of your own images!

For this to run, you'll need some dependencies:

In [28]:
# For Mac, uncomment:
# !brew install tesseract

# For Linux or Colab etc. uncomment these:
!sudo apt install tesseract-ocr
!sudo apt install libtesseract-dev

# Then restart your kernel and give it a try!

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 30 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libtesseract-dev is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 30 not upgraded.


In [29]:
pass  # YOUR CODE HERE
from transformers import pipeline
from PIL import Image

doc_qa = pipeline("document-question-answering")


No model was supplied, defaulted to impira/layoutlm-document-qa and revision beed3c4 (https://huggingface.co/impira/layoutlm-document-qa).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


In [30]:
result = doc_qa(
    image=Image.open("/content/data/receipt.webp"),
    question="What is the total amount?"
)

print(f"Total: {result[0]['answer']}")

Total: $55.59


Congrats 🎉 You've just seen how simple it can be to start working with some advanced Transformer-based models and we've only just scratched the surface.

There are so many models you can explore in the HuggingFace library for all kinds of different tasks. Your imagination is literally the limit (well - your compute power can also be a limit somtimes 😅). To take these models even further for custom usage, we're going to tackle fine-tuning next.

⚠️⚠️⚠️ If you have been running these models locally, don't forget to clean up your `/.cache/huggingface/hub` if you're limited on space or you'll have a lot of unwanted models hanging around in your cache 🧹 ⚠️⚠️⚠️