<a href="https://colab.research.google.com/github/Saminho44/dotfiles/blob/master/Your_first_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to HuggingFace's Transformers Library 🤗


To recap, HuggingFace is an AI company that has blown up in the last few years, especially in the realm of Natural Language Processing (NLP).

In particular, the Transformers library has revolutionized the way people  work with large-scale transformer models. The goal of this challenge is to introduce you to these models for the first time and show how easy they can be to work with.

### Why you should love HuggingFace 🌟:

#### Pre-trained Models 📚:

One of the best features of the Transformers library is its huge repo of pre-trained models. Whether you're looking to employ BERT, GPT-2, T5, RoBERTa, or any of the other transformer architectures, chances are you'll find a version that suits your needs in their model hub.

#### It's super easy 👍:

The library is designed to be user-friendly. Loading a model and its corresponding tokenizer can be done in just a couple of lines of code. This simplicity extends to fine-tuning as well, allowing you to adapt these powerful models to a wide range of tasks. The `pipelines` library we'll be using lets you go from model selection to getting results in just a few lines.

#### Tokenizer  🔄 and Datasets 📊 Library:

Alongside the Transformers library, HuggingFace also offers the Tokenizers and Datasets libraries. While the first provides efficient and easy-to-use tokenization methods, the second offers a whole bunch of datasets, meaning you have all the tools and data you need in one ecosystem.

#### Community-Driven 🌐:
The HuggingFace community is very active and any community member (you included) can upload their own models and datasets.

Enough pre-amble, let's get started by installing the HuggingFace library!

In [None]:
# Install the transformers library from HuggingFace
!pip install transformers torch pytesseract



In [None]:
# You'll also need some extra tools that some of these models use under the hood
! pip install sentencepiece sacremoses



Over the course of this notebook, you'll be using Pipelines to download and easily use some very powerful models. Bear in mind that some of these models are quite large (up to 500Mb so make sure you have some disk space free on your machine!). We are going to be using pre-built models and the best resource for implementing them will be using the [Pipelines documentation](https://huggingface.co/docs/transformers/main_classes/pipelines). If you ever want to delete the models after use, you can find them here in your root directory at:

`/.cache/huggingface/hub`

### Basic Sentiment : 😀 /  😕 / 😠 / 😟

With that in mind, instantiate a pipeline for sentiment analysis __without__ specifying a model and try testing out that model with the sentence "Transformers are awesome!" Feel free to try some other sentences, too.

In [None]:
pass  # YOUR CODE HERE
from transformers import pipeline

sentiment_analysis = pipeline("sentiment-analysis")

result = sentiment_analysis(["Transformers are awesome!"])

result

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': 'POSITIVE', 'score': 0.9998667240142822}]

### Nuanced Sentiment 🤔

HuggingFace will default to using `distilbert-base-uncased-finetuned-sst-2-english` if we don't specify a model. This model will work fine on a lot of basic use cases, but - because it's been trained on a fairly limited corpus of text:

`The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges.`

It's fairly obvious that a model trained on this will likely perform poorly on sentences that include modern language: e.g. ""These jokes were absolutely killer!" or "These beats are sick!". Try running these sentences through your pipeline now and you should get negative scores even though they are expressing quite positive sentiment.

In [None]:
pass  # YOUR CODE HERE
result2 = sentiment_analysis(["These beats are sick!"])
result2

[{'label': 'NEGATIVE', 'score': 0.9997040629386902}]

In [None]:
result3 = sentiment_analysis(["These jokes were absolutely killer!"])
result3

[{'label': 'POSITIVE', 'score': 0.9855958223342896}]

Go to the list of HuggingFace models to see if you can find a model that will specialize on Twitter sentiment - hopefully that should be a bit more up to date with all this new lingo! Now create a second pipeline, this time __specifying__ that model that we want to use (use `model=`) and see how our performance instantly improves now we're using a fine-tuned model.

In [None]:
pass  # YOUR CODE HERE
model2 = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment-latest")


result4 = model2(["These beats are sick!"])

result4

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'label': 'positive', 'score': 0.7884202003479004}]

You should see a much more accurate interpretation of the sentiment we're trying to express.

### Sentiment in other languages

While even our first pipeline will actually perform surprisingly well on simple sentences in other languages (e.g. "C' est bon" or "Esta bueno"), it breaks down when handling more sophisticated ideas in those languages.

Here is an example review for the Jurassic World Dominion movie 😬:

"This was frankly a spectacular failure from start to finish, with  remarkably uninspired performances from some very well-paid actors who acted with all the passion of a wet biscuit"

Tranlated into Korean it reads as this: "이것은 솔직히 처음부터 끝까지 엄청난 실패였으며 젖은 비스킷의 모든 열정으로 연기한 일부 매우 보수가 좋은 배우들의 현저하게 영감을 받지 못한 연기로 끝났습니다."

Try running the Korean text through either your Twitter model or original sentiment mode; you should see they won't pick up on how bad the review is.

In [None]:
pass  # YOUR CODE HERE
result5 = model2(["이것은 솔직히 처음부터 끝까지 엄청난 실패였으며 젖은 비스킷의 모든 열정으로 연기한 일부 매우 보수가 좋은 배우들의 현저하게 영감을 받지 못한 연기로 끝났습니다."])
result5

[{'label': 'neutral', 'score': 0.7584187388420105}]

Now see if you can find a model that might perform better in the HuggingFace library and use it.

In [None]:
pass  # YOUR CODE HERE
model3 = pipeline("sentiment-analysis", model="matthewburke/korean_sentiment")

In [None]:
result6 = model3(["이것은 솔직히 처음부터 끝까지 엄청난 실패였으며 젖은 비스킷의 모든 열정으로 연기한 일부 매우 보수가 좋은 배우들의 현저하게 영감을 받지 못한 연기로 끝났습니다."])
result6

[{'label': 'LABEL_0', 'score': 0.9615505337715149}]

### Translation ✍️

Let's stick with our language theme and see if we can find a model that can handle the tasks of translating some sentences for us. The `opus-mt` project from the University of Helsinki is incredibly active on HuggingFace, creating and maintaining models designed to democratize the translation process for many different global languages. Try to find one that will allow you to translate from English to Spanishgfjt f

In [None]:
pass  # YOUR CODE HERE

model5 = pipeline("translation", model="Helsinki-NLP/opus-mt-en-es")
result7 = model5('I love you')
result7

[{'translation_text': 'Te quiero.'}]

### Summarization

Another really useful NLP task is summarizing a large amount of information into a very small amount of words. BART is a model that performs well on tasks like summarization; it contains a combination of two models you've already seen briefly in the lecture the - the BERT model and and autogressive style GPT model - check out this [link](https://www.projectpro.io/article/transformers-bart-model-explained/553) for some more information on it.

Since BART models can be quite large, try to find the `distilbart-xsum-12-6` model on HuggingFace which is one of the smallest distillations available (we'll talk more about distillations later!). Integrate that model into a `"summarization"` pipeline, then try to scrape your own article from [the BBC](https://www.bbc.com/news/topics/cx2pk70323et) and summarize it with your pipeline!

In [None]:
pass  # YOUR CODE HERE
import requests
from bs4 import BeautifulSoup
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

def summarize_bbc_article(url):


  # Scrape the article text
  response = requests.get(url)
  soup = BeautifulSoup(response.content, "html.parser")
  article_text = soup.find("div", class_="story-body")

  # Check if the article text is not None
  if article_text is not None:
    # Load the distilbart-xsum-12-6 model
    model = ("sshleifer/distilbart-xsum-12-6")


  summarization_pipeline = sentiment_analysis = pipeline("sentiment-analysis")

result = sentiment_analysis(["Transformers are awesome!"])

result

    # Print the summary
    print(tokenizer.decode(summary[0]))


In [None]:
print(summarize_bbc_article("https://www.bbc.com/news/uk-england-hereford-worcester-66279315"))

None


Once you've done it for one article, try building a function that scrapes BBC urls and summarizes them.

In [None]:
pass  # YOUR CODE HERE


In [None]:
summarize_url("https://www.bbc.com/news/uk-england-hereford-worcester-66279315")

NameError: ignored

### Going further: Question Answering 🔍

What if we wanted to go further than just a summary? Perhaps asking questions about a specific dataset in an intuitive way? There's a model for that, too! Enter the (reasonably small) `roberta-base-squad2` - a model trained on question-answer pairs that can answer a `question` about a provided `context` (a body of text you will provide). Check the docs [here](https://huggingface.co/deepset/roberta-base-squad2?context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species.&question=How+many+species+are+in+the+Amazon%3F).

You know the drill: Create a `"question-answering"` pipeline with the `roberta-base-squad2` model, then try putting the `article` you picked before as your context and try asking a `question` about it.

In [None]:
pass  # YOUR CODE HERE

### Speech to text 🎤

One of the best models for converting speech to text was made is the open source Whisper model made by OpenAI (creator of ChatGPT etc.) Take a look at the diagram of the model architecture - it should now look quite similar to those you've already seen today:


<img src = https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/whipser.png width = 450px>

Run the following command to download this audio sample and install some additional required packages:

In [None]:
Uncomment line below for Windows/ Linux
#!sudo apt install ffmpeg

Uncomment line below for Mac users
#!HOMEBREW_NO_AUTO_UPDATE=1 brew install ffmpeg

!mkdir data
!curl https://wagon-public-datasets.s3.amazonaws.com/deep_learning_datasets/harvard.wav > data/harvard.wav

You can listen to the clip by using the by importing `IPython` and loading the audio file (see the Algebra day recap for an example of how this is done!)

In [None]:
pass  # YOUR CODE HERE

Find the smallest Whisper model version on HuggingFace and use it to transcribe the audio. Try it on some other `.wav` files if you'd like!

In [None]:
pass  # YOUR CODE HERE

### Let's get multimodal 😎: Visual Question Answering

We can even use question-answering style models on images if we'd like. Many of these models will use chains under the hood that will extract text from an image then pass it through to a language model. In order to use the following model you will need to make sure you `pip install Pillow pytesseract` which are two libraries that will help us to extract text from our images.

Once that's done, we're going to create a `"document-question-answering"` pipeline - we'll need a model for it, so search for the `layoutlm-invoices` model on HuggingFace. Then try to ask questions about this [`receipt.webp`](https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/receipt.webp) (you'll need to download it first). Try asking how much the eggs cost, what sales tax was and what the total was. Feel free to try it on some of your own images!

For this to run, you'll need some dependencies:

In [None]:
For Mac, uncomment:
#!brew install tesseract

For Linux etc. uncomment these:
#!sudo apt install tesseract-ocr
#!sudo apt install libtesseract-dev

# Then restart your kernel and give it a try!

In [None]:
pass  # YOUR CODE HERE

Congrats 🎉 You've just seen how simple it can be to start working with some advanced Transformer-based models and we've only just scratched the surface.

There are so many models you can explore in the HuggingFace library for all kinds of different tasks. Your imagination is literally the limit (well - your compute power can be a limit somtimes 😅). To take these models even further for custom usage, we're going to tackle fine-tuning next.

⚠️⚠️⚠️ Don't forget to clean up your `/.cache/huggingface/hub` if you're limited on space or you'll have a lot of unwanted models hanging around in your cache 🧹 ⚠️⚠️⚠️