# Session 10 - Using BERT-style models via ```Huggingface```

In the lecture today, we saw how exploring the different layers and self-attention heads in BERT-style models can gives us a more nuanced breakdown of how the model has performed and what it has learned.

There are three main tools which can be used for this task:

- BERTviz
    - https://github.com/jessevig/bertviz
- Ecco
    - https://github.com/jalammar/ecco
- Language Interpretability Toolkit (LIT)
    - https://github.com/PAIR-code/lit

Each of these has empirical results in peer reviewed journals as evidence of robustness, but each does something a little different. Feel free to explore them in this class, or in your own time.

A second thing we saw was that BERT (and BERT-style) models can be *finetuned* in order to perform specific tasks. In this class, we're going to see how this can be used for the purposes of cultural data science. To do this, we're going to be using the library called ```HuggingFace``` or sometimes just ```🤗```.

__My (Emma) note__:

The requirements.txt should include: 

- transformers
- tensorflow-cpu

## Creating ```HuggingFace``` pipelines

We're specifically going to use the ```pipelines()``` abstraction in HuggingFace. This allows us to load a finetuned model, initialize it with the necessary requirements, and use it for the specific task for which it was finetuned. You can read more [here](https://huggingface.co/docs/transformers/v4.27.2/en/task_summary#natural-language-processing).

We're going to use the ```text-classification``` pipeline in this class (and [Assignment 4](https://classroom.github.com/a/BhnScEmU)).

In [1]:
# import pipelines
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm
2023-04-12 09:46:12.700794: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Text classification

To begin with, let's use the defaul sentiment classification model to see how we can return a binary sentiment classification for a document.

In [2]:
# download model "destilbert" which is the base version, it is uncased (everything is lower case), has been trained on the sst-2 data set and the language is english. 
classifier = pipeline(task="sentiment-analysis")
# we are not training anything, we are just using a pre-trained model from Huggingface. 
# doing sentiment analysis --> predicting wether the text is positive or negative

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 629/629 [00:00<00:00, 78.5kB/s]
Downloading pytorch_model.bin: 100%|██████████| 268M/268M [00:01<00:00, 241MB/s] 
Downloading (…)okenizer_config.json: 100%|██████████| 48.0/48.0 [00:00<00:00, 11.7kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.08MB/s]


In [3]:
# uses the classifier on a string and get predictions
preds = classifier("Hugging Face is the best thing since sliced bread!")

In [4]:
print(preds)
# the score is the accuracy

[{'label': 'POSITIVE', 'score': 0.9990912675857544}]


### Question answering

We can also use BERT-style models for much more complex texts, such as *question answering*. Again, there's a ```HuggingFace``` pipeline for this!

Let's start by defining a text we want to use as our *context*:

In [5]:
text = "In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles."

We then initalize our question-answering pipeline.

In [6]:
# loading a question answering pipeline
question_answerer = pipeline(task="question-answering")
# distilbert-base-cased-distilled-squad = the distilbert model is cased (it takes capital letters into account), and has been trained on the squad data set. 

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 473/473 [00:00<00:00, 115kB/s]
Downloading pytorch_model.bin: 100%|██████████| 261M/261M [00:01<00:00, 246MB/s]  
Downloading (…)okenizer_config.json: 100%|██████████| 29.0/29.0 [00:00<00:00, 6.21kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 1.01MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 436k/436k [00:00<00:00, 3.85MB/s]


And then we define the question we want to ask of our text:

In [7]:
answer = question_answerer(
    context = text,
    question="What are the main results of this paper?",
)

In [8]:
print(answer)
# the "start" and "end" numbers are the number of the first and last token in the "answer" text (but that doesn't fit, does it?)

{'score': 0.0676712617278099, 'start': 505, 'end': 570, 'answer': 'our best model outperforms even all previously reported ensembles'}


### Text summarization

HuggingFace also allows us to use other styles of transformers models, such as T5 and GPT, which we'll be looking at in coming weeks. These allow us to do interesting things like *text summarization* and *text generation*

In [9]:
summarizer = pipeline(task="summarization")

# summerizes the main points of the article
summary = summarizer(text)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 1.80k/1.80k [00:00<00:00, 402kB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.22G/1.22G [00:02<00:00, 419MB/s] 
Downloading (…)okenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 4.96kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 2.76MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 1.40MB/s]
Your max_length is set to 142, but you input_length is only 117. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=58)


In [10]:
print(summary)
# note that these sentences doesn't have to occur in the actual text, the model has created the sentences itself. 

[{'summary_text': ' The Transformer is the first sequence transduction model based entirely on attention . It replaces the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention . For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers .'}]


### Text generation 

Compare how this performs relative to your trained RNN and consider that we're only using the default parameters here:

In [11]:
prompt = "Hugging Face is a community-based open-source platform for machine learning."

In [12]:
# load gpt2 generator pipeline
# gpt2 has been specifically trained to generate new text with next word prediction. 
generator = pipeline(task="text-generation")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 665/665 [00:00<00:00, 175kB/s]
Downloading pytorch_model.bin: 100%|██████████| 548M/548M [00:01<00:00, 399MB/s]  
Downloading (…)neration_config.json: 100%|██████████| 124/124 [00:00<00:00, 15.7kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 3.33MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 1.45MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 2.57MB/s]


In [25]:
generated = generator(prompt)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [26]:
print(generated)

[{'generated_text': "Hugging Face is a community-based open-source platform for machine learning. We're excited about creating a community that is able to harness both the power of AI in machine learning and the power of machine learning in machine learning through the use of the"}]


In [23]:
prompt2 = "Hello my name is Emma and I study cultural data science."

In [27]:
generated2 = generator(prompt2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [28]:
print(generated2)

[{'generated_text': "Hello my name is Emma and I study cultural data science. I'm interested in how people think and understand the world around them rather than just thinking. Maybe I can help you understand why people aren't talking about the world around people instead of just having"}]


### Using a different model

So far, we've only been using the default models and parameters for these tasks. But if you check out the ```HuggingFace``` model universe, you'll see that there are many (in some cases hundreds) of finetuned models which can be slotted into these pipelines.

Check out the options [here](https://huggingface.co/models).

In [15]:
# loading the model emotion-english-distilroberta-base
# doing emotion classification (have 7 emotions as the classes)
classifier = pipeline("text-classification", 
                      model="j-hartmann/emotion-english-distilroberta-base", 
                      return_all_scores=True)

Downloading (…)lve/main/config.json: 100%|██████████| 1.00k/1.00k [00:00<00:00, 234kB/s]
Downloading pytorch_model.bin: 100%|██████████| 329M/329M [00:03<00:00, 103MB/s]  
Downloading (…)okenizer_config.json: 100%|██████████| 294/294 [00:00<00:00, 75.6kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 798k/798k [00:00<00:00, 1.91MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 1.45MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 2.61MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 239/239 [00:00<00:00, 74.3kB/s]


In [16]:
classifier("I love this!")
# gives a score for all the labels in the model (anger, disgust, fear, joy etc.)
# the sentence mots likely reflects joy.

[[{'label': 'anger', 'score': 0.004419781267642975},
  {'label': 'disgust', 'score': 0.0016119900392368436},
  {'label': 'fear', 'score': 0.0004138521908316761},
  {'label': 'joy', 'score': 0.9771687984466553},
  {'label': 'neutral', 'score': 0.005764583125710487},
  {'label': 'sadness', 'score': 0.002092392183840275},
  {'label': 'surprise', 'score': 0.008528688922524452}]]

This final pipeline forms the basis of [Assignment 4](https://classroom.github.com/a/BhnScEmU), which you should start working on now!