<a href="https://colab.research.google.com/github/Kiet2000/ActionDetectionforSignLanguage/blob/main/INFOMTALC_Midterm_TakeHomeExam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Transformers: Applications in Language and Communication (INFOMTALC)

## Midterm Take-Home Exam

### Deadline: Sunday March 10, 2024, 23:59 CET

The midterm take-home exam for the INFOMTALC Transformers course consists of five assignments that each require you to write or edit working code blocks that produce results, and to write a discussion of the results. Each answer is evaluated on quality of the code or code edits (as far as you were asked to write new code or edit code), whether the required results are produced, and the quality of the discussion. Per question, 2 points can be scored maximally.

All questions assume the use of Google Colab. We begin with installing some requirements.

In [None]:
!git clone https://github.com/nlp-with-transformers/notebooks.git
%cd notebooks
from install import *
install_requirements()

fatal: destination path 'notebooks' already exists and is not an empty directory.
/content/notebooks
⏳ Installing base requirements ...
✅ Base requirements installed!
⏳ Installing Git LFS ...
✅ Git LFS installed!


In [None]:
%%javascript
require.config({
  paths: {
      d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min',
      jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
  }
});

<IPython.core.display.Javascript object>

In [None]:
from utils import *
setup_chapter()

Using transformers v4.16.2
Using datasets v1.16.1


#Assignment 1

This question deals with using a pre-trained BERT encoder for classification. Given a labeled dataset provided below, code two ways of using DistilBERT (or any other BERT-based Transformer you prefer):

1.   Using the Transformer as a feature extractor: Compare the performance of **at least two** machine learning classifiers using the output of the Transformer as input, with a baseline classifier;
2.   Fine-tuning the Transformer: train a classifier on top of the pre-trained Transformer on the same task.

Report the performances in terms of overall accuracy on test data, the precision, recall, and F1-scores  on both class, and visualize the errors on test data with confusion matrices. Discuss the differences between the performances of the baseline classifier, the machine learning classifiers, and the fine-tuned Transformer.

The dataset used is `SetFit/sst2`, the Stanford Sentiment Treebank version 2, with a binary sentiment score (0 for negative, 1 for positive). Each instance is a sentence expressing an opinion on a movie.

If lost, you may refresh your memory and get inspiration from the [Seminar 1 Notebook](https://colab.research.google.com/drive/1Wdpuppyk9oyWS9LiGqYI9J6I8IKqg8C-?usp=sharing). Here is a first block of code to get you started:


In [None]:
from datasets import load_dataset

sst2 = load_dataset("SetFit/sst2")

  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
### add your code blocks here.

# Assignment 2

This assignment is about finding meaningful attention weights in a pre-trained Transformer. We are going to use the `bertviz` library to visualize internals of [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased), a pretrained BERT model. We are going to input a short two-sentence text into the model, activating all 12 attention layers:

> *Jack put a banana on the table. It was yellow, with brown spots.*

In this text we find a co-referential relation between 'a banana' and 'It', although 'the table' could potentially also be 'yellow, with brown spots', and so theoretically, 'It' could also refer to 'the table'.


In [None]:
# Load model and retrieve attention weights

from bertviz import head_view, model_view
from transformers import BertTokenizer, BertModel

model_version = 'bert-base-uncased'
model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version)
sentence_a = "Jack put a banana on the table."
sentence_b = "It was yellow, with brown spots."
inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt')
input_ids = inputs['input_ids']
token_type_ids = inputs['token_type_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
sentence_b_start = token_type_ids[0].tolist().index(1)
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)

In Seminar 2, we found out that among other attention heads, the sixth (brown-colored) attention head of layer 5 is sensitive to co-reference. By selecting this layer and head, and moving your pointer to 'It', please verify which are the input words with the strongest attention weights to 'It', and **discuss whether this co-referential attention makes sense**.

Then, replace 'a banana' with 'two bananas'. Do the values of the attention weights from input tokens to 'It' change? **Discuss**.


In [None]:
head_view(attention, tokens, sentence_b_start)

<IPython.core.display.Javascript object>

#Assignment 3

Finetune an LLM (not restricted to DistilBERT!) on one or more of the [GLUE Benchmark tasks](https://gluebenchmark.com/), aiming to achieve a score better than the best reported baseline score on the [GLUE Benchmark Leaderboard](https://gluebenchmark.com/leaderboard). You may chose to use hyperparameter tuning. **Discuss the results you obtained.**

Get inspiration from the [Seminar 3 Notebook](https://colab.research.google.com/drive/13-H6-Zl1y3DFvLjNIswnyrZtni--Ixle?usp=sharing).

Note that you can [run the GLUE Benchmark baselines yourself](https://github.com/nyu-mll/GLUE-baselines).

In [None]:
## add your code blocks here

#Assignment 4

In this assignment you will compare two sub-word tokenizers by applying them to the following two sentences:

> *1) Investor excitement about AI reached a new peak this week.*

> *2) Idk send a dm to my fam #lol*

Load in the following three tokenizers, and use them to tokenize the two given sentences.

- distilbert-base-uncased
- jhu-clsp/bernice
- xlnet-base-cased


In [None]:
from transformers import AutoTokenizer

distilbert_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
bernice_tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/bernice")
xlnet_tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

sentence1 = "Investor excitement reached a new peak this week."
sentence2 = "Idk send a dm to my fam #lol"

db1 = distilbert_tokenizer.tokenize(sentence1)
db2 = distilbert_tokenizer.tokenize(sentence2)
bern1 = bernice_tokenizer.tokenize(sentence1)
bern2 = bernice_tokenizer.tokenize(sentence2)
xl1 = xlnet_tokenizer.tokenize(sentence1)
xl2 = xlnet_tokenizer.tokenize(sentence2)

print('Sentence1:\nDistilbert:\t{}'.format(db1))
print('\nBernice:\t{}'.format(bern1))
print('\nXLNet:\t{}'.format(xl1))

print('\nSentence2:\nDistilbert:\t{}'.format(db2))
print('\nBernice:\t{}'.format(bern2))
print('\nXLNet:\t{}'.format(xl2))

Sentence1:
Distilbert:     ['investor', 'excitement', 'reached', 'a', 'new', 'peak',
'this', 'week', '.']

Bernice:        ['▁Investor', '▁excitement', '▁reached', '▁a', '▁new', '▁peak',
'▁this', '▁week', '.']

XLNet:  ['▁Investor', '▁excitement', '▁reached', '▁a', '▁new', '▁peak', '▁this',
'▁week', '.']

Sentence2:
Distilbert:     ['id', '##k', 'send', 'a', 'd', '##m', 'to', 'my', 'fa', '##m',
'#', 'lo', '##l']

Bernice:        ['▁Idk', '▁send', '▁a', '▁dm', '▁to', '▁my', '▁fam', '▁#',
'lol']

XLNet:  ['▁I', 'd', 'k', '▁send', '▁a', '▁', 'd', 'm', '▁to', '▁my', '▁', 'fa',
'm', '▁', '#', 'lol']


**What differences do you observe between the three tokenizers when tokenizing sentence 1? And sentence2? How do you explain these differences?**

Now check the vocabulary size of the different tokenizers. **How do they differ, and what are the pros and cons of smaller or larger vocabulary sizes?**

In [None]:
print('Vocab size distilbert:\t{}'.format(distilbert_tokenizer.vocab_size))
print('Vocab size xlnet:\t{}'.format(xlnet_tokenizer.vocab_size))
print('Vocab size bernice:\t{}'.format(bernice_tokenizer.vocab_size))

Vocab size distilbert:  30522
Vocab size xlnet:       32000
Vocab size bernice:     250000


Most tokenizers that are used for transformer architectures include characters, words and sub-words in their vocabulary, but not super-words. Would you expect that super-words can be beneficial to model performance? **Motivate why (not).**

#Assignment 5

In this assignment you are going to use Transformer models to generate short stories.

You will need to install the `bitsandbytes`, `accelerate` and `transformers` packages. You may also need to restart the Google Colab session.

In [None]:
## !pip uninstall -y bitsandbytes accelerate transformers
!pip install bitsandbytes accelerate
!pip install --upgrade transformers

Found existing installation: bitsandbytes 0.42.0
Uninstalling bitsandbytes-0.42.0:
  Successfully uninstalled bitsandbytes-0.42.0
Found existing installation: accelerate 0.5.1
Uninstalling accelerate-0.5.1:
  Successfully uninstalled accelerate-0.5.1
Found existing installation: transformers 4.38.1
Uninstalling transformers-4.38.1:
  Successfully uninstalled transformers-4.38.1
Collecting bitsandbytes
  Using cached bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
Collecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bitsandbytes, accelerate
Successfully installed accelerate-0.27.2 bitsandbytes-0.42.0
Collecting transformers
  Using cached transformers-4.38.1-py3-none-any.whl (8.5 MB)
Installing collected packages: transformers
Successfully installed transformers-4.38.1


Now load a model. As an example we use GPT-2 (1.5B) and int-8 quantization (to save compute/memory, we will learn more about this approach in week 6).

You can also explore other models (including more recent ones like Google [Gemma](https://huggingface.co/blog/gemma) (requires HuggingFace login token); it is possible to run Gemma-7B with int-8 quantization on T4 GPU on Google Colab ).

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, GenerationConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_name = "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantization_config)


`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Specify the configurations for generation. We offer an example but feel free to explore other options (see links below for explanations for each parameter):
- https://huggingface.co/docs/transformers/generation_strategies
- https://huggingface.co/docs/transformers/v4.38.1/en/main_classes/text_generation#transformers.GenerationConfig

In [None]:
generation_config = GenerationConfig(
    eos_token_id=model.config.eos_token_id,
    pad_token_id=model.config.eos_token_id,
    max_new_tokens=50,
    penalty_alpha=0.5,
    top_k=4,
    temperature=0.7,
    do_sample=True,
)

Here is an example of prompting the model:

In [None]:
def prompting(input_text, model, tokenizer, generation_config):
    input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
    outputs = model.generate(**input_ids, generation_config=generation_config)
    return tokenizer.decode(outputs[0])

example_input = "I am an applied data science student taking the midterm exam of Transformers course. I"
example_generation = prompting(example_input, model, tokenizer, generation_config)
print(example_generation)


I am an applied data science student taking the midterm exam of Transformers course. I have been looking forward to the exam for quite a while now. I am currently working on a project which will be the final exam. I have been preparing my questions for the exam for the last couple of weeks and I am now ready for the test


Please answer the following questions:

1) Use five different prompts to generate stories of different topics. Discuss the generation quality.

2) Select one prompt, and use this prompt to generate five different stories. State your observations.

3) Can BERT also be used for generating stories? Explain your reasonings.