# Project 4: Generating and Finetuning Transformer Language Models With Huggingface 

In this project, you will first learn how to use Huggingface's Transformers library to load large language models. Next, we will generate text from these models. Finally, we will finetune models on two tasks (sentiment analysis and machine translation).

This project will be more open ended than the previous projects. We expect you to learn how to use the huggingface and torch documentation.

## Setup

First we install and import the required dependencies. These include:
* `torch` for modeling and training
* `transformers` for pre-trained models
* `datasets` from huggingface to load existing datasets.

In [1]:
%%capture
!pip install transformers
!pip install datasets
!pip install --upgrade sacrebleu sentencepiece

# Standard library imports
import torch
from torch.utils.data import Dataset, random_split
from transformers import AutoTokenizer, TrainingArguments, Trainer, AutoModelWithLMHead

Before proceeding, let's verify that we're connected to a GPU runtime and that `torch` can detect the GPU.
We'll define a variable `device` here to use throughout the code so that we can easily change to run on CPU for debugging.

In [2]:
assert torch.cuda.is_available()
device = torch.device("cuda")
print("Using device:", device)

Using device: cuda


### Loading Model

We will use GPT-2 medium for this project. This includes both the GPT-2 tokenizer and the GPT-2 model weights itself. If you want to learn more about this model, you can read the GPT-2 paper https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.

Let's first load the tokenizer for the GPT-2 medium model. You can find how to do this by reading the documentation for AutoTokenzier in transformers, and finding the GPT-2 model of ~345 million params in there.

In [3]:
from transformers import AutoTokenizer
# Your code here
tokenizer = AutoTokenizer.from_pretrained("gpt2-medium")

Downloading (…)lve/main/config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Let's tokenize and detokenize some text from this model.

In [4]:
print(tokenizer.encode('Hello world'))
print(tokenizer.decode(tokenizer.encode('Hello world')))
print(tokenizer.encode("Hola, cómo estás😍"))

[15496, 995]
Hello world
[39, 5708, 11, 269, 10205, 5908, 1556, 40138, 47249, 235]


Now let's load the GPT-2 medium model. Make sure you also put the model onto the GPU.

In [5]:
from transformers import AutoModelWithLMHead
# Your code here
gpt2_model = AutoModelWithLMHead.from_pretrained('gpt2-medium')
gpt2_model = gpt2_model.to(device)



Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Generate From the Model

Now let's generate some text from the model to test its LM capabilities. Let's generate 10 pieces of random text of length 50 tokens from the model using random sampling with temperature set to 0.7. This will allow the text to be somewhat high in diversity (random sampling) while maintaining reasonable quality (temperature < 1). When generating text, you can condition on phrases such as "The coolest thing in NLP right now is". Find the relevant function and arguments to use for generating text using the Huggingface documentation.

Hint: you may find https://huggingface.co/docs/transformers/main_classes/text_generation to be useful for learning about generating from LMs.

In [6]:
inputs = tokenizer("<|startoftext|>The coolest thing right now in NLP is", return_tensors="pt").input_ids.cuda()
# Your code here
sample_outputs = gpt2_model.generate(inputs, max_length =50, min_length=50, temperature=0.7, num_return_sequences=10, do_sample=True)

Now lets print the text.

In [7]:
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

0: <|startoftext|>The coolest thing right now in NLP is that you can get the start of text in your questions in a real-time. The trick is to use two different algorithms: one that uses NLP's built-
1: <|startoftext|>The coolest thing right now in NLP is that the data is stored in a structured text format. This means that, once you have a word in your memory, it's impossible to replace it with anything else until
2: <|startoftext|>The coolest thing right now in NLP is the fact that, for the first time in decades, we're able to use deep learning to understand the meaning of words," says Andrew McAfee, Ph.D.,
3: <|startoftext|>The coolest thing right now in NLP is the ability to combine multiple features of multiple languages with a single language tag. That means you can combine NLP features into a single tag. One thing I have discovered recently
4: <|startoftext|>The coolest thing right now in NLP is talking to people about whatever you want. You could be talking about how to make this 

Now generate one piece of text of length 50 with the same prompt ("The coolest thing right now in NLP is") but use greedy decoding (temperature = 0). This roughly corresponds to generating some text that is high likelihood for the model.

In [8]:
inputs = tokenizer("<|startoftext|>The coolest thing right now in NLP is", return_tensors="pt").input_ids.cuda()
# Your code here
output = gpt2_model.generate(inputs, max_length =50, min_length=50, temperature=0)
for i, sample_output in enumerate(output):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

0: <|startoftext|>The coolest thing right now in NLP is the ability to use the word "startoftext" to refer to a word that is not part of the text. This is useful for things like "the word"


Now let's try to see how good of a translation system GPT-2 medium is when used "out of the box". To accomplish this, we can condition on a prompt like the one below and generate from the model with greedy decoding. This will attempt to translate the sentence "UC Berkeley ist eine Schule in Kalifornien", which means "UC Berkeley is a school in California". Make sure to set the max length to be high enough so that the model generates sufficient text.

In [9]:
prompt = """Translate the following texts into English.

German: UC Berkeley ist eine Schule in Kalifornien
English:"""

In [10]:
# Your code here. Generate from the model using greedy decoding with the above prompt
inputs = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
# Your code here
sample_outputs = gpt2_model.generate(inputs, temperature=0, max_length=100)
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

0: Translate the following texts into English.

German: UC Berkeley ist eine Schule in Kalifornien
English: UC Berkeley ist eine Schule in Kalifornien

English: UC Berkeley ist eine Schule in Kalifornien

English: UC Berkeley ist eine Schule in Kalifornien

English: UC Berkeley ist eine Schule in Kalifornien

English: UC Berkeley ist


As we can see, translation quality is terrible, as it just repeats the words from the previous text.

Now, let's finetune GPT-2 on the translation task to improve the results. We will use a translation dataset from the Huggingface dataset repository (it has thousands of other datasets available). This dataset is one of TED talks translated between German and English.

In [11]:
import datasets
dataset = datasets.load_dataset("ted_talks_iwslt", language_pair=("de", "en"), year="2014")

Downloading builder script:   0%|          | 0.00/4.72k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

Downloading and preparing dataset ted_talks_iwslt/de_en_2014 to /root/.cache/huggingface/datasets/ted_talks_iwslt/de_en_2014-c6e771351acd148b/1.1.0/43935b3fe470c753a023642e1f54b068c590847f9928bd3f2ec99f15702ad6a6...


Downloading data:   0%|          | 0.00/1.67G [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset ted_talks_iwslt downloaded and prepared to /root/.cache/huggingface/datasets/ted_talks_iwslt/de_en_2014-c6e771351acd148b/1.1.0/43935b3fe470c753a023642e1f54b068c590847f9928bd3f2ec99f15702ad6a6. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [12]:
tokenizer.pad_token = tokenizer.eos_token

In [13]:
print(dataset['train'][0]['translation'])

{'de': '"Ich habe Zerebralparese. Ich zappele die ganze Zeit", kündigt Maysoon Zayid zu Anfang dieses ungeheuer witzigen, erheiternden an. (Er ist wirklich ungeheur witzig.) "Als würde Shakira auf Muhammad Ali treffen." Elegant und scharfsinnig nimmt uns die arabisch-amerikanische Komikerin auf eine Reise durch ihre Abenteuer als Schauspielerin, Komikerin, Philanthropin und Fürsprecherin für Menschen mit Behinderungen mit.', 'en': '"I have cerebral palsy. I shake all the time," Maysoon Zayid announces at the beginning of this exhilarating, hilarious talk. (Really, it\'s hilarious.) "I\'m like Shakira meets Muhammad Ali." With grace and wit, the Arab-American comedian takes us on a whistle-stop tour of her adventures as an actress, stand-up comic, philanthropist and advocate for the disabled.'}


Now we can create a dataset. For each element in the dataset, it should have a text prompt and then the translation, similar to above. Your job is to fill in the labels field below. This field sets the labels to use for training during the language modeling task. 

For the labels, we only want to train the model to output the text after the words "English:". This is because in the prompt, everything before the words "English:" will also be provided to the model as input. Hint: use -100 as the label for tokens you do not want to train on.
Hint 2: When doing LM training, the labels are the same as the input tokens, except shifted to the left by one. You should check whether Huggingface is already doing the shifting, or whether you need to do the shifting yourself.

One thing to be careful of with all LMs is to make sure there are not extra spaces. So, the text should be formatted as like "English: Hello..." not "English:  Hello...". This issue is a common problem people face when using APIs like GPT-3 which we will cover next time.

In [14]:
prompt = """Translate the following texts into English.
German: """

class TranslationDataset(Dataset):
    def __init__(self, examples, tokenizer):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for example in examples:
            training_text = prompt + example['translation']['de'] + '\nEnglish: ' + example['translation']['en'] + "<|endoftext|>"
            encodings_dict = tokenizer(training_text, max_length=275, padding="max_length", truncation=True)
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
            prompt_and_input_length = len(tokenizer.encode(prompt + example['translation']['de'] + '\nEnglish:'))
            # your code below

            labels = [-100] * prompt_and_input_length + encodings_dict['input_ids'][prompt_and_input_length:]
            self.labels.append(torch.tensor(labels))


    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {'input_ids':self.input_ids[idx], 'attention_mask':self.attn_masks[idx], 'labels':self.labels[idx]}

In [15]:
translation_dataset = TranslationDataset(dataset['train'], tokenizer)

Now let's break the dataset into a train and test split.

In [16]:
train_size = int(0.9 * len(translation_dataset))
train_dataset, val_dataset = random_split(translation_dataset, [train_size, len(translation_dataset) - train_size])
print(len(train_dataset))
print(len(val_dataset))

2674
298


In [17]:
print(train_dataset[0])

{'input_ids': tensor([ 8291, 17660,   262,  1708, 13399,   656,  3594,    13,   198, 16010,
           25, 18623, 38294,    25,  1810,   388,   266,   343,   266,   798,
          263,  1976,   388,  8706,   285,  9116,   824,   268,   198, 15823,
           25, 18623, 38294,    25,  4162,   356,   761,   284,   467,   736,
          284,  8706, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 

Now we can use the Huggingface Trainer to finetune GPT-2 on this dataset. This abstracts away all of the details of training. Setup the training arguments to perform 3 epochs of training on this dataset, use a per-device batch size of 2 with gradient accumulation set to 8, use 100 warmup steps, a weight decay of 0.05. Set the eval batch size to be 2. Save a checkpoint every 250 steps. Set fp16 to True. Save the checkpoint in a specific output_dir so you can load it later. Hint: if it tries to launch Wandb, you may add the argument report_to="none".

In [18]:
# Your code here
# training_args = TrainingArguments(output_dir="try3", overwrite_output_dir=True, num_train_epochs=3, weight_decay=0.05, per_device_train_batch_size = 2, per_device_eval_batch_size=2, gradient_accumulation_steps=8, evaluation_strategy='epoch',save_strategy='epoch', save_steps=1,  eval_steps= 1, fp16 = True, report_to="none",)
training_args = TrainingArguments(output_dir="try1", 
                                  warmup_steps= 100,
                                  num_train_epochs=3, 
                                  weight_decay=0.05, 
                                  per_device_train_batch_size = 2, 
                                  per_device_eval_batch_size=2,
                                  gradient_accumulation_steps=8, 
                                  save_steps=1,  
                                  evaluation_strategy = "epoch",
                                  save_strategy = "epoch",
                                  fp16 = True, 
                                  report_to="none",
                                 )

Next create a Huggingface Trainer object and call train() on it.

In [19]:
# Your code here
trainer = Trainer(
    model=gpt2_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset = val_dataset,
    tokenizer=tokenizer,
)
trainer.train()



Epoch,Training Loss,Validation Loss
0,No log,0.560738
1,No log,0.524275
2,0.860700,0.522176


TrainOutput(global_step=501, training_loss=0.8597437055287009, metrics={'train_runtime': 1211.8767, 'train_samples_per_second': 6.619, 'train_steps_per_second': 0.413, 'total_flos': 4000487073792000.0, 'train_loss': 0.8597437055287009, 'epoch': 3.0})

Now load your saved checkpoint and see how well the finetuned GPT-2 model does on translating the sentence from before.

In [37]:
prompt = """Translate the following texts into English.

German: UC Berkeley ist eine Schule in Kalifornien
English:"""

In [38]:
tokenizer = AutoTokenizer.from_pretrained("try1/checkpoint-501")
model = AutoModelWithLMHead.from_pretrained("try1/checkpoint-501")
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1024)
    (wpe): Embedding(1024, 1024)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout)

In [39]:
# your code here
inputs = tokenizer.encode(prompt, return_tensors="pt", padding=True, truncation=True)
output = model.generate(input_ids=inputs, max_length=100, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)

In [40]:
# print(output.shape)
# print(output[0])
translation = tokenizer.decode(output[0], skip_special_tokens=True)
print(translation)


Translate the following texts into English.

German: UC Berkeley ist eine Schule in Kalifornien
English: University of California, Berkeley: A school in Cambodia


If training went correctly, you should see a reasonable translation of the sentence, with some errors.

For the project report, find two sentences where the model succeeds and two sentences where the model fails. Describe what might be causing these types of failures.

In [41]:
prompt2 = """Translate the following texts into English.

German: Das Schlossgespenst schwatzte geschwind und geschwätzig in der Schlosskapelle.
English:"""
inputs2 = tokenizer.encode(prompt2, return_tensors="pt", padding=True, truncation=True)
output2 = model.generate(input_ids=inputs2, max_length=100, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
print(output2.shape)
print(output2[0])
translation2 = tokenizer.decode(output2[0], skip_special_tokens=True)
print(translation2)
# actual translation : "The castle ghost chattered swiftly and garrulously in the castle chapel."

torch.Size([1, 61])
tensor([ 8291, 17660,   262,  1708, 13399,   656,  3594,    13,   198,   198,
        16010,    25, 29533,  3059, 22462,  3212,  3617,   301,  5513,    86,
        27906,   660,   308,   274,   354,  7972,  3318,   308,   274,   354,
           86, 11033, 22877,   328,   287,  4587,  3059, 22462,    74,   499,
        13485,    13,   198, 15823,    25,   383, 20073,   286,   257, 24790,
          268,  1748,  6486,   739,   262,  4417,   286,   262,  9151,    13,
        50256])
Translate the following texts into English.

German: Das Schlossgespenst schwatzte geschwind und geschwätzig in der Schlosskapelle.
English: The ruins of a sunken city lie under the surface of the ocean.


In [42]:
print(dataset['train'][0]['translation'])

{'de': '"Ich habe Zerebralparese. Ich zappele die ganze Zeit", kündigt Maysoon Zayid zu Anfang dieses ungeheuer witzigen, erheiternden an. (Er ist wirklich ungeheur witzig.) "Als würde Shakira auf Muhammad Ali treffen." Elegant und scharfsinnig nimmt uns die arabisch-amerikanische Komikerin auf eine Reise durch ihre Abenteuer als Schauspielerin, Komikerin, Philanthropin und Fürsprecherin für Menschen mit Behinderungen mit.', 'en': '"I have cerebral palsy. I shake all the time," Maysoon Zayid announces at the beginning of this exhilarating, hilarious talk. (Really, it\'s hilarious.) "I\'m like Shakira meets Muhammad Ali." With grace and wit, the Arab-American comedian takes us on a whistle-stop tour of her adventures as an actress, stand-up comic, philanthropist and advocate for the disabled.'}


In [44]:
#### Good because this sentence is in the training set. 
## Hint: One failure mode for GPT-2 is that it may generate fluent sentences that are actually unrelated to the input

prompt3 = """Translate the following texts into English.

German: Ich habe Zerebralparese.
English:"""
inputs3 = tokenizer.encode(prompt3, return_tensors="pt", padding=True, truncation=True)
output3 = model.generate(input_ids=inputs3, max_length=100, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
print(output3.shape)
print(output3[0])
translation3 = tokenizer.decode(output3[0], skip_special_tokens=True)
print(translation3)
#Actual translation : "I have cerebral palsy."

torch.Size([1, 32])
tensor([ 8291, 17660,   262,  1708, 13399,   656,  3594,    13,   198,   198,
        16010,    25, 26364,   387,  1350,  1168,   567, 24427,    79,   533,
          325,    13,   198, 15823,    25,   314,   423, 31169, 39898,    88,
           13, 50256])
Translate the following texts into English.

German: Ich habe Zerebralparese.
English: I have cerebral palsy.


In [48]:
#### Good because this sentence is in the training set. 
## Hint: One failure mode for GPT-2 is that it may generate fluent sentences that are actually unrelated to the input

prompt3 = """Translate the following texts into English.

German: Als würde Shakira auf Muhammad Ali treffen.
English:"""
inputs3 = tokenizer.encode(prompt3, return_tensors="pt", padding=True, truncation=True)
output3 = model.generate(input_ids=inputs3, max_length=100, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
print(output3.shape)
print(output3[0])
translation3 = tokenizer.decode(output3[0], skip_special_tokens=True)
print(translation3)
#Actual translation : I\'m like Shakira meets Muhammad Ali.

torch.Size([1, 40])
tensor([ 8291, 17660,   262,  1708, 13399,   656,  3594,    13,   198,   198,
        16010,    25,   978,    82,   266, 25151,  2934, 35274,  8704,   257,
         3046, 15870, 12104,  2054, 46985,    13,   198, 15823,    25,  1081,
           64, 18971,  3245,    25, 19098,   329,  1466,   338,  2489, 50256])
Translate the following texts into English.

German: Als würde Shakira auf Muhammad Ali treffen.
English: Asa Butterfield: Fighting for women's rights


Finally, revisit the code from project 2 on using and running the Multi30k dataset. Your goal will be to translate the test set using the GPT-2 model you just finetuned. You will then submit your test predictions as a txt file, where you place your model's prediction for each test example on a separate line. Feel free to copy and paste any code from HW2 that may be useful. Submit the file named as mt_predictions.txt to gradescope.

The GPT-2 model may not work that well on the Multi30k dataset, because there is a distribution shift where the Multi30k data looks different than the Ted talks data that you finetuned the model on. The takeaway I want people to have is that a general-purpose LM system can be decent at a task like translation, however, if you create a domain-specific model like a LSTM trained specifically on Multi30k, you can outperform the general purpose model.

For the project report, compare two translations from the GPT-2 versus LSTM model. Which one works better?

Hint: One failure mode for GPT-2 is that it may generate fluent sentences that are actually unrelated to the input.

In [27]:
%%capture
!pip install torchtext==0.6

import torchtext


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [28]:
# Your code for generating mt_predictions.txt below

extensions = [".de", ".en"]
source_field = torchtext.data.Field(tokenize=lambda x: x)
target_field = torchtext.data.Field(tokenize=lambda x: x)
training_data, validation_data, test_data = torchtext.datasets.Multi30k.splits(
    extensions, [source_field, target_field], root=".")
print("Number of training examples:", len(training_data))
print("Number of validation examples:", len(validation_data))
print("Number of test examples:", len(test_data))
print()
  

downloading training.tar.gz


training.tar.gz: 100%|██████████| 1.21M/1.21M [00:01<00:00, 939kB/s] 


downloading validation.tar.gz


validation.tar.gz: 100%|██████████| 46.3k/46.3k [00:00<00:00, 252kB/s]


downloading mmt_task1_test2016.tar.gz


mmt_task1_test2016.tar.gz: 100%|██████████| 66.2k/66.2k [00:00<00:00, 241kB/s]


Number of training examples: 29000
Number of validation examples: 1014
Number of test examples: 1000



In [35]:
translations = []
i = 0
for example in test_data:
    print(i)
    german_sentence = example.src
    prompt = f"""Translate the following text into English.

    German: {german_sentence}
    English:"""
#     print(prompt)
    input_sentence = tokenizer.encode(prompt, return_tensors="pt", padding=True, truncation=True)
    output_sentence = model.generate(input_ids=input_sentence, max_length=100, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
#     print(output_sentence.shape)
#     print(output_sentence[0])
    translation_sentence = tokenizer.decode(output_sentence[0], skip_special_tokens=True)
#     print(translation_sentence)
    english = translation_sentence.split("English:")[1].strip()
#     print(english)
    translations.append(english)
    i=i+1
    
with open("mt_predictions.txt", "w") as f:
    f.write('\n'.join(translations))

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

### Sentiment Analysis

The beauty of language models is that we can apply this exact same machinery to solve a completely different task of sentiment analysis. Here, we will be given a movie review and the goal is to have the model predict whether the review is positive or negative.

First, we will load some sentiment analysis data. Your job is to copy what we did above for machine translation to load the dataset, build a Class to create the dataset, etc., 

When doing so, use the prompt below, where you put the text of the input in the first [] and in the second [], put the word Positive if the label is 1 and the word Negative if the label is 0. Make sure to also set the self.labels field correctly, we only want to compute a loss on the words Positive/Negative, and no other tokens in the model's input.

The following is a movie review. [Movie Review Text Here]. The sentiment of the review is [Positive/Negative].

In [None]:
import datasets
# dataset = datasets.load_dataset("sst2")
dataset = datasets.load_dataset('glue', 'sst2')

Note: Some people were saying that this line of code wasn't working and they needed to use "dataset = datasets.load_dataset('glue', 'sst2')" instead.

In [None]:
class SentimentDataset(Dataset):
    # Your code below
    # ....
    def __init__(self, examples, tokenizer):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for example in examples:
#             print(example)
            # add the space 
            if example['label'] == 0:
                sentiment = 'Negative'
            else:
                sentiment = 'Positive'
            
            training_text = prompt + example['sentence'] + '\nThe sentiment of the review is ' + sentiment + "<|endoftext|>"
            encodings_dict = tokenizer(training_text, max_length=275, padding="max_length", truncation=True)
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
            #remove the space 
            prompt_and_input_length = len(tokenizer.encode(prompt + example['sentence'] + '\nThe sentiment of the review is'))
            # your code below

            labels = [-100] * prompt_and_input_length + encodings_dict['input_ids'][prompt_and_input_length:]
            self.labels.append(torch.tensor(labels))


    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {'input_ids':self.input_ids[idx], 'attention_mask':self.attn_masks[idx], 'labels':self.labels[idx]}
    
    
   

In [None]:
sentiment_train_dataset = SentimentDataset(dataset['train'], tokenizer)
sentiment_val_dataset = SentimentDataset(dataset['validation'], tokenizer)

In [None]:
print(sentiment_train_dataset[0])

The data already comes with a validation and train split

In [None]:
print(len(sentiment_train_dataset))
print(len(sentiment_val_dataset))

Now let's train the model using the same trainer arguments as before, except just do $<$1 epoch of training because this dataset is quite large and training on the entire thing will take some time. Make sure you also use a different output_dir so it doesn't overwrite your old results.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gpt2-medium")

In [None]:
gpt2_model = AutoModelWithLMHead.from_pretrained('gpt2-medium')
gpt2_model.to(device)

In [None]:
# Your code here
training_args = TrainingArguments(output_dir="sentiment", 
                                  warmup_steps= 100,
                                  num_train_epochs=0.2, 
                                  weight_decay=0.05, 
                                  per_device_train_batch_size = 2, 
                                  per_device_eval_batch_size=2,
                                  gradient_accumulation_steps=8, 
                                  save_steps=1,  
                                  evaluation_strategy = "epoch",
                                  save_strategy = "epoch",
                                  fp16 = True, 
                                  report_to="none",
              )




# Create the Trainer object
trainer = Trainer(
    model=gpt2_model,
    args=training_args,
    train_dataset=sentiment_train_dataset,
    eval_dataset=sentiment_val_dataset,
)

In [None]:
# Train the model
trainer.train()

At test-time, when you want to classify an incoming movie review, you can just check whether the model generates the words Positive or Negative as the final word.

In [None]:
prompt = """The following is a movie review. The acting was great but overall I was left disappointed by the film. The sentiment of the review is"""

In [None]:
# Your code here
# tokenizer = AutoTokenizer.from_pretrained("sentiment/checkpoint-842")
model = AutoModelWithLMHead.from_pretrained("sentiment/checkpoint-842")
model.eval()

In [None]:
# your code here
inputs = tokenizer.encode(prompt, return_tensors="pt", truncation=True)
output = model.generate(input_ids=inputs, max_length=100, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)

# print(output.shape)
# print(output[0])
sentiment = tokenizer.decode(output[0], skip_special_tokens=True)
print(sentiment)


Finally, run the entire validation set through the model and get your model predictions. Save the results as a txt file, where each line just contains either "1" if your model predicted Positive and "0" if the model predicted Negative. You will get full credit if your model's accuracy is greater than 80%. Save the file as sst_predictions.txt and submit it to gradescope.

For the report, describe two possible improvements to your sentiment classifier.

In [None]:
print(dataset["validation"][0])

In [None]:
# Your code here for generating sst_predictions
# print(len(sentiment_val_dataset))

sentiments = []
i = 0
for review in dataset["validation"]:
#     print(review)
#     print(i)
    prompt = f"""The following is a movie review. {review['sentence']} The sentiment of the review is"""
#     print(prompt)
    input_sentence = tokenizer.encode(prompt, return_tensors="pt", truncation=True)
    output_sentence = model.generate(input_ids=input_sentence, max_length=100, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
#     print(output_sentence.shape)
#     print(output_sentence[0])
    translation_sentence = tokenizer.decode(output_sentence[0], skip_special_tokens=True)
    print(translation_sentence)
    sentiment = translation_sentence.split("review is")[1].strip()
#     sentiment = sentiment.split("is")[1].strip()
#     print(sentiment)
    if sentiment == "Positive":
        sentiment = "1\n"
    else: 
        sentiment = "0\n" 
#     print(sentiment)
    sentiments.append(sentiment)
    i=i+1
    
with open("sst_predictions.txt", "w") as f:
    f.write(''.join(sentiments))

## Submission

Turn in the following files on Gradescope:
* hw4.ipynb (this file; please rename to match)
* mt_predictions.txt (the predictions for the Multi30k test set)
* sst_predictions.txt (the predictions for the SST-2 validation set)
* report.pdf

Be sure to check the output of the autograder after it runs.  It should confirm that no files are missing and that the output files have the correct format.