#Week 3: NLP Transformer Architecture


Applied Learning Assignments 1:

Apply transformers to a real-world text classification task

1. Finetune a pre trained transformer from the Hugging Face library
(e.g., BERT or GPT).

2. Train the fine tuned model on a text classification dataset (use any
datasets of your choice)

In [21]:

import torch

torch.cuda.is_available()


True

In [22]:

torch.cuda.device_count()


1

In [23]:

torch.cuda.current_device()


0

In [24]:

torch.cuda.device(0)


<torch.cuda.device at 0x786a69420b10>

In [25]:

torch.cuda.get_device_name(0)


'Tesla T4'

In [26]:
#!pip install -U datasets huggingface_hub fsspec
#  ↳ Runtime ▸ Restart runtime

In [27]:

from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')


In [28]:

# datasets = load_dataset("text", data_files={"train": path_to_train.txt, "validation": path_to_validation.txt}


In [29]:

datasets["train"][10]


{'text': ' The game \'s battle system , the BliTZ system , is carried over directly from Valkyira Chronicles . During missions , players select each unit using a top @-@ down perspective of the battlefield map : once a character is selected , the player moves the character around the battlefield in third @-@ person . A character can only act once per @-@ turn , but characters can be granted multiple turns at the expense of other characters \' turns . Each character has a field and distance of movement limited by their Action Gauge . Up to nine characters can be assigned to a single mission . During gameplay , characters will call out if something happens to them , such as their health points ( HP ) getting low or being knocked out by enemy attacks . Each character has specific " Potentials " , skills unique to each character . They are divided into " Personal Potential " , which are innate skills that remain unaltered unless otherwise dictated by the story and can either help or impede

In [30]:

from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))


In [31]:

show_random_elements(datasets["train"])


Unnamed: 0,text
0,= = Production = = \n
1,= = Route description = = \n
2,"Once Dylan was well enough to resume creative work , he began to edit D. A. Pennebaker 's film of his 1966 tour . A rough cut was shown to ABC Television and rejected as incomprehensible to a mainstream audience . The film was subsequently titled Eat the Document on bootleg copies , and it has been screened at a handful of film festivals . In 1967 he began recording with the Hawks at his home and in the basement of the Hawks ' nearby house , "" Big Pink "" . These songs , initially demos for other artists to record , provided hits for Julie Driscoll and the Brian Auger Trinity ( "" This Wheel 's on Fire "" ) , The Byrds ( "" You Ain 't Goin ' Nowhere "" , "" Nothing Was Delivered "" ) , and Manfred Mann ( "" Mighty Quinn "" ) . Columbia released selections in 1975 as The Basement Tapes . Over the years , more songs recorded by Dylan and his band in 1967 appeared on bootleg recordings , culminating in a five @-@ CD set titled The Genuine Basement Tapes , containing 107 songs and alternative takes . In the coming months , the Hawks recorded the album Music from Big Pink using songs they worked on in their basement in Woodstock , and renamed themselves the Band , beginning a long recording and performing career of their own . \n"
3,
4,"A hemmema ( from Finnish "" Hämeenmaa "" , Tavastia ) was a type of warship built for the Swedish archipelago fleet and the Russian Baltic navy in the late 18th and early 19th centuries . The hemmema was initially developed for use against the Russian Navy in the Archipelago Sea and along the coasts of Svealand and Finland . It was designed by the prolific and innovative Swedish naval architect Fredrik Henrik af Chapman ( 1721 – 1808 ) in collaboration with Augustin Ehrensvärd ( 1710 – 1772 ) , an artillery officer and later commander of the Swedish archipelago fleet . The hemmema was a specialized vessel for use in the shallow waters and narrow passages that surround the thousands of islands and islets extending from the Swedish capital of Stockholm into the Gulf of Finland . \n"
5,
6,= = Production = = \n
7,""" Moment of Surrender "" is played in common time at a tempo of 87 beats per minute in a key of A minor . The song makes use of the conventional verse @-@ chorus form . The song begins with an uneven percussion loop , before an ambient synthesiser fades in and the drums enter at 0 : 08 . A cello part joins and the synthesiser plays the chord progression C – Am – F – C – G – E – D7 . At the end of the progression , 47 seconds into the song , the intensity of the synthesier rises before an organ , bass guitar , and piano subsequently enter . At 1 : 16 , Bono 's vocals enter and the first verse begins , lasting three stanzas . After the first chorus concludes and the second verse begins at 2 : 59 , the Edge begins playing a guitar riff . The second verse lasts two stanzas . After the second chorus , a piano interlude begins , with Lanois contributing pedal steel . The Edge begins a slide guitar solo at 4 : 59 that many critics compared to the playing style of Pink Floyd 's David Gilmour . After the third chorus ends at 6 : 11 , "" Oh @-@ oh @-@ ohhh "" vocals and a guitar figure bring the song to its conclusion . \n"
8,
9,


In [32]:

model_checkpoint = "gpt2"
tokenizer_checkpoint = "sgugger/gpt2-like-tokenizer"


In [33]:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)


In [34]:

def tokenize_function(examples):
    return tokenizer(examples["text"])


In [35]:

tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])


Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

In [36]:

tokenized_datasets["train"][1]



{'input_ids': [238, 8576, 9441, 2987, 238, 252],
 'attention_mask': [1, 1, 1, 1, 1, 1]}

In [37]:

# block_size = tokenizer.model_max_length
block_size = 128


In [17]:

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result


In [38]:

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)


Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

In [39]:

tokenizer.decode(lm_datasets["train"][1]["input_ids"])


' the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . \n The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series newcomers . Character designer Raita Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large'

In [40]:

from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained(model_checkpoint)
model = AutoModelForCausalLM.from_config(config)


In [41]:

from transformers import Trainer, TrainingArguments


In [44]:
import transformers, importlib, inspect, textwrap, pprint
print(transformers.__version__)
print(inspect.signature(transformers.TrainingArguments.__init__))

4.53.1


In [45]:
!pip install "transformers<4.46

/bin/bash: -c: line 1: unexpected EOF while looking for matching `"'
/bin/bash: -c: line 2: syntax error: unexpected end of file


In [46]:
training_args = TrainingArguments(
    output_dir=f"/content/{model_checkpoint}-wikitext2",
    eval_strategy="epoch",          # <-- new name
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False
)

In [47]:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)


In [48]:

trainer.train()




<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mnunsiomi[0m ([33mnunsiomi-nunsi[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,6.5062,6.475512
2,6.1401,6.195918
3,5.9622,6.110425


TrainOutput(global_step=6747, training_loss=6.339066404368377, metrics={'train_runtime': 3581.566, 'train_samples_per_second': 15.07, 'train_steps_per_second': 1.884, 'total_flos': 3525678710784000.0, 'train_loss': 6.339066404368377, 'epoch': 3.0})

In [49]:

import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")


Perplexity: 450.53
