In [3]:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

## Load Model

You can change the model used by changing the content of pretrained_weights (if it’s not a GPT2 model, you’ll need to change the classes used for the model and the tokenizer of course).

In [5]:
pretrained_weights = 'gpt2'
tokenizer = GPT2TokenizerFast.from_pretrained(pretrained_weights)
model = GPT2LMHeadModel.from_pretrained(pretrained_weights)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Tokenize

The tokenizers in HuggingFace usually do the tokenization and the numericalization in one step (we ignore the padding warning for now):

In [6]:
ids = tokenizer.encode('This is an example of text, and')
ids

[1212, 318, 281, 1672, 286, 2420, 11, 290]

Like fastai Transforms, the tokenizer has a decode method to give you back a text from ids:

In [7]:
tokenizer.decode(ids)

'This is an example of text, and'

## Inference

In [8]:
import torch

It has a generate method that expects a batch of prompt, so we feed it our ids and add one batch dimension (there is a padding warning we can ignore as well):

In [9]:
t = torch.LongTensor(ids)[None]
preds = model.generate(t)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [10]:
preds.shape,preds[0]

(torch.Size([1, 20]),
 tensor([1212,  318,  281, 1672,  286, 2420,   11,  290,  340,  338,  407,  257,
          922,  530,   13,  198,  198,  464,  717, 1517]))

In [11]:
tokenizer.decode(preds[0].numpy())

"This is an example of text, and it's not a good one.\n\nThe first thing"

# Bridging the gap with fastai

In [14]:
from fastai.text.all import *

In [18]:
path = untar_data(URLs.WIKITEXT_TINY)
path.ls()

(#2) [Path('/Users/ridealist/.fastai/data/wikitext-2/test.csv'),Path('/Users/ridealist/.fastai/data/wikitext-2/train.csv')]

In [19]:
train = pd.read_csv(path/'train.csv', header=None)
df_valid = pd.read_csv(path/'test.csv', header=None)
df_train.head()

Unnamed: 0,0
0,"\n = 2013 – 14 York City F.C. season = \n \n The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only above the relegation z..."
1,"\n = Big Boy ( song ) = \n \n "" Big Boy "" <unk> "" I 'm A Big Boy Now "" was the first single ever recorded by the Jackson 5 , which was released by Steeltown Records in January 1968 . The group played instruments on many of their Steeltown compositions , including "" Big Boy "" . The song was neither a critical nor commercial success , but the Jackson family were delighted with the outcome nonetheless . \n The Jackson 5 would release a second single with Steeltown Records before moving to Motown Records . The group 's recordings at Steeltown Records were thought to be lost , but they were re..."
2,"\n = The Remix ( Lady Gaga album ) = \n \n The Remix is a remix album by American recording artist Lady Gaga . Released in Japan on March 3 , 2010 , it contains remixes of the songs from her first studio album , The Fame ( 2008 ) , and her third extended play , The Fame Monster ( 2009 ) . A revised version of the track list was prepared for release in additional markets , beginning with Mexico on May 3 , 2010 . A number of recording artists have produced the songs , including Pet Shop Boys , Passion Pit and The Sound of Arrows . The remixed versions feature both uptempo and <unk> composit..."
3,"\n = New Year 's Eve ( Up All Night ) = \n \n "" New Year 's Eve "" is the twelfth episode of the first season of the American comedy television series Up All Night . The episode originally aired on NBC in the United States on January 12 , 2012 . It was written by Erica <unk> and was directed by Beth McCarthy @-@ Miller . The episode also featured a guest appearance from Jason Lee as Chris and Reagan 's neighbor and Ava 's boyfriend , Kevin . \n During Reagan ( Christina Applegate ) and Chris 's ( Will <unk> ) first New Year 's Eve game night , Reagan 's competitiveness comes out causing Ch..."
4,"\n = Geopyxis carbonaria = \n \n Geopyxis carbonaria is a species of fungus in the genus Geopyxis , family <unk> . First described to science in 1805 , and given its current name in 1889 , the species is commonly known as the charcoal loving elf @-@ cup , dwarf <unk> cup , <unk> <unk> cup , or pixie cup . The small , <unk> @-@ shaped fruitbodies of the fungus are reddish @-@ brown with a whitish fringe and measure up to 2 cm ( 0 @.@ 8 in ) across . They have a short , tapered stalk . Fruitbodies are commonly found on soil where brush has recently been burned , sometimes in great numbers ...."


In [23]:
all_texts = np.concatenate([df_train[0].values, df_valid[0].values])

## In a fastai Transform

you can define:

- an encodes method that is applied when you call the transform (a bit like the forward method in a nn.Module)
- a decodes method that is applied when you call the decode method of the transform, if you need to decode anything for showing purposes (like converting ids to a text here)
- a setups method that sets some inner state of the Transform (not needed here so we skip it)

In [24]:
class TransformersTokenizer(Transform):
    def __init__(self, tokenizer): self.tokenizer = tokenizer
    def encodes(self, x): 
        toks = self.tokenizer.tokenize(x)
        return tensor(self.tokenizer.convert_tokens_to_ids(toks))
    def decodes(self, x): return TitledStr(self.tokenizer.decode(x.cpu().numpy()))

In [25]:
splits = [range_of(df_train), list(range(len(df_train), len(all_texts)))]
tls = TfmdLists(all_texts, TransformersTokenizer(tokenizer), splits=splits, dl_type=LMDataLoader)

Token indices sequence length is longer than the specified maximum sequence length for this model (4576 > 1024). Running this sequence through the model will result in indexing errors


In [27]:
tls.train[0], tls.valid[0]

(tensor([220, 198, 796,  ..., 198, 220, 198]),
 tensor([220, 198, 796,  ..., 198, 220, 198]))

In [28]:
show_at(tls.train, 0)

 
 = 2013 – 14 York City F.C. season = 
 
 The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club, a professional football club based in York, North Yorkshire, England. Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two. The season ran from 1 July 2013 to 30 June 2014. 
 Nigel Worthington, starting his first full season as York manager, made eight permanent summer signings. By the turn of the year York were only above the relegation zone on goal difference, before a 17 @-@ match unbeaten run saw the team finish in seventh @-@ place in the 24 @-@ team 2013 – 14 Football League Two. This meant York qualified for the play @-@ offs, and they were eliminated in the semi @-@ final by Fleetwood Town. York were knocked out of the 2013 – 14 FA Cup, Football League Cup and Football League Trophy in their opening round matches. 
 35 players made at least

In [29]:
show_at(tls.valid, 0)

 
 = Tropical Storm <unk> ( 2008 ) = 
 
 Tropical Storm <unk> was the tenth tropical storm of the 2008 Atlantic hurricane season. <unk> developed out of a strong tropical wave which moved off the African coast on August 31. The wave quickly became organized and was declared Tropical Depression Ten while located 170 mi ( 270 km ) to the south @-@ southeast of the Cape Verde Islands on September 2. The depression was quickly upgraded to Tropical Storm <unk> around noon the same day. Over the next several days, <unk> moved in a general west @-@ northwest direction and reached its peak intensity early on September 3. Strong wind shear, some due to the outflow of Hurricane Ike, and dry air caused the storm to weaken. On September 6, the combination of wind shear, dry air, and cooling waters caused <unk> to weaken into a tropical depression. <unk> deteriorated into a remnant low shortly after as convection continued to dissipate around the storm. The low ultimately dissipated while located 5

In [30]:
bs,sl = 4,256
dls = tls.dataloaders(bs=bs, seq_len=sl)

In [31]:
dls.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"\n = Martin Keamy = \n \n First Sergeant Martin Christopher Keamy is a fictional character played by Kevin Durand in the fourth season and sixth season of the American ABC television series Lost. Keamy is introduced in the fifth episode of the fourth season as a crew member aboard the freighter called the <unk> that is offshore the island where most of Lost takes place. In the second half of the season, Keamy served as the primary antagonist. He is the leader of a mercenary team hired by billionaire Charles Widmore ( played by Alan Dale ) that is sent to the island on a mission to capture Widmore's enemy Ben Linus ( Michael Emerson ) from his home, then torch the island. \n Unlike Lost's ensemble of characters who, according to the writers, each have good and bad intentions, the writers have said that Keamy is evil","\n = Martin Keamy = \n \n First Sergeant Martin Christopher Keamy is a fictional character played by Kevin Durand in the fourth season and sixth season of the American ABC television series Lost. Keamy is introduced in the fifth episode of the fourth season as a crew member aboard the freighter called the <unk> that is offshore the island where most of Lost takes place. In the second half of the season, Keamy served as the primary antagonist. He is the leader of a mercenary team hired by billionaire Charles Widmore ( played by Alan Dale ) that is sent to the island on a mission to capture Widmore's enemy Ben Linus ( Michael Emerson ) from his home, then torch the island. \n Unlike Lost's ensemble of characters who, according to the writers, each have good and bad intentions, the writers have said that Keamy is evil and"
1,"s "" official film history "". The film addresses three main categories of "" Ozploitation "" films : sex, horror and action. \n \n = = <unk> = = \n \n The actors, directors, screenwriters and producers interviewed for the film were : \n \n = = Production = = \n \n As a child, Mark Hartley discovered many of the "" Ozploitation "" B @-@ movies from the 1970s and'80s while watching late @-@ night television, but was disappointed when they were completely overlooked in books he read detailing Australian cinema. After becoming an accomplished music video director, his interest in this era of Australian filmmaking grew and he spent years researching a potential documentary film. He was close to giving up on the project when he sent a 100 @-@ page draft of the script to American film director Quentin Tarantino, not expecting to receive a reply. Tarantino",""" official film history "". The film addresses three main categories of "" Ozploitation "" films : sex, horror and action. \n \n = = <unk> = = \n \n The actors, directors, screenwriters and producers interviewed for the film were : \n \n = = Production = = \n \n As a child, Mark Hartley discovered many of the "" Ozploitation "" B @-@ movies from the 1970s and'80s while watching late @-@ night television, but was disappointed when they were completely overlooked in books he read detailing Australian cinema. After becoming an accomplished music video director, his interest in this era of Australian filmmaking grew and he spent years researching a potential documentary film. He was close to giving up on the project when he sent a 100 @-@ page draft of the script to American film director Quentin Tarantino, not expecting to receive a reply. Tarantino"


## Fine-tuning the model

In [32]:
class DropOutput(Callback):
    def after_pred(self): self.learn.pred = self.pred[0]

In [33]:
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), cbs=[DropOutput], metrics=Perplexity()).to_fp16()

In [34]:
learn.validate()



(#2) [3.695716381072998,40.2744140625]

In [None]:
learn.lr_find()

In [None]:
learn.fit_one_cycle(1, 1e-4)