# Finetuning GPT2 on ELI5-Data

Downloading dependencies

In [1]:
! pip install transformers
! pip install yake
! pip install datasets

Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 8.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 71.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 6.4 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 73.3 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 77.4 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  A

Importing dependencies

In [2]:
from google.colab import drive

import os
import numpy as np
import pandas as pd

import re
import random

from datasets import list_datasets, load_dataset
from transformers import AutoTokenizer, AutoConfig, AutoModelForPreTraining, \
                         TrainingArguments, Trainer, pipeline, AutoModelForTokenClassification

import yake

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

import torch
from torch.utils.data import Dataset
print(f"PyTorch version: {torch.__version__}")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
PyTorch version: 1.10.0+cu111


Verifying GPU

In [3]:
!nvidia-smi

Sat Jan 22 19:00:04 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Configurations

In [4]:
MODEL           = 'gpt2' # Used model for finetuning (huggingface.co/gpt2)

UNFREEZE_LAST_N = 6 # The last N layers to unfreeze for training

SPECIAL_TOKENS  = { "bos_token": "<|BOS|>",
                    "eos_token": "<|EOS|>",
                    "unk_token": "<|UNK|>",                    
                    "pad_token": "<|PAD|>",
                    "sep_token": "<|SEP|>"}
                    
MAXLEN          = 768 # Max len of generated text

TRAIN_SIZE      = 0.8 # Train split

USE_APEX        = True

#lowers Batch size in case of little ram
if USE_APEX:
    TRAIN_BATCHSIZE = 4
    BATCH_UPDATE    = 16
else:
    TRAIN_BATCHSIZE = 2
    BATCH_UPDATE    = 32

EPOCHS          = 4
LR              = 5e-4
EPS             = 1e-8
WARMUP_STEPS    = 1e2

SEED            = 2022

initiate seeds

In [5]:
random.seed(SEED)
os.environ['PYTHONHASHSEED'] = str(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = True

# Load ELI5 Dataset

Available at huggingface: https://huggingface.co/datasets/eli5

In [6]:
eli5_dataset = load_dataset('eli5')

Downloading:   0%|          | 0.00/5.63k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

Downloading and preparing dataset eli5/LFQA_reddit (download: 6.03 MiB, generated: 1.26 GiB, post-processed: Unknown size, total: 1.26 GiB) to /root/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa...


Downloading:   0%|          | 0.00/3.50k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/576M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/21.1M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/286M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.65M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/330M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/18.7M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/36.2M [00:00<?, ?B/s]

Dataset eli5 downloaded and prepared to /root/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa. Subsequent calls will reuse this data.


  0%|          | 0/9 [00:00<?, ?it/s]

Print data sample

In [7]:
# multiple answers and their score/rating/upvotes
# Answers are stored in descending order inside 'text', only first (best) answer will be considered
eli5_dataset["test_eli5"][1]

{'answers': {'a_id': ['cyi9d1m',
   'cyi990h',
   'cyi9do0',
   'cyiger5',
   'cyilcl7',
   'cyiidqq',
   'cyi9i76',
   'cyiic6t',
   'cyibpiq',
   'cyikhsv'],
  'score': [148, 57, 45, 19, 19, 6, 6, 4, 2, 2],
  'text': ['I like to think that leather clothing is rather more durable and easy to fix than an Italian tailored suit.\n\nSo in lack of industrial infrastructure (lack of shops, sewing machines, fabric production etc.) Simple and durable clothing would become common.\n\nAs for makeup, i guess black makeup (or motor oil, dirt, grit) is easy to pull off than perfectly clean face.',
   'To a lot of of observers, the nearest present-day mirror that we have to a dystopian and post-apocalyptic society is the part of our own current society that prefers to dress and appear that way. Biker ~~games~~ **gangs** (edit: used an incorrect worm) are a good example.\n\nPeople who dress and wear a whole lot of dark make-up or tattoos are generally seen as tougher and grittier than Joe Average, m

Filtering redundant datapoints:

In [8]:
df = pd.DataFrame(data=eli5_dataset["test_eli5"])

In [9]:
df = df.drop(columns=["q_id", "selftext", "document", "subreddit", "title_urls", "selftext_urls", "answers_urls"])

In [10]:
# filtering out the answers
for i in df.index:
    df.answers[i] = df.answers[i]["text"][0]

In [11]:
# merging question and answers to one
df["full_text"] = df["title"] + " " + df["answers"]
df["keyword"] = ""

In [12]:
# For development, only 10k posts will be considered
df = df[:10000]

# Extracting Keywords

Reference for yake from SeemsPhishy code:

```
! pip install yake
import yake
import sys
def keywords_yake(text, language = "en", max_ngram_size = 2, numOfKeywords = 1):

    custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, top=numOfKeywords)
    keywords = custom_kw_extractor.extract_keywords(text)
    return keywords

for i in df.index:
    df["keyword"][i] = keywords_yake(df.full_text[i])[0][0]
    sys.stdout.write("\rExtracting keyword: %i" % i)
    sys.stdout.flush()
df

data = dict()
for id in df.index:
    data[id] = [df["title"][id], df["answers"][id], [df["keyword"][id]]]
```


## Keyword extraction approach A: Yake

In [13]:
def keywords_yake(text, language = "en", max_ngram_size = 1, numOfKeywords = 1):

    custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, top=numOfKeywords)
    keywords = custom_kw_extractor.extract_keywords(text)
    return keywords

## Keyword extraction approach B: Tf-idf

In [14]:
def sort_coo(coo_matrix):
    """Sort a dict with highest score"""
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []
    
    # word index and corresponding tf-idf score
    for idx, score in sorted_items:
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    #create a tuples of feature, score
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

def get_keywords(vectorizer, feature_names, doc):
    """Return top k keywords from a doc using TF-IDF method"""

    #generate tf-idf for the given document
    tf_idf_vector = vectorizer.transform([doc])
    
    #sort the tf-idf vectors by descending order of scores
    sorted_items=sort_coo(tf_idf_vector.tocoo())

    #extract only TOP_K_KEYWORDS
    keywords=extract_topn_from_vector(feature_names,sorted_items,3)
    
    return list(keywords.keys())

# Preprocessing Data

In [15]:
#PUNCTUATION = """!"#$%&'()*+,-.:;<=>?@[]^_`{|}~"""

def clean_text(text, training = False):
  text = re.sub("\(_URL_[0-9]+_\)"," ",text)
  text = "".join([c for c in text if c not in "[]<>*_"])
  text = text.replace("\n", " ") #remove newlines

  if training is not True:  
    text = text.lower()
    text = re.sub("\s+"," ", text) #remove whitespace and newlines
    
  return text

Clean the answers for model finetuning

In [16]:
for i in df.index:
  df["answers"][i] = clean_text(df["answers"][i], training = True)

In [17]:
df["answers"]

0       I think it's because, at that moment, it's bas...
1       I like to think that leather clothing is rathe...
2       Shrubs and trees are both specifically woody p...
3       Moving air = lower pressure. The greater the d...
4       It's kind of like a "3 strikes and you're out"...
                              ...                        
9995    The charts aren't based on YouTube views - the...
9996    I perceive certain animals as food, chickens, ...
9997    Relevant xkcd   DVDs and Blu Rays come in Keep...
9998    In an alphabetic script one symbol represents ...
9999    The new exoskeleton stays soft as it is formed...
Name: answers, Length: 10000, dtype: object

Clean the full text for keyword extraction

In [18]:
for i in df.index:
  df["full_text"][i] = clean_text(df["full_text"][i])

In [19]:
vectorizer = TfidfVectorizer(stop_words=set(stopwords.words("english")), smooth_idf=True, use_idf=True)

In [20]:
corpora = df["full_text"].to_list()

In [21]:
vectorizer.fit_transform(corpora[:])

<10000x36210 sparse matrix of type '<class 'numpy.float64'>'
	with 506220 stored elements in Compressed Sparse Row format>

In [22]:
feature_names = vectorizer.get_feature_names()



Tf-idf keyword extraction

In [23]:
for i in df.index:
    #tfidf_df = {}
    #tfidf_df['full_text'] = corpora[i]
    df['keyword'][i] = get_keywords(vectorizer, feature_names, corpora[i])

Yake keyword extraction

In [24]:
df["yake_keyword"] = ""
for i in df.index:
  df["yake_keyword"][i] = [keyword[0] for keyword in keywords_yake(df.full_text[i], numOfKeywords=3)] # get the keyword only without probability
  

# Named entity recognition

Below is the code that was tried for NER. As it turned out, the dataset doesn't provide enough entities. The initial thought was to extract entities that could further be used for finetuning, so that the model eventually learns to incorporate new unseen entities.

```
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)
df["ner"] = ""
for i in df.index:
  ner_results = nlp(df["full_text"][i])
  df["ner"][i] = [entity["word"] for entity in ner_results]
#ner doesn't work too well...
nlp("why can a big music name (e.g. keith urban) get into the billboard top 100 easily with 1,000,000 song views, but a new/independent artist needs 50,000,000 or even 100,000,000 song views to make it into the billboard hot 100? the charts aren't based on youtube views - they're still mostly based on sales & radio play. most unsigned artists on youtube aren't even getting counted at all because the charts *are run by the record industry*.")
```

# Datasets processing for finetuning

In [25]:
class myDataset(Dataset):

    def __init__(self, data, tokenizer, randomize=True):

        title, text, keywords = [], [], [] # df = title 	answers 	full_text 	keyword_A keyword_B
        for k, v in data.items():
            title.append(v[0])
            text.append(v[1])
            keywords.append(v[2])

        self.randomize = randomize
        self.tokenizer = tokenizer 
        self.title     = title
        self.text      = text
        self.keywords  = keywords  


    @staticmethod
    def join_keywords(keywords, randomize=True):
        N = len(keywords)

        #random sampling and shuffle
        if randomize: 
            M = random.choice(range(N+1))
            keywords = keywords[:M]
            random.shuffle(keywords)

        return ','.join(keywords)


    def __len__(self):
        return len(self.text)

    
    def __getitem__(self, i):
        keywords = self.keywords[i].copy() #list of keywords [k1, k2, k3]
        kw = self.join_keywords(keywords, self.randomize)
        
        # training data consists of keywords, questions and the corresponding answers
        input = SPECIAL_TOKENS['bos_token'] + \
                SPECIAL_TOKENS['sep_token'] + kw + SPECIAL_TOKENS['sep_token'] + \
                self.title[i] + ": " + self.text[i] + SPECIAL_TOKENS['eos_token']

        encodings_dict = tokenizer(input,                                   
                                   truncation=True, 
                                   max_length=MAXLEN, 
                                   padding="max_length")   
        
        input_ids = encodings_dict['input_ids']
        attention_mask = encodings_dict['attention_mask']
        
        return {'label': torch.tensor(input_ids),
                'input_ids': torch.tensor(input_ids), 
                'attention_mask': torch.tensor(attention_mask)}

In [26]:
def split_data(data, S=TRAIN_SIZE):
    # Shuffle ids
    ids = list(data.keys())
    random.shuffle(ids)

    # Split into training and validation sets    
    train_size = int(S * len(data))

    train_ids = ids[:train_size]
    val_ids = ids[train_size:]

    train_data = dict()
    for id in train_ids:
        train_data[id] = data[id]

    val_data = dict()
    for id in val_ids:
        val_data[id] = data[id]

    return train_data, val_data

# Loading Tokenizer

In [27]:
def get_tokenizer(special_tokens=None):
    tokenizer = AutoTokenizer.from_pretrained(MODEL) # uses the gpt2 tokenizer

    if special_tokens:
        tokenizer.add_special_tokens(special_tokens)
        print("Special tokens added")
    return tokenizer

In [28]:
tokenizer = get_tokenizer(special_tokens=SPECIAL_TOKENS)

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Special tokens added


# Loading Config and Model

In [29]:
def get_model(tokenizer, special_tokens=None, load_model_path=None):

    #GPT2LMHeadModel
    if special_tokens:
        config = AutoConfig.from_pretrained(MODEL, 
                                            bos_token_id=tokenizer.bos_token_id,
                                            eos_token_id=tokenizer.eos_token_id,
                                            sep_token_id=tokenizer.sep_token_id,
                                            pad_token_id=tokenizer.pad_token_id,
                                            output_hidden_states=False)
    else: 
        config = AutoConfig.from_pretrained(MODEL,                                     
                                            pad_token_id=tokenizer.eos_token_id,
                                            output_hidden_states=False)    


    model = AutoModelForPreTraining.from_pretrained(MODEL, config=config)

    if special_tokens:
        #Special tokens added, model needs to be resized accordingly
        model.resize_token_embeddings(len(tokenizer))

    if load_model_path:
        model.load_state_dict(torch.load(load_model_path))

    model.cuda()
    return model

In [30]:
model = get_model(tokenizer, 
                  special_tokens=SPECIAL_TOKENS,
                #   load_model_path='pytorch_model.bin'
                 )

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

In [None]:
# - Freeze selective layers:
# - Freeze all layers except last n:
for parameter in model.parameters():
    parameter.requires_grad = False

for i, m in enumerate(model.transformer.h):        
    #Only un-freeze the last n transformer blocks
    if i+1 > 12 - UNFREEZE_LAST_N:
        for parameter in m.parameters():
            parameter.requires_grad = True 

for parameter in model.transformer.ln_f.parameters():        
    parameter.requires_grad = True

for parameter in model.lm_head.parameters():        
    parameter.requires_grad = True

## Split dataset

In [None]:
data = dict()
for id in df.index:
    data[id] = [df["title"][id], df["answers"][id], df["yake_keyword"][id]] # yake keywords are used because they appear to be more relevant

In [None]:
data[0][2]

In [None]:
train_data, val_data = split_data(data)

train_dataset = myDataset(train_data, tokenizer)
val_dataset = myDataset(val_data, tokenizer, randomize=False)

f'Training sample: {len(train_dataset) :,}\nValidation sample: {len(val_dataset) :,}'

# Finetuning GPT2 

In [None]:
%%time

training_args = TrainingArguments(
    output_dir="/content/",
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=TRAIN_BATCHSIZE,
    per_device_eval_batch_size=TRAIN_BATCHSIZE,
    gradient_accumulation_steps=BATCH_UPDATE,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    fp16=True,
    fp16_opt_level='O1',
    warmup_steps=WARMUP_STEPS,    
    learning_rate=LR,
    adam_epsilon=EPS,
    weight_decay=0.01,        
    save_total_limit=1,
    load_best_model_at_end=True,     
)


trainer = Trainer(
    model=model,
    args=training_args,    
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer
)

In [36]:
trainer.train()
trainer.save_model()    

***** Running training *****
  Num examples = 8000
  Num Epochs = 4
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 16
  Total optimization steps = 500


Epoch,Training Loss,Validation Loss
1,No log,0.755551
2,No log,0.741121
3,No log,0.744387


***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /content/checkpoint-125
Configuration saved in /content/checkpoint-125/config.json
Model weights saved in /content/checkpoint-125/pytorch_model.bin
tokenizer config file saved in /content/checkpoint-125/tokenizer_config.json
Special tokens file saved in /content/checkpoint-125/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /content/checkpoint-250
Configuration saved in /content/checkpoint-250/config.json
Model weights saved in /content/checkpoint-250/pytorch_model.bin
tokenizer config file saved in /content/checkpoint-250/tokenizer_config.json
Special tokens file saved in /content/checkpoint-250/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint-125] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /content/checkpoint-375


Epoch,Training Loss,Validation Loss
1,No log,0.755551
2,No log,0.741121
3,No log,0.744387
4,1.782400,0.750322


***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /content/checkpoint-500
Configuration saved in /content/checkpoint-500/config.json
Model weights saved in /content/checkpoint-500/pytorch_model.bin
tokenizer config file saved in /content/checkpoint-500/tokenizer_config.json
Special tokens file saved in /content/checkpoint-500/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint-375] due to args.save_total_limit


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from /content/checkpoint-250 (score: 0.7411211133003235).
Saving model checkpoint to /content/
Configuration saved in /content/config.json
Model weights saved in /content/pytorch_model.bin
tokenizer config file saved in /content/tokenizer_config.json
Special tokens file saved in /content/special_tokens_map.json


In [None]:
#storing the files
drive.mount('/content/gdrive',force_remount=True)

In [None]:
!cp ./pytorch_model.bin '/content/gdrive/MyDrive/model_run_2'
!cp ./config.json '/content/gdrive/MyDrive/model_run_2'
!cp ./tokenizer_config.json '/content/gdrive/MyDrive/model_run_2'
!cp ./special_tokens_map.json '/content/gdrive/MyDrive/model_run_2'
!ls -lt '/content/gdrive/MyDrive/model_run_2'

# Generating Text

In [40]:
tokenizer = get_tokenizer(special_tokens=SPECIAL_TOKENS)
model = get_model(tokenizer, 
                  special_tokens=SPECIAL_TOKENS,
                  load_model_path='pytorch_model.bin')

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": 

Special tokens added


loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50257,
  "embd_pdrop": 0.1,
  "eos_token_id": 50258,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "pad_token_id": 50260,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "sep_token_id": 50261,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "sum

In [41]:
title = "I have a Question."
keywords = ["business","consulting","fraud"]
kw = myDataset.join_keywords(keywords, randomize=False)

prompt = SPECIAL_TOKENS['bos_token'] + title + SPECIAL_TOKENS['sep_token'] + kw + SPECIAL_TOKENS['sep_token'] 
         
generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
device = torch.device("cuda")
generated = generated.to(device)

model.eval();

In [42]:
sample_outputs = model.generate(generated, 
                                do_sample=True,   
                                min_length=50, 
                                max_length=MAXLEN,
                                top_k=30,                                 
                                top_p=0.7,        
                                temperature=0.9,
                                repetition_penalty=2.0,
                                num_return_sequences=10
                                )

for i, sample_output in enumerate(sample_outputs):
    text = tokenizer.decode(sample_output, skip_special_tokens=True)
    a = len(title) + len(','.join(keywords))    
    print("{}: {}\n\n".format(i+1,  text[a:]))

1: What is the difference between "concessions" and real business?: The latter refers to any transaction you've made with someone else that involves money or property in exchange for your services (such as agreeing on how much cash/gold it will cost).  In most cases this means purchasing something at auction - which usually requires lots of other people's permission first thing before being able sell them off...but sometimes there are agreements where they're allowed some sort trade-in fee if their goods aren't sold outright so long like an agreement regarding what kind might be sent back once somebody has purchased theirs from another source....or when buying products directly without actually doing anything themselves by paying anyone up front; either way these deals tend not always take place within legitimate transactions but instead involve very complex legal arrangements involving many different kinds.: Concession comes into play here because banks typically don’t want customers 

# How to use

The model is uploaded to https://huggingface.co/Madhour/gpt2-eli5. Follow the instructions there to use this pre-trained model.

# References
Parts of the code were drawn and/or inspired from Ivan Lai's article on [conditional text generation](https://towardsdatascience.com/conditional-text-generation-by-fine-tuning-gpt-2-11c1a9fc639d). In his [colab notebook](https://colab.research.google.com/drive/1vnpMoZoenRrWeaxMyfYK4DDbtlBu-M8V?usp=sharing#scrollTo=ZwRhz144Fknp), he explained in full detail how to finetune a pre-trained huggingface transformer model.

The eli5 dataset is appropriated from [Fan et al.](https://doi.org/10.18653/v1/p19-1346) and was downloaded using the huggingface dataset library.