<h1><a href="https://huggingface.co/transformers/model_doc/gpt2.html">HuggingFace OpenAI GPT2</a> <a href="https://huggingface.co/exbert/?model=gpt2&modelKind=bidirectional&sentence=The%20girl%20ran%20to%20a%20local%20pub%20to%20escape%20the%20din%20of%20her%20city.&layer=11&heads=..&threshold=0.7&tokenInd=null&tokenSide=null&maskInds=..&hideClsSep=false">Transformer Visualizer</a></h1>
<h4><a href="https://huggingface.co/transformers/main_classes/processors.html">List of Data Processors</a></h4>
<h4><a href="https://huggingface.co/transformers/pretrained_models.html">List of Pretrained Models</a></h4>
<h4><a href="https://huggingface.co/transformers/main_classes/tokenizer.html">List of Tokenizers</a></h4>
<h4><a href="https://huggingface.co/transformers/main_classes/pipelines.html">List of Pipelines</a></h4>
<h4><a href="https://huggingface.co/transformers/main_classes/optimizer_schedules.html">List of Optimizers</a></h4>

<h3>Installation</h3>

>pip install transformers\[torch]

>pip install transformers\[tf-cpu]

If you don’t have any specific environment variable set, the cache directory will be at ~/.cache/torch/transformers/.


<h2>Text Generation</h2>
In text generation (a.k.a open-ended text generation) the goal is to create a coherent portion of text that is a continuation from the given context. The following example shows how GPT-2 can be used in pipelines to generate text. As a default all models apply Top-K sampling when used in pipelines, as configured in their respective configurations (see gpt-2 config for example).

<pre><code>
from transformers import pipeline

text_generator = pipeline("text-generation")

print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))



from transformers import GPT2Tokenizer, <em>TFGPT2Model</em>

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = <em>TFGPT2Model.from_pretrained('gpt2')</em>

text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

</pre></code>

>>[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]


Here, the model generates a random text with a total maximal length of 50 tokens from context “As far as I am concerned, I will”. The default arguments of PreTrainedModel.generate() can be directly overriden in the pipeline, as is shown above for the argument max_length.

<h2>Text Summarization</h2>
Summarization is the task of summarizing a document or an article into a shorter text.

An example of a summarization dataset is the CNN / Daily Mail dataset, which consists of long news articles and was created for the task of summarization. If you would like to fine-tune a model on a summarization task, various approaches are described in this document.

Here is an example of using the pipelines to do summarization. It leverages a Bart model that was fine-tuned on the CNN / Daily Mail data set.
<pre><code>
from transformers import pipeline

summarizer = pipeline("summarization")
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
<em>...</em> 
<em>...</em> 
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18."""
</pre></code>

Because the summarization pipeline depends on the PretrainedModel.generate() method, we can override the default arguments of PretrainedModel.generate() directly in the pipeline for max_length and min_length as shown below. This outputs the following summary:

<pre><code>
    print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False
</pre></code>
>>[{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]

<h4> Using Model and Tokenizer Example</h4>
Here is an example of doing summarization using a model and a tokenizer. The process is the following:

1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as Bart or T5.

2. Define the article that should be summarized.

3. Add the T5 specific prefix “summarize: “.

4. Use the PretrainedModel.generate() method to generate the summary.

<pre><code>
from transformers import AutoModelWithLMHead, AutoTokenizer

model = AutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

\#T5 uses a max_length of 512 so we cut the article to 512 tokens.

inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="pt", max_length=512)
outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
</pre></code>

<h2>Fine Tuning</h2>
<h3> Fine-tuning with <a href="https://huggingface.co/transformers/main_classes/trainer.html">Trainer</a></h3>
<pre>
<code>
from transformers import TFDistilBertForSequenceClassification, TFTrainer, TFTrainingArguments

training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

with training_args.strategy.scope():
    model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = TFTrainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()
</code>
</pre>

<h3>Fine-tuning with Tensorflow</h3>
<pre>
<code>
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(device)
model.train()

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(3):
    for batch in train_loader:
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]
        loss.backward()
        optim.step()

model.eval()
</code>
</pre>

In [1]:
import tensorflow as tf
from transformers import GPT2Model, GPT2Config
import ipywidgets
from IPython import display



In [2]:

# Initializing a GPT2 configuration
configuration = GPT2Config()
# Initializing a model from the configuration
model = GPT2Model(configuration)
# Accessing the model configuration
configuration = model.config

<h1>Tokenization</h1>

→Every transformer model has a similar token definition API

→Here I am using a tokenizer from a Pretrained model.

→Here,

* add_special_tokens: Is used to add special character like <cls>, <sep>,<unk>, etc w.r.t Pretrained model in use. It should be always kept True
* max_length: Max length of any sentence to tokenize, its a hyperparameter. (originally BERT has 512 max length)
pad_to_max_length: perform padding operation.


In [4]:
from transformers import DistilBertTokenizer, RobertaTokenizer 
distil_bert = 'distilbert-base-uncased' # Pick any desired pre-trained model
roberta = 'roberta-base'

# Defining DistilBERT tokonizer
tokenizer = DistilBertTokenizer.from_pretrained(distil_bert, do_lower_case=True, add_special_tokens=True,
                                                max_length=128, pad_to_max_length=True)
# Defining RoBERTa tokinizer
tokenizer = RobertaTokenizer.from_pretrained(roberta, do_lower_case=True, add_special_tokens=True,
                                                max_length=128, pad_to_max_length=True)

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=898823.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=456318.0), HTML(value='')))




<h2>Tokenize Document</h2>

→Any transformer model generally needs three input:

* input ids: word id associated with their vocabulary

* attention mask: Which id must be paid attention to; 1=pay attention. In simple terms, it tells the model which are original words and which are padded words or special tokens

* token type id: It's associated with model consuming multiply sentence like Question-Answer model. It tells model about the sequence of the sentences.

→Though it is not compulsory to provide all these three ids and only input ids will also do, but attention mask help model to focus on only valid words. So at least for classification task both this should be provided.


In [None]:
def tokenize(sentences, tokenizer):
    input_ids, input_masks, input_segments = [],[],[]
    for sentence in tqdm(sentences):
        inputs = tokenizer.encode_plus(sentence, add_special_tokens=True, max_length=128, pad_to_max_length=True, 
                                             return_attention_mask=True, return_token_type_ids=True)
        input_ids.append(inputs['input_ids'])
        input_masks.append(inputs['attention_mask'])
        input_segments.append(inputs['token_type_ids'])        
        
    return np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')


<h1>Training and Fine-Tuning</h1>

<h2>Use Pretrained Model directly as Classifier</h2>

Hugging Face’s transformers library provide some models with sequence classification ability. These model have two heads, one is a pre-trained model architecture as the base & a classifier as the top head.

Tokenizer definition →Tokenization of Documents →Model Definition

In [2]:
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
import tensorflow as tf
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="tf")
inputs["labels"] = tf.reshape(tf.constant(1), (-1, 1)) # Batch size 1
outputs = model(inputs)
loss, logits = outputs[:2]

ImportError: cannot import name 'Unigram' from 'tokenizers.models' (C:\Users\Carson\Anaconda3\envs\Transformers\lib\site-packages\tokenizers\models\__init__.py)

In [10]:
from transformers import TFDistilBertForSequenceClassification, DistilBertConfig
import tensorflow as tf


distil_bert = 'distilbert-base-uncased'

config = DistilBertConfig(num_labels=6)
config.output_hidden_states = False
transformer_model = TFDistilBertForSequenceClassification.from_pretrained(distil_bert, config = config)

input_ids = tf.keras.layers.Input(shape=(128,), name='input_token', dtype='int32')
input_masks_ids = tf.keras.layers.Input(shape=(128,), name='masked_token', dtype='int32')
X = transformer_model(input_ids, input_masks_ids)
model = tf.keras.Model(inputs=[input_ids, input_masks_ids], outputs = X)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_projector', 'vocab_transform', 'activation_13', 'vocab_layer_norm']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_99']
You should probably TRAIN this model on a down-stream task to be able to use i

In [11]:
model.summary()

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_token (InputLayer)        [(None, 128)]        0                                            
__________________________________________________________________________________________________
masked_token (InputLayer)       [(None, 128)]        0                                            
__________________________________________________________________________________________________
tf_distil_bert_for_sequence_cla ((None, 6),)         66958086    input_token[0][0]                
                                                                 masked_token[0][0]               
Total params: 66,958,086
Trainable params: 66,958,086
Non-trainable params: 0
__________________________________________________________________________________________________


→Note: Models which are SequenceClassification are only applicable here.

→Defining the proper config is crucial here. As you can see on line 6, I am defining the config. ‘num_labels’ is the number of classes to use when the model is a classification model. It also supports a variety of configs so go ahead & see their docs.

→ Some key things to note here are:

* Here only weights of the pre-trained model can be updated, but updating them is not a good idea as it will defeat the purpose of transfer learning. So, actually there is nothing here to update. This is the reason I least prefer this.
* It is also the least customizable.
* A hack you can try is using num_labels with much higher no and finally adding a dense layer at the end which can be trained.

In [13]:
# # Hack
# config = DistilBertConfig(num_labels=64)
# config.output_hidden_states = False
# transformer_model=TFDistilBertForSequenceClassification.from_pretrained(distil_bert, config = config) 
# input_ids = tf.keras.layers.Input(shape=(128,), name='input_token', dtype='int32')
# input_masks_ids = tf.keras.layers.Input(shape=(128,), name='masked_token', dtype='int32')
# X = transformer_model(input_ids, input_masks_ids)[0]
# X = tf.keras.layers.Dropout(0.2)(X)
# X = tf.keras.layers.Dense(6, activation='softmax')
# model = tf.keras.Model(inputs=[input_ids, input_masks_ids], outputs = X)
# for layer in model.layer[:2]:
#     layer.trainable = False


<h2>Transformer model to extract embedding and use it as input to another classifier</h2>

This approach needs two level or two separate models. We use any transformer model to extract word embedding & then use this word embedding as input to any classifier (eg Logistic classifier, Random forest, Neural nets, etc).
I would suggest you read this <a href="http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/">article</a> by Jay Alammar which discusses this approach with great detail and clarity.
As this blog is all about neural nets, let me give you an example of this approach with NN.

→Line 12 is key here. We are only interested in <cls> or classification token of the model which can be extracted using the slice operation. Now we have 2D data and build the network as one desired.


In [17]:
from transformers import TFDistilBertModel
distil_bert = 'distilbert-base-uncased'

config = DistilBertConfig(dropout=0.2, attention_dropout=0.2)
config.output_hidden_states = False
transformer_model = TFDistilBertModel.from_pretrained(distil_bert, config = config)

input_ids_in = tf.keras.layers.Input(shape=(128,), name='input_token', dtype='int32')
input_masks_in = tf.keras.layers.Input(shape=(128,), name='masked_token', dtype='int32') 

embedding_layer = transformer_model(input_ids_in, attention_mask=input_masks_in)[0]
cls_token = embedding_layer[:,0,:]
X = tf.keras.layers.BatchNormalization()(cls_token)
X = tf.keras.layers.Dense(192, activation='relu')(X)
X = tf.keras.layers.Dropout(0.2)(X)
X = tf.keras.layers.Dense(6, activation='softmax')(X)
model = tf.keras.Model(inputs=[input_ids_in, input_masks_in], outputs = X)

for layer in model.layers[:3]:
  layer.trainable = False

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_projector', 'vocab_transform', 'activation_13', 'vocab_layer_norm']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


In [18]:
model.summary()

Model: "functional_6"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_token (InputLayer)        [(None, 128)]        0                                            
__________________________________________________________________________________________________
masked_token (InputLayer)       [(None, 128)]        0                                            
__________________________________________________________________________________________________
tf_distil_bert_model_1 (TFDisti ((None, 128, 768),)  66362880    input_token[0][0]                
                                                                 masked_token[0][0]               
__________________________________________________________________________________________________
tf_op_layer_strided_slice_1 (Te [(None, 768)]        0           tf_distil_bert_model_1

<h3>Just extract word embedding</h3>

In [20]:
# import numpy as np
# from transformers import AutoTokenizer, pipeline, TFDistilBertModel
# model = TFDistilBertModel.from_pretrained('distilbert-base-uncased')
# tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
# pipe = pipeline('feature-extraction', model=model, 
#                 tokenizer=tokenizer)
# features = pipe('any text data or list of text data',
#                 pad_to_max_length=True)
# features = np.squeeze(features)
# features = features[:,0,:]


<h2>Fine-Tuning a Pretrained transformer model</h2>

→Look at line 17 as 3D data is generated earlier embedding layer, we can use LSTM to extract great details.

→Next thing is to transform the 3D data into 2D so that we can use a FC layer. You can use any Pooling layer to perform this.

→ Also, note on line 18 & 19. We should always freeze the pre-trained weights of transformer model & never update them and update only remaining weights.
Some extras

→Every approach has two things in common:
config.output_hidden_states=False; as we are training & not interested in output state.
X = transformer_model(…)\[0]; this is inline in config.output_hidden_states as we want only the top head.

→config is a dictionary. So to see all available configuration, just simply print it.

→Choose base model carefully as TF 2.0 support is new, so there might be bugs.

In [21]:
distil_bert = 'distilbert-base-uncased'

config = DistilBertConfig(dropout=0.2, attention_dropout=0.2)
config.output_hidden_states = False
transformer_model = TFDistilBertModel.from_pretrained(distil_bert, config = config)

input_ids_in = tf.keras.layers.Input(shape=(128,), name='input_token', dtype='int32')
input_masks_in = tf.keras.layers.Input(shape=(128,), name='masked_token', dtype='int32') 

embedding_layer = transformer_model(input_ids_in, attention_mask=input_masks_in)[0]
X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(embedding_layer)
X = tf.keras.layers.GlobalMaxPool1D()(X)
X = tf.keras.layers.Dense(50, activation='relu')(X)
X = tf.keras.layers.Dropout(0.2)(X)
X = tf.keras.layers.Dense(6, activation='sigmoid')(X)
model = tf.keras.Model(inputs=[input_ids_in, input_masks_in], outputs = X)

for layer in model.layers[:3]:
  layer.trainable = False

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_projector', 'vocab_transform', 'activation_13', 'vocab_layer_norm']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


In [22]:
model.summary()

Model: "functional_8"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_token (InputLayer)        [(None, 128)]        0                                            
__________________________________________________________________________________________________
masked_token (InputLayer)       [(None, 128)]        0                                            
__________________________________________________________________________________________________
tf_distil_bert_model_3 (TFDisti ((None, 128, 768),)  66362880    input_token[0][0]                
                                                                 masked_token[0][0]               
__________________________________________________________________________________________________
bidirectional (Bidirectional)   (None, 128, 100)     327600      tf_distil_bert_model_3

<h3>For Data Output Cleaning</h3>
You can actually use bad_words_id parameter with a line break, which will prevent generate function from giving you results, which contain "\n". (though you'd probably have to add every id from your vocab, which has line breaks in it, since I do think there tends to be more than one "breaking" sequence out there...)