## Introduction to Huggingface __transformers__ library
* The transformers library is an open-source, community-based repository to train, use and share models based on the Transformer architecture such as 
  * Bert
  * GPT2
  * XLNet

* Along with the models, the library contains multiple variations of each of them for a large variety of downstream-tasks like 

  * Named Entity Recognition (NER)

  * Classification Tasks (like Text Classification, Sentiment Analysis )

  * Language Modeling

  * Question Answering.

* The transformers library allows you to benefits from large, pretrained language models without requiring a huge and costly computational infrastructure. 


## Getting started with transformers

Install HuggingFace transformers library

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |▏                               | 10kB 23.2MB/s eta 0:00:01[K     |▎                               | 20kB 30.6MB/s eta 0:00:01[K     |▌                               | 30kB 21.1MB/s eta 0:00:01[K     |▋                               | 40kB 24.5MB/s eta 0:00:01[K     |▉                               | 51kB 25.3MB/s eta 0:00:01[K     |█                               | 61kB 27.8MB/s eta 0:00:01[K     |█▏                              | 71kB 19.0MB/s eta 0:00:01[K     |█▎                              | 81kB 20.1MB/s eta 0:00:01[K     |█▌                              | 92kB 19.2MB/s eta 0:00:01[K     |█▋                              | 102kB 20.5MB/s eta 0:00:01[K     |█▉                              | 112kB 20.5MB/s eta 0:00:01[K     |██                              | 

Load tensorflow and required trasformers modules. 

In [2]:
import tensorflow 
from transformers import AutoTokenizer, TFAutoModel

## Three types of classes for each model:
*   **Model** classes such as BertModel, which are 30+ PyTorch models (torch.nn.Module) or Keras models (tf.keras.Model) that work with the pretrained weights provided in the library.
*   **Configuration** classes such as BertConfig, which store all the parameters required to build a model. You don’t always need to instantiate these yourself. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model).
*   **Tokenizer** classes such as BertTokenizer, which store the vocabulary for each model and provide methods for encoding/decoding strings in a list of token embeddings indices to be fed to a model.

## All these classes can be instantiated from pretrained instances and saved locally using two methods:

*   **from_pretrained()** lets you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (the supported models are provided in the list here) or stored locally (or on a server) by the user,
*   **save_pretrained()** lets you save a model/configuration/tokenizer locally so that it can be reloaded using from_pretrained().

In [3]:
# Store the name of the model we want to use
MODEL_NAME = "bert-base-cased"

# Create the pretrained bert tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Create the pretrained bert Tensorflow models
model = TFAutoModel.from_pretrained(MODEL_NAME)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=526681800.0, style=ProgressStyle(descri…




Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [5]:
# Tokens comes from a process that splits the input into sub-entities with interesting linguistic properties. 
tokens = tokenizer.tokenize("This is an input example")
print("Tokens                       : {}".format(tokens))

# This is not sufficient for the model, as it requires integers as input, 
# not a problem, let's convert tokens to ids.
tokens_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Tokens id                    : {}".format(tokens_ids))

# Add the required special tokens
tokens_ids = tokenizer.build_inputs_with_special_tokens(tokens_ids)
print("Tokens id with special tokens: {}".format(tokens_ids))

# We need to convert to a Deep Learning framework specific format, let's use TensorFlow 
tokens_tf = tensorflow.convert_to_tensor([tokens_ids])
print("Tokens TensorFlow            : {}".format(tokens_tf))

Tokens                       : ['This', 'is', 'an', 'input', 'example']
Tokens id                    : [1188, 1110, 1126, 7758, 1859]
Tokens id with special tokens: [101, 1188, 1110, 1126, 7758, 1859, 102]
Tokens TensorFlow            : [[ 101 1188 1110 1126 7758 1859  102]]



```python
tokens = tokenizer.tokenize("This is an input example")
tokens_ids = tokenizer.convert_tokens_to_ids(tokens)
tokens_ids = tokenizer.build_inputs_with_special_tokens(tokens_ids)
tokens_tf = tf.convert_to_tensor([tokens_ids])
```
__TensorFlow__:This code can be factored into one-line as follow

```python
tokens_tf = tokenizer.encode_plus("This is an input example", return_tensors="tf")
```

In [6]:
tokens_tf2 = tokenizer.encode_plus("This is an input example", return_tensors="tf")

for key, value in tokens_tf2.items():
    print("{}:\n\t{}".format(key, value))

input_ids:
	[[ 101 1188 1110 1126 7758 1859  102]]
token_type_ids:
	[[0 0 0 0 0 0 0]]
attention_mask:
	[[1 1 1 1 1 1 1]]


As you can see above, the methode `encode_plus` provides a convenient way to generate all the required parameters
that will go through the model. 

In addition to input_ids it also generates token_type_ids and attention_mask tensors

#### __token_type_ids__: 

This tensor will map every tokens to their corresponding segment.

In [7]:
# Single segment input
single_seg_input = tokenizer.encode_plus("This is a seqment A")

print("Single segment token (str): {}".format(tokenizer.convert_ids_to_tokens(single_seg_input['input_ids'])))
print("Single segment token (int): {}".format(single_seg_input['input_ids']))
print("Single segment type       : {}".format(single_seg_input['token_type_ids']))
print()

# Multiple segment input
multi_seg_input = tokenizer.encode_plus("This is segment A", "This is segment B")

# Segments are concatened in the input to the model, with 
print("Multi segment token (str): {}".format(tokenizer.convert_ids_to_tokens(multi_seg_input['input_ids'])))
print("Multi segment token (int): {}".format(multi_seg_input['input_ids']))
print("Multi segment type       : {}".format(multi_seg_input['token_type_ids']))

Single segment token (str): ['[CLS]', 'This', 'is', 'a', 'se', '##q', '##ment', 'A', '[SEP]']
Single segment token (int): [101, 1188, 1110, 170, 14516, 4426, 1880, 138, 102]
Single segment type       : [0, 0, 0, 0, 0, 0, 0, 0, 0]

Multi segment token (str): ['[CLS]', 'This', 'is', 'segment', 'A', '[SEP]', 'This', 'is', 'segment', 'B', '[SEP]']
Multi segment token (int): [101, 1188, 1110, 6441, 138, 102, 1188, 1110, 6441, 139, 102]
Multi segment type       : [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]


####__attention_mask__

This tensor is used to "mask" padded values in a batch of sequence with different lengths.

In [9]:
# Padding highlight
tokens = tokenizer.batch_encode_plus(["This is a sample", 
                                      "This is another longer sample text"], 
                                     padding=True)  # First sentence will have some PADDED tokens to match second sequence length

for i in range(2):
    print("Tokens (int)      : {}".format(tokens['input_ids'][i]))
    print("Tokens (str)      : {}".format([tokenizer.convert_ids_to_tokens(s) for s in tokens['input_ids'][i]]))
    print("Tokens (attn_mask): {}".format(tokens['attention_mask'][i]))
    print()

Tokens (int)      : [101, 1188, 1110, 170, 6876, 102, 0, 0]
Tokens (str)      : ['[CLS]', 'This', 'is', 'a', 'sample', '[SEP]', '[PAD]', '[PAD]']
Tokens (attn_mask): [1, 1, 1, 1, 1, 1, 0, 0]

Tokens (int)      : [101, 1188, 1110, 1330, 2039, 6876, 3087, 102]
Tokens (str)      : ['[CLS]', 'This', 'is', 'another', 'longer', 'sample', 'text', '[SEP]']
Tokens (attn_mask): [1, 1, 1, 1, 1, 1, 1, 1]



In [10]:
outputs = model(tokens_tf2)

print("last_hidden_state: \n\t {}".format(outputs.last_hidden_state.shape))
print("pooler_output    : \n\t {}".format(outputs.pooler_output.shape))

last_hidden_state: 
	 (1, 7, 768)
pooler_output    : 
	 (1, 768)


As you can see, BERT outputs two tensors:
 - One with the generated representation for every token in the input `(1, NB_TOKENS, REPRESENTATION_SIZE)`
 - One with an aggregated representation for the whole input `(1, REPRESENTATION_SIZE)`
 
The first, token-based, representation can be leveraged if your task requires to keep the sequence representation and you
want to operate at a token-level. This is particularly useful for Named Entity Recognition and Question-Answering.

The second, aggregated, representation is especially useful if you need to extract the overall context of the sequence and don't
require a fine-grained token-level. This is the case for Sentiment-Analysis of the sequence or Information Retrieval.