In [1]:
import pandas as pd

# Pipeline

In [4]:
from transformers import pipeline

### Create default pipeline
Pipeline is initilized with BERT model and tokenizer.

In [43]:
classifier = pipeline('sentiment-analysis')

reviews = [
    "The movie was great!",
    "The movie was okay.",
    "The movie was terrible...",
    "The movie was so bad that I had to leave the theater. I will never watch it again.",
    "Netflix's The Witcher is a dark fantasy epic that will make you want to toss a coin of your own to the show's creators.",
]

classifier(reviews)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9998706579208374},
 {'label': 'POSITIVE', 'score': 0.9998013377189636},
 {'label': 'NEGATIVE', 'score': 0.9997760653495789},
 {'label': 'NEGATIVE', 'score': 0.999782145023346},
 {'label': 'POSITIVE', 'score': 0.9509405493736267}]

### Specify the model
Tokenizer must be set to the same one that was used for training the model.

In [48]:
model_name = 'nlptown/bert-base-multilingual-uncased-sentiment'

classifier = pipeline('sentiment-analysis', model=model_name)

In [49]:
classifier(reviews)

[{'label': '5 stars', 'score': 0.6459116339683533},
 {'label': '3 stars', 'score': 0.7961868047714233},
 {'label': '1 star', 'score': 0.7816266417503357},
 {'label': '1 star', 'score': 0.7096598744392395},
 {'label': '4 stars', 'score': 0.44803386926651}]

### Text generation

In [53]:
generator = pipeline('text-generation', model='gpt2')

generated_text = generator(
    "In this course, we will teach you how to",
    max_length=20,
    num_return_sequences=2,
)

for text in generated_text:
    print(text['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In this course, we will teach you how to use a variety of programming languages to create and manipulate
In this course, we will teach you how to write functional HTML5 code using JavaScript with HTML5


## Auto classes

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [55]:
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer) # if tokenizer is not specified, it will use the default model tokenizer

classifier(reviews) # the results are the same as before because "distilbert-base-uncased-finetuned-sst-2-english" is a default model for sentiment analysis

[{'label': 'POSITIVE', 'score': 0.9998706579208374},
 {'label': 'POSITIVE', 'score': 0.9998013377189636},
 {'label': 'NEGATIVE', 'score': 0.9997760653495789},
 {'label': 'NEGATIVE', 'score': 0.999782145023346},
 {'label': 'POSITIVE', 'score': 0.9509405493736267}]

### Tokenizer

In [66]:
sequence = "Using a transformer network is simple"

res = tokenizer(sequence)

tokens = tokenizer.tokenize(sequence)

encoded_tokens = tokenizer.convert_tokens_to_ids(tokens)
decoded_tokens = tokenizer.convert_ids_to_tokens(encoded_tokens)

orignal_text = tokenizer.decode(ids)

print(res) # dictionary with input_ids (with SOS and EOS tokens), token_type_ids, and attention_mask
print(tokens)
print(ids) # token ids without SOS and EOS tokens
print(decoded_tokens)
print(orignal_text)

{'input_ids': [101, 2478, 1037, 10938, 2121, 2897, 2003, 3722, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
['using', 'a', 'transform', '##er', 'network', 'is', 'simple']
[2478, 1037, 10938, 2121, 2897, 2003, 3722]
['using', 'a', 'transform', '##er', 'network', 'is', 'simple']
using a transformer network is simple


## Hugging Face with TensorFLow
`TFAutoModel` is a generic class that can be used to load any pre-trained model from the transformers library, while `TFAutoModelForSequenceClassification` is a specific class designed for sequence classification tasks that includes a classification head on top of the base model. You can import the same pre-trained model using both classes, but the resulting models will differ in their architecture and functionality. `TFAutoModel` gives you the base model without any additional task-specific heads, while `TFAutoModelForSequenceClassification` gives you the base model with an additional classification head for sequence classification tasks.

In [40]:
import tensorflow as tf

from transformers import TFAutoModelForSequenceClassification, AutoTokenizer
from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizer

While both methods load the same pre-trained model, `TFAutoModelForSequenceClassification` provides a higher level of abstraction and can be used to load any pre-trained sequence classification model from the transformers library. `TFDistilBertForSequenceClassification` is a lower-level class that is specific to the DistilBERT architecture. This means that you can only use this class to load pre-trained DistilBERT models for sequence classification. 

In [67]:
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

X_train = [
    "I've been waiting for this LEGO Witcher series for a long time! I'm so excited!",
    "The Witch series on Netflix is so inacurate. I can't believe they made it."
]

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_419']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [68]:
batch = tokenizer(X_train, padding=True, truncation=True, max_length=512, return_tensors="tf") # If max_length is not specified, it will use the longest sequence in the X_train
batch

{'input_ids': <tf.Tensor: shape=(2, 24), dtype=int32, numpy=
array([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  2023, 23853,
         6965,  2121,  2186,  2005,  1037,  2146,  2051,   999,  1045,
         1005,  1049,  2061,  7568,   999,   102],
       [  101,  1996,  6965,  2186,  2006, 20907,  2003,  2061, 27118,
        10841, 11657,  1012,  1045,  2064,  1005,  1056,  2903,  2027,
         2081,  2009,  1012,   102,     0,     0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 24), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        0, 0]], dtype=int32)>}

In [69]:
# Inference using the model
with tf.GradientTape() as tape:
    logits = model(batch)[0]
    probs = tf.nn.softmax(logits, axis=1)
    preds = tf.argmax(probs, axis=1)

In [70]:
print(logits.numpy())
print(probs.numpy())
print(preds.numpy()) # 0 is negative and 1 is positive

[[-3.7832916  4.0378428]
 [ 3.5807006 -2.9227448]]
[[4.0100541e-04 9.9959904e-01]
 [9.9850404e-01 1.4960265e-03]]
[1 0]


In [71]:
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

classifier(X_train)

[{'label': 'POSITIVE', 'score': 0.9995989203453064},
 {'label': 'NEGATIVE', 'score': 0.9985040426254272}]

We achieved the same results using `pipeline` and `TFAutoModelForSequenceClassification`!

## Save/Load model

In [72]:
save_dir = 'models'

tokenizer.save_pretrained(save_dir)
model.save_pretrained(save_dir)

our_tokenizer = AutoTokenizer.from_pretrained(save_dir)
our_model = TFAutoModelForSequenceClassification.from_pretrained(save_dir)

Some layers from the model checkpoint at models were not used when initializing TFDistilBertForSequenceClassification: ['dropout_419']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at models and are newly initialized: ['dropout_439']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [73]:
our_classifier = pipeline('sentiment-analysis', model=our_model, tokenizer=our_tokenizer)

our_classifier(X_train)

[{'label': 'POSITIVE', 'score': 0.9995989203453064},
 {'label': 'NEGATIVE', 'score': 0.9985040426254272}]

## Fine-tuning existing model

**Usuall workflow:**
1. Prepare our dataset
2. Load pre-trained Tokenizer, call it with our dataset to get the encodings
3. Build TensorFlow dataset object with encodings
4. Load pre-trained model
5. Load Trainer and train int
6. Use native TensorFlow training loop

In [74]:
from transformers import Trainer, TrainingArguments

In [None]:
training_args = TrainingArguments('test-trainer')

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
    tokenizer=tokenizer,
)

trainer.train()