# Using Transformers

## Summary of the tasks

[https://huggingface.co/transformers/v3.0.2/task_summary.html]

Transformers can be used for a wide rage of tasks including question answering, entity recognition, sentiment analysis, text summerization, etc.

The examples below use 'AutoModel' meaning that a model is chosen based on the best model architecture

*Fined-tuned* means to retrain a pre-trained model towards a specific set of data related towards the developer's target task (Language Modeling, Translation, etc.). This means the learning rate is small enough so the loss reaches the global mininum for that downstream task (image a parabola on a 2D chart with the loss as the y-axis and the learning rate weight as the x-axis).   

For a model to perform well, it needs to be loaded from a checkpoint (based on pre-training on a large corpus then fine-tuned towards a specific task)
to the specific task it was pre-trained on.

Not all models were trained to apply to all tasks; like mentioned before they are fine-tuned towards a specific task. Therefore if a developer wants to train a model, they will need to create there own dataset first along with their own training script.

To apply inference on a task, the model and tokenizer must first be saved to a checkpoint (it will be on HF's Model Hub if they are saved there). Inference can be down in 1 of 2 ways:
    - Using 'pipeline' can help with inference with only 2 lines of code.
    - Using an AutoTokenizer and AutoModel

available tasks are ['audio-classification', 'automatic-speech-recognition', 'conversational', 'depth-estimation', 'document-question-answering', 'feature-extraction', 'fill-mask', 'image-classification', 'image-feature-extraction', 'image-segmentation', 'image-to-image', 'image-to-text', 'mask-generation', 'ner', 'object-detection', 'question-answering', 'sentiment-analysis', 'summarization', 'table-question-answering', 'text-classification', 'text-generation', 'text-to-audio', 'text-to-speech', 'text2text-generation', 'token-classification', 'translation', 'video-classification', 'visual-question-answering', 'vqa', 'zero-shot-audio-classification', 'zero-shot-classification', 'zero-shot-image-classification', 'zero-shot-object-detection', 'translation_XX_to_YY']"

## Goal: Make predictions from established fine-tuned models for various Generative LLM tasks.

### Sequence Classification

Sequence Classfication is a model that predicts the label of an input based on a set of classes. 

Hugging Face offers a dataset called GLUE for training and fine-tuning a model for a sequence classsification task. (run_glue.py and runt_tf_glue.py)

### Sentiment Analysis

Determines whether an input text has a posititve or negative sentiment. 

### Paraphrase Classification

Predicts whether two sentences are consecutive or not 

In [818]:
from transformers import pipeline
import torch

In [819]:
# If no model is included, the default model (distilbert/distilbert-base-uncased-finetuned-sst-2-english) is then used
x = pipeline(task='sentiment-analysis')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [820]:
x('I love you')

[{'label': 'POSITIVE', 'score': 0.9998656511306763}]

In [5]:
x('I love you')[0]

{'label': 'POSITIVE', 'score': 0.9998656511306763}

In [6]:
x('I love you')[0]['label'], x('I love you')[0]['score']

('POSITIVE', 0.9998656511306763)

Can the model find the sentiments of characters or numbers?

In [822]:
x(".")

[{'label': 'POSITIVE', 'score': 0.9668781757354736}]

In [823]:
x("4")

[{'label': 'POSITIVE', 'score': 0.9861552119255066}]

Now determine if sentences are paraphrases of each other

1. Instantiate tokenizer AND model from the same checkpoint name that has been fine-tuned towards the specific task 
2. Build a sequence of texts, and a list of classes that are recognized by the mdoel
3. Feed inputs into model to be classified as 0 (not a paraphrase) or 1 (is a paraphrase)
4. Extract the softmax outputs to receive probabilites for each class



In [7]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [13]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased-finetuned-mrpc')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased-finetuned-mrpc')

In [14]:
tokenizer

BertTokenizerFast(name_or_path='bert-base-cased-finetuned-mrpc', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [15]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [25]:
model.num_labels, model.model_tags, model.config.id2label, model.config.label2id

(2, None, {0: 'LABEL_0', 1: 'LABEL_1'}, {'LABEL_0': 0, 'LABEL_1': 1})

------------------

Notice the label ids above go in order from 0 to 1. When creating the classes below they should be in the same order i.e 0 means 'not paraphrase' and 1 means 'paraphrase'

In [94]:
classes = ["not paraphrase", "paraphrase"]

# sequence_a = "The goalkeeper made an amazing save."
# sequence_b = "A dog walked through the park."
# sequence_c = "The referee has issued a red card after the foul in the penalty box."

sequence_a = "The company HuggingFace is based in New York City"
sequence_b = "Apples are especially bad for your health"
sequence_c = "HuggingFace's headquarters are situated in Manhattan"

In [95]:
tokenizer(sequence_a)

{'input_ids': [101, 1109, 1419, 20164, 10932, 2271, 7954, 1110, 1359, 1107, 1203, 1365, 1392, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [96]:
tokenizer(sequence_b)

{'input_ids': [101, 7302, 1116, 1132, 2108, 2213, 1111, 1240, 2332, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [97]:
tokenizer(sequence_c)

{'input_ids': [101, 20164, 10932, 2271, 7954, 112, 188, 3834, 1132, 3629, 1107, 6545, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [98]:
paraphrase = tokenizer(sequence_a, sequence_c, return_tensors="pt")
paraphrase

{'input_ids': tensor([[  101,  1109,  1419, 20164, 10932,  2271,  7954,  1110,  1359,  1107,
          1203,  1365,  1392,   102, 20164, 10932,  2271,  7954,   112,   188,
          3834,  1132,  3629,  1107,  6545,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1]])}

In [99]:
non_paraphrase = tokenizer(sequence_a, sequence_b, return_tensors='pt')
non_paraphrase

{'input_ids': tensor([[  101,  1109,  1419, 20164, 10932,  2271,  7954,  1110,  1359,  1107,
          1203,  1365,  1392,   102,  7302,  1116,  1132,  2108,  2213,  1111,
          1240,  2332,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

------

In [100]:
model_paraphrase_logits = model(**paraphrase)

In [101]:
model_paraphrase_logits

SequenceClassifierOutput(loss=None, logits=tensor([[-0.3495,  1.9004]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [102]:
model_paraphrase_logits[0]

tensor([[-0.3495,  1.9004]], grad_fn=<AddmmBackward0>)

In [103]:
model_non_paraphrase_logits = model(**non_paraphrase)

In [104]:
model_non_paraphrase_logits

SequenceClassifierOutput(loss=None, logits=tensor([[ 0.5386, -2.2197]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [105]:
torch.softmax(model_paraphrase_logits[0], dim=1)

tensor([[0.0954, 0.9046]], grad_fn=<SoftmaxBackward0>)

In [108]:
paraphrase_predictions = torch.softmax(model_paraphrase_logits[0], dim=1).tolist()[0]
paraphrase_predictions

[0.09536290913820267, 0.9046370387077332]

In [109]:
non_paraphrase_predictions = torch.softmax(model_non_paraphrase_logits[0], dim=1).tolist()[0]
non_paraphrase_predictions

[0.94038325548172, 0.059616751968860626]

In [117]:
for idx, pclass in enumerate(classes):

    print(f'{classes[idx]} A to C: {paraphrase_predictions[idx] * 100}')
    print(f'{classes[idx]} A to B: {non_paraphrase_predictions[idx] * 100}')


not paraphrase A to C: 9.536290913820267
not paraphrase A to B: 94.038325548172
paraphrase A to C: 90.46370387077332
paraphrase A to B: 5.961675196886063


### Extractive Question Answering

Given a question, a transformer model can predict an answer. I.e. it can extract answers from a large corpus of text that it was pre-trained on

Dataset used for fine-tuning for this specific task is called SQuAD dataset

EQA with AutoTokenizer and AutoModelForQuestionAnswering
1. Instatiate tokenizer
2. Instatiate model
3. Create context for model
4. Instatiate questions
5. Feed questions and context into tokenizer
6. Convert ids to tokens
7. Use the model to unpack tokenizer's input id's 
8. Use argmax to find the most likely beginning and ending

In [118]:
from transformers import pipeline

In [120]:
pipeline

<function transformers.pipelines.pipeline(task: str = None, model: Union[str, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel'), NoneType] = None, config: Union[str, transformers.configuration_utils.PretrainedConfig, NoneType] = None, tokenizer: Union[str, transformers.tokenization_utils.PreTrainedTokenizer, ForwardRef('PreTrainedTokenizerFast'), NoneType] = None, feature_extractor: Union[str, ForwardRef('SequenceFeatureExtractor'), NoneType] = None, image_processor: Union[str, transformers.image_processing_utils.BaseImageProcessor, NoneType] = None, framework: Union[str, NoneType] = None, revision: Union[str, NoneType] = None, use_fast: bool = True, token: Union[str, bool, NoneType] = None, device: Union[int, str, ForwardRef('torch.device'), NoneType] = None, device_map=None, torch_dtype=None, trust_remote_code: Union[bool, NoneType] = None, model_kwargs: Dict[str, Any] = None, pipeline_class: Union[Any, NoneType] = None, **kwargs) -> transformers.pipelines.base.Pipeline>

In [119]:
eqa = pipeline('question-answering')

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [124]:
context = ''' 
K Nearest Neighbors (KNN) is a supervised machine learning model. 
It takes the K closest samples and uses the labels to determine 
the class of the inference point based on a popularity vote. The 
K is determined by the user, but should be an odd number so the 
voting between classes isn't tied. The best way to choose a K is 
visualize an elbow plot, where the elbow of the graph should be 
used.
'''

In [125]:
question = "Explain to me how I should choose a K for a KNN model?"

In [127]:
result = eqa(question=question, context=context)
result

{'score': 0.23419320583343506,
 'start': 331,
 'end': 354,
 'answer': 'visualize an elbow plot'}

In [129]:
question_2 = "How does KNN work?"
result_2 = eqa(question=question_2, context=context)
result_2

{'score': 0.0891399011015892,
 'start': 69,
 'end': 99,
 'answer': 'It takes the K closest samples'}

---

Extractive Question Answering WITH Tokenizer and Model

In [135]:
# Import packages
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

In [174]:
# Create a list of questions with respective context where the answers will be extracted from

text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""
question = "How many pretrained models are available in 🤗 Transformers?"


# questions = [
# "How many pretrained models are available in 🤗 Transformers?"
# "What does 🤗 Transformers provide?",
# "🤗 Transformers provides interoperability between which frameworks?",
# ]

In [136]:
# Instatiate tokenizer and model from a specific checkpoint via the HuggingFace Model Hub 
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [175]:
# Apply the tokenizer to the question inputs outputting the ids, token_type_ids, and the attention mask
inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
inputs

{'input_ids': tensor([[  101,  2129,  2116,  3653, 23654,  2098,  4275,  2024,  2800,  1999,
           100, 19081,  1029,   102,   100, 19081,  1006,  3839,  2124,  2004,
          1052, 22123,  2953,  2818,  1011, 19081,  1998,  1052, 22123,  2953,
          2818,  1011,  3653, 23654,  2098,  1011, 14324,  1007,  3640,  2236,
          1011,  3800,  4294,  2015,  1006, 14324,  1010, 14246,  2102,  1011,
          1016,  1010, 23455,  1010, 28712,  2213,  1010,  4487, 16643, 23373,
          1010, 28712,  7159,  1529,  1007,  2005,  3019,  2653,  4824,  1006,
         17953,  2226,  1007,  1998,  3019,  2653,  4245,  1006, 17953,  2290,
          1007,  2007,  2058,  3590,  1009,  3653, 23654,  2098,  4275,  1999,
          2531,  1009,  4155,  1998,  2784,  6970, 25918,  8010,  2090, 23435,
         12314,  1016,  1012,  1014,  1998,  1052, 22123,  2953,  2818,  1012,
           102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [176]:
# Extract and convert the the input ids to a list
input_ids = inputs['input_ids'].tolist()[0]
input_ids

[101,
 2129,
 2116,
 3653,
 23654,
 2098,
 4275,
 2024,
 2800,
 1999,
 100,
 19081,
 1029,
 102,
 100,
 19081,
 1006,
 3839,
 2124,
 2004,
 1052,
 22123,
 2953,
 2818,
 1011,
 19081,
 1998,
 1052,
 22123,
 2953,
 2818,
 1011,
 3653,
 23654,
 2098,
 1011,
 14324,
 1007,
 3640,
 2236,
 1011,
 3800,
 4294,
 2015,
 1006,
 14324,
 1010,
 14246,
 2102,
 1011,
 1016,
 1010,
 23455,
 1010,
 28712,
 2213,
 1010,
 4487,
 16643,
 23373,
 1010,
 28712,
 7159,
 1529,
 1007,
 2005,
 3019,
 2653,
 4824,
 1006,
 17953,
 2226,
 1007,
 1998,
 3019,
 2653,
 4245,
 1006,
 17953,
 2290,
 1007,
 2007,
 2058,
 3590,
 1009,
 3653,
 23654,
 2098,
 4275,
 1999,
 2531,
 1009,
 4155,
 1998,
 2784,
 6970,
 25918,
 8010,
 2090,
 23435,
 12314,
 1016,
 1012,
 1014,
 1998,
 1052,
 22123,
 2953,
 2818,
 1012,
 102]

In [177]:
# Convert input ids (above) to tokens based on the pre-trained tokenizer
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
text_tokens

['[CLS]',
 'how',
 'many',
 'pre',
 '##train',
 '##ed',
 'models',
 'are',
 'available',
 'in',
 '[UNK]',
 'transformers',
 '?',
 '[SEP]',
 '[UNK]',
 'transformers',
 '(',
 'formerly',
 'known',
 'as',
 'p',
 '##yt',
 '##or',
 '##ch',
 '-',
 'transformers',
 'and',
 'p',
 '##yt',
 '##or',
 '##ch',
 '-',
 'pre',
 '##train',
 '##ed',
 '-',
 'bert',
 ')',
 'provides',
 'general',
 '-',
 'purpose',
 'architecture',
 '##s',
 '(',
 'bert',
 ',',
 'gp',
 '##t',
 '-',
 '2',
 ',',
 'roberta',
 ',',
 'xl',
 '##m',
 ',',
 'di',
 '##sti',
 '##lbert',
 ',',
 'xl',
 '##net',
 '…',
 ')',
 'for',
 'natural',
 'language',
 'understanding',
 '(',
 'nl',
 '##u',
 ')',
 'and',
 'natural',
 'language',
 'generation',
 '(',
 'nl',
 '##g',
 ')',
 'with',
 'over',
 '32',
 '+',
 'pre',
 '##train',
 '##ed',
 'models',
 'in',
 '100',
 '+',
 'languages',
 'and',
 'deep',
 'inter',
 '##oper',
 '##ability',
 'between',
 'tensor',
 '##flow',
 '2',
 '.',
 '0',
 'and',
 'p',
 '##yt',
 '##or',
 '##ch',
 '.',
 '[SEP]']

In [178]:
model(**inputs)

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-6.5990, -6.1387, -7.8761, -7.8292, -8.7167, -8.9216, -8.6420, -8.2637,
         -8.1355, -7.8485, -7.9702, -8.3272, -9.3328, -6.5990, -3.5472, -3.5201,
         -7.6366, -6.9593, -8.0762, -8.4253, -6.1195, -7.9624, -8.3828, -8.0024,
         -8.3195, -5.4944, -8.2405, -6.5540, -8.1549, -8.4309, -8.2269, -8.3963,
         -6.7172, -8.1222, -8.3750, -8.2482, -5.6149, -6.8511, -6.3145, -6.1016,
         -7.6912, -7.6393, -5.7812, -7.5353, -7.2943, -5.4125, -8.2601, -6.2903,
         -8.0463, -8.2036, -7.1410, -8.3341, -6.6303, -8.4391, -6.5964, -7.9467,
         -8.5371, -7.0665, -8.0596, -7.9908, -8.5031, -7.0008, -7.7140, -6.2754,
         -6.6136, -7.0534, -5.5944, -6.5917, -6.9854, -8.0204, -6.4408, -7.9721,
         -7.5704, -8.2496, -5.2606, -6.5215, -6.1437, -7.8759, -5.9586, -7.1834,
         -5.0404, -2.2078,  5.1718,  4.9945, -3.1125, -4.7365, -6.2348, -6.1286,
         -3.5514, -5.0355, -1.5301, -6.8052, -5.3237, -7

In [190]:
### This does't work as outputs are strings -> ('start_logits', 'end_logits')
# Pass input tokens to the model. This outputs a range of scores across the start and end positions for both questions and the content (text variable above)
# answer_start_score, answer_end_score = model(**inputs)
# answer_start_score, answer_end_score

('start_logits', 'end_logits')

In [192]:
# This does work unlike code cell above
# Pass input tokens to the model. This outputs a range of scores across the start and end positions for both questions and the content (text variable above)
results = model(**inputs)
answer_start_score = results[0]
answer_end_score = results[1]

tensor([[-6.1479, -8.1504, -8.0547, -8.2172, -8.2597, -9.0340, -9.1834, -8.5189,
         -7.0241, -8.2608, -8.8719, -9.7697, -6.1478, -5.6884, -3.5799, -7.2198,
         -5.8089, -7.5308, -7.7310, -0.2308, -6.2658, -6.5939, -5.9501, -7.4640,
         -4.6919, -6.0423, -0.7622, -6.1282, -6.7993, -6.0876, -6.4823, -3.8885,
         -7.2929, -7.8001, -6.8442, -2.1224, -6.9784, -3.6053, -0.7250, -7.1122,
         -4.8897, -3.4082, -5.9891, -1.8025,  2.1470, -5.6282, -1.4007, -6.5709,
         -7.2383, -5.8166, -7.8012, -4.1402, -7.5582, -4.1243, -6.8240, -8.0695,
         -4.9950, -7.3042, -6.9205, -8.0321, -5.2739, -6.0296, -6.3443, -6.4113,
         -3.2012,  1.8201, -4.3073, -5.1026, -2.3548, -0.6008, -6.1212, -6.4907,
         -6.0279, -2.2168, -6.1134, -5.6783, -5.4636, -5.1583, -6.1090, -5.3320,
         -6.8291, -6.3435, -7.2655, -8.5996, -6.2427, -8.2107, -8.5213, -5.5052,
         -7.9098, -6.9708, -8.8957, -5.3919, -6.2110, -1.8061, -3.2943, -5.5694,
         -5.5466, -0.2703,  

In [204]:
# Compute the softmax to get the probabilities of:
# the most likely beginning of an answer
# the most likely ending of an answer

answer_start = torch.argmax(answer_start_score)
answer_end = torch.argmax(answer_end_score) + 1
answer_start, answer_end

(tensor(98), tensor(108))

In [219]:
# Now print the complete answer out

answer_start_id = answer_start.tolist()


answer_end_id = answer_end.tolist()
answer_start_id, answer_end_id

(98, 108)

In [210]:
tokenizer.convert_ids_to_tokens(98)

'[unused97]'

In [220]:
input_ids[answer_start_id:answer_end_id]

[23435, 12314, 1016, 1012, 1014, 1998, 1052, 22123, 2953, 2818]

In [223]:
tokenizer.convert_ids_to_tokens(input_ids[answer_start_id:answer_end_id])

['tensor', '##flow', '2', '.', '0', 'and', 'p', '##yt', '##or', '##ch']

In [222]:
tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start_id:answer_end_id]))

'tensorflow 2. 0 and pytorch'

In [224]:
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start_id:answer_end_id]))
answer

'tensorflow 2. 0 and pytorch'

---

Now put it all together 

In [186]:
text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""
# question = "How many pretrained models are available in 🤗 Transformers?"


questions = [
"How many pretrained models are available in 🤗 Transformers?",
"What does 🤗 Transformers provide?",
"🤗 Transformers provides interoperability between which frameworks?",
]

In [187]:
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [244]:
for question in questions:
    # tokenizer
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs['input_ids'].tolist()[0]

    #convert tokens to ids
    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # Unravel the results using the model
    result_tensors = model(**inputs)

    # Extract the start and end of the answer
    answer_start_tensor = result_tensors[0]
    answer_end_tensor = result_tensors[1]

    # Determine the embeddings for the start and end of answer
    answer_start_id = torch.argmax(answer_start_tensor)
    answer_end_id = torch.argmax(answer_end_tensor)

    # Convert the list of ids of the answer to tokens then to a string
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start_id:answer_end_id]))

    print(f"Question: {question}")
    print(f"Answer: {answer}", "\n")


    

Question: How many pretrained models are available in 🤗 Transformers?
Answer: over 32 

Question: What does 🤗 Transformers provide?
Answer: general - purpose architecture 

Question: 🤗 Transformers provides interoperability between which frameworks?
Answer: tensorflow 2. 0 and pytor 



### Language Modeling

Language modeling means to predict a word given both a sentence and the location of the masked token. Furthermore, a model is trained on a large corpus, but can then be fine-tuned towards specific content like movie or sports articles. e.g. "The dog {mask} up the hill."

#### Masked Language Modeling

MLM is is a technique to mask tokens of text to where the model can predict the missing word using the context of the right and left side of that missing word. 

In [255]:
from transformers import pipeline
from pprint import pprint


In [246]:
mlm = pipeline('fill-mask')
mlm

No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<transformers.pipelines.fill_mask.FillMaskPipeline at 0x2ffc74eb0>

In [261]:
print(mlm(f"I will reach the {mlm.tokenizer.mask_token} of the moutain no matter what it takes."))

[{'score': 0.06975307315587997, 'token': 253, 'token_str': ' end', 'sequence': 'I will reach the end of the moutain no matter what it takes.'}, {'score': 0.05570685490965843, 'token': 2576, 'token_str': ' bottom', 'sequence': 'I will reach the bottom of the moutain no matter what it takes.'}, {'score': 0.05507586523890495, 'token': 3564, 'token_str': ' summit', 'sequence': 'I will reach the summit of the moutain no matter what it takes.'}, {'score': 0.04980460926890373, 'token': 34150, 'token_str': ' pinnacle', 'sequence': 'I will reach the pinnacle of the moutain no matter what it takes.'}, {'score': 0.03204379975795746, 'token': 4996, 'token_str': ' peak', 'sequence': 'I will reach the peak of the moutain no matter what it takes.'}]


In [262]:
pprint(mlm(f"I will reach the {mlm.tokenizer.mask_token} of the moutain no matter what it takes."))

[{'score': 0.06975307315587997,
  'sequence': 'I will reach the end of the moutain no matter what it takes.',
  'token': 253,
  'token_str': ' end'},
 {'score': 0.05570685490965843,
  'sequence': 'I will reach the bottom of the moutain no matter what it takes.',
  'token': 2576,
  'token_str': ' bottom'},
 {'score': 0.05507586523890495,
  'sequence': 'I will reach the summit of the moutain no matter what it takes.',
  'token': 3564,
  'token_str': ' summit'},
 {'score': 0.04980460926890373,
  'sequence': 'I will reach the pinnacle of the moutain no matter what it '
              'takes.',
  'token': 34150,
  'token_str': ' pinnacle'},
 {'score': 0.03204379975795746,
  'sequence': 'I will reach the peak of the moutain no matter what it takes.',
  'token': 4996,
  'token_str': ' peak'}]


In [260]:
print(f"I will reach the {mlm.tokenizer.mask_token_id} of the moutain no matter what it takes.")

I will reach the 50264 of the moutain no matter what it takes.


In [264]:
pprint(f"I will reach the {mlm.tokenizer.mask_token_id} of the moutain no matter what it takes.")

'I will reach the 50264 of the moutain no matter what it takes.'


In [267]:
# Another example 

from transformers import pipeline
from pprint import pprint
mlm = pipeline('fill-mask')

pprint(mlm(f"Who put the {mlm.tokenizer.mask_token} in the cookie jar?"))

No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.19594798982143402,
  'sequence': 'Who put the cookies in the cookie jar?',
  'token': 15269,
  'token_str': ' cookies'},
 {'score': 0.04086028039455414,
  'sequence': 'Who put the cookie in the cookie jar?',
  'token': 20931,
  'token_str': ' cookie'},
 {'score': 0.02517770417034626,
  'sequence': 'Who put the chips in the cookie jar?',
  'token': 8053,
  'token_str': ' chips'},
 {'score': 0.02281174622476101,
  'sequence': 'Who put the dough in the cookie jar?',
  'token': 14397,
  'token_str': ' dough'},
 {'score': 0.02184339612722397,
  'sequence': 'Who put the candy in the cookie jar?',
  'token': 12644,
  'token_str': ' candy'}]


Language Modeling with AutoTokenizer and AutoModelWithLMHead

1. Load Tokenizer and load model (DISTIL-BERT)
2. Establish a sequence (sentence) with a masked token ('[MASK]')
3. Encode the sequence into token ids using the pre-trained tokenizer
4. Determine the index of the masked token e.g. "I ['MASK'] you." <- masked token id is 1
4. Feed Encoded inputs into model to gather the logits (probabilities) for each result
5. Extract the logits from the step above based on the masked token index 
6. Print top K results


In [138]:
# import packages
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch

In [144]:
# Initiate tokenizer and model
mlm_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
mlm_model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased")



In [185]:
# Create a sequence
mlm_sequence = f"Distil models are {tokenizer.mask_token} than the models they mimic."
mlm_sequence

'Distil models are [MASK] than the models they mimic.'

In [186]:
# encode sequence into id's 

mlm_token_inputs = tokenizer.encode(mlm_sequence, return_tensors="pt")
mlm_token_inputs

tensor([[  101, 12120,  2050,  2723,  3584,  1132,   103,  1190,  1103,  3584,
          1152, 27180,   119,   102]])

In [187]:
# Find position of masked token

mlm_mask_token_idx = torch.where(mlm_token_inputs == tokenizer.mask_token_id)[1]
mlm_mask_token_idx

tensor([6])

In [188]:
# Retrieve the predictions from the model 
mlm_model_logits = model(mlm_token_inputs)[0]
mlm_model_logits

tensor([[[ -6.5477,  -6.5194,  -6.6604,  ...,  -5.4948,  -5.1901,  -5.5867],
         [ -5.9723,  -5.4067,  -5.6112,  ...,  -5.0678,  -4.4226,  -5.4090],
         [ -4.2103,  -4.3035,  -3.7812,  ...,  -3.7468,  -3.5109,  -4.7747],
         ...,
         [ -6.2919,  -5.9060,  -5.9605,  ...,  -5.7470,  -5.5142,  -3.4783],
         [-12.4542, -12.2136, -12.3548,  ..., -10.0168, -10.9499, -11.1881],
         [-12.0805, -11.9212, -12.0093,  ...,  -9.7150, -10.5636, -10.9012]]],
       grad_fn=<ViewBackward0>)

In [189]:
# Number of columns is 14 which is the same size as the input sequence. 
mlm_model_logits.shape

torch.Size([1, 14, 28996])

In [190]:
# Retrieve the logits based on the mask token index
# mlm_model_logits[:,6] <- 6 is the masked token index that needs model prediction
mlm_mask_token_logits = mlm_model_logits[0, mlm_mask_token_idx, :]
mlm_mask_token_logits

tensor([[-4.5315, -4.1294, -4.6032,  ..., -4.7260, -3.2562, -4.5568]],
       grad_fn=<IndexBackward0>)

In [199]:
# Extact top K results

# torch.topk(mlm_mask_token_logits, k=3)
torch.topk(mlm_mask_token_logits, k=3).indices

tensor([[2610, 4946, 2964]])

Now put the code for the Masked Language task all together

In [201]:
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch


In [202]:
# Instatiate pre-trained tokenizer and model

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased")



In [204]:
# Create a sequence 

sequence = f"The way to metricize a distilled model is to measure its {tokenizer.mask_token} for performance."
sequence

'The way to metricize a distilled model is to measure its [MASK] for performance.'

In [206]:
# Encode the sequence into input ids based on the pre-trained tokenizer

input_ids = tokenizer.encode(sequence, return_tensors="pt")
input_ids

tensor([[  101,  1109,  1236,  1106, 12676,  3708,   170,  4267,  2050,  8683,
          1181,  2235,  1110,  1106,  4929,  1157,   103,  1111,  2099,   119,
           102]])

In [218]:
# Determine the mask token id

# tokenizer.mask_token_id, tokenizer.mask_token
mask_token_idx = torch.where(input_ids == tokenizer.mask_token_id)[1]
mask_token_idx

tensor([16])

In [219]:
# Feed input_ids into model to get the logits (probalities for each result)

token_logits = model(input_ids)[0]
token_logits

tensor([[[ -6.6562,  -6.5983,  -6.7531,  ...,  -5.5380,  -5.2008,  -5.6139],
         [ -7.7014,  -7.3568,  -7.3947,  ...,  -5.5471,  -6.1795,  -5.7535],
         [-12.0731, -11.4573, -11.3715,  ...,  -8.6063,  -9.3767,  -9.2412],
         ...,
         [ -7.8804,  -7.9259,  -8.0780,  ...,  -7.2703,  -6.6576,  -7.3891],
         [-12.5762, -12.8128, -12.4420,  ..., -10.4367, -10.5835, -11.2269],
         [-12.2661, -12.5492, -12.0930,  ..., -10.1015, -10.2164, -10.8908]]],
       grad_fn=<ViewBackward0>)

In [220]:
token_logits.shape

torch.Size([1, 21, 28996])

In [221]:
# Extract the masked token id column from the token_logits

token_logits[0, mask_token_idx, :].shape

torch.Size([1, 28996])

In [223]:
mask_token_logits = token_logits[0, mask_token_idx, :]
mask_token_logits

tensor([[-4.5954, -4.4410, -4.5555,  ..., -4.7940, -4.6856, -3.7627]],
       grad_fn=<IndexBackward0>)

In [233]:
# Find the top K results

top_k_token_ids = torch.topk(input=mask_token_logits, k=5).indices[0].tolist()
top_k_token_ids

[2860, 3209, 12949, 8096, 3211]

In [236]:
for token in top_k_token_ids:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

The way to metricize a distilled model is to measure its value for performance.
The way to metricize a distilled model is to measure its potential for performance.
The way to metricize a distilled model is to measure its effectiveness for performance.
The way to metricize a distilled model is to measure its efficiency for performance.
The way to metricize a distilled model is to measure its capacity for performance.


#### Casual Language Modeling

Casual Language allows a model to predict the end of a sequence i.e. it repeatly predicts a single token after a given set of input token till the top k and top p are met or the constraints of other hyperparameters like max_length or min_length. 

e.g. 

"I would like to " -> "I would like to know " -> "I would like to know if"

This means that the model is focused only on the tokens left of the mask token

The next token samples from the last hidden state from the model's token logits (based on input sequence)

top k -> number of tokens with the highest probabilities
top p -> reached sum of probabiliteis from k to k-(k-1)

e.g. 

k = 3
p = .75

"I went to the store to "

top k:
buy .45
purchase .3
shop .25

top p: 
buy .45
purchase .3
since (.45 + .3 = .75) which meets or exceeds the p hyperparameter


temperature: softmax hyperparameter that controls the escalation of values. 

Lower tempeature: next input token logits are close together.

Higher temperature: next input token logits increasing exceed with higher values

e.g.

"The bird flew above the "

lower temperature:
tree .3
clouds .25
bridge .2
building .15
nest .1

higher temperature
tree .6
clouds .2
bridge .15
building .04
nest .01



In [269]:
from transformers import AutoTokenizer, AutoModelWithLMHead, top_k_top_p_filtering
import torch
from torch.nn import functional as F

In [364]:
# Esablish pre-trained tokenizer and model

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2")


In [365]:
# Create Sequence 

sequence = "Hugging Face is based in DUMBO, New York City, and "
sequence

'Hugging Face is based in DUMBO, New York City, and '

In [366]:
# Encode sequence into input ids using pretrained tokenizer

input_ids = tokenizer.encode(sequence, return_tensors="pt")
input_ids

tensor([[48098,  2667, 15399,   318,  1912,   287,   360,  5883,  8202,    11,
           968,  1971,  2254,    11,   290,   220]])

In [367]:
# Compute the model logits from the last hidden state

next_token_logits = model(input_ids)[0][:, -1, :]
next_token_logits

tensor([[-65.0642, -65.9022, -66.1343,  ..., -73.3640, -70.0751, -66.3969]],
       grad_fn=<SliceBackward0>)

In [368]:
# Filter the top K results

filtered_next_token_logits= top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
filtered_next_token_logits

tensor([[-inf, -inf, -inf,  ..., -inf, -inf, -inf]],
       grad_fn=<MaskedFillBackward0>)

In [369]:
# Sample 
probs = F.softmax(filtered_next_token_logits, dim=-1)
probs

tensor([[0., 0., 0.,  ..., 0., 0., 0.]], grad_fn=<SoftmaxBackward0>)

In [370]:
torch.where(probs[0] != 0)

(tensor([  130,   134,   135,   166,   169,   170,   364,   425,   488,   522,
           528,   544,   666,   700,   742,   844,   933,  1134,  1427,  1834,
          1849,  1980,  2419,  2575,  2602,  3711,  4340,  4557,  4841,  4907,
          5099,  5523,  5746,  6353,  9805, 10185, 10221, 13323, 15116, 17479,
         21727, 27193, 29343, 29773, 31854, 32941, 37405, 40493, 48585, 48869]),)

In [371]:
# Samples probabilities for next token.
# num_samples allows the number of tokens to include after the input sequence
next_token = torch.multinomial(probs, num_samples=1)
next_token

tensor([[666]])

In [372]:
input_ids.shape, next_token.shape

(torch.Size([1, 16]), torch.Size([1, 1]))

In [373]:
# Concatenate the input ids and the next token(s) together 
# Notice the sizes of each are different (as seen above), so 
# ...be sure to join them together in 1 dimension
generated = torch.concatenate([input_ids, next_token], dim=-1)
generated

tensor([[48098,  2667, 15399,   318,  1912,   287,   360,  5883,  8202,    11,
           968,  1971,  2254,    11,   290,   220,   666]])

In [382]:
generated[0]

tensor([48098,  2667, 15399,   318,  1912,   287,   360,  5883,  8202,    11,
          968,  1971,  2254,    11,   290,   220,   666])

In [374]:
# Decode the generated token ids into a string
result_string = tokenizer.decode(generated[0])
result_string

'Hugging Face is based in DUMBO, New York City, and ian'

Now put it all together

In [377]:
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch
from torch.nn import functional as F

In [379]:
# Instatiate pre-trained model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2")



In [466]:
# Create Sequence

sequence = "The dog walked down the  "

In [467]:
# Encode the sequence using the pre-trained tokenizer

input_token_ids = tokenizer.encode(sequence, return_tensors="pt")
input_token_ids

tensor([[ 464, 3290, 6807,  866,  262,  220,  220]])

In [468]:
# Compute logits for the input token ids
# and extract the probabilities from the last hidden state
last_hidden_state_token_logits = model(input_token_ids)[0][:,-1, :]
last_hidden_state_token_logits

tensor([[-63.7481, -64.7459, -66.4113,  ..., -75.2423, -68.6873, -63.8376]],
       grad_fn=<SliceBackward0>)

In [469]:
last_hidden_state_token_logits.shape

torch.Size([1, 50257])

In [478]:
# Filter the top k and p next token results

filtered_next_token_logits = top_k_top_p_filtering(last_hidden_state_token_logits, top_k=150, top_p=0.8)
filtered_next_token_logits

tensor([[-inf, -inf, -inf,  ..., -inf, -inf, -inf]],
       grad_fn=<MaskedFillBackward0>)

In [479]:
# Determine probabilities and sample using Softmax and Multinomial

softmax_logits = F.softmax(filtered_next_token_logits)
softmax_logits

  softmax_logits = F.softmax(filtered_next_token_logits)


tensor([[0., 0., 0.,  ..., 0., 0., 0.]], grad_fn=<SoftmaxBackward0>)

In [480]:
softmax_logits.shape

torch.Size([1, 50257])

In [481]:
sampled_next_token_id = torch.multinomial(softmax_logits, num_samples=1)
sampled_next_token_id

tensor([[1133]])

In [482]:
# Combine input ids and sampled next token id(s) together

input_output_ids = torch.concatenate([input_token_ids, sampled_next_token_id], dim=1)
input_output_ids

tensor([[ 464, 3290, 6807,  866,  262,  220,  220, 1133]])

In [483]:
input_output_ids.shape

torch.Size([1, 8])

In [484]:
# Decode result
result = tokenizer.decode(input_output_ids[0])
result

'The dog walked down the  ute'

### Text Generation

Whereas Casual Language Modeling predicts a single token, Text Generation allows the model to output a series of tokens to adhere to a prompt.

Additionally, a padding text is including to help the model with short answers.

Instead of using top_k_top_p_filtering and multinomial, the model itself has a method that can control k, p, max length, min length, and sampling using model.generate()

max_length: uses the length of the original input string and the start and end tokens, then determines the next n tokens.
e.g "If I want to create my own Large language model, I need to " is 14 tokens.
With the start and end tokens, the length is now 16. This means the output will only generate 4 more tokens i.e. "use the following."
for a max total of 20 tokens. 

do_sample: Sample from other probabilites or chose the highest probability when do_sample=False


In [485]:
from transformers import pipeline

In [486]:
sequence = "If I want to create my own Large language model, I need to "

In [487]:
text_gen_model = pipeline("text-generation")
text_gen_model

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


<transformers.pipelines.text_generation.TextGenerationPipeline at 0x2efe68a60>

In [488]:
text_gen_model.predict(sequence)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'If I want to create my own Large language model, I need to \xa0know the \xa0language model parameters\xa0to provide the data structure of a model.\n1) The initial parameter is a model name, which I use to store the'}]

In [489]:
text_gen_model(sequence)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "If I want to create my own Large language model, I need to \xa0know how to set values with the language I want to create.\nIn Scala's examples, there may be three possible values for the language, each with their own syntax"}]

In [497]:
# text_gen_model(sequence, max_length=10)

# ValueError: Input length of input_ids is 10, but `max_length` is set to 10. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


In [498]:
'''
max_length uses the length of the original input string and the start and end tokens, then determines the next n tokens.
e.g "If I want to create my own Large language model, I need to " is 14 tokens.
With the start and end tokens, the length is now 16. This means the output will only generate 4 more tokens i.e. "use the following."
for a max total of 20 tokens. 
'''


text_gen_model(sequence, max_length=20)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'If I want to create my own Large language model, I need to \xa0use the following.'}]

In [502]:
text_gen_model(sequence, max_length=20, do_sample=False)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'If I want to create my own Large language model, I need to \xa0have a language model'}]

In [503]:
text_gen_model(sequence, max_length=20, do_sample=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'If I want to create my own Large language model, I need to \xa0have several different language'}]

In [504]:
from transformers import pipeline
import torch

In [509]:
sequence = "Key, Query, and Value vectors in the Multi Head Attention layers are defined as "

# gpt2 is Default
text_generator_model = pipeline("text-generation")

text_generator_model(sequence, max_length=50, do_sample=True)

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Key, Query, and Value vectors in the Multi Head Attention layers are defined as 〈O(H,A)〈 where H is a nonnegative number between 1 and 2.\n\nFor each set 〈A〈'}]

Text Generation with AutoTokenizer and AutoModelWithLMHead

In [511]:
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch

In [572]:
# Establish input sequence and padding text
prompt = "Key, Query, and Value vectors in the Multi Head Attention layers are defined as "

PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""

# prompt = "Today the weather is really nice and I am planning on "


In [573]:
# Inistiate pre-trained tokenizer and model

tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")
model = AutoModelWithLMHead.from_pretrained("xlnet-base-cased")

In [561]:
tokenizer

XLNetTokenizerFast(name_or_path='xlnet-base-cased', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '<sep>', 'pad_token': '<pad>', 'cls_token': '<cls>', 'mask_token': '<mask>', 'additional_special_tokens': ['<eop>', '<eod>']}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<cls>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("<sep>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	5: AddedToken("<pad>", 

In [562]:
model

XLNetLMHeadModel(
  (transformer): XLNetModel(
    (word_embedding): Embedding(32000, 768)
    (layer): ModuleList(
      (0-11): 12 x XLNetLayer(
        (rel_attn): XLNetRelativeAttention(
          (layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (ff): XLNetFeedForward(
          (layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (layer_1): Linear(in_features=768, out_features=3072, bias=True)
          (layer_2): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (activation_function): GELUActivation()
        )
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (lm_loss): Linear(in_features=768, out_features=32000, bias=True)
)

In [574]:
# Tokenize the prompt into ids
# Notice this cell and the one below result in the same tensor
# Therefore, either use tokenizer() or tokenizer.encode()

input_token_ids = tokenizer(PADDING_TEXT + prompt, return_tensors="pt", add_special_tokens=False)['input_ids']
input_token_ids

tensor([[   67,  2840,    19,    18,  1484,    20,   965, 29077,  8719,  1273,
            21,    45,   273,    17,    10, 15048,    28, 27511,    21,  4185,
            11,    41,  2444,     9,    32,  1025,    20,  8719,    26,    23,
           673,   966,    19, 29077, 20643, 27511, 20822, 20643,    19,    17,
          6616, 17511,    18,  8978,    20,    18,   777,     9, 19233,  1527,
         17669,    19,    24,   673,    17, 28756,   150, 12943,  4354,   153,
            27,   442,    37,    45,   668,    21,    24,   256,    20,   416,
            22,  2771,  4901,     9, 12943,  4354,   153,    51,    24,  3004,
            21, 28142,    23,    65,    20,    18,   416,    34,    24,  2958,
         22947,     9,  1177,    45,   668,  3097, 13768,    23,   103,    28,
           441,   148,    48, 20522,    19, 12943,  4354,   153, 12860,    34,
            18,   326,    27, 17492,   684,    21,  6709,     9,  8585,   123,
           266,    19, 12943,  4354,   153,  6872,  

In [582]:

input_token_ids = tokenizer.encode(PADDING_TEXT + prompt, return_tensors="pt", add_special_tokens=False)
input_token_ids

tensor([[   67,  2840,    19,    18,  1484,    20,   965, 29077,  8719,  1273,
            21,    45,   273,    17,    10, 15048,    28, 27511,    21,  4185,
            11,    41,  2444,     9,    32,  1025,    20,  8719,    26,    23,
           673,   966,    19, 29077, 20643, 27511, 20822, 20643,    19,    17,
          6616, 17511,    18,  8978,    20,    18,   777,     9, 19233,  1527,
         17669,    19,    24,   673,    17, 28756,   150, 12943,  4354,   153,
            27,   442,    37,    45,   668,    21,    24,   256,    20,   416,
            22,  2771,  4901,     9, 12943,  4354,   153,    51,    24,  3004,
            21, 28142,    23,    65,    20,    18,   416,    34,    24,  2958,
         22947,     9,  1177,    45,   668,  3097, 13768,    23,   103,    28,
           441,   148,    48, 20522,    19, 12943,  4354,   153, 12860,    34,
            18,   326,    27, 17492,   684,    21,  6709,     9,  8585,   123,
           266,    19, 12943,  4354,   153,  6872,  

In [598]:
output_token_ids = model.generate(input_token_ids, max_length=250, do_sample=True, top_k=60, top_p=.95)
output_token_ids

tensor([[ 1980,  1219,   291,   132,  4278,  1184,   332,  5360,    49, 16261,
            20, 17366,    23,    22,  2565,    22,    19,    21,    18, 14497,
         10714, 15807,    23,  3911,   166,  2350,  5360,     9,   595,   491,
          2067,  1219,  3303,    19,    18,  1342,    64,  1184,    22,  2565,
            22,   332,  3878,    20,    18,  5272,  6173,  8691,    19,  5375,
         15356,    24,  6768,   944,    20,   229,    21,  4710,    18,  1342,
            26,    23,   922,    31,   807,  6243,     9,   228,    24,  1259,
            30,  2999,   151,    24,  1579, 14277,    19,    63,  1133,    24,
          5361,   834,    40,   583,     9,   228,    63,   711,   199,    63,
           685,   106,  4007,   834,    76,    24,   274,  2094,     9,    79,
           887,    29,   917,    76,    22,    24,   263,    33,    24,  1236,
            29,    42,    17,    12, 14825,    18,  5378,    12,     9,    84,
          6577,   115,   886,    30, 20090,    38,  

In [599]:
prompt_length = len(tokenizer.decode(input_token_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
prompt_length

495

In [600]:
# generated_prompt = tokenizer.convert_tokens_to_string(output_token_logits[0])
generated_prompt = tokenizer.decode(output_token_ids[0])[len(PADDING_TEXT)-2:]
generated_prompt

'couple was walking down a dark alley, they heard a laugh coming from behind. As they turned around they saw some smoke coming up a few steps. A door that opened up to a left with a sign that said "Take the Right". It sounded like someone was yelling at the stairs. When the lights went out they heard a laugh coming from behind them. The fire from the smoke started to appear.<eop> The two men walked to the left and went upstairs. The door opened up to a third of a way up the stairs. The smoke of the smoke started to "flash" around the other side of the stairs. The smoke of the smoke looked like fire in the mirror. It was a big black, burning mess of fire. The smoke in the mirror looked like fire in the mirror.<eop> The group headed to a little back alley on the right and took a walk in the'

Notice the Padding Text did not affect the results the ouput such that the padding text contained info about Rasputin but the prompt was about Multi Head Attention Vectors

Now put it all together

In [601]:
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch

In [629]:
# Establish prompt and padding text 

PADDING_TEXT = '''
Each attention head may indeed learn different 
patterns or combinations of tokens to attend to, 
and the Value matrices capture these learned patterns. 
By having multiple attention heads, the model can learn 
to attend to different aspects of the input sequence 
simultaneously, potentially capturing a wider range of 
information and improving the model's performance on various tasks.
'''

prompt = "As a couple was walking down a dark alley, they heard a "

In [630]:
# Determine the padding length so that you can use that value to index the output.
# This is due to the inputs (via tokenizer.encode()) for the padding text and prompt being concatenated together (i.e. PADDING_TEXT + prompt)

padding_text_length = len(PADDING_TEXT)
padding_text_length

390

In [631]:
# Compute the input tokens ids of PADDING_TEXT + prompt
'''
input_token_ids.shape -> torch.Size([1, 91])
'''
input_token_ids = tokenizer.encode(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")
input_token_ids

tensor([[ 1980,  1219,   291,   132,  4278,  1184,   332,  5360,    49, 16261,
            20, 17366,    23,    22,  2565,    22,    19,    21,    18, 14497,
         10714, 15807,    23,  3911,   166,  2350,  5360,     9,   595,   491,
          2067,  1219,  3303,    19,    18,  1342,    64,  1184,    22,  2565,
            22,   332,  3878,    20,    18,  5272,  6173,  8691,    19,  5375,
         15356,    24,  6768,   944,    20,   229,    21,  4710,    18,  1342,
            26,    23,   922,    31,   807,  6243,     9,   228,    24,  1259,
            30,  2999,   151,    24,  1579, 14277,    19,    63,  1133,    24]])

In [637]:
# Compute the output which includes all the input token ids and the output token ids 

# do_sample = False means to choose text with the highest probabilities
# do_sample = True means to randomly sample based on k and p

output_token_ids = model.generate(input_token_ids, top_k=60, top_p=0.9, max_length=120, do_sample=True)
output_token_ids

tensor([[ 1980,  1219,   291,   132,  4278,  1184,   332,  5360,    49, 16261,
            20, 17366,    23,    22,  2565,    22,    19,    21,    18, 14497,
         10714, 15807,    23,  3911,   166,  2350,  5360,     9,   595,   491,
          2067,  1219,  3303,    19,    18,  1342,    64,  1184,    22,  2565,
            22,   332,  3878,    20,    18,  5272,  6173,  8691,    19,  5375,
         15356,    24,  6768,   944,    20,   229,    21,  4710,    18,  1342,
            26,    23,   922,    31,   807,  6243,     9,   228,    24,  1259,
            30,  2999,   151,    24,  1579, 14277,    19,    63,  1133,    24,
         31980,    20,    18,    17,    12,    12,    17,    10,    12,    12,
            11,    25,    18,  1538,    20,    24,  1515,     9,   200,   505,
          2999,   151,    18, 14277,    25,  2720,    22,   278,   106,    70,
          4688,     9,     8,   200,  1068,   685,    24,  2729,    20,  4481]])

In [639]:
tokenizer.decode(output_token_ids[0])[padding_text_length-7:]

'As a couple was walking down a dark alley, they heard a shuffling of the "" ("") in the middle of a store. They started walking down the alley in hopes to find some more clothing.<eop> They quickly saw a pair of shoes'

### Named Entity Recognition (NER)

NER predicts classes for tokens given a sequence. Some of the tokens in a sentence are distinctive towards a certain label like a persons name, or location.

Popular dataset for NER tasks include the CoNLL-2003 dataset, which solely pertains to the specific task that is NER. 

A basic example of applying pipelines for Named Entity Recognition can identify tokens for 9 classes:

- O: Outside of a named entity
- B-MISC: Beginning of a miscalaneus item entity
- I-MISC: Miscalaneus entity
- B-PER: Beginning of a person's name entity
- I-PER: Person Entity
- B-LOC: Beginning of a location entity
- I-LOC: Location entity
- B-ORG: Beginning of an organization entity
- I-ORG: Organization entity

Resource Link: `https://ubiai.tools/mastering-named-entity-recognition-with-bert/`

Named Entity Recognition with pipelines

Default model is dbmdz/bert-large-cased-finetuned-conll03-english

In [648]:
from transformers import pipeline
# import pprint

In [645]:
ner = pipeline("ner")
ner

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<transformers.pipelines.token_classification.TokenClassificationPipeline at 0x3b7695b80>

In [657]:
entities = ner("Sally needs to deliver 30 computers to Adam from RoadCross LLC in Lowell, Indiana.")

In [658]:
# pprint.pprint(entities)

In [659]:
for entity in entities:
    print(f"Word: {entity['word']}, Entity: {entity['entity']}, Score: {entity['score']}")

Word: Sally, Entity: I-PER, Score: 0.9633005261421204
Word: Adam, Entity: I-ORG, Score: 0.7452928423881531
Word: Road, Entity: I-ORG, Score: 0.9995040893554688
Word: ##C, Entity: I-ORG, Score: 0.9991481304168701
Word: ##ross, Entity: I-ORG, Score: 0.9993059635162354
Word: LLC, Entity: I-ORG, Score: 0.9989431500434875
Word: Lowell, Entity: I-LOC, Score: 0.9779480695724487
Word: Indiana, Entity: I-LOC, Score: 0.9881199598312378


Notice the company 'RoadCross LLC' was classified into three 'I-ORG' locaton entities.

- ##C
- ##ross
- LLC

Named Entity Recognition with AutoTokenizer and AutoModel

In [699]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

In [807]:
# Load pretrained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
# model = AutoModelForTokenClassification.from_pretrained("bert-base-cased")


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [808]:
tokenizer

BertTokenizerFast(name_or_path='dbmdz/bert-large-cased-finetuned-conll03-english', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [809]:
model

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-23): 24 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), 

In [810]:
# Create a sequence 

# sequence = "I need ten six-foot two-by-fours delivered to Charles from TankStein Inc. This needs to be dropped off by the side of their warehouse not the entrance."

'''
The sequence commented out below shows shows a different 
result as its using a different pre-trained tokenizer 
(looking below gives a good result compared to 
the tokenizer that uses the same bert model as the 
pre-trained model itself.)
'''

# sequence = "I need ten six-foot two-by-fours delivered to Charles from TankStein Inc which needs to be dropped off by the side of their warehouse not the entrance."
# sequence = "I need ten six computers delivered to Charles from TankStein Inc which needs to be dropped off in their Cedar Lake, Indiana warehouse."
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge."

In [811]:
# Encode the sequence into input ids

ner_input_ids = tokenizer.encode(sequence, return_tensors="pt")
ner_input_ids

tensor([[  101, 20164, 10932, 10289,  3561,   119,  1110,   170,  1419,  1359,
          1107,  1203,  1365,  1392,   119,  2098,  3834,  1132,  1107,   141,
         25810, 23904,   117,  3335,  1304,  1601,  1106,  1103,  6545,  3640,
           119,   102]])

In [812]:
# Compute the model logits - The way to extract the output logits is a bit hacky

# Note this specific model does not include a .generate() method

# Make sure to understand the task of the model to better understand 
# ... its hyperperameters

# For example, top_k, top_p, and do_sample are NOT necessary here because
# ...the model is NOT generating a token(s), but instead an entity label like person or place

# Utilize the correct hyperparameters via the documentation


ner_input_tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
ner_input_tokens

['[CLS]',
 'Hu',
 '##gging',
 'Face',
 'Inc',
 '.',
 'is',
 'a',
 'company',
 'based',
 'in',
 'New',
 'York',
 'City',
 '.',
 'Its',
 'headquarters',
 'are',
 'in',
 'D',
 '##UM',
 '##BO',
 ',',
 'therefore',
 'very',
 'close',
 'to',
 'the',
 'Manhattan',
 'Bridge',
 '.',
 '[SEP]']

In [813]:
ner_token_str = tokenizer.decode(tokenizer.encode(sequence))
ner_token_str

'[CLS] Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge. [SEP]'

In [814]:
# Fine-tune the BERT model

ner_output_class_logits = model(ner_input_ids)[0]
ner_output_class_logits

tensor([[[ 9.4474e+00, -2.5001e+00, -1.6684e+00, -2.0315e+00, -2.1446e+00,
          -1.7645e+00, -4.6828e-01, -1.9855e+00,  1.2138e+00],
         [ 6.9599e-01, -2.7866e+00, -6.6957e-01, -3.2325e+00, -8.4399e-01,
          -1.7734e+00,  9.0137e+00, -2.4886e+00, -5.2432e-01],
         [ 1.8199e+00, -2.2245e+00,  1.5032e-01, -3.1896e+00, -5.8592e-02,
          -1.2954e+00,  6.7632e+00, -2.2815e+00, -8.2044e-01],
         [ 6.5241e-01, -2.8254e+00, -1.6821e-01, -3.4216e+00,  2.2805e-01,
          -1.3876e+00,  7.8084e+00, -2.8727e+00, -3.5963e-01],
         [ 1.0578e+00, -2.9133e+00, -8.2334e-01, -3.4733e+00, -1.6451e+00,
          -1.9206e+00,  9.0729e+00, -2.0552e+00, -1.3504e-01],
         [ 6.9276e+00, -2.6744e+00, -1.4723e+00, -3.7311e+00, -5.5609e-01,
          -1.8288e+00,  4.7135e+00, -2.5164e+00, -2.0687e-01],
         [ 1.0976e+01, -2.1511e+00, -8.8926e-01, -2.6146e+00, -1.3121e+00,
          -1.6596e+00,  7.6239e-01, -2.1253e+00, -7.0168e-01],
         [ 1.1182e+01, -2.2161e+00

In [815]:
# Extract the classes for each entity with the highest probablity 
predictions = torch.argmax(ner_output_class_logits, dim=2)

In [816]:
predictions[0]

tensor([0, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 8, 8, 8, 0, 0, 0, 0, 0, 8, 8, 8, 0, 0,
        0, 0, 0, 0, 8, 8, 0, 0])

In [817]:
# Print results 

# The label list helps with with the print statement becuase the index of
# ...each prediction (above) represents the index of the label list (below).

label_list = [
     "O",       # Outside of a named entity i.e. NO CLASS WAS IDENTIFIED
     "B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
     "I-MISC",  # Miscellaneous entity
     "B-PER",   # Beginning of a person's name right after another person's name
     "I-PER",   # Person's name
     "B-ORG",   # Beginning of an organisation right after another organisation
     "I-ORG",   # Organisation
     "B-LOC",   # Beginning of a location right after another location
     "I-LOC"    # Location
 ]

# dbmdz/bert-large-cased-finetuned-conll03-english
# for token, prediction in zip(ner_input_tokens, predictions[0].tolist()):
#     if prediction != 0:
#         print(token, label_list[prediction])

# bert-base-cased
for token, prediction in zip(ner_input_tokens, predictions[0]):
    if prediction:
        print(token, label_list[int(prediction)])

Hu I-ORG
##gging I-ORG
Face I-ORG
Inc I-ORG
New I-LOC
York I-LOC
City I-LOC
D I-LOC
##UM I-LOC
##BO I-LOC
Manhattan I-LOC
Bridge I-LOC


Tokenizer Impact 

'BERT' ('bert-based-cased') tokenizer vs 'dbmdz/bert-large-cased-finetuned-conll03-english' tokenizer (which is the pre-trained model and tokenizer name)

output results: 

- bert-base-cased:
    - Hu I-ORG
    - ##gging I-ORG
    - Face I-ORG
    - Inc I-ORG
    - New I-LOC
    - York I-LOC
    - City I-LOC
    - D I-LOC
    - ##UM I-LOC
    - ##BO I-LOC
    - Manhattan I-LOC
    - Bridge I-LOC

---

- dbmdz/bert-large-cased-finetuned-conll03-english:
    - Hu I-ORG
    - ##gging I-ORG
    - Face I-ORG
    - Inc I-ORG
    - New I-LOC
    - York I-LOC
    - City I-LOC
    - D I-LOC
    - ##UM I-LOC
    - ##BO I-LOC
    - Manhattan I-LOC
    - Bridge I-LOC

    

    Conclusion: This shows bert-base-cased and the pre-trained tokenizer result in the same thing. This is due to the fact that the tokeinzer for the pre-trained model is the exact same, meaning no other clases were added, removed, or edited when training the LLM (dbmdz/bert-large-cased-finetuned-conll03-english")

Now put it all together 

If you plan on using a pretrained model, it’s important to use the associated pretrained tokenizer: it will split the text you give it in tokens the same way for the pretraining corpus, and it will use the same correspondence token to index (that we usually call a vocab) as during pretraining. ('https://huggingface.co/transformers/v3.0.2/preprocessing.html')

In [824]:
# Import tokenizer and model packages from transformers
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

In [825]:
# Initiate pre-trained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [826]:
# Create sequence 
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge."

In [835]:
# Encode the input into ids
input_ids = tokenizer.encode(sequence, return_tensors="pt")
input_ids

tensor([[  101, 20164, 10932, 10289,  3561,   119,  1110,   170,  1419,  1359,
          1107,  1203,  1365,  1392,   119,  2098,  3834,  1132,  1107,   141,
         25810, 23904,   117,  3335,  1304,  1601,  1106,  1103,  6545,  3640,
           119,   102]])

In [839]:
# Decode the model ids into logits for each class

output_logits = model(input_ids)[0]
output_logits

tensor([[[ 9.4474e+00, -2.5001e+00, -1.6684e+00, -2.0315e+00, -2.1446e+00,
          -1.7645e+00, -4.6828e-01, -1.9855e+00,  1.2138e+00],
         [ 6.9599e-01, -2.7866e+00, -6.6957e-01, -3.2325e+00, -8.4399e-01,
          -1.7734e+00,  9.0137e+00, -2.4886e+00, -5.2432e-01],
         [ 1.8199e+00, -2.2245e+00,  1.5032e-01, -3.1896e+00, -5.8592e-02,
          -1.2954e+00,  6.7632e+00, -2.2815e+00, -8.2044e-01],
         [ 6.5241e-01, -2.8254e+00, -1.6821e-01, -3.4216e+00,  2.2805e-01,
          -1.3876e+00,  7.8084e+00, -2.8727e+00, -3.5963e-01],
         [ 1.0578e+00, -2.9133e+00, -8.2334e-01, -3.4733e+00, -1.6451e+00,
          -1.9206e+00,  9.0729e+00, -2.0552e+00, -1.3504e-01],
         [ 6.9276e+00, -2.6744e+00, -1.4723e+00, -3.7311e+00, -5.5609e-01,
          -1.8288e+00,  4.7135e+00, -2.5164e+00, -2.0687e-01],
         [ 1.0976e+01, -2.1511e+00, -8.8926e-01, -2.6146e+00, -1.3121e+00,
          -1.6596e+00,  7.6239e-01, -2.1253e+00, -7.0168e-01],
         [ 1.1182e+01, -2.2161e+00

In [844]:
output_logits[0]

tensor([[ 9.4474e+00, -2.5001e+00, -1.6684e+00, -2.0315e+00, -2.1446e+00,
         -1.7645e+00, -4.6828e-01, -1.9855e+00,  1.2138e+00],
        [ 6.9599e-01, -2.7866e+00, -6.6957e-01, -3.2325e+00, -8.4399e-01,
         -1.7734e+00,  9.0137e+00, -2.4886e+00, -5.2432e-01],
        [ 1.8199e+00, -2.2245e+00,  1.5032e-01, -3.1896e+00, -5.8592e-02,
         -1.2954e+00,  6.7632e+00, -2.2815e+00, -8.2044e-01],
        [ 6.5241e-01, -2.8254e+00, -1.6821e-01, -3.4216e+00,  2.2805e-01,
         -1.3876e+00,  7.8084e+00, -2.8727e+00, -3.5963e-01],
        [ 1.0578e+00, -2.9133e+00, -8.2334e-01, -3.4733e+00, -1.6451e+00,
         -1.9206e+00,  9.0729e+00, -2.0552e+00, -1.3504e-01],
        [ 6.9276e+00, -2.6744e+00, -1.4723e+00, -3.7311e+00, -5.5609e-01,
         -1.8288e+00,  4.7135e+00, -2.5164e+00, -2.0687e-01],
        [ 1.0976e+01, -2.1511e+00, -8.8926e-01, -2.6146e+00, -1.3121e+00,
         -1.6596e+00,  7.6239e-01, -2.1253e+00, -7.0168e-01],
        [ 1.1182e+01, -2.2161e+00, -8.9680e-01, 

In [848]:
output_ids = torch.argmax(output_logits, dim=2)
output_ids

tensor([[0, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 8, 8, 8, 0, 0, 0, 0, 0, 8, 8, 8, 0, 0,
         0, 0, 0, 0, 8, 8, 0, 0]])

In [851]:
# tokenizer.decode(output_ids[0])

'[PAD] [unused6] [unused6] [unused6] [unused6] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [unused8] [unused8] [unused8] [PAD] [PAD] [PAD] [PAD] [PAD] [unused8] [unused8] [unused8] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [unused8] [unused8] [PAD] [PAD]'

In [860]:
label_list = [
     "O",       # Outside of a named entity
     "B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
     "I-MISC",  # Miscellaneous entity
     "B-PER",   # Beginning of a person's name right after another person's name
     "I-PER",   # Person's name
     "B-ORG",   # Beginning of an organisation right after another organisation
     "I-ORG",   # Organisation
     "B-LOC",   # Beginning of a location right after another location
     "I-LOC"    # Location
 ]

sequence_token_str = tokenizer.decode(input_ids[0]).split(' ')
[(i,label_list[o]) for i,o in zip(sequence_token_str, output_ids[0])]


[('[CLS]', 'O'),
 ('Hugging', 'I-ORG'),
 ('Face', 'I-ORG'),
 ('Inc.', 'I-ORG'),
 ('is', 'I-ORG'),
 ('a', 'O'),
 ('company', 'O'),
 ('based', 'O'),
 ('in', 'O'),
 ('New', 'O'),
 ('York', 'O'),
 ('City.', 'I-LOC'),
 ('Its', 'I-LOC'),
 ('headquarters', 'I-LOC'),
 ('are', 'O'),
 ('in', 'O'),
 ('DUMBO,', 'O'),
 ('therefore', 'O'),
 ('very', 'O'),
 ('close', 'I-LOC'),
 ('to', 'I-LOC'),
 ('the', 'I-LOC'),
 ('Manhattan', 'O'),
 ('Bridge.', 'O'),
 ('[SEP]', 'O')]

### Summarization

Task: summarize a piece of text(s) into a shorter text.

In [861]:
from transformers import pipeline

In [864]:
# A real use case includes summarizing news articles
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

In [865]:
summarizer = pipeline("summarization")
summarizer

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


<transformers.pipelines.text2text_generation.SummarizationPipeline at 0x352475d30>

In [866]:
summarizer(ARTICLE)

[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, nine of them between 1999 and 2002 . She is believed to still be married to four men, and at one time, she was married to eight men at once .'}]

In [867]:
summarizer(ARTICLE, max_length=50, min_length = 20, do_sample=False)

[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002'}]

In [868]:
summarizer(ARTICLE, max_length=50, min_length=20, do_sample=True)

[{'summary_text': ' Liana Barrientos has been married 10 times, nine of them between 1999 and 2002 . She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say .'}]

The usage of different hyperparameters are shown above, resulting in similar results for this specific article and hyperparameter tuning. 

#### Summarization with AutoTokenizer and AutoModelWithLMHead

The input sequence consists of the Abstract, Introduction, and Background from the famouse research paper "Attention is All you need".

`https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf`


1. Instatiate pre-trained tokenizer and model from a checkpoint
2. Establish an input using an article, paper, etc.
3. Encode the paper into token ids (where ids are the keys in a dict, and the embeddings are values). BE SURE TO USE THE KEY WORD "summarize:" KEYWORD AS THIS IS T5 SPECIFIC e.g. tokenizer.encode("summarization:" + ARTICLE)
4. Feed input ids into model (.generate()) to determine the output ids
5. Decode output ids with tokenizer.decode()

Summarization is usually done with BART or T5

Googles T5 model was trained on a multi-task-mixed dataset like CNN, and Daily Mail. Since the dataset is more general, it's logical to think that is too many loaded tasks for a model, but nonetheless it returns good results.

model depends on model.generate() therefore hyperameters like max/min length can be overwritten. 


In [869]:
from transformers import AutoTokenizer, AutoModelWithLMHead

In [918]:
# Establish the sequence, article, script, review, etc. to be summarized.

research_paper = '''
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including
ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,
our model establishes a new single-model state-of-the-art BLEU score of 41.0 after
training for 3.5 days on eight GPUs, a small fraction of the training costs of the
best models from the literature.
1 Introduction
Recurrent neural networks, long short-term memory [12] and gated recurrent [7] neural networks
in particular, have been firmly established as state of the art approaches in sequence modeling and
transduction problems such as language modeling and machine translation [29, 2, 5]. Numerous
efforts have since continued to push the boundaries of recurrent language models and encoder-decoder
architectures [31, 21, 13].
∗Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started
the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and
has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head
attention and the parameter-free position representation and became the other person involved in nearly every
detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and
tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and
efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and
implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating
our research.
†Work performed while at Google Brain.
‡Work performed while at Google Research.
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
Recurrent models typically factor computation along the symbol positions of the input and output
sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden
states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently
sequential nature precludes parallelization within training examples, which becomes critical at longer
sequence lengths, as memory constraints limit batching across examples. Recent work has achieved
significant improvements in computational efficiency through factorization tricks [18] and conditional
computation [26], while also improving model performance in case of the latter. The fundamental
constraint of sequential computation, however, remains.
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in
the input or output sequences [2, 16]. In all but a few cases [22], however, such attention mechanisms
are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead
relying entirely on an attention mechanism to draw global dependencies between input and output.
The Transformer allows for significantly more parallelization and can reach a new state of the art in
translation quality after being trained for as little as twelve hours on eight P100 GPUs.
2 Background
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU
[20], ByteNet [15] and ConvS2S [8], all of which use convolutional neural networks as basic building
block, computing hidden representations in parallel for all input and output positions. In these models,
the number of operations required to relate signals from two arbitrary input or output positions grows
in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes
it more difficult to learn dependencies between distant positions [11]. In the Transformer this is
reduced to a constant number of operations, albeit at the cost of reduced effective resolution due
to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as
described in section 3.2.
Self-attention, sometimes called intra-attention is an attention mechanism relating different positions
of a single sequence in order to compute a representation of the sequence. Self-attention has been
used successfully in a variety of tasks including reading comprehension, abstractive summarization,
textual entailment and learning task-independent sentence representations [4, 22, 23, 19].
End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering and
language modeling tasks [28].
To the best of our knowledge, however, the Transformer is the first transduction model relying
entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. In the following sections, we will describe the Transformer, motivate
self-attention and discuss its advantages over models such as [14, 15] and [8].
'''

In [871]:
# Instatiate pre-trained tokenizer and model

tokenizer = AutoTokenizer.from_pretrained("t5-base")
model = AutoModelWithLMHead.from_pretrained("t5-base")



In [872]:
model

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dro

In [892]:
# Encode the sequence into token indices
# T5 uses a max_length of 512 so we cut the article NO MORE than 512 tokens.

input_ids = tokenizer.encode("summarize:" + research_paper, return_tensors="pt", max_length=200)
input_ids

tensor([[21603,    10, 20114,    37, 12613,  5932,  3017,  8291,  2250,    33,
             3,   390,    30,  1561,     3,    60, 14907,    42,   975, 24817,
           138, 24228,  5275,    24,   560,    46, 23734,    52,    11,     3,
             9,    20,  4978,    52,     5,    37,   200,  5505,  2250,    92,
          1979,     8, 23734,    52,    11,    20,  4978,    52,   190,    46,
          1388,  8557,     5,   101,  4230,     3,     9,   126,   650,  1229,
          4648,     6,     8, 31220,     6,     3,   390,  4199,   120,    30,
          1388, 12009,     6,  1028,  3801,    53,    28,     3,    60,  3663,
            52,  1433,    11,   975, 24817,     7,  4585,     5,  1881,  4267,
          4128,    30,   192,  1437,  7314,  4145,   504,   175,  2250,    12,
            36,  4784,    16,   463,   298,   271,    72,  8449,    23,   172,
           179,    11,     3, 10695,  4019,   705,    97,    12,  2412,     5,
           421,   825,  1984,     7,  2059,     5,  

In [897]:
# Feed input ids into model to return logit probabilities for each word

# output_ids = model.generate(input_ids, min_length=30, max_length=200, do_sample=False, length_penalty=2.0, num_beams=4, early_stopping=True)
output_ids = model.generate(input_ids, min_length=30, max_length=200, do_sample=False)

output_ids

tensor([[    0,     3,     9,   126,   650,  1229,  4648,    19,  4382,     6,
             8, 31220,     3,     5,    34,  1028,  3801,    15,     7,    28,
             3,    60,  3663,    52,  1433,    11,   975, 24817,     7,  4585,
             3,     5,    34,  1984,     7,  2059,     5,   591,     3,  8775,
         12062,    30,     8,  1412, 22269,    18,   235,    18, 24518,  7314,
          2491,     3,     5,     1]])

In [898]:
output_tokens_str = tokenizer.decode(output_ids[0])
output_tokens_str

'<pad> a new simple network architecture is proposed, the Transformer. it dispenses with recurrence and convolutions entirely. it achieves 28.4 BLEU on the 2014 english-to-German translation task.</s>'

Now put it all together. 

In [899]:
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch

In [919]:
# Establish input article, paper, etc. 
TRANSCRIPT = '''

Judge: Order in the court! This mock trial is in session. Today we're examining the case of "The People vs. Mr. Smith" for alleged theft.

Prosecutor: Your Honor, the evidence clearly shows Mr. Smith was caught on camera stealing from the store.

Defense Attorney: Objection, Your Honor! The footage is inconclusive and lacks context.

Judge: Sustained. Defense, your opening statement?

Defense Attorney: Ladies and gentlemen, my client is innocent until proven guilty. There's reasonable doubt here.

Judge: Let's proceed with witnesses.

[The trial continues with arguments and testimony.]

Judge: Court adjourned. We'll reconvene for closing statements tomorrow.
'''

In [920]:
# Instatiate pre-trained tokenizer and automodel
tokenizer = AutoTokenizer.from_pretrained("T5-base")
model = AutoModelWithLMHead.from_pretrained("T5-base")




In [921]:
tokenizer

T5TokenizerFast(name_or_path='T5-base', vocab_size=32100, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': ['<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>', '<extra_id_4>', '<extra_id_5>', '<extra_id_6>', '<extra_id_7>', '<extra_id_8>', '<extra_id_9>', '<extra_id_10>', '<extra_id_11>', '<extra_id_12>', '<extra_id_13>', '<extra_id_14>', '<extra_id_15>', '<extra_id_16>', '<extra_id_17>', '<extra_id_18>', '<extra_id_19>', '<extra_id_20>', '<extra_id_21>', '<extra_id_22>', '<extra_id_23>', '<extra_id_24>', '<extra_id_25>', '<extra_id_26>', '<extra_id_27>', '<extra_id_28>', '<extra_id_29>', '<extra_id_30>', '<extra_id_31>', '<extra_id_32>', '<extra_id_33>', '<extra_id_34>', '<extra_id_35>', '<extra_id_36>', '<extra_id_37>', '<extra_id_38>', '<extra_id_39>', '<extra_id_40>', '<extra_id_41>', '<extra_id_42>',

In [922]:
model

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dro

In [931]:
# Encode the transript into input indices

input_ids = tokenizer.encode("summarization:" + research_paper, return_tensors="pt")
input_ids

tensor([[4505, 1635, 1707,  ...,  927, 4275,    1]])

In [932]:
# Feed token ids into model

output_ids = model.generate(input_ids, min_length=10, max_length=100, do_sample=True)
output_ids

tensor([[    0,     3,   157,   547,     9,     3,    10,     8, 19903,    19,
           166,   825,    24,    65,   150,  1388,  8557,     6,     3,  3565,
          5932,   138,  1707,     3,     5,     3,    88,   845,    34,  1250,
            21,   231,    72,  8449,  1707,    11,     3,  7161,  3627,   761,
             3,     5,     1]])

In [933]:
# Decode the output ids into a string of the tokens

output_transcript_summarization = tokenizer.decode(output_ids[0])
output_transcript_summarization

'<pad> khata : the transformer is first model that has no attention mechanism, despite sequencealization. he says it allows for much more parallelization and enables faster training.</s>'

### Translation

Translating text in one language to another

An example of a dataset used for training is the WMT Dataset which includes English-to-German translations (English sentences as input and German translative sentence in output)

In [934]:
# Using Pipelines
from transformers import pipelines

In [943]:
# model defaulted to google-t5/t5-base 
translator = pipeline("translation_en_to_de")
translator

No model was supplied, defaulted to google-t5/t5-base and revision 686f1db (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on google-t5/t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


<transformers.pipelines.text2text_generation.TranslationPipeline at 0x3c1a04c70>

In [945]:
sentence = "I would like to get a 1 bed room in the non-smoking area for the next two nights."

In [946]:
translator(sentence)

[{'translation_text': 'Das Hotel ist sehr zentral gelegen und hat eine gute Verkehrsanbindung.'}]

#### Pre-trained translation tokenizer and model

The steps are similar to summarization

1. Instatiate pre-trained tokenizer and model (use BART OR T5 since they are encoder-decoder models unlike BERT or XLNET that are strictly encoder models)
2. Create an input sequence to be translated
3. Encode the inputs to become to token indices.
    - Be sure to include the key specific word for T5 "translate English to German:" concatenated with the input string e.g. tokenizer.encode("translate English to German:" + input_english_sentence)
4. Feed token ids into the model to extract the output token ids. 
    - Use model.generate() to override hyperparameters like max/min length
5. Decode the output ids into words of the predicted translation

In [948]:
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch

In [1001]:
# Instantiate pre-trained tokenizer and model

tokenizer = AutoTokenizer.from_pretrained("t5-base")
model = AutoModelWithLMHead.from_pretrained("t5-base")



In [1020]:
# Sequence to be translated is from "Attention is All You Need"

# sequence = ''' 
# Most competitive neural sequence transduction models have an
# encoder-decoder structure.
# '''

sequence = "Most competitive neural sequence transduction models have an encoder-decoder structure."

In [1021]:
# Encode sequence into pre-trained token ids

input_ids = tokenizer.encode("translate English to German: " + sequence, return_tensors="pt")
input_ids

tensor([[13959,  1566,    12,  2968,    10,  1377,  3265, 24228,  5932,  3017,
          8291,  2250,    43,    46, 23734,    52,    18,   221,  4978,    52,
          1809,     5,     1]])

In [1022]:
# Feed input ids into model to output of the summarized token idst

# output_ids = model.generate(input_ids, top_k=100, top_p=0.95, do_sample=False, min_length=100, max_length=170)
# output_ids

# output_ids = model.generate(input_ids, do_sample=False, min_length=100, max_length=170)
# output_ids



# Compare the outputs below 
# output_ids = model.generate(input_ids, do_sample=True)
# output_ids
# <pad> Die meisten wettbewerbsfähigen neuralen Sequenz-Transduction-Modelle besitzen'

# output_ids = model.generate(input_ids, num_beams=4, early_stopping=True)
# output_ids
# '<pad> Die meisten wettbewerbsfähigen neuralen Sequenztransduction-Modelle haben eine'

output_ids = model.generate(input_ids, do_sample=True, max_length=len(sequence))
output_ids
# '<pad> Die meisten wettbewerbsfähigen neuralen Sequenztransduction-Modelle haben eine Coder-Decoder-Struktur.</s>'



tensor([[    0,  1212,  7176,     7,     3, 31467,     7, 13718,    15, 24228,
          5932,  3017,  8291,  2250,    43,     3,     9, 23734,    52,    18,
           221,  4978,    52,  1809,     5,     1]])

In [1023]:
# Decode the output ids (which is the token ids for the german translation) into a string 

translation_output = tokenizer.decode(output_ids[0])
translation_output

'<pad> Meistens wettbewerbsfähige neural sequence transduction models have a encoder-decoder structure.</s>'

Compare the translations from "model.generate(input_ids, do_sample=True)" and model.generate(input_ids, num_beams=4, early_stopping=True)
to see which is closer to the English translation.

Source link for Pre-trained German to English LLM
`https://huggingface.co/amey1803/german_to_eng_lang_translation`

In [1024]:
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch

In [1025]:
tokenizer = AutoTokenizer.from_pretrained("amey1803/german_to_eng_lang_translation")
model = AutoModelWithLMHead.from_pretrained("amey1803/german_to_eng_lang_translation")



In [1026]:
model

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

In [1027]:
input_sequence_1 = "Die meisten wettbewerbsfähigen neuralen Sequenz-Transduction-Modelle besitzen"

input_sequence_2 = "Die meisten wettbewerbsfähigen neuralen Sequenztransduction-Modelle haben eine"

input_sequence_3 = "Die meisten wettbewerbsfähigen neuralen Sequenztransduction-Modelle haben eine Coder-Decoder-Struktur."


In [1028]:
input_ids_1 = tokenizer.encode("translate German to English: " + input_sequence_1, return_tensors="pt")
input_ids_2 = tokenizer.encode("translate German to English: " + input_sequence_2, return_tensors="pt")
input_ids_3 = tokenizer.encode("translate German to English: " + input_sequence_3, return_tensors="pt")

In [1029]:
output_ids_1 = model.generate(input_ids_1, do_sample=True)
output_ids_2 = model.generate(input_ids_2, num_beams=4, early_stopping=True)
output_ids_3 = model.generate(input_ids_3, max_length=50, num_beams=4, early_stopping=True)



In [1030]:
translation_output_1 = tokenizer.decode(output_ids_1[0])
translation_output_2 = tokenizer.decode(output_ids_2[0])
translation_output_3 = tokenizer.decode(output_ids_3[0])

In [1031]:
# Original Sequence 
sequence

'Most competitive neural sequence transduction models have an encoder-decoder structure.'

In [1032]:
# Output from "model.generate(input_ids_1, do_sample=True)" for both English to German and German to English Translations.
translation_output_1

'<pad> The majority of competitor-fähig neural-Select TransductionModels are registered</s>'

In [1033]:
# Output from "model.generate(input_ids_2, num_beams=4, early_stopping=True)" for both English to German and German to English Translations.
translation_output_2

'<pad> The most competitive neural-transduction models have a</s>'

In [1034]:
# Output from "Die meisten wettbewerbsfähigen neuralen Sequenztransduction-Modelle haben eine Coder-Decoder-Struktur." for both English to German and German to English Translations.
translation_output_3

'<pad> The most competitive neural networks have a coder-decoder-Struktur.</s>'

## Summary of the models

Community-uploaded Pretrained Models availble to use are located at (`https://huggingface.co/models?sort=trending`)

HuggingFace Pretrained Models availble to use are located at (`https://huggingface.co/transformers/v3.0.2/pretrained_models.html`)

Link for "Attention is all you Need" `https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf`

For an in-depth explaination of the code ("Attention is all you Need"), click the link (`http://nlp.seas.harvard.edu/annotated-transformer/`)

Each transformer model falls into one of the categories:
- AutoRegressive Models: Pre-trained casual language models (e.g GPT2, XLNET) that utilizes only the decoder (meaning there is a separate component for the encoder) for conditioning a model to sequentially predict a token given a sequence of previous tokens e.g. mask-language-modeling and text generation. 

    This does not mean there isn't a separate model for the encoding process. Typically during the preprocessing step, the encoder includes converting the tokens (words) to its respective token id. Then a lookup table is used to extract the embedding per token id. The input then becomes a high dimensional tensor filled with token embeddings for each word in the input sequence. 
    
    For text generation, the model corresponds to the DECODER part of the original transformer to predict the masked token at the end of a sequence ONE AT A TIME, so all the attention is put towards the words PRIOR to the masked token. Therefore, the last hidden state (which is the attention score computed by the attention mechanism for all prior tokens) and the current hidden state are concatenated at each step. 

    - Examples of AutoRegressive Models and their specific pretrained task:
        - Original GPT: Language Modeling (MLM), multi-task language modeling (learns multiple classification tasks jointly), multiple choice classification.
        - GPT2: Question Answering, Language Modeling (MLM), multi-task language modeling(learns multiple classifcation tasks jointly), multiple choice classification.
        - CRTL: Similar to GPT model, but uses control codes to learns the style of writing so tasks like MLM or Text Generation can result in text that matches the style of the input sequence (e.g. a blog or speech).  
        - Transformer-XL: Similar to GPT Modlel, but introduces a reoccurance mechanism for two consecutive segments i.e tokens (e.g. a RNN model with two consecutive inputs). A segment is a reoccuring token that span across multiple documents, which are then fed in order to the model. This helps put the focus on the current hidden state and the last state by concatentating them together to determine the attention scores. For example, the word "artificial" and "intelligence" might occur frequently across latest series of political documents, so those words can recieve a higher attention score. There are multiple attention layers stacked on top of each other. Therefore, the inputs from one attention layer feed into the next one creating multiple previous segments (i.e. tokens). This changes the positional embeddings to positional relatvie embeddings. Postitional embeddings only uses the current input which is the same as the current hidden state. Transformer-XL is used for language modeling. 
        - Reformer: Primary benefit is that is holds a set of tricks to lower computation time and memory footprint by using smaller batch matrices instead of a full attention matrix (NxN matrix includes all the word tokens in the matrix).
            - Axial Position Encoding: Splitting up a large positional encodeing matrix into a smaller matrices. 
            - Replace attention for computing the full results of Key-Query vectors in the attention layers with LSH (local-sensitive-hashing) attention.
            - To obtain the intermediate results for each layer, replace the reverse transformers method (obtained by a backwards pass) and just subtract the residual (i.e. cost or loss) from the input of the next layer.
            - Modify the Feed-Forward NN to input batch chunks of data as opposed to the whole batch.
        - XL-NET: Uses permutations of a sequence where mask tokens hide 1 to n number of words to predict the n+1 token. Examples include sentence classification (sentiment analysis) multiple choice classification, token classification (NER), language modeling (MLM) and Casual Language Modeling (text generation). 

- AutoEncoding Models: Models that traditionally correspond to the encoder-decoder architecture. However, for models like BERT, it is considered an autoencoder because it aligns with the AutoEncoding principles (even though it only includes and encoder, but not a decoder). Unlike before with AutoRegressive models, autoencoding models traditionally have no masked token included in the input, so it gets a full view of the input from a bidirectional standpoint (gather context from both right-to-left and left-to-right of the sequence). Furthermore, it uses a metric called "reconstruction loss" instead of "masked tokens" to help measure accuracy from the original sequence to the encoded representation via a cost function like CategoricalCrossEntropy or Mean Squared Error. Example tasks include text generation, token classification (NER), sentence classification (Sentiment Analysis).

    - Examples of AutoEncoding Models:
        - BERT: corrupts the original sequence with a percentage of masked tokens:
            - 80% probability for a special mask token.
            - 10% probability for a random token to be masked that are different than other than masked tokens.
            - 10% probability same token will be masked again.
            Goal of BERT is not only predict the masked tokens, but also if the words to the left and right of the masked token are consecutive or not (paraphrases). BERT can be used for Language Modeling (Masked Language Modeling), Casual Modeling (Next Sentence Prediction and Text Generation), Token Classification (NER), Question Answering, Multiple Choice classification, Sentence Classification.
        - ALBERT: Similar to BERT except for some adjustments:
            - one embedding represents one token (context dependent), whereas one hidden state represents a series of token embeddings.
            - Layers are split into groups that have same parameters
            - Next Sentence Prediction is swapped with Sentence Ordering Prediction. For example, let's say there are 2 sentences A and B. We know B comes after A but its up to the model to predict whether they have been switched or not. Therefore the inputs can either be A and B or B and A.  
        - RoBERTa: Similar to BERT except for some pretaining adjustments:
            - Includes dynamic masking to masked tokens at each epoch rather than the whole batch like BERT. 
            - No Next Sentence Prediction Loss. Therefore, instead of a putting two sentences together, include a contiguous group of text between each sentence, so the total token length is 512. This will allow other words to span across multiple documents. RoBERTa can be used for Masked Language Modeling, Sentence Classifcation, Token Classification, Question Answering, and multiple choice classification.
        - DistilBERT: Same as BERT, but its 40% smaller which makes it faster. Another interesting point to make out, is that DistilBERT can retain most (~97%) of the functionality from BERT making it useful all around. DistilBERT can be used for masked language modeling, sentence classification, token classification, question answering.
        - XLM: trained on several languages, so there are three separate checkpoints available that pertain to the different types of training. 
            - Casual Language Modeling (CLM) (predicting next word in a sequence (NTP? NWP?) or Text Generation),which uses the traditional autoregressive training, thefore it can technically be in the category above. It trains on a selected language for a given input of 256 tokens (words) than span across several documents for that paricular language.
            - Masked Language Modeling (MLM) similar to RoBERTa, it takes an input of 256 tokens that span across several documnets, but with a dynamic masking of tokens. 
            - Combination of MLM and transition language modeling (TLM) that consists of concatenating sentence in two different languages with random masking, so the model can better learn the context of surrounding words. 
            
            Checkpoints include CLM/MLM for indicating the language used based on the whole sequence, and MLM-TLM for indicating the language used based on specific tokens. 

            XLM can be used for Language Modeling like MLM and its subgroup of Casual Language Modeling like Next-Word-Prediction and Text Generation, Token Classification, Sentence Classification, multiple-choice classification, question answering
        - XLM-RoBERTa: Uses a RoBERTa tricks on the XLM approach, but only uses one language (unlike XLM). However XLM-RoBERTa is TRAINED on 100 languages. Also it doesn't utilize language embeddings so the model detects the language by itself.
        - FlauBERT: Like RoBERTa, it doesn't utilize sentence ordering prediction (predicting whether a sentence A comes after sentence B or not.). Therefore its just trained for MLM objective, but can also be used for sentence classification.
        - ELECTRA contains a two model process (though ELECTRA is a single model): 
            - One language model that takes text as input and then applies an n number of random masked tokens to each sequence. 
            - ELECTRA then tries to predict which tokens are part of the original sequence and which ones were replaced with tokens from the random language model. It only trains for a few steps (like a Generative Adverserial Network).However, it still includes the original sequences for its target labels unlike a GAN model. ELECTRA can be applied to sentence classification, token classification, masked language modeling.
        - Longformer  was originally pretrained the same way as ROBERTa, but takes advantage of speed and memory. Essentially, Longformer uses sparse matrices and focuses on local attention with some pre-selected input tokens as the focus for global attention. Longformer can be used for masked language modeling, sentence classification, token classification, multiple choice classification, and question answering. 

NOTE: The only difference between AutoRegressive and AutoEncodings is how they pretrain, meaning they have the same architecture. If a single architecture is pretrained on both model types (AutoRegressive and AutoEncodings), it will be classified in the group of what is was first pretrained on. 

- Sequence-to-sequence Models: Based off the original trasformer (Attention is All You Nedd), it includes a encoder-decoder architecture where the encoder converts the sequence to embeddings, then outputs a new sequence of a different size. Examples include summariztion, question answering, and translation (Fun Fact: the original transformer was pretrained for translating Engish to German). Models for this task include T5 or BART. 

    - Example of Sequence-to-Sequence Models:
        - BART: Encoder inputs a text sequence that outputs the tokenized encodings which was manipulated, whereas the decoder tries to learn the original tokens from the tokens that replaced the original ones. Some transformative methods for encoding include the following:
            - Masking random tokens (like BERT)
            - Deleting random tokens 
            - Masking a set of tokens but with the same masked token
            - Permutations of sequences
            - Set the document to start at a specific token instead of the beginning of first sequence. 
            BART can be applied to conditional generation (like translation) and sequence classification
        - MarianMT: framework for translation (conditional generation) models using BART architecture
        - T5: Learns postional encodings at each layers. It can be used for many text-to-text tasks like translation (conditional generation), summarization, question answering, etc. as long as the sequence to be encoded inludes the prefix:
            - input_ids = tokenizer.encode("summarization: " + SEQUENCE, return_tensors="pt")
            - input_ids = tokenizer.encode("question: " + SEQUENCE, return_tensors="pt")
            - input_ids = tokenizer.encode("translate English to German : " + SEQUENCE, return_tensors="pt")
            T5 uses both supervised and self-supervised learning: 
                - For supervised learning, the model learns a set of downstream tasks (fined-tuned model after its been pretrained) to be converted into text-to-text problem statements like mentioned above. 
                - Self-supervised learning consists of replacing 15% of tokens with masked tokens. Additionally, only a single token represents the n number of consecutive masked tokens. Input to the encoder is the sequence with the masked tokens. Input to the decoder is the original sequence and the target output displays only the masked tokens. I.e. the original masked words vs the unmasked words are switched around.


                e.g. "The bird ate a worm."

                Encoder input: "The <x> a <y>." 
                <x> = "bird ate"
                <y> = "worm"

                Decoder input: "The bird ate a worm."
                Decoder target output "<x> bird ate <y> worm <z>"  
                <x> = "The" 
                <y> = "a"
                <z> = "."


- Multimodel Models: Mixture of various types of data (e.g. text, audio, image, etc.) pretrained towards a specific task.

    - Example of MultiModal Models:
        - MMBT: A classifcation model that takes a text and image combination to generate predictions. The transformer takes the tokenized encodings of the text sequence while the image model takes in the final activations (after the pooling layer) of a resnet image model and feeds them into a linear layer to compute the dimension for the transformer's hidden state. 



### More Technical Aspects

#### Full vs Sparse Attention

Most transformers use full attention matrices i.e an NxN matrix that includes all words in the sequence(s). However, this can cause bottlenecking, increasing time and space complexity. Models like Reformer and Longformer overcome this by utilizing sparse local attention (NxM) matrices.  

##### LSH (local sensitve hashing) attention:

- Reformer: For each Query-Key vector, Reformer uses an average score of various hash functions to determine what keys (k vectors) are related to what query (q vectors) so only the closest ones are choosen. To add, the current token is masked over before hashing.

##### Local Attention:

- Longformer: splits a full attention NxN matrix into stackable attention layers with smaller windows. Thefore, only the token before and after are taken into account for each token's local attention score. This means that a bunch of smaller windowed layers on top of each other can be understood for global attention. The model also has a set of pre-determined word tokens that are be used for global attention so each word has access to the pre-selected words along with the word tokens in their respected window.

Another benefit of sparse attention is that the model can input longer sequences without bottlenecking. 

#### Other Tricks

##### Axial positional encodings:

- Reformer: Traditionally, transformers use full attention with a postional encoding matrix called E. Axial positional encodings split E into two sub matrices then doing a positional-wise multiplication beteen the each embedding from both the smaller matrices. The product for each position in the matrics is then used to represent the attention (which lowers time complexity).

## Preprocessing data

Tokenizers convert words into token ids (represent words and subwords) then are allowed to be a.) encoded into embeddings and b.) to be decoded from the embeddings into token ids. This can be accomplished using AutoTokenizer or PreTrainedTokenizer class.

It's important to remember to use the respective tokenizer that was included when pre-training the model. This is due to the tokenizer splitting words into tokens and sub-tokens for a dev's input the same way text was split in the pre-training corpus (i.e. the corresponding word to token ids (known as vocab) are the same for both pre-training and fine-tuning). This includes a list of special tokens and other preprocessing steps like removing stopwords, lemmatizing, etc. 

    (e.g AutoTokenizer.from_pretrained("bert-base-cased") and AutoModelWithLMHead.from_pretrained("bert-base-cased"))


    
Transformers includes different classes for each data type making preprocessing easier since it is called under one function e.g. *tokenizer(data, return_tensors="pt")* or *tokenizer.encode(data, return_tensors="pt")*

- Tokenizer: inputs text data, returns tensors or list of token ids
- Feature extractor: inputs speech and audio data, returns tensors or list of token ids
- ImageProcessor: inputs images, returns tensors or list of token ids
- Processor: various types of inputs (multimodal) like text, audio, images, returns tensors or list of token ids
 


### Base Use

In [1045]:
from transformers import AutoTokenizer

In [1047]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

For preprocessing, the tokenizer has many methods but the *__call__* function is the most important as it will output a sequences input_ids, token_type_ids, and attention_mask (see below). 

- The *input ids* represent the token ids for each word in a sequence. 
- The *attention_mask* represent a list, or matrix (depending if return_tensors="pt" and return_tensors="tf" or not) the same size as the input_ids matrix where a 1 is a token and 0 is not a token. This occurs when additional hyperparameters are set to add padding and truncation to the tokenizer. 
- The *token_type_ids* 


In [1056]:
sequence = "The brown dog ran up the hill"

input_info = tokenizer(sequence, return_tensors="pt")
input_info

{'input_ids': tensor([[ 101, 1109, 3058, 3676, 1868, 1146, 1103, 4665,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Have a looks at the outputs above and below. The input_info includes the input_ids, token_type_ids, and the attention_mask whearas the .encode() method used below extracts the input_ids as its own output. 

In [1055]:
input_ids = tokenizer.encode(sequence)
input_ids

[101, 1109, 3058, 3676, 1868, 1146, 1103, 4665, 102]

In [1058]:
decode_sequence = tokenizer.decode(input_ids)
decode_sequence

'[CLS] The brown dog ran up the hill [SEP]'

Notice the decoded string includes special characters from the pretrained tokenizer. This can be removed by setting the hyperparameter as *verbose=False*. It's only advised to remove special tokens when they have been added by yourself. 

Not all models include special character tokens (e.g. gpt2-medium).

Take a look below at all the hyperparameters that can be instantiated to various settings.

In [1076]:
sequence = "GPT is a large language model."

In [1077]:
tokenizer_gpt2 = AutoTokenizer.from_pretrained("gpt2-medium")
tokenizer_gpt2

GPT2TokenizerFast(name_or_path='gpt2-medium', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}

In [1078]:
input_ids_gpt2 = tokenizer_gpt2(sequence, return_tensors="pt")
input_ids_gpt2

{'input_ids': tensor([[   38, 11571,   318,   257,  1588,  3303,  2746,    13]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

Compare the original sentence to the outputs of the decoded sequences via input_ids_gpt2 and input_ids_gpt2_b above to the one below that includes additional set hyperparameter *add_special_tokens* to False. 

In [1088]:
input_ids_gpt2_b = tokenizer_gpt2(sequence, return_tensors="pt", add_special_tokens=False)
input_ids_gpt2_b

{'input_ids': tensor([[   38, 11571,   318,   257,  1588,  3303,  2746,    13]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [1089]:
tokenizer.decode(input_ids_gpt2["input_ids"][0])

'[unused38] hatred Ś ë south conducted Earth [unused13]'

In [1090]:
tokenizer.decode(input_ids_gpt2_b["input_ids"][0])

'[unused38] hatred Ś ë south conducted Earth [unused13]'

For batch pre-processing, use a list of sequences

In [1091]:
from transformers import AutoTokenizer

In [1095]:
batch_sentences = ["Hello I'm a single sentence",
                   "And another sentence",
                   "And the very very last one"]
tokenizer_bert_base = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer_bert_base

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

If return_tensors hyperparameter is the only additional hyperparameter set, then this error will occur -> 

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).


In [1099]:
input_ids_bert_base = tokenizer(batch_sentences)
input_ids_bert_base

{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102], [101, 1262, 1330, 5650, 102], [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}

In [1104]:
tokenizer_bert_base.decode(input_ids_bert_base['input_ids'][0])

"[CLS] Hello I'm a single sentence [SEP]"

If the point is to feed a pairs of sentences as input, then use truncation and padding.

Let's looks at *bert-bsase-cased* hyperparameters via input_ids_bert_base_b

In [1103]:
tokenizer_bert_base

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [1112]:
# input_ids_bert_base_b = tokenizer_bert_base(batch_sentences, padding_side="right", truncation_side="right", return_tesors="pt")
# input_ids_bert_base_b

input_ids_bert_base_b = tokenizer_bert_base(batch_sentences, padding=True, truncation=True, return_tensors="pt")
input_ids_bert_base_b

{'input_ids': tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
        [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
        [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 0]])}

Notice above the attention mask. It contains a matrix the same size as the input token ids where a 1 represents a given token and 0 represents a non-token (which comes up during padding).

- padding: adding zeros to a sequence when its size is less than the max_length.
- truncation: removing characters when a sequence size is greater than the max_length.

### Processing PAIRS of sequences NOT lists

Depending on the tokenizer and model, feeding an input list of paired sentences needs to be set according to the specific task.

E.g. BERT can do paraphrase classification and question answering. For these tasks the input appears as such *[CLS] Sequence A [SEP] Sequence B [SEP]* where sentence A and B are a combined string separted by a token. To be processed by the tokenizer, each string can be two set as two arguements i.e. *tokenizer(Sequence A, Sequence B)*. This means the input is processed in the same batch as oppose to separate batches for each sentence (e.g. list of sequences - *tokenizer([Sequence A, Sequence B])* ).

Now let's look at what *token_type_ids* represent.

In [1113]:
from transformers import AutoTokenizer

In [1114]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [1115]:
input_info = tokenizer("Why do you feel that way?", "I feel that way because I am sad.")
input_info

{'input_ids': [101, 2009, 1202, 1128, 1631, 1115, 1236, 136, 102, 146, 1631, 1115, 1236, 1272, 146, 1821, 6782, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

*token_type_ids* consists of a list of ids, where each part indicates the sentence they belong to. I.e. "0, 0, 0, 0, 0, 0, 0, 0, 0" are assigned to *Sentence A* and "1, 1, 1, 1, 1, 1, 1, 1, 1, 1" are assigned to *Sentence B* 

Not all the input info is included for all models (which depends on the setup of the model's inputs). This can be overwritten with the additional hyperparameter *setup return_input_ids*, *return_token_type_ids*. 

When there are multiple pairs of inputs, it helps to put the first sentences in one list and second sentences in another with the same respective order.

In [1123]:
from transformers import AutoTokenizer

In [1126]:
batch_input_sentence_pairs_A = [
    "The lion ran as fast as it could.",
    "Fine-Tuning a model means to train a pretrained model.",
    "You\'re a wizard, Harry."
]

batch_input_sentence_pairs_B = [
    "Soon thereafter, it got too tired to stand.",
    "It includes a learning rate that should be small enough to learn a specific downstream task.",
    "I\'m a what?"
]

In [1127]:
batch_input_sentence_pairs_A

['The lion ran as fast as it could.',
 'Fine-Tuning a model means to train a pretrained model.',
 "You're a wizard, Harry."]

In [1130]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

Note: When *return_tensors="pt"* or *"tf"*, the tokenizer will return a array of integers, OTHERWISE it will return a list of list of integers. Each integer indicates the sentence id for every token. Special characters are also included and associated with their resprective sentence id UNLESS they are set to NOT be inlcuded i.e. *add_special_tokens=False*

In [1132]:
input_info = tokenizer(batch_input_sentence_pairs_A, batch_input_sentence_pairs_B)
input_info

{'input_ids': [[101, 1109, 11160, 1868, 1112, 2698, 1112, 1122, 1180, 119, 102, 5398, 7321, 117, 1122, 1400, 1315, 4871, 1106, 2484, 119, 102], [101, 4730, 118, 17037, 3381, 170, 2235, 2086, 1106, 2669, 170, 3073, 4487, 9044, 2235, 119, 102, 1135, 2075, 170, 3776, 2603, 1115, 1431, 1129, 1353, 1536, 1106, 3858, 170, 2747, 14102, 4579, 119, 102], [101, 1192, 112, 1231, 170, 16678, 117, 3466, 119, 102, 146, 112, 182, 170, 1184, 136, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [1134]:
input_ids = input_info['input_ids']
input_ids

[[101,
  1109,
  11160,
  1868,
  1112,
  2698,
  1112,
  1122,
  1180,
  119,
  102,
  5398,
  7321,
  117,
  1122,
  1400,
  1315,
  4871,
  1106,
  2484,
  119,
  102],
 [101,
  4730,
  118,
  17037,
  3381,
  170,
  2235,
  2086,
  1106,
  2669,
  170,
  3073,
  4487,
  9044,
  2235,
  119,
  102,
  1135,
  2075,
  170,
  3776,
  2603,
  1115,
  1431,
  1129,
  1353,
  1536,
  1106,
  3858,
  170,
  2747,
  14102,
  4579,
  119,
  102],
 [101,
  1192,
  112,
  1231,
  170,
  16678,
  117,
  3466,
  119,
  102,
  146,
  112,
  182,
  170,
  1184,
  136,
  102]]

In [1139]:
for a_b_ids in input_ids:
    print(tokenizer.decode(a_b_ids))

[CLS] The lion ran as fast as it could. [SEP] Soon thereafter, it got too tired to stand. [SEP]
[CLS] Fine - Tuning a model means to train a pretrained model. [SEP] It includes a learning rate that should be small enough to learn a specific downstream task. [SEP]
[CLS] You're a wizard, Harry. [SEP] I'm a what? [SEP]


Rember to include truncation, padding, and a Pytorch or Tensorflow array

In [1140]:
input_info = tokenizer(batch_input_sentence_pairs_A, batch_input_sentence_pairs_B, padding=True, truncation=True, return_tensors="pt")
input_info

{'input_ids': tensor([[  101,  1109, 11160,  1868,  1112,  2698,  1112,  1122,  1180,   119,
           102,  5398,  7321,   117,  1122,  1400,  1315,  4871,  1106,  2484,
           119,   102,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0],
        [  101,  4730,   118, 17037,  3381,   170,  2235,  2086,  1106,  2669,
           170,  3073,  4487,  9044,  2235,   119,   102,  1135,  2075,   170,
          3776,  2603,  1115,  1431,  1129,  1353,  1536,  1106,  3858,   170,
          2747, 14102,  4579,   119,   102],
        [  101,  1192,   112,  1231,   170, 16678,   117,  3466,   119,   102,
           146,   112,   182,   170,  1184,   136,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [1141]:
input_ids = input_info['input_ids']
input_ids

tensor([[  101,  1109, 11160,  1868,  1112,  2698,  1112,  1122,  1180,   119,
           102,  5398,  7321,   117,  1122,  1400,  1315,  4871,  1106,  2484,
           119,   102,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0],
        [  101,  4730,   118, 17037,  3381,   170,  2235,  2086,  1106,  2669,
           170,  3073,  4487,  9044,  2235,   119,   102,  1135,  2075,   170,
          3776,  2603,  1115,  1431,  1129,  1353,  1536,  1106,  3858,   170,
          2747, 14102,  4579,   119,   102],
        [  101,  1192,   112,  1231,   170, 16678,   117,  3466,   119,   102,
           146,   112,   182,   170,  1184,   136,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0]])

In [1142]:
for embedding in input_ids:
    print(tokenizer.decode(embedding))

[CLS] The lion ran as fast as it could. [SEP] Soon thereafter, it got too tired to stand. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[CLS] Fine - Tuning a model means to train a pretrained model. [SEP] It includes a learning rate that should be small enough to learn a specific downstream task. [SEP]
[CLS] You're a wizard, Harry. [SEP] I'm a what? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]


Hyperparameter settings for tokenizer's padding and truncation 

    Reference Link (https://huggingface.co/transformers/v3.0.2/preprocessing.html)

    max_length:
    - max_length: used to set the size of batch paired sequences. It can either be set with a specific max_length arguement or if not specified (i.e. the default is max_length=None), the the length is deterimined by the maximum length that the model can accept. If the model does not have a maximum input length it can accept, then max_length is deactivated. 

        e.g.

            tokenizer(batch_sentences, max_length=50)

    
    - Padding: add zeros to and beginning/end of a sequence
        - 'max_length': Uses the maximum length accepted by the model. 
        - True or 'longest': Using the longest sequence pair as reference, pad zeros to the rest of the sequences making them the same size as the longest sequence pair. 
        - False or 'do_not_pad': No padding at all, meaning the sequences could be different length sizes. 


        e.g. 
            tokenizer(batch_sentences)
            
            tokenizer(batch_sentences, padding="do_not_pad") # same as above

            
            
            tokenizer(batch_sentences, padding="max_length")

            
            
            tokenizer(batch_sentences, padding="longest")

            tokenizer(batch_sentences, padding=True) # same as line above


            


    - Truncation: remove tokens at the beginning/end of sequence from either the first sentence, second sentence, or both sequences within the batch pair. 

        DISCLAIMER: a chosen STRATEGY arguement for truncation such as *only_first*, *only_second*, and *longest_first* are specific towards BATCH PAIRS OF SENTENCES ONLY (i.e. *[CLS] Sequence A [SEP] Sequence B [SEP]*).

        e.g. 
            batch_input_sentence_pairs_A = [
                "The lion ran as fast as it could.",
                Fine-Tuning a model means to train a pretrained model.",
                "You\'re a wizard, Harry."
                ]

            batch_input_sentence_pairs_B = [
                "Soon thereafter, it got too tired to stand.",
                "It includes a learning rate that should be small enough to learn a specific downstream task.",
                "I\'m a what?"
                ]


        - 'longest_first': Truncate tokens from each longer sentence within the batch pairs of sequences via...
            - a specific length using the max_length arguement
            - If no max_length arguement is given, then the default is max_length=None. This means the maximum sequence length accepted by the model will be used. E.g. a model might only accept a sequence length size of 10, so any sequences with a larger size will be truncated. 
        - True or 'only_first': Truncates the *FIRST* sentence in a batch pair to sequences to... 
            - a specific length using the max_length arguement
            - If no max_length arguement is given, then the default is max_length=None. This means the maximum sequence length accepted by the model will be used. E.g. a model might only accept a sequence length size of 10, so any sequences with a larger size will be truncated. 
        - False or 'do_not_truncate': No truncating at all, meaning sequences could be different sizes. 
        - 'only_second': Truncates the *SECOND* sentence in a batch pair of sequences to...
            - a specific length using the max_length arguement
            - If no max_length arguement is given, then the default is max_length=None. This means the maximum sequence length accepted by the model will be used. E.g. a model might only accept a sequence length size of 10, so any sequences with a larger size will be truncated. 

            e.g. 
                tokenizer(batch_sentences, truncation=True)

                tokenizer(batch_sentences, truncation=STRATEGY) # same as line above

                

                tokenizer(batch_sentences, truncation="only_first")

                tokenizer(batch_sentences, truncation="longest_first")

                tokenizer(batch_sentences, truncation="only_second")


                
                tokenizer(batch_sentences, truncation=True, max_length=50)
                
                tokenizer(batch_sentences, truncation=STRATEGY, max_length=50) # same as line above

                

                tokenizer(batch_sentences, padding=True, truncation=True)

                tokenizer(batch_sentences, padding=True, truncation=STRATEGY) # same as line above

                

                tokenizer(batch_sentences, padding=True, truncation=True, max_length=50)

                tokenizer(batch_sentences, padding=STRATEGY, truncation=True, max_length=50)  # same as line above


                
                tokenizer(batch_sentences, padding="max_length", truncation=True, max_length=50)

                tokenizer(batch_sentences, padding="max_length", truncation=STRATEGY, max_length=50) # same as line above
                


### Pre-tokenized Inputs

The model can accept tokenized inputs along with a sequence represented as a single string


Single list or a list of lists containing a single string for each sequence:

batch_input_sentence_pairs_A = [
    "The lion ran as fast as it could.",
    "Fine-Tuning a model means to train a pretrained model.",
    "You\'re a wizard, Harry."
]

batch_input_sentence_pairs_B = [
    "Soon thereafter, it got too tired to stand.",
    "It includes a learning rate that should be small enough to learn a specific downstream task.",
    "I\'m a what?"
]


Involves a list of lists, where each list contains the tokens (comma separated words usually presented in a list) for each sequence. 

When each sequence is a list of tokens (like seen below), then set the following arguement to be True for the tokenizer. e.g. *"is_pretokenized=True"*

batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
                   ["And", "another", "sentence"],
                   ["And", "the", "very", "very", "last", "one"]]


batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
                             ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
                             ["And", "I", "go", "with", "the", "very", "last", "one"]]


input_info = tokenizer(batch_sentences_A,  is_pretokenized=True)

OR

batch_input_info = tokenizer(batch_sentences_A, 
                            batch_sentences_B, 
                            padding=True,
                            truncation=True,
                            return_tensors="pt"
                            is_pretokenized=True)





## Training and Fine-Tuning

### Fine-tuning in Pytorch

Hugging Face allows both TensorFlow and Pytorch models

Use the *.pre_trained()* method to extract any pre-trained model in the HuggingFace Model Hub. This will allow all the pretrained weights to be used for further training/inference. The library also includes attentional heads that can attached to the already pre-trained model for specific tasks. 

E.g for sentence classification, the arguement *num_labels* can be added to the *from_pretrained()* function to specify the number of outputs. The additional layer (or head) is placed on top of the encoder and can be used for further training. Since their weights are instantiated randomly, the new layer can update to learn a particular task.  

*BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)*

Above shows the pre-trained model with its respective weights with an added layer (head) on top of the encoder with an output size of 2.

In [1159]:
from transformers import BertForSequenceClassification

In [1248]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
model

loading configuration file config.json from cache at /Users/druestaples/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.38.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model.safetensors from cache at /Users/druestaples/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/model.safetensors
Some 

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

model.train() easily helps train a model to learn another dataset. Whatever dataset is used, it needs to pertain to the specific task the model was already pre-trained on. E.g. for a pre-trained sentence classifier, the new dataset for it to be trained on also needs to be for sentence classification.

In [1161]:
model.train()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,


Pytorch includes the *AdamW* optimizer which involves gradient bias correction and weight decay.

In [1162]:
from transformers import AdamW

In [1163]:
optimizer = AdamW(model.parameters(), lr=1e-5, weight_decay=0.2, correct_bias=False)
optimizer

AdamW (
Parameter Group 0
    betas: (0.9, 0.999)
    correct_bias: False
    eps: 1e-06
    lr: 1e-05
    weight_decay: 0.2
)

The optimizer allows for different hyperparamertes to be utilized for each specific group. The following shows weight decay being applied without bias and layer normalization and vice versa. 

In [1174]:
[m for m in model.parameters()]

[Parameter containing:
 tensor([[-0.0102, -0.0615, -0.0265,  ..., -0.0199, -0.0372, -0.0098],
         [-0.0117, -0.0600, -0.0323,  ..., -0.0168, -0.0401, -0.0107],
         [-0.0198, -0.0627, -0.0326,  ..., -0.0165, -0.0420, -0.0032],
         ...,
         [-0.0218, -0.0556, -0.0135,  ..., -0.0043, -0.0151, -0.0249],
         [-0.0462, -0.0565, -0.0019,  ...,  0.0157, -0.0139, -0.0095],
         [ 0.0015, -0.0821, -0.0160,  ..., -0.0081, -0.0475,  0.0753]],
        requires_grad=True),
 Parameter containing:
 tensor([[ 1.7505e-02, -2.5631e-02, -3.6642e-02,  ...,  3.3437e-05,
           6.8312e-04,  1.5441e-02],
         [ 7.7580e-03,  2.2613e-03, -1.9444e-02,  ...,  2.8910e-02,
           2.9753e-02, -5.3247e-03],
         [-1.1287e-02, -1.9644e-03, -1.1573e-02,  ...,  1.4908e-02,
           1.8741e-02, -7.3140e-03],
         ...,
         [ 1.7418e-02,  3.4903e-03, -9.5621e-03,  ...,  2.9599e-03,
           4.3435e-04, -2.6949e-02],
         [ 2.1687e-02, -6.0216e-03,  1.4736e-02,  

In [1182]:
[m for m in model.named_parameters()]

[('bert.embeddings.word_embeddings.weight',
  Parameter containing:
  tensor([[-0.0102, -0.0615, -0.0265,  ..., -0.0199, -0.0372, -0.0098],
          [-0.0117, -0.0600, -0.0323,  ..., -0.0168, -0.0401, -0.0107],
          [-0.0198, -0.0627, -0.0326,  ..., -0.0165, -0.0420, -0.0032],
          ...,
          [-0.0218, -0.0556, -0.0135,  ..., -0.0043, -0.0151, -0.0249],
          [-0.0462, -0.0565, -0.0019,  ...,  0.0157, -0.0139, -0.0095],
          [ 0.0015, -0.0821, -0.0160,  ..., -0.0081, -0.0475,  0.0753]],
         requires_grad=True)),
 ('bert.embeddings.position_embeddings.weight',
  Parameter containing:
  tensor([[ 1.7505e-02, -2.5631e-02, -3.6642e-02,  ...,  3.3437e-05,
            6.8312e-04,  1.5441e-02],
          [ 7.7580e-03,  2.2613e-03, -1.9444e-02,  ...,  2.8910e-02,
            2.9753e-02, -5.3247e-03],
          [-1.1287e-02, -1.9644e-03, -1.1573e-02,  ...,  1.4908e-02,
            1.8741e-02, -7.3140e-03],
          ...,
          [ 1.7418e-02,  3.4903e-03, -9.5621e

In [1192]:
no_decay = ['bias', 'LayerNorm.weight']

optimizer_group_parameters = [
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0, 'correct_bias': False},
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01}
]
optimizer_group_parameters

[{'params': [Parameter containing:
   tensor([0.9261, 0.8851, 0.8581, 0.8617, 0.8937, 0.8969, 0.9297, 0.9137, 0.9371,
           0.8084, 0.7992, 0.8071, 0.9031, 0.8198, 0.9100, 0.8493, 0.8152, 0.8613,
           0.9142, 0.8652, 0.9234, 0.8672, 0.9008, 0.8684, 0.8440, 0.8990, 0.7891,
           0.9275, 0.8501, 0.8413, 0.9179, 0.8641, 0.9185, 0.9657, 0.8861, 0.8710,
           0.9103, 0.8739, 0.9133, 0.8880, 0.9130, 0.9374, 0.8823, 0.8622, 0.8812,
           0.8708, 0.8570, 0.9445, 0.9163, 0.9356, 0.9265, 0.8504, 0.9300, 0.3447,
           0.8650, 0.8197, 0.8722, 0.8566, 0.8939, 0.8051, 0.9007, 0.8483, 0.3870,
           0.8889, 0.8923, 0.8772, 0.8963, 0.9548, 0.8944, 0.8946, 0.9471, 0.9489,
           0.9349, 0.7814, 0.9255, 0.7943, 0.8806, 0.3857, 0.7900, 0.8478, 0.8886,
           0.9215, 0.9292, 0.8990, 0.7790, 0.8255, 0.8717, 0.8778, 0.9021, 0.9190,
           0.8605, 0.8762, 0.7084, 0.8599, 0.8981, 0.8092, 0.4021, 0.7917, 0.8923,
           0.9118, 0.9459, 0.9489, 0.8744, 0.8402, 0

In [1193]:
optimizer = AdamW(optimizer_group_parameters, lr=1e-5)
optimizer

AdamW (
Parameter Group 0
    betas: (0.9, 0.999)
    correct_bias: False
    eps: 1e-06
    lr: 1e-05
    weight_decay: 0.0

Parameter Group 1
    betas: (0.9, 0.999)
    correct_bias: True
    eps: 1e-06
    lr: 1e-05
    weight_decay: 0.01
)

For batch encoding, the Tokenizer uses the __call__() method to return a BatchEncoding() instance of the class

In [1199]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, BertTokenizer

In [1210]:
batch_sequences = ["Encoders convert words to embeddings.", "Decoders convert embeddings back into text."]

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
input_info = tokenizer(batch_sequences, return_tensors="pt", padding=True, truncation=True)
input_info

loading file vocab.txt from cache at /Users/druestaples/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /Users/druestaples/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/tokenizer_config.json
loading file tokenizer.json from cache at /Users/druestaples/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/tokenizer.json
loading configuration file config.json from cache at /Users/druestaples/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": n

{'input_ids': tensor([[  101,  4372, 16044,  2869, 10463,  2616,  2000,  7861,  8270,  4667,
          2015,  1012,   102],
        [  101, 21933, 13375, 10463,  7861,  8270,  4667,  2015,  2067,  2046,
          3793,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [1240]:
input_info['input_ids']

tensor([[  101,  4372, 16044,  2869, 10463,  2616,  2000,  7861,  8270,  4667,
          2015,  1012,   102],
        [  101, 21933, 13375, 10463,  7861,  8270,  4667,  2015,  2067,  2046,
          3793,  1012,   102]])

In [1241]:
input_info['attention_mask']

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

In [1242]:
tokenizer.decode(input_info['input_ids'][0]), tokenizer.decode(input_info['input_ids'][1])

('[CLS] encoders convert words to embeddings. [SEP]',
 '[CLS] decoders convert embeddings back into text. [SEP]')

When a model is called with the *labels* arguement, the first element returned is the Cross Entropy loss between the predictions and passed labels. From there, a backwards pass can be done and then update the weights 

In [1243]:
labels = torch.tensor([[0,1]])
labels

tensor([[0, 1]])

In [1250]:
outputs = model(input_ids=input_info['input_ids'], attention_mask=input_info['attention_mask'], labels=labels)
outputs

SequenceClassifierOutput(loss=tensor(0.9562, grad_fn=<NllLossBackward0>), logits=tensor([[-0.9187,  0.5627],
        [-0.9136,  0.4580]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [1251]:
loss = outputs[0]
loss

tensor(0.9562, grad_fn=<NllLossBackward0>)

In [1252]:
loss.backward()

In [1253]:
optimizer.step()

If the *labels* hyperparameter is not used as an arguement for the model() method,
then the logits are outputed. The loss can then be computed from utilizing the *torch.nn.functional.cross_entropy()* method. 

Notice the key *loss* in the output is instatiated to None

In [1256]:
outputs = model(input_ids=input_info['input_ids'], attention_mask=input_info['attention_mask'])
outputs 

SequenceClassifierOutput(loss=None, logits=tensor([[-0.9187,  0.5627],
        [-0.9136,  0.4580]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [1257]:
from torch.nn import functional as F

In [1262]:
loss = F.cross_entropy(outputs[0], target=labels[0])
loss

tensor(0.9562, grad_fn=<NllLossBackward0>)

In [1263]:
loss.backward()

In [1264]:
optimizer.step()

In [1265]:
from transformers import get_linear_schedule_with_warmup
num_warmup_steps, num_train_steps = 3, 5
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_train_steps)

In [1267]:
optimizer.step()

In [1268]:
scheduler.step()

Freezing the encoder 

A dev might want to only tune the newly addition head layers instead of all of the layers including the pre-trained ones. 

To do this, simply set requires_grad to False for the encoder parameters. 

In [1273]:
for i in model.parameters():
    print(i.requires_grad)
    i.requires_grad = False
    print(i.requires_grad)

True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False
True
False

### Fine-tuning in Tensorflow 2

from transformers import TFBertForSequenceClassification
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

from transformers import BertTokenizer, glue_convert_examples_to_features
import tensorflow_datasets as tfds
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
data = tfds.load('glue/mrpc')
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)

optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss)
model.fit(train_dataset, epochs=2, steps_per_epoch=115)

TF/Pytorch models can be saved and reloaded as either or. 

from transformers import BertForSequenceClassification
model.save_pretrained('./my_mrpc_model/')
pytorch_model = BertForSequenceClassification.from_pretrained('./my_mrpc_model/', from_tf=True)

### Fine-Tuning with Pytorch using Tranformers Trainer Class

Training in Pytorch includes a training loop where the loss is continually computed, then the weighted parameters are optimized with the goal of reducing loss as much as possible. Transformers offer a an alternative way to train through a Trainer Class to avoid the loop. It includes a range of training options and features e.g. logging, gradient accumulation, mixed precision.

In [32]:
# Load dataset
# For this example, it is a Yelp Review Dataset

from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
import numpy as np 
import evaluate

In [3]:
data = load_dataset("yelp_review_full")
data

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [34]:
# Preprocess data via tokenizing, padding, and truncating

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
tokenizer

BertTokenizerFast(name_or_path='google-bert/bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [35]:
def tokenize_func(data_text: str):
    return tokenizer(data_text['text'], padding='max_length', truncation=True, return_tensors="pt")


In [36]:
# data.map() references the keys inside data which is "train" and "test"
# Both train and test have a "text" key inside both dictionaries. 
# Therefore the "tokenize_func()" function works. 
tokenized_data = data.map(tokenize_func, batched=True)

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:  64%|██████▍   | 32000/50000 [00:07<00:04, 4347.66 examples/s]


KeyboardInterrupt: 

In [37]:
training_data = tokenized_data['train'].shuffle(seed=99)
testing_data = tokenized_data['test'].shuffle(seed=99)

In [38]:
training_data, testing_data

(Dataset({
     features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
     num_rows: 650000
 }),
 Dataset({
     features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
     num_rows: 50000
 }))

In [39]:
# Load model with an additional layer on top of the pretrained model 
# ...above the encoder.
# Also, instatitate the num_labels arguement, which is the output 
# ...size for the encoder.
# In this example (Yelp Review Dataset), there are five label classes. 
# Therefore, the input for the first layer (which is the one that was 
# ...just added) will have an input of five.

In [40]:
model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [41]:
# Option A: Training Hyperparameters WITHOUT an evaluation_metric parameter

# Transformers offer a class called "TrainingArguements" to set a 
# ...wide amout of hyperparameters for model tuning (which is just 
# ...additional training for a downstream task for an already pre-trained model)
training_args = TrainingArguments(output_dir="test_trainer" )

In [42]:
training_args

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_la

In [43]:
# Option B: Training Hyperparameters WITh an *evaluation strategy* to monitor the evaluation metrics 

# Trainer does not offer an automatic way to compute metrics, 
# ...so include a custom function and a library that can 
# ...evaluate on its own.

accuracy_metric = evaluate.load("accuracy")

def compute_metrics(data):
    logits, labels = data
    predictions = np.argmax(logits, axis=-1)
    return accuracy_metric.compute(predictions, references=labels)

In [44]:
training_args_2 = TrainingArguments(output_dir="test_trainer_2", evaluation_strategy='epoch')

In [45]:
training_args_2

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=epoch,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_

In [49]:
'''
Create a Trainer instance with the...
- model
- training arguments
- train dataset
- test dataset
- custom function for computing metrics
'''


trainer = Trainer(
    model=model,
    args=training_args_2,
    train_dataset=training_data,
    eval_dataset=testing_data,
    compute_metrics=compute_metrics
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [50]:
# Train model

trainer.train()





KeyboardInterrupt: 

### Llama 3 with 120 billion parameters

In [None]:
# Install transformers
# !pip install -qU transformers accelerate

In [51]:

from transformers import AutoTokenizer
import transformers
import torch



In [54]:
model = "mlabonne/Llama-3-120B"


In [55]:
messages = [{"role": "user", "content": "What is a large language model?"}]


In [56]:

tokenizer = AutoTokenizer.from_pretrained(model)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [57]:
tokenizer

PreTrainedTokenizerFast(name_or_path='mlabonne/Llama-3-120B', vocab_size=128000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|begin_of_text|>', 'eos_token': '<|end_of_text|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	128000: AddedToken("<|begin_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128001: AddedToken("<|end_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128002: AddedToken("<|reserved_special_token_0|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128003: AddedToken("<|reserved_special_token_1|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128004: AddedToken("<|reserved_special_token_2|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128005: AddedToken("<|reserv

In [None]:

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)



In [None]:
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)



In [None]:
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, 
          temperature=0.7, top_k=50, top_p=0.95)


In [None]:
print(outputs[0]["generated_text"])