<a href="https://colab.research.google.com/github/Sulphite05/ReinforcementLearning/blob/main/transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers

In [1]:
from transformers import pipeline

The pipeline function --> most high-level API of transformers library <br/>
- end-to-end object from raw text to usbale predcitions
- includes all pre-processing (text to numbers)
- and post-processing (numbers to text)

In [2]:
classifier1 = pipeline("sentiment-analysis")
classifier1(["I've been waiting my whole life to be with you!",
           "The OS is the manager of the computer system.",
           "What is wrong with you?",
           "Are you okay?"])

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9984979629516602},
 {'label': 'NEGATIVE', 'score': 0.9084308743476868},
 {'label': 'NEGATIVE', 'score': 0.9990400671958923},
 {'label': 'POSITIVE', 'score': 0.998790442943573}]

In [4]:
classifier2 = pipeline("zero-shot-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", trust_remote_code=True)
classifier2("Shehbaz Sharif is the PM of Pakitan.",
           candidate_labels=["education", "politics", "business"])

Device set to use cpu
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


{'sequence': 'Shehbaz Sharif is the PM of Pakitan.',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.47259363532066345, 0.294681578874588, 0.23272478580474854]}

In [5]:
generator = pipeline("text-generation") # initial objective of gpt
generator("In this notebook, we will", max_length=20, num_return_sequences=2)

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this notebook, we will be using the RCS standard library and all of its variants.\n'},
 {'generated_text': 'In this notebook, we will find a simple guide how to setup your device using the default firmware as'}]

In [6]:
generator = pipeline("fill-mask") # initial objective of bert
generator("In this notebook, we will learn <mask> computation", top_k=3)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


[{'score': 0.22510744631290436,
  'token': 17997,
  'token_str': ' quantum',
  'sequence': 'In this notebook, we will learn quantum computation'},
 {'score': 0.03392070159316063,
  'token': 47713,
  'token_str': ' asynchronous',
  'sequence': 'In this notebook, we will learn asynchronous computation'},
 {'score': 0.020953776314854622,
  'token': 37920,
  'token_str': ' numerical',
  'sequence': 'In this notebook, we will learn numerical computation'}]

In [7]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Harris. I am a software engineer at Apple Inc.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use cpu


[{'entity_group': 'PER',
  'score': 0.9988734,
  'word': 'Harris',
  'start': 11,
  'end': 17},
 {'entity_group': 'ORG',
  'score': 0.99913627,
  'word': 'Apple Inc',
  'start': 47,
  'end': 56}]

In [8]:
answerer = pipeline("question-answering")
answerer(question="Who am I?",
         context="The sky is blue and I am Aqiba.")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cpu


{'score': 0.9811325669288635, 'start': 25, 'end': 30, 'answer': 'Aqiba'}

In [9]:
summarizer = pipeline("summarization")
summarizer("""The Trojan War was a legendary conflict in Greek mythology
           that took place around the 12th or 13th century BC.
           The war was waged by the Achaeans (Greeks) against the city of
           Troy after Paris of Troy took Helen from her husband Menelaus,
           king of Sparta. The war is one of the most important events in Greek
           mythology, and it has been narrated through many works of Greek
           literature, most notably Homer's Iliad. The core of the Iliad
           (Books II – XXIII) describes a period of four days and two nights
           in the tenth year of the decade-long siege of Troy; the Odyssey
           describes the journey home of Odysseus, one of the war's heroes.
           Other parts of the war are described in a cycle of epic poems,
           which have survived through fragments. Episodes from the war provided
           material for Greek tragedy and other works of Greek literature,
           and for Roman poets including Virgil and Ovid. The ancient Greeks
           believed that Troy was located near the Dardanelles and that
           the Trojan War was a historical event of the 13th or 12th century
           BC. By the mid-19th century AD, both the war and the city were
           widely seen as non-historical, but in 1868, the German archaeologist
           Heinrich Schliemann met Frank Calvert, who convinced Schliemann
           that Troy was at what is now Hisarlık in modern-day Turkey. On
           the basis of excavations conducted by Schliemann and others,
           this claim is now accepted by most scholars.""")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cpu


[{'summary_text': ' The Trojan War was a legendary conflict in Greek mythology that took place around the 12th or 13th century BC . The war was waged by the Achaeans (Greeks) against the city of Troy after Paris of Troy took Helen from her husband Menelaus, the king of Sparta . The core of the Iliad describes a period of four days and two nights in the tenth year of the decade-long siege of Troy; the Odyssey describes the journey home of Odysseus .'}]

In [None]:
translator = pipeline("translation", model="abdulwaheed1/urdu_to_english_translation_mbart", src_lang="ur_PK", tgt_lang="en_XX")
translator("کیا نام ہے آپ کا؟")

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

Transformers are language models. This means they are trained on large amount of raw text in self-supervised fashion without human intervention.<br/>
This way the model gets a statistical understanding of the language it has been trained on but it's not useful for practical tasks. Therefore the model now has to go through transfer learning in which the model is fine-tuned in a supervised way, using human-annotated labels on a given task.<br/><br/>
An example of a task is predicting the next word in a sentence having read the n previous words. This is called causal language modeling because the output depends on the past and present inputs, but not the future ones.<br/>
Another example is masked language modeling, in which the model predicts a masked word in the sentence.<br/>
There are roughly of three types:
- GPT-like (also called auto-regressive Transformer models)
- BERT-like (also called auto-encoding Transformer models)
- BART/T5-like (also called sequence-to-sequence Transformer models)

Pre training is done on large amount of data with randomly initialised weights. It involves a lot of computing power, energy, carbon emission and money.

After pre training of the model, fine-tuning is done as per data acccording to your task. The pre trained learning is 'transferred' during the fine-tuning process. So there is no need to train the model from scatch effectively reducing computation power and cost.

# The Transformer Architecture

## 1. The Encoder
Encoder receives input and builds a representation of its features. This means the model is optimised to acquire understanding of the input.
e.g. encoder-only models like sentence classification, named entity recognition etc.

## 2. The Decoder
The decoder uses encoder's representation along with inputs to generate a target sequence so now the model is optimised for generating outputs.
e.g. generative tasks like text generation

Encoder-decoder models or sequence-to-sequence models are good for generative tasks that rwuiqre an input such as ranslation or summarisation.

## Attention Layers

This layer tells the model to pay specific attention to certain words in the sentence. It was initially developed for translation as in some languages, in order to translate one word, you need to pay attention to some other word in the sentence to fully understand the context.

The Transformer architecture was originally designed for translation. During training, the encoder receives inputs (sentences) in a certain language, while the decoder receives the same sentences in the desired target language.

In the encoder, the attention layers can use all the words in a sentence (since, as we just saw, the translation of a given word can be dependent on what is after as well as before it in the sentence). The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated (so, only the words before the word currently being generated). For example, when we have predicted the first three words of the translated target, we give them to the decoder which then uses all the inputs of the encoder to try to predict the fourth word.

- Architecture: This is the skeleton of the model — the definition of each layer and each operation that happens within the model.
- Checkpoints: These are the weights that will be loaded in a given architecture.<br/><br/>
For example, BERT is an architecture while bert-base-cased, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the bert-base-cased model.”

### Encoder Only Models
At each stage of the encoder region of the transformer architecture, the attention layers can access all the words in the given sentence. They have bidirectional attention and are often termed as auto-encoding models.
Encoder models are best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition (and more generally word classification), and extractive question answering.

Representatives of this family of models include:

- ALBERT
- BERT
- DistilBERT
- ELECTRA
- RoBERTa

The numerical representaion of each token in the sentence is also called the feature vector/tensor.

![feature vector](images/feature_vectors.png "Feature Vector")

Dimension of the vector is defined by the architeccture of the model. Forthe base BERT, it is 768. The vector holds the meaning of the word with the text(along with the context from left and right) using self-attention mechanism.

Two amin characteristics include:
- Self-attention
- Bi directional context

### Decoder-Only Models

They can be used for most of the same tasks as encoders iwth generally a little loss fo performance. Here again we convert the words into a feature vector. It's distinction from the encoder lies in the fact that it uses masked self-attention. The words on the right are not included in the current word's context. They have access to only one direction of the context. The left or the right. This is why they are good for causal language modelling.

In causal language modelling, initially a word is given for instance 'My' whose feature vector is developed. Now a small transformation is applied to this vector so that it maps to all the words known by the model(also called language modelling head). Now the word with highest probability is selected. Now the new word for instance "name" is added to the initial sequence. This is the autoregressive concept which means reusing past outputs as inputs in following steps. This is repeated upto a particular context size. The context size of base GPT is 1024 means it could retain the context of last 1024 words only.

Some fetaures of decoder-only models are:
- uni-directional context
- Auto regressive
- Masked self-attention + Cross-Attention

### Sequence to sequence model

An example of it is T5. The outputs of encoder(holding meaning of the sequence of words) are directly passed to the decoder along with the usual inputs of the decoder. Now the decoder tries to decode the input of the encoder for the initial word. Now the encoder inputs are no longer needed and the generated output is added to the inputs for the decoder

Sequence-to-sequence models are best suited for tasks revolving around generating new sentences depending on a given input, such as summarization, translation, or generative question answering.

# Model Bias

In [None]:
unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("He works as a [MASK].")
print([r['token_str'] for r in result])
result = unmasker("She works as a [MASK].")
print([r['token_str'] for r in result])

# The Transformers Library

- In order to facilitate the use of new models being released everyday, the transformers library library was created to provide a single API through which models could be loaded, trained and saved.
- It enables ease of use, simplicity(Single file definitions for a model. No complicated abstraction layers) and flexibility(all models are simple PyTorch nn.Module or TensorFlow tf.keras.Model classes)

## What happens inside the pipeline library?

Let's focus on sentimental analysis.

The flow is as follows: <br/><br/>
Tokenizer -> Model -> Postprocessing <br/><br/>
The raw text is converted to numbers. The numbers are sent to a model which outputs logits. The post-processing steps convert these logits to predictions(labels and scores).

### Tokenization
1. The text is divided into tokens.
2. The the tokenizer adds some special tokens if the model needs them. Such as adding [CLS] token at the beginning and [SEP] token at the end of the sentence to classify.
3. The tokens are then mapped to their unique IDs by the tokenizer. We can do the same using the AutoTokenizer API provided by the Transformers library.

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint) # it downloads and caches the vocabulary associated with the given checkpoint

raw_inputs = ["I've been waiting for someone like you my whole life.",
              "I don't believe any of you!"]


In [None]:
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
# padding is used since the size of both sentences isn't the same.
# truncation makes sure any sentence longer tha what the model expected can be truncated.
# pt tells the tokeniser to return a pytorch tensor.
inputs

The areas with 0 are padded.

### The Model

In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained(checkpoint) # it will download and cache the configuration of the model as well as
# the pretrained weights
# However, the automodel API only instantiates the body of the model which the part of the model that is left once the
# pretraining head is removed
# it will output a high-dimensional tensor/vector which is a representation of the sentences passed but is not directly
# useful for our classification problem.
outputs = model(**inputs)


In [None]:
outputs.last_hidden_state.shape

Here 768 is the hidden size of our tensor.

To get an output linked to our classification problem, we do the following.

In [None]:
from transformers import AutoModelForSequenceClassification
# It is like AutoModel class except that it will build a model with a classification head.
# There is one Auto class for each common NLP task in the transformers library.

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits)


Here, we get one result for each sentence and for each possible label.
They are not probabilities yet as they are logits.

### Postprocessing

To convert logits to probabilities, we apply softmax to them. This transforms them to positive numbers that sum up to one.

In [None]:
import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predictions

In [None]:
model.config.id2label

 ![Pipeline](images/pipeline.png)

In [None]:
predictions * 100

## MODELS

AutoModel class allows you to instantiate a pretrained model from any checkpoint on the HF Hub. It picks the right model class from the library to instantiate the proper architecture and load the weights of the pretrained model inside it.

 ![Pipeline](images/model_files.png)