## Behind the pipeline

There are 3 stages in the pipeline:

* Tokenizer
* Model
* PostProcessing


*Process*

**Raw Text** ====> **Numbers[Input IDs]** ====> **Outputs [Logits]** ===>
**Predictions**


### Tokenization

The process has several steps i.e: 

1. the text is split into small chunks called tokens
2. it will add special tokens
3. matches each token



In [2]:
# install the required dependencies in the virtual environment venv

# install tensorflow

!pip install tensorflow
!pip freeze > requirements.txt
!cat requirements.txt


Collecting tensorflow
  Using cached tensorflow-2.17.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (4.1 kB)
Using cached tensorflow-2.17.0-cp312-cp312-macosx_12_0_arm64.whl (236.3 MB)
Installing collected packages: tensorflow
Successfully installed tensorflow-2.17.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
absl-py==2.1.0
anyio==4.4.0
appnope==0.1.4
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
astunparse==1.6.3
async-lru==2.0.4
attrs==23.2.0
Babel==2.15.0
beautifulsoup4==4.12.3
bleach==6.1.0
certifi==2024.7.4
cffi==1.16.0
charset-normalizer==3.3.2
comm==0.2.2
debugpy==1.8.2
decorator==5.1.1
defusedxml==0.7.1
executing==2.0.1
fastjsonschema==2.20.0
filelock==3.15.4
flatbuffers==24.3.25
fqdn==1.5.1
fsspec==2024.6.1
gast==0.6.0
google-pasta==0.2.0
grp

In [7]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
print("hello")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


hello


[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

## Pipeline process explained

it groups together 3 steps:
* tokenizer
* model
* post processing

### preprocessing with tokenizer

1. test input is converted to numbers that the model can make sense of using `tokenizers` responsible for : splitting input, mapping each token to integer and adding additional inputs that may be usefull to the model

2. preprocessing needs to be done in the exact same way the model was pretrained. to achieve this, the `AutoTokenizer class` and its `from_pretranined()` method is used. Using the checkpoint name of the model, it will automatically fetch the data associated with the model tokenizer and cache it such that it is only downloaded once

In [8]:
# using the autotokenizer class

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


3. once the tokenizer is created as above:- sentences can be passed. output will be a dictionary that's ready to feed to the model
4. convert list of input id's to tensors

In [9]:
# example

raw_inputs = [
    "i've been waiting for this course my whole life",
    "I hate this so much!",
]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="tf")
print(inputs)

{'input_ids': <tf.Tensor: shape=(2, 13), dtype=int32, numpy=
array([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 2023, 2607, 2026, 2878,
        2166,  102],
       [ 101, 1045, 5223, 2023, 2061, 2172,  999,  102,    0,    0,    0,
           0,    0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 13), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]], dtype=int32)>}


**Going through the model**

we can download the pretrained model the same way i.e transformers provides a `TFAutoModel` class which equally has a `from_pretrained` method


In [10]:
from transformers import TFAutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModel.from_pretrained(checkpoint)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


In the above code model, we have:

* downloaded the same pipeline used before and instantiated a model with it
* the above architecture contains only the base transformee module: given some inputs, its outputs are called `hidden states` also known as `features`
* hidden states or features are usually inputs to another model often known as the head


**A high-dimensional vector?**
***context: beginner***
*meaning of terms*:

* Vector:- think of a list in python or array in java i.e a list of numbers
* high-dimension vector:- a big list.
* transformer module:- type of machine learning model used for processing sequential data like sentences in a language. it takes the data, transforms it to machine readable format.

When we say, vector output by the transformer module is "high dimensional" we are talking about the size and complexity of the data/ vector itself. It usually has 3 dimensions:

* **Batch size** :- the number of sequences processed at a time
* **Sequence length** :- the length of the numerical representation of the sequence
*  **Hidden size**:- the vector(inferr to the list analogy) dimension of each model input. 


In [11]:
## we can see this if we feed the input we preprocessed to the model
outputs = model(inputs)
print(outputs.last_hidden_state.shape)

(2, 13, 768)


Do note the output huggingface transformers model behave like named tuples of dictioneries 

### Model Heads: Making sense out of numbers

***What is a model head***

this is the part of the model, mostly the transformer model that takes the final data and makes sense of it. same functionality as the brain. 

***functions of model heads i.e what do they do***

1. it takes on high-dimensional data (think of big size python list) and projects them onto a different dimension(a direction or an axis on a graph). they do so leveraging on simple mathematical layers called linear layers.

***How does this work within the transformer model***

1. **Embedding Layer**:- first part of the transformer model. Takes each input word(or token) into a vector, a list of numbers that represent a word
2. **Subsequent Layers**:- they use something called the attention mechanism to manipulate vectors and understand relationships between the words. this produces a final representation of the entire sentence.
3. **Model Head**:- after the transformer model processes the data, the output is sent to the model head. the model head then processes this data to perform a specific task.

***Different Tasks for Different Heads***

There are many types of model heads, each designed for aspecific task. Here are some examples:

* **Model** :- just retrieves the hidden states(vectors) without any specific task
* **ForCasualLM** :- used for generating tEXTS
* **ForMaskedLM**:- used for filling in missing words in a sentence
* **ForMultipleChoice**:- used for answering multiple-choice questions
* **ForQuestionAnswering**:- used for answering questions based on texts
* **ForSequenceClassification**:- used for classifying entire sentences (like determining if a sentence is positive or negative).
* **ForTokenClassification**:- used for classifying individual words in a sentence.


For the context of our example, we will need a model with a sequence classification head (to be able to classify the sentence as positive or negative). As a result we wont use the `TFAutoModel` class, but the `TFAutoModelForSequenceClassification` as shown below:


In [12]:
from transformers import TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(inputs)

# check the shape of the data
print(outputs.logits.shape)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


(2, 2)


since we had just 2 sentences and two labels, the results we get from the molde os of shape 2 x 2



### Postprocessing the Output

Values gottent from the outputs of a model don't necessarily make sense. 

**Context**
***Raw model outputs(Logits)***

The raw outputs of a model are called `logits`. They are unprocessed score
for example:

In [13]:
print(outputs.logits)

tf.Tensor(
[[-2.6304553  2.6135018]
 [ 4.1692314 -3.3464472]], shape=(2, 2), dtype=float32)


The above outputs mean:

* for the first sentence, the model gave a score of [-1.5607, 1.6123]
* for the second sentence, the model gave scores of [4.1692, -3.3464]

***What are logits***
given they are the raw outputs of a model expressing performance, they are not probabilities yet. to convert them into probabilities, or something we can understand, we need to process them further.

To convert logits to probabilities, we use a function called `softMax`. This function turns the raw scores into values between 0 and 1, which add up to 1 like percentages

Below is the example:

In [15]:
import tensorflow as tf

predictions = tf.math.softmax(outputs.logits, axis=-1)
print(predictions)

tf.Tensor(
[[5.2515999e-03 9.9474841e-01]
 [9.9945587e-01 5.4418424e-04]], shape=(2, 2), dtype=float32)


The above output simply means:

* the model predicted [0.00525(0.525%), 0.995(99.5%)] for the first sentence and [0.9995 (99.95%), 0.0005(0.05%)] for the second one.

  "*the above attributes are converted based on the place values at the end of the output i.e 5.25...-03] means it is o.00525*"

To get the labels corresponding to each position we can inspect the id2label attribute of the model config

In [16]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}