## Behind the pipeline

There are 3 stages in the pipeline:

* Tokenizer
* Model
* PostProcessing


*Process*

**Raw Text** ====> **Numbers[Input IDs]** ====> **Outputs [Logits]** ===>
**Predictions**


### Tokenization

The process has several steps i.e: 

1. the text is split into small chunks called tokens
2. it will add special tokens
3. matches each token



In [2]:
# install the required dependencies in the virtual environment venv

# install tensorflow

!pip install tensorflow
!pip freeze > requirements.txt
!cat requirements.txt


Collecting tensorflow
  Using cached tensorflow-2.17.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (4.1 kB)
Using cached tensorflow-2.17.0-cp312-cp312-macosx_12_0_arm64.whl (236.3 MB)
Installing collected packages: tensorflow
Successfully installed tensorflow-2.17.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
absl-py==2.1.0
anyio==4.4.0
appnope==0.1.4
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
astunparse==1.6.3
async-lru==2.0.4
attrs==23.2.0
Babel==2.15.0
beautifulsoup4==4.12.3
bleach==6.1.0
certifi==2024.7.4
cffi==1.16.0
charset-normalizer==3.3.2
comm==0.2.2
debugpy==1.8.2
decorator==5.1.1
defusedxml==0.7.1
executing==2.0.1
fastjsonschema==2.20.0
filelock==3.15.4
flatbuffers==24.3.25
fqdn==1.5.1
fsspec==2024.6.1
gast==0.6.0
google-pasta==0.2.0
grp

In [6]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
print("hello")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

hello


[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

## Pipeline process explained

it groups together 3 steps:
* tokenizer
* model
* post processing

### preprocessing with tokenizer

1. test input is converted to numbers that the model can make sense of using `tokenizers` responsible for : splitting input, mapping each token to integer and adding additional inputs that may be usefull to the model

2. preprocessing needs to be done in the exact same way the model was pretrained. to achieve this, the `AutoTokenizer class` and its `from_pretranined()` method is used. Using the checkpoint name of the model, it will automatically fetch the data associated with the model tokenizer and cache it such that it is only downloaded once

In [9]:
# using the autotokenizer class

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


3. once the tokenizer is created as above:- sentences can be passed. output will be a dictionary that's ready to feed to the model
4. convert list of input id's to tensors

In [12]:
# example

raw_inputs = [
    "i've been waiting for this course my whole life",
    "I hate this so much!",
]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="tf")
print(inputs)

{'input_ids': <tf.Tensor: shape=(2, 13), dtype=int32, numpy=
array([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 2023, 2607, 2026, 2878,
        2166,  102],
       [ 101, 1045, 5223, 2023, 2061, 2172,  999,  102,    0,    0,    0,
           0,    0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 13), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]], dtype=int32)>}


**Going through the model**

we can download the pretrained model the same way i.e transformers provides a `TFAutoModel` class which equally has a `from_pretrained` method


In [14]:
from transformers import TFAutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModel.from_pretrained(checkpoint)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.
