### 🧩 Components of a Hugging Face Pipeline
A pipeline in Hugging Face is essentially made up of three core parts:

**Tokenizer** – Prepares input text for the model.

**Model** – Performs the task (e.g., classification, translation).

**Pipeline Logic** – Ties tokenizer + model together and applies task-specific behavior.

Let’s explore each one with a real example (`sentiment analysis`):

#### Step-by-step Breakdown
1. Load Tokenizer: Tokenizers break text into tokens and convert them to IDs the model understands.

AutoTokenizer automatically loads the correct tokenizer based on the model name.

In our case the model is `distilbert-base-uncased-finetuned-sst-2-english`.

`from_pretrained(...)` downloads the tokenizer configuration and vocabulary for this specific model.


In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

2. Load Model: This is the trained neural network that understands the task.

In [6]:
from transformers import AutoModelForSequenceClassification

""" This class helps you load a pre-trained model(in our case "distilbert-base-uncased-finetuned-sst-2-english") specifically designed for classifying sequences of text. """
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

3. Build Pipeline Manually: You can combine tokenizer and model using the `pipeline` function:

In [9]:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

Device set to use cpu


In [10]:
result = sentiment_pipeline("I love hugging face")
print(result)

[{'label': 'POSITIVE', 'score': 0.9998455047607422}]


## Tokenizer internals (how tokenization actually works)

#### What is Tokenization?
Tokenization is the process of converting text into pieces (called tokens) that can be converted into numbers and fed to a model.

Input: "Hugging Face is awesome!"

Tokens: ["hugging", "face", "is", "awesome", "!"]


### Types of Tokenizers
There are 4 main types of tokenizers you’ll encounter in Hugging Face:

| Type                         | Example         | Splits Words Into                         |
| ---------------------------- | --------------- | ----------------------------------------- |
| **Whitespace**               | basic tokenizer | \["Hello", "world"]                       |
| **WordPiece**                | BERT            | \["play", "##ing"]                        |
| **Byte Pair Encoding (BPE)** | RoBERTa, GPT    | \["hug", "ging", "face"]                  |
| **Unigram**                  | XLNet           | \["hugging", "face", "is", "aw", "esome"] |


These tokenizers help in:

* Handling unknown words (unseenword → un, ##seen, ##word)
* Reducing vocabulary size
* Improving model efficiency




In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "Hugging Face is awesome!"
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids)


Tokens: ['hugging', 'face', 'is', 'awesome', '!']
Token IDs: [17662, 2227, 2003, 12476, 999]


Behind the scenes:

* Text is lowercased (bert-base-uncased)
* Punctuation is separated
* Words like hugging are part of the vocabulary (so no subwords needed)



#### What Happens Internally?
Here’s the pipeline:

1. Pre-tokenization: Clean, lowercase, and split by space/punctuation.
2. Tokenization: Break tokens into subwords based on model’s vocabulary.
3. Mapping to IDs: Use a vocabulary dictionary to convert tokens to integers.
4. Padding/truncation: Add [PAD] if sentence is too short or truncate if too long.
5. Special Tokens: Add [CLS] and [SEP] (used by BERT, etc.)

In [13]:
# Example: Full Tokenizer Output

output = tokenizer("Hugging Face is awesome!", return_tensors="pt")
print(output)

{'input_ids': tensor([[  101, 17662,  2227,  2003, 12476,   999,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}


Where:

* input_ids: The token IDs
* attention_mask: Which tokens are real vs. padding
* (optionally) token_type_ids: Used in tasks like QA

#### Summary

* Hugging Face tokenizers use subword tokenization for flexibility and compact vocabularies.
* They convert raw text → tokens → IDs → tensors.
* You can inspect tokenization using .tokenize() and .convert_tokens_to_ids().

Example Attention Mask and Padding

In [15]:
inputs = tokenizer(["Hello world", "How are you today?"], padding=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[ 101, 7592, 2088,  102,    0,    0,    0],
        [ 101, 2129, 2024, 2017, 2651, 1029,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1]])}


The output contains:

* input_ids: tokenized + padded sequences
* attention_mask: 1s for real tokens, 0s for padding