<div align="center">
<img src="https://poorit.in/image.png" alt="Poorit" width="40" style="vertical-align: middle;"> <b>AI SYSTEMS ENGINEERING 1</b>

## Unit 3: HuggingFace Pipelines & Tokenization

**CV Raman Global University, Bhubaneswar**  
*AI Center of Excellence*

---

</div>

---

### What You'll Learn

In this notebook, you will:

1. **Understand the HuggingFace ecosystem** — Hub, Transformers, Datasets, Spaces
2. **Use the Pipelines API** for text generation and sentiment analysis
3. **Understand why tokenization matters** — cost, context length, and performance
4. **Use HuggingFace tokenizers** to encode and decode text
5. **Compare tokenizers** across different models and languages

---

## 1. Environment Setup

In [None]:
!pip install -q transformers torch tokenizers

In [None]:
import torch
from transformers import pipeline, AutoTokenizer

device = 0 if torch.cuda.is_available() else -1

---

## 2. The HuggingFace Ecosystem

HuggingFace is the largest platform for sharing and using AI models.

| Component | Description |
|-----------|-------------|
| **Hub** | Repository of 500k+ models and datasets |
| **Transformers** | Library for working with transformer models |
| **Datasets** | Library for loading and processing datasets |
| **Spaces** | Platform for hosting ML demos |

The **Pipelines API** is the simplest way to use models — it handles tokenization, inference, and post-processing automatically.

---

## 3. Text Generation Pipeline

Let's start with text generation using GPT-2, a small but capable model.

In [None]:
generator = pipeline("text-generation", model="gpt2", device=device)

In [None]:
prompt = "Artificial intelligence is transforming"

result = generator(prompt, max_new_tokens=50, do_sample=True, temperature=0.7)

print(result[0]["generated_text"])

That's it — one line to load the model, one call to generate text. The pipeline handles tokenization, model inference, and decoding automatically.

---

## 4. Sentiment Analysis Pipeline

Sentiment analysis classifies text as positive or negative. The pipeline uses a pretrained model (DistilBERT fine-tuned on SST-2) by default.

In [None]:
sentiment = pipeline("sentiment-analysis", device=device)

In [None]:
texts = [
    "I love this course! It's really helping me understand AI.",
    "The weather today is terrible and I'm stuck indoors.",
    "The food was okay, nothing special."
]

results = sentiment(texts)

for text, result in zip(texts, results):
    print(f"Text: {text}")
    print(f"Sentiment: {result['label']} (confidence: {result['score']:.2f})\n")

---

## 5. Why Tokenization Matters

LLMs don't process raw text — they work with **tokens** (numerical representations).

Tokenization affects:
- **Cost** — API pricing is per token
- **Context length** — models have token limits (e.g., GPT-4 has 128k tokens)
- **Performance** — different tokenizers handle languages differently

---

## 6. Basic Encoding & Decoding

**Encoding** converts text to token IDs. **Decoding** converts token IDs back to text.

In [None]:
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [None]:
text = "Hello, I am studying at CV Raman University!"

tokens = gpt2_tokenizer.encode(text)
print(f"Text: {text}")
print(f"Token IDs: {tokens}")
print(f"Number of tokens: {len(tokens)}")

In [None]:
decoded = gpt2_tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

In [None]:
token_strings = gpt2_tokenizer.convert_ids_to_tokens(tokens)

print("Token breakdown:")
for token_id, token_str in zip(tokens, token_strings):
    print(f"  {token_id:6d} → '{token_str}'")

Notice how some words are split into sub-words (e.g., `Ġstudying` — the `Ġ` represents a leading space). This is how the tokenizer handles words it hasn't seen as a whole.

---

## 7. Comparing Tokenizers

Different models tokenize the same text differently. Let's compare GPT-2 and BERT.

In [None]:
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

print(f"GPT-2 vocabulary size: {len(gpt2_tokenizer):,}")
print(f"BERT vocabulary size:  {len(bert_tokenizer):,}")

In [None]:
def compare_tokenizers(text, tokenizers):
    """Compare how different tokenizers handle the same text."""
    print(f"Text: {text}\n")

    for name, tokenizer in tokenizers.items():
        tokens = tokenizer.encode(text)
        token_strings = tokenizer.convert_ids_to_tokens(tokens)

        print(f"{name}:")
        print(f"  Tokens: {len(tokens)}")
        print(f"  Breakdown: {token_strings}")
        print()

In [None]:
tokenizers = {
    "GPT-2": gpt2_tokenizer,
    "BERT": bert_tokenizer
}

compare_tokenizers("Artificial Intelligence is transforming industries.", tokenizers)

In [None]:
compare_tokenizers("भारत एक महान देश है।", tokenizers)

**Key observations:**
- English text is tokenized efficiently by both models
- Hindi text requires significantly more tokens — these tokenizers were trained mostly on English

---

## 8. Tokenization Strategies

Different models use different algorithms to build their vocabulary:

| Strategy | Used By | Key Idea |
|----------|---------|----------|
| **BPE** (Byte Pair Encoding) | GPT-2, GPT-4 | Starts with characters, repeatedly merges the most frequent pairs |
| **WordPiece** | BERT | Similar to BPE, but uses likelihood to decide merges |
| **SentencePiece** | Llama, T5 | Language-agnostic, works directly on raw text (no pre-tokenization) |

All three approaches learn sub-word vocabularies — they split rare words into pieces while keeping common words whole. You don't need to choose a strategy yourself; it's baked into each model's tokenizer.

---

## 9. Exercise

**Step 1:** Run sentiment analysis on 3 sentences of your choice.  
**Step 2:** Tokenize each sentence with GPT-2 and count the tokens. Which is the most token-efficient?

In [None]:
# Step 1: Sentiment analysis on your own sentences

my_texts = [
    # "your first sentence here",
    # "your second sentence here",
    # "your third sentence here",
]

results = sentiment(my_texts)

for text, result in zip(my_texts, results):
    print(f"{text}")
    print(f"  → {result['label']} ({result['score']:.2f})\n")

In [None]:
# Step 2: Count tokens for each sentence

for text in my_texts:
    tokens = gpt2_tokenizer.encode(text)
    print(f"{text}")
    print(f"  → {len(tokens)} tokens ({len(text)/len(tokens):.1f} chars/token)\n")

---

## Key Takeaways

1. **Pipelines abstract complexity** — tokenization, inference, and post-processing are handled automatically

2. **Many tasks supported** — text generation, classification, NER, zero-shot classification, and more

3. **Tokenizers are model-specific** — always use the matching tokenizer for a model

4. **Token counts vary by content** — non-English text often uses more tokens

5. **Sub-word tokenization** — rare words are split into pieces, common words stay whole

### Common Pipeline Tasks

| Task | Pipeline Name | Example Model |
|------|--------------|---------------|
| Text Generation | `text-generation` | gpt2, llama |
| Classification | `sentiment-analysis` | distilbert |
| Named Entities | `ner` | bert |
| Zero-Shot Classification | `zero-shot-classification` | bart-large-mnli |

### What's Next?

In the next notebook, we'll **build a multi-tool text analyzer app** using Gradio and the HuggingFace pipelines you learned here — sentiment analysis, named entity recognition, zero-shot classification, and token counting, all in one interface.

---

## Additional Resources

- [HuggingFace Hub](https://huggingface.co/models)
- [Transformers Documentation](https://huggingface.co/docs/transformers)
- [Pipeline Tutorial](https://huggingface.co/docs/transformers/pipeline_tutorial)
- [Tokenizers Documentation](https://huggingface.co/docs/tokenizers)
- [Understanding Tokenization](https://huggingface.co/docs/transformers/tokenizer_summary)

---

**Course Information:**
- **Institution:** CV Raman Global University, Bhubaneswar
- **Program:** AI Center of Excellence
- **Course:** AI Systems Engineering 1
- **Developed by:** [Poorit Technologies](https://poorit.in) — *Transform Graduates into Industry-Ready Professionals*

---