🔥 First Concept: NLP - Tokenization

💡 What is it?
Tokenization = Breaking down text into smaller pieces (words, subwords, or characters) so that models can understand and process them.

🧠 Why it matters?
Machines can’t read sentences directly. They need tokens (like “Hello” → [Hello]) to convert into numbers (embeddings).

🔧 Types of Tokenizers:

- Word-level → “Hello world” → ['Hello', 'world']
- Subword-level (BPE) → “unhappiness” → ['un', 'happi', 'ness']
- Character-level → “Hi” → ['H', 'i']

✅ Real Use:
Hugging Face models like BERT use subword tokenizers.

🤗 Transformers – Core Concept

💡 What is a Transformer?
A Transformer is an architecture that understands sequences (like sentences) using self-attention – it looks at all words at once and learns which ones matter most.

🧠 Why It Matters?
This powers BERT, GPT, Claude, Gemini – all modern LLMs.

Key Ideas:

- No loops, just attention
- Parallel processing = Fast
- Can understand long-range word relationships (e.g., “bank” = riverbank or money)

🧱 Transformer Parts (simple view):
- Input Embeddings: Text → Vectors
- Positional Encoding: Adds word order info
- Self-Attention: Learns context
- Feed Forward Layers: Processes info
- Output: Classifies, generates, etc.

In [None]:
# !pip install transformers

# change in notepad
# or change in system Move Python 3.13 to Top in environmental variable



In [None]:
import sys
print(sys.executable)


import transformers
print(transformers.__version__)

/Library/Developer/CommandLineTools/usr/bin/python3


  from .autonotebook import tqdm as notebook_tqdm


4.52.4


In [None]:
# 🤗 Using Hugging Face Transformers (Hands-On)
# Let’s load a real model and run it on your own text 👇

from transformers import pipeline    # Correct

classifier = pipeline("sentiment-analysis")

# result = classifier("I love lesarning huggface with chatgpt")
result = classifier("happy")
print(result) # [{'label': 'POSITIVE', 'score': 0.9979}] 'label': Predicted class 'score': Confidence (close to 1.0 = very confident)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9998753070831299}]


🧠 What’s happening here:
pipeline("sentiment-analysis"): Loads a pretrained model like BERT that’s fine-tuned for sentiment.

You pass in raw text → it gets tokenized, embedded, processed by transformer → gives a label (POSITIVE/NEGATIVE) + confidence.

🧠 Token IDs & Attention Mask (Mini Concept)

💡 Token IDs:
- Text is turned into numbers. Example:
- "AI is great" → [101, 9932, 2003, 2307, 102] (Each word/subword gets a unique ID from the model's vocab)

💡 Attention Mask:
- Tells the model which tokens to focus on (1 = real word, 0 = padding).
- Useful when inputs are of different lengths but sent as batches.

In [None]:
from transformers import AutoTokenizer #loads class to fetch tokenizer from Hugging Face (e.g. bert, gpt, etc.)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") #✅ Downloads the BERT tokenizer (lowercase version) that knows how to:
# split words into subwords
# convert to token IDs
inputs = tokenizer("AI is awesome!", padding=True, truncation=True, return_tensors="pt")
# ✅ Tokenizes the input:
# padding=True → Pads the input if it's shorter than max length
# truncation=True → Cuts if it’s too long
# return_tensors="pt" → Returns PyTorch tensor format (pt = PyTorch)

print(inputs['input_ids'])       # Token IDs
print(inputs['attention_mask'])  # 1s = real tokens, 0s = ignore (pad)

tensor([[  101,  9932,  2003, 12476,   999,   102]])
tensor([[1, 1, 1, 1, 1, 1]])


we’ll manually run text through a Transformer model for classification — to see how all pieces (tokenizer + model) work together.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification # Loads tokenizer + classification model
import torch # We'll use PyTorch tensors for input/output

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") # Tokenizer: Breaks input text into token IDs
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") #Model: A DistilBERT already trained for sentiment analysis (SST-2 dataset)

inputs = tokenizer("I love this movie!", return_tensors="pt") #Tokenizes text → returns token IDs + attention mask (in PyTorch tensor format)

with torch.no_grad():             # Disables gradient tracking (we're just predicting)
    outputs = model(**inputs)     # Passes the input into the model to get output logits
print(outputs)

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) #Converts raw output (logits) into probabilities
print(predictions)  #Usually returns 2 classes: [negative_prob, positive_prob]

SequenceClassifierOutput(loss=None, logits=tensor([[-4.3246,  4.6837]]), hidden_states=None, attentions=None)
tensor([[1.2238e-04, 9.9988e-01]])


### 🔹 What is `logits`?

* `logits` are the **raw, unnormalized outputs** from the final layer of a neural network.
* They can be **positive or negative**, and **don’t sum to 1**.
* You convert `logits` → probabilities using `softmax`.

**Example:**

```python
logits = tensor([[2.0, 0.5]])
# After softmax → [0.82, 0.18] → means class 0 is 82% likely
```

---

### 🔹 What is `**inputs` inside the model?

When you do:

```python
outputs = model(**inputs)
```

It’s the same as writing:

```python
outputs = model(input_ids=..., attention_mask=...)
```

✔️ `tokenizer(...)` returns a dictionary like:

```python
{
  'input_ids': tensor([[101, 1045, 2293, 2023, 3185, 999, 102]]),
  'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])
}
```

The `**inputs` syntax **unpacks** that dictionary directly into keyword arguments for the model.


### 1️⃣ **Why `torch.no_grad()`?**

When you're **only predicting (inference)** and not training, you don’t need to calculate gradients.

✅ **Benefits:**

* Saves memory
* Speeds up execution
* Cleaner and safer for inference

---

### 2️⃣ **What is Softmax?**

🧠 **Softmax** turns raw scores (logits) into probabilities that add up to 1.

**Formula:**

$$
\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
$$

It gives:

* High confidence for the most likely class
* Low values for others
* Output like: `[0.02, 0.98]` → 98% confidence for class 1

---

### 3️⃣ **Whole Purpose Recap**

We’re doing this:

**Raw Text** → `Tokenizer` → `Model` → `Logits` → `Softmax` → `Probabilities` → `Prediction`

📌 This is how Hugging Face models work internally:

* Tokenization = preprocess
* Model = neural network
* Logits = raw model output
* Softmax = make predictions human-readable


🛠️ Mini Project: Sentiment Classifier for Multiple Texts
create a custom function that can analyze multiple reviews at once.

In [None]:
from transformers import pipeline

# Load sentiment analysis model
sentiment_model = pipeline("sentiment-analysis")

# Sample reviews
reviews = [
    "This movie was fantastic!",
    "Worst experience ever.",
    "I loved the visuals but hated the story.",
    "Just average, nothing special.",
    "Absolutely brilliant!"
]

# Analyze all
results = sentiment_model(reviews)
print(results, end="\n\n")

for review, result in zip(reviews, results):
    print(f"{review} -> {result['label']} ({round(result['score']*100, 2)}%)")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9998781681060791}, {'label': 'NEGATIVE', 'score': 0.9997876286506653}, {'label': 'NEGATIVE', 'score': 0.9886063933372498}, {'label': 'NEGATIVE', 'score': 0.998314619064331}, {'label': 'POSITIVE', 'score': 0.999871015548706}]

This movie was fantastic! -> POSITIVE (99.99%)
Worst experience ever. -> NEGATIVE (99.98%)
I loved the visuals but hated the story. -> NEGATIVE (98.86%)
Just average, nothing special. -> NEGATIVE (99.83%)
Absolutely brilliant! -> POSITIVE (99.99%)


🔍 What You Practiced:

- Multi-input processing
- Model confidence score
- Basic NLP automation

🧠 Use Custom Models from Hugging Face Hub
explore other powerful models (e.g. emotion, topic, toxicity detection) using a few lines of code.

In [None]:
# ✅ Example 1: Emotion Detection
from transformers import pipeline

emotion = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base", top_k=1)

print(emotion("I am so proud of myself today!"))

Device set to use mps:0


[[{'label': 'joy', 'score': 0.7351986765861511}]]


In [None]:
# ✅ Example 2: Topic Classification
topic = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

text = "Apple is releasing a new iPhone this year"
# labels = ["sports", "politics", "technology", "food"]
labels = ["technology", "food"]

print(topic(text, candidate_labels=labels))


{"timestamp":"2025-06-20T07:16:36.187118Z","level":"WARN","fields":{"message":"Reqwest(reqwest::Error { kind: Request, url: \"https://transfer.xethub.hf.co/xorbs/default/4455fcc93c8d6194266e5d68eb8dde5edfc583ecad6f628fb052b62eeaf5be21?X-Xet-Signed-Range=bytes%3D12909080-24052376&Expires=1750407322&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly90cmFuc2Zlci54ZXRodWIuaGYuY28veG9yYnMvZGVmYXVsdC80NDU1ZmNjOTNjOGQ2MTk0MjY2ZTVkNjhlYjhkZGU1ZWRmYzU4M2VjYWQ2ZjYyOGZiMDUyYjYyZWVhZjViZTIxP1gtWGV0LVNpZ25lZC1SYW5nZT1ieXRlcyUzRDEyOTA5MDgwLTI0MDUyMzc2IiwiQ29uZGl0aW9uIjp7IkRhdGVMZXNzVGhhbiI6eyJBV1M6RXBvY2hUaW1lIjoxNzUwNDA3MzIyfX19XX0_&Signature=HJUk-DZjNT144pDHye5qa8TxSCMO1b6mlKCe8sYAuEf0XKxd-2Tgsa8uM3RlgggKiGI8izBCjcDPmNmZi2HghkqM~8WcmpPwMVzWfYu7m-99RP4SLmCVT4Vp6G~A-vbsfnvxwe~cHfFU9ZQc1cto9m8mZrdUKA0qTvEUTNY2buc76LwGkzdbJ8f077jWPJ5Md~uAeisiy7BBXurK1J9Or2rMF2JZ1TU5rk0-8ptwskHa1RPTjwJN8BkAIAb4Er4wIllowGMPnl9KuVl~T7G0JXsIyb5rKi9MCm4CQqBBVnLZ46t~QmQWPD9tbcTUyU~6EayHsOzyCJThxAhBY3Ro1w__&Key-Pair-Id=K2

In [None]:
# ✅ Example 3: Toxicity Detection
toxic = pipeline("text-classification", model="unitary/toxic-bert")

print(toxic("I hate you!"))

🔍 Summary:
- pipeline() makes any model easy to use
- You can swap models using Hugging Face model names
- Explore huggingface.co/models for more tasks