🔥 First Concept: NLP - Tokenization

💡 What is it?
Tokenization = Breaking down text into smaller pieces (words, subwords, or characters) so that models can understand and process them.

🧠 Why it matters?
Machines can’t read sentences directly. They need tokens (like “Hello” → [Hello]) to convert into numbers (embeddings).

🔧 Types of Tokenizers:

- Word-level → “Hello world” → ['Hello', 'world']
- Subword-level (BPE) → “unhappiness” → ['un', 'happi', 'ness']
- Character-level → “Hi” → ['H', 'i']

✅ Real Use:
Hugging Face models like BERT use subword tokenizers.

🤗 Transformers – Core Concept

💡 What is a Transformer?
A Transformer is an architecture that understands sequences (like sentences) using self-attention – it looks at all words at once and learns which ones matter most.

🧠 Why It Matters?
This powers BERT, GPT, Claude, Gemini – all modern LLMs.

Key Ideas:

- No loops, just attention
- Parallel processing = Fast
- Can understand long-range word relationships (e.g., “bank” = riverbank or money)

🧱 Transformer Parts (simple view):
- Input Embeddings: Text → Vectors
- Positional Encoding: Adds word order info
- Self-Attention: Learns context
- Feed Forward Layers: Processes info
- Output: Classifies, generates, etc.

In [None]:
# !pip install transformers

# change in notepad
# or change in system Move Python 3.13 to Top in environmental variable



In [None]:
import sys
print(sys.executable)


import transformers
print(transformers.__version__)

/Library/Developer/CommandLineTools/usr/bin/python3


  from .autonotebook import tqdm as notebook_tqdm


4.52.4


In [None]:
# 🤗 Using Hugging Face Transformers (Hands-On)
# Let’s load a real model and run it on your own text 👇

from transformers import pipeline    # Correct

classifier = pipeline("sentiment-analysis")

# result = classifier("I love lesarning huggface with chatgpt")
result = classifier("happy")
print(result) # [{'label': 'POSITIVE', 'score': 0.9979}] 'label': Predicted class 'score': Confidence (close to 1.0 = very confident)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9998753070831299}]


🧠 What’s happening here:
pipeline("sentiment-analysis"): Loads a pretrained model like BERT that’s fine-tuned for sentiment.

You pass in raw text → it gets tokenized, embedded, processed by transformer → gives a label (POSITIVE/NEGATIVE) + confidence.

🧠 Token IDs & Attention Mask (Mini Concept)

💡 Token IDs:
- Text is turned into numbers. Example:
- "AI is great" → [101, 9932, 2003, 2307, 102] (Each word/subword gets a unique ID from the model's vocab)

💡 Attention Mask:
- Tells the model which tokens to focus on (1 = real word, 0 = padding).
- Useful when inputs are of different lengths but sent as batches.

In [None]:
from transformers import AutoTokenizer #loads class to fetch tokenizer from Hugging Face (e.g. bert, gpt, etc.)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") #✅ Downloads the BERT tokenizer (lowercase version) that knows how to:
# split words into subwords
# convert to token IDs
inputs = tokenizer("AI is awesome!", padding=True, truncation=True, return_tensors="pt")
# ✅ Tokenizes the input:
# padding=True → Pads the input if it's shorter than max length
# truncation=True → Cuts if it’s too long
# return_tensors="pt" → Returns PyTorch tensor format (pt = PyTorch)

print(inputs['input_ids'])       # Token IDs
print(inputs['attention_mask'])  # 1s = real tokens, 0s = ignore (pad)

tensor([[  101,  9932,  2003, 12476,   999,   102]])
tensor([[1, 1, 1, 1, 1, 1]])


we’ll manually run text through a Transformer model for classification — to see how all pieces (tokenizer + model) work together.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification # Loads tokenizer + classification model
import torch # We'll use PyTorch tensors for input/output

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") # Tokenizer: Breaks input text into token IDs
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") #Model: A DistilBERT already trained for sentiment analysis (SST-2 dataset)

inputs = tokenizer("I love this movie!", return_tensors="pt") #Tokenizes text → returns token IDs + attention mask (in PyTorch tensor format)

with torch.no_grad():             # Disables gradient tracking (we're just predicting)
    outputs = model(**inputs)     # Passes the input into the model to get output logits
print(outputs)

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) #Converts raw output (logits) into probabilities
print(predictions)  #Usually returns 2 classes: [negative_prob, positive_prob]

SequenceClassifierOutput(loss=None, logits=tensor([[-4.3246,  4.6837]]), hidden_states=None, attentions=None)
tensor([[1.2238e-04, 9.9988e-01]])


### 🔹 What is `logits`?

* `logits` are the **raw, unnormalized outputs** from the final layer of a neural network.
* They can be **positive or negative**, and **don’t sum to 1**.
* You convert `logits` → probabilities using `softmax`.

**Example:**

```python
logits = tensor([[2.0, 0.5]])
# After softmax → [0.82, 0.18] → means class 0 is 82% likely
```

---

### 🔹 What is `**inputs` inside the model?

When you do:

```python
outputs = model(**inputs)
```

It’s the same as writing:

```python
outputs = model(input_ids=..., attention_mask=...)
```

✔️ `tokenizer(...)` returns a dictionary like:

```python
{
  'input_ids': tensor([[101, 1045, 2293, 2023, 3185, 999, 102]]),
  'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])
}
```

The `**inputs` syntax **unpacks** that dictionary directly into keyword arguments for the model.


### 1️⃣ **Why `torch.no_grad()`?**

When you're **only predicting (inference)** and not training, you don’t need to calculate gradients.

✅ **Benefits:**

* Saves memory
* Speeds up execution
* Cleaner and safer for inference

---

### 2️⃣ **What is Softmax?**

🧠 **Softmax** turns raw scores (logits) into probabilities that add up to 1.

**Formula:**

$$
\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
$$

It gives:

* High confidence for the most likely class
* Low values for others
* Output like: `[0.02, 0.98]` → 98% confidence for class 1

---

### 3️⃣ **Whole Purpose Recap**

We’re doing this:

**Raw Text** → `Tokenizer` → `Model` → `Logits` → `Softmax` → `Probabilities` → `Prediction`

📌 This is how Hugging Face models work internally:

* Tokenization = preprocess
* Model = neural network
* Logits = raw model output
* Softmax = make predictions human-readable


🛠️ Mini Project: Sentiment Classifier for Multiple Texts
create a custom function that can analyze multiple reviews at once.

In [None]:
from transformers import pipeline

# Load sentiment analysis model
sentiment_model = pipeline("sentiment-analysis")

# Sample reviews
reviews = [
    "This movie was fantastic!",
    "Worst experience ever.",
    "I loved the visuals but hated the story.",
    "Just average, nothing special.",
    "Absolutely brilliant!"
]

# Analyze all
results = sentiment_model(reviews)
print(results, end="\n\n")

for review, result in zip(reviews, results):
    print(f"{review} -> {result['label']} ({round(result['score']*100, 2)}%)")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9998781681060791}, {'label': 'NEGATIVE', 'score': 0.9997876286506653}, {'label': 'NEGATIVE', 'score': 0.9886063933372498}, {'label': 'NEGATIVE', 'score': 0.998314619064331}, {'label': 'POSITIVE', 'score': 0.999871015548706}]

This movie was fantastic! -> POSITIVE (99.99%)
Worst experience ever. -> NEGATIVE (99.98%)
I loved the visuals but hated the story. -> NEGATIVE (98.86%)
Just average, nothing special. -> NEGATIVE (99.83%)
Absolutely brilliant! -> POSITIVE (99.99%)


🔍 What You Practiced:

- Multi-input processing
- Model confidence score
- Basic NLP automation

🧠 Use Custom Models from Hugging Face Hub
explore other powerful models (e.g. emotion, topic, toxicity detection) using a few lines of code.

In [2]:
# ✅ Example 1: Emotion Detection
from transformers import pipeline

emotion = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base", top_k=1)

print(emotion("I am so proud of myself today!"))

  from .autonotebook import tqdm as notebook_tqdm
Device set to use mps:0


[[{'label': 'joy', 'score': 0.7351986765861511}]]


In [7]:
# # ✅ Example 2: Topic Classification
# topic = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# text = "Apple is releasing a new iPhone this year"
# # labels = ["sports", "politics", "technology", "food"]
# labels = ["technology", "food"]

# print(topic(text, candidate_labels=labels))

# ✅ Example 2: Topic Classification
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

classifier = pipeline("zero-shot-classification", model="valhalla/distilbart-mnli-12-1") #light weight model - model="typeform/distilbert-base-uncased-mnli"

text = "Apple is releasing a new iPhone this year"
# labels = ["sports", "politics", "technology", "food"]
labels = ["technology", "food"]

result = classifier(text, candidate_labels=labels)
print(result)




Device set to use mps:0


{'sequence': 'Apple is releasing a new iPhone this year', 'labels': ['technology', 'food'], 'scores': [0.9958313703536987, 0.004168595653027296]}


In [8]:
# ✅ Example 3: Toxicity Detection
toxic = pipeline("text-classification", model="unitary/toxic-bert")

print(toxic("I hate you!"))

Device set to use mps:0


[{'label': 'toxic', 'score': 0.95553058385849}]


🔍 Summary:
- pipeline() makes any model easy to use
- You can swap models using Hugging Face model names
- Explore huggingface.co/models for more tasks

zero-shot takes time
⏱️ Why Zero-Shot Topic Classification Is Slow:
You're using this:

# pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
This:
- Loads a large transformer (BART = 400M+ parameters)
- Internally runs the input once per label (e.g., 4 labels = 4 forward passes)
- Works without any prior training on your custom labels = “zero-shot”

✅ Tips to Speed It Up:

- Use fewer candidate labels
- Use DistilBART (if available) instead of full BART
- Run it only once and cache results locally
- Use GPU if you have one


This warning and retry log is happening because the zero-shot model is trying to download large model files from Hugging Face, but:

❗Issue:
DNS lookup is failing due to network problems, causing:

⚠️ Retrying multiple times
⚠️ Delayed or stuck execution
📦 Big files not loading (models are 1GB+)

🔧 How to Fix It:
✅ Option 1: Set this to silence tokenizer warnings
Add this before importing pipelines:
# import os
# os.environ["TOKENIZERS_PARALLELISM"] = "false"

✅ Option 2: Try lightweight model
Use "valhalla/distilbart-mnli-12-1" instead of "facebook/bart-large-mnli":
# classifier = pipeline("zero-shot-classification", model="valhalla/distilbart-mnli-12-1")

✅ Option 3: Restart kernel + run with stable internet
Sometimes Colab/VS Code gets stuck due to a corrupted download. Restarting + running again helps.

Network/DNS Issues - ping huggingface.co
Ensure your environment can connect to Hugging Face:

Cache & Retry
Clear the cache folder: - ~/.cache/huggingface

Embeddings + Vector Search (FAISS, Pinecone) — the foundation of RAG, semantic search, chatbot memory.

**Embeddings + Vector Search**, the backbone of:

* **Semantic Search**
* **Chatbot memory**
* **RAG (Retrieval-Augmented Generation)**

Let’s start with the core idea 👇

---

### 🧠 What Are Embeddings?

Embeddings are **vector representations of text** (or images, code, etc.) in a high-dimensional space.
Words/paragraphs with **similar meaning → closer vectors**.

📌 Example:

* "king" and "queen" will have similar vectors.
* "dog" and "bark" will be closer than "dog" and "car".

---

### 🔍 Why Use Embeddings?

* To **compare semantic meaning** (not exact words)
* To build **semantic search engines**, chatbot memory, document similarity tools, etc.

---

### ⚙️ What is FAISS?

FAISS (by Facebook AI) lets you:

* Store millions of embeddings
* Search fast: “Which vector is most similar to this one?”

---

### 📦 Flow (RAG / Semantic Search):

1. Convert text → embeddings (using models like `sentence-transformers`)
2. Store them in FAISS or Pinecone
3. When user queries:
   → Convert query to embedding
   → Search nearest documents
   → Show or feed them to LLM

---
Step 1: pip install sentence-transformer

In [None]:
# ✅ Step 2: Get sentence embeddings
# Here’s how to convert text into a vector using a pretrained model: text->vector
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")  # Light & fast, 384-dim embeddings

sentences = [
    "I love playing football.",
    "Soccer is my favorite sport.",
    "Apples are red and sweet."
]

embeddings = model.encode(sentences)

print(embeddings.shape)  # (3, 384)
# Each sentence is now a 384-dimension vector that captures meaning.

(3, 384)


pip install faiss-cpu

In [None]:
# build a vector search system using FAISS. This is used in chatbots, search engines, RAG, memory, etc.
# ✅ Step 2: Use FAISS to search similar sentences
import faiss          # ✅ faiss: Facebook AI Similarity Search – used to index and search high-dimensional vectors fast.
import numpy as np    # ✅ numpy: Needed to store and manipulate embeddings (vectors).

# Convert embeddings to float32 (required by FAISS)
embeddings = np.array(embeddings).astype("float32") #🔍 FAISS only works with float32 format. We convert sentence embeddings (from sentence-transformers) to float32.

# Create FAISS index
index = faiss.IndexFlatL2(embeddings.shape[1]) #🧠 Create a FAISS index that uses L2 (Euclidean) distance to measure similarity. embeddings.shape[1] = the number of dimensions per embedding (e.g., 384 or 768).
index.add(embeddings)  # Add your sentence vectors to index. 📚 We add all our sentence vectors to the index — like storing them in memory for search.


#querying faiss
# Search for most similar sentence to a new query
query = model.encode(["I enjoy watching football"]).astype("float32") #🎯 We encode a new query sentence into a vector using the same sentence-transformers model. Convert to float32 for FAISS.

D, I = index.search(query, k=2) #🔍 FAISS searches and gives:
# D: distances to the top-k closest vectors
# I: indices of the most similar sentences (from original list)

# Show the results - 🗣️ We print the actual sentences corresponding to the top-k similar results.
print("Most similar sentences:")
for idx in I[0]:
    print(sentences[idx])


Most similar sentences:
I love playing football.
Soccer is my favorite sport.


📌 Summary
FAISS creates a fast search engine using vector similarity.
We use sentence embeddings → build index → query → get closest matches.

This finds the closest meaning sentence using vector similarity!
when user queries -> get top 2 similar sentences (from faiss) like query
🧠 GOAL:
We want to search similar sentences using semantic meaning (not keyword). This is useful in:

- AI search engines
- Chatbot memory
- Document Q&A
- Retrieval-Augmented Generation (RAG)

In [11]:
# 🔧 Now let's turn this into a reusable search function:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# 1. Load model & encode your sentences
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
    "I love playing cricket",
    "Artificial Intelligence is the future",
    "Messi is the best football player",
    "AI will change the world",
    "Football is a great sport"
]
embeddings = model.encode(sentences).astype("float32")

# 2. Build the FAISS index
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)
print(embeddings)

# 3. 🔍 Define the search function
def semantic_search(query_text, top_k=3):
    query_vec = model.encode([query_text]).astype("float32")
    distances, indices = index.search(query_vec, top_k)
    print("dist", distances, indices)
    return [sentences[i] for i in indices[0]]

# 4. ✅ Try it
print(semantic_search("Tell me about football"))


[[ 4.9958229e-02  3.3321250e-02 -2.3285332e-03 ...  1.1092138e-02
   8.0104306e-05 -6.7105673e-02]
 [-3.2004926e-02 -1.6430480e-03  2.6563013e-02 ...  3.8728736e-02
   6.6925928e-02 -4.1541487e-02]
 [ 1.8890753e-02  3.5387490e-02 -3.0344751e-02 ...  6.5683750e-03
   1.1034363e-01 -2.4785191e-02]
 [-2.1528115e-02  4.1045551e-03  4.4449449e-02 ... -4.9252391e-02
   1.4991087e-03 -7.1676977e-02]
 [ 2.4371710e-02  3.2307386e-02  1.8054860e-02 ...  3.5636850e-02
   6.0518291e-02  1.3624058e-02]]
dist [[0.56211036 1.2136043  1.2308424 ]] [[4 0 2]]
['Football is a great sport', 'I love playing cricket', 'Messi is the best football player']
