🔥 First Concept: NLP - Tokenization

💡 What is it?
Tokenization = Breaking down text into smaller pieces (words, subwords, or characters) so that models can understand and process them.

🧠 Why it matters?
Machines can’t read sentences directly. They need tokens (like “Hello” → [Hello]) to convert into numbers (embeddings).

🔧 Types of Tokenizers:

- Word-level → “Hello world” → ['Hello', 'world']
- Subword-level (BPE) → “unhappiness” → ['un', 'happi', 'ness']
- Character-level → “Hi” → ['H', 'i']

✅ Real Use:
Hugging Face models like BERT use subword tokenizers.

🤗 Transformers – Core Concept

💡 What is a Transformer?
A Transformer is an architecture that understands sequences (like sentences) using self-attention – it looks at all words at once and learns which ones matter most.

🧠 Why It Matters?
This powers BERT, GPT, Claude, Gemini – all modern LLMs.

Key Ideas:

- No loops, just attention
- Parallel processing = Fast
- Can understand long-range word relationships (e.g., “bank” = riverbank or money)

🧱 Transformer Parts (simple view):
- Input Embeddings: Text → Vectors
- Positional Encoding: Adds word order info
- Self-Attention: Learns context
- Feed Forward Layers: Processes info
- Output: Classifies, generates, etc.

In [None]:
# !pip install transformers

# change in notepad
# or change in system Move Python 3.13 to Top in environmental variable



In [None]:
import sys
print(sys.executable)


import transformers
print(transformers.__version__)

/Library/Developer/CommandLineTools/usr/bin/python3


  from .autonotebook import tqdm as notebook_tqdm


4.52.4


In [None]:
# 🤗 Using Hugging Face Transformers (Hands-On)
# Let’s load a real model and run it on your own text 👇

from transformers import pipeline    # Correct

classifier = pipeline("sentiment-analysis")

# result = classifier("I love lesarning huggface with chatgpt")
result = classifier("happy")
print(result) # [{'label': 'POSITIVE', 'score': 0.9979}] 'label': Predicted class 'score': Confidence (close to 1.0 = very confident)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9998753070831299}]


🧠 What’s happening here:
pipeline("sentiment-analysis"): Loads a pretrained model like BERT that’s fine-tuned for sentiment.

You pass in raw text → it gets tokenized, embedded, processed by transformer → gives a label (POSITIVE/NEGATIVE) + confidence.

🧠 Token IDs & Attention Mask (Mini Concept)

💡 Token IDs:
- Text is turned into numbers. Example:
- "AI is great" → [101, 9932, 2003, 2307, 102] (Each word/subword gets a unique ID from the model's vocab)

💡 Attention Mask:
- Tells the model which tokens to focus on (1 = real word, 0 = padding).
- Useful when inputs are of different lengths but sent as batches.

In [None]:
from transformers import AutoTokenizer #loads class to fetch tokenizer from Hugging Face (e.g. bert, gpt, etc.)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") #✅ Downloads the BERT tokenizer (lowercase version) that knows how to:
# split words into subwords
# convert to token IDs
inputs = tokenizer("AI is awesome!", padding=True, truncation=True, return_tensors="pt")
# ✅ Tokenizes the input:
# padding=True → Pads the input if it's shorter than max length
# truncation=True → Cuts if it’s too long
# return_tensors="pt" → Returns PyTorch tensor format (pt = PyTorch)

print(inputs['input_ids'])       # Token IDs
print(inputs['attention_mask'])  # 1s = real tokens, 0s = ignore (pad)

tensor([[  101,  9932,  2003, 12476,   999,   102]])
tensor([[1, 1, 1, 1, 1, 1]])


we’ll manually run text through a Transformer model for classification — to see how all pieces (tokenizer + model) work together.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification # Loads tokenizer + classification model
import torch # We'll use PyTorch tensors for input/output

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") # Tokenizer: Breaks input text into token IDs
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") #Model: A DistilBERT already trained for sentiment analysis (SST-2 dataset)

inputs = tokenizer("I love this movie!", return_tensors="pt") #Tokenizes text → returns token IDs + attention mask (in PyTorch tensor format)

with torch.no_grad():             # Disables gradient tracking (we're just predicting)
    outputs = model(**inputs)     # Passes the input into the model to get output logits
print(outputs)

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) #Converts raw output (logits) into probabilities
print(predictions)  #Usually returns 2 classes: [negative_prob, positive_prob]

SequenceClassifierOutput(loss=None, logits=tensor([[-4.3246,  4.6837]]), hidden_states=None, attentions=None)
tensor([[1.2238e-04, 9.9988e-01]])


### 🔹 What is `logits`?

* `logits` are the **raw, unnormalized outputs** from the final layer of a neural network.
* They can be **positive or negative**, and **don’t sum to 1**.
* You convert `logits` → probabilities using `softmax`.

**Example:**

```python
logits = tensor([[2.0, 0.5]])
# After softmax → [0.82, 0.18] → means class 0 is 82% likely
```

---

### 🔹 What is `**inputs` inside the model?

When you do:

```python
outputs = model(**inputs)
```

It’s the same as writing:

```python
outputs = model(input_ids=..., attention_mask=...)
```

✔️ `tokenizer(...)` returns a dictionary like:

```python
{
  'input_ids': tensor([[101, 1045, 2293, 2023, 3185, 999, 102]]),
  'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])
}
```

The `**inputs` syntax **unpacks** that dictionary directly into keyword arguments for the model.


### 1️⃣ **Why `torch.no_grad()`?**

When you're **only predicting (inference)** and not training, you don’t need to calculate gradients.

✅ **Benefits:**

* Saves memory
* Speeds up execution
* Cleaner and safer for inference

---

### 2️⃣ **What is Softmax?**

🧠 **Softmax** turns raw scores (logits) into probabilities that add up to 1.

**Formula:**

$$
\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
$$

It gives:

* High confidence for the most likely class
* Low values for others
* Output like: `[0.02, 0.98]` → 98% confidence for class 1

---

### 3️⃣ **Whole Purpose Recap**

We’re doing this:

**Raw Text** → `Tokenizer` → `Model` → `Logits` → `Softmax` → `Probabilities` → `Prediction`

📌 This is how Hugging Face models work internally:

* Tokenization = preprocess
* Model = neural network
* Logits = raw model output
* Softmax = make predictions human-readable


🛠️ Mini Project: Sentiment Classifier for Multiple Texts
create a custom function that can analyze multiple reviews at once.

In [None]:
from transformers import pipeline

# Load sentiment analysis model
sentiment_model = pipeline("sentiment-analysis")

# Sample reviews
reviews = [
    "This movie was fantastic!",
    "Worst experience ever.",
    "I loved the visuals but hated the story.",
    "Just average, nothing special.",
    "Absolutely brilliant!"
]

# Analyze all
results = sentiment_model(reviews)
print(results, end="\n\n")

for review, result in zip(reviews, results):
    print(f"{review} -> {result['label']} ({round(result['score']*100, 2)}%)")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9998781681060791}, {'label': 'NEGATIVE', 'score': 0.9997876286506653}, {'label': 'NEGATIVE', 'score': 0.9886063933372498}, {'label': 'NEGATIVE', 'score': 0.998314619064331}, {'label': 'POSITIVE', 'score': 0.999871015548706}]

This movie was fantastic! -> POSITIVE (99.99%)
Worst experience ever. -> NEGATIVE (99.98%)
I loved the visuals but hated the story. -> NEGATIVE (98.86%)
Just average, nothing special. -> NEGATIVE (99.83%)
Absolutely brilliant! -> POSITIVE (99.99%)


🔍 What You Practiced:

- Multi-input processing
- Model confidence score
- Basic NLP automation

🧠 Use Custom Models from Hugging Face Hub
explore other powerful models (e.g. emotion, topic, toxicity detection) using a few lines of code.

In [2]:
# ✅ Example 1: Emotion Detection
from transformers import pipeline

emotion = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base", top_k=1)

print(emotion("I am so proud of myself today!"))

  from .autonotebook import tqdm as notebook_tqdm
Device set to use mps:0


[[{'label': 'joy', 'score': 0.7351986765861511}]]


In [7]:
# # ✅ Example 2: Topic Classification
# topic = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# text = "Apple is releasing a new iPhone this year"
# # labels = ["sports", "politics", "technology", "food"]
# labels = ["technology", "food"]

# print(topic(text, candidate_labels=labels))

# ✅ Example 2: Topic Classification
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

classifier = pipeline("zero-shot-classification", model="valhalla/distilbart-mnli-12-1") #light weight model - model="typeform/distilbert-base-uncased-mnli"

text = "Apple is releasing a new iPhone this year"
# labels = ["sports", "politics", "technology", "food"]
labels = ["technology", "food"]

result = classifier(text, candidate_labels=labels)
print(result)




Device set to use mps:0


{'sequence': 'Apple is releasing a new iPhone this year', 'labels': ['technology', 'food'], 'scores': [0.9958313703536987, 0.004168595653027296]}


In [8]:
# ✅ Example 3: Toxicity Detection
toxic = pipeline("text-classification", model="unitary/toxic-bert")

print(toxic("I hate you!"))

Device set to use mps:0


[{'label': 'toxic', 'score': 0.95553058385849}]


🔍 Summary:
- pipeline() makes any model easy to use
- You can swap models using Hugging Face model names
- Explore huggingface.co/models for more tasks

zero-shot takes time
⏱️ Why Zero-Shot Topic Classification Is Slow:
You're using this:

# pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
This:
- Loads a large transformer (BART = 400M+ parameters)
- Internally runs the input once per label (e.g., 4 labels = 4 forward passes)
- Works without any prior training on your custom labels = “zero-shot”

✅ Tips to Speed It Up:

- Use fewer candidate labels
- Use DistilBART (if available) instead of full BART
- Run it only once and cache results locally
- Use GPU if you have one


This warning and retry log is happening because the zero-shot model is trying to download large model files from Hugging Face, but:

❗Issue:
DNS lookup is failing due to network problems, causing:

⚠️ Retrying multiple times
⚠️ Delayed or stuck execution
📦 Big files not loading (models are 1GB+)

🔧 How to Fix It:
✅ Option 1: Set this to silence tokenizer warnings
Add this before importing pipelines:
# import os
# os.environ["TOKENIZERS_PARALLELISM"] = "false"

✅ Option 2: Try lightweight model
Use "valhalla/distilbart-mnli-12-1" instead of "facebook/bart-large-mnli":
# classifier = pipeline("zero-shot-classification", model="valhalla/distilbart-mnli-12-1")

✅ Option 3: Restart kernel + run with stable internet
Sometimes Colab/VS Code gets stuck due to a corrupted download. Restarting + running again helps.

Network/DNS Issues - ping huggingface.co
Ensure your environment can connect to Hugging Face:

Cache & Retry
Clear the cache folder: - ~/.cache/huggingface

Embeddings + Vector Search (FAISS, Pinecone) — the foundation of RAG, semantic search, chatbot memory.

**Embeddings + Vector Search**, the backbone of:

* **Semantic Search**
* **Chatbot memory**
* **RAG (Retrieval-Augmented Generation)**

Let’s start with the core idea 👇

---

### 🧠 What Are Embeddings?

Embeddings are **vector representations of text** (or images, code, etc.) in a high-dimensional space.
Words/paragraphs with **similar meaning → closer vectors**.

📌 Example:

* "king" and "queen" will have similar vectors.
* "dog" and "bark" will be closer than "dog" and "car".

---

### 🔍 Why Use Embeddings?

* To **compare semantic meaning** (not exact words)
* To build **semantic search engines**, chatbot memory, document similarity tools, etc.

---

### ⚙️ What is FAISS?

FAISS (by Facebook AI) lets you:

* Store millions of embeddings
* Search fast: “Which vector is most similar to this one?”

---

### 📦 Flow (RAG / Semantic Search):

1. Convert text → embeddings (using models like `sentence-transformers`)
2. Store them in FAISS or Pinecone
3. When user queries:
   → Convert query to embedding
   → Search nearest documents
   → Show or feed them to LLM

---
Step 1: pip install sentence-transformer

In [None]:
# ✅ Step 2: Get sentence embeddings
# Here’s how to convert text into a vector using a pretrained model: text->vector
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")  # Light & fast, 384-dim embeddings

sentences = [
    "I love playing football.",
    "Soccer is my favorite sport.",
    "Apples are red and sweet."
]

embeddings = model.encode(sentences)

print(embeddings.shape)  # (3, 384)
# Each sentence is now a 384-dimension vector that captures meaning.

(3, 384)


pip install faiss-cpu

In [None]:
# build a vector search system using FAISS. This is used in chatbots, search engines, RAG, memory, etc.
# ✅ Step 2: Use FAISS to search similar sentences
import faiss          # ✅ faiss: Facebook AI Similarity Search – used to index and search high-dimensional vectors fast.
import numpy as np    # ✅ numpy: Needed to store and manipulate embeddings (vectors).

# Convert embeddings to float32 (required by FAISS)
embeddings = np.array(embeddings).astype("float32") #🔍 FAISS only works with float32 format. We convert sentence embeddings (from sentence-transformers) to float32.

# Create FAISS index
index = faiss.IndexFlatL2(embeddings.shape[1]) #🧠 Create a FAISS index that uses L2 (Euclidean) distance to measure similarity. embeddings.shape[1] = the number of dimensions per embedding (e.g., 384 or 768).
index.add(embeddings)  # Add your sentence vectors to index. 📚 We add all our sentence vectors to the index — like storing them in memory for search.


#querying faiss
# Search for most similar sentence to a new query
query = model.encode(["I enjoy watching football"]).astype("float32") #🎯 We encode a new query sentence into a vector using the same sentence-transformers model. Convert to float32 for FAISS.

D, I = index.search(query, k=2) #🔍 FAISS searches and gives:
# D: distances to the top-k closest vectors
# I: indices of the most similar sentences (from original list)

# Show the results - 🗣️ We print the actual sentences corresponding to the top-k similar results.
print("Most similar sentences:")
for idx in I[0]:
    print(sentences[idx])


Most similar sentences:
I love playing football.
Soccer is my favorite sport.


📌 Summary
FAISS creates a fast search engine using vector similarity.
We use sentence embeddings → build index → query → get closest matches.

This finds the closest meaning sentence using vector similarity!
when user queries -> get top 2 similar sentences (from faiss) like query
🧠 GOAL:
We want to search similar sentences using semantic meaning (not keyword). This is useful in:

- AI search engines
- Chatbot memory
- Document Q&A
- Retrieval-Augmented Generation (RAG)

In [None]:
# 🔧 Now let's turn this into a reusable search function:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# 1. Load model & encode your sentences
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
    "I love playing cricket",
    "Artificial Intelligence is the future",
    "Messi is the best football player",
    "AI will change the world",
    "Football is a great sport"
]
embeddings = model.encode(sentences).astype("float32")
# emb = model.encode("example sentence")
# print(len(emb))  # ➝ 384
print(len(embeddings))  
print(embeddings)

# 2. Build the FAISS index
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

# 3. 🔍 Define the search function
def semantic_search(query_text, top_k=3):
    query_vec = model.encode([query_text]).astype("float32")
    distances, indices = index.search(query_vec, top_k)
    print("dist", distances, indices)
    return [sentences[i] for i in indices[0]]

# 4. ✅ Try it
print(semantic_search("Tell me about football"))


384
5
[[ 4.9958229e-02  3.3321250e-02 -2.3285332e-03 ...  1.1092138e-02
   8.0104306e-05 -6.7105673e-02]
 [-3.2004926e-02 -1.6430480e-03  2.6563013e-02 ...  3.8728736e-02
   6.6925928e-02 -4.1541487e-02]
 [ 1.8890753e-02  3.5387490e-02 -3.0344751e-02 ...  6.5683750e-03
   1.1034363e-01 -2.4785191e-02]
 [-2.1528115e-02  4.1045551e-03  4.4449449e-02 ... -4.9252391e-02
   1.4991087e-03 -7.1676977e-02]
 [ 2.4371710e-02  3.2307386e-02  1.8054860e-02 ...  3.5636850e-02
   6.0518291e-02  1.3624058e-02]]
dist [[0.56211036 1.2136043  1.2308424 ]] [[4 0 2]]
['Football is a great sport', 'I love playing cricket', 'Messi is the best football player']


384:

- The model turns each sentence into a 384-length vector of numbers
- Pinecone must know this dimension to store/search properly
- If you used a different model (e.g., one with 768 or 1024 dims), you’d change this

🔍 Output:
This will return 3 sentences most semantically similar to "Tell me about football".

In [8]:
import os
from dotenv import load_dotenv
from pinecone import Pinecone, ServerlessSpec


# Step 1: Load .env file
load_dotenv(dotenv_path=".env")  # Ensure .env is in the same folder

# Step 2: Get API Key
api_key = os.getenv("PINECONE_API_KEY")

# Optional check
print("Loaded:", bool(api_key))  # Should print True


# ✅ Set up the Pinecone client
pc = Pinecone(api_key=api_key)  # or use os.environ.get("PINECONE_API_KEY")

    # environment="us-east-1"  # 🌍 Region like "gcp-starter" or "us-west1-gcp" When you create a Pinecone account, you also create a project in a specific region (like "gcp-starter" or "us-west1-gcp"). This region is your environment — it tells Pinecone where to store and access your vector data.

Loaded: True


In [9]:
# ✅ STEP 3: Create Index (1-time setup)
index_name = "semantic-search-demo"

# ✅ Create index if it doesn't exist
if index_name not in pc.list_indexes().names():
    pc.create_index_for_model(  #Creates a new index using a pre-defined embedding model (LLaMA v2).
        name=index_name,    
        cloud="aws",
        region="us-east-1",
        embed={
            "model": "llama-text-embed-v2",     # Embedding model Pinecone will use (so you don’t need to generate embeddings manually).
            "field_map": {"text": "chunk_text"} # 	Tells Pinecone: “In my documents, take the chunk_text field as the input text.”
        }
    )

💡 Why it's powerful:
You don’t need to create embeddings manually anymore — just send chunks with a chunk_text field, and Pinecone will embed it using llama-text-embed-v2 automatically.

In [10]:
# pushing data to pinecone
index = pc.Index(index_name)  # Connect to the index

# Example chunks - Use upsert_records for text-based inserts:
chunks = [ # chunk_text must match your field_map={"text": "chunk_text"} from when you created the index.
    {
        "_id": "doc1#chunk1",
        "chunk_text": "Pinecone is a vector database for semantic search.",
    },
    {
        "_id": "doc1#chunk2",
        "chunk_text": "It enables Retrieval-Augmented Generation (RAG) applications.",
    }
]

# ✅ Upsert text data — Pinecone handles embeddings
index.upsert_records(namespace="__default__", records=chunks)

# 🔹 What is a Serverless Index in Pinecone?
A serverless index is a special type of Pinecone index that:

- Auto-scales — you don’t need to manage or provision resources (no pods, no replicas).
- Embeds text automatically — Pinecone does the embedding for you if you select a hosted embedding model (like llama-text-embed-v2) while creating the index.
- Is cheaper and easier to get started with.


# 🧠 Why it matters for us:
You're using:

# pc.create_index_for_model(...)
That automatically creates a serverless index with built-in embedding (integrated embedding), so you don’t need to manually embed text yourself.

And:

# index.query(text="...")
✅ No need for sentence-transformers or any separate model.

In [8]:
# let’s run semantic search on the data you just upserted.

# 🔍 Search with a query string — Pinecone auto-embeds it (because your index has embedding built-in)
query_text = "What is Pinecone used for?"

results = index.search(
    namespace="__default__",
    query={
        "inputs": {"text": query_text},  # this uses the integrated embedding -> This will be converted to an embedding using the model attached to the index.
        "top_k": 3                       #  It fetches the top 3 most relevant vector matches.
    },
    fields=["chunk_text", "category"]  # or any metadata fields you added
)

print(results)


# 🖨️ Print matched results
for match in results["result"]["hits"]:
    print(f"\nScore: {match['_score']}")
    print(f"Text: {match['fields']['chunk_text']}")


{'result': {'hits': [{'_id': 'doc1#chunk1',
                      '_score': 0.41501930356025696,
                      'fields': {'chunk_text': 'Pinecone is a vector database '
                                               'for semantic search.'}},
                     {'_id': 'doc1#chunk2',
                      '_score': 0.06871819496154785,
                      'fields': {'chunk_text': 'It enables Retrieval-Augmented '
                                               'Generation (RAG) '
                                               'applications.'}}]},
 'usage': {'embed_total_tokens': 10, 'read_units': 6}}

Score: 0.41501930356025696
Text: Pinecone is a vector database for semantic search.

Score: 0.06871819496154785
Text: It enables Retrieval-Augmented Generation (RAG) applications.


now let’s move to chatbot memory using RAG-style flow.

We already have:
- 🔹 Document chunks stored in Pinecone.
- 🔹 Text queries giving back semantically similar results.

🧠 Now we’ll simulate a chatbot that:
- Takes a user question.
- Searches Pinecone for relevant chunks (memory).
- Uses those chunks as context to answer.

In [16]:
# ✅ Step 1: Function to Search Pinecone
def search_memory(query_text, index, namespace="__default__", top_k=3):
    response = index.search(
        namespace=namespace,
        query={"inputs": {"text": query_text}, "top_k": top_k},
        fields=["chunk_text"]
    )
    return [hit["fields"]["chunk_text"] for hit in response["result"]["hits"]]

In [17]:
# ✅ Step 2: Combine Results into Context
def build_context(chunks):
    return "\n".join(chunks)

In [11]:
# ✅ Step 3: Simulate Chatbot Answering with Context
# We'll use a basic LLM like text-davinci-003 or chatgpt via OpenAI API — or for local testing, just print() the prompt.
user_question = "How is Pinecone used in AI?"

relevant_chunks = search_memory(user_question, index)
context = build_context(relevant_chunks)

final_prompt = f"""Use the context below to answer the question:

Context:
{context}

Question: {user_question}
Answer:"""

print(final_prompt)  # replace this with OpenAI call later


Use the context below to answer the question:

Context:
Pinecone is a vector database for semantic search.
It enables Retrieval-Augmented Generation (RAG) applications.

Question: How is Pinecone used in AI?
Answer:


In [19]:
from openai import OpenAI
import os
from dotenv import load_dotenv

load_dotenv(dotenv_path=".env")  

# Check if it's loaded
api_key = os.getenv("OPENROUTER_API_KEY")

# Fetch the API key securely
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=api_key # or paste directly as string
)

print("API Key Loaded:", bool(api_key))  # Optional: confirm it's loaded, not print full key

# 🔮 RAG-style Answer Generator
def generate_answer(prompt):
    response = client.chat.completions.create(
        model="mistralai/mistral-7b-instruct",  # Try Claude or GPT if needed
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3
    )
    return response.choices[0].message.content

# 🧠 RAG Flow
user_question = "How does Pinecone help with semantic search?"
relevant_chunks = search_memory(user_question, index)  # <-- define this function earlier
context = build_context(relevant_chunks)               # <-- define this function earlier

# 📝 Build prompt with context
prompt = f"""Use the context below to answer the question:

Context:
{context}

Question: {user_question}
Answer:"""

# 🤖 Get and print answer
response = generate_answer(prompt)
print(response)

API Key Loaded: True
 Pinecone helps with semantic search by providing a vector database specifically designed for it. Semantic search is a type of search that aims to understand the intent and context behind the query, rather than just matching keywords. Pinecone enables this by using Retrieval-Augmented Generation (RAG) applications, which combine the power of machine learning models for understanding the context with the efficiency of vector databases for quick retrieval of relevant information. This allows Pinecone to deliver more accurate and meaningful search results.


summary of what we built, the step-by-step flow, the purpose, and how to turn it into a real chatbot with chat history. Let’s go!

# ✅ WHAT WE DID
We built a RAG-style (Retrieval-Augmented Generation) chatbot using:

- 🧠 OpenRouter LLMs (free/alternative to OpenAI)
- 🧾 Pinecone vector DB for semantic search
- 🧑‍💻 Your own context data (example chunks)
- 📥 A prompt builder to generate answers from relevant data

# 🎯 PURPOSE
To answer user questions using your own data, not just what the LLM was trained on.

Instead of guessing, the LLM:
- Searches your custom docs via Pinecone (semantic search)
- Gets relevant context
- Answers only based on that

This is how tools like ChatPDF, Notion AI, ChatGPT RAG bots work.

# 🔁 FLOW WE BUILT
# 1. Store Chunks in Pinecone
You:
- Break text into small chunks
- Embed them (using an embedding model)
- Store in Pinecone with IDs

2. Search Pinecone by Semantic Similarity
You:
- Take user's question
- Embed it
- Search in Pinecone for similar chunks

3. Build Context from Top Chunks
You:
- Extract the matched chunks' text
- Combine into a context string

4. Send Prompt to LLM
<!-- You:

Create a prompt like:

text
Copy code
Use the context below to answer the question:

Context:
{relevant_chunks}

Question: {user_question}
Answer: -->
Send to the LLM (like mistral-7b-instruct) via OpenRouter

5. LLM Answers Based on Context
You:
Print or return the response to the user

In [18]:
# 💬 ADDING CHAT HISTORY (to make it a real chatbot)
# Right now, each prompt is single-turn (no memory).
# To add chat memory, do this:

# ✅ Modified generate_answer function:
chat_history = [
    {"role": "system", "content": "You are a helpful assistant."}
]

def generate_answer(user_input, context):
    chat_history.append({"role": "user", "content": f"""Use the context below to answer the question:

Context:
{context}

Question: {user_input}
Answer:"""})

    response = client.chat.completions.create(
        model="mistralai/mistral-7b-instruct",  # or any model from OpenRouter
        messages=chat_history,
        temperature=0.3
    )

    assistant_reply = response.choices[0].message.content
    chat_history.append({"role": "assistant", "content": assistant_reply})
    return assistant_reply

In [19]:
# 🔁 How to use it in a loop:
while True:
    user_question = input("You: ")
    if user_question.lower() in ["exit", "quit"]:
        print("Chat ended.")
        break

    relevant_chunks = search_memory(user_question, index)
    context = build_context(relevant_chunks)

    reply = generate_answer(user_question, context)
    print("Bot:", reply)

Bot:  Hello! I'm here to help. Pinecone is a vector database designed for semantic search. It supports Retrieval-Augmented Generation (RAG) applications. How can I assist you with Pinecone or RAG today?
Bot:  Certainly! Pinecone is a vector database designed for semantic search. It enables Retrieval-Augmented Generation (RAG) applications, which are used to generate responses to user queries by combining retrieved data with pre-defined templates. This makes it a powerful tool for a variety of applications such as chatbots, content generation, and recommendation systems.
Chat ended.


how to turn it into a real chatbot with chat history. Let’s go!

✅ WHAT WE DID
We built a RAG-style (Retrieval-Augmented Generation) chatbot using:
- 🧠 OpenRouter LLMs (free/alternative to OpenAI)
- 🧾 Pinecone vector DB for semantic search
- 🧑‍💻 Your own context data (example chunks)
- 📥 A prompt builder to generate answers from relevant data

🎯 PURPOSE
- To answer user questions using your own data, not just what the LLM was trained on.
- Instead of guessing, the LLM:
- Searches your custom docs via Pinecone (semantic search)
- Gets relevant context
- Answers only based on that
- This is how tools like ChatPDF, Notion AI, ChatGPT RAG bots work.

🔁 FLOW WE BUILT
1. Store Chunks in Pinecone
You:

- Break text into small chunks
- Embed them (using an embedding model)
- Store in Pinecone with IDs

2. Search Pinecone by Semantic Similarity
You:

Take user's question
- Embed it
- Search in Pinecone for similar chunks

3. Build Context from Top Chunks
You:

- Extract the matched chunks' text
- Combine into a context string

4. Send Prompt to LLM
You:

- Create a prompt
- Send to the LLM (like mistral-7b-instruct) via OpenRouter

5. LLM Answers Based on Context
You:

- Print or return the response to the user
- 💬 ADDING CHAT HISTORY (to make it a real chatbot)
- Right now, each prompt is single-turn (no memory).



In [13]:
# //using widgets
import ipywidgets as widgets
from IPython.display import display, clear_output

In [14]:
# ✅ Step 2: Setup Chat History + Functions
chat_history = [
    {"role": "system", "content": "You are a helpful assistant."}
]

# Chat handler
def generate_answer(user_input, context):
    chat_history.append({"role": "user", "content": f"""Use the context below to answer the question:

Context:
{context}

Question: {user_input}
Answer:"""})

    response = client.chat.completions.create(
        model="mistralai/mistral-7b-instruct",
        messages=chat_history,
        temperature=0.3
    )

    reply = response.choices[0].message.content
    chat_history.append({"role": "assistant", "content": reply})
    return reply


In [15]:
# ✅ Step 3: Create Input/Output Widgets
input_box = widgets.Text(
    placeholder='Type your question here...',
    description='You:',
    layout=widgets.Layout(width='90%')
)

output_area = widgets.Output()

In [16]:
# ✅ Step 4: Create Callback + Display UI
def on_submit(change):
    user_q = input_box.value
    input_box.value = ''  # Clear input

    with output_area:
        # 🔍 RAG: Search + Context
        relevant_chunks = search_memory(user_q, index)
        context = build_context(relevant_chunks)

        # 🤖 Get Answer
        reply = generate_answer(user_q, context)
        print(f"\nYou: {user_q}")
        print(f"Bot: {reply}")

# ✅ Attach Submit Event
input_box.on_submit(on_submit)

# ✅ Show UI
display(input_box, output_area)



  input_box.on_submit(on_submit)


Text(value='', description='You:', layout=Layout(width='90%'), placeholder='Type your question here...')

Output()

Let’s add PDF support into your RAG chatbot before turning it into a web app. Here's the plan:

✅ PDF Ingestion Plan (Phase 4: RAG with Real Docs)
🔹 Step 1: Install Dependencies

🔹 Step 2: Extract Text from PDF
We'll use PyPDF2 or unstructured (for more accurate parsing).

🔹 Step 3: Chunk the Text
Split the large PDF text into smaller chunks (e.g., 200–300 words each).

🔹 Step 4: Push to Pinecone
Each chunk will be upserted with a unique _id and chunk_text.

In [22]:
# ✅ Step 2: Upload + Extract Text from PDF
from pypdf import PdfReader
from io import BytesIO

# 🔼 Upload PDF using file upload widget
from IPython.display import display
import ipywidgets as widgets

upload_btn = widgets.FileUpload(accept='.pdf', multiple=False)
display(upload_btn)


FileUpload(value=(), accept='.pdf', description='Upload')

In [11]:
from pypdf import PdfReader
from io import BytesIO
from IPython.display import display
import ipywidgets as widgets

# 📥 Step 1: Upload widget
upload_btn = widgets.FileUpload(accept='.pdf', multiple=False)
display(upload_btn)

# 📤 Step 2: Extract after file is uploaded
def extract_pdf_text(file_upload_widget):
    if file_upload_widget.value:
        # ✅ Extract the first uploaded file from the tuple
        uploaded_file = file_upload_widget.value[0]
        content = uploaded_file['content']
        reader = PdfReader(BytesIO(content))

        # ✅ Extract text from all pages
        text = "\n".join(page.extract_text() for page in reader.pages)
        return text
    return None

FileUpload(value=(), accept='.pdf', description='Upload')

In [12]:
pdf_text = extract_pdf_text(upload_btn)
print(pdf_text[:1000])  # preview # first 1000 chars

Rakshath U Shetty
Bengaluru, India | +91-8951085835 | rakshathushettyu6@gmail.com | LinkedIn | Github | Portfolio
EDUCATION
Dayananda Sagar University Bengaluru, India
Bachelor of Technology in Computer Science. (SGPA:10 CGPA: 9.31) Graduation Date: May 2024
EXPERIENCE
Startup Role: Software Engineer Intern Jul 2024 - Dec 2024
Freo - MoneyTap| Backend Java Developer Bengaluru, India
• Developed and debugged code, leveraging log analysis to resolve issues and maintain system stability; written SDK test scripts
in REST to automate FinFlux testing, reducing manual testing time by 40% and improving deployment reliability by 25%.
• Directed the repayments service, owning the process in a 6-member backend team, automating updates for customer
collections from lending partners, leading to a 30% efficiency gain and 20% improvement in payment accuracy.
Startup Role: Full Stack Developer, Junior Data Engineer Sept 2023 - Feb 2024
AnTech Crew - Freelancing Startup Bengaluru, India
• Pioneered the

In [13]:
# let's chunk the PDF text and push it to Pinecone using the same logic we used earlier.
# ✅ Step 1: Split PDF Text into Chunks
from textwrap import wrap

def chunk_text(text, chunk_size=300, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

pdf_chunks = chunk_text(pdf_text)
print(f"Total chunks: {len(pdf_chunks)}")
print(pdf_chunks[:2])  # Preview first 2

Total chunks: 21
['Rakshath U Shetty\nBengaluru, India | +91-8951085835 | rakshathushettyu6@gmail.com | LinkedIn | Github | Portfolio\nEDUCATION\nDayananda Sagar University Bengaluru, India\nBachelor of Technology in Computer Science. (SGPA:10 CGPA: 9.31) Graduation Date: May 2024\nEXPERIENCE\nStartup Role: Software Enginee', 'May 2024\nEXPERIENCE\nStartup Role: Software Engineer Intern Jul 2024 - Dec 2024\nFreo - MoneyTap| Backend Java Developer Bengaluru, India\n• Developed and debugged code, leveraging log analysis to resolve issues and maintain system stability; written SDK test scripts\nin REST to automate FinFlux testing']


In [14]:
# ✅ Step 2: Format and Push to Pinecone
# We assume your Pinecone index was already created with this setup:
# index_name = "semantic-search-demo"
# index = pc.Index(index_name)

records = []
for i, chunk in enumerate(pdf_chunks):
    records.append({
        "_id": f"pdf#chunk{i}",
        "chunk_text": chunk
    })

index.upsert_records(namespace="pdf-doc", records=records)
print("✅ Uploaded to Pinecone!")



✅ Uploaded to Pinecone!


In [24]:
from openai import OpenAI
import os
from dotenv import load_dotenv

load_dotenv(dotenv_path=".env")  

# Check if it's loaded
api_key = os.getenv("OPENROUTER_API_KEY")

# Fetch the API key securely
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=api_key # or paste directly as string
)

print("API Key Loaded:", bool(api_key))  # Optional: confirm it's loaded, not print full key

# 🔮 RAG-style Answer Generator
def generate_answer(prompt):
    response = client.chat.completions.create(
        model="mistralai/mistral-7b-instruct",  # Try Claude or GPT if needed
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3
    )
    return response.choices[0].message.content

# 🧠 RAG Flow
user_question = "what projects rakshath worked on?"
relevant_chunks = search_memory(user_question, index, namespace="pdf-doc")  # <-- define this function earlier
context = build_context(relevant_chunks)               # <-- define this function earlier

# 📝 Build prompt with context
prompt = f"""Use the context below to answer the question:

Context:
{context}

Question: {user_question}
Answer:"""

# 🤖 Get and print answer
response = generate_answer(prompt)
print(response)

API Key Loaded: True
 Rakshath U Shetty has worked on two notable projects as per the provided context:

1. UR-Connect: This was a project that Rakshath led his team to participate in and place in the top 2 at a highly competitive State Level Project Competition held at BIT - Bengaluru Institute of Technology. The competition had over 300+ participants.

2. Code 360: While working as a Software Engineer at a startup, Rakshath participated in Code 360, where he demonstrated exceptional skills and resolved over 1000 complex problems. He also achieved the Grand Master, Dominator (highest rank among 7 leagues), and MASTER titles in this competition.


In [22]:
# ✅ PDF Q&A Inside while Loop
chat_history = [
    {"role": "system", "content": "You are a helpful assistant."}
]

while True:
    user_question = input("You: ")
    if user_question.lower() in ["exit", "quit"]:
        print("Chat ended.")
        break

    # 1. Search Pinecone for relevant PDF chunks (namespace="pdf-doc")
    relevant_chunks = search_memory(user_question, index, namespace="pdf-doc")

    # 2. Build the context from those chunks
    context = build_context(relevant_chunks)

    # 3. Append question to chat history
    chat_history.append({
        "role": "user",
        "content": f"""Use the context below to answer the question:

Context:
{context}

Question: {user_question}
Answer:"""
    })

    # 4. Get the LLM-generated response
    response = client.chat.completions.create(
        model="mistralai/mistral-7b-instruct",
        messages=chat_history,
        temperature=0.3
    )

    # 5. Extract and display reply
    assistant_reply = response.choices[0].message.content
    chat_history.append({"role": "assistant", "content": assistant_reply})
    print("Bot:", assistant_reply)


Chat ended.


✅ Step 1: Ask a Question
Just type a question like:

- "What projects has Rakshath worked on?"
- "What skills does this resume highlight?"
- "Tell me about Rakshath’s work experience?"

🧠 How It Works (Quick Recap):
We search your uploaded resume chunks from pdf-doc namespace in Pinecone.

- Grab top matches based on semantic similarity.
- Pass them to the LLM via OpenRouter.
- LLM answers only from your resume (RAG-style).


Here's the exact questioning logic used in our RAG-style PDF chatbot (semantic search + OpenRouter LLM):

🧠 Why This Works
- We use semantic search to find only relevant info from your PDF.
- Then let the LLM generate an answer based only on that (RAG).
- Ensures the bot answers from your data, not from its training.

✅ You can now reuse this for PDFs, Docs, YouTube transcripts, etc.

✅ What You’ve Completed:
You now have a full pipeline for:
- 📄 Uploading a PDF
- 🧠 Extracting and chunking the text
- 🔍 Indexing into Pinecone
- 🤖 Asking questions using semantic search + LLM (RAG)

In [None]:
# ✅ Streamlit Web App: PDF RAG Chatbot

import streamlit as st
from pypdf import PdfReader
from io import BytesIO
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from openai import OpenAI
import os
from dotenv import load_dotenv

os.environ["TOKENIZERS_PARALLELISM"] = "false" #disabling parallel processing


# Load .env for OpenRouter API
load_dotenv(dotenv_path=".env")
api_key = os.getenv("OPENROUTER_API_KEY")
client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=api_key)

# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Streamlit UI
st.title("📄 Chat with Your PDF (RAG + Pinecone style)")

uploaded_file = st.file_uploader("Upload a PDF", type="pdf")

if uploaded_file:
    reader = PdfReader(uploaded_file)
    raw_text = "\n".join([page.extract_text() for page in reader.pages if page.extract_text()])

    # Split into chunks
    def chunk_text(text, chunk_size=300):
        words = text.split()
        return [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

    chunks = chunk_text(raw_text)
    embeddings = model.encode(chunks).astype("float32")

    # Create FAISS index
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)

    # Chat history
    if "chat_history" not in st.session_state:
        st.session_state.chat_history = [
            {"role": "system", "content": "You are a helpful assistant."}
        ]

    # Search top-k relevant chunks
    def search_context(query, top_k=3):
        query_vec = model.encode([query]).astype("float32")
        _, I = index.search(query_vec, top_k)
        return [chunks[i] for i in I[0]]

    # Generate answer
    def generate_answer(user_q):
        context = "\n".join(search_context(user_q))
        prompt = f"""Use the context below to answer the question:

Context:
{context}

Question: {user_q}
Answer:"""
        st.session_state.chat_history.append({"role": "user", "content": prompt})
        response = client.chat.completions.create(
            model="mistralai/mistral-7b-instruct",
            messages=st.session_state.chat_history,
            temperature=0.3
        )
        reply = response.choices[0].message.content
        st.session_state.chat_history.append({"role": "assistant", "content": reply})
        return reply

    # Input box
    user_input = st.text_input("Ask a question about the PDF")
    if user_input:
        with st.spinner("Thinking..."):
            answer = generate_answer(user_input)
            st.markdown(f"**You:** {user_input}")
            st.markdown(f"**Bot:** {answer}")


# echo 'export PATH=$PATH:~/Library/Python/3.9/bin' >> ~/.zshrc
# source ~/.zshrc




In [None]:
# ✅ Streamlit Web App: PDF RAG Chatbot (Fixed Version)
code = """
import streamlit as st
from pypdf import PdfReader
from io import BytesIO
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from openai import OpenAI
import os
from dotenv import load_dotenv

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # disabling parallel processing

# Load .env for OpenRouter API
load_dotenv(dotenv_path=".env")
api_key = os.getenv("OPENROUTER_API_KEY")
client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=api_key)

# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Streamlit UI
st.title("📄 Chat with Your PDF (RAG + FAISS style)")

uploaded_file = st.file_uploader("Upload a PDF", type="pdf")

if uploaded_file:
    reader = PdfReader(uploaded_file)
    raw_text = "\\n".join([page.extract_text() for page in reader.pages if page.extract_text()])

    # Split into chunks
    def chunk_text(text, chunk_size=300):
        words = text.split()
        return [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

    chunks = chunk_text(raw_text)
    embeddings = model.encode(chunks).astype("float32")

    # Create FAISS index
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)

    # Chat history
    if "chat_history" not in st.session_state:
        st.session_state.chat_history = [
            {"role": "system", "content": "You are a helpful assistant."}
        ]

    # Search top-k relevant chunks
    def search_context(query, top_k=3):
        query_vec = model.encode([query]).astype("float32")
        _, I = index.search(query_vec, top_k)
        return [chunks[i] for i in I[0]]

    # Generate answer
    def generate_answer(user_q):
        context = "\\n".join(search_context(user_q))
        prompt = (
            "Use the context below to answer the question:\\n\\n"
            f"Context:\\n{context}\\n\\n"
            f"Question: {user_q}\\n"
            "Answer:"
        )
        st.session_state.chat_history.append({"role": "user", "content": prompt})
        response = client.chat.completions.create(
            model="mistralai/mistral-7b-instruct",
            messages=st.session_state.chat_history,
            temperature=0.3
        )
        reply = response.choices[0].message.content
        st.session_state.chat_history.append({"role": "assistant", "content": reply})
        return reply

    # Input box
    user_input = st.text_input("Ask a question about the PDF")
    if user_input:
        with st.spinner("Thinking..."):
            answer = generate_answer(user_input)
            st.markdown(f"**You:** {user_input}")
            st.markdown(f"**Bot:** {answer}")
"""

# Save to file
with open("chat_pdf_app.py", "w") as f:
    f.write(code)

#streamlit run chat_pdf_app.py

 Why You See It
Hugging Face's tokenizers library supports parallel processing (multi-threading).

But when you load tokenizers inside a web app (like Streamlit) that uses multiprocessing or forking, Python gets confused and disables parallelism to avoid crashes.

It shows this warning to inform you.