## **Build Your Own AI-Powered Article Search with Pinecone + Gemini**

This notebook guides you through building a smart Q&A system that searches a large collection of technical articles and answers natural language questions using Google’s Gemini model.

It combines:
- **Semantic search** using vector embeddings + Pinecone  
- **LLM-generated answers** based on retrieved context using Gemini

---

#### What You’ll Learn

1. **Setup**
   - Clone dataset and install required libraries  
2. **Data Preparation**
   - Merge article CSVs and convert to JSON  
3. **Embedding Generation**
   - Generate embeddings using Sentence Transformers or Qwen  
4. **Pinecone Integration**
   - Create an index and upload embeddings  
5. **Question Answering**
   - Query with a user question and get a Gemini-generated answer  

---

### Dataset Overview

A curated set of technical articles from GeeksforGeeks, merged from multiple CSV files.

- **Rows**: 49,328  
- **Columns**:
  - `title`: Topic/title of the article  
  - `article`: Full content of the article  

### Sample

| Title                             | Article (snippet)                            |
|----------------------------------|-----------------------------------------------|
| What’s New in PHP 7 ?           | Prerequisite PHP 7 Features Set 1...          |
| Kotlin Inheritance              | Kotlin supports inheritance which allows...   |
| Merge two sorted linked list... | Merge two sorted linked list of size n1...    |

> Articles cover topics like programming, web dev, data structures, and more—perfect for powering GenAI search tools.

---

By the end, you'll have a fully functional RAG system ready for real-world applications.


### Download Preprocessed Data & Pretrained Embeddings

To save time, you can directly download the **cleaned dataset** and **pretrained embeddings** from:

- Hugging Face [(articles_dataset_for_rag)](https://huggingface.co/datasets/ashish-jangra/articles_dataset_for_RAG)  
- Kaggle [(articles_dataset_for_rag)](https://www.kaggle.com/datasets/ashishjangra27/articles-dataset-for-rag)


> These files can be loaded directly into the notebook to skip data processing and embedding generation steps.



### 1. Data Preparation

#### 1.1) Get the data & install dependencies

In [None]:
!git clone https://github.com/AshishJangra27/datasets/
!pip install pinecone

#### 1.2) Import Required Libraries

In [1]:
import os
import json
import time
import torch
import pinecone
import unicodedata
import pandas as pd

from google import genai
from google.genai import types

from tqdm.auto import tqdm
from tqdm.notebook import tqdm
import torch.nn.functional as F
from pinecone import Pinecone, ServerlessSpec
from transformers import AutoModel, AutoTokenizer

#### 1.3) Setup APIs for PineCone and Gemini

In [2]:
PINECONE_API_KEY = "pcsk_4zdnPr_9QmiNFJcSp6gmCYs1Xe49WhLVu51XUkovuooExLzcw3VVDkmrAaUAGKnPEx8YdM"
YOUR_GEMINI_API_KEY = ""

#### 1.4) Load and Combine the Datasets

- List all CSV files from dataset folder  
- Read and combine them into a single DataFrame  
- Save the merged data as `data.csv`

In [None]:
csvs = [csv for csv in os.listdir('/content/datasets/GFG Articles/data') if '.csv' in csv]

df = pd.DataFrame()

for csv in tqdm(csvs):
    df_ = pd.read_csv('/content/datasets/GFG Articles/data/' + csv )
    df = pd.concat((df,df_))

df.to_csv('data.csv', index = False)

#### 1.5) Data Formatting

- Extract `title` and `article` columns  
- Store as a list of dictionaries in `articles.json`  
- Delete the original DataFrame to free memory  


```json
[
  {
    "id": "Understanding Linear Regression",
    "text": "Linear regression is a fundamental algorithm in supervised learning..."
  },
  {
    "id": "Introduction to Neural Networks",
    "text": "Neural networks are inspired by the structure of the human brain..."
  }
]


In [None]:
articles = []
for index, row in tqdm(df.iterrows()):
    articles.append({"id": row['title'], "text": row['article']})

with open('articles.json', 'w') as f:
    json.dump(articles, f)

del df

### 2. Generate Embeddings

#### 2.1) Generate Embeddings with All_MiniLM

- Set up device for GPU/CPU  
- Load tokenizer and model from `sentence-transformers/all-MiniLM-L6-v2`  
- Define helper functions:
  - `to_ascii_id`: Converts text to ASCII-safe ID
  - `get_embeddings`: Generates sentence embeddings using the transformer model  
- Load `articles.json`  
- Generate embeddings for each article  
- Save results to `article_embeddings.json`  


```json
[
  {
    "id": "Understanding_Linear_Regression",
    "embedding": [0.0123, -0.0541, ..., 0.0998],
    "text": "Linear regression is a fundamental algorithm in supervised learning..."
  },
  ...
]
```

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2").to(device)

def to_ascii_id(text):
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('ascii')

def get_embeddings(texts):

    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt').to(device)
    with torch.no_grad():
        model_output = model(**encoded_input)

    sentence_embeddings = (model_output.last_hidden_state * encoded_input['attention_mask'].unsqueeze(-1)).sum(dim=1) / encoded_input['attention_mask'].sum(dim=1, keepdim=True)
    return sentence_embeddings.tolist()


with open('articles.json', 'r') as file:
    articles = json.load(file)

embeddings_list = []

for article in tqdm(articles, desc="Generating embeddings"):
    text = article["text"]
    try:

        embedding = get_embeddings([text])[0]
        embeddings_list.append({"id": to_ascii_id(article["id"]), "embedding": embedding, "text": text}) # Include text here
    except Exception as e:
        print(f"Error generating embedding for article {article['id']}: {e}")
        embeddings_list.append({"id": to_ascii_id(article["id"]), "embedding": None, "text": text}) # Include text here even if embedding fails


print(f"Generated embeddings for {len(embeddings_list)} articles.")

with open('article_embeddings.json', 'w') as f:
    json.dump(embeddings_list, f)

#### 2.2) Generate Embeddings with Qwen3 (**Optional**)

Qwen3 .6B is a heavy model that will take hours around **15 hours** GPU to generate embeddings

- Define `last_token_pool` to extract pooled embedding based on attention
- Define `to_ascii_id` for ID normalization
- Load tokenizer and model from `Qwen/Qwen3-Embedding-0.6B`
- Set device (GPU/CPU)
- Define `get_qwen_embeddings` to encode and normalize text embeddings
- Load articles from `articles.json`
- Generate embeddings and save as `article_embeddings_qwen.json`

```json
[
  {
    "id": "Understanding_Linear_Regression",
    "embedding": [0.0142, -0.0357, ..., 0.0871],
    "text": "Linear regression is a fundamental algorithm in supervised learning..."
  },
  ...
]
```

In [None]:
def last_token_pool(last_hidden_states, attention_mask):
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        seq_lens = attention_mask.sum(dim=1) - 1
        return last_hidden_states[torch.arange(last_hidden_states.size(0)), seq_lens]


def to_ascii_id(text):
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('ascii')


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", padding_side='left')
model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B").to(device)


def get_qwen_embeddings(texts):
    encoded = tokenizer(texts, padding=True, truncation=True, max_length=8192, return_tensors="pt").to(device)
    with torch.no_grad():
        output = model(**encoded)
    pooled = last_token_pool(output.last_hidden_state, encoded['attention_mask'])
    return F.normalize(pooled, p=2, dim=1).cpu().tolist()


with open('articles.json', 'r') as f:
    articles = json.load(f)

results = []
for article in tqdm(articles, desc="Generating Qwen embeddings"):
    text = article["text"]
    try:
        emb = get_qwen_embeddings([text])[0]
        results.append({"id": to_ascii_id(article["id"]), "embedding": emb, "text": text})
    except Exception as e:
        print(f"Error with {article['id']}: {e}")
        results.append({"id": to_ascii_id(article["id"]), "embedding": None, "text": text})

with open('article_embeddings_qwen.json', 'w') as f:
    json.dump(results, f)

### 3. Push Embeddings on Pinecone

#### 3.1) Initialize Index on Pinecone

- Initialize Pinecone client using API key  
- Set index name as `vector-db`  
- Create index if it doesn't already exist  
- Index configuration:
  - **Dimension**: 384  
  - **Metric**: Cosine similarity  
  - **Cloud**: AWS  
  - **Region**: us-east-1  
- Print list of available indexes  

In [None]:
pc = Pinecone(api_key=os.getenv(PINECONE_API_KEY))

index_name = "vector-db"

# if pc.has_index(index_name):
#     pc.delete_index(name=index_name)

if index_name not in [info["name"] for info in pc.list_indexes()]:
    pc.create_index(
        name=index_name,
        dimension=384,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

print(pc.list_indexes())

#### 3.2) Push Vectors on Pinecone

- Initialize Pinecone client and connect to `vector-db` index  
- Define utility functions:
  - `to_ascii_id`: Ensures valid ASCII IDs  
  - `truncate_text`: Limits metadata to max byte size  
- Load `article_embeddings.json` and filter out entries with missing embeddings  
- Batch upload to Pinecone with:
  - **Batch size**: 50  
  - **Max metadata**: 40,000 bytes  
  - **Delay**: 1 second between batches  
- Upload as: `(id, embedding, metadata)`  
- Print total uploaded vector count  


In [None]:
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index("vector-db")

def to_ascii_id(text):
    """Convert text to ASCII-only string for Pinecone vector IDs."""
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('ascii')

def truncate_text(text, max_bytes):
    """Truncate text to a maximum number of bytes, ensuring valid UTF-8."""
    truncated_text = text.encode('utf-8')[:max_bytes].decode('utf-8', 'ignore')
    return truncated_text


with open('article_embeddings.json', 'r') as file:
    articles_with_embeddings = json.load(file)

articles_with_embeddings = [article for article in articles_with_embeddings if article["embedding"] is not None]

BATCH_SIZE = 50
DELAY = 1
MAX_METADATA_BYTES = 40000

total_batches = (len(articles_with_embeddings) + BATCH_SIZE - 1) // BATCH_SIZE

for batch_start in tqdm(
    range(0, len(articles_with_embeddings), BATCH_SIZE),
    total=total_batches,
    desc="Upserting batches",
    unit="batch"
):
    batch = articles_with_embeddings[batch_start:batch_start + BATCH_SIZE]

    vectors = []
    for article in batch:
        truncated_text = truncate_text(article["text"], MAX_METADATA_BYTES)
        vectors.append((article["id"], article["embedding"], {"text": truncated_text}))

    index.upsert(vectors=vectors)

    if batch_start + BATCH_SIZE < len(articles_with_embeddings):
        time.sleep(DELAY)

print(f"Upserted {len(articles_with_embeddings)} article embeddings to vector-db")

### 4. Search for similar embedding Articles

#### 4.1) Search for similar embedding Articles

- Initialize Pinecone client and access `vector-db`  
- Embed the query using `get_embeddings`  
- Search top 10 similar vectors using cosine similarity  
- Print:
  - Matched vector ID  
  - Similarity score  
  - First 200 characters of the matched article text  


In [None]:
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index("vector-db")

query_text = "what is gen-ai?"
query_embedding = get_embeddings([query_text])[0]

response = index.query(
    vector=query_embedding,
    top_k=10,
    include_metadata=True,
    include_values=False)

for match in response["matches"]:
    print(f"ID: {match['id']}, score: {match['score']:.4f}")
    snippet = match["metadata"].get("text", "")
    print("Snippet:", snippet[:200].replace("\n", " "), "...\n")

#### 4.2) Get RAG based results with Gemini

- Initialize Gemini client with API key  
- Define `answer_with_gemini(query, top_k)` to:
  - Embed the query  
  - Retrieve top-k similar articles from Pinecone  
  - Build a context from their metadata  
  - Prompt Gemini model with context and question  
- Print the model-generated answer  

```python
q = "What is GAN?"
response = answer_with_gemini(q)
print("Answer:", response.text)
```

In [None]:
client = genai.Client(api_key=YOUR_GEMINI_API_KEY)

def answer_with_gemini(query, top_k=5):

    q_emb = get_embeddings([query])[0]
    resp = index.query(vector=q_emb, top_k=top_k, include_metadata=True)

    context = "\n\n".join(
        m["metadata"].get("text", "").strip()
        for m in resp["matches"]
    )

    response = client.models.generate_content(
        model="gemini-1.5-flash",
        contents=[
            types.Content(role="user", parts=[types.Part(text=(
                f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
            ))])
        ],
        config=types.GenerateContentConfig(
            temperature=0.7,
            max_output_tokens=500
        )
    )
    return response

if __name__ == "__main__":
    q = "What is GAN?"
    response = answer_with_gemini(q)
    print("Answer:", response.text)

#### 4.3) Check Token Consumption

- Print the Gemini model version used  
- Extract and display the generated answer text  
- Print token usage statistics:
  - Prompt tokens  
  - Completion tokens  
  - Total tokens  

In [None]:
print("Model version:", response.model_version)

if response.candidates:
    candidate = response.candidates[0]
    text = "".join(part.text for part in candidate.content.parts)
    print("\nGenerated Text:\n", text.strip())


usage = response.usage_metadata
print("\nTokens used in prompt:", usage.prompt_token_count)
print("Tokens in generated content:", usage.candidates_token_count)
print("Total tokens:", usage.total_token_count)

### 5. Conclusion

In this notebook, we built a complete RAG-based system that combines semantic search using Pinecone with contextual answering using Gemini.

From loading and embedding thousands of articles to querying with real questions and getting LLM-powered answers, you've seen how modern GenAI tools can work together to create powerful search experiences.

#### What You Now Have:
- A scalable, vector-based search backend using Pinecone  
- Cleanly embedded article data using Transformer models  
- A question-answering layer powered by Google Gemini  
- A flexible pipeline you can extend with your own data or use cases

> This setup can be the foundation for internal knowledge assistants, educational bots, or smart documentation search tools.

Feel free to tweak, scale, and deploy it further — the core logic is now in your hands.