# **Storing and Querying PDF Data in a Vector Database**

This notebook demonstrates how to:
1. Extract text data from a PDF document.
2. Convert the text into embeddings using the Gemini model.
3. Store these embeddings in a Pinecone vector database for efficient retrieval.
4. Query the vector database to retrieve relevant results.

---

## **Prerequisites**

Ensure you have:
- A Python environment with the following libraries installed:
  - `pinecone-client`
  - `PyPDF2`
  - `llama-index`
  - `llama-index-embeddings-gemini`
- A valid Pinecone account and API key.
- A PDF document to process.

---

## **Steps**

### Step 1: Extract Text from the PDF
We use `PyPDF2` to extract the content from the PDF document. The text is then chunked into smaller segments to ensure meaningful embedding generation.

### Step 2: Generate Embeddings
The Gemini embedding model is used to create vector representations of the PDF text chunks.

### Step 3: Store Embeddings in Pinecone
The embeddings are stored in a Pinecone vector database with appropriate metadata for efficient retrieval.

### Step 4: Query the Vector Database
Using a query embedding, relevant data is fetched from Pinecone to answer user questions.

---

## **Expected Output**

- The embeddings from the PDF data are successfully stored in Pinecone.
- The query fetches relevant information based on the user's input.

Let's proceed!


In [None]:
# Dependencies
!pip install llama-index llama-index-llms-gemini llama-index-embeddings-gemini llama-index-vector-stores-pinecone PyPDF2 pinecone-client



In [3]:
# Imports
import os
import PyPDF2 
from llama_index.embeddings.gemini import GeminiEmbedding
from pinecone import Pinecone

In [20]:
# env vars
os.environ['PINECONE_API_KEY'] = os.getenv("PINECONE_API_KEY")
os.environ['GOOGLE_API_KEY'] = os.getenv("GOOGLE_API_KEY")


In [None]:
# Init Pinecone client
pinecone_client = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
pinecone_index = pinecone_client.Index("pdfq") # enter your pinecone client

In [6]:
def extract_text_from_pdf(pdf_path):
    text = ""
    reader = PyPDF2.PdfReader(pdf_path)
    for page in reader.pages:
        text += page.extract_text()
    return text.split("\n")

In [21]:
# Extract text from pdf here
pdf_text = extract_text_from_pdf("./data/somatosensory.pdf")
pdf_text[:5]

['This is a sample document to',
 'showcase page-based formatting. It',
 'contains a chapter from a Wikibook',
 'called Sensory Systems . None of the',
 'content has been changed in this']

In [9]:
embed_model = GeminiEmbedding(
    model_name="models/embedding-001", api_key=os.getenv("GOOGLE_API_KEY"), title="this is a document"
)
vectors = []
for i, chunk in enumerate(pdf_text):
    clean_chunk = chunk.strip()
    if clean_chunk:
        embedding = embed_model.get_text_embedding(clean_chunk)
        if isinstance(embedding, list) and all(isinstance(x, float) for x in embedding):  # Validate output
                    vectors.append({
                        "id": f"chunk-{i}",
                        "values": embedding,
                        "metadata": {"text": clean_chunk}
                    })

In [22]:

vectors[:1]

[{'id': 'chunk-0',
  'values': [0.038513806,
   -0.034792732,
   -0.037798207,
   -0.020050896,
   0.041156333,
   0.0018874952,
   0.033766955,
   -0.0061720447,
   0.0037947872,
   0.038210396,
   -0.01288822,
   0.007575297,
   -0.00049272494,
   0.0191192,
   0.025337826,
   -0.030174363,
   0.019323636,
   0.024302593,
   0.021241847,
   0.038778774,
   0.013828615,
   0.034577873,
   -0.027288208,
   0.0069088843,
   0.0433171,
   -0.037555598,
   0.023931153,
   -0.06804182,
   -0.023606544,
   -0.011222893,
   -0.03750337,
   0.03629757,
   -0.039738823,
   0.052188784,
   0.024677832,
   -0.024173731,
   -0.007022975,
   -0.015181478,
   -0.0059715025,
   -0.018206237,
   -0.018829737,
   -0.006154284,
   -0.0302878,
   0.017651657,
   0.056936745,
   -0.0035144496,
   -0.0014682364,
   0.013484866,
   0.013464385,
   -0.05881291,
   0.06155042,
   0.022911139,
   0.050891586,
   -0.011273632,
   -0.013196038,
   -0.025456656,
   0.030613663,
   -0.008624051,
   -0.012107932,


In [11]:
pinecone_index.upsert(vectors)

# Verify the insertion
print(pinecone_index.describe_index_stats())

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 177}},
 'total_vector_count': 177}


### Search From Vector Database

In [1]:
# Imports
from llama_index.llms.gemini import Gemini

  from .autonotebook import tqdm as notebook_tqdm


In [14]:
llm = Gemini()

In [12]:
query = "What are Joint receptors"
q_embedding = embed_model.get_query_embedding(query)

In [15]:
# Search in Pinecone
results = pinecone_index.query(
    vector=q_embedding,  # Specify the query vector
    top_k=5,                 # Number of results to return
    include_metadata=True    # Include metadata in the results
)
# Combine results into context
context = " ".join([match["metadata"]["text"] for match in results["matches"]])

# Generate an answer
response = llm.complete(prompt=f"Context: {context}\n\nQuestion: {query}\nAnswer:")
print(response)


Joint receptors are low-threshold mechanoreceptors located in muscles and joints that provide information about the position and movement of the body.

