## Daily Challenge: W6_D4

### Pinecone Serverless Reranking in Action

**Why we are doing this**

Reranking models improve search relevance by assigning similarity scores between a query and documents, then reordering results so that the most relevant information appears first. In fields like healthcare, this helps clinicians quickly access critical patient notes.

---

**Task Overview**

You are provided with a step-by-step pipeline. Each numbered step is an action to perform. After each instruction, you’ll find an explanation of what to do and why it matters. Replace each placeholder (like ...) with the correct code or value.

---

**Part 1 – Load Documents and Execute Reranking Model**

1. **Install Pinecone libraries**

Install the Pinecone client and notebook helper package to interact with the API and simplify authentication.

2. **Authenticate with Pinecone**

Check if your environment contains the API key. If not, authenticate securely to avoid hard-coding credentials.

3. **Instantiate the Pinecone client**

Use your API key and cloud environment (e.g., us-west1-gcp) to initialize a Pinecone client instance.

4. **Define your query and example documents**

Write a query (e.g., "Tell me about Apple’s products") and create five sample documents mixing different meanings of "apple" (fruit and company).

5. **Call the reranker**

Run the bge-reranker-v2-m3 model with your query and documents. Set how many top-ranked results to retrieve.

6. **Inspect the reranked results**

Print each result’s rank, score, and content. This shows how the reranker prioritizes relevant documents.

---

**Part 2 – Set Up a Serverless Index for Medical Notes**

1. **Install libraries**

Install pandas, torch, and transformers to handle data and load the embedding model.

2. **Import modules and define environment**

Set cloud provider (e.g., AWS) and region. Configure the serverless index with appropriate resource settings.

3. **Create or recreate the index**

Delete and recreate the index with a dimension matching your embedding model output (e.g., 384).

---

**Part 3 – Load the Sample Data**

1. **Download and read JSONL**

Download the sample notes file from GitHub and load it into a DataFrame.

2. **Preview the data**

Display the first few rows to verify it includes the expected columns: id, embedding, metadata.

---

**Part 4 – Upsert Data into the Index**

1. **Instantiate index client and upsert**

Connect to the index and upload the DataFrame to Pinecone using upsert_from_dataframe.

2. **Wait for index availability**

Check that the data is fully indexed before running queries.

---

**Part 5 – Run a Semantic Search**

1. **Define the embedding function**

Use sentence-transformers to create a function that converts text into a vector using a model like all-MiniLM-L6-v2.

2. **Run the search query**

Write a clinical question, convert it to a vector, and search the index to retrieve the most similar notes.

---

**Part 6 – Display and Rerank Clinical Notes**

1. **Show initial search results**

Print each result’s ID, similarity score, and metadata for review.

2. **Prepare documents for reranking**

Format each result’s metadata into a string and define a refined version of the query.

3. **Execute reranking**

Call the reranker with the refined query and metadata strings. Choose how many top reranked results to display.

4. **Show reranked results**

Print each reranked note’s ID, new score, and metadata to see how the ordering changed and improved.

### Part 1: Load Documents & Execute Reranking Model

### Step 1 – Install Pinecone Libraries

In [None]:
# Install Pinecone core client and notebook helper
!pip install pinecone==6.0.1 pinecone-notebooks

### Step 2 – Authenticate with Pinecone

In [47]:
import os
import requests, tempfile, os
import pandas as pd
from pinecone import Pinecone
import os, time, pandas as pd, torch
from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer

In [None]:
# Set your API key manually for local development (don't share this!)
os.environ["PINECONE_API_KEY"] = "xxx"  # remplace par ta vraie clé

In [26]:
# Remplace par ton environnement Pinecone
os.environ["PINECONE_ENVIRONMENT"] = "us-west1-gcp"

Explanation: This snippet ensures your Pinecone API key is available in your environment. If not, it asks for it interactively (especially useful in notebooks).

### Step 3 – Instantiate the Pinecone Client

In [27]:
api_key = os.environ["PINECONE_API_KEY"]
environment = os.environ.get("PINECONE_ENVIRONMENT")  # ou directement "us-west1-gcp"
pc = Pinecone(api_key=api_key, environment=environment)

Explanation: The Pinecone object is initialized using your API key and the correct cloud region. This sets up the connection for all future operations.

### Step 4 – Define Query and Sample Documents

In [28]:
# Define a query and a list of mixed-context documents (fruit and company)
query = "Tell me about Apple's products"

documents = [
    "Apple has recently released the new iPhone with advanced camera features.",
    "I like eating green apples in the summer.",
    "The Apple Watch can monitor your heart rate and sleep patterns.",
    "An apple a day keeps the doctor away.",
    "Apple's M-series chips are revolutionizing personal computing."
]

Explanation: This query is ambiguous (“Apple”) and the documents include both meanings: the company and the fruit. This allows us to test the reranker's ability to distinguish context.

### Step 5 – Call the Reranker

In [29]:
# Apply the reranker to score and reorder documents by relevance to the query
from pinecone import RerankModel

reranked = pc.inference.rerank(
    model="bge-reranker-v2-m3",
    query=query,
    documents=[{"id": str(i), "text": doc} for i, doc in enumerate(documents)],
    top_n=3  # Only keep the top 3 most relevant documents
)

Explanation: This reranking model scores the similarity between the query and each document, and returns the top 3 by relevance.

### Step 6 – Display Reranked Results

In [30]:
# Function to safely print reranked results using correct .data field
def show_reranked(query, matches):
    print(f"Query: {query}")
    if not matches:
        print("No results returned from reranking.")
        return
    for i, m in enumerate(matches):
        try:
            print(f"{i+1}. [Score: {m.score:.4f}] {m.document['text']}")
        except Exception as e:
            print(f"{i+1}. Error accessing match data: {e}")

# Call with correct field
show_reranked(query, reranked.data)

Query: Tell me about Apple's products
1. [Score: 0.0771] Apple's M-series chips are revolutionizing personal computing.
2. [Score: 0.0356] Apple has recently released the new iPhone with advanced camera features.
3. [Score: 0.0217] The Apple Watch can monitor your heart rate and sleep patterns.


**Interpretation of the Reranked Results**

**Query:** Tell me about Apple's products

The reranking model evaluated the semantic similarity between the query and the candidate documents. Here’s the interpretation of the top results:

1. **"Apple's M-series chips are revolutionizing personal computing."**  
   → Highest score and most relevant. This sentence directly relates to a key hardware product from Apple Inc., aligning strongly with the query.

2. **"Apple has recently released the new iPhone with advanced camera features."**  
   → Very relevant. It mentions a major Apple product (iPhone), though slightly less technical or core than the first result.

3. **"The Apple Watch can monitor your heart rate and sleep patterns."**  
   → Relevant but less directly aligned. The Apple Watch is a product, but the focus on health monitoring makes it a weaker match.

**Not returned:**  
Sentences about the fruit (e.g., "An apple a day keeps the doctor away") were excluded, showing that the reranker correctly distinguished between the fruit and the company.

**Conclusion:** The model demonstrates contextual understanding, going beyond keywords to capture intent and meaning.

### Partie 2 : Setup a Serverless Index for Medical Notes

#### Step 1 – Import Modules and Define Environment Settings

In [34]:
# Define your deployment environment
cloud = "aws"
region = "us-east-1"

# Create the serverless configuration with cloud and region
spec = ServerlessSpec(cloud=cloud, region=region)

# Initialize Pinecone client
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"], environment=f"{cloud}-{region}")

Explanation:
You choose the cloud provider and region based on your Pinecone project configuration. ServerlessSpec defines the computational resources for your index. Finally, we instantiate the Pinecone client.

#### Step 2 – Create or Recreate the Serverless Index

In [36]:
# Define index name
index_name = "pinecone-reranker"

# Remove existing index if present
if pc.has_index(index_name):
    pc.delete_index(index_name)

# Create a new serverless index with the proper spec argument
pc.create_index(
    name=index_name,
    dimension=384,  # must match your embedding model output
    spec=spec       # pass the serverless spec object
)

{
    "name": "pinecone-reranker",
    "metric": "cosine",
    "host": "pinecone-reranker-duhgekd.svc.aped-4627-b74a.pinecone.io",
    "spec": {
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    },
    "status": {
        "ready": true,
        "state": "Ready"
    },
    "vector_type": "dense",
    "dimension": 384,
    "deletion_protection": "disabled",
    "tags": null
}

### Part 3: Load the Sample Data

#### Step 1 – Download & Load JSONL File

In [41]:
with tempfile.TemporaryDirectory() as tmpdir:
    file_path = os.path.join(tmpdir, "sample_notes_data.jsonl")
    url = "https://raw.githubusercontent.com/pinecone-io/examples/refs/heads/master/docs/data/sample_notes_data.jsonl"
    
    resp = requests.get(url)
    resp.raise_for_status()
    with open(file_path, "wb") as f:
        f.write(resp.content)

    df = pd.read_json(file_path, orient='records', lines=True)

# Rename 'values' column to 'embedding' if needed for compatibility
if 'values' in df.columns and 'embedding' not in df.columns:
    df.rename(columns={"values": "embedding"}, inplace=True)

# Preview the loaded DataFrame
print("Loaded DataFrame:")
print(df.head())

Loaded DataFrame:
     id                                          embedding  \
0  P011  [-0.2027486265, 0.2769146562, -0.1509393603, 0...   
1  P001  [0.1842793673, 0.4459365904, -0.0770567134, 0....   
2  P002  [-0.2040648609, -0.1739618927, -0.2897160649, ...   
3  P003  [0.1889383644, 0.2924542725, -0.2335938066, -0...   
4  P004  [-0.12171068040000001, 0.1674752235, -0.231888...   

                                            metadata  
0  {'advice': 'rest, hydrate', 'symptoms': 'heada...  
1  {'tests': 'EKG, stress test', 'symptoms': 'che...  
2  {'HbA1c': '7.2', 'condition': 'diabetes', 'med...  
3  {'symptoms': 'cough, wheezing', 'diagnosis': '...  
4  {'referral': 'dermatology', 'condition': 'susp...  


**Interpretation of the Medical Notes DataFrame**

The dataset simulates clinical notes, where each row represents a patient case. Here's how to interpret the columns:

- **id**:  
  A unique identifier for each medical note (e.g., "P001").  
  → This will be used as the document ID in the Pinecone index.

- **embedding** (originally named values):  
  A list of 384 float values representing the semantic embedding of the note.  
  → These embeddings capture the meaning of the note and enable semantic search.

- **metadata**:  
  A dictionary containing relevant clinical context. It can include:
  - Symptoms (e.g., "headache", "chest pain")
  - Diagnoses (e.g., "asthma", "diabetes")
  - Tests, advice, prescriptions, or referrals

This structure is well-suited for testing reranking models because:
- Embeddings support initial semantic search.
- Metadata provides rich textual context to refine search results through reranking.

### Part 4: Upsert Data into the Index

#### Step 1 – Instantiate Index Client and Upsert Data

In [45]:
# Rename 'embedding' to 'values' for Pinecone compatibility
df.rename(columns={"embedding": "values"}, inplace=True)

# Preview structure after renaming
print(df.columns)

# Connect to the index
index = pc.Index(index_name)

# Upsert into Pinecone
index.upsert_from_dataframe(df)

Index(['id', 'values', 'metadata'], dtype='object')


sending upsert requests:   0%|          | 0/100 [00:00<?, ?it/s]

{'upserted_count': 100}

In [46]:
print(index.describe_index_stats())

{'dimension': 384,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 100}},
 'total_vector_count': 100,
 'vector_type': 'dense'}


**Index Availability Check**

Although the tutorial includes a polling loop to wait for indexing, the successful message:

```python
{'upserted_count': 100}

In [44]:
print(df.columns)
print(df.dtypes)
print(df.iloc[0])

Index(['id', 'embedding', 'metadata'], dtype='object')
id           object
embedding    object
metadata     object
dtype: object
id                                                        P011
embedding    [-0.2027486265, 0.2769146562, -0.1509393603, 0...
metadata     {'advice': 'rest, hydrate', 'symptoms': 'heada...
Name: 0, dtype: object


**Upsert Successful: What It Means**

You have successfully inserted 100 clinical notes into your Pinecone index. Here's what that confirms:

- **Each record includes:**
  - An id (e.g., "P001"): a unique identifier for the note.
  - A values field: the 384-dimensional embedding vector representing the semantic content of the note.
  - A metadata field: structured information (e.g., symptoms, diagnoses, prescriptions).

- **Result:**
  - Pinecone now stores and indexes these 100 vectors.
  - You are ready to perform **semantic search** on this dataset using embedding queries.
  - You can then apply **reranking** to reorder the most relevant notes based on metadata and context.

Next, you'll define a query, embed it using a sentence-transformer model, and perform your first semantic search.

### Partie 5 : Query & Embedding Function

#### Step 1 – Define Your Embedding Function

In [48]:
# Load the pre-trained model (384-dim compatible with your index)
model = SentenceTransformer("all-MiniLM-L6-v2")

# Function to convert a text query into a dense embedding vector
def get_embedding(text):
    return model.encode(text).tolist()

Explanation:
The model "all-MiniLM-L6-v2" outputs 384-dimensional vectors, which matches your Pinecone index's dimension.

#### Step 2 – Run a Semantic Search Query

In [59]:
# Define a clinical question
question = "what if my patient has chest pain?"

# Convert question to embedding
emb = get_embedding(question)

# Query Pinecone index using this embedding
results = index.query(vector=emb, top_k=5, include_metadata=True)

# Sort results by similarity score
matches = sorted(results.matches, key=lambda m: m.score, reverse=True)

In [60]:
# Use clinical search results (from 'question' with 'chest pain')
rerank_docs = [
    {
        "id": m.id,
        "reranking_field": "; ".join([f"{k}: {v}" for k, v in m.metadata.items()])
    }
    for m in matches
]

rerank_query = "Evaluate chest-related symptoms and diagnostic tests"

Explanation:
This step retrieves the top 5 most semantically similar medical notes from the index using your query.

**Semantic Search Completed (Part 5)**

We successfully ran a semantic query against our indexed medical notes:

- Used the all-MiniLM-L6-v2 model to encode a clinical question into a vector.
- Queried the Pinecone index using index.query(...).
- Retrieved the top 5 most relevant notes based on vector similarity.

The results (matches) are now ready to be visualized and reranked using metadata and a refined query.

### Part 6: Display & Rerank Clinical Notes

#### Step 1 – Display Initial Search Results

In [61]:
# Show the initial top-k search results with their metadata
def show_results(q, matches):
    print(f"Question: {q}")
    for i, m in enumerate(matches):
        print(f"{i+1}. ID: {m.id} | Score: {m.score:.4f} | Metadata: {m.metadata}")

# Call display function
show_results(question, matches)

Question: what if my patient has chest pain?
1. ID: P001 | Score: 0.7128 | Metadata: {'symptoms': 'chest pain', 'tests': 'EKG, stress test'}
2. ID: P016 | Score: 0.4676 | Metadata: {'condition': 'heart murmur', 'referral': 'cardiology'}
3. ID: P0100 | Score: 0.4450 | Metadata: {'advice': 'over-the-counter pain relief, stretching', 'symptoms': 'muscle pain'}
4. ID: P047 | Score: 0.4145 | Metadata: {'symptoms': 'back pain', 'treatment': 'physical therapy'}
5. ID: P095 | Score: 0.4145 | Metadata: {'symptoms': 'back pain', 'treatment': 'physical therapy'}


Explanation:
This prints each note’s ID, similarity score, and metadata for review before reranking.

#### Step 2 – Prepare Documents for Reranking

In [62]:
reranked = pc.inference.rerank(
    model="bge-reranker-v2-m3",
    query=rerank_query,
    documents=rerank_docs,
    rank_fields=["reranking_field"],
    top_n=3
)

Explanation:
You provide a more specific clinical question and compress metadata into a single string (reranking_field) for each document.

#### Step 3 – Execute Serverless Reranking

In [63]:
def show_reranked(q, matches):
    print(f"Refined Query: {q}")
    for i, m in enumerate(matches):
        print(f"{i+1}. ID: {m.document.id} | Score: {m.score:.4f} | Metadata: {m.document.reranking_field}")

show_reranked(rerank_query, reranked.data)

Refined Query: Evaluate chest-related symptoms and diagnostic tests
1. ID: P001 | Score: 0.7921 | Metadata: symptoms: chest pain; tests: EKG, stress test
2. ID: P016 | Score: 0.0019 | Metadata: condition: heart murmur; referral: cardiology
3. ID: P0100 | Score: 0.0015 | Metadata: advice: over-the-counter pain relief, stretching; symptoms: muscle pain


**Final Interpretation of Reranked Clinical Notes**

**Refined Query:**  
Evaluate chest-related symptoms and diagnostic tests

The reranker re-scored the top 5 semantic matches using only their metadata fields. Here’s what the results indicate:

1. **P001 | Score: 0.7921**  
   - symptoms: chest pain; tests: EKG, stress test`  
   → Highly relevant. The patient presents chest pain and has diagnostic tests, matching the refined query exactly.

2. **P016 | Score: 0.0019**  
   - condition: heart murmur; referral: cardiology  
   → Weak match. Although heart-related, it lacks direct mention of chest pain or diagnostic procedures.

3. **P0100 | Score: 0.0015**  
   - symptoms: muscle pain; advice: over-the-counter pain relief  
   → Very weak match. Unrelated to chest symptoms or diagnostic tests.

**Conclusion:**  
The reranker successfully pushed the most relevant note (P001) to the top, refining the initial semantic search results using clinical context from metadata. This is crucial in medical scenarios where nuance matters.