# Semantic Search Demo

This notebook demonstrates how to build a **Semantic Search Engine** using **Pinecone** (Vector Database) and **Sentence-Transformers** (Embedding Model).

### What is Semantic Search?
Instead of searching for exact keywords (like "Apple" matching "Apple"), we search for **concepts**. The AI model understands that "wild cats" is similar to "Lions" even if they don't share any words.

### Prerequisites
Make sure you have installed the required libraries:
`pip install pinecone-client sentence-transformers`

And ensure your `.env` file has your `PINECONE_API_KEY`.

In [1]:
# 1. Setup & Imports
import os
import time
from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer
from dotenv import load_dotenv

# Load API keys from .env file
load_dotenv()
api_key = os.environ.get("PINECONE_API_KEY")

if not api_key:
    print("‚ùå Error: PINECONE_API_KEY not found in environment variables.")
else:
    print("‚úÖ API Key loaded successfully.")

INDEX_NAME = "semantic-search-demo"

‚úÖ API Key loaded successfully.


## 2. Load the "Brain" (The Embedding Model)

We use `all-MiniLM-L6-v2`. It's a small, fast, and free model from HuggingFace `sentence-transformers`. 

It converts any text into a list of **384 numbers** (a vector).

In [2]:
print("üß† Loading model (this might take a few seconds)...")
model = SentenceTransformer('all-MiniLM-L6-v2') 
DIMENSION = 384 
print("‚úÖ Model loaded.")

üß† Loading model (this might take a few seconds)...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

‚úÖ Model loaded.


## 3. Initialize Pinecone (The Memory)

We connect to Pinecone and create an index if it doesn't exist. 
The `dimension` must match our model (384). 
The `metric` is "cosine" because it's best for text similarity.

In [None]:
pc = Pinecone(api_key=api_key)

# Check if index exists, if not create it
existing_indexes = [index.name for index in pc.list_indexes()]

if INDEX_NAME not in existing_indexes:
    print(f"üìÇ Creating index: {INDEX_NAME}...")
    pc.create_index(
        name=INDEX_NAME,
        dimension=DIMENSION, # Must match the model output!
        metric="cosine",     # Cosine Similarity is best for text
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        )
    )
    # Wait for index to be ready
    while not pc.describe_index(INDEX_NAME).status['ready']:
        time.sleep(1)
    print("‚úÖ Index created.")
else:
    print(f"‚ÑπÔ∏è Index '{INDEX_NAME}' already exists.")

# Connect to the Index
index = pc.Index(INDEX_NAME)
print(f"‚úÖ Connected to Pinecone Index: {INDEX_NAME}")

## 4. Define Data (The Corpus)

We create a small dataset with sentences about different topics (Tech, Finance, Animals). 
This data will be our "Knowledge Base".

In [None]:
data = [
    {"id": "vec1", "text": "Apple released a new iPhone with a better camera."}, # Tech
    {"id": "vec2", "text": "The stock market crashed today due to inflation."}, # Finance
    {"id": "vec3", "text": "Lions are social animals that live in prides."},    # Animals
    {"id": "vec4", "text": "My laptop keyboard is broken and needs repair."},   # Tech
    {"id": "vec5", "text": "Dogs are known as man's best friend."},            # Animals
    {"id": "vec6", "text": "Interest rates were raised by the central bank."},  # Finance
]

## 5. Embed & Upsert (Teaching)

We loop through our data, convert each sentence into a vector using the model, and upload it to Pinecone.
We also store the original `text` as metadata so we can read it back later.

In [None]:
print("\nüì§ Upserting data...")
vectors_to_upsert = []

for item in data:
    # Convert Text -> Numbers (Vector)
    vector_values = model.encode(item["text"]).tolist()
    
    # Pack it for Pinecone: (ID, Vector, Metadata)
    vectors_to_upsert.append((
        item["id"], 
        vector_values, 
        {"text": item["text"]} # Store original text as metadata so we can read it later!
    ))

index.upsert(vectors=vectors_to_upsert)
print(f"‚ú® Uploaded {len(data)} vectors.")
time.sleep(2) # Give Pinecone a moment to index

## 6. The "Search" (Query)

Now for the magic. We ask questions that don't match the exact words in our database.
1.  **Query**: "tell me about wild cats"
2.  **Model**: Converts to Vector.
3.  **Pinecone**: Finds closest vectors.
4.  **Result**: Should match the sentence about "Lions".

In [None]:
queries = [
    "tell me about wild cats",       # Should match "Lions"
    "tech news",                     # Should match "iPhone" or "Laptop"
    "money and economy"              # Should match "Stock market" or "Interest rates"
]

print("\nüîç Starting Semantic Search...\n")

for query_text in queries:
    print(f"‚ùì Question: '{query_text}'")
    
    # 1. Convert Question -> Vector
    query_vector = model.encode(query_text).tolist()
    
    # 2. Ask Pinecone for the 2 closest matches
    results = index.query(
        vector=query_vector, 
        top_k=2, 
        include_metadata=True
    )
    
    # 3. Print Results
    for match in results.matches:
        print(f"   üëâ Match ({match.score:.2f}): {match.metadata['text']}")
    print("-" * 40)