# 🌲 Pinecone Vector Database Tutorial

A comprehensive guide to building a semantic search system using the **Pinecone** vector database and **SentenceTransformers**.  
This tutorial demonstrates how to store document embeddings and perform intelligent similarity searches.

## 📋 Prerequisites

Before starting, ensure you have:

- A **Pinecone account** and **API key**
- **Python 3.7+** installed
- Required packages:
  - `pinecone-client`
  - `sentence-transformers`
  - `python-dotenv`


In [None]:
! pip install pinecone-client sentence-transformers python-dotenv

## 🚀 Implementation

### 1. Import Required Libraries

Start by importing all necessary libraries.  
We'll use:

- **Pinecone** for vector storage  
- **SentenceTransformers** for creating embeddings  
- **dotenv** for managing environment variables securely


In [30]:
from pinecone import Pinecone, ServerlessSpec
from dotenv import load_dotenv
from sentence_transformers import SentenceTransformer
import os

> 📝 **Note:** Make sure to create a `.env` file with your  
> `PINECONE_API_KEY=your_api_key_here`

### 2. Environment Setup and Pinecone Initialization

Load your environment variables and establish a connection to Pinecone.  
This keeps your API key secure and separate from your code.


In [31]:
# Load environment variables from .env file
load_dotenv()

# Get the API key from environment variables
api_key = os.getenv('PINECONE_API_KEY')

# Initialize Pinecone with the API key
pc = Pinecone(api_key=api_key)

> 🔐 **Security Tip:** Never hardcode API keys in your source code.  
> Always use environment variables.


### 3. Index Configuration and Creation

Configure your vector index with appropriate dimensions and create it if it doesn't exist.  
We're using a **serverless setup on AWS** for cost-effectiveness and scalability.


In [32]:
# Index name and config
index_name = "pinecone-tutorial"
dimension = 384  # Dimension for all-MiniLM-L6-v2 model

# Step 1: Check and create index using ServerlessSpec
if index_name not in [index['name'] for index in pc.list_indexes()]:
    pc.create_index(
        name=index_name,
        dimension=dimension,
        metric="cosine",  # Cosine similarity for text embeddings
        spec=ServerlessSpec(cloud='aws', region='us-east-1')
    )
    print(f"✅ Created index: {index_name}")
else:
    print(f"ℹ️ Index {index_name} already exists")

# Connect to the index
index = pc.Index(index_name)
print(f"📊 Index stats: {index.describe_index_stats()}")

ℹ️ Index pinecone-tutorial already exists
📊 Index stats: {'dimension': 384,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 10}},
 'total_vector_count': 10,
 'vector_type': 'dense'}


#### 🔑 Key Points:

- **Dimension 384**: Matches the output size of the `all-MiniLM-L6-v2` model  
- **Cosine metric**: Best for text similarity comparisons  
- **Serverless**: Pay-per-use pricing model


### 4. Sample Documents and Model Initialization

Define a diverse set of sample documents covering various tech topics  
and initialize the SentenceTransformer model that will convert text to vector embeddings.

In [33]:
documents = [
    "The quick brown fox jumps over the lazy dog",
    "Artificial intelligence is transforming technology",
    "Python is a popular programming language",
    "Machine learning models require large datasets",
    "Vector databases enable fast similarity search",
    "Natural language processing analyzes text data",
    "Deep learning uses neural networks",
    "Data science combines statistics and programming",
    "Cloud computing provides scalable infrastructure",
    "Software development involves writing code"
]

# Initialize the sentence transformer model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

> 🤖 **Model Choice:** `all-MiniLM-L6-v2` is lightweight, fast, and produces high-quality embeddings  
> for general text similarity tasks.


### 5. Generate Embeddings and Prepare Data

Transform your text documents into numerical vectors that capture their semantic meaning,  
then format them for Pinecone storage with metadata.


In [34]:
# Generate embeddings for all documents
embeddings = model.encode(documents).tolist()

# Prepare data structure for Pinecone upsert
to_upsert = [
    {
        "id": f"doc{i}",                    # Unique identifier
        "values": embeddings[i],            # Vector embedding
        "metadata": {"text": documents[i]}  # Original text for retrieval
    }
    for i in range(len(documents))
]

> **📊 Data Structure:**
> - **ID**: Unique identifier for each document
> - **Values**: 384-dimensional vector embedding
> - **Metadata**: Store original text for easy retrieval

### 6. Insert Documents into Pinecone

Upload your prepared vectors to the Pinecone index. This operation is called "upsert" because it inserts new vectors or updates existing ones.

In [35]:
# Upsert vectors to Pinecone index
index.upsert(vectors=to_upsert)

print("✅ Documents inserted successfully.")
print(f"📈 Index stats after upload: {index.describe_index_stats()}")

✅ Documents inserted successfully.
📈 Index stats after upload: {'dimension': 384,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 10}},
 'total_vector_count': 10,
 'vector_type': 'dense'}


> **⚡ Performance:** Pinecone handles indexing automatically, optimizing for fast similarity searches.

### 7. Perform Similarity Search

Query your vector database with natural language and retrieve the most semantically similar documents based on cosine similarity.

In [36]:
# Define search query
query = "What is AI and machine learning?"

# Convert query to embedding using the same model
query_embedding = model.encode([query])

# Search for similar documents
search_results = index.query(
    vector=query_embedding.tolist(),
    top_k=3,                    # Return top 3 matches
    include_metadata=True       # Include original text
)

# Display results
print(f"\n🔍 Pinecone Query: '{query}'")
print("\n🏆 Top 3 most similar documents:")
print("-" * 50)

for i, match in enumerate(search_results['matches']):
    score = match['score']
    text = match['metadata']['text']
    
    print(f"{i+1}. 📊 Similarity Score: {score:.3f}")
    print(f"   📄 Text: {text}")
    print()


🔍 Pinecone Query: 'What is AI and machine learning?'

🏆 Top 3 most similar documents:
--------------------------------------------------
1. 📊 Similarity Score: 0.532
   📄 Text: Artificial intelligence is transforming technology

2. 📊 Similarity Score: 0.438
   📄 Text: Deep learning uses neural networks

3. 📊 Similarity Score: 0.366
   📄 Text: Data science combines statistics and programming



## 🎯 Expected Output

When you run this code, you should see output similar to:

```
✅ Created index: pinecone-tutorial
📊 Index stats: {'dimension': 384, 'index_fullness': 0.0, 'namespaces': {}}
✅ Documents inserted successfully.
📈 Index stats after upload: {'dimension': 384, 'index_fullness': 0.0, 'namespaces': {'': {'vector_count': 10}}}

🔍 Pinecone Query: 'What is AI and machine learning?'

🏆 Top 3 most similar documents:
--------------------------------------------------
1. 📊 Similarity Score: 0.689
   📄 Text: Artificial intelligence is transforming technology

2. 📊 Similarity Score: 0.623
   📄 Text: Machine learning models require large datasets

3. 📊 Similarity Score: 0.521
   📄 Text: Deep learning uses neural networks
```

## 🔧 Advanced Usage

### Batch Processing
For larger datasets, process documents in batches:

In [37]:
batch_size = 100
for i in range(0, len(documents), batch_size):
    batch = to_upsert[i:i+batch_size]
    index.upsert(vectors=batch)

### Filtering with Metadata
Add more metadata for advanced filtering:

In [38]:
# Enhanced metadata
to_upsert = [
    {
        "id": f"doc{i}",
        "values": embeddings[i],
        "metadata": {
            "text": documents[i],
            "category": "technology",
            "source": "tutorial"
        }
    }
    for i in range(len(documents))
]

# Query with filters
search_results = index.query(
    vector=query_embedding.tolist(),
    top_k=3,
    include_metadata=True,
    filter={"category": "technology"}
)

## 🛠️ Troubleshooting

| Issue | Solution |
|-------|----------|
| API Key Error | Ensure your `.env` file contains `PINECONE_API_KEY=your_key` |
| Dimension Mismatch | Verify model output dimensions match index configuration |
| Index Not Found | Check index name spelling and creation status |
| Slow Queries | Consider using namespaces for better organization |

---

## 📚 Additional Resources

- [Pinecone Documentation](https://docs.pinecone.io/)
- [Sentence Transformers Models](https://huggingface.co/sentence-transformers)
- [Vector Database Concepts](https://www.pinecone.io/learn/vector-database/)

---

## 🎉 Next Steps

1. **Scale Up**: Try with larger document collections
2. **Fine-tune**: Experiment with different embedding models
3. **Production**: Implement error handling and logging
4. **Optimize**: Use namespaces and metadata filtering for complex queries

Happy vector searching! 🚀