# Notebook 01 - Pinecone Vector Store Integration with LangChain

In this notebook I follow the official [Pinecone integration tutorial](https://python.langchain.com/docs/integrations/vectorstores/pinecone) from the LangChain documentation.

**What I will cover:**
1. **Setup** - I configure API credentials (Google Gemini + Pinecone).
2. **Initialization** - I create (or connect to) a Pinecone index and initialize the vector store.
3. **Manage the vector store** - I add and delete documents.
4. **Query the vector store** - I perform similarity search, similarity search with score, and use a retriever.

> **Note:** I use **Google Gemini** (free tier) instead of OpenAI for embeddings. API keys are loaded from environment variables or prompted via `getpass`. I never hard-code secrets.

## 1 - Setup

I load environment variables from a `.env` file (if present) and make sure both `GOOGLE_API_KEY` and `PINECONE_API_KEY` are available.

- **Google Gemini** provides free-tier access to embedding models (`gemini-embedding-001`, 768 dims via MRL) and chat models.
- **Pinecone** offers a free starter tier for vector storage.

In [1]:
import os
import getpass
from dotenv import load_dotenv

load_dotenv()

if not os.getenv("GOOGLE_API_KEY"):
    os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter your Google API key: ")

if not os.getenv("PINECONE_API_KEY"):
    os.environ["PINECONE_API_KEY"] = getpass.getpass("Enter your Pinecone API key: ")

### Connect to Pinecone

I create a `Pinecone` client using the API key. This client lets me manage indexes (create, list, delete) and obtain `Index` handles for reading/writing vectors.

In [2]:
from pinecone import Pinecone

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

## 2 - Initialization

Before initializing the LangChain vector store I need a Pinecone **index**. If one with the chosen name does not exist yet, I create it as a *serverless* index with:
- **dimension = 768** - I use `gemini-embedding-001` with `output_dimensionality=768` (Matryoshka Representation Learning allows flexible sizing).
- **metric = cosine** (standard for semantic similarity).
- Hosted on **AWS us-east-1**.

If the index already exists with a different dimension, I delete and recreate it automatically.

In [3]:
from pinecone import ServerlessSpec

index_name = "langchain-test-index"

if pc.has_index(index_name):
    desc = pc.describe_index(index_name)
    if desc.dimension != 768:
        print(f"Index '{index_name}' has dimension {desc.dimension}, deleting to recreate with 768...")
        pc.delete_index(index_name)

if not pc.has_index(index_name):
    pc.create_index(
        name=index_name,
        dimension=768,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )

index = pc.Index(index_name)

### Create the embedding model and the vector store

I use Google's `gemini-embedding-001` to convert text into 768-dimensional vectors (reduced from the default 3072 via the `output_dimensionality` parameter). Then I wrap the Pinecone index with `PineconeVectorStore`, which provides a high-level LangChain interface for adding, deleting, and searching documents.

In [4]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_pinecone import PineconeVectorStore

embeddings = GoogleGenerativeAIEmbeddings(
    model="models/gemini-embedding-001",
    task_type="SEMANTIC_SIMILARITY",
    output_dimensionality=768,
)

vector_store = PineconeVectorStore(index=index, embedding=embeddings)

## 3 - Manage the Vector Store

### Add items

I create 10 sample `Document` objects (each with `page_content` and `metadata`) and upsert them into Pinecone using `add_documents`. Each document receives a unique UUID so I can reference or delete it later.

In [5]:
from uuid import uuid4
from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
)

documents = [
    document_1, document_2, document_3, document_4, document_5,
    document_6, document_7, document_8, document_9, document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)

['1ba648b6-4642-456a-87e3-40629100f5fe',
 '3e294496-d0da-4764-aa43-6a82e870e224',
 '0fb150b4-5225-475c-a138-7d1e6fd25b91',
 'fc386538-1a7c-41f7-8e6a-3fb5b747f659',
 '85fb5cc0-9237-4c9b-a2d2-ed2195ba519b',
 '1e4f120a-3496-4bb8-97d6-3497c736b89e',
 '5d4a145a-3e21-43e8-bd77-6ba0616d8158',
 'd3805ffc-da90-411f-8029-b4697d18a91f',
 '72e537d9-636f-41bc-ad01-0317ed64be4d',
 'dabddc3b-c3bd-448e-b221-3395eca0be61']

### Delete items

I can remove a document by its UUID. Here I delete the last one (`document_10`) which had a bad feeling about being deleted.

In [6]:
vector_store.delete(ids=[uuids[-1]])

## 4 - Query the Vector Store

### 4.1 Similarity search

The simplest query: I pass a natural-language string and get back the **k** most similar documents. I also apply a metadata **filter** so only tweets are returned.

In [7]:
results = vector_store.similarity_search(
    "LangChain provides abstractions to make working with LLMs easy",
    k=2,
    filter={"source": "tweet"},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]
* Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]


### 4.2 Similarity search with score

Same as above but each result also returns a **similarity score** (cosine). This is useful when I want to set a threshold and discard low-confidence matches.

In [8]:
results = vector_store.similarity_search_with_score(
    "Will it be hot tomorrow?", k=1, filter={"source": "news"}
)
for res, score in results:
    print(f"* [SIM={score:.3f}] {res.page_content} [{res.metadata}]")

* [SIM=0.851] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]


### 4.3 Using the vector store as a Retriever

LangChain's `as_retriever()` converts the vector store into a **Retriever** object that I can plug directly into chains and agents. Here I use `similarity_score_threshold` so only results above 0.4 are returned.

In [9]:
retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 1, "score_threshold": 0.4},
)
retriever.invoke("Stealing from the bank is a crime", filter={"source": "news"})

[Document(id='569d21a6-df70-4601-8f95-d0816622ea30', metadata={'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]

## Summary

In this notebook I:
1. Connected to Pinecone and created a serverless index (768 dims for Gemini embeddings).
2. Initialized `GoogleGenerativeAIEmbeddings` (`gemini-embedding-001`) and `PineconeVectorStore`.
3. Added and deleted documents.
4. Queried the store via similarity search (with and without scores) and via a LangChain Retriever.

