# LangChain + Pinecone Starter Pipeline

This notebook ingests a handful of local documents, indexes them in Pinecone, and runs a retrieval-augmented generation (RAG) query using OpenAI models.

**Prerequisites**

* Python 3.10+ environment with Jupyter.
* API keys exported as environment variables:
  * `OPENAI_API_KEY` (for both text embeddings and chat completions).
  * `PINECONE_API_KEY` and `PINECONE_ENVIRONMENT`.
* A Pinecone index with cosine similarity and an embedding dimension that matches the embedding model (e.g. `text-embedding-3-large` → 3072).

If you already have an OpenAI-hosted vector store (for example: [`vs_6859e43920848191a894dd36ecf0595a`](https://platform.openai.com/storage/vector_stores/vs_6859e43920848191a894dd36ecf0595a)), you can skip the Pinecone indexing cells and adapt the retrieval logic accordingly.


In [None]:
%%capture
!pip install --upgrade langchain langchain-openai pinecone-client tiktoken unstructured


## 1. Configuration
Set the runtime configuration and sanity-check the required environment variables.

In [None]:
import os
from datetime import datetime

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document

# Fail fast if critical environment variables are missing.
required_env = ["OPENAI_API_KEY", "PINECONE_API_KEY", "PINECONE_ENVIRONMENT"]
missing = [var for var in required_env if not os.environ.get(var)]
if missing:
    raise EnvironmentError(f"Missing environment variables: {missing}")

EMBED_MODEL = "text-embedding-3-large"
CHAT_MODEL = "gpt-4o-mini"
INDEX_NAME = "roque-ingest"  # change to your Pinecone index
NAMESPACE = "demo"
METADATA_SOURCE = "notebook-demo"


## 2. Create sample documents
Replace these with your own IP ingestion flow. In production you would pull from cloud storage, OCR PDFs, transcribe audio, etc.

In [None]:
seed_texts = [
    ("balanchine-cipher.txt", "Balanchine Cipher", "A short excerpt describing the Balanchine ritual blueprint and its choreography-driven mnemonic devices."),
    ("rosetta-console.txt", "Rosetta Console", "Notes on console interactions, fail-safes, and curator touchpoints for the ROQUE activation sequence."),
    ("market-scan.txt", "Market Scan", "A synthetic daily scan capturing volatility, hype spikes, and the Newton gas fee index across social networks."),
]

documents = []
for filename, title, summary in seed_texts:
    text = f"# {title}\n\n{summary}\n\nTimestamp: {datetime.utcnow().isoformat()}Z\n"
    documents.append(Document(page_content=text, metadata={"source": METADATA_SOURCE, "title": title, "filename": filename}))

print(f"Created {len(documents)} in-memory documents.")


## 3. Chunk and embed
Chunk documents with provenance metadata and generate embeddings.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=120)
chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks.")

embeddings = OpenAIEmbeddings(model=EMBED_MODEL)
embedding_vectors = embeddings.embed_documents([chunk.page_content for chunk in chunks])
print(f"Embedded {len(embedding_vectors)} chunks.")


## 4. Upsert into Pinecone
Create (or connect to) an index and upsert the chunk embeddings with metadata.

In [None]:
from pinecone import Pinecone

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
if INDEX_NAME not in [index_info.name for index_info in pc.list_indexes()]:
    pc.create_index(
        name=INDEX_NAME,
        dimension=len(embedding_vectors[0]),
        metric="cosine",
        spec={"serverless": {"cloud": "aws", "region": os.environ["PINECONE_ENVIRONMENT"]}}
    )
index = pc.Index(INDEX_NAME)

vectors = []
for i, (chunk, vector) in enumerate(zip(chunks, embedding_vectors)):
    metadata = chunk.metadata | {"chunk_id": i, "ingested_at": datetime.utcnow().isoformat() + "Z"}
    vectors.append({"id": f"{chunk.metadata.get('filename', 'doc')}-{i}", "values": vector, "metadata": metadata})

index.upsert(vectors=vectors, namespace=NAMESPACE)
print(f"Upserted {len(vectors)} vectors into namespace '{NAMESPACE}'.")


## 5. Build a LangChain retriever and QA chain
Use the vector store to retrieve context and stream it into a ChatOpenAI model.

In [None]:
from langchain.vectorstores import Pinecone as PineconeVectorStore
from langchain.chains import RetrievalQA

vectorstore = PineconeVectorStore(
    index=index,
    embedding=embeddings,
    text_key="page_content",
    namespace=NAMESPACE
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
llm = ChatOpenAI(model=CHAT_MODEL, temperature=0.2)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)

query = "How do we validate a ROQUE artifact before anchoring it?"
result = qa_chain.invoke({"query": query})
print(result["result"])


## 6. Inspect provenance
Each returned document still has the metadata we attached earlier, enabling full provenance tracking.


In [None]:
for doc in result['source_documents']:
    print(f"--- {doc.metadata['title']} ({doc.metadata['filename']}) ---")
    print(doc.page_content[:300] + '...')
    print(doc.metadata)
    print()


## Next steps
* Swap the in-memory `seed_texts` with your actual ingestion pipeline (Google Drive, Dropbox, Git repos, audio transcripts, etc.).
* Store the raw artifacts and receipts in GCS/S3 and anchor hashes to Arweave or your preferred provenance layer.
* Move the chunking + upsert logic into a scheduled Prefect/Dagster flow.
* Replace the example query with your production prompts and add evaluation harnesses (hallucination rate, curator stamp agreement).
