# Pinecone

Workflow:

Embedding: You store a large corpus of documents in a vector database, where each document is represented as a vector embedding.

Query: When a user inputs a query, it is also converted into a vector embedding.

Search: The vector database performs a similarity search to find the most relevant document vectors based on the query vector.

Generation: The retrieved documents are fed into the generative model, which produces an answer by augmenting the response with the retrieved information.

In [6]:
from pinecone import Pinecone, ServerlessSpec
import os

pinecone_api_key = "pcsk_2N3Xuo_MJbFM6dEKPvChPtM8zb4WBnTrbeQAeewAvzEj7G7N8bRFctuKBdwQG1wL2nGGE9"

# pinecone_api_key = os.getenv("PINECONE_API_KEY")
# print(pinecone_api_key)

pc = Pinecone(api_key=pinecone_api_key)



In [None]:
index_name = "quickstart"

pc.create_index(
    name=index_name,
    dimension=1024, # Replace with your model dimensions
    metric="cosine", # Replace with your model metric
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    ) 
)

# pinecone.init(api_key=pinecone_api_key, environment="us-west1-gcp")
# pinecone.create_index(name="example-index", dimension=1536, metric="cosine")

In [47]:
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

vectors = []

pc.upsert(index="example-index", vectors=vectors)


ValidationError: 1 validation error for OpenAIEmbeddings
  Value error, Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass `openai_api_key` as a named parameter. [type=value_error, input_value={'model': 'text-embedding...20, 'http_client': None}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/value_error

In [8]:
data = [
    {"id": "vec1", "text": "Apple is a popular fruit known for its sweetness and crisp texture."},
    {"id": "vec2", "text": "The tech company Apple is known for its innovative products like the iPhone."},
    {"id": "vec3", "text": "Many people enjoy eating apples as a healthy snack."},
    {"id": "vec4", "text": "Apple Inc. has revolutionized the tech industry with its sleek designs and user-friendly interfaces."},
    {"id": "vec5", "text": "An apple a day keeps the doctor away, as the saying goes."},
    {"id": "vec6", "text": "Apple Computer Company was founded on April 1, 1976, by Steve Jobs, Steve Wozniak, and Ronald Wayne as a partnership."}
]

embeddings = pc.inference.embed(
    model="multilingual-e5-large",
    inputs=[d['text'] for d in data],
    parameters={"input_type": "passage", "truncate": "END"}
)
print(embeddings[0])

{'vector_type': dense, 'values': [0.04913330078125, -0.01306915283203125, ..., -0.0196990966796875, -0.0110321044921875]}


In [9]:
# Wait for the index to be ready
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

index = pc.Index(index_name)

vectors = []
for d, e in zip(data, embeddings):
    vectors.append({
        "id": d['id'],
        "values": e['values'],
        "metadata": {'text': d['text']}
    })

index.upsert(
    vectors=vectors,
    namespace="ns1"
)

{'upserted_count': 6}

In [10]:
print(index.describe_index_stats())

{'dimension': 1024,
 'index_fullness': 0.0,
 'namespaces': {'ns1': {'vector_count': 6}},
 'total_vector_count': 6}


In [None]:
query_vector = embeddings.embed_query("Explain AI technologies")
result = index.query(queries=[query_vector], top_k=5)
print(result)


In [11]:
query = "Tell me about the tech company known as Apple."

embedding = pc.inference.embed(
    model="multilingual-e5-large",
    inputs=[query],
    parameters={
        "input_type": "query"
    }
)

Indexes

An index stores vector embeddings for similarity search.
Each index can have a specified metric (cosine, dot-product, or Euclidean).

Namespaces

Used for logically grouping vectors in an index for isolation or multi-tenancy.

Metadata Filtering

Attach metadata (e.g., category, tags) to vectors and filter results during queries.

In [None]:
result = index.query(
    queries=[query_vector], 
    top_k=5,
    filter={"category": "science"}
)


In [None]:
results = index.query(
    namespace="ns1",
    vector=embedding[0].values,
    top_k=3,
    include_values=False,
    include_metadata=True
)

print(results)

6. Pinecone Use Cases with LLMs

RAG Systems: Retrieve knowledge stored in vectors to generate more accurate answers.

Semantic Search: Search through unstructured data (e.g., documents) based on meaning.

Chatbots & Q&A Systems: Store embeddings of documents for context-based chat interactions.

Recommendation Engines: Suggest similar content based on embeddings.

Anomaly Detection: Detect patterns or anomalies in datasets.

7. Advanced Features

Hybrid Search: Combines keyword and vector search for better accuracy.

Batch Processing: Supports batch insertion and queries for scalability.

Data Persistence: Ensures data durability and backups.

APIs and SDKs: Easy-to-use REST APIs and SDKs for Python and Node.js.

Integration with ML Frameworks: Works with HuggingFace, OpenAI, and LangChain.

In [None]:
%pip install pinecone

In [70]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load PDF
loader = PyPDFLoader("docs/django-api-tutorial-latest.pdf")  # Replace with your PDF file
pages = loader.load_and_split()

# Split text into smaller chunks for embedding
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
chunks = splitter.split_documents(pages)

print(f"Number of chunks: {len(chunks)}")
print(chunks)



Number of chunks: 82


In [91]:
index_name = "pdfchucks"

pc.create_index(
    name=index_name,
    dimension=1024, # Replace with your model dimensions
    metric="cosine", # Replace with your model metric
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    ) 
)

# pinecone.init(api_key=pinecone_api_key, environment="us-west1-gcp")
# pinecone.create_index(name="example-index", dimension=1536, metric="cosine")

In [92]:
# Extract text data for embeddings
inputs = [doc.page_content for doc in chunks]
print(inputs)

embeddings = pc.inference.embed(
    model="multilingual-e5-large",
    inputs=inputs,
    parameters={"input_type": "passage", "truncate": "END"}
)
print(embeddings[0])

print(index.describe_index_stats())

{'vector_type': dense, 'values': [0.033599853515625, 0.006103515625, ..., -0.02166748046875, -0.01187896728515625]}
{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'ns1': {'vector_count': 82}},
 'total_vector_count': 82}


In [93]:
import time
# Wait for the index to be ready
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

index = pc.Index(index_name)

vectors = []
for idx, (doc, embedding) in enumerate(zip(chunks, embeddings)):
    vectors.append({
        "id": f"doc-{idx}",  # Unique ID for each document
        "values": embedding['values'],  # Extract only the values
        "metadata": {
            "source": doc.metadata['source'],  # Store source metadata
            "page": doc.metadata['page'],     # Store page metadata
            "text": doc.page_content          # Store original text
        }
    })

index.upsert(
    vectors=vectors,
    namespace="ns1"
)

{'upserted_count': 82}

In [96]:
query = "Tell me about the django-rest settings."

embedding = pc.inference.embed(
    model="multilingual-e5-large",
    inputs=[query],
    parameters={
        "input_type": "query"
    }
)

In [97]:
results = index.query(
    namespace="ns1",
    vector=embedding[0].values,
    top_k=4,
    include_values=False,
    include_metadata=True
)

print(results)

{'matches': [{'id': 'doc-7',
              'metadata': {'page': 5.0,
                           'source': 'docs/django-api-tutorial-latest.pdf',
                           'text': 'Building APIs with Django and Django Rest '
                                   'Framework, Release 2.0\n'
                                   '2 Contents'},
              'score': 0.834403694,
              'values': []},
             {'id': 'doc-9',
              'metadata': {'page': 7.0,
                           'source': 'docs/django-api-tutorial-latest.pdf',
                           'text': 'Building APIs with Django and Django Rest '
                                   'Framework, Release 2.0\n'
                                   '4 Chapter 1. Introductions'},
              'score': 0.833245814,
              'values': []},
             {'id': 'doc-46',
              'metadata': {'page': 33.0,
                           'source': 'docs/django-api-tutorial-latest.pdf',
                           'text'

In [71]:
openai_api_key = os.getenv("OPENAI_API_KEY")

In [72]:

from langchain.embeddings.openai import OpenAIEmbeddings


embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=openai_api_key)


In [77]:
index_name = "textada-002"

pc.create_index(
    name=index_name,
    dimension=1536, # Replace with your model dimensions
    metric="cosine", # Replace with your model metric
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    ) 
)

In [78]:
# Wait for the index to be ready
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

index = pc.Index(index_name)

vectors = []
for idx, doc in enumerate(chunks):
    # Get embedding for the current chunk of text
    embedding = embeddings.embed_query(doc.page_content)
    
    # Prepare vector for upsert
    vectors.append({
        "id": f"doc-{idx}",  # Unique ID for each document
        "values": embedding,  # Embed the content
        "metadata": {
            "source": doc.metadata['source'],  # Store source metadata
            "page": doc.metadata['page'],     # Store page metadata
            "text": doc.page_content          # Store original text
        }
    })

# Upsert the vectors into Pinecone (correct way)
try:
    index.upsert(vectors=vectors, namespace="ns1")
    print(f"Upserted {len(vectors)} vectors to index '{index_name}'")
except Exception as e:
    print(f"Error during upsert: {e}")

Upserted 82 vectors to index 'textada-002'


In [79]:
# Output the number of chunks and details
print(f"Number of chunks: {len(chunks)}")
for chunk in chunks:
    print(chunk)

Number of chunks: 82
page_content='Building APIs with Django and Django
Rest Framework
Release 2.0
Agiliq
Aug 07, 2021' metadata={'source': 'docs/django-api-tutorial-latest.pdf', 'page': 0}
page_content='Contents
1 Introductions 3
1.1 Who is this book for? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 How to read this book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Setup, Models and Admin 5
2.1 Creating a project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Database setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Creating models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Activating models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 A simple API with pure Django 9
3.1 The endpoints and the URLS .

In [None]:
# Create OpenAIEmbeddings instance
# embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=openai_api_key)

# Define the query
query = "Tell me about the django-rest settings."

# Embed the query using OpenAI's API (text-embedding-ada-002 model)
embedding_Q = embeddings.embed_query(query) 

# Connect to the Pinecone index
index = pc.Index(index_name)


In [87]:

# Query the Pinecone index
results = index.query(
    namespace="ns1",
    vector=embedding_Q,  # Use the OpenAI-generated embedding vector
    top_k=2,  # Number of top results to return
    include_values=False,  # Whether to include the values of the vectors
    include_metadata=True  # Include metadata with the results
)

# Print the results
print(results)

{'matches': [{'id': 'doc-46',
              'metadata': {'page': 33.0,
                           'source': 'docs/django-api-tutorial-latest.pdf',
                           'text': 'Building APIs with Django and Django Rest '
                                   'Framework, Release 2.0\n'
                                   '30 Chapter 6. More views and viewsets'},
              'score': 0.850220501,
              'values': []},
             {'id': 'doc-9',
              'metadata': {'page': 7.0,
                           'source': 'docs/django-api-tutorial-latest.pdf',
                           'text': 'Building APIs with Django and Django Rest '
                                   'Framework, Release 2.0\n'
                                   '4 Chapter 1. Introductions'},
              'score': 0.844251812,
              'values': []}],
 'namespace': 'ns1',
 'usage': {'read_units': 6}}
