# Vector Store Creation for Document Search

This notebook creates a searchable vector database from markdown documents using Pinecone and OpenAI embeddings. It processes the following steps:

1. Chunks markdown documents into manageable segments
2. Creates embeddings for each chunk using OpenAI's text-embedding-3-small model
3. Stores these embeddings in a Pinecone vector database for efficient similarity search

This setup enables semantic search across documents, allowing for natural language queries to find relevant content based on meaning rather than just keyword matching. The system is particularly useful for building AI-powered document search and retrieval systems.

## Requirements:
- OpenAI API key
- Pinecone API key
- Markdown documents in the current directory

In [40]:
from pinecone import Pinecone, ServerlessSpec
import os
import dotenv
import json
from tqdm import tqdm
from openai import OpenAI
import tiktoken

# Text Chunking Function
Function to split text into overlapping chunks based on token count.


In [41]:
def chunk_text_by_tokens(text: str, chunk_size=500, overlap=100, model_name="gpt-3.5-turbo") -> list:
    encoding = tiktoken.encoding_for_model(model_name)
    token_ids = encoding.encode(text)

    chunks = []
    start = 0
    while start < len(token_ids):
        end = start + chunk_size
        chunk_ids = token_ids[start:end]
        chunk_text = encoding.decode(chunk_ids)
        chunks.append(chunk_text)
        start += (chunk_size - overlap)
    return chunks

# Setup Pinecone and OpenAI Clients
Initialize the Pinecone and OpenAI clients using environment variables.
Then define the passage loading function that reads markdown files.


In [42]:

dotenv.load_dotenv()

pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

client= OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def load_passages(data_dir="."):
    passages = {}

    for root, dirs, files in os.walk(data_dir):

        for file in files:
            if file.endswith('.md'):
                file_path = os.path.join(root, file)
                relative_path = os.path.relpath(file_path, data_dir)
                
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        full_text = f.read()
                    
                    full_text=full_text.strip()

                    chunked_passages = chunk_text_by_tokens(
                        full_text, 
                        chunk_size=500, 
                        overlap=100, 
                        model_name="gpt-3.5-turbo"
                    )

                    passages[relative_path] = chunked_passages
                        
                except Exception as e:
                    print(f"Error reading file {file_path}: {e}")
    
    print(f"Loaded {len(passages)} passages from {data_dir}")
    return passages

# Create Embeddings
Load the markdown files and create embeddings for each chunk using OpenAI's embedding model.


In [43]:
data=load_passages()

embedding_dict={}

for key, value in data.items():
    embedding_dict[key] = client.embeddings.create(
        model="text-embedding-3-small",
        input=value
    )

Loaded 12 passages from data


# Create Pinecone Index
Set up a new serverless Pinecone index with the appropriate dimensions for OpenAI embeddings.


In [44]:
index_name = "showcase-index"

pc.create_index(
    name=index_name,
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1",
    )
)

index=pc.Index(index_name)


# Prepare Vector Data
Format the embeddings and metadata into vectors suitable for Pinecone storage.


In [45]:
vectors=[]

for file_path, file_embeddings in embedding_dict.items():
    file_vectors = [
        {
            "id": f"{file_path}_{i}",  # Create unique ID combining file path and index
            "values": embedding.embedding,
            "metadata": {
                "source_file": file_path,
                "text":data[file_path][i]
            }
        }
        for i, embedding in enumerate(file_embeddings.data)
    ]
    
    # Extend our main vectors list with the vectors from this file
    vectors.extend(file_vectors)

print(f"Created {len(vectors)} total vectors")



# Upload Vectors to Pinecone
Upload all prepared vectors to the Pinecone index.

In [77]:
index.upsert(vectors=vectors)

Created 157 total vectors
