# Twilight Imperium Embedding Generator
## Step 3: Generate Embeddings and Create Vector Database

This notebook creates vector embeddings for all text chunks using OpenAI's text-embedding-3-small model and stores them in a FAISS vector database for fast similarity search.


## 🔑 OpenAI API Setup Instructions

**Before running this notebook, you need to set up your OpenAI API key:**

### Step 1: Get OpenAI API Key
1. Go to [OpenAI Platform](https://platform.openai.com/)
2. Sign up or log in to your account
3. Navigate to **API Keys** section
4. Click **"Create new secret key"**
5. Copy the key (starts with `sk-...`)

### Step 2: Set Environment Variable
**Windows (Anaconda Prompt):**
```bash
setx OPENAI_API_KEY "your-api-key-here"
```

**Alternative: Create .env file**
1. Create a file named `.env` in your project root directory
2. Add this line: `OPENAI_API_KEY=your-api-key-here`
3. Save the file

### Step 3: Verify Setup
Run the cells below to verify your API key is working.

**💰 Cost Estimate:** ~$0.01-0.02 for 286 chunks with text-embedding-3-small


In [1]:
# Import necessary libraries
import json
import os
from pathlib import Path
from typing import List, Dict, Any
import numpy as np
from tqdm import tqdm

# LangChain imports
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.schema import Document

# Load environment variables (for .env file)
from dotenv import load_dotenv
load_dotenv()

print("✅ All libraries imported successfully")


✅ All libraries imported successfully


In [2]:
# Force reload environment variables and check
from dotenv import load_dotenv
load_dotenv(override=True)  # This will load from .env file

import os
api_key = os.getenv("OPENAI_API_KEY")
print(f"API key found: {api_key is not None}")
if api_key:
    print(f"Key starts with: {api_key[:20]}...")

API key found: True
Key starts with: sk-proj-gU0tE3P-TebB...


In [3]:
# Verify OpenAI API key is set
api_key = os.getenv("OPENAI_API_KEY")

if not api_key:
    print("❌ OpenAI API key not found!")
    print("Please set your OPENAI_API_KEY environment variable or create a .env file")
    print("See the instructions above for how to do this.")
elif api_key.startswith("sk-"):
    print("✅ OpenAI API key found and appears valid")
    print(f"Key starts with: {api_key[:20]}...")
else:
    print("⚠️  API key found but format looks incorrect")
    print("OpenAI API keys should start with 'sk-'")

# Initialize the embedding model
try:
    embeddings_model = OpenAIEmbeddings(
        model="text-embedding-3-small",
        openai_api_key=api_key
    )
    print("✅ OpenAI Embeddings model initialized successfully")
    print(f"Model: {embeddings_model.model}")
except Exception as e:
    print(f"❌ Error initializing embeddings model: {e}")
    print("Please check your API key and internet connection")

✅ OpenAI API key found and appears valid
Key starts with: sk-proj-gU0tE3P-TebB...
✅ OpenAI Embeddings model initialized successfully
Model: text-embedding-3-small


In [4]:
# Load the chunked data from Step 2
processed_rules_dir = Path("processed_rules")
chunked_data_path = processed_rules_dir / "chunked_data.json"

# Check if the chunked data exists
if not chunked_data_path.exists():
    print("❌ Error: Could not find chunked data from Step 2.")
    print("Please run the text_chunker.ipynb notebook first!")
else:
    print("✅ Found chunked data from Step 2")
    
    # Load the data
    with open(chunked_data_path, 'r', encoding='utf-8') as f:
        chunked_data = json.load(f)
    
    # Extract the chunks
    all_chunks = chunked_data['all_chunks']
    
    print(f"📊 Loaded {len(all_chunks)} chunks:")
    print(f"  - Learn to Play: {chunked_data['statistics']['learn_to_play_chunks']} chunks")
    print(f"  - Rulebook: {chunked_data['statistics']['rulebook_chunks']} chunks")
    print(f"  - Average chunk size: {chunked_data['statistics']['avg_chunk_size']} characters")


✅ Found chunked data from Step 2
📊 Loaded 286 chunks:
  - Learn to Play: 182 chunks
  - Rulebook: 104 chunks
  - Average chunk size: 583 characters


In [5]:
# Convert chunks to LangChain Documents
def create_langchain_documents(chunks: List[Dict[str, Any]]) -> List[Document]:
    """
    Convert our chunks with metadata to LangChain Document objects
    """
    documents = []
    
    for chunk in chunks:
        # Create a Document with the content and metadata
        doc = Document(
            page_content=chunk['content'],
            metadata=chunk['metadata']
        )
        documents.append(doc)
    
    return documents

# Create Document objects from our chunks
print("📄 Converting chunks to LangChain Documents...")
documents = create_langchain_documents(all_chunks)

print(f"✅ Created {len(documents)} Document objects")
print(f"📋 Sample metadata keys: {list(documents[0].metadata.keys())}")

# Preview a sample document
sample_doc = documents[0]
print(f"\n🔍 Sample Document:")
print(f"  Content length: {len(sample_doc.page_content)} characters")
print(f"  Source: {sample_doc.metadata['source']}")
print(f"  Chunk ID: {sample_doc.metadata['chunk_id']}")
if 'section' in sample_doc.metadata:
    print(f"  Section: {sample_doc.metadata['section']}")
print(f"  Content preview: {sample_doc.page_content[:200]}...")


📄 Converting chunks to LangChain Documents...
✅ Created 286 Document objects
📋 Sample metadata keys: ['source', 'doc_type', 'chunk_id', 'chunk_index', 'total_chunks', 'char_count', 'word_count', 'section']

🔍 Sample Document:
  Content length: 33 characters
  Source: learn_to_play
  Chunk ID: learn_to_play_chunk_000
  Section: ®
  Content preview: --- Page 1 ---

®

--- Page 2 ---...


In [6]:
# Generate embeddings and create FAISS vector store
print("🔥 Starting embedding generation...")
print(f"📊 Processing {len(documents)} documents")
print("⏳ This may take a few minutes depending on your internet connection...")

try:
    # Create FAISS vector store from documents
    # This will automatically generate embeddings for each document
    vector_store = FAISS.from_documents(
        documents=documents,
        embedding=embeddings_model
    )
    
    print("✅ Embeddings generated successfully!")
    print(f"📈 Vector store created with {vector_store.index.ntotal} vectors")
    
    # Get embedding dimension
    # Test with a small text to get the dimension
    test_embedding = embeddings_model.embed_query("test")
    embedding_dimension = len(test_embedding)
    print(f"🔢 Embedding dimension: {embedding_dimension}")
    
except Exception as e:
    print(f"❌ Error generating embeddings: {e}")
    print("Please check your API key, internet connection, and API usage limits")


🔥 Starting embedding generation...
📊 Processing 286 documents
⏳ This may take a few minutes depending on your internet connection...
✅ Embeddings generated successfully!
📈 Vector store created with 286 vectors
🔢 Embedding dimension: 1536


In [7]:
# Test the vector store with a sample query
print("🧪 Testing vector store with sample queries...")

test_queries = [
    "How do I move ships in combat?",
    "What happens when I activate a system?",
    "How do strategy cards work?",
    "What are the victory conditions?",
    "How does ground combat work?"
]

print("\n🔍 Sample Query Results:")
print("=" * 50)

for i, query in enumerate(test_queries[:2]):  # Test first 2 queries
    print(f"\n🔸 Query {i+1}: '{query}'")
    
    try:
        # Perform similarity search
        similar_docs = vector_store.similarity_search(
            query=query,
            k=3  # Get top 3 most similar chunks
        )
        
        print(f"   Found {len(similar_docs)} similar documents:")
        
        for j, doc in enumerate(similar_docs):
            print(f"   📄 Result {j+1}:")
            print(f"      Source: {doc.metadata['source']}")
            print(f"      Chunk ID: {doc.metadata['chunk_id']}")
            if 'section' in doc.metadata:
                print(f"      Section: {doc.metadata['section']}")
            print(f"      Preview: {doc.page_content[:150]}...")
            print()
            
    except Exception as e:
        print(f"   ❌ Error testing query: {e}")

print("✅ Vector store testing complete!")


🧪 Testing vector store with sample queries...

🔍 Sample Query Results:

🔸 Query 1: 'How do I move ships in combat?'
   Found 3 similar documents:
   📄 Result 1:
      Source: learn_to_play
      Chunk ID: learn_to_play_chunk_110
      Preview: 13 2. MOVEMENT During the movement step of a tactical action, the active player may choose to move some of his units into the active system. Each ship...

   📄 Result 2:
      Source: learn_to_play
      Chunk ID: learn_to_play_chunk_117
      Preview: . 3. SPACE COMBAT If multiple players have ships in the active system, they must resolve a space combat in that system. During combat, the active play...

   📄 Result 3:
      Source: learn_to_play
      Chunk ID: learn_to_play_chunk_118
      Preview: . iii. MAKE COMBAT ROLLS: Each player rolls one die for each ship he has in the active system. If the result of a unit’s die roll is equal to or great...


🔸 Query 2: 'What happens when I activate a system?'
   Found 3 similar documents:
   📄 Result 

In [8]:
# Save vector store to disk for later use
vector_store_dir = processed_rules_dir / "vector_store"

print(f"💾 Saving vector store to: {vector_store_dir}")

try:
    # Save the FAISS vector store
    vector_store.save_local(str(vector_store_dir))
    
    print("✅ Vector store saved successfully!")
    print(f"📁 Saved files in: {vector_store_dir}")
    
    # List the created files
    if vector_store_dir.exists():
        files = list(vector_store_dir.glob("*"))
        print(f"📋 Created files:")
        for file in files:
            print(f"   - {file.name}")
    
    # Save embedding configuration for easy reloading
    embedding_config = {
        'model_name': 'text-embedding-3-small',
        'embedding_dimension': embedding_dimension,
        'total_vectors': vector_store.index.ntotal,
        'total_documents': len(documents),
        'vector_store_path': str(vector_store_dir),
        'created_from_chunks': len(all_chunks),
        'sources': {
            'learn_to_play_chunks': chunked_data['statistics']['learn_to_play_chunks'],
            'rulebook_chunks': chunked_data['statistics']['rulebook_chunks']
        }
    }
    
    # Save config file
    config_path = processed_rules_dir / "embedding_config.json"
    with open(config_path, 'w', encoding='utf-8') as f:
        json.dump(embedding_config, f, indent=2, ensure_ascii=False)
    
    print(f"⚙️  Configuration saved to: {config_path}")
    
except Exception as e:
    print(f"❌ Error saving vector store: {e}")

print(f"\n🎉 Step 3 Complete!")
print(f"📊 Summary:")
print(f"  - Generated embeddings for {len(documents)} text chunks")
print(f"  - Created FAISS vector store with {embedding_dimension}-dimensional vectors")
print(f"  - Saved vector store locally for fast loading")
print(f"  - Ready for Step 4: Create LangChain search tool")
print(f"\n🚀 Next step: Build the LangChain agent and chatbot interface!")


💾 Saving vector store to: processed_rules\vector_store
✅ Vector store saved successfully!
📁 Saved files in: processed_rules\vector_store
📋 Created files:
   - index.faiss
   - index.pkl
⚙️  Configuration saved to: processed_rules\embedding_config.json

🎉 Step 3 Complete!
📊 Summary:
  - Generated embeddings for 286 text chunks
  - Created FAISS vector store with 1536-dimensional vectors
  - Saved vector store locally for fast loading
  - Ready for Step 4: Create LangChain search tool

🚀 Next step: Build the LangChain agent and chatbot interface!
