# 🗄️ Vector Database Module - Google Colab

This notebook runs **ONLY the Vector Database Module** to process and store legal documents.

## 📋 What this does:
- Processes Sri Lankan legal documents (Acts, Cases)
- Creates embeddings using Legal-BERT/Sentence-BERT
- Stores in Pinecone vector database
- Chunks documents with proper sequence numbers
- Supports multilingual content (Sinhala, Tamil, English)

## 🔑 Requirements:
- Pinecone API key and environment
- Legal documents in the project folder


In [None]:
# 🔗 Step 1: Mount Google Drive and Setup
from google.colab import drive
import os
import sys
import zipfile

print("📁 Mounting Google Drive...")
drive.mount('/content/drive')
print("✅ Google Drive mounted successfully!")

In [None]:
# 📦 Step 2: Install Vector DB Specific Packages
print("📦 Installing Vector Database packages...")

!pip install -q pinecone-client
!pip install -q sentence-transformers
!pip install -q transformers
!pip install -q torch
!pip install -q numpy pandas
!pip install -q scikit-learn
!pip install -q nltk spacy
!pip install -q python-dotenv
!pip install -q tqdm
!pip install -q langdetect

print("✅ Vector Database packages installed!")

In [None]:
# 📂 Step 3: Extract Vector DB Module
# 🔧 CHANGE THIS PATH to your uploaded zip file location
ZIP_PATH = '/content/drive/MyDrive/Vector_DB_module and RGA_Module.zip'

print(f"📂 Extracting Vector DB module from: {ZIP_PATH}")

try:
    # Extract the project
    with zipfile.ZipFile(ZIP_PATH, 'r') as zip_ref:
        zip_ref.extractall('/content/')
    
    # Change to Vector DB directory
    vector_db_dir = '/content/Vector_DB_module and RGA_Module/Vector_DB_module'
    os.chdir(vector_db_dir)
    
    # Add to Python path
    sys.path.append('/content/Vector_DB_module and RGA_Module')
    sys.path.append(vector_db_dir)
    
    print("✅ Vector DB module extracted successfully!")
    print("📁 Vector DB module contents:")
    !ls -la
    
except FileNotFoundError:
    print("❌ Zip file not found! Please check the ZIP_PATH variable.")
except Exception as e:
    print(f"❌ Error extracting project: {e}")

In [None]:
# 🔑 Step 4: Setup Vector DB Environment
print("🔑 Setting up Vector Database environment...")

# 🔧 CHANGE THESE VALUES to your actual Pinecone credentials
PINECONE_API_KEY = 'your-pinecone-api-key-here'
PINECONE_ENVIRONMENT = 'your-pinecone-environment-here'
INDEX_NAME = 'sri-lankan-legal-docs'  # Your Pinecone index name

# Set environment variables
os.environ['PINECONE_API_KEY'] = PINECONE_API_KEY
os.environ['PINECONE_ENVIRONMENT'] = PINECONE_ENVIRONMENT
os.environ['INDEX_NAME'] = INDEX_NAME

# Create .env file
env_content = f"""PINECONE_API_KEY={PINECONE_API_KEY}
PINECONE_ENVIRONMENT={PINECONE_ENVIRONMENT}
INDEX_NAME={INDEX_NAME}
"""

with open('.env', 'w') as f:
    f.write(env_content)

print("✅ Vector Database environment configured!")
print(f"📊 Index Name: {INDEX_NAME}")
print("⚠️  Make sure to replace the placeholder API keys with your actual keys!")

In [None]:
# 📚 Step 5: Check Legal Documents
print("📚 Checking for legal documents...")

# Check for legal documents directory
legal_docs_dir = 'legal_documents'
if os.path.exists(legal_docs_dir):
    print(f"✅ Found legal documents directory: {legal_docs_dir}")
    
    # List document types
    for root, dirs, files in os.walk(legal_docs_dir):
        if files:
            rel_path = os.path.relpath(root, legal_docs_dir)
            print(f"📁 {rel_path}: {len(files)} files")
            # Show first few files as examples
            for file in files[:3]:
                print(f"   📄 {file}")
            if len(files) > 3:
                print(f"   ... and {len(files) - 3} more files")
else:
    print("❌ No legal_documents directory found!")
    print("💡 Please add your legal documents to the legal_documents folder")
    print("📁 Expected structure:")
    print("   legal_documents/")
    print("   ├── acts/")
    print("   ├── cases/")
    print("   └── regulations/")

print("\n📊 System Information:")
!nvidia-smi --query-gpu=name,memory.total --format=csv,noheader,nounits
print(f"💾 Available RAM: {!cat /proc/meminfo | grep MemAvailable}")

In [None]:
# 🚀 Step 6: Initialize and Run Vector Database Processing
print("🚀 Starting Vector Database processing...")
print("⏳ This may take several minutes depending on document size...")

try:
    # Import and run the main vector DB processing
    import subprocess
    import sys
    
    # Run the main vector DB script
    result = subprocess.run([sys.executable, 'main_fixed.py'], 
                          capture_output=True, text=True, timeout=3600)  # 1 hour timeout
    
    print("📤 Vector DB Processing Output:")
    print("=" * 50)
    print(result.stdout)
    
    if result.stderr:
        print("⚠️  Warnings/Errors:")
        print(result.stderr)
    
    if result.returncode == 0:
        print("\n✅ Vector Database processing completed successfully!")
        print("🎉 Legal documents have been processed and stored in Pinecone!")
    else:
        print(f"\n❌ Processing failed with return code: {result.returncode}")
        
except subprocess.TimeoutExpired:
    print("⏰ Processing timed out after 1 hour")
    print("💡 Try processing smaller batches of documents")
except Exception as e:
    print(f"❌ Error during processing: {e}")
    print("💡 Try running the processing manually in the next cell")

In [None]:
# 🔧 Step 7: Manual Processing (Alternative)
print("🔧 Manual Vector DB Processing (if automatic failed)")
print("Uncomment and run the code below for manual processing:")

# Uncomment the following lines for manual processing:

# try:
#     # Import the main components
#     from vector_db_connector import VectorDBConnector
#     from document_processor import DocumentProcessor
#     from embedding_generator import EmbeddingGenerator
#     
#     print("📊 Initializing Vector Database components...")
#     
#     # Initialize components
#     embedding_gen = EmbeddingGenerator()
#     doc_processor = DocumentProcessor()
#     vector_db = VectorDBConnector()
#     
#     print("📚 Processing legal documents...")
#     
#     # Process documents
#     processed_docs = doc_processor.process_directory('legal_documents')
#     print(f"✅ Processed {len(processed_docs)} documents")
#     
#     # Generate embeddings and store
#     for doc in processed_docs:
#         embeddings = embedding_gen.generate_embeddings(doc['chunks'])
#         vector_db.store_embeddings(doc['id'], embeddings, doc['metadata'])
#     
#     print("🎉 Manual processing completed successfully!")
#     
# except Exception as e:
#     print(f"❌ Manual processing error: {e}")
#     import traceback
#     traceback.print_exc()

print("💡 To use manual processing:")
print("   1. Uncomment the code above")
print("   2. Run this cell")
print("   3. Monitor the output for any errors")

In [None]:
# 📊 Step 8: Verify Vector Database Status
print("📊 Checking Vector Database status...")

try:
    import pinecone
    
    # Initialize Pinecone
    pinecone.init(
        api_key=os.environ['PINECONE_API_KEY'],
        environment=os.environ['PINECONE_ENVIRONMENT']
    )
    
    # Check index status
    index_name = os.environ['INDEX_NAME']
    
    if index_name in pinecone.list_indexes():
        index = pinecone.Index(index_name)
        stats = index.describe_index_stats()
        
        print("✅ Vector Database Status:")
        print(f"📊 Index Name: {index_name}")
        print(f"📈 Total Vectors: {stats['total_vector_count']}")
        print(f"📏 Dimension: {stats['dimension']}")
        
        if 'namespaces' in stats:
            print("📁 Namespaces:")
            for namespace, info in stats['namespaces'].items():
                print(f"   {namespace}: {info['vector_count']} vectors")
        
        print("\n🎉 Vector Database is ready for RAG Module!")
        
    else:
        print(f"❌ Index '{index_name}' not found!")
        print("Available indexes:", pinecone.list_indexes())
        
except Exception as e:
    print(f"❌ Error checking database status: {e}")
    print("💡 Make sure your Pinecone credentials are correct")

In [None]:
# 💾 Step 9: Save Processing Results
from datetime import datetime
import shutil

print("💾 Saving Vector DB processing results...")

# Create results directory in Google Drive
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
results_dir = f'/content/drive/MyDrive/Vector_DB_Results_{timestamp}'
os.makedirs(results_dir, exist_ok=True)

# Save configuration and logs
files_to_save = [
    '.env',
    'config.py',
    'vector_db.log',
    'processing.log',
    'embeddings_stats.json'
]

for file_name in files_to_save:
    if os.path.exists(file_name):
        shutil.copy2(file_name, results_dir)
        print(f"✅ Saved {file_name}")

# Save processing summary
summary = f"""Vector Database Processing Summary
==========================================
Timestamp: {datetime.now()}
Index Name: {os.environ.get('INDEX_NAME', 'N/A')}
Pinecone Environment: {os.environ.get('PINECONE_ENVIRONMENT', 'N/A')}

Processing completed in Google Colab
Ready for RAG Module integration
"""

with open(f'{results_dir}/processing_summary.txt', 'w') as f:
    f.write(summary)

print(f"\n💾 Results saved to: {results_dir}")
print("✅ Vector Database processing backup completed!")
print("\n🔗 Next Steps:")
print("   1. Use the RAG Module Colab notebook")
print("   2. Configure it with the same Pinecone credentials")
print("   3. Start asking legal questions!")

## 🎯 Vector Database Module Complete!

### ✅ What was accomplished:
- Legal documents processed and chunked
- Embeddings generated using Legal-BERT/Sentence-BERT
- Documents stored in Pinecone vector database
- Multilingual support enabled
- Proper sequence numbers and metadata added

### 🔗 Next Steps:
1. **Use the RAG Module Colab notebook** to create the question-answering interface
2. **Configure the same Pinecone credentials** in the RAG module
3. **Start asking legal questions** and get AI-powered responses!

### 📊 Database Ready For:
- ✅ Legal question answering
- ✅ Semantic search
- ✅ Document retrieval
- ✅ Multi-language queries
- ✅ Context-aware responses

Your Vector Database is now ready to power the RAG Module! 🚀
