# LOCALTRIAGE - Data Ingestion Notebook

This notebook demonstrates the data ingestion pipeline for the LOCALTRIAGE platform.
It covers:
1. Database setup and connection
2. Ticket data ingestion (CSV/JSON)
3. Knowledge base article ingestion
4. Embedding generation and vector store population

## Prerequisites
- PostgreSQL database running
- Vector store (FAISS or Qdrant) available
- Python environment with required packages

In [None]:
# Install required packages (if needed)
# !pip install pandas psycopg2-binary sentence-transformers faiss-cpu tqdm

In [None]:
import os
import sys
from pathlib import Path

# Add src to path for imports
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / 'src'))

# Configuration
os.environ['DB_HOST'] = 'localhost'
os.environ['DB_PORT'] = '5432'
os.environ['DB_NAME'] = 'localtriage'
os.environ['DB_USER'] = 'postgres'
os.environ['DB_PASSWORD'] = 'postgres'

print(f"Project root: {project_root}")

## 1. Database Setup

First, we'll set up the database schema and verify connectivity.

In [None]:
from ingestion.ingest import DatabaseConnection

# Test database connection
with DatabaseConnection() as conn:
    cursor = conn.cursor()
    cursor.execute("SELECT version();")
    version = cursor.fetchone()[0]
    print(f"Connected to PostgreSQL: {version[:50]}...")

In [None]:
# Initialize database schema
schema_path = project_root / 'src' / 'ingestion' / 'schema.sql'

with open(schema_path, 'r') as f:
    schema_sql = f.read()

# Execute schema (be careful - this will create tables)
with DatabaseConnection() as conn:
    cursor = conn.cursor()
    cursor.execute(schema_sql)
    conn.commit()
    print("Schema initialized successfully!")

In [None]:
# Verify tables exist
with DatabaseConnection() as conn:
    cursor = conn.cursor()
    cursor.execute("""
        SELECT table_name 
        FROM information_schema.tables 
        WHERE table_schema = 'public'
        ORDER BY table_name;
    """)
    tables = cursor.fetchall()
    print("Tables in database:")
    for table in tables:
        print(f"  - {table[0]}")

## 2. Ticket Data Ingestion

Ingest support tickets from CSV or JSON files.

In [None]:
import pandas as pd

# Create sample ticket data for demonstration
sample_tickets = pd.DataFrame([
    {
        'subject': 'Cannot reset password',
        'body': 'I tried to reset my password using the forgot password link but the email never arrived. I checked my spam folder too.',
        'customer_email': 'user1@example.com',
        'category': 'Account',
        'priority': 'P2'
    },
    {
        'subject': 'Charged twice for subscription',
        'body': 'I noticed two charges on my credit card for my monthly subscription. The first charge was on the 1st and another on the 3rd. Please refund the duplicate.',
        'customer_email': 'user2@example.com',
        'category': 'Billing',
        'priority': 'P1'
    },
    {
        'subject': 'App crashes on startup',
        'body': 'After the latest update, the mobile app crashes immediately when I open it. I am using iPhone 14 with iOS 17. Reinstalling did not help.',
        'customer_email': 'user3@example.com',
        'category': 'Technical',
        'priority': 'P2'
    },
    {
        'subject': 'Order not delivered',
        'body': 'My order #12345 was supposed to arrive last week but the tracking shows it is still in transit. Can you help locate my package?',
        'customer_email': 'user4@example.com',
        'category': 'Shipping',
        'priority': 'P2'
    },
    {
        'subject': 'Feature request: Dark mode',
        'body': 'Would love to see a dark mode option in the application. It would be easier on the eyes especially when working late.',
        'customer_email': 'user5@example.com',
        'category': 'Product',
        'priority': 'P4'
    }
])

print(f"Sample data created: {len(sample_tickets)} tickets")
sample_tickets.head()

In [None]:
# Save to CSV for demonstration
data_dir = project_root / 'data' / 'raw'
data_dir.mkdir(parents=True, exist_ok=True)

csv_path = data_dir / 'sample_tickets.csv'
sample_tickets.to_csv(csv_path, index=False)
print(f"Saved to: {csv_path}")

In [None]:
from ingestion.ingest import TicketIngester

# Initialize ingester with column mapping
column_mapping = {
    'subject': 'subject',
    'body': 'body',
    'customer_email': 'customer_email',
    'category': 'category',
    'priority': 'priority'
}

ingester = TicketIngester(column_mapping=column_mapping)

# Ingest from CSV
result = ingester.ingest_from_csv(str(csv_path))

print(f"Ingested {result['successful']} tickets successfully")
print(f"Failed: {result['failed']}")
if result['errors']:
    print(f"Errors: {result['errors'][:3]}")

In [None]:
# Verify tickets in database
with DatabaseConnection() as conn:
    cursor = conn.cursor()
    cursor.execute("""
        SELECT id, subject, category, priority, created_at
        FROM tickets
        ORDER BY created_at DESC
        LIMIT 10;
    """)
    tickets = cursor.fetchall()
    
print("Recent tickets in database:")
for t in tickets:
    print(f"  [{t[3]}] {t[1][:50]}... ({t[2]})")

## 3. Knowledge Base Ingestion

Ingest knowledge base articles from Markdown files.

In [None]:
# Create sample KB articles
kb_dir = data_dir / 'kb'
kb_dir.mkdir(parents=True, exist_ok=True)

kb_articles = {
    'password-reset.md': '''# How to Reset Your Password

If you've forgotten your password, follow these steps to reset it:

## Via Email

1. Go to the login page and click "Forgot Password"
2. Enter your registered email address
3. Check your inbox (and spam folder) for the reset link
4. Click the link and create a new password

**Note:** Reset links expire after 24 hours.

## Common Issues

- **Email not received:** Wait 5-10 minutes and check spam. If still missing, contact support.
- **Link expired:** Request a new reset link.
- **Account locked:** After 5 failed attempts, accounts are locked for 30 minutes.

## Security Tips

- Use a strong password with at least 12 characters
- Include uppercase, lowercase, numbers, and symbols
- Don't reuse passwords across sites
''',
    
    'billing-faq.md': '''# Billing FAQ

## Subscription Plans

We offer three subscription tiers:
- **Basic:** $9.99/month
- **Pro:** $19.99/month  
- **Enterprise:** Custom pricing

## Billing Cycle

Subscriptions are billed on the same day each month. If you signed up on the 15th, you'll be charged on the 15th each month.

## Refunds

We offer full refunds within 14 days of purchase. Contact support with your order number to request a refund.

## Duplicate Charges

If you see duplicate charges:
1. Check if both charges have completed (not pending)
2. Verify you don't have multiple subscriptions
3. Contact support with both charge dates and amounts

Duplicate charges are typically resolved within 3-5 business days.
''',
    
    'app-troubleshooting.md': '''# App Troubleshooting Guide

## App Crashes on Startup

If the app crashes immediately:

### iOS
1. Force close the app (swipe up from app switcher)
2. Restart your device
3. Check for app updates in App Store
4. Delete and reinstall the app

### Android
1. Force stop the app (Settings > Apps > [App Name] > Force Stop)
2. Clear app cache (Settings > Apps > [App Name] > Storage > Clear Cache)
3. Restart device
4. Reinstall if issues persist

## Known Issues

- **iOS 17 compatibility:** Version 3.2.1 has known issues with iOS 17. Update to version 3.2.2+
- **Android 14:** Some devices may experience slow performance. Optimization patch coming soon.

## Reporting Bugs

When reporting issues, please include:
- Device model and OS version
- App version number
- Steps to reproduce
- Screenshots if possible
'''
}

for filename, content in kb_articles.items():
    filepath = kb_dir / filename
    filepath.write_text(content)
    print(f"Created: {filepath}")

In [None]:
from ingestion.ingest import KBIngester

# Initialize KB ingester
kb_ingester = KBIngester(
    chunk_size=500,
    chunk_overlap=50
)

# Ingest all KB articles
result = kb_ingester.ingest_directory(str(kb_dir))

print(f"Articles processed: {result['articles_processed']}")
print(f"Chunks created: {result['chunks_created']}")
if result['errors']:
    print(f"Errors: {result['errors']}")

In [None]:
# Verify KB data in database
with DatabaseConnection() as conn:
    cursor = conn.cursor()
    
    # Count articles
    cursor.execute("SELECT COUNT(*) FROM kb_articles;")
    article_count = cursor.fetchone()[0]
    
    # Count chunks
    cursor.execute("SELECT COUNT(*) FROM kb_chunks;")
    chunk_count = cursor.fetchone()[0]
    
    print(f"Total KB articles: {article_count}")
    print(f"Total KB chunks: {chunk_count}")
    
    # Sample chunks
    cursor.execute("""
        SELECT c.id, a.title, LEFT(c.content, 100) as preview
        FROM kb_chunks c
        JOIN kb_articles a ON c.article_id = a.id
        LIMIT 5;
    """)
    print("\nSample chunks:")
    for chunk in cursor.fetchall():
        print(f"  [{chunk[1]}] {chunk[2]}...")

## 4. Embedding Generation & Vector Store

Generate embeddings for KB chunks and populate the vector store.

In [None]:
from retrieval.vector_search import EmbeddingModel, FAISSVectorStore

# Initialize embedding model
print("Loading embedding model...")
embedder = EmbeddingModel()
print(f"Model loaded: {embedder.model_name}")
print(f"Embedding dimension: {embedder.dimension}")

In [None]:
# Fetch all KB chunks
with DatabaseConnection() as conn:
    cursor = conn.cursor()
    cursor.execute("""
        SELECT c.id, c.content, a.title
        FROM kb_chunks c
        JOIN kb_articles a ON c.article_id = a.id;
    """)
    chunks = cursor.fetchall()

print(f"Fetched {len(chunks)} chunks for embedding")

In [None]:
from tqdm import tqdm

# Generate embeddings
chunk_ids = [str(c[0]) for c in chunks]
chunk_texts = [c[1] for c in chunks]

print("Generating embeddings...")
embeddings = embedder.encode(chunk_texts)
print(f"Generated {len(embeddings)} embeddings of shape {embeddings[0].shape}")

In [None]:
# Initialize FAISS vector store
vector_store = FAISSVectorStore(dimension=embedder.dimension)

# Add embeddings to store
metadata = [
    {'chunk_id': chunk_ids[i], 'title': chunks[i][2], 'content_preview': chunks[i][1][:200]}
    for i in range(len(chunks))
]

vector_store.add(chunk_ids, embeddings, metadata)
print(f"Added {len(chunk_ids)} vectors to FAISS index")

In [None]:
# Save the index for later use
index_dir = project_root / 'data' / 'indices'
index_dir.mkdir(parents=True, exist_ok=True)

index_path = str(index_dir / 'kb_faiss.index')
vector_store.save(index_path)
print(f"Index saved to: {index_path}")

In [None]:
# Test retrieval
test_query = "How do I reset my password if the email didn't arrive?"
query_embedding = embedder.encode([test_query])[0]

results = vector_store.search(query_embedding, top_k=3)

print(f"Query: {test_query}")
print("\nTop results:")
for result in results:
    print(f"  Score: {result['score']:.4f}")
    print(f"  Title: {result['metadata'].get('title', 'N/A')}")
    print(f"  Preview: {result['metadata'].get('content_preview', '')[:100]}...")
    print()

## 5. Data Summary

Summary of all ingested data.

In [None]:
# Generate data summary
with DatabaseConnection() as conn:
    cursor = conn.cursor()
    
    # Ticket stats
    cursor.execute("""
        SELECT 
            COUNT(*) as total,
            COUNT(DISTINCT category) as categories,
            COUNT(DISTINCT customer_email) as customers
        FROM tickets;
    """)
    ticket_stats = cursor.fetchone()
    
    # Category distribution
    cursor.execute("""
        SELECT category, COUNT(*) 
        FROM tickets 
        GROUP BY category 
        ORDER BY COUNT(*) DESC;
    """)
    cat_dist = cursor.fetchall()
    
    # KB stats
    cursor.execute("SELECT COUNT(*) FROM kb_articles;")
    kb_count = cursor.fetchone()[0]
    
    cursor.execute("SELECT COUNT(*), AVG(LENGTH(content)) FROM kb_chunks;")
    chunk_stats = cursor.fetchone()

print("=" * 50)
print("DATA INGESTION SUMMARY")
print("=" * 50)
print(f"\nTickets:")
print(f"  Total: {ticket_stats[0]}")
print(f"  Categories: {ticket_stats[1]}")
print(f"  Unique customers: {ticket_stats[2]}")
print(f"\nCategory distribution:")
for cat, count in cat_dist:
    print(f"  {cat}: {count}")
print(f"\nKnowledge Base:")
print(f"  Articles: {kb_count}")
print(f"  Chunks: {chunk_stats[0]}")
print(f"  Avg chunk size: {chunk_stats[1]:.0f} chars")
print(f"\nVector Store:")
print(f"  Index size: {vector_store.index.ntotal}")
print("=" * 50)

## Next Steps

Now that data is ingested, you can:

1. Run the **evaluation harness** to establish baseline metrics
2. Start the **API server** to enable ticket triage
3. Open the **Streamlit dashboard** for interactive use
4. Generate **weekly insights** using the insights notebook