<a href="https://colab.research.google.com/github/Thiwanka-Sandakalum/ETL-pipline/blob/main/Educational_Data_ETL_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üéì Educational Data ETL Pipeline with LLM Extraction

## System Overview

```
Raw Text Files (per institution)
        ‚Üì
Chunking + Cleaning
        ‚Üì
LLM Extraction (Ollama / LLaMA)
        ‚Üì
Strict JSON Validation (Pydantic)
        ‚Üì
Confidence Scoring & Normalization
        ‚Üì
MongoDB Insert (Institutions / Programs / Raw)
        ‚Üì
Audit + Retry Queue
```

## Technology Stack

| Layer | Technology |
|-------|-----------|
| LLM Runtime | Ollama (LLaMA 3.x / Mistral) |
| Runtime Env | Google Colab / Local |
| Language | Python 3.10+ |
| Validation | Pydantic |
| Database | MongoDB (Atlas or Local) |
| Embeddings | Ollama embeddings |
| Logging | JSON logs |

---

## üì¶ Step 1: Install Dependencies and Setup Environment

In [1]:
# Install required packages
import sys
import subprocess

def install_packages():
    """Install all required dependencies"""
    packages = [
        'pymongo[srv]==3.12',  # MongoDB driver with srv support for Atlas
        'pydantic',
        'requests',
        'python-dotenv'
    ]

    for package in packages:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])

    print("‚úÖ All packages installed successfully!")

install_packages()

Installing pymongo[srv]==3.12...
Installing pydantic...
Installing requests...
Installing python-dotenv...
‚úÖ All packages installed successfully!


In [3]:
# Create necessary directories
import os
from pathlib import Path

# Auto-detect base directory for Google Colab or local environment
try:
    import google.colab
    # Running in Google Colab
    BASE_DIR = Path("/content")
    print("üåê Running in Google Colab")
except ImportError:
    # Running locally - auto-detect from current working directory
    BASE_DIR = Path.cwd()
    print("üíª Running locally")

INPUT_DIR = BASE_DIR / "bci_lk"
LOGS_DIR = BASE_DIR / "logs"
OUTPUT_DIR = BASE_DIR / "output"

print(f"   üìÇ Output: {OUTPUT_DIR}")

# Create directories if they don't existprint(f"   üìÇ Logs: {LOGS_DIR}")

LOGS_DIR.mkdir(exist_ok=True)
print(f"   üìÇ Input: {INPUT_DIR}")

OUTPUT_DIR.mkdir(exist_ok=True)
print(f"   üìÇ Base: {BASE_DIR}")

INPUT_DIR.mkdir(exist_ok=True)  # Create input dir if neededprint(f"‚úÖ Directory structure created:")


üåê Running in Google Colab
   üìÇ Output: /content/output
   üìÇ Input: /content/bci_lk
   üìÇ Base: /content


In [4]:
# Import core libraries
import json
import logging
from datetime import datetime
from typing import Optional, List, Dict, Any
import traceback

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(LOGS_DIR / f'pipeline_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)
logger.info("üöÄ ETL Pipeline Initialized")

## ü§ñ Step 2: Install and Start Ollama Server

**Note:** This section is for Google Colab. For local environments, install Ollama separately.

In [5]:
# Check if running in Google Colab
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    print("üìç Running in Google Colab - Installing Ollama...")
    # Install Ollama in Colab
    !curl -fsSL https://ollama.com/install.sh | sh

    # Start Ollama server in background
    import subprocess
    import time

    # Start server
    ollama_process = subprocess.Popen(['ollama', 'serve'],
                                      stdout=subprocess.PIPE,
                                      stderr=subprocess.PIPE)
    time.sleep(5)  # Wait for server to start

    # Pull LLaMA 3 model
    !ollama pull llama3

    print("‚úÖ Ollama installed and started successfully!")
else:
    print("üìç Running locally - Please ensure Ollama is installed and running")
    print("   Install: curl -fsSL https://ollama.com/install.sh | sh")
    print("   Start server: ollama serve")
    print("   Pull model: ollama pull llama3")

üìç Running in Google Colab - Installing Ollama...
>>> Installing ollama to /usr/local
>>> Downloading ollama-linux-amd64.tgz
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?202

In [None]:
# Test Ollama connection
import requests
import time

def test_ollama_connection(max_retries=5):
    """Test if Ollama server is accessible"""
    for i in range(max_retries):
        try:
            response = requests.get("http://localhost:11434/api/tags", timeout=5)
            if response.status_code == 200:
                models = response.json().get('models', [])
                print(f"‚úÖ Ollama server is running!")
                print(f"   Available models: {[m['name'] for m in models]}")
                return True
        except Exception as e:
            print(f"‚è≥ Waiting for Ollama server... (attempt {i+1}/{max_retries})")
            time.sleep(2)

    print("‚ùå Could not connect to Ollama server")
    return False

test_ollama_connection()

## üìã Step 3: Define Pydantic Data Models (JSON Contract)

In [39]:
from pydantic import BaseModel, Field, validator, ConfigDict
from typing import Optional, List, Dict, Any
from datetime import datetime

class Institution(BaseModel):
    """Strict schema for educational institution data"""
    institution_code: Optional[str] = Field(None, description="Unique institution identifier")
    name: str = Field(..., description="Official institution name")
    description: Optional[str] = Field(None, description="Institution description and overview")
    type: List[str] = Field(default_factory=list, description="Institution types (e.g., university, college)")
    country: Optional[str] = Field(default="Sri Lanka", description="Country where institution is located") # Changed to Optional[str]
    website: Optional[str] = Field(None, description="Official website URL")
    recognition: Optional[Dict[str, Optional[bool]]] = Field(None, description="Recognition/accreditation status")
    contact_info: Optional[Dict[str, Any]] = Field(None, description="Contact details")
    confidence_score: float = Field(..., ge=0.0, le=1.0, description="Extraction confidence (0.0-1.0)")

    model_config = ConfigDict(extra="forbid")  # Use ConfigDict instead of class Config


class Program(BaseModel):
    """Strict schema for academic program data"""
    name: str = Field(..., description="Official program name") # Added this line
    program_code: Optional[str] = Field(None, description="Unique program identifier")
    description: Optional[str] = Field(None, description="Program description and overview")
    level: Optional[str] = Field(None, description="Academic level (e.g., Bachelor, Diploma)")
    duration: Optional[Dict[str, Any]] = Field(None, description="Program duration details")
    delivery_mode: Optional[List[str]] = Field(None, description="Delivery modes (online, on-campus, hybrid)")
    fees: Optional[Dict[str, Any]] = Field(None, description="Fee structure")
    eligibility: Optional[Dict[str, Any]] = Field(None, description="Admission requirements")
    curriculum_summary: Optional[str] = Field(None, description="Brief curriculum overview")
    specializations: Optional[List[str]] = Field(None, description="Available specializations")
    extensions: Optional[Dict[str, Any]] = Field(None, description="Additional unstructured data")
    confidence_score: float = Field(..., ge=0.0, le=1.0, description="Extraction confidence (0.0-1.0)")

    model_config = ConfigDict(extra="forbid") # Use ConfigDict instead of class Config


class ExtractionResult(BaseModel):
    """Complete extraction result container"""
    institution: Institution
    programs: List[Program] = Field(default_factory=list)
    raw_text: str = Field(..., description="Original source text")
    extraction_timestamp: datetime = Field(default_factory=datetime.now)
    source_file: Optional[str] = Field(None, description="Source filename")

    model_config = ConfigDict(extra="allow")  # Use ConfigDict instead of class Config

print("‚úÖ Pydantic schemas defined successfully!")
print(f"   - Institution")
print(f"   - Program")

print(f"   - ExtractionResult")
print(f"   - ExtractionResult")

‚úÖ Pydantic schemas defined successfully!
   - Institution
   - Program
   - ExtractionResult
   - ExtractionResult


## üí¨ Step 4: Create Prompt Engineering Module

In [8]:
SYSTEM_PROMPT = """You are a structured data extraction engine for educational institutions.

CRITICAL RULES:
1. Output ONLY valid JSON - no markdown, no code blocks, no explanations
2. Follow the JSON schema EXACTLY as defined
3. Do NOT infer, guess, or hallucinate information
4. If data is missing or unclear, OMIT the field entirely
5. Unknown or unstructured information ‚Üí place in "extensions" field
6. Preserve original wording and terminology from source
7. Assign confidence_score honestly based on data clarity (0.0‚Äì1.0)
8. Be conservative with confidence scores - unclear data = lower score

CONFIDENCE SCORING GUIDELINES:
- 1.0: Explicitly stated, clear, unambiguous
- 0.8-0.9: Clearly stated but minor ambiguity
- 0.6-0.7: Implied or partially stated
- 0.4-0.5: Inferred from context
- 0.0-0.3: Highly uncertain or speculative

OUTPUT FORMAT:
Return a single JSON object with this structure:
{
  "institution": {
    "name": "...",
    "description": "Institution overview and mission...",
    "type": ["..."],
    "country": "...",
    "website": "...",
    "confidence_score": 0.0-1.0
  },
  "programs": [
    {
      "name": "...",
      "description": "Program overview and objectives...",
      "level": "...",
      "duration": {...},
      "curriculum_summary": "...",
      "confidence_score": 0.0-1.0
    }
  ]
}
"""

USER_PROMPT_TEMPLATE = """Extract structured educational data from the following content.

Source content:
<<<
{content}
>>>

Return ONLY the JSON object. No explanations. No markdown.
"""

print("‚úÖ Extraction prompts configured")
print(f"   System prompt length: {len(SYSTEM_PROMPT)} chars")
print(f"   User template ready")

‚úÖ Extraction prompts configured
   System prompt length: 1316 chars
   User template ready


## üîå Step 5: Implement Ollama LLM Client

In [9]:
import requests
import json
import re

class OllamaClient:
    """Robust Ollama LLM client with error handling"""

    def __init__(self, base_url="http://localhost:11434", model="llama3"):
        self.base_url = base_url
        self.model = model
        self.api_url = f"{base_url}/api/generate"

    def call(self, prompt: str, temperature: float = 0.1, timeout: int = 300) -> dict:
        """
        Call Ollama API with structured prompt

        Args:
            prompt: Complete prompt including system and user messages
            temperature: Sampling temperature (lower = more deterministic)
            timeout: Request timeout in seconds

        Returns:
            Parsed JSON response from LLM

        Raises:
            ValueError: If JSON parsing fails
            requests.RequestException: If API call fails
        """
        try:
            logger.info(f"Calling Ollama with model: {self.model}")

            response = requests.post(
                self.api_url,
                json={
                    "model": self.model,
                    "prompt": prompt,
                    "stream": False,
                    "temperature": temperature,
                    "options": {
                        "num_predict": 4096,  # Max tokens
                        "top_k": 40,
                        "top_p": 0.9
                    }
                },
                timeout=timeout
            )

            response.raise_for_status()

            # Extract response text
            raw_response = response.json().get("response", "")
            logger.debug(f"Raw LLM response: {raw_response[:200]}...")

            # Clean response (remove markdown code blocks if present)
            cleaned = self._clean_json_response(raw_response)

            # Parse JSON
            try:
                return json.loads(cleaned)
            except json.JSONDecodeError as e:
                logger.error(f"JSON decode error: {e}")
                logger.error(f"Cleaned response: {cleaned[:500]}")
                raise ValueError(f"Invalid JSON returned by LLM: {e}")

        except requests.RequestException as e:
            logger.error(f"Ollama API request failed: {e}")
            raise

    def _clean_json_response(self, text: str) -> str:
        """Remove markdown code blocks and extract JSON"""
        # Remove markdown code blocks
        text = re.sub(r'```json\s*', '', text)
        text = re.sub(r'```\s*', '', text)

        # Find JSON object (starting with { and ending with })
        match = re.search(r'\{.*\}', text, re.DOTALL)
        if match:
            return match.group(0)

        return text.strip()

    def test_connection(self) -> bool:
        """Test if Ollama server is accessible"""
        try:
            response = requests.get(f"{self.base_url}/api/tags", timeout=5)
            return response.status_code == 200
        except:
            return False

# Initialize client
ollama_client = OllamaClient(model="llama3")

# Test connection
if ollama_client.test_connection():
    print("‚úÖ Ollama client initialized and connected")
else:
    print("‚ö†Ô∏è  Warning: Could not connect to Ollama server")
    print("   Make sure Ollama is running: ollama serve")

‚úÖ Ollama client initialized and connected


## ‚úÇÔ∏è Step 6: Build Text Chunking Function

In [10]:
class TextChunker:
    """Intelligent text chunking for large documents"""

    def __init__(self, max_chars=3000, overlap=200):
        """
        Initialize chunker

        Args:
            max_chars: Maximum characters per chunk
            overlap: Character overlap between chunks for context preservation
        """
        self.max_chars = max_chars
        self.overlap = overlap

    def chunk_text(self, text: str) -> List[str]:
        """
        Split text into manageable chunks while preserving context

        Args:
            text: Input text to chunk

        Returns:
            List of text chunks
        """
        if len(text) <= self.max_chars:
            return [text]

        chunks = []
        current_chunk = ""

        # Split by paragraphs (double newline) for better context
        paragraphs = text.split('\n\n')

        for paragraph in paragraphs:
            # If single paragraph exceeds max, split by sentences
            if len(paragraph) > self.max_chars:
                sentences = paragraph.split('. ')
                for sentence in sentences:
                    if len(current_chunk) + len(sentence) + 2 <= self.max_chars:
                        current_chunk += sentence + '. '
                    else:
                        if current_chunk:
                            chunks.append(current_chunk.strip())
                            # Add overlap from previous chunk
                            current_chunk = current_chunk[-self.overlap:] + sentence + '. '
                        else:
                            current_chunk = sentence + '. '
            else:
                # Add full paragraph if it fits
                if len(current_chunk) + len(paragraph) + 2 <= self.max_chars:
                    current_chunk += paragraph + '\n\n'
                else:
                    if current_chunk:
                        chunks.append(current_chunk.strip())
                        # Add overlap
                        current_chunk = current_chunk[-self.overlap:] + paragraph + '\n\n'
                    else:
                        current_chunk = paragraph + '\n\n'

        # Add final chunk
        if current_chunk:
            chunks.append(current_chunk.strip())

        logger.info(f"Split text into {len(chunks)} chunks")
        return chunks

    def clean_text(self, text: str) -> str:
        """Clean and normalize text"""
        # Remove excessive whitespace
        text = re.sub(r'\s+', ' ', text)
        # Remove special characters but keep punctuation
        text = re.sub(r'[^\w\s\.\,\:\;\-\(\)\[\]\/\&]', '', text)
        return text.strip()

# Initialize chunker
chunker = TextChunker(max_chars=3000, overlap=200)

# Test chunking
test_text = "This is a test. " * 500
test_chunks = chunker.chunk_text(test_text)
print(f"‚úÖ Text chunker initialized")
print(f"   Test: {len(test_text)} chars ‚Üí {len(test_chunks)} chunks")
print(f"   Max chunk size: {max(len(c) for c in test_chunks)} chars")

‚úÖ Text chunker initialized
   Test: 8000 chars ‚Üí 3 chunks
   Max chunk size: 2999 chars


## ‚úÖ Step 7: Implement JSON Validation Layer

In [11]:
from pydantic import ValidationError

class DataValidator:
    """Strict Pydantic-based validation"""

    @staticmethod
    def validate_extraction(data: dict, source_file: str = None) -> ExtractionResult:
        """
        Validate extracted data against Pydantic schema

        Args:
            data: Raw dictionary from LLM
            source_file: Source filename for metadata

        Returns:
            Validated ExtractionResult object

        Raises:
            ValueError: If validation fails
        """
        try:
            # Add metadata
            if source_file:
                data['source_file'] = source_file

            # Validate against schema
            result = ExtractionResult(**data)

            logger.info(f"‚úÖ Validation passed")
            logger.info(f"   Institution: {result.institution.name}")
            logger.info(f"   Programs: {len(result.programs)}")

            return result

        except ValidationError as e:
            error_details = []
            for error in e.errors():
                field = '.'.join(str(x) for x in error['loc'])
                message = error['msg']
                error_details.append(f"  - {field}: {message}")

            error_msg = "Schema validation failed:\n" + "\n".join(error_details)
            logger.error(error_msg)
            raise ValueError(error_msg)

    @staticmethod
    def validate_confidence_scores(result: ExtractionResult) -> bool:
        """Check if confidence scores are within acceptable range"""
        min_acceptable = 0.5

        # Check institution confidence
        if result.institution.confidence_score < min_acceptable:
            logger.warning(f"Low institution confidence: {result.institution.confidence_score}")
            return False

        # Check program confidences
        low_confidence_programs = [
            p.name for p in result.programs
            if p.confidence_score < min_acceptable
        ]

        if low_confidence_programs:
            logger.warning(f"Low confidence programs: {low_confidence_programs}")

        return True

validator = DataValidator()
print("‚úÖ Validation layer ready")

‚úÖ Validation layer ready


## üéØ Step 8: Create Confidence Scoring System

In [13]:
class ConfidenceScorer:
    """Normalize and adjust confidence scores based on data quality"""

    def __init__(self, low_confidence_threshold=0.6, penalty_factor=0.9):
        self.low_confidence_threshold = low_confidence_threshold
        self.penalty_factor = penalty_factor

    def normalize_confidence(self, result: ExtractionResult) -> ExtractionResult:
        """
        Apply confidence normalization rules

        Rules:
        1. Penalize scores below threshold
        2. Adjust based on data completeness
        3. Apply penalties for missing critical fields

        Args:
            result: ExtractionResult to normalize

        Returns:
            Normalized ExtractionResult
        """
        # Normalize institution confidence
        result.institution.confidence_score = self._normalize_institution_score(
            result.institution
        )

        # Normalize program confidences
        for program in result.programs:
            program.confidence_score = self._normalize_program_score(program)

        logger.info(f"Confidence scores normalized")
        logger.info(f"   Institution: {result.institution.confidence_score:.2f}")
        logger.info(f"   Programs avg: {sum(p.confidence_score for p in result.programs) / len(result.programs):.2f}"
                   if result.programs else "   Programs: N/A")

        return result

    def _normalize_institution_score(self, institution: Institution) -> float:
        """Normalize institution confidence score"""
        score = institution.confidence_score

        # Apply low confidence penalty
        if score < self.low_confidence_threshold:
            score *= self.penalty_factor

        # Penalize if critical fields are missing
        if not institution.website:
            score *= 0.95

        if not institution.institution_code:
            score *= 0.98

        return round(min(score, 1.0), 3)

    def _normalize_program_score(self, program: Program) -> float:
        """Normalize program confidence score"""
        score = program.confidence_score

        # Apply low confidence penalty
        if score < self.low_confidence_threshold:
            score *= 0.85  # Stricter for programs

        # Penalize missing important fields
        missing_fields = 0
        if not program.level:
            missing_fields += 1
        if not program.duration:
            missing_fields += 1
        if not program.curriculum_summary:
            missing_fields += 1

        # Reduce score by 3% per missing important field
        score *= (1.0 - 0.03 * missing_fields)

        return round(min(score, 1.0), 3)

        # Apply penalty if missing qualifications
    def get_overall_quality_score(self, result: ExtractionResult) -> float:
        """Calculate overall data quality score"""
        scores = [result.institution.confidence_score]
        scores.extend([p.confidence_score for p in result.programs])

        return sum(scores) / len(scores) if scores else 0.0

scorer = ConfidenceScorer()
print("‚úÖ Confidence scoring system ready")

‚úÖ Confidence scoring system ready


## üíæ Step 9: Setup MongoDB Connection and Writer

In [14]:
# MongoDB Atlas Configuration
MONGODB_URI = "mongodb+srv://ict22006_db_user:gGgnHUqamNcU5jAy@development.ps1jayw.mongodb.net/?appName=development"
DATABASE_NAME = "edu_platform"

print("‚úÖ MongoDB Atlas configured")
print(f"   Database: {DATABASE_NAME}")
print(f"   Cluster: development.ps1jayw.mongodb.net")

‚úÖ MongoDB Atlas configured
   Database: edu_platform
   Cluster: development.ps1jayw.mongodb.net


In [33]:
from pymongo import MongoClient, errors
from pymongo.server_api import ServerApi
from pymongo.collection import Collection

class MongoDBWriter:
    """MongoDB writer with error handling and relationship management"""

    def __init__(self, connection_string=MONGODB_URI, database_name=DATABASE_NAME):
        """
        Initialize MongoDB connection

        Args:
            connection_string: MongoDB connection URI (Atlas or local)
            database_name: Database name to use
        """
        try:
            # Create client with server API version
            self.client = MongoClient(
                connection_string,
                server_api=ServerApi('1'),
                serverSelectionTimeoutMS=5000
            )

            # Test connection with ping
            self.client.admin.command('ping')
            logger.info("‚úÖ Pinged deployment - successfully connected to MongoDB!")

            self.db = self.client[database_name]

            # Collections
            self.institutions = self.db.institutions
            self.programs = self.db.programs
            self.raw_documents = self.db.raw_documents
            self.extraction_logs = self.db.extraction_logs

            # Create indexes
            self._create_indexes()

            logger.info(f"‚úÖ Connected to MongoDB: {database_name}")

        except errors.ServerSelectionTimeoutError:
            logger.error("‚ùå Could not connect to MongoDB")
            logger.error("   Check your connection string and network access")
            raise
        except Exception as e:
            logger.error(f"‚ùå MongoDB connection error: {e}")
            raise

    def _create_indexes(self):
        """Create database indexes for performance"""
        # Institution indexes
        self.institutions.create_index("name")
        self.institutions.create_index("institution_code", unique=True, sparse=True)

        # Program indexes
        self.programs.create_index("institution_id")
        self.programs.create_index("name")
        self.programs.create_index("level")

        # People indexes
        logger.info("Database indexes created")

    def write_extraction(self, result: ExtractionResult) -> dict:
        """
        Write complete extraction result to MongoDB

        Args:
            result: Validated ExtractionResult

        Returns:
            Dictionary with inserted IDs
        """
        try:
            # Insert institution
            institution_data = result.institution.model_dump() # Changed .dict() to .model_dump()
            institution_data['inserted_at'] = datetime.now()

            inst_result = self.institutions.insert_one(institution_data)
            institution_id = inst_result.inserted_id

            logger.info(f"‚úÖ Inserted institution: {institution_id}")

            # Insert programs
            program_ids = []
            for program in result.programs:
                program_data = program.model_dump() # Changed .dict() to .model_dump()
                program_data['institution_id'] = institution_id
                program_data['inserted_at'] = datetime.now()

                prog_result = self.programs.insert_one(program_data)
                program_ids.append(prog_result.inserted_id)

            logger.info(f"‚úÖ Inserted {len(program_ids)} programs")

            raw_doc = {
                'institution_id': institution_id,
                'raw_text': result.raw_text,
                'source_file': result.source_file,
                'extraction_timestamp': result.extraction_timestamp,
                'inserted_at': datetime.now()
            }
            raw_result = self.raw_documents.insert_one(raw_doc)

            # Log extraction
            log_entry = {
                'institution_id': institution_id,
                'source_file': result.source_file,
                'programs_count': len(result.programs),
                'institution_confidence': result.institution.confidence_score,
                'avg_program_confidence': sum(p.confidence_score for p in result.programs) / len(result.programs) if result.programs else 0,
                'timestamp': datetime.now(),
                'status': 'success'
                }
            self.extraction_logs.insert_one(log_entry)

            return {
                'institution_id': institution_id,
                'program_ids': program_ids,
                'raw_document_id': raw_result.inserted_id
            }

        except Exception as e:
            logger.error(f"MongoDB write failed: {e}")

            # Log failure
            log_entry = {
                'source_file': result.source_file if hasattr(result, 'source_file') else None,
                'error': str(e),
                'timestamp': datetime.now(),
                'status': 'failed'
            }
            self.extraction_logs.insert_one(log_entry)

            raise

    def get_statistics(self) -> dict:
        """Get database statistics"""
        return {
            'institutions': self.institutions.count_documents({}),
            'programs': self.programs.count_documents({}),
            'raw_documents': self.raw_documents.count_documents({}),
            'extraction_logs': self.extraction_logs.count_documents({})
        }

    def close(self):
        """Close MongoDB connection"""
        self.client.close()
        logger.info("MongoDB connection closed")

# Initialize MongoDB writer with Atlas connection
mongo_writer = MongoDBWriter()

## üìÅ Step 10: Load and Process Text Files

In [27]:
import glob
from pathlib import Path

class FileLoader:
    """Load and preprocess text files"""

    def __init__(self, input_dir: Path):
        self.input_dir = Path(input_dir)

    def get_all_files(self, pattern="*.txt") -> List[Path]:
        """Get all text files from input directory"""
        files = list(self.input_dir.glob(pattern))
        logger.info(f"Found {len(files)} files in {self.input_dir}")
        return files

    def load_file(self, file_path: Path) -> str:
        """Load and preprocess a single file"""
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()

            # Basic preprocessing
            content = self._preprocess_text(content)

            logger.info(f"Loaded {file_path.name}: {len(content)} chars")
            return content

        except Exception as e:
            logger.error(f"Failed to load {file_path}: {e}")
            raise

    def _preprocess_text(self, text: str) -> str:
        """Clean and preprocess text"""
        # Remove excessive whitespace
        text = re.sub(r'\n{3,}', '\n\n', text)
        text = re.sub(r' {2,}', ' ', text)

        # Remove non-printable characters
        text = ''.join(char for char in text if char.isprintable() or char in '\n\t')

        return text.strip()

# Initialize file loader
file_loader = FileLoader(INPUT_DIR)

# Get all files
all_files = file_loader.get_all_files()
print(f"‚úÖ File loader initialized")
print(f"   Found {len(all_files)} files to process")
print(f"\n   Sample files:")
for file in all_files[:5]:
    print(f"   - {file.name}")

‚úÖ File loader initialized
   Found 20 files to process

   Sample files:
   - www_bci_lk_programmes_undergraduate-programmes.txt
   - www_bci_lk_academic-senate.txt
   - www_bci_lk_course_bachelor-of-science-honours-in-computer-science.txt
   - www_bci_lk_course_advanced-certificate-in-ict.txt
   - www_bci_lk_course_certificate-in-counselling-programme.txt


## üöÄ Step 11: Run Extraction Pipeline on Single File

In [36]:
class ExtractionPipeline:
    """Master pipeline orchestrator"""

    def __init__(self, ollama_client, chunker, validator, scorer, mongo_writer=None):
        self.ollama_client = ollama_client
        self.chunker = chunker
        self.validator = validator
        self.scorer = scorer
        self.mongo_writer = mongo_writer

    def process_file(self, file_path: Path, save_to_db=True) -> Optional[ExtractionResult]:
        """
        Process a single file through the complete pipeline

        Args:
            file_path: Path to input file
            save_to_db: Whether to save results to MongoDB

        Returns:
            ExtractionResult or None if processing fails
        """
        logger.info(f"\n{'='*60}")
        logger.info(f"Processing: {file_path.name}")
        logger.info(f"{'='*60}")

        try:
            # Step 1: Load file
            text = file_loader.load_file(file_path)

            # Step 2: Chunk text
            chunks = self.chunker.chunk_text(text)
            logger.info(f"Split into {len(chunks)} chunks")

            # Step 3: Process each chunk and merge results
            all_programs = []
            institution_data = None

            for i, chunk in enumerate(chunks, 1):
                logger.info(f"Processing chunk {i}/{len(chunks)}")

                # Build prompt
                prompt = SYSTEM_PROMPT + USER_PROMPT_TEMPLATE.format(content=chunk)

                # Call LLM
                try:
                    llm_response = self.ollama_client.call(prompt)
                except Exception as e:
                    logger.error(f"LLM call failed for chunk {i}: {e}")
                    continue

                # Extract data from chunk
                if 'institution' in llm_response and institution_data is None:
                    institution_data = llm_response['institution']

                if 'programs' in llm_response:
                    all_programs.extend(llm_response['programs'])

            # Step 4: Merge all data
            if not institution_data:
                logger.error("No institution data extracted")
                return None

            merged_data = {
                'institution': institution_data,
                'programs': all_programs,
                'raw_text': text[:5000]  # Store first 5000 chars for audit
            }

            # Step 5: Validate
            result = DataValidator.validate_extraction(merged_data, str(file_path.name)) # Calling static method directly on class

            # Step 6: Normalize confidence scores
            result = self.scorer.normalize_confidence(result)

            # Step 7: Calculate overall quality
            quality_score = self.scorer.get_overall_quality_score(result)
            logger.info(f"Overall quality score: {quality_score:.2f}")

            # Step 8: Save to MongoDB
            if save_to_db and self.mongo_writer:
                inserted_ids = self.mongo_writer.write_extraction(result)
                logger.info(f"Saved to MongoDB: {inserted_ids['institution_id']}")

            # Step 9: Save JSON output
            self._save_json_output(result, file_path)

            logger.info(f"‚úÖ Successfully processed {file_path.name}")
            return result

        except Exception as e:
            logger.error(f"‚ùå Pipeline failed for {file_path.name}: {e}")
            logger.error(traceback.format_exc())
            return None

    def _save_json_output(self, result: ExtractionResult, file_path: Path):
        """Save extraction result as JSON file"""
        output_file = OUTPUT_DIR / f"{file_path.stem}_extracted.json"

        # Convert to dict and handle datetime serialization
        result_dict = result.model_dump() # Changed .dict() to .model_dump()
        result_dict['extraction_timestamp'] = result.extraction_timestamp.isoformat()

        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(result_dict, f, indent=2, ensure_ascii=False)

        logger.info(f"JSON output saved: {output_file.name}")

# Initialize pipeline
pipeline = ExtractionPipeline(
    ollama_client=ollama_client,
    chunker=chunker,
    validator=validator,
    scorer=scorer,
    mongo_writer=mongo_writer
)

print("‚úÖ Extraction pipeline ready")

‚úÖ Extraction pipeline ready


In [35]:
# Test pipeline on a single file
if len(all_files) > 0:
    test_file = all_files[0]  # Process first file
    print(f"\nüß™ Testing pipeline on: {test_file.name}\n")

    result = pipeline.process_file(test_file, save_to_db=True)

    if result:
        print(f"\n‚úÖ Test completed successfully!")
        print(f"\nüìä Extraction Summary:")
        print(f"   Institution: {result.institution.name}")
        print(f"   Confidence: {result.institution.confidence_score}")
        print(f"   Programs extracted: {len(result.programs)}")
        if result.programs:
            print(f"\n   Sample programs:")
            for prog in result.programs[:3]:
                print(f"   - {prog.name} (confidence: {prog.confidence_score})")
else:
    print("‚ùå No files found to process")


üß™ Testing pipeline on: www_bci_lk_programmes_undergraduate-programmes.txt



ERROR:__main__:‚ùå Pipeline failed for www_bci_lk_programmes_undergraduate-programmes.txt: 'function' object has no attribute 'validate_extraction'
ERROR:__main__:Traceback (most recent call last):
  File "/tmp/ipython-input-2783511999.py", line 70, in process_file
    result = self.validator.validate_extraction(merged_data, str(file_path.name))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'function' object has no attribute 'validate_extraction'



In [40]:
# Test pipeline on a single file
if len(all_files) > 0:
    test_file = all_files[0]  # Process first file
    print(f"\n‚ÑπÔ∏è Testing pipeline on: {test_file.name}\n")

    result = pipeline.process_file(test_file, save_to_db=True)

    if result:
        print(f"\n‚úÖ Test completed successfully!")
        print(f"\nüìä Extraction Summary:")
        print(f"   Institution: {result.institution.name}")
        print(f"   Confidence: {result.institution.confidence_score}")
        print(f"   Programs extracted: {len(result.programs)}")
        if result.programs:
            print(f"\n   Sample programs:")
            for prog in result.programs[:3]:
                print(f"   - {prog.name} (confidence: {prog.confidence_score})")
else:
    print("‚ùå No files found to process")


‚ÑπÔ∏è Testing pipeline on: www_bci_lk_programmes_undergraduate-programmes.txt



ERROR:__main__:MongoDB write failed: E11000 duplicate key error collection: edu_platform.institutions index: institution_code_1 dup key: { institution_code: null }, full error: {'index': 0, 'code': 11000, 'errmsg': 'E11000 duplicate key error collection: edu_platform.institutions index: institution_code_1 dup key: { institution_code: null }', 'keyPattern': {'institution_code': 1}, 'keyValue': {'institution_code': None}}
ERROR:__main__:‚ùå Pipeline failed for www_bci_lk_programmes_undergraduate-programmes.txt: E11000 duplicate key error collection: edu_platform.institutions index: institution_code_1 dup key: { institution_code: null }, full error: {'index': 0, 'code': 11000, 'errmsg': 'E11000 duplicate key error collection: edu_platform.institutions index: institution_code_1 dup key: { institution_code: null }', 'keyPattern': {'institution_code': 1}, 'keyValue': {'institution_code': None}}
ERROR:__main__:Traceback (most recent call last):
  File "/tmp/ipython-input-2626884717.py", line 

In [37]:
# Test pipeline on a single file
if len(all_files) > 0:
    test_file = all_files[0]  # Process first file
    print(f"\n‚ÑπÔ∏è Testing pipeline on: {test_file.name}\n")

    result = pipeline.process_file(test_file, save_to_db=True)

    if result:
        print(f"\n‚úÖ Test completed successfully!")
        print(f"\nüìä Extraction Summary:")
        print(f"   Institution: {result.institution.name}")
        print(f"   Confidence: {result.institution.confidence_score}")
        print(f"   Programs extracted: {len(result.programs)}")
        if result.programs:
            print(f"\n   Sample programs:")
            for prog in result.programs[:3]:
                print(f"   - {prog.name} (confidence: {prog.confidence_score})")
else:
    print("‚ùå No files found to process")


‚ÑπÔ∏è Testing pipeline on: www_bci_lk_programmes_undergraduate-programmes.txt



ERROR:__main__:Schema validation failed:
  - institution.country: Input should be a valid string
ERROR:__main__:‚ùå Pipeline failed for www_bci_lk_programmes_undergraduate-programmes.txt: Schema validation failed:
  - institution.country: Input should be a valid string
ERROR:__main__:Traceback (most recent call last):
  File "/tmp/ipython-input-2074822042.py", line 27, in validate_extraction
    result = ExtractionResult(**data)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pydantic/main.py", line 250, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for ExtractionResult
institution.country
  Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev

## üîÑ Step 12: Batch Process All Institution Files

In [None]:
import time
from collections import defaultdict

class BatchProcessor:
    """Batch processing with retry logic and error tracking"""

    def __init__(self, pipeline, max_retries=2, retry_delay=5):
        self.pipeline = pipeline
        self.max_retries = max_retries
        self.retry_delay = retry_delay

        self.results = {
            'success': [],
            'failed': [],
            'retry_queue': []
        }

    def process_all_files(self, files: List[Path], save_to_db=True):
        """
        Process all files with error handling and retry logic

        Args:
            files: List of file paths to process
            save_to_db: Whether to save to MongoDB
        """
        total_files = len(files)
        print(f"\n{'='*60}")
        print(f"BATCH PROCESSING: {total_files} files")
        print(f"{'='*60}\n")

        start_time = time.time()

        for idx, file_path in enumerate(files, 1):
            print(f"\n[{idx}/{total_files}] Processing: {file_path.name}")

            # Attempt processing with retries
            success = self._process_with_retry(file_path, save_to_db)

            if success:
                self.results['success'].append(file_path.name)
                print(f"‚úÖ Success")
            else:
                self.results['failed'].append(file_path.name)
                print(f"‚ùå Failed after {self.max_retries} retries")

            # Progress update
            success_rate = len(self.results['success']) / idx * 100
            print(f"Progress: {idx}/{total_files} ({success_rate:.1f}% success rate)")

        # Final summary
        elapsed_time = time.time() - start_time
        self._print_summary(elapsed_time)

    def _process_with_retry(self, file_path: Path, save_to_db: bool) -> bool:
        """Process file with retry logic"""
        for attempt in range(self.max_retries + 1):
            try:
                result = self.pipeline.process_file(file_path, save_to_db)

                if result:
                    return True

                if attempt < self.max_retries:
                    logger.warning(f"Retry {attempt + 1}/{self.max_retries} for {file_path.name}")
                    time.sleep(self.retry_delay)

            except Exception as e:
                logger.error(f"Attempt {attempt + 1} failed: {e}")
                if attempt < self.max_retries:
                    time.sleep(self.retry_delay)

        return False

    def _print_summary(self, elapsed_time: float):
        """Print processing summary"""
        total = len(self.results['success']) + len(self.results['failed'])
        success_count = len(self.results['success'])
        failed_count = len(self.results['failed'])

        print(f"\n{'='*60}")
        print(f"BATCH PROCESSING COMPLETE")
        print(f"{'='*60}")
        print(f"\nüìä Summary:")
        print(f"   Total files: {total}")
        print(f"   ‚úÖ Successful: {success_count} ({success_count/total*100:.1f}%)")
        print(f"   ‚ùå Failed: {failed_count} ({failed_count/total*100:.1f}%)")
        print(f"   ‚è±Ô∏è  Time elapsed: {elapsed_time:.1f}s")
        print(f"   ‚ö° Avg time/file: {elapsed_time/total:.1f}s")

        if self.results['failed']:
            print(f"\n‚ùå Failed files:")
            for filename in self.results['failed']:
                print(f"   - {filename}")

        # Save summary to file
        summary_file = LOGS_DIR / f"batch_summary_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
        with open(summary_file, 'w') as f:
            json.dump({
                'total': total,
                'success': success_count,
                'failed': failed_count,
                'elapsed_time': elapsed_time,
                'success_files': self.results['success'],
                'failed_files': self.results['failed']
            }, f, indent=2)

        print(f"\nüìÑ Summary saved to: {summary_file.name}")

# Initialize batch processor
batch_processor = BatchProcessor(pipeline, max_retries=2, retry_delay=3)

print("‚úÖ Batch processor ready")

In [None]:
# Run batch processing on all files
# WARNING: This will process ALL files - may take a long time!
# Uncomment the line below to process all files

# batch_processor.process_all_files(all_files, save_to_db=True)

print("‚ö†Ô∏è  To process all files, uncomment the line above")
print(f"   {len(all_files)} files ready to process")

## ‚úÖ Step 13: Validate and Display Results

In [None]:
# Check generated JSON outputs
output_files = list(OUTPUT_DIR.glob("*_extracted.json"))

print(f"üìä Generated Outputs: {len(output_files)} files\n")

if output_files:
    # Display first extraction result
    sample_file = output_files[0]
    print(f"Sample extraction: {sample_file.name}\n")

    with open(sample_file, 'r') as f:
        sample_data = json.load(f)

    # Display institution info
    print("üèõÔ∏è  Institution:")
    inst = sample_data['institution']
    print(f"   Name: {inst['name']}")
    print(f"   Type: {', '.join(inst['type'])}")
    print(f"   Country: {inst['country']}")
    print(f"   Confidence: {inst['confidence_score']}")

    # Display programs
    if sample_data['programs']:
        print(f"\nüìö Programs ({len(sample_data['programs'])}):")
        for i, prog in enumerate(sample_data['programs'][:5], 1):
            print(f"   {i}. {prog['name']}")
            print(f"      Level: {prog.get('level', 'N/A')}")
            print(f"      Confidence: {prog['confidence_score']}")

    print(f"\nüìÖ Extraction timestamp: {sample_data['extraction_timestamp']}")
else:
    print("‚ùå No extraction results found. Run the pipeline first.")

## üîç Step 14: Query MongoDB Collections

In [None]:
if mongo_writer:
    print("üìä MongoDB Database Statistics\n")

    # Get statistics
    stats = mongo_writer.get_statistics()
    print(f"Collections:")
    for collection, count in stats.items():
        print(f"   {collection}: {count} documents")

    print("\n" + "="*60)

    # Query institutions
    print("\nüèõÔ∏è  Institutions:")
    institutions = list(mongo_writer.institutions.find().limit(5))
    for inst in institutions:
        print(f"\n   Name: {inst['name']}")
        print(f"   Confidence: {inst['confidence_score']}")
        if 'website' in inst and inst['website']:
            print(f"   Website: {inst['website']}")

    print("\n" + "="*60)

    # Query programs with aggregation
    print("\nüìö Programs by Confidence Level:")
    pipeline = [
        {
            '$bucket': {
                'groupBy': '$confidence_score',
                'boundaries': [0.0, 0.5, 0.7, 0.9, 1.0],
                'default': 'Other',
                'output': {
                    'count': {'$sum': 1},
                    'programs': {'$push': '$name'}
                }
            }
        }
    ]

    confidence_buckets = list(mongo_writer.programs.aggregate(pipeline))
    for bucket in confidence_buckets:
        range_str = f"{bucket['_id']}-{bucket['_id']+0.2:.1f}" if isinstance(bucket['_id'], float) else bucket['_id']
        print(f"   {range_str}: {bucket['count']} programs")

    print("\n" + "="*60)

    # Query high-confidence programs
    print("\n‚≠ê High Confidence Programs (>0.8):")
    high_conf_programs = list(mongo_writer.programs.find(
        {'confidence_score': {'$gt': 0.8}}
    ).limit(10))

    for prog in high_conf_programs:
        print(f"   - {prog['name']} ({prog['confidence_score']})")

    print("\n" + "="*60)

    # Query extraction logs
    print("\nüìù Recent Extraction Logs:")
    logs = list(mongo_writer.extraction_logs.find().sort('timestamp', -1).limit(5))

    for log in logs:
        status_icon = "‚úÖ" if log['status'] == 'success' else "‚ùå"
        print(f"\n   {status_icon} {log.get('source_file', 'Unknown')}")
        print(f"      Status: {log['status']}")
        print(f"      Timestamp: {log['timestamp']}")
        if 'programs_count' in log:
            print(f"      Programs: {log['programs_count']}")
        if 'error' in log:
            print(f"      Error: {log['error'][:100]}...")

    print("\n" + "="*60)

    # Advanced query: Programs with missing fields
    print("\n‚ö†Ô∏è  Programs with Missing Duration Info:")
    missing_duration = mongo_writer.programs.count_documents({'duration': None})
    print(f"   Count: {missing_duration}")

    print("\n‚ö†Ô∏è  Programs with Missing Level Info:")
    missing_level = mongo_writer.programs.count_documents({'level': None})
    print(f"   Count: {missing_level}")

else:
    print("‚ùå MongoDB not connected. Cannot query database.")

## üéØ Additional Features: Retry Queue & Data Quality Metrics

In [None]:
class DataQualityAnalyzer:
    """Analyze data quality and generate metrics"""

    def __init__(self, mongo_writer):
        self.mongo_writer = mongo_writer

    def generate_quality_report(self) -> dict:
        """Generate comprehensive data quality report"""
        if not self.mongo_writer:
            return {"error": "MongoDB not connected"}

        report = {
            'timestamp': datetime.now().isoformat(),
            'institutions': {},
            'programs': {},
            'overall': {}
        }

        # Institution metrics
        institutions = list(self.mongo_writer.institutions.find())
        if institutions:
            inst_scores = [i['confidence_score'] for i in institutions]
            report['institutions'] = {
                'total': len(institutions),
                'avg_confidence': sum(inst_scores) / len(inst_scores),
                'min_confidence': min(inst_scores),
                'max_confidence': max(inst_scores),
                'missing_website': sum(1 for i in institutions if not i.get('website')),
                'missing_code': sum(1 for i in institutions if not i.get('institution_code'))
            }

        # Program metrics
        programs = list(self.mongo_writer.programs.find())
        if programs:
            prog_scores = [p['confidence_score'] for p in programs]
            report['programs'] = {
                'total': len(programs),
                'avg_confidence': sum(prog_scores) / len(prog_scores),
                'min_confidence': min(prog_scores),
                'max_confidence': max(prog_scores),
                'missing_level': sum(1 for p in programs if not p.get('level')),
                'missing_duration': sum(1 for p in programs if not p.get('duration')),
                'missing_fees': sum(1 for p in programs if not p.get('fees')),
                'missing_curriculum': sum(1 for p in programs if not p.get('curriculum_summary'))
            }

        # Overall quality score
        all_scores = inst_scores + prog_scores if (institutions and programs) else []
        if all_scores:
            report['overall'] = {
                'total_records': len(all_scores),
                'avg_confidence': sum(all_scores) / len(all_scores),
                'quality_grade': self._calculate_grade(sum(all_scores) / len(all_scores))
            }

        return report

    def _calculate_grade(self, score: float) -> str:
        """Calculate quality grade from score"""
        if score >= 0.9:
            return 'A - Excellent'
        elif score >= 0.8:
            return 'B - Good'
        elif score >= 0.7:
            return 'C - Acceptable'
        elif score >= 0.6:
            return 'D - Poor'
        else:
            return 'F - Very Poor'

    def print_report(self):
        """Print formatted quality report"""
        report = self.generate_quality_report()

        if 'error' in report:
            print(f"‚ùå {report['error']}")
            return

        print("\n" + "="*60)
        print("üìä DATA QUALITY REPORT")
        print("="*60)

        if 'institutions' in report and report['institutions']:
            print("\nüèõÔ∏è  Institutions:")
            inst = report['institutions']
            print(f"   Total: {inst['total']}")
            print(f"   Avg Confidence: {inst['avg_confidence']:.3f}")
            print(f"   Range: {inst['min_confidence']:.3f} - {inst['max_confidence']:.3f}")
            print(f"   Missing Website: {inst['missing_website']} ({inst['missing_website']/inst['total']*100:.1f}%)")
            print(f"   Missing Code: {inst['missing_code']} ({inst['missing_code']/inst['total']*100:.1f}%)")

        if 'programs' in report and report['programs']:
            print("\nüìö Programs:")
            prog = report['programs']
            print(f"   Total: {prog['total']}")
            print(f"   Avg Confidence: {prog['avg_confidence']:.3f}")
            print(f"   Range: {prog['min_confidence']:.3f} - {prog['max_confidence']:.3f}")
            print(f"   Missing Level: {prog['missing_level']} ({prog['missing_level']/prog['total']*100:.1f}%)")
            print(f"   Missing Duration: {prog['missing_duration']} ({prog['missing_duration']/prog['total']*100:.1f}%)")
            print(f"   Missing Fees: {prog['missing_fees']} ({prog['missing_fees']/prog['total']*100:.1f}%)")
            print(f"   Missing Curriculum: {prog['missing_curriculum']} ({prog['missing_curriculum']/prog['total']*100:.1f}%)")

        if 'overall' in report and report['overall']:
            print("\n‚≠ê Overall Quality:")
            overall = report['overall']
            print(f"   Total Records: {overall['total_records']}")
            print(f"   Avg Confidence: {overall['avg_confidence']:.3f}")
            print(f"   Quality Grade: {overall['quality_grade']}")

        print("\n" + "="*60)

        # Save report
        report_file = LOGS_DIR / f"quality_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
        with open(report_file, 'w') as f:
            json.dump(report, f, indent=2)
        print(f"\nüìÑ Report saved to: {report_file.name}\n")

# Initialize quality analyzer
if mongo_writer:
    quality_analyzer = DataQualityAnalyzer(mongo_writer)
    print("‚úÖ Data quality analyzer ready")
else:
    quality_analyzer = None
    print("‚ö†Ô∏è  Quality analyzer unavailable (MongoDB not connected)")

In [None]:
# Generate and display quality report
if quality_analyzer:
    quality_analyzer.print_report()
else:
    print("‚ùå Cannot generate report - MongoDB not available")

## üìù Usage Examples & Quick Reference

### Quick Reference Guide

#### 1Ô∏è‚É£ Process a Single File
```python
file_path = Path("/path/to/file.txt")
result = pipeline.process_file(file_path, save_to_db=True)
```

#### 2Ô∏è‚É£ Process All Files in Batch
```python
batch_processor.process_all_files(all_files, save_to_db=True)
```

#### 3Ô∏è‚É£ Query MongoDB
```python
# Get all institutions
institutions = list(mongo_writer.institutions.find())

# Get high-confidence programs
high_conf = list(mongo_writer.programs.find({'confidence_score': {'$gt': 0.8}}))

# Get programs for specific institution
programs = list(mongo_writer.programs.find({'institution_id': institution_id}))
```

#### 4Ô∏è‚É£ Generate Quality Report
```python
quality_analyzer.print_report()
```

#### 5Ô∏è‚É£ Export Data
```python
# Export to JSON
import pandas as pd

institutions_df = pd.DataFrame(list(mongo_writer.institutions.find()))
institutions_df.to_json('institutions_export.json', orient='records', indent=2)
```

---

### ‚öôÔ∏è Configuration Options

**Ollama Model:**
- Default: `llama3`
- Change: `ollama_client = OllamaClient(model="mistral")`

**Chunking:**
- Max chars: 3000 (default)
- Overlap: 200 (default)
- Adjust: `chunker = TextChunker(max_chars=5000, overlap=300)`

**Confidence Thresholds:**
- Low confidence: 0.6 (default)
- Penalty factor: 0.9 (default)
- Adjust: `scorer = ConfidenceScorer(low_confidence_threshold=0.7, penalty_factor=0.85)`

**MongoDB:**
- Default: `mongodb://localhost:27017`
- Database: `edu_platform`
- Change: `mongo_writer = MongoDBWriter(connection_string="mongodb://...

")`

---

### üîß Troubleshooting

**Issue: Ollama not connecting**
- Check if Ollama is running: `ps aux | grep ollama`
- Start server: `ollama serve`
- Pull model: `ollama pull llama3`

**Issue: MongoDB connection failed**
- Check if MongoDB is running: `ps aux | grep mongod`
- Start MongoDB: `mongod` or `sudo systemctl start mongod`

**Issue: Low extraction quality**
- Increase chunk overlap for better context
- Adjust temperature (lower = more deterministic)
- Try different LLM model (mistral, llama3.1, etc.)

**Issue: JSON parsing errors**
- LLM may return markdown - cleaning is automated
- Check prompt clarity
- Reduce chunk size for better focus

---

## üéâ Summary & Next Steps

### ‚úÖ What This Pipeline Does

1. **Loads** raw text files from educational institutions
2. **Chunks** large documents into manageable pieces
3. **Extracts** structured data using Ollama LLM (LLaMA 3)
4. **Validates** output against strict Pydantic schemas
5. **Scores** confidence levels for data quality
6. **Normalizes** scores based on completeness
7. **Stores** in MongoDB with proper relationships
8. **Audits** with logs and raw document storage
9. **Analyzes** data quality with comprehensive metrics
10. **Exports** results as JSON for downstream use

### üöÄ Next Steps

**Immediate:**
- [ ] Ensure Ollama is installed and running with LLaMA 3 model
- [ ] Start MongoDB server
- [ ] Run test on single file to verify setup
- [ ] Review extraction quality and adjust prompts if needed

**Short-term:**
- [ ] Process all files in batch mode
- [ ] Review quality report and identify gaps
- [ ] Fine-tune confidence thresholds
- [ ] Create retry queue for failed extractions

**Long-term:**
- [ ] Implement vector embeddings for semantic search
- [ ] Add schema evolution tracking
- [ ] Build API layer for data access
- [ ] Create data visualization dashboard
- [ ] Implement incremental updates (delta processing)
- [ ] Add multilingual support

### üìö Resources

- **Ollama Documentation:** https://ollama.ai/
- **Pydantic Docs:** https://docs.pydantic.dev/
- **MongoDB Manual:** https://docs.mongodb.com/
- **LLaMA 3:** https://ai.meta.com/llama/

---

**Pipeline Status:** ‚úÖ Ready to run
**Version:** 1.0.0
**Last Updated:** January 2026