# üìö PDF-to-Embedding Pipeline
## Procesare 15GB Manuale »òcolare ‚Üí Supabase pgvector

**‚ö†Ô∏è  IMPORTANT: RuleazƒÉ celulele √éN ORDINE (1 ‚Üí 2 ‚Üí 3 ‚Üí 4 ‚Üí 5 ‚Üí 6)**

**Configurare Kaggle necesarƒÉ:**
1. ‚úÖ **Accelerator**: GPU P100 (Settings ‚Üí Accelerator ‚Üí GPU)
2. ‚úÖ **Internet**: ON (Settings ‚Üí Internet ‚Üí ON)
3. ‚úÖ **Secrets**: Add in Settings ‚Üí Secrets:
   - `SUPABASE_URL` = your-project-url.supabase.co
   - `SUPABASE_ANON_KEY` = eyJhbGc... (anon public key)
4. ‚úÖ **Dataset**: Upload `materiale_didactice.zip` ca Kaggle Dataset

**Timp estimat:** 18-25 ore pentru 15GB PDFs

## Cell 1: Verificare GPU + Instalare Dependen»õe

In [None]:
# Verificare GPU disponibil
print("=" * 70)
print("VERIFICARE GPU")
print("=" * 70)
!nvidia-smi

print("\n" + "=" * 70)
print("INSTALARE DEPENDEN»öE")
print("=" * 70)

# Instalare toate dependen»õele necesare
!pip install -q PyMuPDF>=1.23.0 \
    sentence-transformers>=2.2.2 \
    supabase>=2.0.0 \
    numpy>=1.24.0 \
    tqdm>=4.65.0 \
    pyyaml>=6.0

# PaddleOCR - OP»öIONAL (comenteazƒÉ dacƒÉ nu vrei OCR)
# OCR adaugƒÉ timp de procesare dar extrage text din imagini
ENABLE_OCR = False  # SchimbƒÉ √Æn True pentru OCR

if ENABLE_OCR:
    print("\nInstalare PaddleOCR...")
    !pip install -q paddleocr>=2.7.0
    print("‚úÖ PaddleOCR instalat")
else:
    print("\n‚ö†Ô∏è  OCR DISABLED - doar text extraction simplu")
    print("    (SeteazƒÉ ENABLE_OCR=True pentru OCR)")

print("\n" + "=" * 70)
print("‚úÖ TOATE DEPENDEN»öELE INSTALATE")
print("=" * 70)

## Cell 2: Import Module + Configurare

In [None]:
# Import standard libraries
import os
import sys
import logging
import time
import hashlib
import re
import json
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Tuple

# Import third-party libraries
import numpy as np
from tqdm import tqdm
import fitz  # PyMuPDF
from sentence_transformers import SentenceTransformer
from supabase import create_client
from kaggle_secrets import UserSecretsClient

# Import PaddleOCR dacƒÉ activat
if ENABLE_OCR:
    from paddleocr import PaddleOCR
    print("‚úÖ PaddleOCR importat")

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

print("=" * 70)
print("√éNCƒÇRCARE CREDEN»öIALE SUPABASE")
print("=" * 70)

# Ob»õine Supabase credentials din Kaggle Secrets
secret = UserSecretsClient()

try:
    SUPABASE_URL = secret.get_secret('SUPABASE_URL')
    SUPABASE_KEY = secret.get_secret('SUPABASE_ANON_KEY')
    logger.info(f"‚úÖ Credentials loaded: {SUPABASE_URL[:40]}...")
except Exception as e:
    logger.error(f"‚ùå EROARE: Nu pot √ÆncƒÉrca secrets!")
    logger.error(f"   {e}")
    logger.error("\nAsigurƒÉ-te cƒÉ ai adƒÉugat √Æn Settings ‚Üí Secrets:")
    logger.error("   - SUPABASE_URL")
    logger.error("   - SUPABASE_ANON_KEY")
    raise

# Configurare procesare
CONFIG = {
    'chunk_size': 500,
    'chunk_overlap': 50,
    'min_chunk_length': 50,
    'embedding_model': 'paraphrase-multilingual-mpnet-base-v2',
    'embedding_batch_size': 128,
    'embedding_device': 'cuda',  # GPU
    'supabase_batch_size': 5000,  # 5k vectors per batch
    'max_retries': 3,
    'ocr_enabled': ENABLE_OCR,
    'ocr_languages': ['ro', 'en'],
    'ocr_min_confidence': 0.6,
    'test_mode': False  # ‚ö†Ô∏è SCHIMBƒÇ √éN True pentru testare rapidƒÉ (10 PDFs)
}

logger.info("\n" + "=" * 70)
logger.info("CONFIGURARE")
logger.info("=" * 70)
logger.info(f"Chunk size: {CONFIG['chunk_size']} caractere")
logger.info(f"Overlap: {CONFIG['chunk_overlap']} caractere")
logger.info(f"Embedding model: {CONFIG['embedding_model']}")
logger.info(f"Device: {CONFIG['embedding_device']}")
logger.info(f"Batch upload: {CONFIG['supabase_batch_size']} vectors")
logger.info(f"OCR enabled: {CONFIG['ocr_enabled']}")
logger.info(f"Test mode: {CONFIG['test_mode']} {'(doar 10 PDFs!)' if CONFIG['test_mode'] else '(toate PDFs)'}")
logger.info("=" * 70)

print("\n‚úÖ Import »ôi configurare complete!")

## Cell 3: GƒÉsire PDFs √Æn Dataset

In [None]:
print("=" * 70)
print("CƒÇUTARE PDFs √éN DATASET")
print("=" * 70)

# Localizare dataset folder √Æn Kaggle
kaggle_input = Path('/kaggle/input')

# Listare toate dataset-urile disponibile
dataset_folders = [d for d in kaggle_input.iterdir() if d.is_dir()]
logger.info(f"\nDataset-uri gƒÉsite √Æn /kaggle/input/:")
for folder in dataset_folders:
    logger.info(f"  - {folder.name}")

if not dataset_folders:
    logger.error("\n‚ùå EROARE: Nu existƒÉ dataset-uri √Æn /kaggle/input/")
    logger.error("\nTrebuie sƒÉ:")
    logger.error("  1. Uploadezi materiale_didactice.zip ca Kaggle Dataset")
    logger.error("  2. Ata»ô dataset-ul la acest notebook (Add Data ‚Üí Your Datasets)")
    logger.error("  3. Restart notebook dupƒÉ ata»ôare")
    raise FileNotFoundError("Dataset lipse»ôte!")

# Folose»ôte primul dataset (sau specificƒÉ manual)
pdf_folder = dataset_folders[0]
logger.info(f"\n‚úÖ Folosesc dataset: {pdf_folder.name}")

# GƒÉse»ôte toate PDFs recursiv
logger.info("\nCƒÉutare PDFs recursiv...")
all_pdfs = list(pdf_folder.glob('**/*.pdf'))

if not all_pdfs:
    logger.error("\n‚ùå EROARE: Nu s-au gƒÉsit PDFs √Æn dataset!")
    logger.error(f"   VerificƒÉ cƒÉ dataset-ul con»õine foldere cu PDFs")
    raise FileNotFoundError("PDFs lipsesc din dataset!")

logger.info(f"\n" + "=" * 70)
logger.info(f"‚úÖ GƒÇSITE {len(all_pdfs)} PDFs")
logger.info("=" * 70)

# Afi»ôare sample PDFs
logger.info("\nPrime 10 PDFs:")
for i, pdf in enumerate(all_pdfs[:10]):
    size_mb = pdf.stat().st_size / 1024 / 1024
    relative_path = pdf.relative_to(pdf_folder)
    logger.info(f"  {i+1}. {relative_path} ({size_mb:.1f} MB)")

if len(all_pdfs) > 10:
    logger.info(f"  ... »ôi √ÆncƒÉ {len(all_pdfs) - 10} PDFs")

# Calculare dimensiune totalƒÉ
total_size_bytes = sum(p.stat().st_size for p in all_pdfs)
total_size_gb = total_size_bytes / 1024 / 1024 / 1024

logger.info(f"\nDimensiune totalƒÉ dataset: {total_size_gb:.2f} GB")
logger.info(f"Timp estimat procesare: {len(all_pdfs) * 3 / 60:.0f}-{len(all_pdfs) * 5 / 60:.0f} ore")

print("\n‚úÖ PDFs gƒÉsite »ôi verificate!")

## Cell 4: Definire Func»õii de Procesare

In [None]:
print("=" * 70)
print("DEFINIRE FUNC»öII DE PROCESARE")
print("=" * 70)

# ===================================================================
# 1. PDF TEXT EXTRACTION
# ===================================================================

def extract_text_from_pdf(pdf_path: Path) -> Dict:
    """
    Extrage text din PDF folosind PyMuPDF.
    
    Returns:
        dict cu 'text', 'pages', 'images', 'status'
    """
    try:
        doc = fitz.open(pdf_path)
        full_text = ""
        images_found = []
        
        for page_num, page in enumerate(doc):
            # Extract text
            page_text = page.get_text()
            full_text += page_text + "\n"
            
            # Detect images (op»õional - pentru OCR)
            if CONFIG['ocr_enabled']:
                image_list = page.get_images(full=True)
                for img in image_list:
                    images_found.append({
                        'page': page_num + 1,
                        'xref': img[0]
                    })
        
        doc.close()
        
        return {
            'text': full_text,
            'pages': len(doc),
            'images': images_found,
            'status': 'success',
            'char_count': len(full_text)
        }
        
    except Exception as e:
        return {
            'text': '',
            'pages': 0,
            'images': [],
            'status': 'error',
            'error': str(e)
        }

# ===================================================================
# 2. OCR PROCESSING (OP»öIONAL)
# ===================================================================

def extract_text_from_images_ocr(pdf_path: Path, images: List[Dict]) -> str:
    """
    Extrage text din imagini folosind PaddleOCR.
    RuleazƒÉ doar dacƒÉ OCR activat.
    
    Returns:
        str - text extras din imagini
    """
    if not CONFIG['ocr_enabled']:
        return ""
    
    try:
        # Ini»õializare OCR (lazy loading)
        if not hasattr(extract_text_from_images_ocr, 'ocr'):
            logger.info("Ini»õializare PaddleOCR...")
            extract_text_from_images_ocr.ocr = PaddleOCR(
                use_angle_cls=True,
                lang='en',  # Ro + En
                use_gpu=True,
                show_log=False
            )
        
        ocr = extract_text_from_images_ocr.ocr
        ocr_text = ""
        
        # Process doar primele 5 imagini (pentru vitezƒÉ)
        for img_info in images[:5]:
            try:
                # Extract image din PDF
                doc = fitz.open(pdf_path)
                page = doc[img_info['page'] - 1]
                pix = page.get_pixmap()
                img_array = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
                    pix.height, pix.width, pix.n
                )
                doc.close()
                
                # Run OCR
                result = ocr.ocr(img_array, cls=True)
                
                if result and result[0]:
                    for line in result[0]:
                        text = line[1][0]
                        confidence = line[1][1]
                        if confidence >= CONFIG['ocr_min_confidence']:
                            ocr_text += text + " "
            except:
                continue
        
        return ocr_text
        
    except Exception as e:
        logger.warning(f"OCR failed: {e}")
        return ""

# ===================================================================
# 3. TEXT CHUNKING
# ===================================================================

def chunk_text(text: str) -> List[str]:
    """
    √émparte text √Æn chunks cu overlap.
    
    Returns:
        List[str] - lista de chunks
    """
    chunk_size = CONFIG['chunk_size']
    overlap = CONFIG['chunk_overlap']
    min_length = CONFIG['min_chunk_length']
    
    # Split pe sentin»õe (aproximativ)
    sentences = re.split(r'(?<=[.!?])\s+', text)
    
    chunks = []
    current_chunk = ""
    
    for sentence in sentences:
        if len(current_chunk) + len(sentence) <= chunk_size:
            current_chunk += " " + sentence if current_chunk else sentence
        else:
            if len(current_chunk) >= min_length:
                chunks.append(current_chunk.strip())
            current_chunk = sentence
    
    # AdaugƒÉ ultimul chunk
    if len(current_chunk) >= min_length:
        chunks.append(current_chunk.strip())
    
    return chunks

# ===================================================================
# 4. DEDUPLICATION
# ===================================================================

def deduplicate_chunks(chunks: List[str]) -> List[str]:
    """
    EliminƒÉ duplicate chunks (headers/footers repetate).
    
    Returns:
        List[str] - chunks unice
    """
    seen_hashes = set()
    unique_chunks = []
    
    for chunk in chunks:
        chunk_hash = hashlib.md5(chunk.encode('utf-8')).hexdigest()
        
        if chunk_hash not in seen_hashes:
            seen_hashes.add(chunk_hash)
            unique_chunks.append(chunk)
    
    return unique_chunks

# ===================================================================
# 5. EMBEDDING GENERATION
# ===================================================================

def init_embedding_model():
    """
    Ini»õializare sentence-transformers model.
    """
    logger.info(f"√éncƒÉrcare model: {CONFIG['embedding_model']}")
    logger.info(f"Device: {CONFIG['embedding_device']}")
    
    model = SentenceTransformer(
        CONFIG['embedding_model'],
        device=CONFIG['embedding_device']
    )
    
    logger.info(f"‚úÖ Model √ÆncƒÉrcat: 768 dimensiuni")
    return model

def generate_embeddings(texts: List[str], model) -> np.ndarray:
    """
    GenereazƒÉ embeddings pentru listƒÉ de texte.
    
    Returns:
        np.ndarray shape (N, 768)
    """
    embeddings = model.encode(
        texts,
        batch_size=CONFIG['embedding_batch_size'],
        convert_to_numpy=True,
        show_progress_bar=False
    )
    return embeddings

# ===================================================================
# 6. SUPABASE UPLOAD
# ===================================================================

def init_supabase():
    """
    Ini»õializare Supabase client.
    """
    client = create_client(SUPABASE_URL, SUPABASE_KEY)
    logger.info("‚úÖ Supabase client ini»õializat")
    return client

def upload_vectors_to_supabase(supabase, vectors: List[Dict]) -> Dict:
    """
    Upload vectors √Æn Supabase cu batching.
    
    Returns:
        dict cu 'success' »ôi 'failed' counts
    """
    batch_size = CONFIG['supabase_batch_size']
    max_retries = CONFIG['max_retries']
    
    total_success = 0
    total_failed = 0
    
    for i in range(0, len(vectors), batch_size):
        batch = vectors[i:i + batch_size]
        
        # Retry logic
        for attempt in range(max_retries):
            try:
                response = supabase.table('document_embeddings').insert(batch).execute()
                total_success += len(batch)
                break  # Success, exit retry loop
            except Exception as e:
                if attempt == max_retries - 1:
                    logger.warning(f"Batch {i//batch_size} failed after {max_retries} retries: {e}")
                    total_failed += len(batch)
                else:
                    time.sleep(2)  # Wait before retry
    
    return {'success': total_success, 'failed': total_failed}

logger.info("\n‚úÖ Toate func»õiile definite!")
logger.info("   - extract_text_from_pdf()")
logger.info("   - extract_text_from_images_ocr() [op»õional]")
logger.info("   - chunk_text()")
logger.info("   - deduplicate_chunks()")
logger.info("   - init_embedding_model()")
logger.info("   - generate_embeddings()")
logger.info("   - upload_vectors_to_supabase()")
print("=" * 70)

## Cell 5: Ini»õializare Componente (Model + Supabase)

In [None]:
print("=" * 70)
print("INI»öIALIZARE COMPONENTE")
print("=" * 70)

# Ini»õializare embedding model
logger.info("\n√éncarc embedding model pe GPU...")
embedding_model = init_embedding_model()

# Ini»õializare Supabase
logger.info("\nConectare la Supabase...")
supabase = init_supabase()

# Testare conexiune Supabase
try:
    test_response = supabase.table('document_embeddings').select('count', count='exact').limit(1).execute()
    current_count = test_response.count if test_response.count else 0
    logger.info(f"‚úÖ Supabase conectat - {current_count} vectors existen»õi √Æn DB")
except Exception as e:
    logger.error(f"‚ùå Eroare testare conexiune Supabase: {e}")
    logger.error("VerificƒÉ cƒÉ ai rulat supabase_setup.sql √Æn SQL Editor!")
    raise

print("\n" + "=" * 70)
print("‚úÖ COMPONENTE INI»öIALIZATE - GATA DE PROCESARE")
print("=" * 70)

## Cell 6: PIPELINE PRINCIPAL DE PROCESARE

**‚ö†Ô∏è  IMPORTANT**: AceastƒÉ celulƒÉ depinde de celulele anterioare!
- Cell 2: `CONFIG`
- Cell 3: `all_pdfs`, `pdf_folder`
- Cell 5: `embedding_model`, `supabase`

DacƒÉ prime»ôti `NameError`, ruleazƒÉ celulele 1-5 √Ænainte!

In [None]:
print("\n" + "=" * 70)
print("√éNCEPE PROCESAREA PRINCIPALƒÇ")
print("=" * 70)

# ===================================================================
# VERIFICARE DEPENDEN»öE DIN CELULELE ANTERIOARE
# ===================================================================
required_vars = {
    'all_pdfs': 'Cell 3 (GƒÉsire PDFs)',
    'pdf_folder': 'Cell 3 (GƒÉsire PDFs)',
    'embedding_model': 'Cell 5 (Ini»õializare Componente)',
    'supabase': 'Cell 5 (Ini»õializare Componente)',
    'CONFIG': 'Cell 2 (Configurare)'
}

missing_vars = []
for var_name, source_cell in required_vars.items():
    if var_name not in globals():
        missing_vars.append(f"  ‚ùå {var_name} (din {source_cell})")

if missing_vars:
    print("\n" + "=" * 70)
    print("‚ùå EROARE: Lipsesc variabile necesare!")
    print("=" * 70)
    print("\nRuleazƒÉ celulele √Æn ordine √éNAINTE de aceastƒÉ celulƒÉ:\n")
    for missing in missing_vars:
        print(missing)
    print("\nPa≈üi de rezolvare:")
    print("  1. RuleazƒÉ Cell 1 (Instalare Dependen»õe)")
    print("  2. RuleazƒÉ Cell 2 (Configurare)")
    print("  3. RuleazƒÉ Cell 3 (GƒÉsire PDFs)")
    print("  4. RuleazƒÉ Cell 4 (Definire Func»õii)")
    print("  5. RuleazƒÉ Cell 5 (Ini»õializare Componente)")
    print("  6. Apoi ruleazƒÉ aceastƒÉ celulƒÉ\n")
    raise NameError(f"Variabile lipsƒÉ: {', '.join([v.split(' ')[1] for v in missing_vars])}")

print("‚úÖ Toate dependen»õele verificate - continuƒÉm procesarea...\n")

# ===================================================================
# MOD TEST vs COMPLET (din CONFIG)
# ===================================================================
if CONFIG['test_mode']:
    pdfs_to_process = all_pdfs[:10]
    logger.warning("\n‚ö†Ô∏è  MOD TEST ACTIVAT - se proceseazƒÉ doar 10 PDFs!")
    logger.warning("   SchimbƒÉ CONFIG['test_mode']=False √Æn Cell 2 pentru procesare completƒÉ\n")
else:
    pdfs_to_process = all_pdfs
    logger.info(f"\nMod COMPLET - se proceseazƒÉ toate {len(all_pdfs)} PDFs\n")

# Ini»õializare statistici
stats = {
    'start_time': time.time(),
    'processed_pdfs': 0,
    'failed_pdfs': 0,
    'total_pages': 0,
    'total_chars': 0,
    'total_chunks': 0,
    'total_unique_chunks': 0,
    'total_embeddings': 0,
    'uploaded_vectors': 0,
    'failed_uploads': 0,
    'ocr_images_processed': 0
}

# Buffer pentru batch upload
upload_buffer = []
buffer_max_size = CONFIG['supabase_batch_size']

# ===================================================================
# MAIN PROCESSING LOOP
# ===================================================================

logger.info("√éncepe procesare...\n")

with tqdm(total=len(pdfs_to_process), desc="üìÑ Procesare PDFs") as pbar:
    
    for pdf_path in pdfs_to_process:
        try:
            # ============================================================
            # STEP 1: EXTRAGERE TEXT DIN PDF
            # ============================================================
            result = extract_text_from_pdf(pdf_path)
            
            if result['status'] != 'success':
                logger.warning(f"Failed: {pdf_path.name} - {result.get('error', 'Unknown error')}")
                stats['failed_pdfs'] += 1
                pbar.update(1)
                continue
            
            text = result['text']
            pages = result['pages']
            images = result['images']
            
            stats['processed_pdfs'] += 1
            stats['total_pages'] += pages
            stats['total_chars'] += len(text)
            
            # ============================================================
            # STEP 2: OCR PE IMAGINI (dacƒÉ activat)
            # ============================================================
            ocr_text = ""
            if CONFIG['ocr_enabled'] and images:
                ocr_text = extract_text_from_images_ocr(pdf_path, images)
                if ocr_text:
                    text = text + "\n" + ocr_text
                    stats['ocr_images_processed'] += len(images)
            
            # Skip PDFs fƒÉrƒÉ text
            if len(text.strip()) < 100:
                logger.warning(f"Skipping {pdf_path.name} - prea pu»õin text ({len(text)} chars)")
                pbar.update(1)
                continue
            
            # ============================================================
            # STEP 3: CHUNKING TEXT
            # ============================================================
            chunks = chunk_text(text)
            stats['total_chunks'] += len(chunks)
            
            # ============================================================
            # STEP 4: DEDUPLICARE CHUNKS
            # ============================================================
            unique_chunks = deduplicate_chunks(chunks)
            stats['total_unique_chunks'] += len(unique_chunks)
            
            # ============================================================
            # STEP 5: GENERARE EMBEDDINGS
            # ============================================================
            embeddings = generate_embeddings(unique_chunks, embedding_model)
            stats['total_embeddings'] += len(embeddings)
            
            # ============================================================
            # STEP 6: PREGƒÇTIRE PENTRU UPLOAD
            # ============================================================
            pdf_relative_path = str(pdf_path.relative_to(pdf_folder))
            
            # Extract metadata din path (ex: clasa_2/matematica/...)
            path_parts = pdf_relative_path.split(os.sep)
            clasa = 0
            materie = "General"
            
            # √éncearcƒÉ sƒÉ extragi clasa din path (ex: clasa_2)
            for part in path_parts:
                if 'clasa' in part.lower():
                    try:
                        clasa = int(re.search(r'\d+', part).group())
                    except:
                        pass
                # Materia poate fi numele folderului
                if len(path_parts) > 1:
                    materie = path_parts[1] if len(path_parts) > 2 else path_parts[0]
            
            # CreeazƒÉ vector items pentru Supabase
            for idx, (chunk, embedding) in enumerate(zip(unique_chunks, embeddings)):
                chunk_hash = hashlib.md5(chunk.encode('utf-8')).hexdigest()
                
                vector_item = {
                    'chunk_id': f"{pdf_path.stem}_{idx}_{chunk_hash[:8]}",
                    'text': chunk[:10000],  # LimitƒÉ 10k chars pentru DB
                    'embedding': embedding.tolist(),
                    'source_pdf': pdf_relative_path,
                    'page_num': 1,  # Simplificat - nu trackuim pagina exactƒÉ
                    'clasa': clasa,
                    'materie': materie,
                    'capitol': 'General',
                    'chunk_hash': chunk_hash,
                    'has_images': len(images) > 0
                }
                
                upload_buffer.append(vector_item)
            
            # ============================================================
            # STEP 7: UPLOAD BATCH C√ÇND BUFFER PLIN
            # ============================================================
            if len(upload_buffer) >= buffer_max_size:
                upload_result = upload_vectors_to_supabase(supabase, upload_buffer)
                stats['uploaded_vectors'] += upload_result['success']
                stats['failed_uploads'] += upload_result['failed']
                upload_buffer = []  # Clear buffer
            
            pbar.update(1)
            
        except Exception as e:
            logger.error(f"Eroare criticƒÉ procesare {pdf_path.name}: {e}")
            stats['failed_pdfs'] += 1
            pbar.update(1)
            continue

# ===================================================================
# UPLOAD ULTIMUL BATCH RƒÇMAS √éN BUFFER
# ===================================================================
if upload_buffer:
    logger.info(f"\nUpload ultimul batch: {len(upload_buffer)} vectors...")
    upload_result = upload_vectors_to_supabase(supabase, upload_buffer)
    stats['uploaded_vectors'] += upload_result['success']
    stats['failed_uploads'] += upload_result['failed']

stats['end_time'] = time.time()
stats['elapsed_seconds'] = stats['end_time'] - stats['start_time']

print("\n" + "=" * 70)
print("‚úÖ PROCESARE COMPLETƒÇ")
print("=" * 70)

## Cell 7: Afi»ôare Statistici Finale

In [None]:
print("\n" + "=" * 70)
print("üìä STATISTICI FINALE")
print("=" * 70)

# Calculare timp
elapsed_hours = stats['elapsed_seconds'] / 3600
elapsed_minutes = (stats['elapsed_seconds'] % 3600) / 60

print(f"\n‚è±Ô∏è  TIMP PROCESARE")
print(f"   Total: {elapsed_hours:.1f} ore ({elapsed_minutes:.0f} minute)")
print(f"   Timp mediu per PDF: {stats['elapsed_seconds'] / max(stats['processed_pdfs'], 1):.1f} secunde")

print(f"\nüìÑ PDFs")
print(f"   ‚úÖ Procesate cu succes: {stats['processed_pdfs']:,}")
print(f"   ‚ùå Failed: {stats['failed_pdfs']:,}")
print(f"   üìë Total pagini: {stats['total_pages']:,}")

print(f"\nüìù TEXT")
print(f"   Total caractere: {stats['total_chars']:,}")
print(f"   Mediu per PDF: {stats['total_chars'] // max(stats['processed_pdfs'], 1):,} caractere")

print(f"\n‚úÇÔ∏è  CHUNKING")
print(f"   Total chunks create: {stats['total_chunks']:,}")
print(f"   DupƒÉ deduplicare: {stats['total_unique_chunks']:,}")
print(f"   Duplicate eliminate: {stats['total_chunks'] - stats['total_unique_chunks']:,}")
print(f"   Mediu chunks per PDF: {stats['total_unique_chunks'] // max(stats['processed_pdfs'], 1)}")

print(f"\nüß† EMBEDDINGS")
print(f"   Total vectors generate: {stats['total_embeddings']:,}")
print(f"   Dimensiune: 768 (paraphrase-multilingual-mpnet-base-v2)")

if CONFIG['ocr_enabled']:
    print(f"\nüîç OCR")
    print(f"   Imagini procesate: {stats['ocr_images_processed']:,}")

print(f"\n‚òÅÔ∏è  SUPABASE UPLOAD")
print(f"   ‚úÖ Uploaded cu succes: {stats['uploaded_vectors']:,} vectors")
print(f"   ‚ùå Failed uploads: {stats['failed_uploads']:,}")

# Estimare dimensiune DB
estimated_db_mb = (stats['uploaded_vectors'] * 3.5) / 1024  # ~3.5KB per vector
print(f"\nüíæ DIMENSIUNE DATABASE")
print(f"   Estimat: ~{estimated_db_mb:.0f} MB")
print(f"   Supabase free tier: 500 MB")
if estimated_db_mb < 500:
    print(f"   ‚úÖ √én limitƒÉ ({500 - estimated_db_mb:.0f} MB rƒÉmase)")
else:
    print(f"   ‚ö†Ô∏è  DEPƒÇ»òE»òTE limita cu {estimated_db_mb - 500:.0f} MB")

print("\n" + "=" * 70)
print("üéâ PROCESARE FINALIZATƒÇ CU SUCCES!")
print("=" * 70)

## Cell 8: Verificare Supabase Database

In [None]:
print("\n" + "=" * 70)
print("üîç VERIFICARE SUPABASE DATABASE")
print("=" * 70)

try:
    # Count total vectors
    response = supabase.table('document_embeddings').select('count', count='exact').limit(1).execute()
    db_count = response.count if response.count else 0
    
    print(f"\nüìä Vectors √Æn database: {db_count:,}")
    print(f"   Uploaded √Æn acest run: {stats['uploaded_vectors']:,}")
    
    if db_count == stats['uploaded_vectors']:
        print("   ‚úÖ VERIFICARE PASSED - toate vectorii uploada»õi")
    else:
        print(f"   ‚ö†Ô∏è  Diferen»õƒÉ: {abs(db_count - stats['uploaded_vectors']):,} vectors")
        print("      (Poate include vectors din rulƒÉri anterioare)")
    
    # Sample query - afi»ôeazƒÉ primele 3 records
    print("\nüìÑ Sample records din database:")
    sample = supabase.table('document_embeddings').select('chunk_id, source_pdf, clasa, materie').limit(3).execute()
    
    if sample.data:
        for i, record in enumerate(sample.data, 1):
            print(f"   {i}. {record['chunk_id']}")
            print(f"      PDF: {record['source_pdf']}")
            print(f"      Clasa: {record['clasa']}, Materie: {record['materie']}")
    else:
        print("   ‚ö†Ô∏è  Nu s-au gƒÉsit records")
    
    print("\n‚úÖ Verificare completƒÉ - database OK!")
    
except Exception as e:
    print(f"\n‚ùå Eroare verificare database: {e}")
    print("   VerificƒÉ conexiunea Supabase »ôi tabelul 'document_embeddings'")

print("=" * 70)

## Cell 9: Test Similarity Search (Op»õional)

In [None]:
print("\n" + "=" * 70)
print("üîç TEST SIMILARITY SEARCH")
print("=" * 70)

if stats['uploaded_vectors'] == 0:
    print("\n‚ö†Ô∏è  Nu existƒÉ vectors √Æn database - skip search test")
else:
    try:
        # Query de test √Æn rom√¢nƒÉ
        test_queries = [
            "Cum se calculeazƒÉ aria unui pƒÉtrat?",
            "Ce este fotosinteza?",
            "Capitala Rom√¢niei"
        ]
        
        for query_text in test_queries:
            print(f"\nüìù Query: '{query_text}'")
            
            # Generate query embedding
            query_embedding = embedding_model.encode([query_text])[0].tolist()
            
            # Call Supabase RPC function
            try:
                results = supabase.rpc('match_documents', {
                    'query_embedding': query_embedding,
                    'match_count': 3
                }).execute()
                
                if results.data:
                    print("   Rezultate:")
                    for i, result in enumerate(results.data, 1):
                        text_preview = result.get('text', '')[:80]
                        similarity = result.get('similarity', 0)
                        source = result.get('source_pdf', 'unknown')
                        print(f"      {i}. [{similarity:.1%}] {text_preview}...")
                        print(f"         Source: {source}")
                else:
                    print("   ‚ö†Ô∏è  Nu s-au gƒÉsit rezultate")
                    
            except Exception as e:
                print(f"   ‚ùå Eroare query: {e}")
                print("      (VerificƒÉ cƒÉ match_documents() RPC function existƒÉ √Æn Supabase)")
                print("      (RuleazƒÉ supabase_setup.sql pentru a crea func»õia)")
                break
        
        print("\n‚úÖ Similarity search func»õioneazƒÉ!")
        print("   Po»õi folosi match_documents() RPC √Æn aplica»õia ta")
        
    except Exception as e:
        print(f"\n‚ùå Eroare test search: {e}")

print("\n" + "=" * 70)

## Cell 10: NEXT STEPS & CLEANUP

In [None]:
print("\n" + "=" * 70)
print("üéØ NEXT STEPS")
print("=" * 70)

print("""
‚úÖ DONE:
   1. PDFs procesate cu succes
   2. Embeddings generate (768 dimensiuni)
   3. Vectors uploadate √Æn Supabase

üîß TODO (dacƒÉ nu ai fƒÉcut deja):
   
   1. CREATE HNSW INDEX √Æn Supabase SQL Editor:
      
      CREATE INDEX IF NOT EXISTS idx_embedding_hnsw
      ON document_embeddings
      USING hnsw (embedding vector_cosine_ops)
      WITH (m = 16, ef_construction = 64);
      
      ‚è±Ô∏è  Timp: 30-60 minute pentru 600k vectors (one-time)
   
   2. VERIFICƒÇ RPC FUNCTION match_documents() existƒÉ
      (Ar trebui creatƒÉ din supabase_setup.sql)

üöÄ FOLOSIRE √éN AI TUTORING APP:
   
   # √én aplica»õia ta Python/Node.js:
   
   1. User face √Æntrebare: "Cum calculez aria unui pƒÉtrat?"
   
   2. Generate query embedding:
      model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
      query_emb = model.encode([question])[0].tolist()
   
   3. CautƒÉ √Æn Supabase:
      results = supabase.rpc('match_documents', {
          'query_embedding': query_emb,
          'match_count': 10,
          'filter_clasa': 2  # Optional
      }).execute()
   
   4. Feed context la LLM:
      context = "\n".join([r['text'] for r in results.data])
      prompt = f"Context: {context}\n\n√éntrebare: {question}"
      response = openai.ChatCompletion.create(...)

üíæ STORAGE:
   - Vectors √Æn Supabase: PERMANENT (p√¢nƒÉ la limita de 500MB)
   - Po»õi »ôterge acum PDFs locale pentru a elibera 15GB
   - Procesarea se face O SINGURƒÇ DATƒÇ

üí∞ COST:
   - Total: $0 (Kaggle free + Supabase free tier)
   - Mentenan»õƒÉ: $0/lunƒÉ

""")

print("=" * 70)
print("üéâ SUCCESS! Embeddings gata de folosit √Æn AI tutoring!")
print("=" * 70)