# Memoirr: Preprocessor + Chunker Pipeline Smoke Test

This notebook runs a minimal Haystack pipeline using the SRT preprocessor and the semantic chunker.

Requirements:
- Ensure a local sentence-transformers model is available under `models/<EMBEDDING_MODEL_NAME>/` (with `model.safetensors` and tokenizer/config files).
- `.env` should set `EMBEDDING_MODEL_NAME` (default included in repo). Optionally set `EMBEDDING_DEVICE` (e.g., `cuda:0`).

Notes:
- The preprocessor emits cleaned JSONL lines, one per caption.
- The chunker uses Chonkie SemanticChunker with the self-hosted embeddings to create time-aware chunks.


In [None]:

import pathlib
import textwrap
from src.core.config import get_settings

settings = get_settings()
print('EMBEDDING_MODEL_NAME =', settings.embedding_model_name)
print('EMBEDDING_DEVICE     =', settings.device)

# Quick existence check to help the user
model_path = pathlib.Path('models') / settings.embedding_model_name
if not model_path.exists():
    # Fallback: search by terminal folder name (case-insensitive), similar to runtime resolver
    target = settings.embedding_model_name.split('/')[-1].lower()
    candidates = [p for p in pathlib.Path('models').rglob('*') if p.is_dir() and p.name.lower() == target]
    if candidates:
        print('Found candidate model dir at:', candidates[0])
    else:
        print('WARNING: Expected model folder not found under models/. The chunker cell may fail.')


EMBEDDING_MODEL_NAME = qwen3-embedding-0.6B
EMBEDDING_DEVICE     = 
Found candidate model dir at: models/chunker/qwen3-embedding-0.6B


In [2]:
# Sample SRT content (very small)
sample_srt = textwrap.dedent('''
1
00:00:01,000 --> 00:00:02,000
- Hello there!

2
00:00:02,100 --> 00:00:03,000
How are you doing?

3
00:00:03,100 --> 00:00:04,000
I'm fine. Thanks!
''')
print(sample_srt)



1
00:00:01,000 --> 00:00:02,000
- Hello there!

2
00:00:02,100 --> 00:00:03,000
How are you doing?

3
00:00:03,100 --> 00:00:04,000
I'm fine. Thanks!



In [3]:
# Run the complete end-to-end pipeline: SRT → Preprocess → Chunk → Embed → Qdrant
from src.pipelines.srt_to_qdrant import build_srt_to_qdrant_pipeline

print('Building the complete SRT-to-Qdrant pipeline...')
pipe = build_srt_to_qdrant_pipeline()

print('Pipeline components:')
for component_name in pipe.graph.nodes:
    print(f'  - {component_name}')

print('Pipeline connections:')
for edge in pipe.graph.edges:
    print(f'  - {edge[0]} → {edge[1]}')

print('Running pipeline on sample SRT...')
result = pipe.run({'pre': {'srt_text': sample_srt}})

print('Pipeline Results:')
print('=================')

# Show preprocessing stats
pre_stats = result['pre']['stats']
print(f'Preprocessor: {pre_stats}')

# Show chunking stats
chunk_stats = result['chunk']['stats']
print(f'Chunker: {chunk_stats}')

# Show write stats
write_stats = result['write']['stats']
print(f'Writer: {write_stats}')

print('✅ SUCCESS! Check Qdrant UI at http://localhost:6300/dashboard to see the embedded chunks!')
print('Collection name: memoirr')

Building the complete SRT-to-Qdrant pipeline...


  return torch._C._cuda_getDeviceCount() > 0


Pipeline components:
  - pre
  - chunk
  - explode
  - embed
  - docs
  - write
Pipeline connections:
  - pre → chunk
  - chunk → explode
  - explode → embed
  - explode → docs
  - explode → docs
  - embed → docs
  - docs → write
Running pipeline on sample SRT...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

100it [00:00, 14303.31it/s]          

Pipeline Results:
Preprocessor: {'total_captions': 3, 'kept': 3, 'dropped_empty': 0, 'dropped_non_english': 0, 'deduped': 0}
Chunker: {'input_captions': 3, 'output_chunks': 1, 'avg_tokens_per_chunk': 15.0}
Writer: {'written': 1, 'skipped': 0, 'total': 1}
✅ SUCCESS! Check Qdrant UI at http://localhost:6300/dashboard to see the embedded chunks!
Collection name: memoirr



