# CPATF: Code-switched Parliament-Aware Token Filtering

## Definition
**CPATF** (Code-switched Parliament-Aware Token Filtering) is a domain-specific preprocessing module designed for bilingual (Malay-English) Malaysian parliamentary Hansard texts. It filters out noise (e.g., attendance lists, OCR artifacts, repeated honorifics) while selectively retaining content-rich tokens essential for downstream topic modeling, entity extraction, and speaker identification.

## Core Concept
Hansard proceedings are highly structured yet noisy due to:
- Frequent code-switching between Malay and English
- Repetitive honorific titles (YB, Dato', Datuk Seri, etc.)
- Historical OCR errors and attendance lists in older documents (1970s–1990s)

CPATF addresses these challenges by assigning each token a **retention score** based on linguistic, syntactic, and domain-specific signals, preserving only high-information tokens.

## Processing Flow
1. **Attendance List Detection**  
   Skip entire segments identified as attendance lists using regex patterns (e.g., multiple ". " separated names or numbered lines).

2. **Tokenization & Linguistic Analysis**  
   - spaCy pipeline with custom Malay POS fallback  
   - FastText language identification per token (ms/en confidence > 0.7)

3. **Retention Score Calculation**  
   Compute score for each token and retain only those exceeding threshold.

4. **Lemmatization & Output**  
   Lemmatize retained tokens and concatenate into cleaned text.

## Mathematical Formulation
For each token \( t \):

Retention score:
$$
r(t) = w_1 \cdot \mathbb{I}_{\text{lang}}(t) + w_2 \cdot \mathbb{I}_{\text{POS}}(t) + w_3 \cdot \mathbb{I}_{\text{NER}}(t) - w_4 \cdot \text{red}(t)
$$

Token is retained if \( r(t) \geq \theta \).

### Indicators:
- **I_lang(t)**: 1 if FastText predicts "ms" or "en" with confidence > 0.7, else 0
- **I_POS(t)**: 1 if POS ∈ {NOUN, PROPN, VERB, ADJ, ADV}, else 0
- **I_NER(t)**: 1 if NER type ∈ {PERSON, ORG, LOC, BILL/LAW}, else 0
- **red(t)**: Redundancy penalty = (sum of repeated honorifics beyond first occurrence in local window) / window length

### Optimized Parameters (via ablation study on core500 dataset):
- w₁ = 0.25, w₂ = 0.25, w₃ = 0.40, w₄ = 0.10
- Threshold θ = 0.6

## Role in Overall Pipeline
CPATF produces clean, content-focused segments that serve as standardized input for all topic modeling pipelines, ensuring fair evaluation and high-quality input for MEHTC clustering and XLM-RoBERTa fine-tuning.

### ===========================================================================================================================================
### Imports and Environment Setup

In [2]:
import os
import re
import gc
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import List, Dict, Optional
from datetime import datetime
from pathlib import Path
from functools import lru_cache

import pymongo
import spacy
import fasttext
import random
from tqdm import tqdm
from dotenv import load_dotenv

# Suppress warnings
import warnings
warnings.filterwarnings("ignore")

# Load environment variables
project_root = Path.cwd().parents[1] if 'parents' in dir(Path.cwd()) else Path.cwd()
backend_env_path = project_root / "3_app_system" / "backend" / ".env"
load_dotenv(backend_env_path)

# FastText language model
FASTTEXT_MODEL_PATH = 'lid.176.bin'
if not os.path.exists(FASTTEXT_MODEL_PATH):
    print("Downloading FastText language model...")
    import urllib.request
    urllib.request.urlretrieve("https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin", FASTTEXT_MODEL_PATH)

# Load FastText model
ft_model = fasttext.load_model(FASTTEXT_MODEL_PATH)

# Load spaCy models
nlp_ner = spacy.load("xx_ent_wiki_sm")      # Multilingual NER
nlp_en = spacy.load("en_core_web_sm")       # English lemmatizer

# Common Malay function words (for rule-based POS filtering)
MALAY_FUNCTION_WORDS = {
    'yang', 'di', 'dan', 'untuk', 'dari', 'pada', 'dalam', 'adalah', 'ialah', 'bahawa',
    'sebagai', 'oleh', 'kepada', 'dengan', 'atau', 'jika', 'kerana', 'supaya', 'walaupun',
    'serta', 'tetapi', 'manakala', 'itu', 'ini', 'tersebut', 'akan', 'telah', 'sudah'
}

print("Environment ready.")
print(f"CPU Cores: {os.cpu_count()} ")

Environment ready.
CPU Cores: 32 


### Database Connection and Data Loading

In [3]:
client = pymongo.MongoClient(os.getenv("MONGO_URI"))
db = client["MyParliament"]

# Collections
segmented_col = db["hansard_segmented500"]
honorific_col = db["honorific_dictionary"]
cpatf_col = db["hansard_cpatf500"]

# Load honorifics
honorific_dict = honorific_col.find_one({}, {"categories": 1})
if not honorific_dict:
    raise ValueError("Honorific dictionary not found")
all_honorifics = set()
for titles in honorific_dict.get("categories", {}).values():
    all_honorifics.update([t.lower() for t in titles])

print(f"Loaded {len(all_honorifics)} unique honorifics.")

# Load FastText model
ft_model = fasttext.load_model(FASTTEXT_MODEL_PATH)

# Load documents with correct date field
all_docs = list(tqdm(
    segmented_col.find(
        {}, 
        {
            "_id": 1, 
            "segmentation_output": 1,   
            "hansardDate": 1,          
            "mesyuarat": 1,
            "parlimen": 1,
            "penggal": 1,
            "decade": 1
        }
    ),
    desc="Loading segmented documents"
))

print(f"Successfully loaded {len(all_docs)} documents.")

# Check samples with correct date field
print("\nSample documents check:")
for doc in all_docs[:5]:
    print(f"Doc ID: {doc['_id']}")
    print(f"  Date: {doc.get('hansardDate', 'Missing')}")
    print(f"  Segments: {len(doc.get('segmentation_output', []))}")
    print(f"  Parlimen: {doc.get('parlimen')}, Penggal: {doc.get('penggal')}, Mesyuarat: {doc.get('mesyuarat')}")
    print(f"  Decade: {doc.get('decade')}")
    print("---")

Loaded 20 unique honorifics.


Loading segmented documents: 500it [00:14, 35.11it/s]

Successfully loaded 500 documents.

Sample documents check:
Doc ID: 6947d3ebdaaf821ec476383b
  Date: 1959-09-11 00:00:00
  Segments: 16
  Parlimen: 1, Penggal: 1, Mesyuarat: 1
  Decade: pre1970
---
Doc ID: 6947d3ebdaaf821ec476383c
  Date: 1961-10-19 00:00:00
  Segments: 67
  Parlimen: 1, Penggal: 3, Mesyuarat: 1
  Decade: pre1970
---
Doc ID: 6947d3ebdaaf821ec476383d
  Date: 1961-04-24 00:00:00
  Segments: 56
  Parlimen: 1, Penggal: 3, Mesyuarat: 1
  Decade: pre1970
---
Doc ID: 6947d3ebdaaf821ec476383e
  Date: 1961-04-28 00:00:00
  Segments: 61
  Parlimen: 1, Penggal: 3, Mesyuarat: 1
  Decade: pre1970
---
Doc ID: 6947d3ebdaaf821ec476383f
  Date: 1963-08-15 00:00:00
  Segments: 42
  Parlimen: 1, Penggal: 5, Mesyuarat: 1
  Decade: pre1970
---





### CPATF Core Functions (Bilingual Processing)

In [8]:
W_LANG = 0.30
W_POS  = 0.30
W_RED  = 0.15
THRESHOLD = 0.40 
LANG_CONF_THRESHOLD = 0.6
REDUNDANCY_WINDOW = 15

CONTENT_POS_TAGS = {'NOUN', 'PROPN', 'VERB', 'ADJ', 'ADV', 'NUM'}

def is_attendance_list(text: str) -> bool:
    dot_pattern = re.compile(r'(\.\s+[A-Z][a-z]+){5,}')
    num_pattern = re.compile(r'^\d+\.', re.MULTILINE)
    return bool(dot_pattern.search(text) or len(num_pattern.findall(text)) > 5)

@lru_cache(maxsize=30000)
def get_lang_indicator(token: str) -> int:
    pred = ft_model.predict(token.replace('\n', ' '), k=1)
    lang, conf = pred[0][0].replace('__label__', ''), pred[1][0]
    return 1 if lang in ['ms', 'en', 'id'] and conf > LANG_CONF_THRESHOLD else 0

def get_redundancy_penalty(tokens: List[str], idx: int) -> float:
    start = max(0, idx - REDUNDANCY_WINDOW // 2)
    end = min(len(tokens), idx + REDUNDANCY_WINDOW // 2 + 1)
    window = [t.lower() for t in tokens[start:end]]
    repeated = sum(max(0, window.count(h) - 1) for h in all_honorifics if h in window)
    return min(repeated * 0.15, 0.4)  # Stronger penalty for honorifics

def simple_malay_stem(word: str) -> str:
    word_lower = word.lower()
    if len(word_lower) <= 4:
        return word_lower
    suffixes = ['kan', 'an', 'i', 'lah', 'kah', 'nya', 'tah', 'pun', 'mu', 'ku']
    for suffix in suffixes:
        if word_lower.endswith(suffix):
            return word_lower[:-len(suffix)]
    return word_lower

def rule_based_pos(word: str) -> str:
    if word and word[0].isupper():
        return 'PROPN'
    if word.isdigit() or re.match(r'^\d', word):
        return 'NUM'
    if word.lower().endswith(('kan', 'i', 'lah', 'nya', 'tah')):
        return 'VERB'
    return 'NOUN'

def process_segment(segment: str) -> str:
    if isinstance(segment, list):
        segment = " ".join([s.strip() for s in segment if s.strip()])

    if not segment or not segment.strip():
        return ""

    if is_attendance_list(segment):
        return ""

    segment = segment[:6000]
    words = segment.split()
    if not words:
        return ""

    pos_tags = [rule_based_pos(word) for word in words]

    retained = []
    for idx, word in enumerate(words):
        lang_ind = get_lang_indicator(word)
        pos_ind = 1 if pos_tags[idx] in CONTENT_POS_TAGS else 0
        red_pen = get_redundancy_penalty(words, idx)

        score = W_LANG * lang_ind + W_POS * pos_ind - W_RED * red_pen

        # Force retain only critical entities
        force_retain = (
            pos_tags[idx] == 'PROPN' or 
            pos_tags[idx] == 'NUM' or 
            len(word) > 8  # Longer words likely entities
        )

        if force_retain or score >= THRESHOLD:
            word_lower = word.lower()
            if lang_ind == 1 and not word[0].isupper():
                normalized = simple_malay_stem(word_lower)
            else:
                normalized = word_lower
            retained.append(normalized)

    return " ".join(retained)

def process_long_segment(segment: str, max_chunk_tokens: int = 2000) -> str:
    words = segment.split()
    total_tokens = len(words)
    
    if total_tokens <= max_chunk_tokens:
        return process_segment(segment)
    
    print(f"Long segment ({total_tokens} tokens) → chunking (max {max_chunk_tokens})...")
    
    retained_words = []
    overlap = 200
    start = 0
    while start < total_tokens:
        end = min(start + max_chunk_tokens, total_tokens)
        chunk_text = " ".join(words[start:end])
        cleaned_chunk = process_segment(chunk_text)
        chunk_words = cleaned_chunk.split()
        if chunk_words:
            if retained_words and chunk_words[:50] == retained_words[-50:]:
                retained_words.extend(chunk_words[50:])
            else:
                retained_words.extend(chunk_words)
        start = end - overlap if end < total_tokens else end
    
    return " ".join(retained_words)

### Test on 10 Random Real Segments

In [9]:
# Randomly sample 10 documents for testing
test_docs = random.sample(all_docs, min(10, len(all_docs)))

print("=== CPATF Test: Before & After on 10 Random Real Segments ===\n")

total_before = 0
total_after = 0
valid_count = 0
total_entities_before = 0
total_entities_after = 0

for i, doc in enumerate(test_docs, 1):
    doc_id = doc["_id"]
    date_str = doc.get("hansardDate")
    if isinstance(date_str, datetime):
        date_str = date_str.strftime("%Y-%m-%d")
    else:
        date_str = str(date_str) if date_str else "Unknown Date"
    
    raw_segments = doc.get("segmentation_output", [])

    print(f"[{i}/10] Doc ID: {doc_id} | Date: {date_str}")
    print(f"Raw segmentation_output items: {len(raw_segments)}\n")

    # Extract clean text strings
    segments = []
    for item in raw_segments:
        if isinstance(item, str):
            text = item.strip()
        elif isinstance(item, dict):
            text = item.get("text", "") or item.get("content", "")
            text = text.strip()
        else:
            continue
        if text:
            segments.append(text)

    if not segments:
        print("No valid text segments found in this document.\n")
        print("-" * 90 + "\n")
        continue

    valid_count += 1

    # Pick the longest segment for demonstration
    sample_seg = max(segments, key=len)

    print("【Original Segment】")
    print(sample_seg[:600] + ("..." if len(sample_seg) > 600 else "") + "\n")

    # Use robust chunking - rule-based is fast, so larger chunks OK
    cleaned = process_long_segment(sample_seg, max_chunk_tokens=1000)

    print("【CPATF Cleaned (Rule-based)】")
    print(cleaned[:600] + ("..." if len(cleaned) > 600 else "") + "\n")

    # Token count
    before_tokens = len(sample_seg.split())
    after_tokens = len(cleaned.split())
    total_before += before_tokens
    total_after += after_tokens

    # Simple entity preservation check (using spaCy NER on original and cleaned)
    try:
        orig_doc = nlp_ner(sample_seg)
        clean_doc = nlp_ner(cleaned)
        
        orig_entities = {ent.text.lower() for ent in orig_doc.ents 
                        if ent.label_ in ['PERSON', 'ORG', 'LOC']}
        clean_entities = {ent.text.lower() for ent in clean_doc.ents 
                         if ent.label_ in ['PERSON', 'ORG', 'LOC']}
        
        preserved = len(orig_entities & clean_entities)
        total_entities_before += len(orig_entities)
        total_entities_after += preserved
        
        print(f"Entities preserved: {preserved}/{len(orig_entities)} "
              f"({100 * preserved / len(orig_entities):.1f}% if orig_entities else 0) ")
        if orig_entities:
            print(f"Sample entities kept: {list(orig_entities & clean_entities)[:5]}")
    except:
        print("Entity check skipped (NER error)")

    reduction = before_tokens - after_tokens
    reduction_pct = (reduction / before_tokens * 100) if before_tokens > 0 else 0
    print(f"Tokens: {before_tokens} → {after_tokens} "
          f"(reduced by {reduction}, {reduction_pct:.1f}%)")
    print("-" * 90 + "\n")

# Final Summary
print("=== CPATF Test Summary ===")
print(f"Documents with valid text: {valid_count}/10")

if total_before > 0:
    overall_reduction = total_before - total_after
    overall_pct = 100 * overall_reduction / total_before
    entity_preserve_rate = (100 * total_entities_after / total_entities_before 
                           if total_entities_before > 0 else 0)
    
    print(f"Total tokens before CPATF : {total_before}")
    print(f"Total tokens after CPATF  : {total_after}")
    print(f"Overall token reduction   : {overall_reduction} tokens ({overall_pct:.1f}%)")
    print(f"Entity preservation rate  : {entity_preserve_rate:.1f}% "
          f"({total_entities_after}/{total_entities_before} entities kept)")
    print(f"\n→ Reduction ~{overall_pct:.1f}% with ~{entity_preserve_rate:.1f}% entity retention")
else:
    print("No valid tokens found in sampled documents.")

print("\n=== CPATF Test Completed Successfully ===")

=== CPATF Test: Before & After on 10 Random Real Segments ===

[1/10] Doc ID: 6947dae7daaf821ec47638bb | Date: 1981-10-27
Raw segmentation_output items: 93

【Original Segment】
Terima kasih, hend Tuan Yang di-Pertua. Oleh yang demikian iala bagi menyelesaikan Rancangan Makanan seko Tambahan yang sangat berfaedah ini adalah muri dicadangkan mana-mana sekolah yang Wala terpilih hendaklah diberi kepada semua murid dala sekolah tersebut dan yang kedua, bantuan 25 ingi sen itu hendaklah ditimbangkan semula oleh S kerana memandangkan harga bahan-bahan kera makanan yang melambung hari ini adalah seba tidak sesuai. mera Tuan Yang di-Pertua, berhubung dengan pent Skim Pinjaman Buku Teks pula, saya hendak seba menyentuh dan memberi sedikit pandangan akan iaitu mengikut pengala...

Long segment (5716 tokens) → chunking (max 1000)...
【CPATF Cleaned (Rule-based)】
terima kasih, tuan yang di-pertua. oleh yang iala bagi menyelesaikan rancangan makanan tambahan yang sangat berfaedah ini ada dicadangkan 

Entities preserved: 27/120 (22.5% if orig_entities else 0) 
Sample entities kept: ['mekah', 'libya', 'keselamatan', 'umpamanya', 'akan']
Tokens: 5716 → 3935 (reduced by 1781, 31.2%)
------------------------------------------------------------------------------------------

[2/10] Doc ID: 6947dae7daaf821ec47638cb | Date: 1982-03-19
Raw segmentation_output items: 174

【Original Segment】
apa guna undi kepada Kerajaan, apa dia boleh bagi? Kalau undi kami pun sama juga. Dalam keadaan semacam itu, dia seolah-olah menggalak pengundi-pengundi bertanya 1339 19 MAC kepada calon-calon Kerajaan, apa Kerajaan m hendak bagi. Bila pengundi-pengundi o bertanya maka terpaksa Kerajaan membuat s penerangan. Calon Kerajaan ini bertanya. r kalau kami jadi Kerajaan maka kami dapat t mengadakan ini, ini dan ini, dan oleh kerana s pengundi mendapat ini dianya berpendapat m bahawa satu cara untuk mengugut calon- p calon Kerajaan ialah dengan berkata, “Kalau tidak bagi ini, kamu undi parti Pembangka...

Long se

### Full Preprocessing on 500 Documents using Best CPATF Parameters (Multi-threaded)

In [11]:
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
import gc
from pymongo.errors import ConnectionFailure, BulkWriteError
from pymongo import UpdateOne

# Full preprocessing - resume-safe (avoid duplicates if interrupted)

all_docs = list(segmented_col.find({}))
total_docs = len(all_docs)

if total_docs == 0:
    print("Error: No documents found in segmented_col.")
else:
    print(f"Found {total_docs} documents in segmented_col.")
    print("Resume-safe preprocessing: skipping already processed parent_doc_id")

start_time = time.time()

# Get already processed parent_doc_ids to skip
existing_parent_ids = set(
    seg["parent_doc_id"] for seg in cpatf_col.find({}, {"parent_doc_id": 1})
)
print(f"Found {len(existing_parent_ids)} already processed parent_doc_ids - skipping them")

# Filter only unprocessed docs
unprocessed_docs = [doc for doc in all_docs if doc["_id"] not in existing_parent_ids]
print(f"Need to process {len(unprocessed_docs)} new documents")

all_cleaned_segments = []
total_original_tokens = 0
total_cleaned_tokens = 0

BATCH_SIZE = 500  # Safe batch

def process_full_document_wrapper(doc):
    segments = process_full_document(doc)
    return segments

# Multi-thread processing only on unprocessed docs
if unprocessed_docs:
    with ThreadPoolExecutor(max_workers=24) as executor:
        futures = [executor.submit(process_full_document_wrapper, doc) for doc in unprocessed_docs]
        
        for future in tqdm(as_completed(futures), total=len(unprocessed_docs), desc="Processing new documents", unit="doc"):
            segments = future.result()
            all_cleaned_segments.extend(segments)
            
            for seg in segments:
                total_original_tokens += seg["original_token_count"]
                total_cleaned_tokens += seg["cleaned_token_count"]
            
            if len(all_cleaned_segments) % 2000 == 0:
                gc.collect()
else:
    print("All documents already processed - nothing to do!")

# Batch insert with upsert (avoid duplicates even if interrupted)
elapsed_time = time.time() - start_time

if all_cleaned_segments:
    print(f"\nStarting safe batch insert of {len(all_cleaned_segments)} new segments (batch size: {BATCH_SIZE})...")
    
    # Use upsert with unique key (parent_doc_id + original_text hash)
    # Create index first 
    cpatf_col.create_index([("parent_doc_id", 1), ("original_text_hash", 1)], unique=True)
    
    for i in range(0, len(all_cleaned_segments), BATCH_SIZE):
        batch = all_cleaned_segments[i:i + BATCH_SIZE]
        operations = []
        for seg in batch:
            # Add hash for dedup
            import hashlib
            text_hash = hashlib.md5(seg["original_text"].encode('utf-8')).hexdigest()
            seg["original_text_hash"] = text_hash
            
            operations.append(
                UpdateOne(
                    {"parent_doc_id": seg["parent_doc_id"], "original_text_hash": text_hash},
                    {"$setOnInsert": seg},
                    upsert=True
                )
            )
        
        retry_count = 0
        max_retries = 5
        while retry_count < max_retries:
            try:
                result = cpatf_col.bulk_write(operations, ordered=False)
                print(f"Batch {i//BATCH_SIZE + 1}: Inserted {result.upserted_count}, Modified {result.modified_count}")
                break
            except (ConnectionFailure, BulkWriteError) as e:
                retry_count += 1
                print(f"Batch failed (attempt {retry_count}/{max_retries}): {e}")
                time.sleep(5 * retry_count)
        else:
            print("Batch failed after max retries")
    
    overall_reduction = 100 * (1 - total_cleaned_tokens / total_original_tokens) if total_original_tokens > 0 else 0
    
    print(f"\n=== Full Preprocessing Complete ===")
    print(f"Total time: {elapsed_time/60:.1f} minutes")
    print(f"New segments inserted: {len(all_cleaned_segments)}")
    print(f"Overall reduction: {overall_reduction:.1f}%")
    print(f"Collection: hansard_cpatf500")
else:
    print("No new segments to process")

print("\nCPATF preprocessing done")

Found 500 documents in segmented_col.
Resume-safe preprocessing: skipping already processed parent_doc_id
Found 0 already processed parent_doc_ids - skipping them
Need to process 500 new documents
Long segment (1354 tokens) → chunking (max 1200)...
Long segment (3773 tokens) → chunking (max 1200)...
Long segment (1262 tokens) → chunking (max 1200)...
Long segment (2426 tokens) → chunking (max 1200)...
Long segment (6704 tokens) → chunking (max 1200)...
Long segment (1323 tokens) → chunking (max 1200)...
Long segment (1207 tokens) → chunking (max 1200)...
Long segment (1276 tokens) → chunking (max 1200)...
Long segment (3021 tokens) → chunking (max 1200)...
Long segment (1785 tokens) → chunking (max 1200)...
Long segment (1504 tokens) → chunking (max 1200)...
Long segment (1560 tokens) → chunking (max 1200)...
Long segment (1772 tokens) → chunking (max 1200)...
Long segment (1218 tokens) → chunking (max 1200)...
Long segment (1404 tokens) → chunking (max 1200)...
Long segment (1496 toke

Processing new documents:   4%|▎         | 18/500 [00:00<00:14, 32.83doc/s]

Long segment (1309 tokens) → chunking (max 1200)...
Long segment (1397 tokens) → chunking (max 1200)...
Long segment (1670 tokens) → chunking (max 1200)...
Long segment (1905 tokens) → chunking (max 1200)...
Long segment (2399 tokens) → chunking (max 1200)...
Long segment (1423 tokens) → chunking (max 1200)...
Long segment (1348 tokens) → chunking (max 1200)...
Long segment (2080 tokens) → chunking (max 1200)...
Long segment (1759 tokens) → chunking (max 1200)...
Long segment (1918 tokens) → chunking (max 1200)...
Long segment (1653 tokens) → chunking (max 1200)...
Long segment (1414 tokens) → chunking (max 1200)...
Long segment (1263 tokens) → chunking (max 1200)...
Long segment (8538 tokens) → chunking (max 1200)...
Long segment (1438 tokens) → chunking (max 1200)...
Long segment (6201 tokens) → chunking (max 1200)...
Long segment (1719 tokens) → chunking (max 1200)...
Long segment (2434 tokens) → chunking (max 1200)...
Long segment (2356 tokens) → chunking (max 1200)...
Long segment

Processing new documents:   5%|▍         | 24/500 [00:02<00:56,  8.47doc/s]

Long segment (1499 tokens) → chunking (max 1200)...Long segment (1962 tokens) → chunking (max 1200)...
Long segment (1246 tokens) → chunking (max 1200)...
Long segment (1452 tokens) → chunking (max 1200)...
Long segment (1355 tokens) → chunking (max 1200)...
Long segment (3638 tokens) → chunking (max 1200)...
Long segment (1281 tokens) → chunking (max 1200)...
Long segment (1545 tokens) → chunking (max 1200)...
Long segment (2828 tokens) → chunking (max 1200)...
Long segment (14768 tokens) → chunking (max 1200)...
Long segment (1661 tokens) → chunking (max 1200)...
Long segment (1254 tokens) → chunking (max 1200)...

Long segment (1500 tokens) → chunking (max 1200)...
Long segment (3448 tokens) → chunking (max 1200)...
Long segment (2098 tokens) → chunking (max 1200)...
Long segment (2819 tokens) → chunking (max 1200)...
Long segment (2264 tokens) → chunking (max 1200)...
Long segment (1253 tokens) → chunking (max 1200)...
Long segment (2395 tokens) → chunking (max 1200)...
Long segmen

Processing new documents:   5%|▌         | 26/500 [00:02<01:12,  6.56doc/s]

Long segment (2675 tokens) → chunking (max 1200)...
Long segment (1810 tokens) → chunking (max 1200)...
Long segment (1788 tokens) → chunking (max 1200)...
Long segment (2510 tokens) → chunking (max 1200)...
Long segment (1263 tokens) → chunking (max 1200)...
Long segment (3012 tokens) → chunking (max 1200)...
Long segment (5046 tokens) → chunking (max 1200)...
Long segment (1356 tokens) → chunking (max 1200)...
Long segment (1690 tokens) → chunking (max 1200)...
Long segment (1208 tokens) → chunking (max 1200)...
Long segment (1476 tokens) → chunking (max 1200)...
Long segment (1376 tokens) → chunking (max 1200)...
Long segment (1330 tokens) → chunking (max 1200)...
Long segment (1778 tokens) → chunking (max 1200)...
Long segment (2654 tokens) → chunking (max 1200)...
Long segment (1271 tokens) → chunking (max 1200)...
Long segment (2929 tokens) → chunking (max 1200)...
Long segment (1800 tokens) → chunking (max 1200)...
Long segment (1494 tokens) → chunking (max 1200)...
Long segment

Processing new documents:   6%|▌         | 28/500 [00:07<04:56,  1.59doc/s]

Long segment (1439 tokens) → chunking (max 1200)...
Long segment (1286 tokens) → chunking (max 1200)...
Long segment (1992 tokens) → chunking (max 1200)...
Long segment (1682 tokens) → chunking (max 1200)...
Long segment (1604 tokens) → chunking (max 1200)...
Long segment (2651 tokens) → chunking (max 1200)...
Long segment (2618 tokens) → chunking (max 1200)...
Long segment (3669 tokens) → chunking (max 1200)...
Long segment (2082 tokens) → chunking (max 1200)...
Long segment (2472 tokens) → chunking (max 1200)...
Long segment (2202 tokens) → chunking (max 1200)...
Long segment (3887 tokens) → chunking (max 1200)...
Long segment (3423 tokens) → chunking (max 1200)...
Long segment (1362 tokens) → chunking (max 1200)...
Long segment (2786 tokens) → chunking (max 1200)...
Long segment (1402 tokens) → chunking (max 1200)...
Long segment (1319 tokens) → chunking (max 1200)...
Long segment (1609 tokens) → chunking (max 1200)...
Long segment (2454 tokens) → chunking (max 1200)...
Long segment

Processing new documents:   6%|▌         | 30/500 [00:10<07:30,  1.04doc/s]

Long segment (1714 tokens) → chunking (max 1200)...
Long segment (1934 tokens) → chunking (max 1200)...
Long segment (2567 tokens) → chunking (max 1200)...
Long segment (1285 tokens) → chunking (max 1200)...
Long segment (1217 tokens) → chunking (max 1200)...
Long segment (3764 tokens) → chunking (max 1200)...
Long segment (1781 tokens) → chunking (max 1200)...
Long segment (2620 tokens) → chunking (max 1200)...
Long segment (1286 tokens) → chunking (max 1200)...
Long segment (5179 tokens) → chunking (max 1200)...
Long segment (1210 tokens) → chunking (max 1200)...
Long segment (1341 tokens) → chunking (max 1200)...
Long segment (2004 tokens) → chunking (max 1200)...
Long segment (1567 tokens) → chunking (max 1200)...
Long segment (2480 tokens) → chunking (max 1200)...
Long segment (1466 tokens) → chunking (max 1200)...
Long segment (2309 tokens) → chunking (max 1200)...
Long segment (1496 tokens) → chunking (max 1200)...
Long segment (1746 tokens) → chunking (max 1200)...
Long segment

Processing new documents:   6%|▌         | 31/500 [00:11<07:34,  1.03doc/s]

Long segment (2364 tokens) → chunking (max 1200)...
Long segment (3758 tokens) → chunking (max 1200)...
Long segment (3532 tokens) → chunking (max 1200)...
Long segment (1318 tokens) → chunking (max 1200)...
Long segment (4993 tokens) → chunking (max 1200)...
Long segment (1324 tokens) → chunking (max 1200)...
Long segment (3227 tokens) → chunking (max 1200)...
Long segment (1819 tokens) → chunking (max 1200)...
Long segment (1375 tokens) → chunking (max 1200)...
Long segment (5101 tokens) → chunking (max 1200)...
Long segment (1696 tokens) → chunking (max 1200)...
Long segment (3056 tokens) → chunking (max 1200)...
Long segment (2265 tokens) → chunking (max 1200)...
Long segment (1305 tokens) → chunking (max 1200)...
Long segment (3645 tokens) → chunking (max 1200)...
Long segment (1622 tokens) → chunking (max 1200)...
Long segment (1571 tokens) → chunking (max 1200)...
Long segment (3579 tokens) → chunking (max 1200)...
Long segment (3080 tokens) → chunking (max 1200)...
Long segment

Processing new documents:   7%|▋         | 33/500 [00:12<06:50,  1.14doc/s]

Long segment (1607 tokens) → chunking (max 1200)...
Long segment (1728 tokens) → chunking (max 1200)...
Long segment (1855 tokens) → chunking (max 1200)...
Long segment (2100 tokens) → chunking (max 1200)...
Long segment (1633 tokens) → chunking (max 1200)...
Long segment (1619 tokens) → chunking (max 1200)...
Long segment (1378 tokens) → chunking (max 1200)...
Long segment (2002 tokens) → chunking (max 1200)...
Long segment (1203 tokens) → chunking (max 1200)...
Long segment (7898 tokens) → chunking (max 1200)...


Processing new documents:   7%|▋         | 34/500 [00:13<06:31,  1.19doc/s]

Long segment (1318 tokens) → chunking (max 1200)...
Long segment (3815 tokens) → chunking (max 1200)...
Long segment (2008 tokens) → chunking (max 1200)...
Long segment (2004 tokens) → chunking (max 1200)...
Long segment (1883 tokens) → chunking (max 1200)...
Long segment (3825 tokens) → chunking (max 1200)...
Long segment (2004 tokens) → chunking (max 1200)...
Long segment (1444 tokens) → chunking (max 1200)...
Long segment (2733 tokens) → chunking (max 1200)...


Processing new documents:   7%|▋         | 36/500 [00:14<06:10,  1.25doc/s]

Long segment (1456 tokens) → chunking (max 1200)...
Long segment (1553 tokens) → chunking (max 1200)...
Long segment (1788 tokens) → chunking (max 1200)...
Long segment (1508 tokens) → chunking (max 1200)...
Long segment (1462 tokens) → chunking (max 1200)...
Long segment (1824 tokens) → chunking (max 1200)...
Long segment (2325 tokens) → chunking (max 1200)...
Long segment (2495 tokens) → chunking (max 1200)...
Long segment (1235 tokens) → chunking (max 1200)...
Long segment (1643 tokens) → chunking (max 1200)...
Long segment (1397 tokens) → chunking (max 1200)...
Long segment (2206 tokens) → chunking (max 1200)...
Long segment (1721 tokens) → chunking (max 1200)...
Long segment (2015 tokens) → chunking (max 1200)...
Long segment (1768 tokens) → chunking (max 1200)...
Long segment (1322 tokens) → chunking (max 1200)...
Long segment (1210 tokens) → chunking (max 1200)...
Long segment (1219 tokens) → chunking (max 1200)...
Long segment (1780 tokens) → chunking (max 1200)...
Long segment

Processing new documents:   8%|▊         | 38/500 [00:17<07:58,  1.03s/doc]

Long segment (1424 tokens) → chunking (max 1200)...
Long segment (3901 tokens) → chunking (max 1200)...
Long segment (5203 tokens) → chunking (max 1200)...
Long segment (1421 tokens) → chunking (max 1200)...
Long segment (1840 tokens) → chunking (max 1200)...
Long segment (4120 tokens) → chunking (max 1200)...
Long segment (1345 tokens) → chunking (max 1200)...
Long segment (1838 tokens) → chunking (max 1200)...
Long segment (1687 tokens) → chunking (max 1200)...
Long segment (1944 tokens) → chunking (max 1200)...
Long segment (2849 tokens) → chunking (max 1200)...
Long segment (1428 tokens) → chunking (max 1200)...
Long segment (1337 tokens) → chunking (max 1200)...
Long segment (3049 tokens) → chunking (max 1200)...
Long segment (2294 tokens) → chunking (max 1200)...
Long segment (3021 tokens) → chunking (max 1200)...
Long segment (1943 tokens) → chunking (max 1200)...
Long segment (1577 tokens) → chunking (max 1200)...
Long segment (1211 tokens) → chunking (max 1200)...
Long segment

Processing new documents:   8%|▊         | 40/500 [00:20<08:35,  1.12s/doc]

Long segment (3989 tokens) → chunking (max 1200)...Long segment (2124 tokens) → chunking (max 1200)...
Long segment (1476 tokens) → chunking (max 1200)...
Long segment (1227 tokens) → chunking (max 1200)...
Long segment (1228 tokens) → chunking (max 1200)...
Long segment (2484 tokens) → chunking (max 1200)...
Long segment (1426 tokens) → chunking (max 1200)...
Long segment (1745 tokens) → chunking (max 1200)...
Long segment (1414 tokens) → chunking (max 1200)...
Long segment (1898 tokens) → chunking (max 1200)...
Long segment (1322 tokens) → chunking (max 1200)...
Long segment (3229 tokens) → chunking (max 1200)...

Long segment (1599 tokens) → chunking (max 1200)...
Long segment (1399 tokens) → chunking (max 1200)...
Long segment (1305 tokens) → chunking (max 1200)...
Long segment (1205 tokens) → chunking (max 1200)...
Long segment (1666 tokens) → chunking (max 1200)...
Long segment (1327 tokens) → chunking (max 1200)...
Long segment (1276 tokens) → chunking (max 1200)...
Long segment

Processing new documents:   8%|▊         | 42/500 [00:23<10:26,  1.37s/doc]

Long segment (1395 tokens) → chunking (max 1200)...
Long segment (1234 tokens) → chunking (max 1200)...
Long segment (1380 tokens) → chunking (max 1200)...
Long segment (1258 tokens) → chunking (max 1200)...
Long segment (2027 tokens) → chunking (max 1200)...
Long segment (2040 tokens) → chunking (max 1200)...
Long segment (4461 tokens) → chunking (max 1200)...
Long segment (2202 tokens) → chunking (max 1200)...
Long segment (1291 tokens) → chunking (max 1200)...
Long segment (3247 tokens) → chunking (max 1200)...
Long segment (1868 tokens) → chunking (max 1200)...
Long segment (1453 tokens) → chunking (max 1200)...
Long segment (2628 tokens) → chunking (max 1200)...
Long segment (1258 tokens) → chunking (max 1200)...
Long segment (1715 tokens) → chunking (max 1200)...
Long segment (1529 tokens) → chunking (max 1200)...
Long segment (1465 tokens) → chunking (max 1200)...
Long segment (2025 tokens) → chunking (max 1200)...
Long segment (2380 tokens) → chunking (max 1200)...
Long segment

Processing new documents:   9%|▊         | 43/500 [00:24<09:29,  1.25s/doc]

Long segment (1288 tokens) → chunking (max 1200)...Long segment (2140 tokens) → chunking (max 1200)...
Long segment (1696 tokens) → chunking (max 1200)...
Long segment (1438 tokens) → chunking (max 1200)...
Long segment (1211 tokens) → chunking (max 1200)...

Long segment (1321 tokens) → chunking (max 1200)...
Long segment (1551 tokens) → chunking (max 1200)...
Long segment (1990 tokens) → chunking (max 1200)...
Long segment (1366 tokens) → chunking (max 1200)...
Long segment (1277 tokens) → chunking (max 1200)...
Long segment (1717 tokens) → chunking (max 1200)...
Long segment (1735 tokens) → chunking (max 1200)...
Long segment (1326 tokens) → chunking (max 1200)...
Long segment (1796 tokens) → chunking (max 1200)...
Long segment (1961 tokens) → chunking (max 1200)...


Processing new documents:   9%|▉         | 45/500 [00:26<09:45,  1.29s/doc]

Long segment (3477 tokens) → chunking (max 1200)...
Long segment (1256 tokens) → chunking (max 1200)...
Long segment (1282 tokens) → chunking (max 1200)...
Long segment (1267 tokens) → chunking (max 1200)...
Long segment (2086 tokens) → chunking (max 1200)...
Long segment (1628 tokens) → chunking (max 1200)...
Long segment (1335 tokens) → chunking (max 1200)...
Long segment (1438 tokens) → chunking (max 1200)...
Long segment (2392 tokens) → chunking (max 1200)...
Long segment (1218 tokens) → chunking (max 1200)...
Long segment (1430 tokens) → chunking (max 1200)...
Long segment (1251 tokens) → chunking (max 1200)...
Long segment (1367 tokens) → chunking (max 1200)...
Long segment (2043 tokens) → chunking (max 1200)...
Long segment (1299 tokens) → chunking (max 1200)...
Long segment (1316 tokens) → chunking (max 1200)...
Long segment (1714 tokens) → chunking (max 1200)...
Long segment (2426 tokens) → chunking (max 1200)...
Long segment (1289 tokens) → chunking (max 1200)...
Long segment

Processing new documents:   9%|▉         | 46/500 [00:28<09:46,  1.29s/doc]

Long segment (1361 tokens) → chunking (max 1200)...
Long segment (1231 tokens) → chunking (max 1200)...
Long segment (1559 tokens) → chunking (max 1200)...
Long segment (1675 tokens) → chunking (max 1200)...
Long segment (2135 tokens) → chunking (max 1200)...
Long segment (1311 tokens) → chunking (max 1200)...
Long segment (1652 tokens) → chunking (max 1200)...
Long segment (2109 tokens) → chunking (max 1200)...
Long segment (2044 tokens) → chunking (max 1200)...
Long segment (1525 tokens) → chunking (max 1200)...
Long segment (1512 tokens) → chunking (max 1200)...
Long segment (1508 tokens) → chunking (max 1200)...
Long segment (1470 tokens) → chunking (max 1200)...
Long segment (2100 tokens) → chunking (max 1200)...
Long segment (1821 tokens) → chunking (max 1200)...
Long segment (2182 tokens) → chunking (max 1200)...
Long segment (3058 tokens) → chunking (max 1200)...
Long segment (1250 tokens) → chunking (max 1200)...
Long segment (1793 tokens) → chunking (max 1200)...
Long segment

Processing new documents:   9%|▉         | 47/500 [00:30<11:41,  1.55s/doc]

Long segment (2088 tokens) → chunking (max 1200)...
Long segment (1396 tokens) → chunking (max 1200)...
Long segment (2044 tokens) → chunking (max 1200)...
Long segment (2101 tokens) → chunking (max 1200)...
Long segment (1320 tokens) → chunking (max 1200)...
Long segment (1444 tokens) → chunking (max 1200)...
Long segment (1297 tokens) → chunking (max 1200)...
Long segment (1433 tokens) → chunking (max 1200)...
Long segment (1453 tokens) → chunking (max 1200)...
Long segment (2418 tokens) → chunking (max 1200)...
Long segment (1351 tokens) → chunking (max 1200)...
Long segment (1721 tokens) → chunking (max 1200)...
Long segment (1216 tokens) → chunking (max 1200)...
Long segment (1593 tokens) → chunking (max 1200)...
Long segment (8104 tokens) → chunking (max 1200)...
Long segment (3274 tokens) → chunking (max 1200)...
Long segment (1687 tokens) → chunking (max 1200)...
Long segment (1502 tokens) → chunking (max 1200)...
Long segment (1480 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  10%|▉         | 49/500 [00:32<08:47,  1.17s/doc]

Long segment (1427 tokens) → chunking (max 1200)...
Long segment (1291 tokens) → chunking (max 1200)...
Long segment (2881 tokens) → chunking (max 1200)...
Long segment (1261 tokens) → chunking (max 1200)...
Long segment (2171 tokens) → chunking (max 1200)...
Long segment (1552 tokens) → chunking (max 1200)...
Long segment (1623 tokens) → chunking (max 1200)...
Long segment (2114 tokens) → chunking (max 1200)...
Long segment (2630 tokens) → chunking (max 1200)...
Long segment (1227 tokens) → chunking (max 1200)...
Long segment (1734 tokens) → chunking (max 1200)...


Processing new documents:  10%|█         | 51/500 [00:35<10:07,  1.35s/doc]

Long segment (1373 tokens) → chunking (max 1200)...
Long segment (1983 tokens) → chunking (max 1200)...
Long segment (1920 tokens) → chunking (max 1200)...
Long segment (1592 tokens) → chunking (max 1200)...
Long segment (1586 tokens) → chunking (max 1200)...
Long segment (1373 tokens) → chunking (max 1200)...
Long segment (1265 tokens) → chunking (max 1200)...
Long segment (1845 tokens) → chunking (max 1200)...
Long segment (4198 tokens) → chunking (max 1200)...
Long segment (1352 tokens) → chunking (max 1200)...
Long segment (1611 tokens) → chunking (max 1200)...
Long segment (1500 tokens) → chunking (max 1200)...
Long segment (1264 tokens) → chunking (max 1200)...
Long segment (1759 tokens) → chunking (max 1200)...
Long segment (1289 tokens) → chunking (max 1200)...
Long segment (1236 tokens) → chunking (max 1200)...
Long segment (1567 tokens) → chunking (max 1200)...
Long segment (1646 tokens) → chunking (max 1200)...
Long segment (1333 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  11%|█         | 53/500 [00:36<08:10,  1.10s/doc]

Long segment (1712 tokens) → chunking (max 1200)...
Long segment (1679 tokens) → chunking (max 1200)...
Long segment (1440 tokens) → chunking (max 1200)...
Long segment (1729 tokens) → chunking (max 1200)...
Long segment (1203 tokens) → chunking (max 1200)...
Long segment (2887 tokens) → chunking (max 1200)...
Long segment (1223 tokens) → chunking (max 1200)...
Long segment (1334 tokens) → chunking (max 1200)...
Long segment (2225 tokens) → chunking (max 1200)...
Long segment (1272 tokens) → chunking (max 1200)...
Long segment (2018 tokens) → chunking (max 1200)...
Long segment (1895 tokens) → chunking (max 1200)...
Long segment (1635 tokens) → chunking (max 1200)...
Long segment (1483 tokens) → chunking (max 1200)...
Long segment (1649 tokens) → chunking (max 1200)...
Long segment (1439 tokens) → chunking (max 1200)...
Long segment (1315 tokens) → chunking (max 1200)...
Long segment (1744 tokens) → chunking (max 1200)...
Long segment (1588 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  11%|█         | 55/500 [00:39<09:51,  1.33s/doc]

Long segment (2039 tokens) → chunking (max 1200)...Long segment (2195 tokens) → chunking (max 1200)...

Long segment (1644 tokens) → chunking (max 1200)...
Long segment (1356 tokens) → chunking (max 1200)...
Long segment (1321 tokens) → chunking (max 1200)...
Long segment (2598 tokens) → chunking (max 1200)...
Long segment (1276 tokens) → chunking (max 1200)...
Long segment (1267 tokens) → chunking (max 1200)...
Long segment (1372 tokens) → chunking (max 1200)...
Long segment (1387 tokens) → chunking (max 1200)...
Long segment (1717 tokens) → chunking (max 1200)...
Long segment (1267 tokens) → chunking (max 1200)...
Long segment (2770 tokens) → chunking (max 1200)...
Long segment (2516 tokens) → chunking (max 1200)...
Long segment (1317 tokens) → chunking (max 1200)...
Long segment (1549 tokens) → chunking (max 1200)...
Long segment (1201 tokens) → chunking (max 1200)...
Long segment (1371 tokens) → chunking (max 1200)...
Long segment (2199 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  11%|█▏        | 57/500 [00:42<09:29,  1.29s/doc]

Long segment (1330 tokens) → chunking (max 1200)...
Long segment (1862 tokens) → chunking (max 1200)...
Long segment (1859 tokens) → chunking (max 1200)...
Long segment (1703 tokens) → chunking (max 1200)...
Long segment (2899 tokens) → chunking (max 1200)...
Long segment (1537 tokens) → chunking (max 1200)...
Long segment (1860 tokens) → chunking (max 1200)...
Long segment (2844 tokens) → chunking (max 1200)...
Long segment (1872 tokens) → chunking (max 1200)...
Long segment (1309 tokens) → chunking (max 1200)...


Processing new documents:  40%|████      | 202/500 [00:45<00:12, 23.63doc/s]

Long segment (1418 tokens) → chunking (max 1200)...
Long segment (1690 tokens) → chunking (max 1200)...
Long segment (1322 tokens) → chunking (max 1200)...
Long segment (1381 tokens) → chunking (max 1200)...
Long segment (1715 tokens) → chunking (max 1200)...
Long segment (1622 tokens) → chunking (max 1200)...
Long segment (1824 tokens) → chunking (max 1200)...
Long segment (1741 tokens) → chunking (max 1200)...
Long segment (1395 tokens) → chunking (max 1200)...
Long segment (1660 tokens) → chunking (max 1200)...
Long segment (2765 tokens) → chunking (max 1200)...
Long segment (1347 tokens) → chunking (max 1200)...
Long segment (1909 tokens) → chunking (max 1200)...
Long segment (2391 tokens) → chunking (max 1200)...
Long segment (1700 tokens) → chunking (max 1200)...
Long segment (3451 tokens) → chunking (max 1200)...
Long segment (1840 tokens) → chunking (max 1200)...
Long segment (1621 tokens) → chunking (max 1200)...
Long segment (1515 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  45%|████▌     | 226/500 [00:50<00:22, 12.30doc/s]

Long segment (1407 tokens) → chunking (max 1200)...
Long segment (1430 tokens) → chunking (max 1200)...
Long segment (1203 tokens) → chunking (max 1200)...
Long segment (1415 tokens) → chunking (max 1200)...
Long segment (1430 tokens) → chunking (max 1200)...
Long segment (5303 tokens) → chunking (max 1200)...
Long segment (6521 tokens) → chunking (max 1200)...
Long segment (2968 tokens) → chunking (max 1200)...
Long segment (1554 tokens) → chunking (max 1200)...
Long segment (1201 tokens) → chunking (max 1200)...
Long segment (1392 tokens) → chunking (max 1200)...
Long segment (1401 tokens) → chunking (max 1200)...
Long segment (1377 tokens) → chunking (max 1200)...
Long segment (5254 tokens) → chunking (max 1200)...
Long segment (1259 tokens) → chunking (max 1200)...
Long segment (1290 tokens) → chunking (max 1200)...
Long segment (1833 tokens) → chunking (max 1200)...
Long segment (1569 tokens) → chunking (max 1200)...
Long segment (1637 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  49%|████▊     | 243/500 [00:53<00:26,  9.53doc/s]

Long segment (1751 tokens) → chunking (max 1200)...
Long segment (1537 tokens) → chunking (max 1200)...
Long segment (1553 tokens) → chunking (max 1200)...
Long segment (1229 tokens) → chunking (max 1200)...
Long segment (1467 tokens) → chunking (max 1200)...
Long segment (1561 tokens) → chunking (max 1200)...
Long segment (1224 tokens) → chunking (max 1200)...
Long segment (1503 tokens) → chunking (max 1200)...
Long segment (1437 tokens) → chunking (max 1200)...
Long segment (1536 tokens) → chunking (max 1200)...
Long segment (1415 tokens) → chunking (max 1200)...
Long segment (1733 tokens) → chunking (max 1200)...
Long segment (1970 tokens) → chunking (max 1200)...
Long segment (2045 tokens) → chunking (max 1200)...
Long segment (1308 tokens) → chunking (max 1200)...
Long segment (1312 tokens) → chunking (max 1200)...
Long segment (1450 tokens) → chunking (max 1200)...
Long segment (1717 tokens) → chunking (max 1200)...
Long segment (1279 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  51%|█████     | 255/500 [00:57<00:32,  7.58doc/s]

Long segment (1930 tokens) → chunking (max 1200)...
Long segment (1519 tokens) → chunking (max 1200)...
Long segment (1420 tokens) → chunking (max 1200)...
Long segment (2375 tokens) → chunking (max 1200)...
Long segment (2531 tokens) → chunking (max 1200)...
Long segment (1429 tokens) → chunking (max 1200)...
Long segment (1772 tokens) → chunking (max 1200)...
Long segment (1691 tokens) → chunking (max 1200)...
Long segment (1764 tokens) → chunking (max 1200)...
Long segment (2386 tokens) → chunking (max 1200)...
Long segment (2264 tokens) → chunking (max 1200)...
Long segment (1280 tokens) → chunking (max 1200)...
Long segment (3245 tokens) → chunking (max 1200)...
Long segment (1841 tokens) → chunking (max 1200)...


Processing new documents:  53%|█████▎    | 264/500 [00:58<00:31,  7.47doc/s]

Long segment (1270 tokens) → chunking (max 1200)...Long segment (1768 tokens) → chunking (max 1200)...

Long segment (1624 tokens) → chunking (max 1200)...
Long segment (1326 tokens) → chunking (max 1200)...
Long segment (1671 tokens) → chunking (max 1200)...
Long segment (2477 tokens) → chunking (max 1200)...
Long segment (1365 tokens) → chunking (max 1200)...
Long segment (1306 tokens) → chunking (max 1200)...
Long segment (2273 tokens) → chunking (max 1200)...
Long segment (1709 tokens) → chunking (max 1200)...
Long segment (1262 tokens) → chunking (max 1200)...
Long segment (1478 tokens) → chunking (max 1200)...
Long segment (2479 tokens) → chunking (max 1200)...
Long segment (1987 tokens) → chunking (max 1200)...
Long segment (2263 tokens) → chunking (max 1200)...
Long segment (1983 tokens) → chunking (max 1200)...
Long segment (1631 tokens) → chunking (max 1200)...
Long segment (2456 tokens) → chunking (max 1200)...
Long segment (1304 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  54%|█████▍    | 271/500 [01:01<00:40,  5.68doc/s]

Long segment (1800 tokens) → chunking (max 1200)...
Long segment (2458 tokens) → chunking (max 1200)...
Long segment (2032 tokens) → chunking (max 1200)...
Long segment (4295 tokens) → chunking (max 1200)...
Long segment (1389 tokens) → chunking (max 1200)...
Long segment (2109 tokens) → chunking (max 1200)...
Long segment (2222 tokens) → chunking (max 1200)...
Long segment (1519 tokens) → chunking (max 1200)...
Long segment (2802 tokens) → chunking (max 1200)...
Long segment (1864 tokens) → chunking (max 1200)...
Long segment (1455 tokens) → chunking (max 1200)...
Long segment (1786 tokens) → chunking (max 1200)...
Long segment (1235 tokens) → chunking (max 1200)...
Long segment (2181 tokens) → chunking (max 1200)...
Long segment (1677 tokens) → chunking (max 1200)...
Long segment (2605 tokens) → chunking (max 1200)...
Long segment (2058 tokens) → chunking (max 1200)...
Long segment (2008 tokens) → chunking (max 1200)...
Long segment (1988 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  56%|█████▌    | 280/500 [01:03<00:42,  5.22doc/s]

Long segment (2370 tokens) → chunking (max 1200)...Long segment (1886 tokens) → chunking (max 1200)...
Long segment (2189 tokens) → chunking (max 1200)...
Long segment (3128 tokens) → chunking (max 1200)...
Long segment (2083 tokens) → chunking (max 1200)...
Long segment (1280 tokens) → chunking (max 1200)...
Long segment (2263 tokens) → chunking (max 1200)...
Long segment (1420 tokens) → chunking (max 1200)...
Long segment (1698 tokens) → chunking (max 1200)...
Long segment (2431 tokens) → chunking (max 1200)...

Long segment (1210 tokens) → chunking (max 1200)...
Long segment (1356 tokens) → chunking (max 1200)...
Long segment (1891 tokens) → chunking (max 1200)...


Processing new documents:  57%|█████▋    | 283/500 [01:04<00:43,  4.98doc/s]

Long segment (2400 tokens) → chunking (max 1200)...
Long segment (2297 tokens) → chunking (max 1200)...
Long segment (1318 tokens) → chunking (max 1200)...
Long segment (3669 tokens) → chunking (max 1200)...
Long segment (1771 tokens) → chunking (max 1200)...
Long segment (1941 tokens) → chunking (max 1200)...
Long segment (1250 tokens) → chunking (max 1200)...
Long segment (1690 tokens) → chunking (max 1200)...
Long segment (1877 tokens) → chunking (max 1200)...
Long segment (1207 tokens) → chunking (max 1200)...
Long segment (1310 tokens) → chunking (max 1200)...
Long segment (1430 tokens) → chunking (max 1200)...
Long segment (1878 tokens) → chunking (max 1200)...
Long segment (1753 tokens) → chunking (max 1200)...
Long segment (1666 tokens) → chunking (max 1200)...
Long segment (2184 tokens) → chunking (max 1200)...
Long segment (2745 tokens) → chunking (max 1200)...
Long segment (2712 tokens) → chunking (max 1200)...
Long segment (1438 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  57%|█████▋    | 287/500 [01:08<01:13,  2.88doc/s]

Long segment (1268 tokens) → chunking (max 1200)...
Long segment (1263 tokens) → chunking (max 1200)...
Long segment (1611 tokens) → chunking (max 1200)...
Long segment (1387 tokens) → chunking (max 1200)...
Long segment (1494 tokens) → chunking (max 1200)...
Long segment (1311 tokens) → chunking (max 1200)...


Processing new documents:  58%|█████▊    | 289/500 [01:09<01:17,  2.73doc/s]

Long segment (1391 tokens) → chunking (max 1200)...
Long segment (1539 tokens) → chunking (max 1200)...
Long segment (1205 tokens) → chunking (max 1200)...
Long segment (1307 tokens) → chunking (max 1200)...
Long segment (1232 tokens) → chunking (max 1200)...
Long segment (1243 tokens) → chunking (max 1200)...
Long segment (2854 tokens) → chunking (max 1200)...
Long segment (1479 tokens) → chunking (max 1200)...
Long segment (1902 tokens) → chunking (max 1200)...
Long segment (1667 tokens) → chunking (max 1200)...
Long segment (2585 tokens) → chunking (max 1200)...
Long segment (1657 tokens) → chunking (max 1200)...
Long segment (2999 tokens) → chunking (max 1200)...
Long segment (1360 tokens) → chunking (max 1200)...
Long segment (2138 tokens) → chunking (max 1200)...
Long segment (1964 tokens) → chunking (max 1200)...
Long segment (1227 tokens) → chunking (max 1200)...
Long segment (1733 tokens) → chunking (max 1200)...
Long segment (1322 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  58%|█████▊    | 292/500 [01:14<02:24,  1.44doc/s]

Long segment (1936 tokens) → chunking (max 1200)...
Long segment (1600 tokens) → chunking (max 1200)...
Long segment (1521 tokens) → chunking (max 1200)...
Long segment (1563 tokens) → chunking (max 1200)...
Long segment (1236 tokens) → chunking (max 1200)...
Long segment (1381 tokens) → chunking (max 1200)...
Long segment (1531 tokens) → chunking (max 1200)...
Long segment (2518 tokens) → chunking (max 1200)...
Long segment (1761 tokens) → chunking (max 1200)...
Long segment (1727 tokens) → chunking (max 1200)...
Long segment (2107 tokens) → chunking (max 1200)...
Long segment (1789 tokens) → chunking (max 1200)...


Processing new documents:  59%|█████▊    | 293/500 [01:15<02:29,  1.39doc/s]

Long segment (1659 tokens) → chunking (max 1200)...
Long segment (1251 tokens) → chunking (max 1200)...
Long segment (1213 tokens) → chunking (max 1200)...
Long segment (1801 tokens) → chunking (max 1200)...
Long segment (1333 tokens) → chunking (max 1200)...


Processing new documents:  59%|█████▉    | 295/500 [01:16<02:23,  1.43doc/s]

Long segment (2237 tokens) → chunking (max 1200)...
Long segment (1492 tokens) → chunking (max 1200)...
Long segment (2385 tokens) → chunking (max 1200)...
Long segment (1618 tokens) → chunking (max 1200)...
Long segment (1259 tokens) → chunking (max 1200)...
Long segment (1613 tokens) → chunking (max 1200)...
Long segment (2525 tokens) → chunking (max 1200)...
Long segment (1347 tokens) → chunking (max 1200)...Long segment (1593 tokens) → chunking (max 1200)...


Processing new documents:  59%|█████▉    | 296/500 [01:18<02:39,  1.28doc/s]

Long segment (1511 tokens) → chunking (max 1200)...
Long segment (1564 tokens) → chunking (max 1200)...

Long segment (1206 tokens) → chunking (max 1200)...
Long segment (1311 tokens) → chunking (max 1200)...
Long segment (1395 tokens) → chunking (max 1200)...
Long segment (1219 tokens) → chunking (max 1200)...
Long segment (2242 tokens) → chunking (max 1200)...
Long segment (1334 tokens) → chunking (max 1200)...
Long segment (1392 tokens) → chunking (max 1200)...
Long segment (1282 tokens) → chunking (max 1200)...
Long segment (2125 tokens) → chunking (max 1200)...
Long segment (1282 tokens) → chunking (max 1200)...
Long segment (1589 tokens) → chunking (max 1200)...
Long segment (2198 tokens) → chunking (max 1200)...
Long segment (1204 tokens) → chunking (max 1200)...


Processing new documents:  60%|█████▉    | 299/500 [01:21<03:41,  1.10s/doc]

Long segment (1357 tokens) → chunking (max 1200)...
Long segment (1301 tokens) → chunking (max 1200)...
Long segment (1575 tokens) → chunking (max 1200)...
Long segment (1573 tokens) → chunking (max 1200)...
Long segment (2022 tokens) → chunking (max 1200)...
Long segment (1243 tokens) → chunking (max 1200)...


Processing new documents:  60%|██████    | 300/500 [01:23<03:45,  1.13s/doc]

Long segment (2270 tokens) → chunking (max 1200)...
Long segment (1213 tokens) → chunking (max 1200)...
Long segment (2248 tokens) → chunking (max 1200)...
Long segment (1870 tokens) → chunking (max 1200)...
Long segment (1940 tokens) → chunking (max 1200)...
Long segment (2471 tokens) → chunking (max 1200)...
Long segment (1757 tokens) → chunking (max 1200)...
Long segment (1226 tokens) → chunking (max 1200)...
Long segment (1452 tokens) → chunking (max 1200)...
Long segment (1340 tokens) → chunking (max 1200)...
Long segment (1675 tokens) → chunking (max 1200)...
Long segment (1243 tokens) → chunking (max 1200)...
Long segment (1375 tokens) → chunking (max 1200)...
Long segment (1694 tokens) → chunking (max 1200)...


Processing new documents:  60%|██████    | 302/500 [01:26<04:44,  1.44s/doc]

Long segment (1206 tokens) → chunking (max 1200)...
Long segment (1282 tokens) → chunking (max 1200)...
Long segment (1696 tokens) → chunking (max 1200)...
Long segment (1389 tokens) → chunking (max 1200)...
Long segment (1240 tokens) → chunking (max 1200)...
Long segment (1318 tokens) → chunking (max 1200)...
Long segment (1661 tokens) → chunking (max 1200)...
Long segment (1302 tokens) → chunking (max 1200)...
Long segment (1972 tokens) → chunking (max 1200)...
Long segment (1405 tokens) → chunking (max 1200)...
Long segment (2110 tokens) → chunking (max 1200)...
Long segment (1668 tokens) → chunking (max 1200)...
Long segment (1263 tokens) → chunking (max 1200)...
Long segment (1467 tokens) → chunking (max 1200)...


Processing new documents:  61%|██████    | 303/500 [01:27<04:40,  1.43s/doc]

Long segment (1224 tokens) → chunking (max 1200)...
Long segment (1249 tokens) → chunking (max 1200)...
Long segment (1601 tokens) → chunking (max 1200)...
Long segment (1389 tokens) → chunking (max 1200)...
Long segment (1232 tokens) → chunking (max 1200)...
Long segment (1245 tokens) → chunking (max 1200)...
Long segment (3202 tokens) → chunking (max 1200)...
Long segment (1511 tokens) → chunking (max 1200)...
Long segment (1524 tokens) → chunking (max 1200)...


Processing new documents:  61%|██████    | 305/500 [01:31<05:47,  1.78s/doc]

Long segment (1431 tokens) → chunking (max 1200)...
Long segment (1201 tokens) → chunking (max 1200)...
Long segment (1403 tokens) → chunking (max 1200)...
Long segment (1227 tokens) → chunking (max 1200)...
Long segment (1531 tokens) → chunking (max 1200)...
Long segment (1378 tokens) → chunking (max 1200)...
Long segment (2155 tokens) → chunking (max 1200)...
Long segment (2818 tokens) → chunking (max 1200)...
Long segment (1400 tokens) → chunking (max 1200)...
Long segment (1345 tokens) → chunking (max 1200)...
Long segment (1301 tokens) → chunking (max 1200)...
Long segment (1800 tokens) → chunking (max 1200)...
Long segment (1545 tokens) → chunking (max 1200)...
Long segment (1562 tokens) → chunking (max 1200)...
Long segment (1288 tokens) → chunking (max 1200)...
Long segment (1348 tokens) → chunking (max 1200)...


Processing new documents:  61%|██████▏   | 307/500 [01:35<06:37,  2.06s/doc]

Long segment (1808 tokens) → chunking (max 1200)...
Long segment (1217 tokens) → chunking (max 1200)...
Long segment (2051 tokens) → chunking (max 1200)...
Long segment (2252 tokens) → chunking (max 1200)...
Long segment (2135 tokens) → chunking (max 1200)...
Long segment (1472 tokens) → chunking (max 1200)...
Long segment (1625 tokens) → chunking (max 1200)...
Long segment (1832 tokens) → chunking (max 1200)...
Long segment (1205 tokens) → chunking (max 1200)...
Long segment (1384 tokens) → chunking (max 1200)...
Long segment (1824 tokens) → chunking (max 1200)...
Long segment (1579 tokens) → chunking (max 1200)...
Long segment (1223 tokens) → chunking (max 1200)...
Long segment (1299 tokens) → chunking (max 1200)...
Long segment (1517 tokens) → chunking (max 1200)...
Long segment (1352 tokens) → chunking (max 1200)...
Long segment (1951 tokens) → chunking (max 1200)...
Long segment (1626 tokens) → chunking (max 1200)...
Long segment (1248 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  62%|██████▏   | 308/500 [01:36<05:31,  1.72s/doc]

Long segment (1242 tokens) → chunking (max 1200)...
Long segment (1612 tokens) → chunking (max 1200)...
Long segment (1269 tokens) → chunking (max 1200)...
Long segment (1262 tokens) → chunking (max 1200)...
Long segment (1840 tokens) → chunking (max 1200)...
Long segment (2753 tokens) → chunking (max 1200)...
Long segment (1340 tokens) → chunking (max 1200)...
Long segment (1408 tokens) → chunking (max 1200)...
Long segment (1243 tokens) → chunking (max 1200)...
Long segment (1265 tokens) → chunking (max 1200)...
Long segment (1315 tokens) → chunking (max 1200)...
Long segment (1786 tokens) → chunking (max 1200)...
Long segment (1545 tokens) → chunking (max 1200)...
Long segment (1675 tokens) → chunking (max 1200)...
Long segment (1314 tokens) → chunking (max 1200)...
Long segment (1244 tokens) → chunking (max 1200)...
Long segment (1313 tokens) → chunking (max 1200)...
Long segment (1383 tokens) → chunking (max 1200)...
Long segment (1777 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  62%|██████▏   | 310/500 [01:40<05:36,  1.77s/doc]

Long segment (1224 tokens) → chunking (max 1200)...
Long segment (1767 tokens) → chunking (max 1200)...
Long segment (1240 tokens) → chunking (max 1200)...
Long segment (2010 tokens) → chunking (max 1200)...
Long segment (1867 tokens) → chunking (max 1200)...
Long segment (1659 tokens) → chunking (max 1200)...
Long segment (1417 tokens) → chunking (max 1200)...
Long segment (1618 tokens) → chunking (max 1200)...
Long segment (1332 tokens) → chunking (max 1200)...


Processing new documents:  62%|██████▏   | 312/500 [01:42<04:45,  1.52s/doc]

Long segment (1704 tokens) → chunking (max 1200)...Long segment (1388 tokens) → chunking (max 1200)...

Long segment (3562 tokens) → chunking (max 1200)...
Long segment (1741 tokens) → chunking (max 1200)...
Long segment (1646 tokens) → chunking (max 1200)...
Long segment (1585 tokens) → chunking (max 1200)...
Long segment (1268 tokens) → chunking (max 1200)...
Long segment (1521 tokens) → chunking (max 1200)...
Long segment (2082 tokens) → chunking (max 1200)...
Long segment (2333 tokens) → chunking (max 1200)...
Long segment (1538 tokens) → chunking (max 1200)...
Long segment (1239 tokens) → chunking (max 1200)...
Long segment (1768 tokens) → chunking (max 1200)...
Long segment (1548 tokens) → chunking (max 1200)...
Long segment (1427 tokens) → chunking (max 1200)...
Long segment (1742 tokens) → chunking (max 1200)...
Long segment (1420 tokens) → chunking (max 1200)...
Long segment (1375 tokens) → chunking (max 1200)...
Long segment (1455 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  63%|██████▎   | 314/500 [01:46<05:06,  1.65s/doc]

Long segment (1582 tokens) → chunking (max 1200)...
Long segment (1932 tokens) → chunking (max 1200)...
Long segment (1619 tokens) → chunking (max 1200)...
Long segment (2245 tokens) → chunking (max 1200)...
Long segment (2080 tokens) → chunking (max 1200)...
Long segment (1263 tokens) → chunking (max 1200)...
Long segment (2112 tokens) → chunking (max 1200)...
Long segment (2048 tokens) → chunking (max 1200)...
Long segment (1261 tokens) → chunking (max 1200)...
Long segment (2260 tokens) → chunking (max 1200)...
Long segment (1203 tokens) → chunking (max 1200)...
Long segment (1571 tokens) → chunking (max 1200)...
Long segment (2231 tokens) → chunking (max 1200)...
Long segment (1352 tokens) → chunking (max 1200)...
Long segment (1642 tokens) → chunking (max 1200)...
Long segment (2329 tokens) → chunking (max 1200)...
Long segment (1249 tokens) → chunking (max 1200)...
Long segment (1683 tokens) → chunking (max 1200)...
Long segment (1336 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  63%|██████▎   | 315/500 [01:48<05:06,  1.66s/doc]

Long segment (1652 tokens) → chunking (max 1200)...
Long segment (1713 tokens) → chunking (max 1200)...
Long segment (2660 tokens) → chunking (max 1200)...
Long segment (2696 tokens) → chunking (max 1200)...
Long segment (1394 tokens) → chunking (max 1200)...
Long segment (1544 tokens) → chunking (max 1200)...
Long segment (1757 tokens) → chunking (max 1200)...


Processing new documents:  63%|██████▎   | 316/500 [01:49<04:29,  1.47s/doc]

Long segment (1202 tokens) → chunking (max 1200)...
Long segment (1884 tokens) → chunking (max 1200)...
Long segment (2092 tokens) → chunking (max 1200)...
Long segment (1595 tokens) → chunking (max 1200)...
Long segment (1944 tokens) → chunking (max 1200)...
Long segment (1207 tokens) → chunking (max 1200)...
Long segment (1476 tokens) → chunking (max 1200)...
Long segment (1691 tokens) → chunking (max 1200)...
Long segment (1931 tokens) → chunking (max 1200)...
Long segment (1348 tokens) → chunking (max 1200)...
Long segment (1653 tokens) → chunking (max 1200)...
Long segment (1591 tokens) → chunking (max 1200)...
Long segment (1338 tokens) → chunking (max 1200)...
Long segment (1349 tokens) → chunking (max 1200)...
Long segment (1622 tokens) → chunking (max 1200)...
Long segment (1383 tokens) → chunking (max 1200)...
Long segment (2487 tokens) → chunking (max 1200)...
Long segment (1396 tokens) → chunking (max 1200)...
Long segment (1831 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  64%|██████▎   | 318/500 [01:53<05:32,  1.82s/doc]

Long segment (1244 tokens) → chunking (max 1200)...
Long segment (1505 tokens) → chunking (max 1200)...
Long segment (1509 tokens) → chunking (max 1200)...
Long segment (1436 tokens) → chunking (max 1200)...
Long segment (2718 tokens) → chunking (max 1200)...
Long segment (1254 tokens) → chunking (max 1200)...
Long segment (1259 tokens) → chunking (max 1200)...
Long segment (1435 tokens) → chunking (max 1200)...
Long segment (1466 tokens) → chunking (max 1200)...
Long segment (2081 tokens) → chunking (max 1200)...
Long segment (1686 tokens) → chunking (max 1200)...
Long segment (1882 tokens) → chunking (max 1200)...
Long segment (1463 tokens) → chunking (max 1200)...
Long segment (2000 tokens) → chunking (max 1200)...
Long segment (2318 tokens) → chunking (max 1200)...
Long segment (1597 tokens) → chunking (max 1200)...
Long segment (1207 tokens) → chunking (max 1200)...
Long segment (1574 tokens) → chunking (max 1200)...
Long segment (1222 tokens) → chunking (max 1200)...


Processing new documents:  64%|██████▍   | 320/500 [01:55<03:59,  1.33s/doc]

Long segment (1231 tokens) → chunking (max 1200)...
Long segment (1237 tokens) → chunking (max 1200)...
Long segment (1315 tokens) → chunking (max 1200)...
Long segment (1519 tokens) → chunking (max 1200)...
Long segment (1208 tokens) → chunking (max 1200)...
Long segment (1335 tokens) → chunking (max 1200)...
Long segment (1318 tokens) → chunking (max 1200)...
Long segment (2537 tokens) → chunking (max 1200)...
Long segment (1254 tokens) → chunking (max 1200)...


Processing new documents:  64%|██████▍   | 322/500 [01:56<02:45,  1.07doc/s]

Long segment (1301 tokens) → chunking (max 1200)...
Long segment (1610 tokens) → chunking (max 1200)...
Long segment (1509 tokens) → chunking (max 1200)...
Long segment (1917 tokens) → chunking (max 1200)...
Long segment (1487 tokens) → chunking (max 1200)...
Long segment (9127 tokens) → chunking (max 1200)...
Long segment (1561 tokens) → chunking (max 1200)...
Long segment (1927 tokens) → chunking (max 1200)...


Processing new documents:  65%|██████▍   | 324/500 [01:59<03:11,  1.09s/doc]

Long segment (1323 tokens) → chunking (max 1200)...
Long segment (1236 tokens) → chunking (max 1200)...
Long segment (1378 tokens) → chunking (max 1200)...
Long segment (1617 tokens) → chunking (max 1200)...


Processing new documents:  65%|██████▌   | 326/500 [02:01<02:52,  1.01doc/s]

Long segment (1879 tokens) → chunking (max 1200)...
Long segment (1374 tokens) → chunking (max 1200)...
Long segment (3048 tokens) → chunking (max 1200)...
Long segment (2972 tokens) → chunking (max 1200)...
Long segment (1236 tokens) → chunking (max 1200)...
Long segment (2327 tokens) → chunking (max 1200)...
Long segment (1313 tokens) → chunking (max 1200)...
Long segment (2161 tokens) → chunking (max 1200)...
Long segment (1589 tokens) → chunking (max 1200)...
Long segment (1321 tokens) → chunking (max 1200)...
Long segment (1458 tokens) → chunking (max 1200)...
Long segment (1880 tokens) → chunking (max 1200)...
Long segment (1801 tokens) → chunking (max 1200)...
Long segment (1738 tokens) → chunking (max 1200)...
Long segment (1486 tokens) → chunking (max 1200)...
Long segment (3683 tokens) → chunking (max 1200)...
Long segment (2107 tokens) → chunking (max 1200)...
Long segment (1758 tokens) → chunking (max 1200)...
Long segment (1267 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  65%|██████▌   | 327/500 [02:02<02:36,  1.11doc/s]

Long segment (1574 tokens) → chunking (max 1200)...

Long segment (1930 tokens) → chunking (max 1200)...
Long segment (1245 tokens) → chunking (max 1200)...
Long segment (1434 tokens) → chunking (max 1200)...
Long segment (1691 tokens) → chunking (max 1200)...
Long segment (4116 tokens) → chunking (max 1200)...
Long segment (1233 tokens) → chunking (max 1200)...
Long segment (1783 tokens) → chunking (max 1200)...
Long segment (1436 tokens) → chunking (max 1200)...
Long segment (1594 tokens) → chunking (max 1200)...
Long segment (1392 tokens) → chunking (max 1200)...
Long segment (1828 tokens) → chunking (max 1200)...
Long segment (4269 tokens) → chunking (max 1200)...
Long segment (1205 tokens) → chunking (max 1200)...
Long segment (1211 tokens) → chunking (max 1200)...
Long segment (1248 tokens) → chunking (max 1200)...
Long segment (1442 tokens) → chunking (max 1200)...


Processing new documents:  66%|██████▌   | 328/500 [02:04<04:32,  1.58s/doc]

Long segment (1811 tokens) → chunking (max 1200)...Long segment (1574 tokens) → chunking (max 1200)...
Long segment (1987 tokens) → chunking (max 1200)...

Long segment (1299 tokens) → chunking (max 1200)...
Long segment (2106 tokens) → chunking (max 1200)...
Long segment (1374 tokens) → chunking (max 1200)...
Long segment (1248 tokens) → chunking (max 1200)...
Long segment (1381 tokens) → chunking (max 1200)...
Long segment (1451 tokens) → chunking (max 1200)...
Long segment (2869 tokens) → chunking (max 1200)...
Long segment (1209 tokens) → chunking (max 1200)...


Processing new documents:  66%|██████▌   | 330/500 [02:08<04:44,  1.67s/doc]

Long segment (1378 tokens) → chunking (max 1200)...
Long segment (2117 tokens) → chunking (max 1200)...
Long segment (1387 tokens) → chunking (max 1200)...
Long segment (1498 tokens) → chunking (max 1200)...
Long segment (1839 tokens) → chunking (max 1200)...
Long segment (1780 tokens) → chunking (max 1200)...
Long segment (1607 tokens) → chunking (max 1200)...
Long segment (1900 tokens) → chunking (max 1200)...


Processing new documents:  66%|██████▌   | 331/500 [02:10<04:49,  1.71s/doc]

Long segment (3125 tokens) → chunking (max 1200)...
Long segment (1410 tokens) → chunking (max 1200)...
Long segment (1498 tokens) → chunking (max 1200)...
Long segment (1942 tokens) → chunking (max 1200)...
Long segment (1718 tokens) → chunking (max 1200)...
Long segment (1233 tokens) → chunking (max 1200)...
Long segment (1418 tokens) → chunking (max 1200)...
Long segment (1282 tokens) → chunking (max 1200)...
Long segment (1325 tokens) → chunking (max 1200)...
Long segment (1827 tokens) → chunking (max 1200)...
Long segment (2121 tokens) → chunking (max 1200)...
Long segment (1494 tokens) → chunking (max 1200)...
Long segment (1450 tokens) → chunking (max 1200)...
Long segment (1463 tokens) → chunking (max 1200)...
Long segment (2681 tokens) → chunking (max 1200)...
Long segment (1299 tokens) → chunking (max 1200)...
Long segment (1236 tokens) → chunking (max 1200)...
Long segment (1255 tokens) → chunking (max 1200)...


Processing new documents:  66%|██████▋   | 332/500 [02:14<04:50,  1.73s/doc]

Long segment (1231 tokens) → chunking (max 1200)...
Long segment (1258 tokens) → chunking (max 1200)...
Long segment (1398 tokens) → chunking (max 1200)...
Long segment (1754 tokens) → chunking (max 1200)...
Long segment (1473 tokens) → chunking (max 1200)...
Long segment (1886 tokens) → chunking (max 1200)...
Long segment (1539 tokens) → chunking (max 1200)...
Long segment (1260 tokens) → chunking (max 1200)...
Long segment (1206 tokens) → chunking (max 1200)...
Long segment (1401 tokens) → chunking (max 1200)...
Long segment (1532 tokens) → chunking (max 1200)...
Long segment (1393 tokens) → chunking (max 1200)...
Long segment (1276 tokens) → chunking (max 1200)...
Long segment (2350 tokens) → chunking (max 1200)...
Long segment (1469 tokens) → chunking (max 1200)...
Long segment (1280 tokens) → chunking (max 1200)...
Long segment (3182 tokens) → chunking (max 1200)...
Long segment (1530 tokens) → chunking (max 1200)...
Long segment (1537 tokens) → chunking (max 1200)...


Processing new documents:  67%|██████▋   | 334/500 [02:16<04:44,  1.71s/doc]

Long segment (1664 tokens) → chunking (max 1200)...
Long segment (1586 tokens) → chunking (max 1200)...
Long segment (1487 tokens) → chunking (max 1200)...
Long segment (1698 tokens) → chunking (max 1200)...
Long segment (1683 tokens) → chunking (max 1200)...
Long segment (2055 tokens) → chunking (max 1200)...
Long segment (2048 tokens) → chunking (max 1200)...
Long segment (1899 tokens) → chunking (max 1200)...
Long segment (1558 tokens) → chunking (max 1200)...
Long segment (1776 tokens) → chunking (max 1200)...
Long segment (1278 tokens) → chunking (max 1200)...
Long segment (2203 tokens) → chunking (max 1200)...
Long segment (1454 tokens) → chunking (max 1200)...
Long segment (1459 tokens) → chunking (max 1200)...
Long segment (1333 tokens) → chunking (max 1200)...
Long segment (1224 tokens) → chunking (max 1200)...
Long segment (1365 tokens) → chunking (max 1200)...
Long segment (1438 tokens) → chunking (max 1200)...
Long segment (1649 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  67%|██████▋   | 335/500 [02:16<03:31,  1.28s/doc]

Long segment (1371 tokens) → chunking (max 1200)...
Long segment (1379 tokens) → chunking (max 1200)...
Long segment (1257 tokens) → chunking (max 1200)...
Long segment (1622 tokens) → chunking (max 1200)...
Long segment (1784 tokens) → chunking (max 1200)...
Long segment (1384 tokens) → chunking (max 1200)...
Long segment (1334 tokens) → chunking (max 1200)...
Long segment (1287 tokens) → chunking (max 1200)...
Long segment (1249 tokens) → chunking (max 1200)...
Long segment (1484 tokens) → chunking (max 1200)...
Long segment (1645 tokens) → chunking (max 1200)...
Long segment (1244 tokens) → chunking (max 1200)...
Long segment (1602 tokens) → chunking (max 1200)...
Long segment (1243 tokens) → chunking (max 1200)...
Long segment (1476 tokens) → chunking (max 1200)...
Long segment (1270 tokens) → chunking (max 1200)...
Long segment (1615 tokens) → chunking (max 1200)...
Long segment (1299 tokens) → chunking (max 1200)...
Long segment (1332 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  68%|██████▊   | 338/500 [02:18<02:13,  1.21doc/s]

Long segment (1235 tokens) → chunking (max 1200)...
Long segment (1391 tokens) → chunking (max 1200)...
Long segment (1405 tokens) → chunking (max 1200)...
Long segment (1737 tokens) → chunking (max 1200)...
Long segment (1632 tokens) → chunking (max 1200)...
Long segment (1412 tokens) → chunking (max 1200)...
Long segment (1720 tokens) → chunking (max 1200)...
Long segment (1405 tokens) → chunking (max 1200)...
Long segment (1320 tokens) → chunking (max 1200)...
Long segment (2694 tokens) → chunking (max 1200)...
Long segment (1355 tokens) → chunking (max 1200)...
Long segment (1853 tokens) → chunking (max 1200)...
Long segment (1712 tokens) → chunking (max 1200)...
Long segment (1615 tokens) → chunking (max 1200)...
Long segment (1439 tokens) → chunking (max 1200)...
Long segment (1947 tokens) → chunking (max 1200)...
Long segment (1532 tokens) → chunking (max 1200)...
Long segment (1222 tokens) → chunking (max 1200)...
Long segment (1635 tokens) → chunking (max 1200)...
Long segment

Processing new documents:  91%|█████████ | 455/500 [02:19<00:00, 81.78doc/s]

Long segment (1642 tokens) → chunking (max 1200)...Long segment (2071 tokens) → chunking (max 1200)...

Long segment (1296 tokens) → chunking (max 1200)...
Long segment (1551 tokens) → chunking (max 1200)...
Long segment (1533 tokens) → chunking (max 1200)...
Long segment (1241 tokens) → chunking (max 1200)...


Processing new documents:  99%|█████████▉| 496/500 [02:19<00:00, 96.42doc/s]

Long segment (1470 tokens) → chunking (max 1200)...
Long segment (1398 tokens) → chunking (max 1200)...
Long segment (1441 tokens) → chunking (max 1200)...
Long segment (1534 tokens) → chunking (max 1200)...
Long segment (1396 tokens) → chunking (max 1200)...


Processing new documents: 100%|██████████| 500/500 [02:20<00:00,  3.57doc/s]

Long segment (1298 tokens) → chunking (max 1200)...

Starting safe batch insert of 195055 new segments (batch size: 500)...





Batch 1: Inserted 500, Modified 0
Batch 2: Inserted 498, Modified 0
Batch 3: Inserted 500, Modified 0
Batch 4: Inserted 500, Modified 0
Batch 5: Inserted 497, Modified 0
Batch 6: Inserted 499, Modified 0
Batch 7: Inserted 499, Modified 0
Batch 8: Inserted 499, Modified 0
Batch 9: Inserted 500, Modified 0
Batch 10: Inserted 498, Modified 0
Batch 11: Inserted 500, Modified 0
Batch 12: Inserted 498, Modified 0
Batch 13: Inserted 500, Modified 0
Batch 14: Inserted 499, Modified 0
Batch 15: Inserted 499, Modified 0
Batch 16: Inserted 497, Modified 0
Batch 17: Inserted 499, Modified 0
Batch 18: Inserted 500, Modified 0
Batch 19: Inserted 485, Modified 0
Batch 20: Inserted 481, Modified 0
Batch 21: Inserted 499, Modified 0
Batch 22: Inserted 498, Modified 0
Batch 23: Inserted 500, Modified 0
Batch 24: Inserted 500, Modified 0
Batch 25: Inserted 497, Modified 0
Batch 26: Inserted 499, Modified 0
Batch 27: Inserted 500, Modified 0
Batch 28: Inserted 498, Modified 0
Batch 29: Inserted 498, Modif