# 01: Data Processing Pipeline

**Purpose:** Process raw JSONL data into clean DataFrames with embeddings.

**Run this notebook ONCE** to create processed data files. After that, use `02_EDA.ipynb` for analysis and experiment notebooks for modeling.

---

## What This Notebook Does

1. **Loads raw JSONL** files (GOLD and SILVER)
2. **Adds document IDs** automatically (one per line)
3. **Processes with NLP pipeline:**
   - spaCy-UDPipe for POS tagging
   - RobeCzech BERT for embeddings
4. **Computes embeddings INDEPENDENTLY** per sentence (prevents data leakage)
5. **Saves to DataFrame pickles** with full metadata
6. **Creates integrity checksums** (SHA256)

## Output Files

After running this notebook, you'll have:
```
data/processed/
‚îú‚îÄ‚îÄ gold_tokens.pkl          # Token-level data
‚îú‚îÄ‚îÄ gold_tokens.pkl.sha256   # Integrity check
‚îú‚îÄ‚îÄ gold_sentences.pkl       # Sentence-level data
‚îú‚îÄ‚îÄ gold_sentences.pkl.sha256
‚îú‚îÄ‚îÄ silver_tokens.pkl
‚îú‚îÄ‚îÄ silver_tokens.pkl.sha256
‚îú‚îÄ‚îÄ silver_sentences.pkl
‚îî‚îÄ‚îÄ silver_sentences.pkl.sha256
```

---

‚ö†Ô∏è **WARNING:** Processing takes time (~5-10 minutes for GOLD, ~30-60 minutes for SILVER).

## 1. Setup & Imports

In [1]:
import sys
import os
from pathlib import Path

# Jupyter magic
%load_ext autoreload
%autoreload 2

# Add src to path
current_dir = os.getcwd()
src_dir = os.path.abspath(os.path.join(current_dir, '..', 'src'))
if src_dir not in sys.path:
    sys.path.append(src_dir)

# Import modules
import config
from load_preprocess_data import run_full_pipeline

print("‚úÖ Setup complete")
print(f"   Device: {config.DEVICE}")
print(f"   Model: {config.MODEL_NAME}")
print(f"   Output: {config.PROCESSED_DIR}")

‚öôÔ∏è Configuration loaded. Device: cpu
‚úÖ Setup complete
   Device: cpu
   Model: ufal/robeczech-base
   Output: C:\Users\dobes\Documents\UniversityCodingProject_10-02-26\ThesisCoding\data\processed


## 2. Validate Configuration

In [2]:
# Check if raw data files exist
print("üìã Checking raw data files...\n")

if config.PATH_GOLD_RAW.exists():
    print(f"‚úÖ Gold raw data found: {config.PATH_GOLD_RAW}")
    
    # Count lines
    with open(config.PATH_GOLD_RAW, 'r', encoding='utf-8') as f:
        gold_lines = sum(1 for _ in f)
    print(f"   ‚Üí {gold_lines} entries detected")
else:
    print(f"‚ùå Gold raw data NOT found: {config.PATH_GOLD_RAW}")
    print("   Please add GOLD_data_raw.jsonl to data/raw/")

if config.PATH_SILVER_RAW.exists():
    print(f"\n‚úÖ Silver raw data found: {config.PATH_SILVER_RAW}")
    
    # Count lines
    with open(config.PATH_SILVER_RAW, 'r', encoding='utf-8') as f:
        silver_lines = sum(1 for _ in f)
    print(f"   ‚Üí {silver_lines} entries detected")
else:
    print(f"\n‚ö†Ô∏è  Silver raw data NOT found: {config.PATH_SILVER_RAW}")
    print("   (Silver is optional - you can skip it)")

# Validate configuration
print("\nüîç Validating configuration...")
try:
    config.validate_config()
except Exception as e:
    print(f"‚ö†Ô∏è  Validation warning: {e}")
    print("   Continuing anyway...")

üìã Checking raw data files...

‚úÖ Gold raw data found: C:\Users\dobes\Documents\UniversityCodingProject_10-02-26\ThesisCoding\data\raw\GOLD_data_raw.jsonl
   ‚Üí 521 entries detected

‚úÖ Silver raw data found: C:\Users\dobes\Documents\UniversityCodingProject_10-02-26\ThesisCoding\data\raw\SILVER_data_raw.jsonl
   ‚Üí 1903 entries detected

üîç Validating configuration...
‚úÖ Configuration validated successfully


## 3. Preview Raw Data

Let's check what the raw data looks like.

In [3]:
import json

print("üìÑ GOLD Raw Data Preview (first 3 entries):\n")

if config.PATH_GOLD_RAW.exists():
    with open(config.PATH_GOLD_RAW, 'r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            if i >= 3:
                break
            entry = json.loads(line)
            print(f"Entry {i+1}:")
            print(f"  context_prev: {entry.get('context_prev', 'N/A')[:60]}...")
            print(f"  target_sentence: {entry.get('target_sentence', 'N/A')[:60]}...")
            print(f"  label: {entry.get('label')}")
            print(f"  source: {entry.get('source')}")
            print(f"  target_token: {entry.get('target_token')}")
            
            # Check if document_id exists
            if 'document_id' in entry:
                print(f"  ‚úÖ Has document_id: {entry['document_id']}")
            else:
                print(f"  ‚ÑπÔ∏è  No document_id (will be auto-generated)")
            print()
else:
    print("‚ö†Ô∏è  GOLD file not found")

üìÑ GOLD Raw Data Preview (first 3 entries):

Entry 1:
  context_prev: Z√°soby zemn√≠ho plynu v evropsk√Ωch skladech dos√°hly rekordn√≠ ...
  target_sentence: Souƒçasn√° √∫rove≈à spot≈ôeby energie je alarmuj√≠c√≠....
  label: 1
  source: LLM
  target_token: alarmuj√≠c√≠
  ‚ÑπÔ∏è  No document_id (will be auto-generated)

Entry 2:
  context_prev: Pr≈Ømƒõrn√° mzda v ƒåesk√© republice ve t≈ôet√≠m ƒçtvrtlet√≠ meziroƒçn...
  target_sentence: Rozd√≠l mezi platy mu≈æ≈Ø a ≈æen z≈Øst√°v√° i na d√°le neuvƒõ≈ôiteln√Ω....
  label: 1
  source: Author
  target_token: neuvƒõ≈ôiteln√Ω
  ‚ÑπÔ∏è  No document_id (will be auto-generated)

Entry 3:
  context_prev: V posledn√≠ch mƒõs√≠c√≠ch do≈°lo k v√Ωrazn√©mu poklesu stavebn√≠ pro...
  target_sentence: Tempo v√Ωstavby nov√Ωch byt≈Ø je ≈æalostn√©....
  label: 1
  source: LLM
  target_token: ≈æalostn√©
  ‚ÑπÔ∏è  No document_id (will be auto-generated)



## 4. Process GOLD Dataset

**GOLD Dataset:** High-quality, manually annotated data (~520 documents).

This will:
1. Load GOLD_data_raw.jsonl
2. **Auto-generate document IDs** (gold_doc_0001, gold_doc_0002, ...)
3. Process each sentence INDEPENDENTLY (no cross-sentence context)
4. Extract token-level and sentence-level embeddings
5. Save to pickles with metadata

### Expected Progress:
- Loading models: ~30 seconds
- Processing sentences: ~5-10 minutes (with progress bar)
- Saving files: ~10 seconds

In [4]:
print("üöÄ Processing GOLD dataset...\n")
print("This may take 5-10 minutes depending on your hardware.")
print("Progress bars will show detailed status.\n")
print("="*60)

token_df_gold, sentence_df_gold = run_full_pipeline('gold')

print("="*60)
print("\n‚úÖ GOLD dataset processing complete!")
print(f"   Tokens: {len(token_df_gold):,} rows")
print(f"   Sentences: {len(sentence_df_gold):,} rows")
print(f"   Documents: {token_df_gold['document_id'].nunique():,} unique")
print(f"   LJMPNIK words: {(token_df_gold['is_target'] == True).sum():,}")

2026-02-22 19:27:40,060 - INFO - üöÄ Starting full pipeline for GOLD dataset
2026-02-22 19:27:40,061 - INFO - Loading spaCy-UDPipe model ('cs-pdt')...
2026-02-22 19:27:40,062 - INFO - Downloading spaCy-UDPipe model...


üöÄ Processing GOLD dataset...

This may take 5-10 minutes depending on your hardware.
Progress bars will show detailed status.

Downloaded pre-trained UDPipe model for 'cs-pdt' language


2026-02-22 19:27:53,315 - INFO - Loading RobeCzech model ('ufal/robeczech-base')...
Some weights of RobertaModel were not initialized from the model checkpoint at ufal/robeczech-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2026-02-22 19:27:54,673 - INFO - ‚úÖ Models loaded successfully
2026-02-22 19:27:54,678 - INFO - Loaded 520 entries from C:\Users\dobes\Documents\UniversityCodingProject_10-02-26\ThesisCoding\data\raw\GOLD_data_raw.jsonl
Processing gold data: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 520/520 [00:49<00:00, 10.51it/s]
2026-02-22 19:28:44,194 - INFO - ‚úÖ Processed gold:
2026-02-22 19:28:44,195 - INFO -    - Token-level: 17557 rows
2026-02-22 19:28:44,196 - INFO -    - Sentence-level: 1560 rows
2026-02-22 19:28:46,975 - INFO - üíæ Saved processed data:



‚úÖ GOLD dataset processing complete!
   Tokens: 17,557 rows
   Sentences: 1,560 rows
   Documents: 520 unique
   LJMPNIK words: 305


### 4.1 Preview Processed GOLD Data

In [5]:
print("üìä GOLD Token-Level Data Preview:\n")
display(token_df_gold.head(10))

print("\nüìã Column Information:")
for col in token_df_gold.columns:
    dtype = token_df_gold[col].dtype
    if col == 'embedding':
        shape = token_df_gold[col].iloc[0].shape
        print(f"  - {col}: array{shape}")
    else:
        print(f"  - {col}: {dtype}")

üìä GOLD Token-Level Data Preview:



Unnamed: 0,document_id,sentence_id,token_id,position,form,lemma,pos,embedding,is_target,label,token_label,is_context
0,gold_doc_0001,gold_doc_0001_target,gold_doc_0001_target_tok_0,0,Souƒçasn√°,souƒçasn√Ω,ADJ,"[0.05658283, -0.16128066, -0.083715945, 0.0036...",False,1,0,False
1,gold_doc_0001,gold_doc_0001_target,gold_doc_0001_target_tok_1,1,√∫rove≈à,√∫rove≈à,NOUN,"[0.32223943, -0.046616577, 0.17479904, -0.0836...",False,1,0,False
2,gold_doc_0001,gold_doc_0001_target,gold_doc_0001_target_tok_2,2,spot≈ôeby,spot≈ôeba,NOUN,"[0.2664389, -0.040305506, 0.043529328, -0.0057...",False,1,0,False
3,gold_doc_0001,gold_doc_0001_target,gold_doc_0001_target_tok_3,3,energie,energie,NOUN,"[-0.14703438, 0.017672857, 0.04595686, 0.16327...",False,1,0,False
4,gold_doc_0001,gold_doc_0001_target,gold_doc_0001_target_tok_4,4,je,b√Ωt,AUX,"[-0.49623346, -0.17490284, -0.014973694, 0.339...",False,1,0,False
5,gold_doc_0001,gold_doc_0001_target,gold_doc_0001_target_tok_5,5,alarmuj√≠c√≠,alarmuj√≠c√≠,ADJ,"[-0.085729636, -0.032963824, -0.12887268, 0.10...",True,1,1,False
6,gold_doc_0001,gold_doc_0001_target,gold_doc_0001_target_tok_6,6,.,.,PUNCT,"[0.25777772, -0.08328905, 0.115178466, 0.11127...",False,1,0,False
7,gold_doc_0001,gold_doc_0001_ctx_prev,gold_doc_0001_ctx_prev_tok_0,0,Z√°soby,z√°soba,NOUN,"[0.19291194, 0.13453041, 0.05749755, -0.134289...",False,0,0,True
8,gold_doc_0001,gold_doc_0001_ctx_prev,gold_doc_0001_ctx_prev_tok_1,1,zemn√≠ho,zemn√≠,ADJ,"[0.09201947, 0.11344469, -0.0370981, 0.0397366...",False,0,0,True
9,gold_doc_0001,gold_doc_0001_ctx_prev,gold_doc_0001_ctx_prev_tok_2,2,plynu,plyn,NOUN,"[-0.017090743, -0.04328588, 0.08480315, 0.1402...",False,0,0,True



üìã Column Information:
  - document_id: object
  - sentence_id: object
  - token_id: object
  - position: int64
  - form: object
  - lemma: object
  - pos: object
  - embedding: array(768,)
  - is_target: object
  - label: int64
  - token_label: int64
  - is_context: bool


In [6]:
print("üìä GOLD Sentence-Level Data Preview:\n")

# Display without embedding columns (too long)
display_cols = [col for col in sentence_df_gold.columns if 'embedding' not in col]
display(sentence_df_gold[display_cols].head(10))

print("\nüìã Embedding Information:")
print(f"  - CLS embedding shape: {sentence_df_gold['cls_embedding'].iloc[0].shape}")
print(f"  - Mean embedding shape: {sentence_df_gold['mean_embedding'].iloc[0].shape}")

üìä GOLD Sentence-Level Data Preview:



Unnamed: 0,document_id,sentence_id,text,num_tokens,label,is_context,context_type
0,gold_doc_0001,gold_doc_0001_target,Souƒçasn√° √∫rove≈à spot≈ôeby energie je alarmuj√≠c√≠ .,7,1,False,
1,gold_doc_0001,gold_doc_0001_ctx_prev,Z√°soby zemn√≠ho plynu v evropsk√Ωch skladech dos...,10,0,True,prev
2,gold_doc_0001,gold_doc_0001_ctx_next,Energetick√© spoleƒçnosti ozn√°mily stabiln√≠ dod√°...,10,0,True,next
3,gold_doc_0002,gold_doc_0002_target,Rozd√≠l mezi platy mu≈æ≈Ø a ≈æen z≈Øst√°v√° i na d√°le...,12,1,False,
4,gold_doc_0002,gold_doc_0002_ctx_prev,Pr≈Ømƒõrn√° mzda v ƒåesk√© republice ve t≈ôet√≠m ƒçtvr...,16,0,True,prev
5,gold_doc_0002,gold_doc_0002_ctx_next,"Statistici oƒçek√°vaj√≠ , ≈æe r≈Øst mezd zpomal√≠ v ...",11,0,True,next
6,gold_doc_0003,gold_doc_0003_target,Tempo v√Ωstavby nov√Ωch byt≈Ø je ≈æalostn√© .,7,1,False,
7,gold_doc_0003,gold_doc_0003_ctx_prev,V posledn√≠ch mƒõs√≠c√≠ch do≈°lo k v√Ωrazn√©mu pokles...,10,0,True,prev
8,gold_doc_0003,gold_doc_0003_ctx_next,Developersk√© spoleƒçnosti pl√°nuj√≠ nav√Ω≈°it inves...,9,0,True,next
9,gold_doc_0004,gold_doc_0004_target,Kvalita poskytovan√Ωch slu≈æeb je v≈°ak tristn√≠ .,7,1,False,



üìã Embedding Information:
  - CLS embedding shape: (768,)
  - Mean embedding shape: (768,)


### 4.2 Quick Statistics Check

In [7]:
import pandas as pd

print("üìà GOLD Dataset Quick Stats:\n")

# Filter to target sentences only
target_sentences = sentence_df_gold[~sentence_df_gold.get('is_context', False)]

stats = pd.DataFrame({
    'Metric': [
        'Total Documents',
        'Total Sentences',
        '  - Target Sentences',
        '  - Context Sentences',
        'Total Tokens',
        'LJMPNIK Tokens',
        '',
        'Target Sentences:',
        '  - Neutral (L0)',
        '  - LJMPNIK (L1)',
        '  - L0:L1 Ratio'
    ],
    'Count': [
        token_df_gold['document_id'].nunique(),
        len(sentence_df_gold),
        len(target_sentences),
        len(sentence_df_gold) - len(target_sentences),
        len(token_df_gold),
        (token_df_gold['is_target'] == True).sum(),
        '',
        '',
        (target_sentences['label'] == 0).sum(),
        (target_sentences['label'] == 1).sum(),
        f"{(target_sentences['label'] == 0).sum() / (target_sentences['label'] == 1).sum():.2f}:1"
    ]
})

display(stats)

# Check if everything looks reasonable
n_docs = token_df_gold['document_id'].nunique()
n_ljmpnik = (token_df_gold['is_target'] == True).sum()

if n_docs > 0 and n_ljmpnik > 0:
    print("\n‚úÖ Data looks good!")
    print(f"   Average {n_ljmpnik / n_docs:.1f} LJMPNIK words per document")
else:
    print("\n‚ö†Ô∏è  Something might be wrong - check the data")

üìà GOLD Dataset Quick Stats:



Unnamed: 0,Metric,Count
0,Total Documents,520
1,Total Sentences,1560
2,- Target Sentences,520
3,- Context Sentences,1040
4,Total Tokens,17557
5,LJMPNIK Tokens,305
6,,
7,Target Sentences:,
8,- Neutral (L0),188
9,- LJMPNIK (L1),332



‚úÖ Data looks good!
   Average 0.6 LJMPNIK words per document


## 5. Process SILVER Dataset (Optional)

**SILVER Dataset:** Larger, automatically generated data (~1900 documents).

‚ö†Ô∏è **This takes MUCH longer** (~30-60 minutes).

You can skip this if:
- You only want to experiment with GOLD data
- You don't have SILVER data yet
- You want to test the pipeline first

In [8]:
# Set this to True to process SILVER data
PROCESS_SILVER = True  # ‚ö†Ô∏è Change to True when ready

if PROCESS_SILVER:
    if config.PATH_SILVER_RAW.exists():
        print("üöÄ Processing SILVER dataset...\n")
        print("‚è∞ This will take 30-60 minutes!\n")
        print("="*60)
        
        token_df_silver, sentence_df_silver = run_full_pipeline('silver')
        
        print("="*60)
        print("\n‚úÖ SILVER dataset processing complete!")
        print(f"   Tokens: {len(token_df_silver):,} rows")
        print(f"   Sentences: {len(sentence_df_silver):,} rows")
        print(f"   Documents: {token_df_silver['document_id'].nunique():,} unique")
        print(f"   LJMPNIK words: {(token_df_silver['is_target'] == True).sum():,}")
    else:
        print("‚ö†Ô∏è  SILVER raw data not found. Skipping.")
else:
    print("‚è≠Ô∏è  Skipping SILVER processing")
    print("   (Set PROCESS_SILVER=True above to enable)")

2026-02-22 19:28:47,320 - INFO - üöÄ Starting full pipeline for SILVER dataset
2026-02-22 19:28:47,322 - INFO - ‚úÖ Models loaded successfully


üöÄ Processing SILVER dataset...

‚è∞ This will take 30-60 minutes!



2026-02-22 19:28:47,338 - INFO - Loaded 1903 entries from C:\Users\dobes\Documents\UniversityCodingProject_10-02-26\ThesisCoding\data\raw\SILVER_data_raw.jsonl
Processing silver data: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1903/1903 [04:36<00:00,  6.88it/s] 
2026-02-22 19:33:23,977 - INFO - ‚úÖ Processed silver:
2026-02-22 19:33:23,978 - INFO -    - Token-level: 78991 rows
2026-02-22 19:33:23,979 - INFO -    - Sentence-level: 5709 rows
2026-02-22 19:33:43,231 - INFO - üíæ Saved processed data:



‚úÖ SILVER dataset processing complete!
   Tokens: 78,991 rows
   Sentences: 5,709 rows
   Documents: 1,903 unique
   LJMPNIK words: 918


## 6. Verify Processed Data

Load the saved files to verify integrity.

In [9]:
from load_preprocess_data import load_processed_data

print("üîç Verifying processed files...\n")

# Verify GOLD
try:
    gold_tokens_verify = load_processed_data('gold', level='token', verify_integrity=True)
    gold_sentences_verify = load_processed_data('gold', level='sentence', verify_integrity=True)
    print(f"‚úÖ GOLD data verified (SHA256 checksums match):")
    print(f"   Tokens: {len(gold_tokens_verify):,} rows")
    print(f"   Sentences: {len(gold_sentences_verify):,} rows")
except Exception as e:
    print(f"‚ùå GOLD verification failed: {e}")

# Verify SILVER (if processed)
if PROCESS_SILVER:
    try:
        silver_tokens_verify = load_processed_data('silver', level='token', verify_integrity=True)
        silver_sentences_verify = load_processed_data('silver', level='sentence', verify_integrity=True)
        print(f"\n‚úÖ SILVER data verified (SHA256 checksums match):")
        print(f"   Tokens: {len(silver_tokens_verify):,} rows")
        print(f"   Sentences: {len(silver_sentences_verify):,} rows")
    except Exception as e:
        print(f"‚ùå SILVER verification failed: {e}")

üîç Verifying processed files...



2026-02-22 19:33:44,463 - INFO - ‚úÖ Loaded 17557 rows from C:\Users\dobes\Documents\UniversityCodingProject_10-02-26\ThesisCoding\data\processed\gold_tokens.pkl
2026-02-22 19:33:44,538 - INFO - ‚úÖ Loaded 1560 rows from C:\Users\dobes\Documents\UniversityCodingProject_10-02-26\ThesisCoding\data\processed\gold_sentences.pkl


‚úÖ GOLD data verified (SHA256 checksums match):
   Tokens: 17,557 rows
   Sentences: 1,560 rows


2026-02-22 19:33:47,429 - INFO - ‚úÖ Loaded 78991 rows from C:\Users\dobes\Documents\UniversityCodingProject_10-02-26\ThesisCoding\data\processed\silver_tokens.pkl
2026-02-22 19:33:47,683 - INFO - ‚úÖ Loaded 5709 rows from C:\Users\dobes\Documents\UniversityCodingProject_10-02-26\ThesisCoding\data\processed\silver_sentences.pkl



‚úÖ SILVER data verified (SHA256 checksums match):
   Tokens: 78,991 rows
   Sentences: 5,709 rows


## 7. Summary

### What You've Created

‚úÖ **Processed data files** in `data/processed/` directory  
‚úÖ **Integrity checksums** for all files (SHA256)  
‚úÖ **Document IDs** auto-generated and tracked  
‚úÖ **Independent embeddings** (no data leakage!)  
‚úÖ **Full metadata** for qualitative analysis  

### File Structure

**Token-level DataFrame columns:**
- `document_id` - Groups sentences from same document
- `sentence_id` - Unique sentence identifier
- `token_id` - Unique token identifier
- `position` - Position in sentence
- `form` - The actual word
- `lemma` - Base form
- `pos` - Part of speech tag
- `embedding` - 768-dim BERT vector
- `is_target` - True if this is the LJMPNIK word
- `label` - Sentence-level label (0=neutral, 1=contains LJMPNIK)
- `token_label` - Token-level label (0=neutral, 1=LJMPNIK)

**Sentence-level DataFrame columns:**
- `document_id` - Document identifier
- `sentence_id` - Unique sentence identifier
- `text` - Original sentence text
- `num_tokens` - Token count
- `cls_embedding` - BERT [CLS] token (768-dim)
- `mean_embedding` - Mean of token embeddings (768-dim)
- `label` - 0 (neutral) or 1 (contains LJMPNIK)
- `is_context` - True for context sentences
- `context_type` - 'prev', 'next', or None

### Next Steps

1. **Run `02_EDA.ipynb`** to explore the data
2. **Run experiment notebooks** (M1, M2) for modeling
3. **Don't re-run this notebook** unless:
   - Raw data changes
   - You need to regenerate embeddings
   - You suspect data corruption

### File Locations

```
data/
‚îú‚îÄ‚îÄ raw/
‚îÇ   ‚îú‚îÄ‚îÄ GOLD_data_raw.jsonl           # Your input
‚îÇ   ‚îî‚îÄ‚îÄ SILVER_data_raw.jsonl
‚îî‚îÄ‚îÄ processed/
    ‚îú‚îÄ‚îÄ gold_tokens.pkl               # Generated by this notebook ‚úÖ
    ‚îú‚îÄ‚îÄ gold_tokens.pkl.sha256        # Integrity check ‚úÖ
    ‚îú‚îÄ‚îÄ gold_sentences.pkl            # Generated by this notebook ‚úÖ
    ‚îú‚îÄ‚îÄ gold_sentences.pkl.sha256     # Integrity check ‚úÖ
    ‚îú‚îÄ‚îÄ silver_tokens.pkl             # (if PROCESS_SILVER=True)
    ‚îú‚îÄ‚îÄ silver_tokens.pkl.sha256
    ‚îú‚îÄ‚îÄ silver_sentences.pkl
    ‚îî‚îÄ‚îÄ silver_sentences.pkl.sha256
```

---

**Processing complete! üéâ**

**Time to run EDA:** Open `02_EDA.ipynb` next!