# Test Unsupervised Data Pipeline with Real Data

This notebook tests the complete pipeline using **real data downloads**:
- **Wikipedia**: Downloaded via Wikipedia API
- **CC-News**: Downloaded via Hugging Face datasets
- **BookCorpus**: Downloaded via Hugging Face datasets

**Requirements**: `pip install datasets`


## Setup and Imports


In [None]:
import sys
from pathlib import Path
import json
import tempfile
import shutil
import urllib.request

# Add src to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / "src"))

# Import our parsers
from spellchecker.data.parsers.unsupervised_parser import (
    WikipediaParser,
    CCNewsParser,
    BookCorpusParser,
    UniversalTextCleaner,
)

# Create cleaner
cleaner = UniversalTextCleaner(min_length=10, max_length=500)

# Create temp directory
temp_dir = Path(tempfile.mkdtemp())

print(f"Setup complete!")
print(f"   Project root: {project_root}")
print(f"   Temp directory: {temp_dir}")


✅ Setup complete!
   Project root: /Users/stepan/Documents/RLML/SpellChecker
   Temp directory: /var/folders/fx/th2v9glj5tz0jj1y5s8_5fjh0000gn/T/tmpabhghrte


In [None]:
print("Downloading real Wikipedia articles...\n")

# Create Wikipedia directory structure
wiki_dir = temp_dir / "wikipedia" / "extracted" / "AA"
wiki_dir.mkdir(parents=True, exist_ok=True)
wiki_file = wiki_dir / "wiki_00"

# Articles to download
articles = [
    "Python_(programming_language)",
    "Machine_learning",
    "Natural_language_processing",
    "Artificial_intelligence",
    "Data_science"
]

wiki_texts = []

for article_title in articles:
    try:
        url = f"https://en.wikipedia.org/api/rest_v1/page/summary/{article_title}"
        
        # Create request with proper headers (Wikipedia requires User-Agent)
        req = urllib.request.Request(
            url,
            headers={
                'User-Agent': 'SpellCheckerBot/1.0 (Educational Project)',
                'Accept': 'application/json'
            }
        )
        
        with urllib.request.urlopen(req, timeout=10) as response:
            data = json.loads(response.read())
            
            title = data.get("title", article_title.replace("_", " "))
            extract = data.get("extract", "")
            
            # Format as WikiExtractor-like output
            wiki_text = f'<doc id="{len(wiki_texts)+1}" title="{title}">\n{extract}\n</doc>'
            wiki_texts.append(wiki_text)
            
            print(f"  ✓ Downloaded: {title} ({len(extract)} chars)")
    
    except Exception as e:
        print(f"  ✗ Failed to download {article_title}: {e}")

# If no articles downloaded, use sample data
if len(wiki_texts) == 0:
    print("\n⚠️  No articles downloaded from Wikipedia API")
    print("   Using sample data instead...\n")
    
    wiki_texts = [
        """<doc id="1" title="Python">
Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured, object-oriented and functional programming.
</doc>""",
        """<doc id="2" title="Machine Learning">
Machine learning is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.
</doc>""",
        """<doc id="3" title="Natural Language Processing">
Natural language processing is an interdisciplinary subfield of computer science and artificial intelligence. It is primarily concerned with providing computers the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics.
</doc>""",
        """<doc id="4" title="Artificial Intelligence">
Artificial intelligence is the intelligence of machines or software, as opposed to the intelligence of humans or animals. It is a field of study in computer science that develops and studies intelligent machines. Such machines may be called AIs.
</doc>""",
        """<doc id="5" title="Data Science">
Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured, and unstructured data.
</doc>"""
    ]
    
    for i, text in enumerate(wiki_texts, 1):
        # Extract title for display
        title_start = text.find('title="') + 7
        title_end = text.find('"', title_start)
        title = text[title_start:title_end]
        print(f"  ✓ Using sample: {title}")

# Write to file
with open(wiki_file, "w", encoding="utf-8") as f:
    f.write("\n".join(wiki_texts))

print(f"\nSaved {len(wiki_texts)} Wikipedia articles")
print(f"   Total size: {wiki_file.stat().st_size} bytes")


📥 Downloading real Wikipedia articles...

  ✓ Downloaded: Python (programming language) (152 chars)
  ✓ Downloaded: Machine learning (462 chars)
  ✓ Downloaded: Natural language processing (331 chars)
  ✓ Downloaded: Artificial intelligence (463 chars)
  ✓ Downloaded: Data science (270 chars)

✅ Saved 5 Wikipedia articles
   Total size: 1934 bytes


In [None]:
# Test WikipediaParser on real data
print("\nProcessing Wikipedia articles with WikipediaParser...\n")

wiki_parser = WikipediaParser(cleaner)
output_file = temp_dir / "wikipedia_processed.txt"

count = wiki_parser.save_to_file(
    input_path=temp_dir / "wikipedia" / "extracted",
    output_file=output_file
)

print(f"\nProcessed {count} passages from Wikipedia")
print(f"\nFirst 3 processed passages:")
with open(output_file, "r", encoding="utf-8") as f:
    for i, line in enumerate(f, 1):
        if i > 3:
            break
        preview = line.strip()[:100] + "..." if len(line.strip()) > 100 else line.strip()
        print(f"  {i}. {preview}")



🔧 Processing Wikipedia articles with WikipediaParser...

Processing /var/folders/fx/th2v9glj5tz0jj1y5s8_5fjh0000gn/T/tmpu4z853lu/wikipedia/extracted/AA/wiki_00...
Saved 5 passages to /var/folders/fx/th2v9glj5tz0jj1y5s8_5fjh0000gn/T/tmpu4z853lu/wikipedia_processed.txt

✅ Processed 5 passages from Wikipedia

📄 First 3 processed passages:
  1. Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code ...
  2. Machine learning (ML) is a field of study in artificial intelligence concerned with the development ...
  3. Natural language processing (NLP) is the processing of natural language information by a computer. T...


In [None]:
print("📥 Downloading real CC-News articles...\n")

try:
    from datasets import load_dataset
    
    # Download 20 articles from CC-News
    dataset = load_dataset("cc_news", split="train", streaming=True)
    
    ccnews_file = temp_dir / "ccnews.jsonl"
    count = 0
    
    with open(ccnews_file, "w", encoding="utf-8") as f:
        for i, example in enumerate(dataset):
            if i >= 20:  # Download 20 articles
                break
            
            title = example.get("title", "")
            text = example.get("text", "")
            
            json.dump({"title": title, "text": text}, f)
            f.write("\n")
            count += 1
            
            title_preview = title[:60] + "..." if len(title) > 60 else title
            print(f"  ✓ Downloaded article {count}: {title_preview}")
    
    print(f"\nDownloaded {count} real CC-News articles")
    print(f"   File size: {ccnews_file.stat().st_size} bytes")
    
except ImportError:
    print("'datasets' library not available")
    print("   Install with: pip install datasets")
    ccnews_file = None
except Exception as e:
    print(f"Error downloading CC-News: {e}")
    ccnews_file = None


📥 Downloading real CC-News articles...



  from .autonotebook import tqdm as notebook_tqdm


  ✓ Downloaded article 1: Daughter Duo is Dancing in The Same Company
  ✓ Downloaded article 2: New York City Ballet Announces Interim Leadership Team
  ✓ Downloaded article 3: Watch Pennsylvania Ballet & Boston Ballet Face Off for the S...
  ✓ Downloaded article 4: dance shoes
  ✓ Downloaded article 5: Rebecca Krohn on Her Retirement from New York City Ballet
  ✓ Downloaded article 6: Roy Kaiser to Become Nevada Ballet Theatre's New Artistic Di...
  ✓ Downloaded article 7: What It's Like Inside NYCB After Peter Martins
  ✓ Downloaded article 8: Nutcracker Secrets and Surprises
  ✓ Downloaded article 9: Inside the Beijing Dance Academy
  ✓ Downloaded article 10: dance shoes
  ✓ Downloaded article 11: Isabella Boylston and James Whiteside Get Hilariously Candid
  ✓ Downloaded article 12: Ballet Performances This Week
  ✓ Downloaded article 13: Guillaume Côté on NBoC's "Frame by Frame"
  ✓ Downloaded article 14: Broadway's "Carousel" Stars Some Familiar Ballet Faces
  ✓ Downloaded articl

In [None]:
# Test CCNewsParser on real data
if ccnews_file:
    print("\nProcessing CC-News articles with CCNewsParser...\n")
    
    ccnews_parser = CCNewsParser(cleaner)
    output_file = temp_dir / "ccnews_processed.txt"
    
    count = ccnews_parser.save_to_file(input_file=ccnews_file, output_file=output_file)
    
    print(f"\n✅ Processed {count} articles from CC-News")
    print(f"\n📄 First 3 processed articles:")
    with open(output_file, "r", encoding="utf-8") as f:
        for i, line in enumerate(f, 1):
            if i > 3:
                break
            preview = line.strip()[:100] + "..." if len(line.strip()) > 100 else line.strip()
            print(f"  {i}. {preview}")
else:
    print("Skipping CC-News processing (download failed)")



🔧 Processing CC-News articles with CCNewsParser...

Saved 3 articles to /var/folders/fx/th2v9glj5tz0jj1y5s8_5fjh0000gn/T/tmpu4z853lu/ccnews_processed.txt

✅ Processed 3 articles from CC-News

📄 First 3 processed articles:
  1. dance shoes. Looking for your next audition shoe? Shot at and in collaboration with Broadway Dance C...
  2. dance shoes. Looking for your next audition shoe? Shot at and in collaboration with Broadway Dance C...
  3. Wonderfully Simple Graphic Design Software. DesignWizard Sponsors rebelCon 2017 DesignWizard has joi...


## 3. Download Real BookCorpus Data

Download real book sentences from the BookCorpus dataset via Hugging Face.


In [None]:
print("Downloading real BookCorpus data...\n")

from datasets import load_dataset

# Load the BookCorpus dataset using streaming mode
print("  Loading BookCorpus dataset (streaming mode)...")
dataset = load_dataset("lucadiliello/bookcorpusopen", split="train", streaming=True)

book_file = temp_dir / "bookcorpus.txt"

# Extract first 50 sentences
count = 0
print("  Downloading sentences...\n")

with open(book_file, "w", encoding="utf-8") as f:
    for i, example in enumerate(dataset):
        if i >= 50:  # Take 50 sentences
            break
        
        text = example["text"].strip()
        if text:
            f.write(text + "\n")
            count += 1
            
            text_preview = text[:60] + "..." if len(text) > 60 else text
            if count <= 5 or count % 10 == 0:  # Show first 5 and every 10th
                print(f"  ✓ Sentence {count}: {text_preview}")

print(f"\nDownloaded {count} real BookCorpus sentences")
print(f"   File size: {book_file.stat().st_size} bytes")


📥 Downloading real BookCorpus data...

  Loading BookCorpus dataset (streaming mode)...
  Downloading sentences...

  ✓ Sentence 1: 1 + 2

This Is Only The Beginning

Kristie Lynn Higgins

Tex...
  ✓ Sentence 2: ## 1 God – Poems on God , Creator – volume 1

## By

Nikhil ...
  ✓ Sentence 3: ## 1 God – Poems on God , Creator – volume 2

## By

Nikhil ...
  ✓ Sentence 4: ## 1 God – Poems on God , Creator – volume 3

## By

Nikhil ...
  ✓ Sentence 5: ## 1 God – Poems on God , Creator – volume 4

## By

Nikhil ...
  ✓ Sentence 10: ### 10 of the Best Stories from Kenji Miyazawa & Nankichi Ni...
  ✓ Sentence 20: * * *

## One Thousand Yards

A John Milton Novel

Mark Daws...
  ✓ Sentence 30: # 10X Culture

### The 4-hour meeting week and 25 other secr...
  ✓ Sentence 40: 13 TALES TO GIVE YOU NIGHT TERRORS

A Night Terrors Novel

E...
  ✓ Sentence 50: # 185 TIPS ON WORLD BUILDING

# by Randy Ellefson

Copyright...

✅ Downloaded 50 real BookCorpus sentences
   File size: 17924923 bytes


In [None]:
# Test BookCorpusParser on real data
if book_file:
    print("\nProcessing BookCorpus sentences with BookCorpusParser...\n")
    
    book_parser = BookCorpusParser(cleaner)
    output_file = temp_dir / "bookcorpus_processed.txt"
    
    count = book_parser.save_to_file(input_path=book_file, output_file=output_file)
    
    print(f"\nProcessed {count} passages from BookCorpus")
    print(f"\nFirst 3 processed passages:")
    with open(output_file, "r", encoding="utf-8") as f:
        for i, line in enumerate(f, 1):
            if i > 3:
                break
            preview = line.strip()[:100] + "..." if len(line.strip()) > 100 else line.strip()
            print(f"  {i}. {preview}")
else:
    print("Skipping BookCorpus processing (download failed)")



🔧 Processing BookCorpus sentences with BookCorpusParser...

Saved 106703 passages to /var/folders/fx/th2v9glj5tz0jj1y5s8_5fjh0000gn/T/tmpabhghrte/bookcorpus_processed.txt

✅ Processed 106703 passages from BookCorpus

📄 First 3 processed passages:
  1. This Is Only The Beginning
  2. Kristie Lynn Higgins
  3. Text Copyright © 2018
