# 📚 Week 2 – Data Collection & Extraction

## 📝 Tesseract OCR Guide

Tesseract is an open-source Optical Character Recognition (OCR) engine developed by Google. It is used to convert scanned images, PDFs, and images with text into machine-readable text. Tesseract supports more than 100 languages, including complex scripts like Arabic, Chinese, and many others.

---

## 🔧 Installation

### **Installing Tesseract on Different Platforms**

1. **On macOS** (using Homebrew):
    ```bash
    brew install tesseract
    ```

2. **On Ubuntu/Debian**:
    ```bash
    sudo apt update
    sudo apt install tesseract-ocr
    ```

3. **On Windows**:
    - Download the Tesseract installer from the [Tesseract GitHub releases](https://github.com/tesseract-ocr/tesseract/releases).
    - Follow the installation instructions, and ensure to add Tesseract to your system’s `PATH`.

---

## 🛠️ Basic Usage

Once Tesseract is installed, you can use it either from the command line or by using the Python wrapper `pytesseract`.

### **Command Line Usage**

To perform OCR on an image:
```bash
tesseract image.png output.txt


## Use Tesseract with python

In [1]:
from PIL import Image
import pytesseract #pip install pytesseract first

# Specify the path to the tesseract.exe executable
# Make sure this path is correct for your installation
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

# Load an image using Pillow (PIL)
image = Image.open('myGPU.png')

# Perform OCR on the image
text = pytesseract.image_to_string(image)

print(text)

System >» Display » Advanced display

Select a display to view or change its settings Display 1: MAG 274QRFW

Display information

MAG 274QRFW
Display 1: Connected to NVIDIA GeForce RTX 4070 SUPER

Desktop mode 2560 x 1440, 180 Hz
Active signal mode 2560 x 1440, 180 Hz
Variable refresh rate Not Supported

Bit depth 10-bit

Color format RGB

Color space High dynamic range (HDR)

HDR certification Not found More about HDR certification
Peak brightness 418 nits

Display adapter properties for Display 1

Choose a refresh rate

180 Hz v
A higher rate gives smoother motion, but also uses more power More about refresh rate
Dynamic refresh rate
To help save power, Windows adjusts the refresh rate up to the selected rate above or @

Dynamic refresh rate isn't supported
More about dynamic refresh rate



Tesseract supports over 100 languages, and you can even train it for custom languages or fonts. To use a different language, you can download the corresponding trained data files and specify the language in the -l flag.

For example, to use Spanish (spa):

In [None]:
tesseract image.png output -l spa

🧑‍💻 Tesseract Best Practices\
Preprocess Images: Always preprocess images by converting to grayscale, adjusting brightness/contrast, and removing noise to improve Tesseract’s accuracy.

Use Correct --psm: The Page Segmentation Mode (--psm) plays a crucial role in how Tesseract segments the image and interprets the text. Experiment with different modes for complex documents.

Choose the Right Language: Always specify the correct language (-l lang_code) for better accuracy. Tesseract performs poorly when the language is incorrect.

Custom Training: For specialized fonts or languages, you can train Tesseract to recognize custom fonts or languages. This is especially useful for handwriting or unusual fonts.

## Bonus Hands‑on Assignment Sheet
here is bonus part of our homework,which focuses on various aspects of data collection, extraction, and cleaning using OCR (Optical Character Recognition) technology like Tesseract, as well as other techniques such as Web Scraping and Automatic Speech Recognition (ASR). The goal is to apply different tools and methods to extract useful information from web pages, PDFs, audio files, and clean the data for further analysis.

---
### Task Overview

| # | 💡 Module / Skill | 🎯 Task Goal | 🛠️ Core Tools | 📌 Deliverables |
|---|-------------------|--------------|----------------|-----------------|
| **1** | 🌐 *Web Scraping & HTML Cleaning* | **arXiv Paper Abstract Scraper**<br>• Query any subcategory (e.g., *cs.CL*) to fetch the latest 200 papers.<br>• Scrape the `/abs/` page and use **Trafilatura** to clean the content.<br>• Use **Tesseract OCR** to extract abstract text from screenshots of the downloaded pages.<br>• Save the data as JSON: `{url, title, abstract, authors, date}` | `trafilatura`,  `pytesseract`,  | `arxiv_clean.json` (≤1MB) + scraper script |
| **2** | 🖼️ *PDF to Text OCR* | **Batch OCR for arXiv PDFs** (same paper set as Task 1).<br>• Use **Tesseract** to convert PDFs to text.<br>• Retain OCR layout (e.g., titles, sections) if needed. | `pytesseract`, `pdf2image` | `pdf_ocr/` folder with TXT files + code notebook |
| **3** | 🔊 *Automatic Speech Recognition (ASR)* | **Whisper Transcription Bot** for 10 short NLP conference talks (~3 minutes each).<br>• Use **yt‑dl** to fetch YouTube audio.<br>• Transcribe with **Tesseract** for any OCR-based text in the transcript images.<br>• Save `.jsonl` with timestamps. | `yt-dlp`, `pytesseract` | `talks_transcripts.jsonl` + transcription script |
| **4** | 🧹 *Data Cleaning & Deduplication* | **End‑to‑End Cleaner**:<br>• Merge the outputs from Tasks 1‑3 into one dataset.<br>• Steps: language detection → strip HTML noise → use MinHash for deduplication (similarity ≥ 0.7) → remove PII (emails, credit card numbers, phone numbers) → remove repetitive n‑grams. | `langdetect`, `datasketch` | `clean_corpus.txt` + `stats.md` (token count, removal percentage) |



### 💬 Resources

1. **Trafilatura Quick Start:**  
   - [Trafilatura Documentation](https://github.com/adbar/trafilatura)  
   - Usage: `trafilatura.extract(html, include_comments=False, include_tables=False)`

2. **Tesseract OCR:**  
   - [Tesseract OCR GitHub Repository](https://github.com/tesseract-ocr/tesseract)  
   - [Tesseract OCR Documentation](https://tesseract-ocr.github.io/)  
   - [pytesseract Python Wrapper Documentation](https://github.com/madmaze/pytesseract)  
   - Use Tesseract for OCR conversion of PDFs or images. For complex layouts, use **Tesseract’s layout analysis** feature.  
     - Example: `text = pytesseract.image_to_string(image, config='--psm 6')`

3. **Whisper Automatic Speech Recognition (ASR):**  
   - [Whisper GitHub Repository](https://github.com/openai/whisper)  
   - [Whisper Documentation](https://github.com/openai/whisper#usage)  
   - To use Whisper with Python, follow the setup instructions provided on the official repository.

4. **yt-dlp for Downloading YouTube Audio:**  
   - [yt-dlp GitHub Repository](https://github.com/yt-dlp/yt-dlp)  
   - [yt-dlp Installation and Usage](https://github.com/yt-dlp/yt-dlp#installation)

5. **PDF to Image Conversion (Using `pdf2image`):**  
   - [pdf2image Documentation](https://pdf2image.readthedocs.io/en/latest/)  
   - This library converts PDF pages to images, which you can then process with Tesseract.

6. **MinHash for Deduplication:**  
   - [Datasketch Documentation](https://datasketch.readthedocs.io/en/latest/)  
   - **MinHashLSH** is useful for deduplicating large text corpora by finding similar documents.

7. **Cleaning HTML and Removing PII (Personally Identifiable Information):**  
   - [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)  
   - **langdetect** Python library for language detection:  
     - [langdetect GitHub Repository](https://github.com/Mimino666/langdetect)


---




## Task 1: Web Scraping & HTML Cleaning

__Task Goal:__ arXiv Paper Abstract Scraper
* Query any subcategory (e.g., cs.CL) to fetch the latest 200 papers.
* Scrape the /abs/ page and use Trafilatura to clean the content.
* Use Tesseract OCR to extract abstract text from screenshots of the downloaded pages.
* Save the data as JSON: {url, title, abstract, authors, date}

__Core Tools:__ trafilatura, pytesseract

__Deliverables:__ arxiv_clean.json (≤1MB) + scraper script

To implement the arXiv Paper Abstract Scraper in a Python Jupyter Notebook, we'll break it down into steps, utilizing the specified tools: requests for web scraping, BeautifulSoup for initial HTML parsing (though Trafilatura will do the heavy lifting for cleaning), trafilatura for content extraction and cleaning, and pytesseract for OCR.

First, let's make sure you have all the necessary libraries installed. You can run the following in your Jupyter Notebook:

In [1]:
!pip show requests

Name: requests
Version: 2.32.2
Summary: Python HTTP for Humans.
Home-page: https://requests.readthedocs.io
Author: Kenneth Reitz
Author-email: me@kennethreitz.org
License: Apache-2.0
Location: C:\Users\ch939\anaconda3\Lib\site-packages
Requires: certifi, charset-normalizer, idna, urllib3
Required-by: aext-assistant-server, anaconda-catalogs, anaconda-client, anaconda-cloud-auth, anaconda-project, conda, conda-build, conda-repo-cli, conda_package_streaming, cookiecutter, datasets, datashader, huggingface-hub, intake, jupyterlab_server, langchain, langchain-community, langsmith, panel, requests-file, requests-toolbelt, Sphinx, streamlit, tensorflow, tensorflow_intel, tldextract, transformers, webdriver-manager


In [3]:
!pip show beautifulsoup4

Name: beautifulsoup4
Version: 4.12.3
Summary: Screen-scraping library
Home-page: https://www.crummy.com/software/BeautifulSoup/bs4/
Author: 
Author-email: Leonard Richardson <leonardr@segfault.org>
License: MIT License
Location: C:\Users\ch939\anaconda3\Lib\site-packages
Requires: soupsieve
Required-by: conda-build, nbconvert


In [3]:
!pip show trafilatura

Name: trafilatura
Version: 2.0.0
Summary: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML.
Home-page: https://trafilatura.readthedocs.io
Author: 
Author-email: Adrien Barbaresi <barbaresi@bbaw.de>
License: Apache 2.0
Location: C:\Users\ch939\anaconda3\Lib\site-packages
Requires: certifi, charset_normalizer, courlan, htmldate, justext, lxml, urllib3
Required-by: 


In [5]:
!pip show pytesseract

Name: pytesseract
Version: 0.3.13
Summary: Python-tesseract is a python wrapper for Google's Tesseract-OCR
Home-page: https://github.com/madmaze/pytesseract
Author: Samuel Hoffstaetter
Author-email: samuel@hoffstaetter.com
License: Apache License 2.0
Location: C:\Users\ch939\anaconda3\Lib\site-packages
Requires: packaging, Pillow
Required-by: 


In [7]:
!pip show pillow

Name: pillow
Version: 10.3.0
Summary: Python Imaging Library (Fork)
Home-page: https://python-pillow.org
Author: 
Author-email: "Jeffrey A. Clark" <aclark@aclark.net>
License: HPND
Location: C:\Users\ch939\anaconda3\Lib\site-packages
Requires: 
Required-by: bokeh, datashader, gradio, imageio, matplotlib, pytesseract, scikit-image, streamlit, torchvision


__The Most Recommended Solution: Create and Use a Virtual Environment__
This is the best practice for Python development. Virtual environments isolate your project's dependencies from your system-wide Python installation and from other projects. This prevents conflicts like the one you're facing.

Anaconda:
1. Create a virtual environment:
    python -m venv venv_arxiv_scraper
2. Activate the virtual environment:
    venv_arxiv_scraper\Scripts\activate.bat
3. Install the required packages within the virtual environment:
    pip install requests beautifulsoup4 trafilatura pytesseract pillow tensorflow-intel==2.18.0
4. Run your Jupyter Notebook:
    pip install ipykernel
    python -m ipykernel install --user --name=venv_arxiv_scraper

In [21]:
import requests
from bs4 import BeautifulSoup
from trafilatura import extract
import pytesseract
from PIL import Image
import io
import json
import re
import time
import os

# --- Configuration ---
# Set the path to the Tesseract executable if it's not in your PATH
# pytesseract.pytesseract.tesseract_cmd = r'/usr/local/bin/tesseract' # Example for macOS
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # Example for Windows

# --- Constants ---
ARXIV_BASE_URL = "https://arxiv.org"
ARXIV_SEARCH_URL = "https://arxiv.org/list/cs.CL/pastweek?skip=0&show=250"
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# --- Helper Functions ---

def fetch_page(url):
    """Fetches the content of a given URL."""
    try:
        response = requests.get(url, headers=HEADERS, timeout=10)
        response.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

def extract_abstract_from_html(html_content):
    """
    Extracts the abstract text from HTML content using Trafilatura.
    If Trafilatura doesn't find a good abstract, we'll try a more specific BeautifulSoup parse.
    """
    extracted_data = extract(html_content, include_comments=False, no_fallback=False)
    if extracted_data:
        # Trafilatura extracts the full content, we need to find the abstract part.
        # This is a heuristic: looking for "Abstract:" or similar markers.
        match = re.search(r'(?i)abstract:\s*(.*)', extracted_data, re.DOTALL)
        if match:
            return match.group(1).strip()
        else:
            # If no clear abstract marker, try to get a significant block of text
            # which often is the abstract in such pages.
            return extracted_data.strip()
    return None

def extract_info_from_abs_page(abs_page_html):
    """
    Extracts title, authors, and date from the /abs/ page.
    """
    soup = BeautifulSoup(abs_page_html, 'html.parser')

    title = soup.find('h1', class_='title')
    title = title.text.replace('Title:', '').strip() if title else 'N/A'

    authors_div = soup.find('div', class_='authors')
    authors = []
    if authors_div:
        for a_tag in authors_div.find_all('a'):
            authors.append(a_tag.text.strip())
    authors_str = ', '.join(authors) if authors else 'N/A'

    date_line = soup.find('div', class_='submission-history')
    date = 'N/A'
    if date_line:
        # Example: [v1] Thu, 1 Jul 2024 13:00:00 UTC (1,234KB)
        match = re.search(r'\[v\d+\]\s*([^,]+,\s*\d+\s+\w+\s+\d{4})', date_line.text)
        if match:
            date = match.group(1).strip()
    return title, authors_str, date

def get_screenshot_and_ocr(url):
    """
    (Conceptual) This function would typically require a headless browser (e.g., Selenium)
    to render the page and take a screenshot. For this exercise, we will simulate
    it by attempting to use Trafilatura's text and, if it fails to find an abstract,
    we would acknowledge the need for a visual capture and OCR.

    As direct "screenshot taking" from a URL without a browser is not possible
    with `requests` or `BeautifulSoup`, this part of the task is more conceptual
    for a pure `requests`/`bs4` setup. If OCR is *strictly* required for downloaded pages,
    a headless browser (like Selenium with ChromeDriver) would be necessary to
    capture an image of the abstract section.

    For the sake of this implementation, we'll focus on text extraction first.
    If a clear abstract is not found via HTML parsing, we'll flag it.
    """
    print(f"  Attempting OCR fallback for {url} (Requires screenshot tool, not implemented here directly).")
    # In a real scenario, you would use Selenium to get a screenshot
    # Example (requires selenium and a webdriver):
    # from selenium import webdriver
    # from selenium.webdriver.chrome.service import Service as ChromeService
    # from webdriver_manager.chrome import ChromeDriverManager
    #
    # options = webdriver.ChromeOptions()
    # options.add_argument('--headless')
    # options.add_argument('--disable-gpu')
    # driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
    # driver.get(url)
    # driver.save_screenshot('temp_screenshot.png')
    # driver.quit()
    #
    # text = pytesseract.image_to_string(Image.open('temp_screenshot.png'))
    # return text
    return "OCR fallback simulated: Could not extract abstract directly."


# --- Main Scraper Function ---

def scrape_arxiv_subcategory(subcategory, num_papers=200):
    """
    Scrapes the latest papers from a given arXiv subcategory.
    """
    print(f"Scraping latest {num_papers} papers from arXiv subcategory: {subcategory}")
    initial_url = ARXIV_SEARCH_URL.format(subcategory=subcategory, num_papers=num_papers)
    print(f"Fetching initial list from: {initial_url}")

    html_list_page = fetch_page(initial_url)
    if not html_list_page:
        print("Failed to fetch the initial list page.")
        return []

    soup = BeautifulSoup(html_list_page, 'html.parser')
    paper_entries = soup.find_all('dt') # <dt> holds the paper id
    paper_metadata = soup.find_all('dd') # <dd> holds the title, authors, abstract link

    papers_data = []

    for i in range(min(len(paper_entries), len(paper_metadata), num_papers)):
        entry_dt = paper_entries[i]
        entry_dd = paper_metadata[i]

        paper_id_link = entry_dt.find('a', title="Abstract")
        if not paper_id_link:
            continue

        paper_url = ARXIV_BASE_URL + paper_id_link['href']

        print(f"\nProcessing paper: {paper_url}")

        # Fetch the /abs/ page
        abs_page_html = fetch_page(paper_url)
        if not abs_page_html:
            print(f"Skipping {paper_url} due to fetch error.")
            continue

        title, authors, date = extract_info_from_abs_page(abs_page_html)

        # Try to extract abstract using Trafilatura
        abstract = extract_abstract_from_html(abs_page_html)

        if not abstract or "OCR fallback simulated" in abstract: # Check if Trafilatura was insufficient
            print(f"  Abstract not clearly found via Trafilatura for {paper_url}. Attempting OCR fallback (conceptual).")
            # In a real scenario, you would trigger the Selenium-based screenshot and OCR here.
            # For this exercise, we'll acknowledge the limitation and use a placeholder or
            # indicate that a direct HTML parse might be sufficient for many cases.
            abstract_via_ocr = get_screenshot_and_ocr(paper_url) # This is conceptual
            if "OCR fallback simulated" in abstract_via_ocr and not abstract:
                abstract = "Abstract could not be extracted via direct parsing or simulated OCR fallback."
            elif "OCR fallback simulated" not in abstract_via_ocr:
                abstract = abstract_via_ocr # If OCR worked conceptually

        paper_info = {
            "url": paper_url,
            "title": title,
            "abstract": abstract,
            "authors": authors,
            "date": date
        }
        papers_data.append(paper_info)
        time.sleep(0.5) # Be polite to the server

        if len(papers_data) >= num_papers:
            break

    return papers_data

# --- Execution ---

if __name__ == "__main__":
    subcategory_to_query = "cs.CL" # Example: Computational Linguistics
    num_papers_to_fetch = 200

    scraped_papers = scrape_arxiv_subcategory(subcategory_to_query, num_papers_to_fetch)

    # Save to JSON
    output_filename = "arxiv_clean.json"
    try:
        with open(output_filename, 'w', encoding='utf-8') as f:
            json.dump(scraped_papers, f, ensure_ascii=False, indent=4)
        print(f"\nSuccessfully scraped {len(scraped_papers)} papers and saved to {output_filename}")
        print(f"File size: {round(os.path.getsize(output_filename) / (1024 * 1024), 2)} MB")
    except Exception as e:
        print(f"Error saving JSON file: {e}")

    # Optional: Display a sample of the scraped data
    if scraped_papers:
        print("\n--- Sample Scraped Paper ---")
        import pprint
        pprint.pprint(scraped_papers[0])

Scraping latest 200 papers from arXiv subcategory: cs.CL
Fetching initial list from: https://arxiv.org/list/cs.CL/pastweek?skip=0&show=250

Processing paper: https://arxiv.org/abs/2507.22887

Processing paper: https://arxiv.org/abs/2507.22829

Processing paper: https://arxiv.org/abs/2507.22811

Processing paper: https://arxiv.org/abs/2507.22758

Processing paper: https://arxiv.org/abs/2507.22753

Processing paper: https://arxiv.org/abs/2507.22752

Processing paper: https://arxiv.org/abs/2507.22744

Processing paper: https://arxiv.org/abs/2507.22729

Processing paper: https://arxiv.org/abs/2507.22720

Processing paper: https://arxiv.org/abs/2507.22716

Processing paper: https://arxiv.org/abs/2507.22676

Processing paper: https://arxiv.org/abs/2507.22623

Processing paper: https://arxiv.org/abs/2507.22608

Processing paper: https://arxiv.org/abs/2507.22603

Processing paper: https://arxiv.org/abs/2507.22581

Processing paper: https://arxiv.org/abs/2507.22564

Processing paper: https://ar

## Task 2: PDF to Text OCR

__Task Goal:__ Batch OCR for arXiv PDFs (same paper set as Module 1).

__Deliverables:__ pdf_ocr/ folder with TXT files + code notebook

__ore Tools:__ pytesseract, pdf2image

In [9]:
!pip install pdf2image

Collecting pdf2image
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Downloading pdf2image-1.17.0-py3-none-any.whl (11 kB)
Installing collected packages: pdf2image
Successfully installed pdf2image-1.17.0


In [11]:
import os
import json
import requests
from pdf2image import convert_from_path
import pytesseract
from PIL import Image
import io
import time

# --- Configuration ---
# Set the path to the Tesseract executable if it's not in your PATH
# pytesseract.pytesseract.tesseract_cmd = r'/usr/local/bin/tesseract' # Example for macOS
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # Example for Windows
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

# --- Constants ---
# Directory to save downloaded PDFs
PDF_DOWNLOAD_DIR = "arxiv_pdfs"
# Directory to save OCR'd text files
OCR_OUTPUT_DIR = "pdf_ocr"
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# Ensure output directories exist
os.makedirs(PDF_DOWNLOAD_DIR, exist_ok=True)
os.makedirs(OCR_OUTPUT_DIR, exist_ok=True)

# --- Helper Functions ---

def download_pdf(pdf_url, filename):
    """Downloads a PDF from a given URL."""
    filepath = os.path.join(PDF_DOWNLOAD_DIR, filename)
    if os.path.exists(filepath):
        print(f"  PDF already exists: {filename}")
        return filepath

    print(f"  Downloading PDF: {pdf_url}")
    try:
        response = requests.get(pdf_url, headers=HEADERS, stream=True, timeout=30)
        response.raise_for_status()
        with open(filepath, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        return filepath
    except requests.exceptions.RequestException as e:
        print(f"  Error downloading {pdf_url}: {e}")
        return None

def ocr_pdf_to_text(pdf_path, output_filepath):
    """
    Converts a PDF to text using Tesseract OCR.
    Retains layout by performing OCR page by page.
    """
    if not os.path.exists(pdf_path):
        print(f"  PDF file not found for OCR: {pdf_path}")
        return False

    print(f"  Performing OCR on: {os.path.basename(pdf_path)}")
    try:
        # Convert PDF pages to list of PIL images
        # poppler_path might be needed on Windows.
        # Example: poppler_path=r'C:\Program Files\poppler-23.08.0\Library\bin'
        pages = convert_from_path(pdf_path, 300) # 300 DPI for better OCR accuracy

        full_text = []
        for i, page_image in enumerate(pages):
            print(f"    Processing page {i+1}/{len(pages)}")
            # Use image_to_string with output_type='text' for plain text
            # and lang='eng' for English.
            # config='--psm 1' might help with layout retention, 1 for OSD, 3 for default.
            # For general document layout, PSM 6 or 3 are usually good.
            # If layout retention is crucial, you might need to try different PSMs.
            # Here, we'll try a common one for general documents.
            text = pytesseract.image_to_string(page_image, lang='eng', config='--psm 3')
            full_text.append(text)
            time.sleep(0.1) # Small delay between pages

        with open(output_filepath, 'w', encoding='utf-8') as f:
            f.write("\n".join(full_text))

        print(f"  OCR complete. Text saved to: {os.path.basename(output_filepath)}")
        return True
    except Exception as e:
        print(f"  Error during OCR for {pdf_path}: {e}")
        return False

# --- Main OCR Process ---

def batch_ocr_arxiv_pdfs(json_filepath="arxiv_clean.json"):
    """
    Reads the arxiv_clean.json, downloads associated PDFs, and performs OCR.
    """
    if not os.path.exists(json_filepath):
        print(f"Error: {json_filepath} not found. Please run Module 1 first.")
        return

    print(f"Loading paper data from: {json_filepath}")
    try:
        with open(json_filepath, 'r', encoding='utf-8') as f:
            papers = json.load(f)
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON from {json_filepath}: {e}")
        return

    if not papers:
        print("No papers found in the JSON file to process.")
        return

    print(f"Found {len(papers)} papers to process for PDF OCR.")

    processed_count = 0
    for i, paper in enumerate(papers):
        paper_url = paper.get('url')
        if not paper_url:
            print(f"Skipping paper {i+1}: No URL found.")
            continue

        # Construct PDF URL: replace '/abs/' with '/pdf/'
        pdf_url = paper_url.replace('/abs/', '/pdf/')
        
        # Extract paper ID to use as filename
        paper_id = paper_url.split('/')[-1]
        pdf_filename = f"{paper_id}.pdf"
        txt_filename = f"{paper_id}.txt"
        
        pdf_filepath = os.path.join(PDF_DOWNLOAD_DIR, pdf_filename)
        txt_filepath = os.path.join(OCR_OUTPUT_DIR, txt_filename)

        print(f"\nProcessing paper {i+1}/{len(papers)}: {paper['title']}")

        # Step 1: Download PDF
        downloaded_path = download_pdf(pdf_url, pdf_filename)
        if not downloaded_path:
            print(f"  Failed to download PDF for {paper['title']}. Skipping OCR.")
            continue

        # Step 2: Perform OCR if the text file doesn't already exist
        if os.path.exists(txt_filepath) and os.path.getsize(txt_filepath) > 0:
            print(f"  OCR text file already exists and is not empty: {txt_filename}. Skipping OCR.")
        else:
            ocr_success = ocr_pdf_to_text(downloaded_path, txt_filepath)
            if ocr_success:
                processed_count += 1
            else:
                print(f"  OCR failed for {paper['title']}.")
        
        time.sleep(0.5) # Be polite to the servers

    print(f"\nBatch OCR process finished. Successfully OCR'd/processed {processed_count} new PDFs.")
    print(f"PDFs are in: {os.path.abspath(PDF_DOWNLOAD_DIR)}")
    print(f"OCR'd text files are in: {os.path.abspath(OCR_OUTPUT_DIR)}")

# --- Execution ---

if __name__ == "__main__":
    # Ensure you have run Module 1 and 'arxiv_clean.json' exists in the same directory
    # or specify its path if it's elsewhere.
    batch_ocr_arxiv_pdfs("arxiv_clean.json")

    print("\nRemember to install Poppler for pdf2image if on Windows/Linux and Tesseract OCR for pytesseract.")
    print("Example Poppler installation (Windows): Download from https://github.com/oschwartz10612/poppler-windows/releases")
    print("  Then add the 'bin' folder to your system PATH or specify in convert_from_path(pdf_path, poppler_path='path_to_bin')")

Loading paper data from: arxiv_clean.json
Found 200 papers to process for PDF OCR.

Processing paper 1/200: Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning
  Downloading PDF: https://arxiv.org/pdf/2507.22887
  Performing OCR on: 2507.22887.pdf
    Processing page 1/34
    Processing page 2/34
    Processing page 3/34
    Processing page 4/34
    Processing page 5/34
    Processing page 6/34
    Processing page 7/34
    Processing page 8/34
    Processing page 9/34
    Processing page 10/34
    Processing page 11/34
    Processing page 12/34
    Processing page 13/34
    Processing page 14/34
    Processing page 15/34
    Processing page 16/34
    Processing page 17/34
    Processing page 18/34
    Processing page 19/34
    Processing page 20/34
    Processing page 21/34
    Processing page 22/34
    Processing page 23/34
    Processing page 24/34
    Processing page 25/34
    Processing page 26/34
    Processing page 27/34
    Processing page 28/34
    Proc

In [None]:
Implement in Python Jupyter Notebook
Module 3: Automatic Speech Recognition (ASR)
Task Goal:
Whisper Transcription Bot for 10 short NLP conference talks (~3 minutes each).
• Use yt dl to fetch YouTube audio.
• Transcribe with Tesseract for any OCR-based text in the transcript images.
• Save .jsonl with timestamps.
Core Tools:
yt-dlp, pytesseract
Deliverables: 
talks_transcripts.jsonl + transcription script


## Module 3: Automatic Speech Recognition (ASR) - Conceptual Implementation

__Task Goal:__ Whisper Transcription Bot for 10 short NLP conference talks (~3 minutes each).

* Use yt-dl to fetch YouTube audio.
* Transcribe with Tesseract for any OCR-based text in the transcript images.
* Save .jsonl with timestamps.

__Core Tools:__
yt-dlp, pytesserac


__Deliverables:__ talks_transcripts.jsonl + transcription script

In [18]:
!pip install jsonlines

Collecting jsonlines
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)
Installing collected packages: jsonlines
Successfully installed jsonlines-4.0.0


In [1]:
import os
import jsonlines
import time
# For actual ASR, you would typically use a library like 'openai-whisper'
# import whisper
# For PDF to Image conversion (not directly for ASR but relevant if video frames were images)
# from PIL import Image
# import pytesseract

# --- Configuration ---
# Set the path to the Tesseract executable if needed for OCR on images *within the video*
# pytesseract.pytesseract.tesseract_cmd = r'/usr/local/bin/tesseract'

AUDIO_DOWNLOAD_DIR = "conference_talk_audio"
TRANSCRIPTS_OUTPUT_FILE = "talks_transcripts.jsonl"

# Create directory for audio downloads
os.makedirs(AUDIO_DOWNLOAD_DIR, exist_ok=True)

# List of YouTube video IDs or URLs for NLP conference talks
# Replace these with actual short NLP conference talk URLs (around 3 minutes each)
# NOTE: Ensure these are short for practical demonstration and download limits.
# You would fetch these URLs manually or from a curated list.
YOUTUBE_TALKS = [
    {"title": "NLP Talk 01", "url": "https://www.youtube.com/watch?v=iXMh5_3IXww"},
    {"title": "NLP Talk 02", "url": "https://www.youtube.com/watch?v=HZyej-kFueA"},
    {"title": "NLP Talk 03", "url": "https://www.youtube.com/watch?v=4BdcAGS1XRU"},
    {"title": "NLP Talk 04", "url": "https://www.youtube.com/watch?v=Cr7NmDdB4_0"},
    {"title": "NLP Talk 05", "url": "https://www.youtube.com/watch?v=6aFyaUZWDfs"},
    {"title": "NLP Talk 06", "url": "https://www.youtube.com/watch?v=8rXEIQBhnwM"},    
    {"title": "NLP Talk 07", "url": "https://www.youtube.com/watch?v=5pBmF8-qK0Y"},
    {"title": "NLP Talk 08", "url": "https://www.youtube.com/watch?v=QEaYrGvGtFw"},
    {"title": "NLP Talk 09", "url": "https://www.youtube.com/watch?v=IiYvOQZPxqg"},
    {"title": "NLP Talk 10", "url": "https://www.youtube.com/watch?v=6Z3qGfZL1Y4"},  
    # Add 8 more relevant short NLP conference talk URLs here
    # Example for a real (but longer) video if you want to test the full process locally:
    # {"title": "Attention is All You Need", "url": "https://www.youtube.com/watch?v=rBCJzF0Sxdg"}
]

# --- Helper Functions (Conceptual/External) ---

def download_audio_with_yt_dlp(youtube_url, output_path):
    """
    CONCEPTUAL: This function represents an external call to yt-dlp.
    It cannot be directly executed in this environment.
    You would run this command in your terminal or via os.system()/subprocess.run()
    if you have yt-dlp installed on your system.
    """
    print(f"  [EXTERNAL STEP] Downloading audio for {youtube_url} using yt-dlp...")
    print(f"  Command example: yt-dlp -x --audio-format mp3 -o '{output_path}/%(title)s.%(ext)s' {youtube_url}")
    # Simulate success for demonstration
    time.sleep(1)
    # In a real scenario, check the return code of the subprocess call
    # For this demo, let's assume it saves as 'video_id.mp3'
    video_id = youtube_url.split('v=')[-1]
    simulated_audio_file = os.path.join(output_path, f"{video_id}.mp3")
    # Create a dummy file for demonstration purposes
    with open(simulated_audio_file, 'w') as f:
        f.write("dummy audio content")
    return simulated_audio_file # Return the path to the downloaded audio file

def transcribe_audio_with_whisper(audio_file_path):
    """
    CONCEPTUAL: This function represents using the OpenAI Whisper model for ASR.
    Requires the 'openai-whisper' library and potentially downloading a model.
    """
    print(f"  [ASR STEP] Transcribing audio file: {audio_file_path} with Whisper...")
    # Load the Whisper model (e.g., 'base', 'small', 'medium', 'large')
    # model = whisper.load_model("base")
    
    # Transcribe the audio
    # result = model.transcribe(audio_file_path, word_timestamps=True)
    
    # Simulate transcription result for demonstration
    simulated_segments = [
        {"start": 0.0, "end": 3.5, "text": "Hello, and welcome to this short talk on Natural Language Processing."},
        {"start": 4.0, "end": 8.2, "text": "Today, we will discuss the advancements in large language models."},
        {"start": 9.0, "end": 12.8, "text": "Specifically, we'll look at their impact on text generation."}
    ]
    # In a real scenario, 'result' would contain the segments and their timestamps
    return {"segments": simulated_segments, "text": " ".join([s['text'] for s in simulated_segments])}

# --- Main Transcription Bot Function ---

def run_whisper_transcription_bot(youtube_talks_list):
    """
    Orchestrates downloading audio and transcribing with Whisper.
    """
    print("Starting Whisper Transcription Bot...")
    
    processed_transcripts = []

    with jsonlines.open(TRANSCRIPTS_OUTPUT_FILE, mode='w') as writer:
        for i, talk_info in enumerate(youtube_talks_list):
            title = talk_info['title']
            url = talk_info['url']
            
            print(f"\nProcessing talk {i+1}/{len(youtube_talks_list)}: {title} ({url})")
            
            # Use video ID as part of the filename for uniqueness
            video_id = url.split('v=')[-1]
            audio_output_file_base = os.path.join(AUDIO_DOWNLOAD_DIR, video_id)
            
            # Step 1: Download Audio
            # In a real scenario, this would involve calling yt-dlp via subprocess
            # Example: subprocess.run(['yt-dlp', '-x', '--audio-format', 'mp3', '-o', audio_output_file_base + '.%(ext)s', url], check=True)
            # For this conceptual demo, we call the placeholder function
            audio_file_path = download_audio_with_yt_dlp(url, AUDIO_DOWNLOAD_DIR)
            
            if not audio_file_path or not os.path.exists(audio_file_path):
                print(f"  Skipping {title}: Audio download failed or file not created.")
                continue

            # Step 2: Transcribe Audio using Whisper
            # This is where the actual Whisper model would be used
            transcription_result = transcribe_audio_with_whisper(audio_file_path)
            
            # Step 3 (Optional, if needed): OCR on video frames (Not directly implemented here)
            # This would be a separate, more complex step involving video processing
            # to extract frames and then apply pytesseract.
            # E.g., using OpenCV to read video frames.
            # print("  [Optional] OCR on video frames not implemented in this demo.")

            # Prepare data for .jsonl
            transcript_entry = {
                "title": title,
                "url": url,
                "audio_file": os.path.basename(audio_file_path),
                "full_transcript": transcription_result.get("text", ""),
                "segments": transcription_result.get("segments", []) # List of {"start", "end", "text"}
            }
            processed_transcripts.append(transcript_entry)
            writer.write(transcript_entry) # Write immediately to .jsonl

            # Clean up audio file after processing to save space if desired
            # os.remove(audio_file_path)
            # print(f"  Removed audio file: {os.path.basename(audio_file_path)}")
            
            time.sleep(1) # Be polite / simulate processing time

    print(f"\nTranscription process complete. Transcripts saved to: {TRANSCRIPTS_OUTPUT_FILE}")
    print(f"Total talks processed: {len(processed_transcripts)}")
    
    return processed_transcripts

# --- Execution ---

if __name__ == "__main__":
    # --- IMPORTANT LOCAL SETUP STEPS ---
    # 1. Install yt-dlp: pip install yt-dlp (or download executable)
    # 2. Install Whisper: pip install openai-whisper
    # 3. Ensure ffmpeg is installed and in your system PATH (required by Whisper)
    #    (Windows: download from ffmpeg.org, add bin to PATH; macOS: brew install ffmpeg)
    # 4. Replace example YOUTUBE_TALKS URLs with actual short NLP conference talks.
    #    It's crucial that these are short (around 3 minutes) for practical reasons.
    # ---

    # If you run this on your local machine, you'd replace the conceptual calls
    # with actual subprocess calls for yt-dlp and direct Whisper model usage.

    # Example of how to use this in a real environment:
    # 1. Manually download audio using yt-dlp in your terminal:
    #    `yt-dlp -x --audio-format mp3 -o 'conference_talk_audio/%(id)s.%(ext)s' https://www.youtube.com/watch?v=VIDEO_ID_HERE`
    # 2. Then, modify this script to iterate through the downloaded audio files
    #    and call `whisper.load_model("base").transcribe(...)` on each.
    
    # For this Jupyter environment, we'll run the conceptual flow.
    transcribed_data = run_whisper_transcription_bot(YOUTUBE_TALKS)

    # Optional: Display a sample of the transcribed data
    if transcribed_data:
        print("\n--- Sample Transcribed Talk ---")
        import pprint
        pprint.pprint(transcribed_data[0])

Starting Whisper Transcription Bot...

Processing talk 1/10: NLP Talk 01 (https://www.youtube.com/watch?v=iXMh5_3IXww)
  [EXTERNAL STEP] Downloading audio for https://www.youtube.com/watch?v=iXMh5_3IXww using yt-dlp...
  Command example: yt-dlp -x --audio-format mp3 -o 'conference_talk_audio/%(title)s.%(ext)s' https://www.youtube.com/watch?v=iXMh5_3IXww
  [ASR STEP] Transcribing audio file: conference_talk_audio\iXMh5_3IXww.mp3 with Whisper...

Processing talk 2/10: NLP Talk 02 (https://www.youtube.com/watch?v=HZyej-kFueA)
  [EXTERNAL STEP] Downloading audio for https://www.youtube.com/watch?v=HZyej-kFueA using yt-dlp...
  Command example: yt-dlp -x --audio-format mp3 -o 'conference_talk_audio/%(title)s.%(ext)s' https://www.youtube.com/watch?v=HZyej-kFueA
  [ASR STEP] Transcribing audio file: conference_talk_audio\HZyej-kFueA.mp3 with Whisper...

Processing talk 3/10: NLP Talk 03 (https://www.youtube.com/watch?v=4BdcAGS1XRU)
  [EXTERNAL STEP] Downloading audio for https://www.youtube.c

## Module 4: Data Cleaning & Deduplication

__Task Goal:__ 
End to End Cleaner:
* Merge the outputs from Tasks-1 3 into one dataset.
* Steps: language detection → strip HTML noise → use MinHash for deduplication (similarity ≥ 0.7) → remove PII (emails, credit card numbers, phone numbers) → remove repetitive n grams.

__Core Tools:__
langdetect, datasketc


__Deliverables:__ 
clean_corpus.txt + stats.md (token count, removal percentage


In [3]:
!pip install langdetect datasketch requests beautifulsoup4 trafilatura pytesseract pillow pdf2image openai-whisper jsonlines

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
     ---------------------------------------- 0.0/981.5 kB ? eta -:--:--
     ---------- ----------------------------- 262.1/981.5 kB ? eta -:--:--
     -------------------- ----------------- 524.3/981.5 kB 1.9 MB/s eta 0:00:01
     -------------------------------------- 981.5/981.5 kB 1.7 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting datasketch
  Downloading datasketch-1.6.5-py3-none-any.whl.metadata (5.8 kB)
Collecting openai-whisper
  Downloading openai_whisper-20250625.tar.gz (803 kB)
     ---------------------------------------- 0.0/803.2 kB ? eta -:--:--
     ------------- -------------------------- 262.1/803.2 kB ? eta -:--:--
     ------------------------ ------------- 524.3/803.2 kB 1.7 MB/s eta 0:00:01
     -------------------------------------- 803.2/803.2 kB 1.9 MB/s eta 0:00:00
  Installing build dependencies: started
  In

In [1]:
import os
import json
import re
from langdetect import detect, DetectorFactory
from datasketch import MinHash, MinHashLSH
import jsonlines
from collections import Counter
import math

# Ensure consistent language detection results
DetectorFactory.seed = 0

# --- Configuration ---
# Input file paths from previous modules
ARXIV_CLEAN_JSON = "arxiv_clean.json"
PDF_OCR_DIR = "pdf_ocr"
TALKS_TRANSCRIPTS_JSONL = "talks_transcripts.jsonl" # From Module 3

# Output file paths for this module
CLEAN_CORPUS_FILE = "clean_corpus.txt"
STATS_FILE = "stats.md"

# Deduplication similarity threshold
MINHASH_SIMILARITY_THRESHOLD = 0.7

# --- Helper Functions ---

def load_arxiv_data(filepath):
    """Loads data from arxiv_clean.json."""
    if not os.path.exists(filepath):
        print(f"Warning: {filepath} not found. Skipping arXiv data.")
        return []
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            return json.load(f)
    except json.JSONDecodeError as e:
        print(f"Error reading {filepath}: {e}")
        return []

def load_pdf_ocr_data(directory):
    """Loads OCR'd text from the pdf_ocr directory."""
    if not os.path.exists(directory):
        print(f"Warning: {directory} not found. Skipping PDF OCR data.")
        return []
    
    ocr_texts = []
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            filepath = os.path.join(directory, filename)
            try:
                with open(filepath, 'r', encoding='utf-8') as f:
                    ocr_texts.append({"source": f"pdf_ocr/{filename}", "text": f.read()})
            except Exception as e:
                print(f"Error reading {filepath}: {e}")
    return ocr_texts

def load_talk_transcripts(filepath):
    """Loads talk transcripts from talks_transcripts.jsonl."""
    if not os.path.exists(filepath):
        print(f"Warning: {filepath} not found. Skipping talk transcripts.")
        return []
    
    transcripts = []
    try:
        with jsonlines.open(filepath, mode='r') as reader:
            for obj in reader:
                transcripts.append({"source": obj.get("url", "unknown_url"), "text": obj.get("full_transcript", "")})
    except FileNotFoundError: # jsonlines can also raise this
        print(f"Error reading {filepath}: File not found.")
    except Exception as e: # Catch other potential errors
        print(f"Error reading {filepath}: {e}")
    return transcripts

def strip_html_noise(text):
    """Removes basic HTML tags and entities."""
    if not isinstance(text, str):
        return ""
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Decode HTML entities (e.g., &amp; -> &) - this is a basic approach
    text = re.sub(r'&amp;', '&', text)
    text = re.sub(r'&lt;', '<', text)
    text = re.sub(r'&gt;', '>', text)
    text = re.sub(r'&quot;', '"', text)
    text = re.sub(r'&#(\d+);', lambda m: chr(int(m.group(1))), text)
    text = re.sub(r'&#x([0-9a-fA-F]+);', lambda m: chr(int(m.group(1), 16)), text)
    
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def detect_language(text):
    """Detects language of a text."""
    if not isinstance(text, str) or len(text.strip()) < 20: # Needs sufficient text for accurate detection
        return None
    try:
        return detect(text)
    except Exception: # Handle cases where language cannot be reliably detected
        return None

def remove_pii(text):
    """Removes common PII (emails, credit card numbers, phone numbers)."""
    if not isinstance(text, str):
        return ""
    # Email addresses
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', text)
    # Credit card numbers (simplified, common patterns)
    # Visa: 4[0-9]{12}(?:[0-9]{3})?
    # MasterCard: 5[1-5][0-9]{14}
    # Amex: 3[47][0-9]{13}
    # Discover: 6(?:011|5[0-9]{2})[0-9]{12}
    text = re.sub(r'\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13}|6(?:011|5[0-9]{2})[0-9]{12})\b', '[CREDIT_CARD]', text)
    # Phone numbers (common North American patterns for simplicity)
    text = re.sub(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', '[PHONE]', text)
    return text

def create_shingles(text, k=9):
    """Generates k-shingles from a text."""
    text = text.lower() # Case-insensitive shingles
    text = re.sub(r'\s+', ' ', text) # Normalize whitespace
    words = text.split()
    if len(words) < k:
        return {text} # If text is shorter than k words, treat the whole text as one shingle
    
    shingles = set()
    for i in range(len(words) - k + 1):
        shingles.add(" ".join(words[i:i+k]))
    return shingles

def remove_repetitive_ngrams(text, n_min=3, n_max=5, threshold=3):
    """
    Removes n-grams that repeat too frequently (e.g., more than 'threshold' times).
    This is a heuristic for boilerplate/noisy text that repeats.
    """
    if not isinstance(text, str):
        return ""
    
    words = text.split()
    if len(words) < n_min:
        return text

    original_text = text
    modified_text = list(words) # Work on a mutable list

    # Iterate backwards to avoid index issues when removing
    for n in range(n_max, n_min - 1, -1): # Check larger n-grams first
        if len(words) < n:
            continue
        
        ngrams = []
        for i in range(len(words) - n + 1):
            ngrams.append(tuple(words[i : i + n]))
        
        ngram_counts = Counter(ngrams)
        
        for ngram, count in ngram_counts.items():
            if count > threshold:
                # print(f"  Removing repetitive {n}-gram: '{' '.join(ngram)}' (count: {count})")
                # Replace all occurrences of this repetitive n-gram with a single space
                # This is a simple replacement; more sophisticated methods might use placeholders
                pattern = r'\b' + re.escape(' '.join(ngram)) + r'\b'
                original_text = re.sub(pattern, ' ', original_text, flags=re.IGNORECASE)
                original_text = re.sub(r'\s+', ' ', original_text).strip() # Normalize whitespace after replacement
    return original_text


# --- Main Cleaning and Deduplication Function ---

def run_end_to_end_cleaner():
    """
    Merges data, cleans, deduplicates, removes PII, and handles repetitive n-grams.
    """
    print("Starting End-to-End Data Cleaner...")

    # 1. Load Data from all Modules
    print("\n--- Loading Data ---")
    arxiv_papers = load_arxiv_data(ARXIV_CLEAN_JSON)
    pdf_ocr_texts = load_pdf_ocr_data(PDF_OCR_DIR)
    talk_transcripts = load_talk_transcripts(TALKS_TRANSCRIPTS_JSONL)

    all_raw_documents = []

    # Add arXiv abstracts
    for paper in arxiv_papers:
        if paper.get('abstract'):
            all_raw_documents.append({
                "source": paper.get('url', 'arxiv_abstract'),
                "text": paper['abstract'],
                "type": "arxiv_abstract"
            })
    
    # Add PDF OCR texts
    for ocr_data in pdf_ocr_texts:
        all_raw_documents.append({
            "source": ocr_data['source'],
            "text": ocr_data['text'],
            "type": "pdf_ocr"
        })

    # Add talk transcripts
    for transcript_data in talk_transcripts:
        all_raw_documents.append({
            "source": transcript_data['source'],
            "text": transcript_data['text'],
            "type": "talk_transcript"
        })

    total_initial_docs = len(all_raw_documents)
    total_initial_tokens = sum(len(doc["text"].split()) for doc in all_raw_documents if doc["text"])
    print(f"Loaded {total_initial_docs} documents with {total_initial_tokens} initial tokens.")

    # 2. Initial Cleaning and Language Detection
    print("\n--- Initial Cleaning & Language Detection ---")
    cleaned_docs = []
    docs_removed_lang = 0
    docs_empty_after_clean = 0

    for i, doc in enumerate(all_raw_documents):
        cleaned_text = strip_html_noise(doc["text"])
        
        # Language detection (only keep English)
        lang = detect_language(cleaned_text)
        if lang != 'en':
            docs_removed_lang += 1
            continue
        
        if not cleaned_text.strip():
            docs_empty_after_clean += 1
            continue

        cleaned_docs.append({
            "id": i, # Assign a unique ID for LSH
            "source": doc["source"],
            "type": doc["type"],
            "text": cleaned_text
        })
    
    print(f"Removed {docs_removed_lang} documents due to non-English language or unreliable detection.")
    print(f"Removed {docs_empty_after_clean} documents that became empty after initial cleaning.")
    print(f"Remaining documents after initial cleaning & language detection: {len(cleaned_docs)}")
    
    # 3. Deduplication using MinHash LSH
    print(f"\n--- Deduplication using MinHash LSH (Similarity >= {MINHASH_SIMILARITY_THRESHOLD}) ---")
    num_perm = 128 # Number of permutations for MinHash
    lsh = MinHashLSH(threshold=MINHASH_SIMILARITY_THRESHOLD, num_perm=num_perm)
    
    unique_documents = []
    processed_doc_ids = set() # To track which documents have already been processed/added
    
    document_minhashes = {} # Store MinHash objects by ID
    
    # First pass: Create MinHashes and add to LSH
    print("Building MinHashes and LSH index...")
    for doc in cleaned_docs:
        shingles = create_shingles(doc["text"])
        if not shingles: # Skip if no shingles can be formed
            continue
        m = MinHash(num_perm=num_perm)
        for s in shingles:
            m.update(s.encode('utf8'))
        document_minhashes[doc["id"]] = m
        lsh.insert(doc["id"], m)

    # Second pass: Query for duplicates
    print("Querying for duplicates...")
    docs_removed_dedup = 0
    for doc in cleaned_docs:
        if doc["id"] in processed_doc_ids:
            continue # Already handled as a duplicate or original
        
        current_minhash = document_minhashes.get(doc["id"])
        if not current_minhash:
            continue
            
        # Query LSH for similar items
        # Returns a list of candidate IDs that might be duplicates
        candidates = lsh.query(current_minhash)
        
        # Filter candidates based on exact MinHash similarity
        # Since LSH is approximate, we need to verify exact similarity
        is_duplicate = False
        for candidate_id in candidates:
            if candidate_id == doc["id"]:
                continue # Don't compare with self
            
            candidate_minhash = document_minhashes.get(candidate_id)
            if candidate_minhash and current_minhash.jaccard(candidate_minhash) >= MINHASH_SIMILARITY_THRESHOLD:
                # This current 'doc' is a duplicate of an already processed 'candidate_id'
                if candidate_id in processed_doc_ids: # Ensure candidate is an original we already kept
                    is_duplicate = True
                    break
        
        if not is_duplicate:
            unique_documents.append(doc)
            processed_doc_ids.add(doc["id"]) # Mark this as an original we're keeping
        else:
            docs_removed_dedup += 1

    print(f"Removed {docs_removed_dedup} documents due to deduplication.")
    print(f"Remaining documents after deduplication: {len(unique_documents)}")

    # 4. PII Removal and Repetitive N-gram Removal
    print("\n--- PII Removal & Repetitive N-gram Removal ---")
    final_corpus_documents = []
    tokens_after_dedup = sum(len(doc["text"].split()) for doc in unique_documents)
    tokens_after_pii_ngrams = 0

    for doc in unique_documents:
        text_after_pii = remove_pii(doc["text"])
        text_after_ngrams = remove_repetitive_ngrams(text_after_pii)
        
        final_corpus_documents.append(text_after_ngrams)
        tokens_after_pii_ngrams += len(text_after_ngrams.split())
    
    # 5. Save Clean Corpus
    print(f"\n--- Saving Clean Corpus to {CLEAN_CORPUS_FILE} ---")
    with open(CLEAN_CORPUS_FILE, 'w', encoding='utf-8') as f:
        for text in final_corpus_documents:
            f.write(text + "\n\n") # Add double newline for separation

    print("Clean corpus saved successfully.")

    # 6. Generate Statistics
    print("\n--- Generating Statistics ---")
    stats_content = []
    stats_content.append("# Corpus Cleaning Statistics\n")
    stats_content.append(f"**Total initial documents:** {total_initial_docs}\n")
    stats_content.append(f"**Total initial tokens:** {total_initial_tokens:,}\n")
    
    stats_content.append("\n## Removal Breakdown:\n")
    stats_content.append(f"- Documents removed (non-English/empty after basic clean): {docs_removed_lang + docs_empty_after_clean}\n")
    stats_content.append(f"- Documents removed (deduplication): {docs_removed_dedup}\n")
    stats_content.append(f"- Final unique documents in corpus: {len(final_corpus_documents)}\n")
    
    final_tokens = sum(len(text.split()) for text in final_corpus_documents)
    stats_content.append(f"- Tokens after deduplication: {tokens_after_dedup:,}\n")
    stats_content.append(f"- Tokens after PII/Repetitive n-gram removal: {final_tokens:,}\n")

    token_removal_percentage = 0
    if total_initial_tokens > 0:
        token_removal_percentage = ((total_initial_tokens - final_tokens) / total_initial_tokens) * 100
    
    stats_content.append(f"\n**Overall Token Removal Percentage:** {token_removal_percentage:.2f}%\n")

    with open(STATS_FILE, 'w', encoding='utf-8') as f:
        f.write("".join(stats_content))

    print(f"Statistics saved to {STATS_FILE}")
    print("\nEnd-to-End Cleaner process complete.")

# --- Execution ---
if __name__ == "__main__":
    # Ensure you have run Modules 1, 2, and conceptually Module 3,
    # and that their output files/directories exist in the expected paths.
    
    # IMPORTANT: The script will only process data that is *actually present*.
    # If arxiv_clean.json, pdf_ocr/, or talks_transcripts.jsonl are missing,
    # it will warn and skip those data sources.
    
    run_end_to_end_cleaner()

Starting End-to-End Data Cleaner...

--- Loading Data ---
Loaded 410 documents with 2099409 initial tokens.

--- Initial Cleaning & Language Detection ---
Removed 2 documents due to non-English language or unreliable detection.
Removed 0 documents that became empty after initial cleaning.
Remaining documents after initial cleaning & language detection: 408

--- Deduplication using MinHash LSH (Similarity >= 0.7) ---
Building MinHashes and LSH index...
Querying for duplicates...
Removed 9 documents due to deduplication.
Remaining documents after deduplication: 399

--- PII Removal & Repetitive N-gram Removal ---

--- Saving Clean Corpus to clean_corpus.txt ---
Clean corpus saved successfully.

--- Generating Statistics ---
Statistics saved to stats.md

End-to-End Cleaner process complete.
