# Notebook 2: PDF Downloading and Text Extraction




# 1 . **Introduction and Objectives**
In the previous notebook, we successfully searched the arXiv API to generate a corpus of metadata related to **BiS2-based layered superconductors**.

The objective of this notebook is to transition from metadata to raw data. We will:
1.  Iterate through the corpus generated in step 1.
2.  Download the full-text PDF for each entry.
3.  Extract plain text from these PDFs to prepare for Natural Language Processing (NLP) tasks.


# 2 . Environment Setup


## 2.1 Installation of dependencies
We will use **PyMuPDF** (imported as `fitz`), a high-performance library for data extraction from PDF files. It allows us to access the document structure and extract text with high fidelity.

In [1]:
# Install dependencies
# Using -q to minimize log output for a cleaner notebook presentation
!pip install PyMuPDF -q

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m24.1/24.1 MB[0m [31m45.3 MB/s[0m eta [36m0:00:00[0m
[?25h

## 2.2 Import Libraries
We utilize a combination of standard Python libraries for file and time management, alongside third-party tools for web requests and data processing.

* **`requests`**: Used to fetch PDF binary data from arXiv URLs.
* **`fitz` (PyMuPDF)**: The core engine for parsing PDF documents and extracting text.
* **`pandas`**: Used to load and manipulate the corpus dataframe generated in the previous notebook.

In [2]:
# IMPORT LIBRARIES


# --- Standard Library ---
import os
import re
import json
import time
import hashlib
import logging
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Optional

# --- Third-Party Data Science & Utilities ---
import requests
import fitz  # PyMuPDF
import pandas as pd

# --- Google Colab Specifics ---
from google.colab import drive
from google.colab import userdata


#  SETUP & INITIALIZATION

# Configure Logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Mount Google Drive
MOUNT_PATH = '/content/drive'

if not os.path.exists(MOUNT_PATH):
    print("üîå Mounting Google Drive...")
    drive.mount(MOUNT_PATH)
else:
    print(f"‚úÖ Drive already mounted at {MOUNT_PATH}")

# Optional: Define Base Directory for the project immediately
# BASE_DIR = Path(MOUNT_PATH) / "MyDrive/Research/PDFs"
# BASE_DIR.mkdir(parents=True, exist_ok=True)

üîå Mounting Google Drive...
Mounted at /content/drive


## 2.3 Global Configuration and Path Management
To maintain a structured and version-controlled workflow, we define a central `CorpusConfig` class. This configuration handles:

1.  **Directory Management:** Automatically creating distinct folders for raw PDFs (`data/raw`) and processed text (`data/processed`).
2.  **Versioning:** Using a `CORPUS_VERSION` flag to separate different experimental runs without overwriting previous data.
3.  **Network Settings:** Defining timeouts and retry logic to handle potential instability when scraping the arXiv server.

In [3]:
class CorpusConfig:
    """
    Central configuration for local corpus management.
    Handles directory structures, versioning, and download settings.
    """

    # --- Metadata ---
    CORPUS_VERSION: str = "v1.0"
    CREATION_DATE: str = datetime.now().strftime("%Y-%m-%d")

    # --- Base Paths ---
    # TFM is the Root Project Folder
    BASE_DIR: Path = Path("/content/drive/My Drive/TFM")

    # --- Main Corpus Directories (Versioned) ---
    # These paths dynamically update based on the VERSION attribute above
    PDF_DIR: Path = BASE_DIR / "data/raw/pdfs" / f"corpus_{CORPUS_VERSION}_pdfs"
    TEXT_DIR: Path = BASE_DIR / "data/processed" / f"corpus_{CORPUS_VERSION}_text_dumps"
    EXTRACTION_DIR: Path = BASE_DIR / "data/processed/extractions" / f"corpus_{CORPUS_VERSION}_txt_extractions"
    LOG_DIR: Path = BASE_DIR / "logs"

    # --- Sample & Testing Directories ---
    # Static paths for testing pipelines without affecting the main corpus
    SAMPLE_PDF_DIR: Path = BASE_DIR / "data/raw/pdfs/sample_pdfs"
    SAMPLE_TEXT_DIR: Path = BASE_DIR / "data/processed/sample_text_dumps"
    SAMPLE_EXTRACTION_DIR: Path = BASE_DIR / "data/processed/extractions/sample_txt_extractions"

    # --- Validation ---
    GOLD_STD_DIR: Path = BASE_DIR / "data/processed/extractions/gold_standard_extractions"

    # --- Network / Scraper Settings ---
    TIMEOUT: int = 15       # Seconds to wait for a request
    MAX_RETRIES: int = 3    # Number of retries for failed downloads
    RETRY_DELAY: int = 2    # Seconds to wait between retries

    @classmethod
    def create_directories(cls):
        """
        Utilities to auto-create the folder structure defined above.
        Uses exist_ok=True to prevent errors if folders already exist.
        """
        directories = [
            cls.PDF_DIR, cls.TEXT_DIR, cls.EXTRACTION_DIR, cls.LOG_DIR,
            cls.SAMPLE_PDF_DIR, cls.SAMPLE_TEXT_DIR, cls.SAMPLE_EXTRACTION_DIR,
            cls.GOLD_STD_DIR
        ]

        print(f"üìÇ Initializing Directory Structure for {cls.CORPUS_VERSION}...")
        for folder in directories:
            folder.mkdir(parents=True, exist_ok=True)
            # print(f"   Checked: {folder}") # Uncomment for verbose output
        print("‚úÖ Directory structure ready.")

# --- Execute Setup ---

# Initialize the folder structure immediately
CorpusConfig.create_directories()

# Quick verify of the active version
print(f"üîß Configuration loaded for Corpus Version: {CorpusConfig.CORPUS_VERSION}")

üìÇ Initializing Directory Structure for v1.0...
‚úÖ Directory structure ready.
üîß Configuration loaded for Corpus Version: v1.0


## 2.4 Loading the Corpus Metadata
We begin by loading the JSON corpus generated in **Notebook 1**. This file contains the metadata (titles, authors, arXiv IDs, links) for the papers we intend to download.

We will:
1.  Load the raw JSON file.
2.  Inspect the JSON structure to confirm data integrity.
3.  Convert the list of papers into a **Pandas DataFrame** for efficient iteration and filtering.

In [5]:
# --- Configuration ---
# Update the filename below to match the specific output from Notebook 1
INPUT_FILENAME = "bis2_corpus_v1_20260114_102041.json"
INPUT_PATH = CorpusConfig.BASE_DIR / "data/corpora/01_raw/v1" / INPUT_FILENAME

# --- Load JSON ---
if INPUT_PATH.exists():
    with open(INPUT_PATH, "r", encoding="utf-8") as f:
        corpus_json = json.load(f)
    print(f"‚úÖ Corpus loaded successfully: {INPUT_FILENAME}")
else:
    raise FileNotFoundError(f"‚ùå Corpus file not found at: {INPUT_PATH}")

# --- JSON Structure Exploration ---
print("\n" + "="*40)
print("üîç Top-level JSON keys:")
print("="*40)
print(list(corpus_json.keys()))

print("\n" + "="*40)
print("üìÑ Inspecting 'papers' field (first entry):")
print("="*40)
# Pretty print the first paper to verify structure
print(json.dumps(corpus_json["papers"][0], indent=2))

print("\n" + "-"*40)
print(f"üìö Total papers in corpus: {len(corpus_json['papers'])}")
print("-"*40)

# --- Convert to DataFrame ---
# We focus on the 'papers' list; metadata about the search itself is ignored here
df_corpus = pd.DataFrame(corpus_json["papers"])

print("\n" + "="*40)
print("üìä DataFrame Summary:")
print("="*40)
print(f"Shape: {df_corpus.shape}")
print(f"Columns: {df_corpus.columns.tolist()}")

# Display the first few rows
df_corpus.head(3)

‚úÖ Corpus loaded successfully: bis2_corpus_v1_20260114_102041.json

üîç Top-level JSON keys:
['metadata', 'papers']

üìÑ Inspecting 'papers' field (first entry):
{
  "arxiv_id": "2406.01263v2",
  "entry_id": "http://arxiv.org/abs/2406.01263v2",
  "doi": "10.7566/JPSJ.93.074707",
  "title": "Pb Substitution Effects on Lattice and Electronic System of the BiS2-based Superconductors La(O F)BiS2",
  "abstract": "We examined the effect of Pb substitution in the layered superconductor LaO0.5F0.5Bi1-xPbxS2 (x=0~0.15) through the measurements of the resistivity, thermal expansion, specific heat, and Seebeck coefficient. These transport and thermal properties show anomalies at certain temperatures (T*) for x${\\geq}$0.08. The large thermal expansion anomalies, specific heat anomalies, and the existence of hystereses in the above measurements indicate a first-order structural phase transition at T*. Additionally, the Seebeck coefficient indicates that the anomalies at T* are related not only 

Unnamed: 0,arxiv_id,entry_id,doi,title,abstract,authors,authors_str,published,updated,year,primary_category,categories,pdf_url,comment,journal_ref
0,2406.01263v2,http://arxiv.org/abs/2406.01263v2,10.7566/JPSJ.93.074707,Pb Substitution Effects on Lattice and Electro...,We examined the effect of Pb substitution in t...,"[Miku Sasaki, Kotaro Inada, Fumito Mori, Takaa...","Miku Sasaki, Kotaro Inada, Fumito Mori, Takaak...",2024-06-03T12:24:51+00:00,2024-06-05T06:50:28+00:00,2024,cond-mat.supr-con,[cond-mat.supr-con],https://arxiv.org/pdf/2406.01263v2,"11 pages, 7 figures, to appear in JPSJ","J. Phys. Soc. Jpn. 93, 074707 (2024)"
1,2405.09129v2,http://arxiv.org/abs/2405.09129v2,,Aging Effects on Superconducting Properties of...,Decomposition of superconductors sometimes bec...,"[Poonam Rani, Rajveer Jha, V. P. S. Awana, Yos...","Poonam Rani, Rajveer Jha, V. P. S. Awana, Yosh...",2024-05-15T06:46:28+00:00,2024-06-25T15:26:31+00:00,2024,cond-mat.supr-con,[cond-mat.supr-con],https://arxiv.org/pdf/2405.09129v2,"13 pages, 6 figures",
2,2308.04081v2,http://arxiv.org/abs/2308.04081v2,,Absence of Tc-Pinning Phenomenon Under High Pr...,"Recently, robustness of superconductivity (tra...","[Yoshikazu Mizuguchi, Kazuki Yamane, Ryo Matsu...","Yoshikazu Mizuguchi, Kazuki Yamane, Ryo Matsum...",2023-08-08T06:38:27+00:00,2023-09-12T03:06:13+00:00,2023,cond-mat.supr-con,"[cond-mat.supr-con, cond-mat.mtrl-sci]",https://arxiv.org/pdf/2308.04081v2,"10 pages, 4 figures, to appear in JPSJ",


In [None]:
# --- CREATE A MANUAL SAMPLE FROM THE CORPUS ---

sample_folder = "/content/drive/MyDrive/TFM/data/raw/pdfs/sample_pdfs"

sample_ids = [
    '1508.04820v1', '1701.07575v1', '2001.07928v1', '1712.06815v1',
    '1210.1305v1', '1508.01656v1', '1409.2189v2', '1306.3346v2',
    '1404.6359v2', '1810.08404v3'
]

# Filter corpus using predefined arXiv identifiers
sample_df = df_corpus[df_corpus["arxiv_id"].isin(sample_ids)].copy()

# --- BASIC VERIFICATION ---

print(f"Requested papers: {len(sample_ids)}")
print(f"Papers found in corpus: {len(sample_df)}")

missing_ids = set(sample_ids) - set(sample_df["arxiv_id"])
if missing_ids:
    print("\nMissing arXiv IDs:")
    for mid in sorted(missing_ids):
        print(f" - {mid}")
else:
    print("\nAll requested papers were found in the corpus.")

sample_df.head()


Requested papers: 10
Papers found in corpus: 10

All requested papers were found in the corpus.


Unnamed: 0,arxiv_id,entry_id,doi,title,abstract,authors,authors_str,published,updated,year,primary_category,categories,pdf_url,comment,journal_ref
17,2001.07928v1,http://arxiv.org/abs/2001.07928v1,10.7566/JPSJ.89.064702,Bulk superconductivity induced by Se substitut...,We report the Se substitution effects on the c...,"[Ryosuke Kiyama, Yosuke Goto, Kazuhisa Hoshi, ...","Ryosuke Kiyama, Yosuke Goto, Kazuhisa Hoshi, R...",2020-01-22T09:35:42+00:00,2020-01-22T09:35:42+00:00,2020,cond-mat.supr-con,"[cond-mat.supr-con, cond-mat.mtrl-sci]",https://arxiv.org/pdf/2001.07928v1,"15 pages, 8 figures",
22,1810.08404v3,http://arxiv.org/abs/1810.08404v3,,Bulk superconductivity in La2O2M4S6-type layer...,"Recently, we reported the observation of super...","[Rajveer Jha, Yosuke Goto, Tatsuma D. Matsuda,...","Rajveer Jha, Yosuke Goto, Tatsuma D. Matsuda, ...",2018-10-19T08:55:02+00:00,2019-04-15T07:46:09+00:00,2018,cond-mat.supr-con,[cond-mat.supr-con],https://arxiv.org/pdf/1810.08404v3,"18 Pages, 8 figures. The title has been change...","Scientific Reports 9, 13346 (2019)"
27,1712.06815v1,http://arxiv.org/abs/1712.06815v1,10.7566/JPSJ.87.023704,Evolution of Anisotropic Displacement Paramete...,In order to understand the mechanisms behind t...,"[Yoshikazu Mizuguchi, Kazuhisa Hoshi, Yosuke G...","Yoshikazu Mizuguchi, Kazuhisa Hoshi, Yosuke Go...",2017-12-19T08:25:07+00:00,2017-12-19T08:25:07+00:00,2017,cond-mat.supr-con,"[cond-mat.supr-con, cond-mat.mtrl-sci]",https://arxiv.org/pdf/1712.06815v1,,"J. Phys. Soc. Jpn. 87, 023704 (2018)"
34,1701.07575v1,http://arxiv.org/abs/1701.07575v1,10.1088/1742-6596/871/1/012007,Synchrotron powder X-ray diffraction and struc...,Eu0.5La0.5FBiS2-xSex is a new BiS2-based super...,"[K. Nagasaka, G. Jinno, O. Miura, A. Miura, C....","K. Nagasaka, G. Jinno, O. Miura, A. Miura, C. ...",2017-01-26T04:36:20+00:00,2017-01-26T04:36:20+00:00,2017,cond-mat.supr-con,[cond-mat.supr-con],https://arxiv.org/pdf/1701.07575v1,"5 pages, 2 figures, to appear in proceedings o...",
47,1508.04820v1,http://arxiv.org/abs/1508.04820v1,10.1016/j.ssc.2016.07.001,Applying experimental constraints to a one-dim...,"Recent ARPES measurements [Phys. Rev. B 92, 04...","[M. A. Griffith, K. Foyevtsova, M. A. Continen...","M. A. Griffith, K. Foyevtsova, M. A. Continent...",2015-08-19T22:39:35+00:00,2015-08-19T22:39:35+00:00,2015,cond-mat.supr-con,"[cond-mat.supr-con, cond-mat.str-el]",https://arxiv.org/pdf/1508.04820v1,4pages+references+supplemental material,"Solid State Communications 244,57 (2016)"


## 2.5 Creating a Test Subset (Optional)
To test our download pipeline without processing the entire corpus, we generate a small, reproducible random sample. This allows us to debug connection issues or file handling logic quickly before committing to the full dataset.

* **Sample Size:** Defined by `NUM_PAPERS`.
* **Reproducibility:** Controlled by `RANDOM_SEED` to ensure we test on the same papers every time we run this cell.

In [9]:
# --- Configuration ---
NUM_PAPERS = 10
RANDOM_SEED = 42

# --- Create Sample ---
# We sample directly from the full dataframe loaded in the previous step
test_corpus = df_corpus.sample(n=NUM_PAPERS, random_state=RANDOM_SEED)

print("\n" + "="*40)
print(f"üß™ Test Corpus Generated")
print("="*40)
print(f"Size: {len(test_corpus)} papers")
print(f"Seed: {RANDOM_SEED}")

# Display the sample arXiv IDs to verify variety
print(f"Sample IDs: {test_corpus['arxiv_id'].tolist()}")


üß™ Test Corpus Generated
Size: 10 papers
Seed: 42
Sample IDs: ['1411.6903v1', '1603.02819v1', '1911.02337v1', '1708.03840v1', '1310.1213v2', '1410.6775v2', '1402.1833v1', '1909.01710v1', '1308.1072v3', '1801.06568v1']


# 3 . Core Logic: PDF Download Function
We define a robust function to handle the retrieval of PDF files. This function is designed with several safeguards to ensure data integrity and respect server limits:

1.  **URL Normalization:** Automatically converts arXiv abstract URLs (`/abs/`) to direct PDF links (`/pdf/`).
2.  **Idempotency:** Checks if the file already exists locally to prevent redundant downloads and save bandwidth.
3.  **Resilience:** Implements a retry mechanism (backoff) to handle network timeouts or temporary server errors.
4.  **Integrity:** Calculates an MD5 checksum of the downloaded content to verify that the file is not corrupted.
5.  **Dynamic Storage:** Accepts a target directory argument, allowing us to easily switch between saving to the "Sample" folder or the "Full Corpus" folder.

In [18]:
def download_pdf(
    arxiv_id: str,
    pdf_url: str,
    save_directory: Path = CorpusConfig.PDF_DIR
) -> tuple[str, Optional[Path]]: # Changed return type
    """
    Download a single PDF and save it with the ArXiv ID as the filename.

    Args:
        arxiv_id (str): ArXiv identifier (e.g., "2301.12345").
        pdf_url (str): The URL to the PDF resource.
        save_directory (Path): The folder where the PDF should be saved.
                               Defaults to the main PDF corpus directory.

    Returns:
        tuple[str, Optional[Path]]: A tuple containing a status string ('downloaded', 'skipped', 'failed')
                                     and the path to the saved PDF, or None if the download failed.
    """
    # --- 1. Normalize URL ---
    # Ensure we are targeting the PDF binary, not the abstract HTML page
    if "arxiv.org/abs/" in pdf_url:
        pdf_url = pdf_url.replace("/abs/", "/pdf/")
    if not pdf_url.endswith(".pdf"):
        pdf_url += ".pdf"

    # --- 2. Define Target Path ---
    pdf_path = save_directory / f"{arxiv_id}.pdf"

    # --- 3. Check for Existing File ---
    if pdf_path.exists():
        logging.info(f"‚è≠Ô∏è  Skipped (Exists): {arxiv_id}")
        return "skipped", pdf_path # Return 'skipped' status

    # --- 4. Download Execution ---
    # User-Agent is set to indicate academic research and avoid being blocked as a generic bot
    headers = {'User-Agent': 'Mozilla/5.0 (Academic Research; BiS2-Project)'}

    for attempt in range(CorpusConfig.MAX_RETRIES):
        try:
            response = requests.get(
                pdf_url,
                headers=headers,
                timeout=CorpusConfig.TIMEOUT
            )

            if response.status_code == 200:
                # Save binary content to disk
                with open(pdf_path, 'wb') as f:
                    f.write(response.content)

                # Calculate checksum for integrity verification
                checksum = hashlib.md5(response.content).hexdigest()
                logging.info(f"‚¨áÔ∏è  Downloaded: {arxiv_id} (MD5: {checksum[:8]})")

                return "downloaded", pdf_path # Return 'downloaded' status

            elif response.status_code == 404:
                logging.warning(f"‚ùå PDF not found (404): {arxiv_id}")
                return "failed", None # Return 'failed' status

            else:
                logging.warning(f"‚ö†Ô∏è  Status {response.status_code} for {arxiv_id}")

        except Exception as e:
            logging.warning(f"‚ö†Ô∏è  Attempt {attempt+1} failed for {arxiv_id}: {e}")
            if attempt < CorpusConfig.MAX_RETRIES - 1:
                time.sleep(CorpusConfig.RETRY_DELAY)

    logging.error(f"‚ùå Failed to download {arxiv_id} after {CorpusConfig.MAX_RETRIES} attempts")
    return "failed", None # Return 'failed' status

## 3.1 Batch Processing and Manifest Generation
We now implement the higher-level logic to process the entire DataFrame.

### `build_local_corpus`
This function iterates through the list of papers and triggers the download for each. It tracks success/failure statistics to provide a summary report.

### `save_corpus_manifest`
Crucially, we generate a **Manifest File (`.json`)** alongside the PDF downloads. This manifest serves as a snapshot of the local dataset, recording:
* Which files were successfully downloaded.
* File sizes and MD5 checksums (for verifying data integrity later).
* The relative path to each file.

In [19]:
def save_corpus_manifest(
    metadata_df: pd.DataFrame,
    download_stats: Dict,
    save_directory: Path,
    filename: str = "corpus_manifest.json"
):
    """
    Create a detailed JSON manifest documenting the downloaded corpus.
    This ensures we have a snapshot of exactly what files exist locally.
    """

    manifest = {
        "corpus_version": CorpusConfig.CORPUS_VERSION,
        "creation_date": CorpusConfig.CREATION_DATE,
        "total_papers": len(metadata_df),
        "download_stats": download_stats,
        "papers": []
    }

    print(f"üìù Generating manifest in: {save_directory.name}...")

    for idx, row in metadata_df.iterrows():
        arxiv_id = row['arxiv_id']
        pdf_path = save_directory / f"{arxiv_id}.pdf"

        paper_info = {
            "arxiv_id": arxiv_id,
            "title": row.get('title', 'Unknown'),
            "published": row.get('published', 'Unknown'),
            "pdf_url": row['pdf_url'],
            # Store path relative to project root for portability
            "local_path": str(pdf_path.relative_to(CorpusConfig.BASE_DIR)) if pdf_path.exists() else None,
            "file_exists": pdf_path.exists()
        }

        if pdf_path.exists():
            # Add file metadata
            paper_info["file_size_kb"] = round(pdf_path.stat().st_size / 1024, 2)
            try:
                with open(pdf_path, 'rb') as f:
                    paper_info["md5_checksum"] = hashlib.md5(f.read()).hexdigest()
            except Exception as e:
                paper_info["md5_checksum"] = "error_reading_file"

        manifest["papers"].append(paper_info)

    # Save manifest to the same directory as the PDFs
    manifest_path = save_directory / filename
    with open(manifest_path, 'w', encoding='utf-8') as f:
        json.dump(manifest, f, indent=2)

    logging.info(f"Manifest saved: {manifest_path}")
    print(f"‚úÖ Manifest saved: {manifest_path}")


def build_local_corpus(
    metadata_df: pd.DataFrame,
    save_directory: Path = CorpusConfig.PDF_DIR,
    manifest_filename: str = "corpus_manifest.json"
) -> Dict:
    """
    Orchestrate the download of all PDFs in the metadata DataFrame.

    Args:
        metadata_df: DataFrame containing 'arxiv_id' and 'pdf_url'.
        save_directory: Target folder for PDFs (Sample or Full).
        manifest_filename: Name for the summary JSON file.

    Returns:
        Dictionary containing execution statistics.
    """
    stats = {
        "total": len(metadata_df),
        "success": 0,
        "failed": 0,
        "already_existed": 0, # Added for clarity
        "failed_ids": []
    }

    logging.info(f"Starting download for {len(metadata_df)} papers into {save_directory.name}")
    print(f"\n{'='*60}")
    print(f"üöÄ STARTING CORPUS BUILD")
    print(f"Target Directory: {save_directory}")
    print(f"{'-'*60}") # Changed from '=' for visual distinction
    print(f"Total papers to process: {stats['total']}") # Added for clarity
    print(f"{'-'*60}\n") # Changed from '=' for visual distinction

    for idx, row in metadata_df.iterrows():
        arxiv_id = row['arxiv_id']

        # Pass the dynamic save_directory to the download function
        download_status, result_path = download_pdf(arxiv_id, row['pdf_url'], save_directory=save_directory) # Unpack status

        # Update Statistics based on status
        if download_status == "downloaded":
            stats["success"] += 1
        elif download_status == "skipped":
            stats["success"] += 1 # Count as success as the file is available
            stats["already_existed"] += 1 # Increment 'already_existed'
        else: # download_status == "failed"
            stats["failed"] += 1
            stats["failed_ids"].append(arxiv_id)

        # Progress update every 10 papers
        if (idx + 1) % 10 == 0 or (idx + 1) == len(metadata_df): # Added (idx+1)==len(metadata_df) to ensure final update
            print(f"‚è≥ Progress: {idx+1}/{len(metadata_df)}... (Downloaded: {stats['success'] - stats['already_existed']}, Skipped: {stats['already_existed']}, Failed: {stats['failed']})") # Detailed progress

        # Rate Limiting: Be polite to ArXiv
        time.sleep(1)

    # --- Summary ---
    print(f"\n{'='*60}")
    print("üèÅ DOWNLOAD COMPLETE")
    print(f"{'-'*60}") # Changed from '=' for visual distinction
    print(f"Total processed: {stats['total']}")
    print(f"Successfully retrieved (new downloads + existing): {stats['success']} ({(stats['success']/stats['total'])*100:.1f}%) -- New Downloads: {stats['success'] - stats['already_existed']}, Skipped (Already Existed): {stats['already_existed']}") # Detailed success breakdown
    print(f"Failed:  {stats['failed']}")

    if stats['failed_ids']:
        print(f"‚ö†Ô∏è Failed IDs: {stats['failed_ids']}")

    # --- Generate Manifest ---
    save_corpus_manifest(metadata_df, stats, save_directory, manifest_filename)

    return stats

## 3.2 Execution: Downloading the Sample Corpus
We first run the pipeline on the `test_corpus` (10 papers). This verifies that:
1.  The connection to arXiv is working.
2.  PDFs are saving correctly to the `sample_pdfs` folder.
3.  The manifest JSON is generated accurately.

In [20]:
# --- Run Pipeline on Sample ---
sample_stats = build_local_corpus(
    metadata_df=test_corpus,
    save_directory=CorpusConfig.SAMPLE_PDF_DIR,
    manifest_filename="sample_corpus_manifest.json"
)

# --- Verification ---
# List the files in the sample directory to confirm
print(f"\nüìÇ Contents of {CorpusConfig.SAMPLE_PDF_DIR.name}:")
for f in list(CorpusConfig.SAMPLE_PDF_DIR.glob("*.pdf"))[:5]:  # Show first 5
    print(f" - {f.name} ({round(f.stat().st_size / 1024)} KB)")


üöÄ STARTING CORPUS BUILD
Target Directory: /content/drive/My Drive/TFM/data/raw/pdfs/sample_pdfs
------------------------------------------------------------
Total papers to process: 10
------------------------------------------------------------

‚è≥ Progress: 20/10... (Downloaded: 0, Skipped: 3, Failed: 0)
‚è≥ Progress: 70/10... (Downloaded: 0, Skipped: 7, Failed: 0)

üèÅ DOWNLOAD COMPLETE
------------------------------------------------------------
Total processed: 10
Successfully retrieved (new downloads + existing): 10 (100.0%) -- New Downloads: 0, Skipped (Already Existed): 10
Failed:  0
üìù Generating manifest in: sample_pdfs...
‚úÖ Manifest saved: /content/drive/My Drive/TFM/data/raw/pdfs/sample_pdfs/sample_corpus_manifest.json

üìÇ Contents of sample_pdfs:
 - 1411.6903v1.pdf (602 KB)
 - 1603.02819v1.pdf (267 KB)
 - 1911.02337v1.pdf (829 KB)
 - 1708.03840v1.pdf (1497 KB)
 - 1310.1213v2.pdf (289 KB)


## 3.3 Preparation for Full Corpus Download
Before executing the download on the entire dataset, we define a flexible loading utility. This function ensures we can ingest metadata from either JSON or CSV formats and performs a critical check on the `pdf_url` column to identify missing links.

* **Robustness:** Handles different file extensions automatically.
* **Validation:** Checks for missing PDF URLs, which would cause the download loop to fail.

In [None]:
# Assuming your JSON file is corpus_manifest.json as created previously

json_file_path = Path("/content/drive/MyDrive/TFM/data/raw/pdfs/corpus_v1.0_pdfs/corpus_manifest.json")

if json_file_path.exists():
    with open(json_file_path, 'r') as f:
        json_data = json.load(f)

    print(f"Loaded JSON file: {json_file_path}")
    print(f"Top-level type: {type(json_data)}")

    if isinstance(json_data, list):
        print(f"Number of items in the list: {len(json_data)}")
    elif isinstance(json_data, dict):
        print(f"Number of top-level keys: {len(json_data.keys())}")
        print(f"Top-level keys: {list(json_data.keys())}")

    # Optionally, display a part of the data to see its structure
    # For large files, print only a small sample
    print("\nFirst few items/keys and their values (truncated if large):")
    if isinstance(json_data, list):
        for i, item in enumerate(json_data[:2]): # Display first 2 items
            print(f"Item {i}: {json.dumps(item, indent=2)[:500]}...") # Truncate output
    elif isinstance(json_data, dict):
        for key, value in list(json_data.items()): # Display first 3 key-value pairs
            print(f"Key '{key}': {json.dumps(value, indent=2)[:500]}...") # Truncate output

else:
    print(f"Error: JSON file not found at {json_file_path}")


Loaded JSON file: /content/drive/MyDrive/TFM/data/raw/pdfs/corpus_v1.0_pdfs/corpus_manifest.json
Top-level type: <class 'dict'>
Number of top-level keys: 5
Top-level keys: ['corpus_version', 'creation_date', 'total_papers', 'download_stats', 'papers']

First few items/keys and their values (truncated if large):
Key 'corpus_version': "v1.0"...
Key 'creation_date': "2026-01-14"...
Key 'total_papers': 130...
Key 'download_stats': {
  "total": 130,
  "success": 130,
  "failed": 0,
  "already_existed": 130,
  "failed_ids": []
}...
Key 'papers': [
  {
    "arxiv_id": "2406.01263v2",
    "title": "Pb Substitution Effects on Lattice and Electronic System of the BiS2-based Superconductors La(O F)BiS2",
    "published": "2024-06-03T12:24:51+00:00",
    "pdf_url": "https://arxiv.org/pdf/2406.01263v2",
    "local_path": "data/raw/pdfs/corpus_v1.0_pdfs/2406.01263v2.pdf",
    "file_exists": true,
    "file_size_kb": 3881.12890625,
    "md5_checksum": "308efa46a8624aba94793d0bdf23f2d1"
  },
  {
    "

## 3.4 Text Extraction Logic
Now that we have the PDF files, we need to extract the raw text content to build our NLP dataset. We use `fitz` (PyMuPDF) to iterate through the pages of each PDF.

We define two functions:
1.  **`extract_text_from_pdf`**: A pure function that takes a specific PDF file and saves the text to a target folder. It also calculates metadata (word count, page count).
2.  **`batch_extract_texts`**: A wrapper that iterates through a source directory (e.g., `sample_pdfs` or `corpus_pdfs`) and triggers the extraction for every file found.

In [25]:
def extract_text_from_pdf(
    pdf_path: Path,
    output_dir: Path
) -> Dict:
    """
    Extract full text from a specific local PDF file using PyMuPDF.

    Args:
        pdf_path: Path object pointing to the source PDF.
        output_dir: Path object pointing to where the .txt file should be saved.

    Returns:
        Dict containing metadata (word count, extraction status, etc.)
    """
    arxiv_id = pdf_path.stem  # Extract filename without extension (e.g., '2301.12345')
    text_filename = f"{arxiv_id}_full.txt"
    output_path = output_dir / text_filename

    metadata = {
        "arxiv_id": arxiv_id,
        "page_count": 0,
        "word_count": 0,
        "char_count": 0,
        "extraction_method": "pymupdf",
        "source_file": pdf_path.name,
        "error": None
    }

    # --- Check for existing text file ---
    if output_path.exists():
        logging.info(f"‚è≠Ô∏è  Skipped text extraction (Exists): {arxiv_id}")
        # Optionally, you could try to load existing metadata here if needed.
        # For simplicity, we just mark it as skipped and return basic info.
        metadata["status"] = "skipped_existing"
        metadata["char_count"] = output_path.stat().st_size # Approximate size without reading file
        try:
            with open(output_path, 'r', encoding='utf-8') as f:
                content = f.read()
                metadata["word_count"] = len(content.split())
        except Exception: # Handle potential issues reading the file just for word count
            metadata["word_count"] = 0
        return metadata

    try:
        # Open PDF
        with fitz.open(pdf_path) as doc:
            metadata["page_count"] = len(doc)
            full_text = ""

            # Iterate pages
            for page in doc:
                # Get text blocks
                full_text += page.get_text() + "\n\n"

        # Update metadata
        metadata["char_count"] = len(full_text)
        metadata["word_count"] = len(full_text.split())
        metadata["status"] = "extracted"

        # Save Text to Disk
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(full_text)

        logging.info(f"‚úÖ Extracted: {arxiv_id} ({metadata['word_count']} words)")

    except Exception as e:
        logging.error(f"‚ùå Error extracting {arxiv_id}: {e}")
        metadata["error"] = str(e)
        metadata["status"] = "failed_extraction"

    return metadata


def batch_extract_texts(
    source_pdf_dir: Path,
    target_text_dir: Path
) -> List[Dict]:
    """
    Iterate through all PDFs in the source directory and extract text.

    Args:
        source_pdf_dir: Directory containing .pdf files.
        target_text_dir: Directory where .txt files will be saved.

    Returns:
        List of dictionaries containing extraction metadata for the processed batch.
    """
    # Ensure target directory exists
    target_text_dir.mkdir(parents=True, exist_ok=True)

    # Find all PDFs
    pdf_files = list(source_pdf_dir.glob("*.pdf"))

    print(f"\n{'='*60}")
    print(f"üìÑ STARTING TEXT EXTRACTION")
    print(f"Source: {source_pdf_dir.name}")
    print(f"Target: {target_text_dir.name}")
    print(f"Files to process: {len(pdf_files)}")
    print(f"{'-'*60}\n") # Changed from '=' for visual distinction

    results = []

    for i, pdf_path in enumerate(pdf_files):
        # Execute Extraction
        meta = extract_text_from_pdf(pdf_path, target_text_dir)
        results.append(meta)

        # Progress logging
        if (i + 1) % 10 == 0 or (i + 1) == len(pdf_files):
            # Count actual extractions vs. skipped ones for clearer progress
            extracted_count = sum(1 for r in results if r.get('status') == 'extracted')
            skipped_count = sum(1 for r in results if r.get('status') == 'skipped_existing')
            failed_count = sum(1 for r in results if r.get('status') == 'failed_extraction')

            print(f"[{datetime.now().strftime('%H:%M:%S')}] Processed {i+1}/{len(pdf_files)}... (Extracted: {extracted_count}, Skipped: {skipped_count}, Failed: {failed_count})")

    print(f"\n‚úÖ Extraction Complete!")
    return results

## 3.5 Execution: Extracting Text from Sample PDFs
We run the extraction pipeline on our downloaded **sample PDFs**.
This serves as a quality check:
1.  Verify that `fitz` can open the downloaded files (checking for corruption).
2.  Inspect the resulting `.txt` files to ensure character encoding (UTF-8) is correct.

In [23]:
# --- Run Extraction on Sample ---
sample_extraction_results = batch_extract_texts(
    source_pdf_dir=CorpusConfig.SAMPLE_PDF_DIR,
    target_text_dir=CorpusConfig.SAMPLE_TEXT_DIR
)

# --- Verification ---
# Display the first few results to check word counts
print("\nüîç Sample Extraction Stats:")
df_sample_stats = pd.DataFrame(sample_extraction_results)
# Using standard pandas display for clearer table formatting
display(df_sample_stats[['arxiv_id', 'word_count', 'error', 'status']].head(10))


üìÑ STARTING TEXT EXTRACTION
Source: sample_pdfs
Target: sample_text_dumps
Files to process: 10
------------------------------------------------------------

[12:27:32] Processed 10/10... (Extracted: 0, Skipped: 10, Failed: 0)

‚úÖ Extraction Complete!

üîç Sample Extraction Stats:


Unnamed: 0,arxiv_id,word_count,error,status
0,1411.6903v1,7408,,skipped_existing
1,1603.02819v1,9531,,skipped_existing
2,1911.02337v1,3149,,skipped_existing
3,1708.03840v1,3790,,skipped_existing
4,1310.1213v2,2554,,skipped_existing
5,1410.6775v2,2955,,skipped_existing
6,1402.1833v1,4196,,skipped_existing
7,1909.01710v1,4379,,skipped_existing
8,1308.1072v3,4676,,skipped_existing
9,1801.06568v1,4446,,skipped_existing


# 4 . Execution

## 4.1 Full Corpus Download
**‚ö†Ô∏è Warning: Long-Running Process**
This cell triggers the download for the entire dataset (`corpus_df`). Depending on the number of papers and the `time.sleep` interval (politeness policy), this may take a significant amount of time.

* **Input:** `corpus_df` (Metadata from Notebook 1)
* **Output:** PDFs saved to `data/raw/pdfs/corpus_vX.X_pdfs`
* **Manifest:** A `corpus_manifest.json` will be generated in the same folder.

In [21]:
# --- 1. Execute Full Download ---
# We pass the main configuration directory for the full corpus
full_download_stats = build_local_corpus(
    metadata_df=corpus_df,
    save_directory=CorpusConfig.PDF_DIR,
    manifest_filename="corpus_manifest.json"
)

# --- 2. Save Statistics for Reference ---
# It's good practice to save the run statistics to a log file
log_path = CorpusConfig.LOG_DIR / f"download_stats_{datetime.now().strftime('%Y%m%d')}.json"
with open(log_path, 'w') as f:
    json.dump(full_download_stats, f, indent=2)

print(f"\nüìù Detailed download logs saved to: {log_path}")


üöÄ STARTING CORPUS BUILD
Target Directory: /content/drive/My Drive/TFM/data/raw/pdfs/corpus_v1.0_pdfs
------------------------------------------------------------
Total papers to process: 130
------------------------------------------------------------

‚è≥ Progress: 10/130... (Downloaded: 0, Skipped: 10, Failed: 0)
‚è≥ Progress: 20/130... (Downloaded: 0, Skipped: 20, Failed: 0)
‚è≥ Progress: 30/130... (Downloaded: 0, Skipped: 30, Failed: 0)
‚è≥ Progress: 40/130... (Downloaded: 0, Skipped: 40, Failed: 0)
‚è≥ Progress: 50/130... (Downloaded: 0, Skipped: 50, Failed: 0)
‚è≥ Progress: 60/130... (Downloaded: 0, Skipped: 60, Failed: 0)
‚è≥ Progress: 70/130... (Downloaded: 0, Skipped: 70, Failed: 0)
‚è≥ Progress: 80/130... (Downloaded: 0, Skipped: 80, Failed: 0)
‚è≥ Progress: 90/130... (Downloaded: 0, Skipped: 90, Failed: 0)
‚è≥ Progress: 100/130... (Downloaded: 0, Skipped: 100, Failed: 0)
‚è≥ Progress: 110/130... (Downloaded: 0, Skipped: 110, Failed: 0)
‚è≥ Progress: 120/130... (Downloade

## 4.2 Full Corpus Text Extraction
Once the PDFs are secured locally, we process them into plain text. This data will form the primary input for the NLP and Knowledge Graph construction in subsequent notebooks.

* **Input:** PDFs from `data/raw/pdfs/corpus_vX.X_pdfs`
* **Output:** Text files saved to `data/processed/corpus_vX.X_text_dumps`

In [30]:
# --- 1. Execute Full Extraction ---
# This iterates through the directory populated in the previous step
full_extraction_results = batch_extract_texts(
    source_pdf_dir=CorpusConfig.PDF_DIR,
    target_text_dir=CorpusConfig.TEXT_DIR
)

# --- 2. Save Extraction Metadata ---
# We convert the list of dictionaries (metadata) into a DataFrame and save it.
# This serves as an index for our text corpus (e.g., mapping arXiv IDs to word counts).
df_extraction_meta = pd.DataFrame(full_extraction_results)

meta_save_path = f"/content/drive/MyDrive/TFM/data/processed/corpus_v1.0_text_dumps/text_extraction_metadata_{CorpusConfig.CORPUS_VERSION}.csv"
df_extraction_meta.to_csv(meta_save_path, index=False)

print(f"\n‚úÖ Extraction metadata saved to: {meta_save_path}")
print("üéâ Notebook 2 Complete. Data is ready for NLP processing.")


üìÑ STARTING TEXT EXTRACTION
Source: corpus_v1.0_pdfs
Target: corpus_v1.0_text_dumps
Files to process: 130
------------------------------------------------------------

[12:37:31] Processed 10/130... (Extracted: 0, Skipped: 10, Failed: 0)
[12:37:32] Processed 20/130... (Extracted: 0, Skipped: 20, Failed: 0)
[12:37:32] Processed 30/130... (Extracted: 0, Skipped: 30, Failed: 0)
[12:37:32] Processed 40/130... (Extracted: 0, Skipped: 40, Failed: 0)
[12:37:32] Processed 50/130... (Extracted: 0, Skipped: 50, Failed: 0)
[12:37:32] Processed 60/130... (Extracted: 0, Skipped: 60, Failed: 0)
[12:37:32] Processed 70/130... (Extracted: 0, Skipped: 70, Failed: 0)
[12:37:32] Processed 80/130... (Extracted: 0, Skipped: 80, Failed: 0)
[12:37:32] Processed 90/130... (Extracted: 0, Skipped: 90, Failed: 0)
[12:37:32] Processed 100/130... (Extracted: 0, Skipped: 100, Failed: 0)
[12:37:32] Processed 110/130... (Extracted: 0, Skipped: 110, Failed: 0)
[12:37:32] Processed 120/130... (Extracted: 0, Skipped:

In [None]:
# Folder size verification

def get_folder_size(path: Path) -> tuple[int, int]:
    """
    Calculates the total size and number of files in a given directory.

    Args:
        path: The Path object of the directory.

    Returns:
        A tuple containing (total_size_bytes, num_files).
    """
    total_size = 0
    num_files = 0
    if path.exists() and path.is_dir():
        for file_path in path.rglob('*'):
            if file_path.is_file():
                total_size += file_path.stat().st_size
                num_files += 1
    return total_size, num_files

def format_size(size_bytes: int) -> str:
    """
    Formats a size in bytes to a human-readable string (KB, MB, GB).
    """
    if size_bytes < 1024: # Bytes
        return f"{size_bytes} B"
    elif size_bytes < 1024**2: # Kilobytes
        return f"{size_bytes / 1024:.2f} KB"
    elif size_bytes < 1024**3: # Megabytes
        return f"{size_bytes / (1024**2):.2f} MB"
    else: # Gigabytes
        return f"{size_bytes / (1024**3):.2f} GB"

print("\n" + "="*40)
print("FOLDER SIZE VERIFICATION")
print("="*40)

# Verify main text dump folder
main_text_folder = CorpusConfig.TEXT_DIR
main_size, main_files = get_folder_size(main_text_folder)
print(f"Main Text Dump Folder: {main_text_folder}")
print(f"  - Total Size: {format_size(main_size)}")
print(f"  - Number of Files: {main_files}")

print("\n" + "-"*40)

# Verify sample text dump folder
sample_text_folder = CorpusConfig.SAMPLE_TEXT_DIR
sample_size, sample_files = get_folder_size(sample_text_folder)
print(f"Sample Text Dump Folder: {sample_text_folder}")
print(f"  - Total Size: {format_size(sample_size)}")
print(f"  - Number of Files: {sample_files}")

print("\n" + "="*40)


FOLDER SIZE VERIFICATION
Main Text Dump Folder: /content/drive/My Drive/TFM/data/processed/corpus_v1.0_text_dumps
  - Total Size: 2.99 MB
  - Number of Files: 130

----------------------------------------
Sample Text Dump Folder: /content/drive/My Drive/TFM/data/processed/sample_text_dumps
  - Total Size: 226.20 KB
  - Number of Files: 10



# 5 . Final Verification: Data Volume and Integrity
As a final step, we programmatically verify the storage footprint of our new dataset. This confirms that:
1.  The files were physically written to the Google Drive.
2.  The file counts match our expected corpus size (e.g., if we processed 100 papers, we expect ~100 text files).
3.  The file sizes are reasonable (e.g., if total size is 0 KB, something went wrong).

In [31]:
def get_folder_stats(path: Path) -> tuple[int, int]:
    """
    Calculates the total size and number of files in a given directory recursively.

    Args:
        path: The Path object of the directory.

    Returns:
        A tuple containing (total_size_bytes, num_files).
    """
    total_size = 0
    num_files = 0

    if path.exists() and path.is_dir():
        for file_path in path.rglob('*'):
            if file_path.is_file():
                total_size += file_path.stat().st_size
                num_files += 1
    return total_size, num_files

def format_size(size_bytes: int) -> str:
    """
    Formats a size in bytes to a human-readable string (KB, MB, GB).
    """
    if size_bytes < 1024:
        return f"{size_bytes} B"
    elif size_bytes < 1024**2:
        return f"{size_bytes / 1024:.2f} KB"
    elif size_bytes < 1024**3:
        return f"{size_bytes / (1024**2):.2f} MB"
    else:
        return f"{size_bytes / (1024**3):.2f} GB"

# --- Execute Verification ---
print("\n" + "="*50)
print("üóÑÔ∏è  DATASET STORAGE VERIFICATION")
print("="*50)

# 1. Verify Main Corpus (Text)
main_size, main_files = get_folder_stats(CorpusConfig.TEXT_DIR)
print(f"\nüìÇ Main Text Corpus")
print(f"   Path:  {CorpusConfig.TEXT_DIR}")
print(f"   Count: {main_files} files")
print(f"   Size:  {format_size(main_size)}")

# 2. Verify Sample Corpus (Text)
sample_size, sample_files = get_folder_stats(CorpusConfig.SAMPLE_TEXT_DIR)
print(f"\nüìÇ Sample Text Corpus")
print(f"   Path:  {CorpusConfig.SAMPLE_TEXT_DIR}")
print(f"   Count: {sample_files} files")
print(f"   Size:  {format_size(sample_size)}")

# 3. Verify PDF Cache (Optional but useful)
pdf_size, pdf_files = get_folder_stats(CorpusConfig.PDF_DIR)
print(f"\nüìÇ Raw PDF Cache")
print(f"   Path:  {CorpusConfig.PDF_DIR}")
print(f"   Count: {pdf_files} files")
print(f"   Size:  {format_size(pdf_size)}")

print("\n" + "="*50)
print("‚úÖ NOTEBOOK 2 COMPLETE")


üóÑÔ∏è  DATASET STORAGE VERIFICATION

üìÇ Main Text Corpus
   Path:  /content/drive/My Drive/TFM/data/processed/corpus_v1.0_text_dumps
   Count: 131 files
   Size:  3.00 MB

üìÇ Sample Text Corpus
   Path:  /content/drive/My Drive/TFM/data/processed/sample_text_dumps
   Count: 10 files
   Size:  280.06 KB

üìÇ Raw PDF Cache
   Path:  /content/drive/My Drive/TFM/data/raw/pdfs/corpus_v1.0_pdfs
   Count: 131 files
   Size:  126.15 MB

‚úÖ NOTEBOOK 2 COMPLETE
