
## Required Python Libraries

Your workflow depends on several libraries for interacting with the arXiv API, handling files, managing concurrency, monitoring system resources, and processing text.

Below is a quick overview of what each library does:

* **arxiv** ‚Äì Fetching paper metadata and downloading PDFs from arXiv.
* **requests** ‚Äì Sending HTTP requests (used for downloading files or checking links).
* **psutil** ‚Äì Monitoring memory, CPU usage, and process information.
* **threading** ‚Äì Running tasks concurrently to speed up download/extraction.
* **queue** ‚Äì Thread-safe task queues for worker threads.
* **tarfile** ‚Äì Extracting `.tar.gz` or `.tgz` source files from arXiv.
* **re** ‚Äì Handling regular expressions for parsing text.
* **json** ‚Äì Reading and writing JSON configurations or metadata.
* **time** ‚Äì Sleeping, benchmarking, and timing processes.
* **os** ‚Äì Path handling, checking directories, cleaning files.
* **random** ‚Äì Random selection or sampling of papers.
* **string** ‚Äì Sanitizing filenames and keys.
* **logging** ‚Äì Suppressing or handling logs cleanly (`arxiv` library can be noisy).

In [1]:
!pip install arxiv requests psutil



In [2]:
import logging
logging.getLogger("arxiv").setLevel(logging.ERROR)
import arxiv
import re
import os
import tarfile
import requests
import string
import requests
import json
import time
import threading
import queue
import psutil
import random
import shutil
import traceback

# **I. Overview: arXiv ID discovery and range utilities**

This module provides a set of functions for generating, validating, and enumerating arXiv IDs within a given date range. arXiv uses the format `YYMM.NNNNN`, where `NNNNN` is an incrementing paper number starting from `00001` each month. Since arXiv does not provide an API to directly list all valid IDs, the code implements efficient search techniques to infer the boundaries of valid IDs.

---

## **I.1. ID Generation**

### `get_ID(month, year, number)`

Generates an arXiv ID in the form `YYMM.NNNNN`.

* `year % 100` extracts the last two digits of the year
* `month:02d` ensures the month is zero-padded
* `number:05d` produces a five-digit paper index

Example:

```python
get_ID(4, 2023, 12)  # "2304.00012"
```

---

## **I.2. Checking whether an arXiv ID exists or not**

### `id_exists(paper_id)`

Determines whether a specific arXiv ID corresponds to a real paper by performing a lookup using the arXiv API.

Behavior:

* Returns `True` if metadata for the ID is found
* Returns `False` if:

  * The ID does not exist
  * The API returns no results
  * Any network or parsing error occurs

This function is the foundation for locating valid ID boundaries.

---

## **I.3. Locating the first valid ID of a month**

### `find_first_id(year, month)`

Finds the first valid ID for a given month using a two-stage search strategy:

1. **Exponential search**
   Doubles the index (`high = 1, 2, 4, 8, ...`) until it finds a valid ID.

2. **Binary search**
   Narrows the range between `low` and `high` to identify the smallest existing index.


Returns:

* The earliest valid index
* `None` if the month contains no papers

---

## **I.4. Locating the last valid ID of a month**

### `find_last_id(year, month)`

Finds the last valid ID for a month using an approach similar to `find_first_id`:

1. **Exponential search**
   Increases the index until encountering a missing (nonexistent) ID.

2. **Binary search**
   Determines the highest index that still corresponds to a real paper.

Returns:

* The last valid index
* `99999` if the search reaches the maximum allowed numeric range

---

## **I.5. Listing IDs within a single month**

### `get_IDs_month(month, year, start_number, end_number)`

Generates all IDs between `start_number` and `end_number` for the specified month.

Output format:

```
["YYMM.00001", "YYMM.00002", ...]
```

Assumes the caller has already determined valid start and end indices.

---

## **I.6. Generating IDs across multiple months**

### `get_IDs_All(start_month, start_year, start_ID, end_month, end_year, end_ID)`

Constructs a complete list of all arXiv IDs between two points in time, even when the range spans multiple months or years.

Method:

1. For each month in the range:

   * Determine the ending index:

     * If the month is the last one: use `end_ID`
     * Otherwise: detect automatically using `find_last_id`
   * Collect all IDs for that month using `get_IDs_month`
2. Move to the next month, adjusting the year when necessary.
3. Reset the starting index using `find_first_id` for each new month.

This allows flexible queries, for example from March 2022 (starting at ID 2000) to July 2023 (ending at ID 0500).

---

## **I.7. Formatting arXiv IDs for filenames or keys**

### `format_arxiv_id_for_key(arxiv_id)`

Transforms an arXiv ID from:

```
YYMM.NNNNN
```

to:

```
yyyymm-NNNNN
```

This format is often more convenient for filenames, folders, or BibTeX keys.

Examples:

```
2304.07856 ‚Üí 202304-07856
1912.00123 ‚Üí 201912-00123
```

If the input does not match the expected format, the function returns it unchanged.

---

In [3]:
def get_ID(month, year, number):
    """Return arXiv ID in YYMM.NNNNN format."""
    return f"{year % 100:02d}{month:02d}.{number:05d}"

def id_exists(paper_id):
    """Check if a specific arXiv ID exists."""
    search = arxiv.Search(id_list=[paper_id])
    client = arxiv.Client(page_size=1, delay_seconds=0.2)
    try:
        next(client.results(search))
        return True
    except StopIteration:
        return False
    except Exception:
        # Network or parsing error ‚Äî assume not found for safety
        return False

def find_first_id(year, month):
    """Find the first valid arXiv ID of a given month using exponential + binary search."""
    low, high = 1, 1
    # Exponential search upward until we find a valid ID
    while not id_exists(get_ID(month, year, high)):
        high *= 2
        if high > 99999:
            return None  # No valid papers this month

    # Now binary search between low and high to find the *first* valid ID
    while low + 1 < high:
        mid = (low + high) // 2
        if id_exists(get_ID(month, year, mid)):
            high = mid
        else:
            low = mid
    return high

def find_last_id(year, month):
    """Find the last valid arXiv ID of a given month."""
    low, high = 1, 1
    # Exponential search upward until we find a missing ID
    while id_exists(get_ID(month, year, high)):
        high *= 2
        if high > 99999:
            return 99999
    low = high // 2
    # Binary search for the last existing ID
    while low + 1 < high:
        mid = (low + high) // 2
        if id_exists(get_ID(month, year, mid)):
            low = mid
        else:
            high = mid
    return low

def get_IDs_month(month, year, start_number, end_number):
    """Get all valid arXiv IDs in a given month."""
    return [get_ID(month, year, i) for i in range(start_number, end_number + 1)]

def get_IDs_All(start_month, start_year, start_ID, end_month, end_year, end_ID):
    """Get all valid arXiv IDs in the given range."""
    ids = []
    y, m = start_year, start_month
    n_start = start_ID
    n_end = None
    while True:
        if y == end_year and m == end_month:
            n_end = end_ID
        else:
            n_end = find_last_id(y, m)
            if n_end is None:
                n_end = 0  # No papers this month

        if n_start <= n_end:
            ids.extend(get_IDs_month(m, y, n_start, n_end))

        if y == end_year and m == end_month:
            break
        m += 1
        if m > 12:
            m, y = 1, y + 1
        n_start = find_first_id(y, m)  # reset numbering

    return ids

def format_arxiv_id_for_key(arxiv_id):
    """
    Convert arXiv ID from YYMM.NNNNN format to yyyymm-id format.
    Examples:
        2304.07856 -> 202304-07856
        1912.00123 -> 201912-00123
    """
    match = re.match(r'^(\d{2})(\d{2})\.(\d{5})$', arxiv_id)
    if match:
        yy, mm, id_num = match.groups()
        # Convert YY to YYYY (assuming 20YY for papers after 2000)
        yyyy = f"20{yy}"
        return f"{yyyy}{mm}-{id_num}"
    return arxiv_id  # Return as-is if format doesn't match



# **II. Metadata extraction and storage**

This section describes the functions responsible for converting an `arxiv.Result` object into structured metadata and saving it in JSON format. These utilities help standardize how each downloaded paper is documented, including its versions, authors, categories, and other descriptive attributes.

---
## **II.1. Converting an arXiv Result into metadata**

### `create_metadata(paper)`

Transforms an `arxiv.Result` object into a dictionary containing relevant metadata fields.

Key steps:

1. **Extract base ID and version**
   `paper.get_short_id()` returns values like `2305.00633v4`.

   * The function separates the base ID (`2305.00633`)
   * And the version number (`4`)

2. **Generate PDF URLs**

   * If the paper has multiple versions, PDF URLs for all versions are created
   * Otherwise, only the current version URL is included

3. **Construct the metadata dictionary**, which includes:

   * `arxiv_id`: base ID without version suffix
   * `paper_title`: title string
   * `authors`: list of author names
   * `submission_date`: original publication date
   * `revised_dates`: list with updated dates (empty if only one version)
   * `latest_version`: integer version number
   * `categories`: classification tags
   * `abstract`: cleaned summary
   * `pdf_urls`: one or more PDF download URLs

4. **Optional fields**

   * `publication_venue` is filled if `paper.comment` is present
   * `doi` is included if available

The function returns the full metadata dictionary.

---

## **II.2. Saving metadata**

### `save_metadata(paper, folder)`

Stores the metadata for a single paper into a `metadata.json` file located in the specified folder.

Steps performed:

1. Calls `create_metadata(paper)` to generate the metadata dictionary.
2. Ensures the target folder exists (creates it if necessary).
3. Writes the metadata to `metadata.json` using UTF-8 encoding and readable indentation.
4. Prints a message indicating the save location.
5. Returns the metadata dictionary for further use.

This function ensures that each paper‚Äôs descriptive information is saved consistently alongside its downloaded files.


In [4]:
def create_metadata(paper):
    """
    Convert an arxiv.Result object into a metadata dictionary.
    """
    arxiv_id = paper.get_short_id()         # e.g. '2305.00633v4'
    base_id = arxiv_id.split('v')[0]        # e.g. '2305.00633'
    version = int(arxiv_id.split('v')[1])   # e.g. 4

    # Generate all version URLs if version > 1
    if version > 1:
        pdf_urls = [f"http://arxiv.org/pdf/{base_id}v{i}" for i in range(1, version + 1)]
    else:
        pdf_urls = [f"http://arxiv.org/pdf/{arxiv_id}"]

    metadata = {
        "arxiv_id": base_id,
        "paper_title": paper.title.strip(),
        "authors": [author.name for author in paper.authors],
        "submission_date": paper.published.strftime("%Y-%m-%d"),
        "revised_dates": [
            paper.updated.strftime("%Y-%m-%d")
        ] if paper.updated != paper.published else [],
        "latest_version": version,
        "categories": paper.categories,
        "abstract": paper.summary.strip(),
        "pdf_urls": pdf_urls
    }

    # Optional metadata fields
    if paper.comment:
        metadata["publication_venue"] = paper.comment.strip()
    else:
        metadata["publication_venue"] = None

    if paper.doi:
        metadata["doi"] = paper.doi

    return metadata


def save_metadata(paper, folder):
    """
    Save metadata.json for a single paper into the given folder.
    """
    metadata = create_metadata(paper)

    folder_path = os.path.abspath(folder)
    os.makedirs(folder_path, exist_ok=True)
    save_path = os.path.join(folder_path, "metadata.json")

    with open(save_path, "w", encoding="utf-8") as f:
        json.dump(metadata, f, indent=4, ensure_ascii=False)

    print(f"üíæ Saved metadata to {save_path}")
    return metadata


# **III. Utility functions for handling arXiv source downloads**

This module contains helper functions for preparing valid filenames, safely extracting `.tar.gz` source archives, downloading files from arXiv, cleaning extracted folders, and orchestrating the full download process for all versions of a paper. These functions work together to ensure that downloaded source code is stored in a consistent, safe, and organized folder structure.

---

## **III.1. Formatting and Sanitization**

### `format_yymm_id(base_id: str) -> str`

Converts an arXiv base ID from:

```
2303.07856
```

to:

```
2303-07856
```


### `sanitize_filename(name: str) -> str`

Cleans a path or filename by replacing any unsafe characters with underscores.
Allowed characters include alphanumeric characters, underscores, hyphens, dots, and slashes.

The purpose is to prevent:

* Invalid filesystem paths
* Accidental directory traversal
* Issues caused by special characters in filenames

---

## **III.2. Safe extraction of tar.gz files**

### `safe_extract_tar(tar_path: str, extract_to: str) -> None`

Extracts a `.tar.gz` archive while applying safety checks:

* Rejects symbolic links and hard links
* Rejects absolute paths
* Rejects any entry containing `..`
* Sanitizes filenames before extraction
* Attempts to extract using `tar.extract(..., filter="data")`
* Creates necessary directories before writing files
* Skips any broken or unreadable entries rather than stopping the entire process

If an unrecoverable error occurs, it prints a message but does not raise an exception.

This ensures source archives extracted from arXiv cannot overwrite unintended paths or create unsafe filesystem structures.

---

## **III.3. Downloading remote files**

### `download_url(url: str, out_path: str) -> bool`

Downloads a file from a given URL and streams it to disk in chunks.

Characteristics:

* Uses a custom User-Agent
* Streams data to avoid high memory usage
* Ensures the output directory exists
* Returns `True` if the HTTP response is OK (`200`)
* Otherwise prints an error and returns `False`

This is a minimal downloader without retry or backoff logic.

---

## **III.4. Cleaning extracted source files**

### `cleanup_non_tex_bib_files(folder: str) -> None`

Walks through the extracted directory and removes every file except:

* `.tex` files
* `.bib` files

This keeps only the essential LaTeX source code and bibliography files, ignoring figures or auxiliary files.

---

## **III.5. Full download procedure for all versions**

### `download(list_download: list, base_dir: str) -> None`

Coordinates the process of downloading all available versions of an arXiv paper, extracting source files, cleaning them, and saving metadata.

Main steps:

1. **Validate input**

   * Ensures the list of results is not empty
   * Extracts the base ID using regex (four digits, dot, five digits)

2. **Prepare folder structure**

   * Creates a directory named after the paper (using `YYMM-NNNNN`)
   * Creates a `tex` subfolder
   * Creates per-version subfolders inside `tex`

3. **Download each version**
   For each paper version:

   * Build the `/src/{id}` URL
   * Download the corresponding `.tar.gz` archive
   * Validate that it is a real tar archive
   * Extract it safely using `safe_extract_tar`
   * Clean non-LaTeX files
   * Remove the `.tar.gz` after extraction

4. **Save metadata**

   * After all versions have been processed, calls `save_metadata`
   * Stores a `metadata.json` file in the paper‚Äôs root folder

Errors encountered during any stage are reported but do not stop the overall process unless critical information is missing.

In [5]:
ARXIV_HOST = "https://arxiv.org"

def format_yymm_id(base_id: str) -> str:
    """'2303.07856' -> '2303-07856'"""
    return base_id.replace('.', '-')

def sanitize_filename(name: str) -> str:
    """
    Replace unsafe characters and limit path depth to avoid errors.
    Keeps only alphanumeric, underscores, hyphens, dots, and slashes.
    """
    safe_chars = f"-_.{string.ascii_letters}{string.digits}/"
    return ''.join(c if c in safe_chars else '_' for c in name)

def safe_extract_tar(tar_path: str, extract_to: str) -> None:
    """Safely extract a tar.gz file using 'filter=data', skipping broken entries."""
    try:
        with tarfile.open(tar_path, "r:gz") as tar:
            for member in tar.getmembers():
                try:
                    # Skip symbolic links and absolute paths (security)
                    if member.islnk() or member.issym() or member.name.startswith("/") or ".." in member.name:
                        continue

                    member.name = sanitize_filename(member.name)
                    target_path = os.path.join(extract_to, member.name)
                    target_dir = os.path.dirname(target_path)
                    os.makedirs(target_dir, exist_ok=True)

                    # Extract safely
                    tar.extract(member, path=extract_to, filter="data")
                except (FileNotFoundError, OSError, tarfile.ExtractError) as inner_e:
                    print(f"‚ö†Ô∏è Skipped bad entry in {os.path.basename(tar_path)}: {member.name} ({inner_e})")
                    continue
    except Exception as e:
        print(f"[Error] Extraction failed for {tar_path}: {e}")


def download_url(url: str, out_path: str) -> bool:
    """Basic downloader (no retry, no backoff)."""
    headers = {"User-Agent": "arxiv-downloader/1.0 (+https://github.com/AnhTtis/Data_Science_Project)"}
    try:
        with requests.get(url, headers=headers, stream=True, timeout=30) as r:
            if r.status_code == 200:
                os.makedirs(os.path.dirname(out_path), exist_ok=True)
                with open(out_path, "wb") as f:
                    for chunk in r.iter_content(8192):
                        if chunk:
                            f.write(chunk)
                return True
            print(f"HTTP {r.status_code} for {url}")
            return False
    except requests.RequestException as e:
        print(f"Download failed for {url}: {e}")
        return False


def cleanup_non_tex_bib_files(folder: str) -> None:
    """Remove all non-.tex and non-.bib files."""
    for root, _, files in os.walk(folder):
        for file in files:
            if not (file.endswith(".tex") or file.endswith(".bib")):
                try:
                    os.remove(os.path.join(root, file))
                except OSError as e:
                    print(f"Warning: could not remove {file}: {e}")


def download(list_download: list, base_dir: str) -> None:
    """
    Downloads all versions of an arXiv paper using /src/{id} URL.
    Extracts .tex/.bib files and saves metadata.
    """
    if not list_download:
        print("‚ö†Ô∏è list_download is empty ‚Äî skipping.")
        return

    match = re.match(r"^(\d{4}\.\d{5})", list_download[0].get_short_id())
    if not match:
        print(f"Invalid arXiv ID format: {list_download[0].get_short_id()}")
        return

    arxiv_id = match.group(1)
    folder_arxiv = os.path.join(base_dir, format_yymm_id(arxiv_id))
    print(f"Processing {arxiv_id} ‚Üí {folder_arxiv}")

    os.makedirs(folder_arxiv, exist_ok=True)
    tex_root = os.path.join(folder_arxiv, "tex")
    os.makedirs(tex_root, exist_ok=True)

    for result in list_download:
        full_id = result.get_short_id()  # e.g. '2305.00633v4'
        folder_version = os.path.join(tex_root, full_id)  # put all versions under .../<paper>/tex/<version>
        os.makedirs(folder_version, exist_ok=True)

        src_url = f"{ARXIV_HOST}/src/{full_id}"
        tar_path = os.path.join(folder_version, f"{full_id}.tar.gz")
        print(f"Attempting source: {src_url}")

        if not download_url(src_url, tar_path):
            print(f"Source unavailable for {full_id}")
            continue

        # Validate and extract
        if not tarfile.is_tarfile(tar_path):
            print(f"Invalid tar archive for {full_id}. Removing file.")
            try:
                os.remove(tar_path)
            except OSError as e:
                print(f"Could not remove invalid file {tar_path}: {e}")
            continue

        try:
            safe_extract_tar(tar_path, folder_version)
            cleanup_non_tex_bib_files(folder_version)
            print(f"‚úÖ Extracted to {folder_version}")
        except Exception as e:
            print(f"‚ö†Ô∏è Extraction failed for {full_id}: {e}")
        finally:
            try:
                os.remove(tar_path)
            except OSError:
                pass

    # Save metadata after all versions
    try:
        save_metadata(result, folder_arxiv)
    except Exception as e:
        print(f"‚ö†Ô∏è Metadata save failed for {arxiv_id}: {e}")


# **IV. Semantic Scholar reference extraction module**

This module handles fetching citation references from the Semantic Scholar Graph API, converting them into structured metadata, and saving them in JSON format. It includes a rate limiter to ensure API compliance, helper functions for converting IDs, and utilities for preparing a standardized references dataset for each arXiv paper.

---

## **IV.1. Rate limiting for API requests**

### `wait_for_rate_limit()`

Ensures that at least one second passes between requests.
A global timestamp `_last_request_time` is shared across threads, protected by a lock:

* Prevents exceeding Semantic Scholar‚Äôs free-tier request limits
* Enforces a fixed delay before each new API call

This is critical for multi-threaded reference extraction.

---

## **IV.2. Fetching references from Semantic Scholar**

### `get_paper_references(arxiv_id, delay=1)`

Retrieves reference metadata for a given arXiv paper from the Semantic Scholar API.

Workflow:

1. **Normalize the ID** by removing version suffix (`v1`, `v2`, etc.).
2. Build the API endpoint:
   `https://api.semanticscholar.org/graph/v1/paper/arXiv:{id}/references`
3. Use `fields=` to request:

   * title
   * authors
   * year
   * venue
   * externalIds (ArXiv, DOI)
   * publicationDate
4. Apply strict rate limiting before each request.
5. Handle error conditions:

   * `429`: wait and retry
   * `404`: return an empty list
   * other errors: retry with delay
6. Return a list of reference entries, or an empty list on failure.

---

## **IV.3. Converting Semantic Scholar data to standard format**

### `convert_to_references_dict(references)`

Transforms API results into a dictionary indexed by normalized IDs.

Key behaviors:

* Extract `citedPaper` from each reference.
* Skip invalid or missing entries.
* Focus primarily on references **with an arXiv ID**.
* Use `format_yymm_id()` to convert IDs such as `2304.07856` ‚Üí `2304-07856`.

Metadata fields created per reference:

* `title`
* `authors`
* `submission_date` (from publication date or constructed from year)
* `revised_dates` (empty, since Semantic Scholar does not track versions)
* optional:

  * `doi`
  * `arxiv_id`
  * `venue`
  * `year`

This produces a clean dictionary suitable for storage as JSON.

---

## **IV.4. Saving references**

### `save_references(arxiv_id, paper_folder, verbose=True)`

Retrieves references for a specific arXiv ID and writes them to:

```
<folder>/references.json
```

Process:

1. Create the target folder if missing.
2. Fetch references via `get_paper_references`.
3. Convert results to normalized metadata via `convert_to_references_dict`.
4. Save the dictionary as JSON using UTF-8 encoding and pretty indentation.

If no references are found, the file is still created but contains an empty dictionary.

---

## **IV.5. Extracting references**

### `extract_references_for_paper(paper_id, base_data_dir="../data")`

Handles reference extraction for all versions of a paper.

Steps:

1. Normalize the paper folder name using `format_yymm_id()`.
2. Build the paper‚Äôs directory path under `base_data_dir`.
3. Call `save_references()` to fetch and store reference information.

This function is designed to integrate with your main download pipeline so that downloading a paper and extracting its references can be performed in a consistent workflow.

---


In [6]:
# Semantic Scholar API Key
API_KEY = "bsPmoNlOlZ3wqxmgirXF45aTWNbqAelr3ldxGRbu"

# Rate limiter: 1 request per second across all threads
_rate_limit_lock = threading.Lock()
_last_request_time = 0

def wait_for_rate_limit():
    """Ensure at least 1 second has passed since last API call"""
    global _last_request_time
    with _rate_limit_lock:
        current_time = time.time()
        time_since_last = current_time - _last_request_time
        if time_since_last < 1.0:
            sleep_time = 1.0 - time_since_last
            time.sleep(sleep_time)
        _last_request_time = time.time()

def get_paper_references(arxiv_id, delay=1):
    """
    Fetch references for a paper from Semantic Scholar API.

    Args:
        arxiv_id: arXiv ID (format: YYMM.NNNNN or YYMM.NNNNNvN)
        retry: number of retry attempts
        delay: delay between retries in seconds

    Returns:
        list: List of references with detailed information
    """
    # Clean arxiv_id (remove version suffix if present)
    clean_id = re.sub(r'v\d+$', '', arxiv_id)
    url = f"https://api.semanticscholar.org/graph/v1/paper/arXiv:{clean_id}/references"
    params = {
        "fields": "title,authors,year,venue,externalIds,publicationDate"
    }
    headers = {
        "x-api-key": API_KEY,
        "Accept": "application/json"
    }

    while True:
        try:
            # Wait to respect rate limit before making request
            wait_for_rate_limit()

            response = requests.get(url, params=params, headers=headers, timeout=10)
            if response.status_code == 200:
                data = response.json()
                # API tr·∫£ v·ªÅ {"data": [list of references]}
                return data.get("data", [])
            elif response.status_code == 429:
                print(f"  Rate limit hit. Waiting {delay}s before retry...")
                time.sleep(delay)
            elif response.status_code == 404:
                print(f"  Paper {arxiv_id} not found in Semantic Scholar")
                return []
            else:
                print(f"  API returned status {response.status_code}, retrying in {delay}s...")
                time.sleep(delay)
        except requests.exceptions.RequestException as e:
            print(f"  Request error: {e}, retrying in {delay}s...")
            time.sleep(delay)


def convert_to_references_dict(references):
    """
    Convert Semantic Scholar references to the required format:
    Dictionary with arXiv IDs as keys (in "yyyymm-id" format) for papers with arXiv IDs.
    For papers without arXiv IDs, use DOI or generate a unique key.

    Args:
        references: List of references from Semantic Scholar API

    Returns:
        dict: Dictionary with paper IDs as keys and metadata as values
    """
    result = {}
    non_arxiv_counter = 1

    for ref in references:
        # Extract citedPaper from the reference object
        cited_paper = ref.get("citedPaper", {})

        # Skip if citedPaper is None or empty
        if not cited_paper:
            continue

        # Extract external IDs (may be None)
        external_ids = cited_paper.get("externalIds", {})
        if external_ids is None:
            external_ids = {}

        arxiv_id = external_ids.get("ArXiv", "")
        doi = external_ids.get("DOI", "")
        # Only keep references that have arXiv_id
        if not arxiv_id:
            continue

        # Determine the key for this reference
        if arxiv_id:
            # Use arXiv ID in yyyymm-id format
            key = format_yymm_id(arxiv_id)
        elif doi:
            # Use DOI as key (sanitize it)
            key = f"doi:{doi.replace('/', '_')}"
        else:
            # Generate a unique key for papers without arXiv ID or DOI
            title = cited_paper.get("title", "")
            if title:
                # Use first word of title + counter
                first_word = re.sub(r'[^\w]', '', title.split()[0] if title.split() else "unknown")
                key = f"ref_{first_word[:20]}_{non_arxiv_counter}"
            else:
                key = f"ref_unknown_{non_arxiv_counter}"
            non_arxiv_counter += 1

        # Extract authors
        authors_list = cited_paper.get("authors", [])
        authors = [author.get("name", "") for author in authors_list if author.get("name")]

        # Extract dates (use publicationDate if available)
        publication_date = cited_paper.get("publicationDate", "")
        year = cited_paper.get("year")

        # If no publication date but have year, create an ISO-like format
        if not publication_date and year:
            publication_date = f"{year}-01-01"  # Use Jan 1st as placeholder

        # Build metadata dictionary with required fields
        metadata = {
            "title": cited_paper.get("title", ""),
            "authors": authors,
            "submission_date": publication_date if publication_date else "",
            "revised_dates": []  # Semantic Scholar doesn't provide revision history
        }

        # Add optional fields for reference
        if doi:
            metadata["doi"] = doi
        if arxiv_id:
            metadata["arxiv_id"] = arxiv_id
        if cited_paper.get("venue"):
            metadata["venue"] = cited_paper.get("venue")
        if year:
            metadata["year"] = year

        result[key] = metadata

    return result


def save_references(arxiv_id, paper_folder, verbose=True):
    """
    Fetch and save references for a paper version to both JSON and BibTeX formats.

    Args:
        arxiv_id: arXiv ID (e.g., "2304.07856v1")
        version_folder: Path to version folder (e.g., "data/2304.07856/v1/")
        verbose: Whether to print progress messages

    Returns:
        bool: True if successful, False otherwise
    """
    # Check if the folder exists, if not, create it
    if not os.path.exists(paper_folder):
        os.makedirs(paper_folder, exist_ok=True)

    if verbose:
        print(f"Fetching references for {arxiv_id}...")

    references = get_paper_references(arxiv_id)

    if not references:
        if verbose:
            print(f"  No references found for {arxiv_id}")
        json_path = os.path.join(paper_folder, "references.json")
        with open(json_path, 'w', encoding='utf-8') as f:
            json.dump({}, f, indent=2, ensure_ascii=False)
        return False

    json_path = os.path.join(paper_folder, "references.json")
    references_dict = convert_to_references_dict(references)
    try:
        with open(json_path, 'w', encoding='utf-8') as f:
            json.dump(references_dict, f, indent=2, ensure_ascii=False)
        if verbose:
            print(f"  Saved {len(references_dict)} references to references.json")
    except Exception as e:
        print(f"  Error saving JSON: {e}")
        return False



def extract_references_for_paper(paper_id, base_data_dir="../data"):
    """
    Extract references for all versions of a paper.

    Args:
        paper_id: arXiv paper ID without version (e.g., "2304.07856")
        base_data_dir: Base directory containing data folders

    Returns:
        dict: Statistics about the extraction
    """
    paper_id_key = format_yymm_id(paper_id)
    paper_folder = os.path.join(base_data_dir, paper_id_key)

    save_references(paper_id, os.path.join(paper_folder))



# **V.Performance monitoring utilities**

This module provides tools for monitoring RAM usage, disk consumption, and runtime performance during long-running data-processing pipelines. It includes helpers for measuring memory usage, estimating disk footprint, and a `Benchmark` class that aggregates performance metrics.

---

## **V.1. RAM & Disk helpers**

These functions allow the pipeline to track memory and storage usage accurately.


### **`ram_process_mb()`**

Returns the memory usage **of the current Python process** in megabytes.

* Uses `psutil.Process(os.getpid())` to access the running process.
* Retrieves `rss` (Resident Set Size):
  the actual memory held in RAM.
* Converts bytes ‚Üí megabytes (`1024**2`).


### **`ram_system_mb()`**

Returns the **total RAM used by the operating system**, not just Python.

* Reads `/proc/meminfo`
* Extracts:

  * `MemTotal`
  * `MemFree`
  * `Buffers`
  * `Cached`
* Computes:
  **used = MemTotal ‚àí (Free + Buffers + Cached)**
  Convert to MB.

### **`print_ram_report(title="")`**

Prints a formatted summary of RAM usage:

```
===================== [title] =====================
üîπ RAM Python-process : XX.XX MB
üîπ RAM System-global  : YY.YY MB
===================================================
```


### **`folder_size_mb(folder)`**

Recursively calculates the total size of a directory in megabytes.

* Walks the directory using `os.walk`.
* For each file:

  * Retrieves file size via `os.path.getsize()`
  * Accumulates total size
* Converts bytes ‚Üí MB.


---

## **V.2. Benchmarking Class**

The `Benchmark` class collects performance metrics over the lifetime of the pipeline. It centralizes timing, memory, and disk usage measurements.


### **Initialization**

```python
self.start_time = time.time()
self.id_fetch_time = 0
self.download_times = {}
self.reference_times = {}
self.ram_samples = []
self.peak_disk_mb = 0
```

### **Tracks**

* **Pipeline start time**
* **Time spent fetching arXiv IDs**
* **Per-paper download durations**
* **Per-paper reference extraction durations**
* **RAM samples over time**
* **Peak disk usage** (computed from output folder)


### **`update_disk(base_dir)`**

Updates the peak disk usage metric:

* Computes current folder size with `folder_size_mb()`
* Updates `peak_disk_mb` to the highest observed value

### **`report(base_dir, total_papers)`**

Prints a complete performance summary.

### **Metrics computed**

* **Total pipeline runtime**
* **Time spent discovering IDs**
* **Average download time per paper**
* **Average reference extraction time per paper**
* **Maximum process RAM usage**
* **Average process RAM usage**
* **Peak disk usage during processing**
* **Final disk size of output directory**
* **Total number of processed papers**

### **Sample output**

```
===================== PERFORMANCE REPORT =====================
Total runtime pipeline         : 182.43 sec
Time for entry discovery       : 12.48 sec
Average download time/paper    : 0.73 sec
Average reference time/paper   : 0.21 sec
Python process max RAM         : 745.22 MB
Python process avg RAM         : 512.88 MB
Peak disk usage during runtime : 912.41 MB
Final disk usage               : 890.55 MB
Total papers processed         : 200
==============================================================
```



In [7]:
# ---------------------------
# RAM & Disk helpers
# ---------------------------
def ram_process_mb():
    return psutil.Process(os.getpid()).memory_info().rss / 1024**2

def ram_system_mb():
    try:
        meminfo = {}
        with open("/proc/meminfo") as f:
            for line in f:
                key, val = line.split(":")
                meminfo[key.strip()] = int(val.strip().split()[0])
        mem_total = meminfo["MemTotal"]
        mem_free = meminfo["MemFree"] + meminfo.get("Buffers", 0) + meminfo.get("Cached", 0)
        return (mem_total - mem_free) / 1024
    except Exception:
        vm = psutil.virtual_memory()
        return (vm.total - vm.available) / 1024**2

def print_ram_report(title=""):
    print(f"""
===================== {title} =====================
üîπ RAM Python-process : {ram_process_mb():.2f} MB
üîπ RAM System-global  : {ram_system_mb():.2f} MB
===================================================
""")

def folder_size_mb(folder):
    total = 0
    for dirpath, dirnames, filenames in os.walk(folder):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            if os.path.isfile(fp):
                total += os.path.getsize(fp)
    return total / 1024**2

# ---------------------------
# Benchmarking class
# ---------------------------
class Benchmark:
    def __init__(self):
        self.start_time = time.time()
        self.id_fetch_time = 0
        self.download_times = {}
        self.reference_times = {}
        self.ram_samples = []
        self.peak_disk_mb = 0

    def sample_ram(self):
        self.ram_samples.append(ram_process_mb())

    def update_disk(self, base_dir):
        self.peak_disk_mb = max(self.peak_disk_mb, folder_size_mb(base_dir))

    def report(self, base_dir, total_papers):
        total_time = time.time() - self.start_time
        avg_download = sum(self.download_times.values()) / max(1,len(self.download_times))
        avg_reference = sum(self.reference_times.values()) / max(1,len(self.reference_times))
        max_ram = max(self.ram_samples) if self.ram_samples else 0
        avg_ram = sum(self.ram_samples)/len(self.ram_samples) if self.ram_samples else 0
        disk_mb = folder_size_mb(base_dir)

        print("\n===================== PERFORMANCE REPORT =====================")
        print(f"Total runtime pipeline         : {total_time:.2f} sec")
        print(f"Time for entry discovery       : {self.id_fetch_time:.2f} sec")
        print(f"Average download time/paper    : {avg_download:.2f} sec")
        print(f"Average reference time/paper   : {avg_reference:.2f} sec")
        print(f"Python process max RAM         : {max_ram:.2f} MB")
        print(f"Python process avg RAM         : {avg_ram:.2f} MB")
        print(f"Peak disk usage during runtime : {self.peak_disk_mb:.2f} MB")
        print(f"Final disk usage               : {disk_mb:.2f} MB")
        print(f"Total papers processed         : {total_papers}")
        print("==============================================================\n")


# **VI.Pipeline overview and execution flow**

This module coordinates a **multi-threaded arXiv processing pipeline**, responsible for:

* Fetching arXiv IDs
* Downloading papers (including all versions)
* Extracting references
* Measuring performance (RAM, disk, timing)
* Detecting and recovering missing papers
---

## **VI.1. Global Settings**

```python
DOWNLOAD_THREAD_COUNT = 3
REFERENCE_THREAD_COUNT = 2

benchmark = Benchmark()
```

* The pipeline uses *separate thread pools* for downloading PDFs and extracting references.
* A single global `Benchmark()` instance gathers performance metrics across the full run.

---

## **VI.2. Detecting existing papers**

### `collect_existing_ids(base_dir, target_months, yymm_ranges)`

Scans the data directory and collects existing folders matching the pattern:

```
YYYY-TTTTT ‚Üí e.g., 2401-01532
```

It only records IDs that fall strictly within your requested ranges (`yymm_ranges`), making the scan fast and consistent.

**Functional steps:**

* Initialize an empty set per target month.
* Scan folder names using a regex matcher.
* Validate that the folder‚Äôs tail number falls inside its allowed range.
* Return a dictionary:

```python
{
  "2401": {1532, 1533, ...},
  "2312": {8911, 8912, ...}
}
```

---

## **VI.3. Fetching ID ranges**

### `fetch_ids_worker(...)`

This function fetches all arXiv IDs within a date+tail range, then optionally slices the list depending on whether the user wants:

* **All papers** from the range, or
* A subset (`start_index`, `num_papers`)

It also constructs the internal representation:

```
yymm_ranges = {
    "2401": (01500, 01700),
    "2312": (08900, 09020)
}
```

This becomes the backbone for:

* Detecting missing downloads
* Restricting folder scanning
* Recovering re-downloadable entries


---

## **VI.4. Downloading papers**

### `download_worker(id_queue, download_queue, base_data_dir, delay=1.0)`

This worker thread:

1. Consumes IDs from `id_queue`
2. Purges partially downloaded folders (`remove_folder_if_has_data`)
3. Downloads **all versions** of each paper using retry logic
4. Stores results in `base_data_dir`
5. Sends IDs to `download_queue` so the reference extractor can process them

### Retry logic includes:

* Exponential backoff for 429/503 rate limits
* Graceful handling of missing versions
* Full traceback logging for unexpected errors

### Benchmarking:

Each download updates:

* `benchmark.download_times[arxiv_id]`
* RAM sampling
* Disk peak tracking

---

## **VI.5. Extracting references**

### `reference_worker(download_queue, base_data_dir, delay=0.5)`

This worker:

* Consumes IDs from `download_queue`
* Calls `extract_references_for_paper()`
* Tracks per-paper extraction time
* Updates RAM and disk usage

Using a dedicated queue allows reference extraction to start **immediately** after the first paper finishes downloading‚Äîno need to wait for the full batch.

---

## **VI.6. Recovering missing papers**

### `recover_missing_papers(base_dir, yymm_ranges, selected_ids)`

This is a robust mechanism to **ensure the entire intended dataset exists**, even if:

* A previous run crashed
* Network or rate-limit failures occurred
* Disk was cleaned manually
* Files were partially written

### Recovery steps:

1. Compare all expected arXiv IDs (`selected_ids`) to the folders physically present.
2. Detect missing items **only within the user-selected set**.
3. Spawn download and reference threads to reprocess the missing entries.
4. Produce a full memory/disk/timing report.

---

## **VI.7. Thread coordination**

The system uses two queues:

### 1. `id_queue`

* Filled with all IDs to download
* Ends with N `None` sentinels ‚Üí one for each downloader thread

### 2. `download_queue`

* Receives finished downloads
* Ends with M `None` sentinels ‚Üí one for each reference extractor

Finally, the system waits using:

```
id_queue.join()
download_queue.join()
for t in threads:
    t.join()
```

This guarantees a clean shutdown of all workers.

---

## **VI.8. Benchmark integration**

The pipeline gathers:

* Total pipeline runtime
* Time to fetch IDs
* Average download time per paper
* Average reference extraction time
* Max and average RAM usage
* Peak and final disk size
* Total processed paper count


In [8]:
# ---------------------------
# MAIN
# ---------------------------
DOWNLOAD_THREAD_COUNT = 3
REFERENCE_THREAD_COUNT = 2

#---------------------------
# Global benchmark instance
#---------------------------

benchmark = Benchmark()
# ---------------------------
# Missing papers helpers
# ---------------------------
PATTERN = re.compile(r"^(\d{4})-(\d{5})$")

def collect_existing_ids(base_dir: str, target_months: list, yymm_ranges: dict):
    """
    Collect existing folder tails only for target_months and only within
    the start/end ranges defined in yymm_ranges.

    Returns dict: { yymm: set(tail_ints) } limited to those ranges.
    """
    existing = {ym: set() for ym in target_months}
    if not os.path.isdir(base_dir):
        return existing

    for entry in os.scandir(base_dir):
        if not entry.is_dir():
            continue
        m = PATTERN.match(entry.name)
        if not m:
            continue
        yymm, tail = m.group(1), int(m.group(2))
        if yymm not in target_months:
            continue
        # Only consider folders whose tail falls inside the requested yymm_ranges.
        # yymm_ranges is expected to contain [start_tail, end_tail] for each yymm.
        if yymm not in yymm_ranges:
            # If no explicit range provided for this month, ignore it.
            continue
        start_tail, end_tail = yymm_ranges[yymm]
        if start_tail is None or end_tail is None:
            continue
        if start_tail <= tail <= end_tail:
            existing[yymm].add(tail)

    return existing


# def find_missing_ids(yymm, existing_set, start_tail, end_tail):
#     return sorted(set(range(start_tail, end_tail + 1)) - existing_set)

def format_arxiv_ids(yymm, tails):
    return [f"{yymm}.{t:05d}" for t in tails]


# ---------------------------
# Fetch IDs Worker
# ---------------------------
def fetch_ids_worker(start_month, start_year, start_ID,
                     end_month, end_year, end_ID,
                     start_index, num_papers,
                     download_all):
    t0 = time.time()
    print_ram_report("Before fetch_ids_worker")
    ids = get_IDs_All(start_month, start_year, start_ID,
                      end_month, end_year, end_ID)

    if download_all:
        selected_ids = ids
    else:
        selected_ids = ids[start_index:start_index + num_papers]

    yymm_ranges = {}
    for arxiv_id in selected_ids:
        yymm, tail = arxiv_id.split(".")
        tail = int(tail)
        if yymm not in yymm_ranges:
            yymm_ranges[yymm] = [tail, tail]
        else:
            yymm_ranges[yymm][0] = min(yymm_ranges[yymm][0], tail)
            yymm_ranges[yymm][1] = max(yymm_ranges[yymm][1], tail)

    benchmark.id_fetch_time = time.time() - t0
    print_ram_report("After fetch_ids_worker")
    print(f" Time for fetch_ids_worker: {benchmark.id_fetch_time:.2f} sec\n")
    return selected_ids, yymm_ranges

# ---------------------------
# Download helpers
# ---------------------------
def download_with_retries(client, arxiv_id, max_retries=5):
    for attempt in range(max_retries):
        try:
            search = arxiv.Search(id_list=[arxiv_id])
            return next(client.results(search))
        except Exception as e:
            if "429" in str(e) or "503" in str(e):
                wait = min(60*(2**attempt), 600) + random.random()*5
                print(f"[Download] Rate-limited: {e}, retry in {wait:.1f}s")
                time.sleep(wait)
            else:
                if attempt < max_retries - 1:
                    time.sleep(1+random.random())
                    continue
                raise
    raise RuntimeError(f"[Download] Failed after retries: {arxiv_id}")

def remove_folder_if_has_data(base_dir, arxiv_id):
    yymm, tail = arxiv_id.split(".")
    folder_name = f"{yymm}-{int(tail):05d}"
    folder_path = os.path.join(base_dir, folder_name)
    if os.path.isdir(folder_path) and any(os.scandir(folder_path)):
        shutil.rmtree(folder_path)
        print(f"Removed existing folder: {folder_name}")
        return True
    return False

# ---------------------------
# Download & Reference workers
# ---------------------------
def download_worker(id_queue, download_queue, base_data_dir, delay=1.0):
    client = arxiv.Client()
    processed = 0

    while True:
        arxiv_id = id_queue.get()
        if arxiv_id is None:
            id_queue.task_done()
            print(f"[Download] Thread exit. Total downloaded: {processed}")
            break
        t0 = time.time()
        try:
            remove_folder_if_has_data(base_data_dir, arxiv_id)
            print(f"[Download] Start {arxiv_id}")
            result_latest = download_with_retries(client, arxiv_id)
            short = result_latest.get_short_id()
            latest_version = int(short.split('v')[-1]) if 'v' in short else 1
            results_all = [result_latest]
            for v in range(1, latest_version):
                try:
                    res = download_with_retries(client, f"{arxiv_id}v{v}")
                    results_all.append(res)
                except Exception as e:
                    print(f"[Download] Warning: couldn't fetch version {v} for {arxiv_id}: {e}")
            download(results_all, base_data_dir)
            processed += 1
            print(f"[Download] Done {arxiv_id} (Total {processed})")
            download_queue.put(arxiv_id)
            time.sleep(delay)
        except Exception as e:
            print(f"[Download] Error {arxiv_id}: {e}")
            traceback.print_exc()
        finally:
            benchmark.download_times[arxiv_id] = time.time() - t0
            benchmark.sample_ram()
            benchmark.update_disk(base_data_dir)
            id_queue.task_done()

def reference_worker(download_queue, base_data_dir, delay=0.5):
    processed = 0
    while True:
        arxiv_id = download_queue.get()
        if arxiv_id is None:
            download_queue.task_done()
            print(f"[Reference] Thread exit. Total extracted: {processed}")
            break
        t0 = time.time()
        try:
            print(f"[Reference] Start {arxiv_id}")
            extract_references_for_paper(arxiv_id, base_data_dir)
            processed += 1
            print(f"[Reference] Done {arxiv_id} (Total {processed})")
            time.sleep(delay)
        except Exception as e:
            print(f"[Reference] Error {arxiv_id}: {e}")
            traceback.print_exc()
        finally:
            benchmark.reference_times[arxiv_id] = time.time() - t0
            benchmark.sample_ram()
            benchmark.update_disk(base_data_dir)
            download_queue.task_done()

# ---------------------------
# Recover missing papers
# ---------------------------
def recover_missing_papers(base_dir, yymm_ranges, selected_ids):
    """
    Recover missing papers, strictly within selected_ids and the yymm_ranges.
    This function:
      - builds the exact expected IDs from yymm_ranges,
      - intersects with selected_ids so only requested slice is recovered,
      - runs the download+reference workers with correct sentinel handling.
    """
    print("\n‚è≥ Starting: recover_missing_papers")
    t0 = time.time()
    print_ram_report("Before recover_missing_papers")

    # Fast lookup of the user's requested set
    selected_set = set(selected_ids)

    # Collect existing IDs strictly inside the numeric ranges (using fixed collect_existing_ids)
    target_months = list(yymm_ranges.keys())
    existing = collect_existing_ids(base_dir, target_months, yymm_ranges)

    # Build the *expected* IDs strictly from yymm_ranges, then keep only those in selected_set
    missing_ids = []
    for yymm, (start_tail, end_tail) in yymm_ranges.items():
        if start_tail is None or end_tail is None:
            continue
        for tail in range(start_tail, end_tail + 1):
            aid = f"{yymm}.{tail:05d}"
            # only consider if user selected it
            if aid not in selected_set:
                continue
            # only consider missing if not present in existing[yymm]
            if tail not in existing.get(yymm, set()):
                missing_ids.append(aid)

    if not missing_ids:
        print(f"\n‚úÖ No missing papers left for the selected range")
        print_ram_report("After recover_missing_papers")
        print(f"‚è±Ô∏è Time for recover_missing_papers: {time.time() - t0:.2f} sec\n")
        return

    print(f"\n‚ö†Ô∏è Missing papers detected ({len(missing_ids)}): {missing_ids[:20]}{'...' if len(missing_ids) > 20 else ''}")

    # Setup queues and worker threads
    id_queue = queue.Queue(maxsize=len(missing_ids) + DOWNLOAD_THREAD_COUNT + 2)
    download_queue = queue.Queue(maxsize=len(missing_ids) + REFERENCE_THREAD_COUNT + 2)

    # Enqueue missing IDs
    for aid in missing_ids:
        id_queue.put(aid)
    # Put sentinel None for each download thread so they exit when done
    for _ in range(DOWNLOAD_THREAD_COUNT):
        id_queue.put(None)

    # Start download workers
    download_threads = []
    for _ in range(DOWNLOAD_THREAD_COUNT):
        t = threading.Thread(target=download_worker, args=(id_queue, download_queue, base_dir))
        t.start()
        download_threads.append(t)

    # Start reference workers (they will wait on download_queue)
    reference_threads = []
    for _ in range(REFERENCE_THREAD_COUNT):
        t = threading.Thread(target=reference_worker, args=(download_queue, base_dir))
        t.start()
        reference_threads.append(t)

    # Wait until all downloads enqueued tasks are done
    id_queue.join()
    # signal reference workers to exit (one None per reference thread)
    for _ in range(REFERENCE_THREAD_COUNT):
        download_queue.put(None)
    # wait for reference processing to finish
    download_queue.join()

    # join all threads cleanly
    for t in download_threads:
        t.join()
    for t in reference_threads:
        t.join()

    print_ram_report("After recover_missing_papers")
    print(f"‚è±Ô∏è Time for recover_missing_papers: {time.time() - t0:.2f} sec\n")


# **VII. Google Drive integration & pipeline execution**

---

## **VII.1. Mounting Google Drive**

To ensure all downloaded PDFs, extracted references, and metadata are stored persistently, the pipeline will attempt to mount Google Drive automatically.

```python
print("Attempting to mount Google Drive...")
try:
    from google.colab import drive
    drive.mount('/content/drive')
    print("Google Drive mounted!\n")
except Exception:
    print(" Not running on Google Colab ‚Üí skipping Google Drive mount.\n")
```

---

## **VII.2. Setting up the Data Directory**

All downloaded papers and extracted references are stored in a configurable base folder:

```python
base_data_dir = "/content/drive/MyDrive/arxiv_data"
os.makedirs(base_data_dir, exist_ok=True)
```

* This folder lives inside your mounted Google Drive ‚Üí **persistent between notebook sessions**.
* The pipeline automatically creates it if it does not already exist.

---

## **VII.3. Configuring the pipeline run**

You choose the input arXiv ranges and how many papers to download:

```python
start_month, start_year = 3, 2023
end_month, end_year = 4, 2023
start_ID, end_ID = 7856, 4606
start_index, num_papers = 0, 10
download_all = False
```

### Meaning of parameters:

* **(start_month, start_year)** ‚Üí beginning of the date range
* **(end_month, end_year)** ‚Üí end of the date range
* **start_ID, end_ID** ‚Üí tail bounds inside each month
* **start_index, num_papers** ‚Üí take a slice from the full ID list
* **download_all**

  * `False` ‚Üí take only a subset
  * `True` ‚Üí download the entire range

This gives full flexibility for partial downloads, sampling, or full-range crawls.

---

## **VII.4. Fetching IDs and preparing ranges**

The pipeline starts by discovering all arXiv IDs that match your parameters:

```python
selected_ids, yymm_ranges = fetch_ids_worker(...)
```

### Outputs:

* **selected_ids** ‚Üí a flat list of full arXiv IDs
* **yymm_ranges** ‚Üí a dictionary describing:

  ```
  { "YYYYMM": (start_tail, end_tail), ... }
  ```

These ranges are later used for:

* folder existence checks
* missing-paper recovery
* consistency validation

---

## **VII.5. Launching download & reference threads**

The worker queues coordinate the pipeline:

```python
id_queue = queue.Queue(maxsize=len(selected_ids)+DOWNLOAD_THREAD_COUNT+2)
download_queue = queue.Queue(maxsize=len(selected_ids)+REFERENCE_THREAD_COUNT+2)

for aid in selected_ids:
    id_queue.put(aid)

# One sentinel per downloader
for _ in range(DOWNLOAD_THREAD_COUNT):
    id_queue.put(None)
```

### Threads started:

```python
download_threads = [...]
reference_threads = [...]
```

* **Download workers** fetch PDFs (all versions), clean old folders, and push completed IDs into `download_queue`.
* **Reference workers** extract citation metadata from each just-downloaded folder.

### Thread termination:

```python
id_queue.join()
for _ in range(REFERENCE_THREAD_COUNT):
    download_queue.put(None)
download_queue.join()
for t in combined_threads:
    t.join()
```

This ensures a clean shutdown with no deadlocks.

---

## **VII.6. Automatic recovery of missing papers**

After the main pipeline finishes, the system ensures full integrity:

```python
recover_missing_papers(base_data_dir, yymm_ranges, selected_ids)
```

This step will:

* Re-scan the storage directory
* Detect missing folders (corrupted or incomplete downloads)
* Re-launch a mini-pipeline to re-fetch them
* Print diagnostic information

This guarantees **dataset completeness** even if the run was interrupted earlier.

---

## **VII.7. Performance report**

At the end of the execution, a full performance summary is printed:

```python
benchmark.report(base_data_dir, total_papers=len(selected_ids))
```

The report includes:

* Total runtime
* Time spent fetching IDs
* Average download time
* Average reference extraction time
* Max & average RAM usage
* Peak disk growth
* Final disk footprint
* Total processed papers

In [9]:
# ---------------------------
# Google Drive mount
# ---------------------------
print("Attempting to mount Google Drive...")
try:
    from google.colab import drive
    drive.mount('/content/drive')
    print("Google Drive mounted!\n")
except Exception:
    print("Not running on Google Colab ‚Üí skipping Google Drive mount.\n")


# ---------------------------
# MAIN
# ---------------------------
base_data_dir = "/content/drive/MyDrive/arxiv_data_paper_test2"
os.makedirs(base_data_dir, exist_ok=True)

start_month, start_year = 3, 2023
end_month, end_year = 4, 2023
start_ID, end_ID = 7856, 4606
start_index, num_papers = 0, 10
download_all = False

print("Starting pipeline...\n")
print_ram_report("Pipeline Start")
selected_ids, yymm_ranges = fetch_ids_worker(
        start_month, start_year, start_ID,
        end_month, end_year, end_ID,
        start_index, num_papers,
        download_all
    )

print("‚Üí Implemented ranges (yymm -> start_tail,end_tail):", yymm_ranges)
print(f"‚Üí Selected IDs (first 20): {selected_ids[:20]}{'...' if len(selected_ids)>20 else ''}")

# Queues for threading
id_queue = queue.Queue(maxsize=len(selected_ids)+DOWNLOAD_THREAD_COUNT+2)
download_queue = queue.Queue(maxsize=len(selected_ids)+REFERENCE_THREAD_COUNT+2)
for aid in selected_ids: id_queue.put(aid)
for _ in range(DOWNLOAD_THREAD_COUNT): id_queue.put(None)

download_threads = [threading.Thread(target=download_worker, args=(id_queue, download_queue, base_data_dir)) for _ in range(DOWNLOAD_THREAD_COUNT)]
for t in download_threads: t.start()
reference_threads = [threading.Thread(target=reference_worker, args=(download_queue, base_data_dir)) for _ in range(REFERENCE_THREAD_COUNT)]
for t in reference_threads: t.start()

id_queue.join()
for _ in range(REFERENCE_THREAD_COUNT): download_queue.put(None)
download_queue.join()
for t in download_threads + reference_threads: t.join()

recover_missing_papers(base_data_dir, yymm_ranges, selected_ids)

# Final performance report
benchmark.report(base_data_dir, total_papers=len(selected_ids))

Attempting to mount Google Drive...
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Google Drive mounted!

Starting pipeline...


üîπ RAM Python-process : 108.75 MB
üîπ RAM System-global  : 1264.15 MB


üîπ RAM Python-process : 108.75 MB
üîπ RAM System-global  : 1264.15 MB


üîπ RAM Python-process : 109.84 MB
üîπ RAM System-global  : 1314.21 MB

 Time for fetch_ids_worker: 1.94 sec

‚Üí Implemented ranges (yymm -> start_tail,end_tail): {'2303': [7856, 7865]}
‚Üí Selected IDs (first 20): ['2303.07856', '2303.07857', '2303.07858', '2303.07859', '2303.07860', '2303.07861', '2303.07862', '2303.07863', '2303.07864', '2303.07865']
[Download] Start 2303.07856
[Download] Start 2303.07857
[Download] Start 2303.07858
Processing 2303.07858 ‚Üí /content/drive/MyDrive/arxiv_data_paper_test2/2303-07858
Processing 2303.07857 ‚Üí /content/drive/MyDrive/arxiv_data_paper_test2/2303-07857
Attempting source: https://arx