### 1. Mount Google Drive
This cell mounts your Google Drive to the Colab environment. This is essential for saving the large dataset (OpenAlex 2025) directly to your cloud storage, ensuring persistence across sessions.

In [None]:
from google.colab import drive
drive.mount("/content/drive", force_remount=True)


Mounted at /content/drive


### 2. Import Libraries
Here we import the necessary Python libraries:
*   `requests`: For making HTTP requests to the OpenAlex API.
*   `json`: For parsing and saving data in JSON Lines format.
*   `tqdm`: For displaying a progress bar during the long download process.
*   `time.sleep`: To handle rate limiting or API errors gracefully.

In [None]:
import requests
import json
from tqdm import tqdm
from time import sleep


### 3. Configuration & Setup
We define key constants for the data collection process:
*   `BASE_URL`: The OpenAlex API endpoint for works (papers).
*   `CS_CONCEPT_ID`: The specific ID for 'Computer Science' to filter the dataset.
*   `YEAR`: We are targeting papers from 2025.
*   `OUTPUT_PATH`: The directory in Google Drive where data will be saved.
We also ensure the output directory exists.

In [None]:
BASE_URL = "https://api.openalex.org/works"
CS_CONCEPT_ID = "C41008148"   # Computer Science
YEAR = 2025

OUTPUT_PATH = "/content/drive/MyDrive/OpenAlex_CS_2025"
OUTPUT_FILE = f"{OUTPUT_PATH}/openalex_cs_2025.jsonl"

import os
os.makedirs(OUTPUT_PATH, exist_ok=True)


### 4. Define Fetching Function
This is the core function `fetch_cs_2025` that handles the data retrieval:
*   **Filters**: It sets up the API parameters to fetch Computer Science papers from 2025.
*   **Pagination**: It uses cursor-based pagination (`cursor=*`) to iterate through the entire dataset.
*   **Error Handling**: It includes a retry mechanism for failed requests.
*   **Data Extraction**: For each paper, it extracts relevant fields (ID, DOI, title, abstract, authors, concepts, venue, etc.) and saves them to the JSONL file.
*   **Progress Tracking**: A progress bar shows the number of papers saved.

In [None]:
def fetch_cs_2025():
    params = {
        "filter": f"publication_year:{YEAR},concepts.id:{CS_CONCEPT_ID}",
        "per-page": 200,
        "cursor": "*"
    }

    total_saved = 0

    with open(OUTPUT_FILE, "w", encoding="utf-8") as out_fp:
        with tqdm(desc="Fetching CS 2025 papers") as pbar:
            while True:
                resp = requests.get(BASE_URL, params=params)

                if resp.status_code != 200:
                    print(f"Error {resp.status_code}, retrying...")
                    sleep(5)
                    continue

                data = resp.json()
                results = data.get("results", [])

                if not results:
                    break

                for paper in results:
                    # Safe venue extraction
                    venue = None
                    primary_location = paper.get("primary_location")
                    if primary_location and primary_location.get("source"):
                        venue = primary_location["source"].get("display_name")

                    record = {
                        "openalex_id": paper.get("id"),
                        "doi": paper.get("doi"),
                        "title": paper.get("title"),
                        "abstract": paper.get("abstract"),
                        "publication_year": paper.get("publication_year"),
                        "publication_date": paper.get("publication_date"),

                        "authors": [
                            {
                                "author_id": a.get("author", {}).get("id"),
                                "name": a.get("author", {}).get("display_name")
                            }
                            for a in paper.get("authorships", [])
                        ],

                        "concepts": [
                            {"id": c.get("id"), "name": c.get("display_name")}
                            for c in paper.get("concepts", [])
                        ],

                        "venue": venue,
                        "citation_count": paper.get("cited_by_count"),
                        "is_open_access": paper.get("open_access", {}).get("is_oa"),
                        "oa_status": paper.get("open_access", {}).get("oa_status"),
                        "url": paper.get("id"),
                    }

                    out_fp.write(json.dumps(record) + "\n")
                    total_saved += 1
                    pbar.update(1)

                params["cursor"] = data["meta"]["next_cursor"]

    print(f"\n✅ Total CS 2025 papers saved: {total_saved}")


### 5. Execute Data Collection
This cell calls the `fetch_cs_2025()` function defined above to start the download process. This may take a significant amount of time depending on the number of papers.

In [None]:
fetch_cs_2025()

### 6. Domain Partitioning
**Note:** This cell appears to rely on `clean_paper` and `detect_domain` functions which must be defined elsewhere (or are missing in this specific notebook view).
Its purpose is to:
1.  Read the raw collected data (`openalex_cs_2025_RAW.jsonl`).
2.  Create separate output files for each domain (AI, ML, DL, NLP, CV, RL, Other CS).
3.  Iterate through each paper, clean it, detect its specific domain, and write it to the corresponding domain-specific file.
This is the critical step that enables the domain-aware routing architecture.

In [None]:
INPUT_FILE = "/content/drive/MyDrive/OpenAlex_CS_2025/openalex_cs_2025_RAW.jsonl"
OUTPUT_DIR = "/content/drive/MyDrive/OpenAlex_CS_2025_Domains_processed/"

os.makedirs(OUTPUT_DIR, exist_ok=True)

files = {
    "ai": open(OUTPUT_DIR + "ai.jsonl", "a"),
    "ml": open(OUTPUT_DIR + "ml.jsonl", "a"),
    "dl": open(OUTPUT_DIR + "dl.jsonl", "a"),
    "nlp": open(OUTPUT_DIR + "nlp.jsonl", "a"),
    "cv": open(OUTPUT_DIR + "cv.jsonl", "a"),
    "rl": open(OUTPUT_DIR + "rl.jsonl", "a"),
    "other_cs": open(OUTPUT_DIR + "other_cs.jsonl", "a")
}

with open(INPUT_FILE, "r", encoding="utf-8") as f:
    for line in tqdm(f, desc="Domain split"):
        paper = json.loads(line)
        cleaned = clean_paper(paper)
        domain = detect_domain(cleaned["concepts"])
        cleaned["domain"] = domain
        files[domain].write(json.dumps(cleaned) + "\n")

for f in files.values():
    f.close()


### 7. Verify Data (Count & Print)
This cell performs a quick verification by reading the saved file, counting the total number of papers, and printing them (or a subset) to the console to ensure data integrity.

In [None]:
import json

FILE_PATH = "/content/drive/MyDrive/OpenAlex_CS_2025/openalex_cs_2025.jsonl"

count = 0
with open(FILE_PATH, "r", encoding="utf-8") as f:
    for line in f:
        if line.strip():
            paper = json.loads(line)
            count += 1
            print(f"{count} and {paper}")

print("Total papers in file:", count)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
7384 and {'openalex_id': 'https://openalex.org/W4406833239', 'doi': 'https://doi.org/10.1016/j.jcis.2025.01.222', 'title': 'Enhanced carrier separation and decreased reaction barrier in cobalt-doped CdIn2S4 nanosheets for photocatalytic hydrogen evolution', 'abstract': None, 'publication_year': 2025, 'publication_date': '2025-01-26', 'authors': [{'author_id': 'https://openalex.org/A5074463904', 'name': 'Liang Mao'}, {'author_id': 'https://openalex.org/A5023442149', 'name': 'Qinran Li'}, {'author_id': 'https://openalex.org/A5017278139', 'name': 'Qing Zhou'}, {'author_id': 'https://openalex.org/A5022904115', 'name': 'Wenqiang Dang'}, {'author_id': 'https://openalex.org/A5083398971', 'name': 'Yulong Zhao'}, {'author_id': 'https://openalex.org/A5100298360', 'name': 'Yuzhen Sun'}, {'author_id': 'https://openalex.org/A5101696190', 'name': 'Xiaoyan Cai'}], 'concepts': [{'id': 'https://openalex.org/C65165184', 'name': 'Photocatal

### 8. Check File Size
This utility cell calculates and prints the size of the generated dataset file in Megabytes (MB) to monitor storage usage.

In [1]:
import os

FILE_PATH = "/content/drive/MyDrive/OpenAlex_CS_2025/openalex_cs_2025.jsonl"

print("File size (MB):", round(os.path.getsize(FILE_PATH) / (1024*1024), 3))


File size (MB): 537.383


### 9. Inspect First Record
This cell reads and prints the very first paper object from the dataset. This is useful for inspecting the JSON structure and ensuring fields like `abstract`, `authors`, and `concepts` are correctly populated.

In [2]:
import json

with open(FILE_PATH, "r", encoding="utf-8") as f:
    first_line = f.readline().strip()

if not first_line:
    print("❌ File is empty")
else:
    paper = json.loads(first_line)
    print("✅ First paper object:\n")
    print(json.dumps(paper, indent=2))


✅ First paper object:

{
  "openalex_id": "https://openalex.org/W1803273808",
  "doi": "https://doi.org/10.4135/9781036235611",
  "title": "The Coding Manual for Qualitative Researchers",
  "abstract": null,
  "publication_year": 2025,
  "publication_date": "2025-01-01",
  "authors": [
    {
      "author_id": "https://openalex.org/A5058231379",
      "name": "Johnny Salda\u00f1a"
    }
  ],
  "concepts": [
    {
      "id": "https://openalex.org/C179518139",
      "name": "Coding (social sciences)"
    },
    {
      "id": "https://openalex.org/C2780031656",
      "name": "Glossary"
    },
    {
      "id": "https://openalex.org/C73231260",
      "name": "Tunstall coding"
    },
    {
      "id": "https://openalex.org/C130811719",
      "name": "Shannon\u2013Fano coding"
    },
    {
      "id": "https://openalex.org/C60603091",
      "name": "Variable-length code"
    },
    {
      "id": "https://openalex.org/C41008148",
      "name": "Computer science"
    },
    {
      "id": "htt

### 10. Fast Line Count
This cell uses the system command `wc -l` (word count - lines) to quickly count the total number of papers in the JSONL file. This is much faster than iterating through the file in Python for large datasets.

In [3]:
import subprocess

FILE_PATH = "/content/drive/MyDrive/OpenAlex_CS_2025/openalex_cs_2025.jsonl"

result = subprocess.run(
    ["wc", "-l", FILE_PATH],
    capture_output=True,
    text=True
)

print("Total papers (lines):", result.stdout.split()[0])


Total papers (lines): 331600
