# OpenAlex
OpenAlex seems to be more promising than SemanticScholar at least because we can download its dataset. But let's do things consecutively. 

We are given: 
- pid of researcher from dblp.org
- list doi of papers by given researcher

---

## API Rate Limits
1. Requests per second:	Up to ~1 request per second recommended
2. Burst limit:	Short bursts slightly above 1 rps allowed
3. Requests per day:	Not officially documented — flexible within reason
4. Pagination limit:	Max 200 results per page (default 25)


## API Retrieval

### Get pub data using dblp pid

In [1]:
import requests
import json
import time

# --- CONFIGURATION ---
OPENALEX_API_BASE = "https://api.openalex.org"
HEADERS = {"Accept": "application/json"}

# Adjustable parameters
RATE_LIMIT_DELAY = 1  # seconds between requests (for polite use)



In [2]:
# ----------------------------
# 1. Get Author Publications via DBLP ID
# ----------------------------
def get_author_id_by_dblp(dblp_id: str):
    url = f"{OPENALEX_API_BASE}/authors?filter=ids.dblp:{dblp_id}"
    response = requests.get(url, headers=HEADERS)
    if response.status_code == 200:
        results = response.json().get("results", [])
        if results:
            author = results[0]  # Take first match
            print(f"Found author: {author['display_name']} (OpenAlex ID: {author['id']})")
            return author['id']
        else:
            print("No author found with given DBLP ID.")
    else:
        print(f"Failed to query author: {response.status_code}")
    return None


def get_author_works(author_id: str, max_results=50):
    works = []
    page = 1
    per_page = 25  # Max 200, but let's start small to avoid rate limits
    print(f"Fetching works for author {author_id}...")
    
    while len(works) < max_results:
        url = f"{OPENALEX_API_BASE}/works?filter=authorships.author.id:{author_id}&per-page={per_page}&page={page}"
        response = requests.get(url, headers=HEADERS)
        if response.status_code == 200:
            page_results = response.json().get("results", [])
            if not page_results:
                break  # No more results
            works.extend(page_results)
            print(f"Fetched {len(works)} works so far...")
            page += 1
            time.sleep(RATE_LIMIT_DELAY)
        else:
            print(f"Failed to fetch works: {response.status_code}")
            break
    return works[:max_results]




# Example 1: Author publications via DBLP ID
dblp_id = "https://dblp.org/pid/32/5357"  # Replace with actual DBLP ID
author_id = get_author_id_by_dblp(dblp_id)
if author_id:
    author_works = get_author_works(author_id, max_results=20)  # Limit to 20 works
    with open("author_publications.json", "w", encoding="utf-8") as f:
        json.dump(author_works, f, indent=4)
    print(f"Saved {len(author_works)} works to 'author_publications.json'.")



Failed to query author: 403


### Get pub data via DOI

Works well. We get citation counts and an abstract. 

**Problem**: For polite API usage we have to wait 1sec between API calls which could result in hunderds of seconds wait time. Also The returned data is quite large but it's least of our issues

In [3]:
# ----------------------------
# 2. Get Publication Data via DOIs
# ----------------------------
def get_work_by_doi(doi: str):
    url = f"{OPENALEX_API_BASE}/works/https://doi.org/{doi}"
    response = requests.get(url, headers=HEADERS)
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Failed to fetch data for DOI {doi}: {response.status_code}")
    return None


def fetch_bulk_paper_data(dois: list):
    papers_data = []
    for doi in dois:
        paper = get_work_by_doi(doi)
        if paper:
            papers_data.append(paper)
        time.sleep(RATE_LIMIT_DELAY)  # Respect polite API usage
    return papers_data


# Example 2: Bulk paper data via DOIs
dois = [
    "10.1145/3366423.3380250",
    "10.1038/s41586-020-2649-2"
    # Add more DOIs as needed
]
papers_data = fetch_bulk_paper_data(dois)
with open("papers_metadata.json", "w", encoding="utf-8") as f:
    json.dump(papers_data, f, indent=4)
print(f"Saved metadata for {len(papers_data)} papers to 'papers_metadata.json'.")


Saved metadata for 2 papers to 'papers_metadata.json'.


### Retrieving data in chunks
Check asyn_fetch.py

## Downloading OpenAlex dataset



Reference here: https://registry.opendata.aws/openalex/

https://docs.openalex.org/download-all-data/openalex-snapshot

Schema: https://github.com/ourresearch/openalex-documentation-scripts/blob/main/openalex-pg-schema.sql

Abstracts data takes 412 GB of space, which is BANANAS and it's compressed. So uncompressed would be more than a 1 TB.  So let's stick to querying data

https://openalex.s3.amazonaws.com/browse.html#data/works/updated_date=2024-10-02/
