# Semantic Scholar data retrieval
We are normally given: 
- pid of researcher from dblp.org
- list doi of papers by given researcher

We need to get:
- citation counts of this papers
- paper abstracts

https://api.semanticscholar.org/api-docs/graph

---

## API rate limits
Understanding API rate limits and usage policies is essential to avoid getting blocked and to optimize data collection. 

1. Requests per minute:	100 requests per 5 minutes (avg 20 requests/min)
2. Requests per day: 1,000 requests per day (anonymous use)
3. Batch lookup size: Max 100 paper IDs per /batch call

## API calls to Semantic Scholar

Let's see what is the API call policy on Semantic Scholar


In [7]:
import requests

name = "Rainer Gemulla"
pid = "32/5357"
doi = [
    '10.1145/3583780.3614895', 
    '10.18653/v1/2023.findings-emnlp.713', 
    '10.18653/v1/2023.repl4nlp-1.11', 
    '10.48550/arXiv.2305.13059'
    ]

### Fetching by PID
Fetching by pid in SemanticScholar is not directly supported and would require author matching based on paper titles, which we do not want to do.

### Fetching by DOI
Does not work for all papers

In [None]:
import json 

r = requests.post(
    'https://api.semanticscholar.org/graph/v1/paper/batch',
    params={'fields': 'referenceCount,citationCount,title'},
    json={"ids": [f"DOI:{i}" for i in doi]})
print(json.dumps(r.json(), indent=2))

[
  {
    "paperId": "4978520a959b103f9dd55ec3ce4545ab06191a18",
    "title": "Good Intentions: Adaptive Parameter Management via Intent Signaling",
    "referenceCount": 75,
    "citationCount": 1
  },
  null,
  null,
  {
    "paperId": "d573ea0c11a20a53651550b8dc9af7c99814d1c8",
    "title": "Friendly Neighbors: Contextualized Sequence-to-Sequence Link Prediction",
    "referenceCount": 22,
    "citationCount": 5
  }
]


### Fetch by dblp pub id

Does not work. 

In [13]:
dblp_paper_id = [
    "conf/cikm/Renz-WielandKGG23",
    "conf/emnlp/KochsiekG23",
    "conf/rep4nlp/KochsiekSNG23",
    "journals/corr/abs-2305-13059"
]

r = requests.post(
    'https://api.semanticscholar.org/graph/v1/paper/batch',
    params={'fields': 'referenceCount,citationCount,title'},
    json={"ids": ["DBLP:conf/acl/LoWNKW20"]})
print(json.dumps(r.json(), indent=2))


{
  "error": "No valid paper ids given"
}


### Hybrid strategy

Get one publication by **DOI** and get **author_id** from it and using that get publication data of a person 
That would require making 2 calls

**!!** Since we already don't get all papers by DOI, there's no point in proceeding here.

### Search by paper name
Searching by paper name is okay, however we will have to match the correct paper by its author.
And it seems like we only getting semantic scholar paperId which would require us to make more calls. Not good

## Downloading Semantic Scholar dataset

Doesn't seem possible without an API_KEY which issuance is paused due to high demand. 

Reference here: https://github.com/allenai/s2orc