## What is a third-party API?


Third party APIs are provided by third parties — generally companies such as Facebook, Twitter, or Google — to allow you to access their functionality and data. 

## Sending an API request in Python 

Let's say we wanted to find scientific papers about "ChatGPT" published in 2023. 
There are a few APIs out there that allow you to gather research articles and related data. 
Some examples are: 

- [Semantic Scholar Academic Graph API](https://api.semanticscholar.org/api-docs/graph)
- [openalex](https://docs.openalex.org/)
- [CrossRef API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/)

In this example, I am going to use the Semantic Scholar Academic Graph API. 

In [1]:
import requests


In [2]:
BASE_URL = "https://api.semanticscholar.org/graph/"
VERSION = "v1/"
RESOURCE = "paper/search"


params = {'query':"ChatGPT",
          "year":2023,
           "offset":0,
           "limit":100,
            "fields":"title,year,authors"}

In [3]:
my_url = BASE_URL + VERSION + RESOURCE

In [4]:
r = requests.get(my_url, params=params)

In [5]:
r.json()["data"]

[{'paperId': 'eddfb9be78cfe94193766e3722eb0e56c3d24cef',
  'title': 'ChatGPT is fun, but not an author',
  'year': 2023,
  'authors': [{'authorId': '2003404994', 'name': 'H. Thorp'}]},
 {'paperId': '8f64f4633d9c482bb826b7a9fe9c1493837d7112',
  'title': 'Is ChatGPT A Good Translator? A Preliminary Study',
  'year': 2023,
  'authors': [{'authorId': '12386833', 'name': 'Wenxiang Jiao'},
   {'authorId': '2144328160', 'name': 'Wenxuan Wang'},
   {'authorId': '2161306685', 'name': 'Jen-tse Huang'},
   {'authorId': '2144800839', 'name': 'Xing Wang'},
   {'authorId': '2909321', 'name': 'Zhaopeng Tu'}]},
 {'paperId': '50aea7bae4c478e850f218e58da0e24f501ab8fc',
  'title': 'Abstracts written by ChatGPT fool scientists',
  'year': 2023,
  'authors': [{'authorId': '40898374', 'name': 'Holly Else'}]},
 {'paperId': '0570e8fc8b02e7eb66e798b00726fba0592ea90f',
  'title': 'ChatGPT listed as author on research papers: many scientists disapprove',
  'year': 2023,
  'authors': [{'authorId': '1413934873', '

# Week 2

In [11]:
import pandas as pd
import numpy as np
df2019 = pd.read_csv("2019_poster.csv")
df2020 = pd.read_csv("2020.csv")
df2021 = pd.read_csv("2021.csv")

names2019 = np.array(df2019.author)
names2020 = np.array(df2020.author)
names2021 = np.array(df2021.author)

names = np.concatenate([names2019, names2020, names2021])
names_final = np.unique(names)
df = pd.DataFrame(names_final, columns=['author'])
df.head()


Unnamed: 0,author
0,A. Gül Gökay Emel
1,Aaron Clauset
2,Aaron Cluaset
3,Aaron Halfaker
4,Aaron Schecter


In [19]:
df.to_csv("authors.csv", index=False)

In [18]:
names_final[0]

'A. Gül Gökay Emel'

In [154]:
BASE_URL = "https://api.semanticscholar.org/graph/"
VERSION = "v1/"
RESOURCE = "author/search"

params = {'query':names_final[813],
          'fields': "papers.authors"
}

my_url = BASE_URL + VERSION + RESOURCE

params

{'query': 'Jennifer Pan', 'limit': 1, 'fields': 'papers.authors'}

In [136]:
import time
from tqdm import tqdm

BASE_URL = "https://api.semanticscholar.org/graph/"
VERSION = "v1/"
RESOURCE = "author/search"

my_url = BASE_URL + VERSION + RESOURCE

ids = []

for name in names_final:
    params = {'query':name,
          'limit': 1,
          'fields': "papers.authors"
          }

    r = requests.get(my_url, params=params)
    if len(r.json()['data']) == 0:
        continue
    if r.status_code == 200 and len(r.json()['data']) > 0:
        temp = r.json()['data'][0]['papers'][0]['authors']
        temp_ids = [author['authorId'] for author in temp]
        ids.extend(temp_ids)
    else:
        status = False
        while status == False:
            print("Sleepin'")
            time.sleep(30)
            r = requests.get(my_url, params=params)
            if len(r.json()['data']) == 0:
                continue
            if r.status_code == 200 and len(r.json()['data']) > 0:
                temp = r.json()['data'][0]['papers'][0]['authors']
                temp_ids = [int(author['authorId']) for author in temp]
                ids.extend(temp_ids)
                status = True
                
            
ids


KeyError: 'data'

In [141]:
BASE_URL = "https://api.semanticscholar.org/graph/"
VERSION = "v1/"
RESOURCE = "author/search"

my_url = BASE_URL + VERSION + RESOURCE

ids = []

    

In [162]:
def get_request(name):
    params = {'query':name,
              'limit': 1,
              'fields': "papers.authors"
              }
    r = requests.get(my_url, params=params)
    if r.status_code != 200:
        time.sleep(30)
        get_request(name)

    return r.json()
    
def get_ids(r):
    if 'data' not in r.keys():
        return None
    
    if len(r['data']) == 0:
        return None

    if len(r['data'][0]['papers']) == 0:
        ids.extend(r['data'][0]['authorId'])
        return None
        
    temp = r['data'][0]['papers'][0]['authors']
    temp_ids = [author['authorId'] for author in temp]
    ids.extend(temp_ids)

In [None]:
for name in tqdm(names_final):
    get_ids(get_request(name))

In [163]:
for name in tqdm(names_final[812:]):
    get_ids(get_request(name))

100%|██████████| 1095/1095 [36:48<00:00,  2.02s/it]  


In [168]:
#len(ids)
ids_short = pd.DataFrame(ids, columns=['authorId'])
ids_short.to_csv('ids_short.csv', index=False)


In [171]:
BASE_URL = "https://api.semanticscholar.org/graph/"
VERSION = "v1/"
RESOURCE = "author/search"

params = {'query':names_final[100],
          'fields': "papers.authors"
}

my_url = BASE_URL + VERSION + RESOURCE

In [184]:
r = requests.get(my_url, params=params)
ost = []
for paper in r.json()['data'][0]['papers']:
    aut = paper['authors']
    aut_ids = [author['authorId'] for author in aut]
    ost.extend(aut_ids)


## Get ALL Ids

In [None]:
all_ids = []

In [224]:
def get_request(name):
    params = {'query':name,
              'fields': "papers.authors"
              }
    r = requests.get(my_url, params=params)

    if r.status_code == 500:
        return {}
        
    if r.status_code != 200:
        time.sleep(30)
        get_request(name)

    return r.json()
    
def get_all_ids(r):
    if 'data' not in r.keys():
        return None
    
    if len(r['data']) == 0:
        return None

    if len(r['data'][0]['papers']) == 0:
        ids.extend(r['data'][0]['authorId'])
        return None
        
    for paper in r['data'][0]['papers']:
        aut = paper['authors']
        aut_ids = [author['authorId'] for author in aut]
        all_ids.extend(aut_ids)


In [238]:
for name in tqdm(names_final):
    get_all_ids(get_request(name))


In [227]:
authorIds = list(set(all_ids))

In [237]:
all_authors = pd.DataFrame(authorIds, columns=['authorId'])
all_authors.to_csv("all_authors.csv", index=False)


## Data set construction

We now have all the author Ids that we want, and we now construct the data sets

In [272]:
all_authors.head()

Unnamed: 0,authorId
0,2773799
1,1399712184
2,50197963
3,3530609
4,84242025


In [395]:
BASE_URL = "https://api.semanticscholar.org/graph/"
VERSION = "v1/"
RESOURCE = "author/batch"
FIELDS = "?fields=name,aliases,papers.title,papers.abstract,papers.year,papers.externalIds,papers.s2FieldsOfStudy,papers.citationCount,papers.authors"

params = {"ids": ['2773799','1399712184']}

my_url = BASE_URL + VERSION + RESOURCE + FIELDS



In [422]:
r = requests.post(my_url, json=params)
r.json()[0]['papers'][0]


{'paperId': '825f85375bba977cd3ad78ac1ba22c7fae5609fb',
 'externalIds': {'PubMedCentral': '9045324',
  'DOI': '10.1128/spectrum.02434-21',
  'CorpusId': 247938320,
  'PubMed': '35377231'},
 'title': 'Reference-Grade Genome and Large Linear Plasmid of Streptomyces rimosus: Pushing the Limits of Nanopore Sequencing',
 'abstract': 'The genomes of Streptomyces species are difficult to assemble due to long repeats, extrachromosomal elements (giant linear plasmids [GLPs]), rearrangements, and high GC content. To improve the quality of the S. rimosus ATCC 10970 genome, producer of oxytetracycline, we validated the assembly of GLPs by applying a new approach to combine pulsed-field gel electrophoresis separation and GLP isolation and sequenced the isolated GLP with Oxford Nanopore technology. ABSTRACT Streptomyces rimosus ATCC 10970 is the parental strain of industrial strains used for the commercial production of the important antibiotic oxytetracycline. As an actinobacterium with a large lin

In [100]:
author_data = {
    "ids": [],
    "names": [],
    "aliases": [],
    "citationCount": [],
    "field": []
    }

paper_data = {
    "paperId": [],
    "title": [],
    "year": [],
    "DOI": [],
    "citationCount": [],
    "fields": [],
    "authorIds": []
}

abstract_data = {
    "paperId": [],
    "abstract": []
}

In [101]:
all_authors = pd.read_csv("all_authors.csv")

all_authors = all_authors.dropna()
all_authors = all_authors.astype(int).astype(str)

list(all_authors.iloc[0:5]['authorId'])

['2773799', '1399712184', '50197963', '3530609', '84242025']

In [102]:
from collections import Counter
from tqdm import tqdm
import pandas as pd
import time
import requests

BASE_URL = "https://api.semanticscholar.org/graph/"
VERSION = "v1/"
RESOURCE = "author/batch"
FIELDS = "?fields=name,aliases,citationCount,papers.title,papers.abstract,papers.year,papers.externalIds,papers.s2FieldsOfStudy,papers.citationCount,papers.authors"

my_url = BASE_URL + VERSION + RESOURCE + FIELDS


def most_frequent(List):
    occurence_count = Counter(List)
    return occurence_count.most_common(1)[0][0]

def get_data(idx, url, status_count = 0):
    ids = list(all_authors.iloc[idx:idx+10]['authorId'])
    
    r = requests.post(url, json={"ids": ids})

    if r.status_code == 500 or r.status_code == 504:
        if status_count == 2:
            print(f'Failed on {idx} with status code {r.status_code}')
            return None
        return get_data(idx, url, status_count + 1)
        
    if r.status_code != 200:
        time.sleep(30)
        print(f"Got status code: {r.status_code}. Trying Again!")
        get_data(idx, url, status_count)

    return r.json()

def make_data(r):
    if r is None or r == [None]:
        return None
    for author in r:
        if author is None or author == [None]:
            return None
        author_data['ids'].append(author['authorId']) if author['authorId'] is not None else author_data['ids'].append(None)
        author_data['names'].append(author['name']) if author['name'] is not None else author_data['names'].append(None)
        author_data['aliases'].append(str(author['aliases'])) if author['aliases'] is not None else author_data['aliases'].append(None)
        author_data['citationCount'].append(int(author['citationCount'])) if author['citationCount'] is not None else author_data['citationCount'].append(0)
        temp_fields = []
        for paper in author['papers']:
            if 'DOI' in paper['externalIds'].keys():
                paper_data["paperId"].append(paper['paperId']) if paper['paperId'] is not None else paper_data["paperId"].append(None)
                paper_data["title"].append(paper['title']) if paper['title'] is not None else paper_data["title"].append(None)
                paper_data["year"].append(int(paper['year'])) if paper['year'] is not None else paper_data["year"].append(None)
                paper_data["citationCount"].append(int(paper['citationCount'])) if paper['citationCount'] is not None else paper_data["citationCount"].append(0)
                paper_data["fields"].append(str([field['category'] for field in paper['s2FieldsOfStudy']])) if paper['s2FieldsOfStudy'] is not None else paper_data["fields"].append(None)
                temp_fields.extend([field['category'] for field in paper['s2FieldsOfStudy']]) if paper['s2FieldsOfStudy'] is not None else None
                aut = paper['authors'] if paper['authors'] is not None else None
                paper_data["authorIds"].append([author['authorId'] for author in aut]) if aut is not None else paper_data["authorIds"].append(None)
                paper_data["DOI"].append(paper['externalIds']['DOI']) if paper['externalIds']['DOI'] is not None else paper_data["DOI"].append(None)
                abstract_data['paperId'].append(paper['paperId']) if paper['paperId'] is not None else abstract_data['paperId'].append(None)
                abstract_data["abstract"].append(paper['abstract']) if paper['abstract'] is not None else abstract_data["abstract"].append(None)
        author_data['field'].append(most_frequent(temp_fields)) if len(temp_fields) > 0 else author_data['field'].append(None)
        

In [103]:
for idx in range(0, len(all_authors), 10):
    make_data(get_data(idx, my_url, 0))
    if idx % 100 == 0:
        author_df = pd.DataFrame(author_data, columns=['ids', 'names', 'aliases', 'citationCount', 'field'])
        paper_df = pd.DataFrame(paper_data, columns=['paperId', 'title', 'year', 'DOI', 'citationCount', 'fields', 'authorIds'])
        abstract_df = pd.DataFrame(abstract_data, columns=['paperId', 'abstract'])
        author_df.to_csv("author_data.csv", index=False)
        paper_df.to_csv("paper_data.csv", index=False)
        abstract_df.to_csv("abstract_data.csv", index=False)
        print(f"Saved at {idx}!")




Saved at 0!
Saved at 100!
Saved at 200!
Saved at 300!
Saved at 400!
Saved at 500!
Saved at 600!
Failed on 650 with status code 500
Saved at 700!
Saved at 800!
Saved at 900!
Saved at 1000!
Failed on 1020 with status code 504
Saved at 1100!
Failed on 1120 with status code 500
Failed on 1140 with status code 500
Saved at 1200!
Saved at 1300!
Failed on 1340 with status code 500
Saved at 1400!
Saved at 1500!
Failed on 1520 with status code 500
Saved at 1600!
Saved at 1700!
Failed on 1720 with status code 500
Saved at 1800!
Saved at 1900!
Failed on 1910 with status code 504
Saved at 2000!
Saved at 2100!
Failed on 2110 with status code 500
Saved at 2200!
Saved at 2300!
Saved at 2400!
Saved at 2500!
Failed on 2580 with status code 500
Saved at 2600!
Saved at 2700!
Failed on 2730 with status code 500
Failed on 2800 with status code 500
Saved at 2800!
Saved at 2900!
Saved at 3000!
Saved at 3100!
Failed on 3150 with status code 500
Saved at 3200!
Failed on 3260 with status code 500
Saved at 3300!

KeyboardInterrupt: 

In [104]:
for idx in range(0, len(all_authors), 10):
    make_data(get_data(idx, my_url, 0))
    if idx % 100 == 0:
        author_df = pd.DataFrame(author_data, columns=['ids', 'names', 'aliases', 'citationCount', 'field'])
        paper_df = pd.DataFrame(paper_data, columns=['paperId', 'title', 'year', 'DOI', 'citationCount', 'fields', 'authorIds'])
        abstract_df = pd.DataFrame(abstract_data, columns=['paperId', 'abstract'])
        author_df.to_csv("author_data.csv", index=False)
        paper_df.to_csv("paper_data.csv", index=False)
        abstract_df.to_csv("abstract_data.csv", index=False)
        print(f"Saved at {idx}!")


Unnamed: 0,paperId,title,year,DOI,citationCount,fields,authorIds
0,825f85375bba977cd3ad78ac1ba22c7fae5609fb,Reference-Grade Genome and Large Linear Plasmi...,2022.0,10.1128/spectrum.02434-21,1,"['Biology', 'Engineering']","[7635770, 49271298, 2773799, 5179706, 14009059..."
1,ada32be518f395002d90c52befabcd8735128192,Synthetic biology approaches to actinomycete s...,2021.0,10.1093/femsle/fnab060,2,"['Engineering', 'Biology']","[2661384, 8646979, 144778908, 51935399, 277379..."
2,765919117cdcc0c736502d0c08f3d3b985e017f8,Dynamical Criticality: Overview and Open Quest...,2017.0,10.1007/s11424-017-6117-5,0,['Psychology'],"[1763293, 145479337, 2773799, 143851134]"
3,0d783a10ec4f9293ffdf4bddf4c63f4e2b54b8b1,Beyond Networks: Search for Relevant Subsets i...,2016.0,10.1007/978-3-319-24391-7_12,1,['Computer Science'],"[1763293, 145479337, 2773799, 143851134]"
4,25cb12204aeb78762dc00eab20e7c24927f96ea0,Dynamical regimes in non-ergodic random Boolea...,2016.0,10.1007/s11047-016-9552-7,0,['Computer Science'],"[145479337, 2221493, 35063515, 1763293, 277379..."
...,...,...,...,...,...,...,...
332423,ef410d9cc52dad4e03b42eed82c4b4e3d42b8822,Discriminating among single locus models using...,1980.0,10.1002/AJMG.1320060307,16,['Biology'],[3206382]
332424,3925928b3ce53ef594dabe35320ddf8e7a41f326,Author Correction: A multi-country test of bri...,2022.0,10.1038/s41562-022-01441-4,1,[],"[1753652328, 40484649, 83360346, 2109507581, 3..."
332425,8cdd312f17b47574fa8b54dd1b86eee0dbc27142,"In COVID-19 Health Messaging, Loss Framing Inc...",2022.0,10.1007/s42761-022-00128-3,1,['Psychology'],"[83360346, 5209814, 2016220783, 47013143, 1439..."
332426,c7293f9dd2ec3d34edf5331eb5a6ad7614723197,A multi-country test of brief reappraisal inte...,2021.0,10.1038/s41562-021-01173-x,39,['Psychology'],"[1753652328, 40484649, 83360346, 2109507581, 3..."


In [118]:

for idx in range(5700, len(all_authors), 10):
    make_data(get_data(idx, my_url, 0))
    if idx % 100 == 0:
        author_df = pd.DataFrame(author_data, columns=['ids', 'names', 'aliases', 'citationCount', 'field'])
        paper_df = pd.DataFrame(paper_data, columns=['paperId', 'title', 'year', 'DOI', 'citationCount', 'fields', 'authorIds'])
        abstract_df = pd.DataFrame(abstract_data, columns=['paperId', 'abstract'])
        author_df.to_csv("author_data.csv", index=False)
        paper_df.to_csv("paper_data.csv", index=False)
        abstract_df.to_csv("abstract_data.csv", index=False)
        print(f"Saved at {idx}!")

Saved at 5700!
Saved at 5800!
Saved at 5900!
Saved at 6000!
Saved at 6100!
Saved at 6200!
Saved at 6300!
Saved at 6400!
Failed on 6430 with status code 500
Saved at 6500!
Saved at 6600!
Saved at 6700!
Saved at 6800!
Saved at 6900!
Saved at 7000!
Saved at 7100!
Saved at 7200!
Saved at 7300!
Saved at 7400!
Saved at 7500!
Saved at 7600!
Failed on 7700 with status code 500
Saved at 7700!
Failed on 7770 with status code 500
Saved at 7800!
Saved at 7900!
Saved at 8000!
Saved at 8100!
Saved at 8200!
Failed on 8260 with status code 504
Saved at 8300!
Saved at 8400!
Saved at 8500!
Failed on 8590 with status code 500
Saved at 8600!
Saved at 8700!
Failed on 8720 with status code 500
Saved at 8800!
Saved at 8900!
Failed on 8910 with status code 500
Saved at 9000!
Saved at 9100!
Failed on 9110 with status code 500
Saved at 9200!
Failed on 9270 with status code 504
Saved at 9300!
Failed on 9360 with status code 500
Saved at 9400!
Saved at 9500!
Saved at 9600!
Saved at 9700!
Saved at 9800!
Failed on 

KeyboardInterrupt: 

In [123]:
author_df.drop_duplicates().dropna().to_csv("author_clean.csv", index=False)

In [None]:
paper_df.drop_duplicates(subset=['paperId']).dropna().to_csv("paper_clean.csv", index=False)