# Using WoS Starter API to retrieve author information

In order to run the code below you will need to request and acquire an API key for the Web of Science Starter API.

See this repository's [ReadME.md](../../README.md) for more information on saving and retrieving your API Key from an .env file.

This notebook will demonstrate how to take a list of authors and retrieve information about their publication record using the Starter API.

### I. Getting Started

1. **Import all necessary packages.** To first install these packages to a local environment, you can use the requirements.txt file. Open a terminal / command prompt within this project's folder and type:

```
pip install -r requirements.txt
```

Then you can import these packages.

In [13]:
import requests
import time
import os
import urllib.parse
import pandas as pd
import random
from bs4 import BeautifulSoup   #for parsing xml and html
from random import randint  
from dotenv import load_dotenv 
load_dotenv()

True

2. Store the API link to memory and retrieve your API key. 

In [14]:
BASEURL_ST = 'https://api.clarivate.com/apis/wos-starter/v1/'
HEADERS_ST = {'X-APIKey': os.getenv("APIKEY")}

In [15]:
import Levenshtein as lev

# can set weights for insertions, deletions, and substitutions
lev_wts = (1, 1, 1)

In [16]:
authordf = pd.read_csv("../../data/sample_author-names-and-insts.csv", encoding = 'utf-8')
authordf.head()

Unnamed: 0,University Name,Name,Position
0,Harvard College,Alberto Alesina,Professor
1,Harvard College,Isaiah Andrews,Professor
2,Harvard College,Pol Antràs,Professor
3,Harvard College,Robert J. Barro,Professor
4,Harvard College,Emily Breza,Assistant Professor


In [17]:

print(authordf.shape)
authordf = authordf.dropna(how="all")

print(authordf.shape)
authordf.head()


(107, 3)
(107, 3)


Unnamed: 0,University Name,Name,Position
0,Harvard College,Alberto Alesina,Professor
1,Harvard College,Isaiah Andrews,Professor
2,Harvard College,Pol Antràs,Professor
3,Harvard College,Robert J. Barro,Professor
4,Harvard College,Emily Breza,Assistant Professor


### Convert First M. Last Suffix to Last FM

The Starter API searches by authors in the format "LASTNAME, FIRST_INIT." 

So, for the table above, we need to convert an auther named "Alex Brown" into "Brown, A." The following cells defines a function that does just that, and the cell after that applies the results to four new columns for our data frame, lastname (i.e. "Brown"), firstname ("Alex"), first initial ("A.), and finally the reordered_name ("Brown, A.").

In [18]:
def convert_name_first2last_to_last_fminits(orig_name: str):
    if orig_name == "":
        return "", "", "", "", ""
    comma_split = orig_name.split(", ")
    if len(comma_split) > 1:
        suffix = comma_split[-1]
    else:
        suffix = ""
    name = comma_split[0]
    name_parts = name.split()
    last_name = name_parts.pop(-1)
    first_names = ' '.join(name_parts)
    first_inits = ''.join([part[0] for part in name_parts])
    #reordered_name = last_name + " " + first_names + " " + suffix
    #reordered_shortname = last_name + " " + first_inits
    reordered_shortname2 = last_name + ", " + first_inits
    return last_name, first_names, first_inits, suffix, reordered_shortname2


In [19]:
authordf["lastname"], authordf["firstnames"], authordf["firstinits"], authordf["suffix"], authordf["reordered_name"] = \
    zip(*authordf["Name"].apply(convert_name_first2last_to_last_fminits))
authordf.head()

Unnamed: 0,University Name,Name,Position,lastname,firstnames,firstinits,suffix,reordered_name
0,Harvard College,Alberto Alesina,Professor,Alesina,Alberto,A,,"Alesina, A"
1,Harvard College,Isaiah Andrews,Professor,Andrews,Isaiah,I,,"Andrews, I"
2,Harvard College,Pol Antràs,Professor,Antràs,Pol,P,,"Antràs, P"
3,Harvard College,Robert J. Barro,Professor,Barro,Robert J.,RJ,,"Barro, RJ"
4,Harvard College,Emily Breza,Assistant Professor,Breza,Emily,E,,"Breza, E"


## Apply to dataframe of multiple authors

The following functions - **get_author_metadata()** and **get_author_info()** - reads in a column of author names from a dataframe, 
then searches the WoS database one author at a time returning author and document
information for all documents written by authors that match the search name.

Note, since authors are searched by last name and first initial (I.e. "Brown, A."),
many author searches will return multiple matches. Parsing author searches that retrieve many 
author match candidates may take more time than doing a manual search for the author on the WoS
online database. For this reason, the **get_author_metadata** does not retrieve results for 
any author that returns more than 10 pages of results (or 500 total documents).

In [20]:
#def get_author_info(row: pd.Series, authorname: str, page: int):
def get_author_info(authorname: str, page: int):
    """
    This functions reads in a string with an author's name and an integer representing the
    page number we want the Starter API to call (limit: 50 records at a time, so item number 51 will be on page 2)

    It then places the author's name into a search query and retrieves the *n* page of the document results
    returned for this search query.
    
    """
    
    print("\n***Searching for:", authorname)
    SEARCH_QUERY = f"AU={authorname}"
    request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY)}&limit=50&page={page}', headers=HEADERS_ST)
    data = request.json()
    #id_dict = {}
    doclist = []
    hits = [hit for hit in data['hits']]
    for hit in hits:
        try:
            authors = hit['names']['authors']
            short_names = [au['wosStandard'] for au in authors]
            short_names_lower = [name.lower() for name in short_names]
            ids = [au['researcherId'] if 'researcherId' in au.keys() else '' for au in authors]
            lev_scores = [lev.distance(authorname.lower(), name, weights = lev_wts) for name in short_names_lower]
            min_idx = lev_scores.index(min(lev_scores))
            if lev_scores[min_idx] > 2:
                print("no close match:", short_names[min_idx])
                # return something or set some value??
                continue
            #use min idx to get name and id
            tgt_shortname = short_names[min_idx]
            tgt_id = ids[min_idx]
        except KeyError:
            print("missing authors:", authorname, hit['uid'])
            continue
        keywords = hit['keywords']['authorKeywords']
        if "sourceTitle" in hit['source'].keys():
            so_title = hit['source']['sourceTitle']
        else:
            so_title = []
        if "publishYear" in hit['source'].keys():
            pubyear = [hit['source']['publishYear']]
        else:
            pubyear = []
        doc_uid = hit['uid']
        if 'doi' in hit['identifiers'].keys():
            doi = hit['identifiers']['doi']
        else:
            doi = ""
        if 'issn' in hit['identifiers'].keys():
            issn = hit['identifiers']['issn']
        else:
            issn = ""
        if 'eissn' in hit['identifiers'].keys():
            eissn = hit['identifiers']['eissn']
        else:
            eissn = ""
        citedby_count = hit['citations'][0]['count']
        doclist.append([tgt_shortname, tgt_id, 
                        doc_uid, doi, issn, eissn,
                        so_title, keywords, pubyear, 
                        citedby_count])
    
    return(doclist)

In [21]:
#get_author_info(authorname="Fryer, RG", page=1)

In [22]:
def get_author_metadata(row:pd.Series, colname:str):
    """
    Reads in dataframe with one column with author names written in the form of "LastName, FirstInits"
    Uses Starter API to search for authors with this name
    Retrieve author info for potential matching authors
    Also retrieves document ids for documents written by these authors
    """
    print(row)
    print(row[colname])
    authorname = row[colname]
    SEARCH_QUERY = f"AU={authorname}"
    initial_request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY)}&limit=10', headers=HEADERS_ST)
    data = initial_request.json()
    try:
        num_results = int(data['metadata']['total'])
        num_pages = 1 + (num_results -1)// 50
    except KeyError:
        print("no metadata for match! - ", authorname, "\n")
        return [], True
    try:
        hits = [hit for hit in data['hits']]
    except KeyError:
        print("##no metadata for match2! - ", num_results, ":",    authorname, "\n")
        return [], True
    doclist = []
    if num_pages <= 1:
        doclist = get_author_info(authorname, page=1)
    elif num_pages > 10:
        #row['too_many_matches'] = "yes"
        print("TOO MANY MATCHES:", authorname, "\n")
        #return(row)
        return [], True
    else:
        print("multiple pages for: ", authorname, "pages = ", num_pages)
        for i in range(1, num_pages + 1):
            doclist.extend(get_author_info(authorname, i))
    print("***doclist***:", doclist)
    #newdoclist = []
    #for doc in doclist:
    #    newdoclist.append(row + doc)
    #return newdoclist
    return doclist, False

Divide dataframe with author names into lists of length *n* 

In [23]:
numauthors = len(authordf)
print(numauthors)
n = 50
num_iterations = numauthors // n
print(num_iterations)

107
2


In [None]:
for i in range(0, num_iterations + 1):
    startidx = i * 50
    endidx = startidx + 50
    #startidx = 0
    #endidx = 3
    subdf = authordf.iloc[startidx:endidx].copy()
    subdf["doclists"], subdf['toomany_orno_matches'] = \
       zip(*subdf.apply(lambda x: get_author_metadata(x, "reordered_name"), axis=1))
    #authordf.loc[startidx: endidx, "doclists"] = authordf.iloc[startidx:endidx].apply(lambda x: get_author_metadata(x, "reordered_name"), axis=1)
    docdf = subdf.explode("doclists")
    docdf['doclists'] = docdf['doclists'].apply(lambda d: d if isinstance(d, list) else [])
    #docdf['doclists'] = docdf['doclists'].fillna("").apply(list)
    docdf[["match_name", "resID", "doc_uid", "doc_doi", "doc_issn",
           "doc_eissn", "so_title", "au_keywords", "pubyear", "citedby_count"]] =\
           pd.DataFrame(docdf.doclists.tolist(), index=docdf.index)
    docdf.loc[: , "pubyear"] = docdf.loc[: , "pubyear"].str[0]
    docdf.to_csv(f"../data/docdata_{startidx}-{endidx}.csv", encoding='utf-8')   
    #df = authordf.iloc[startidx: endidx].apply(lambda x: get_author_metadata(x, "reordered_name"), axis=1)
    #authordf.to_csv(f"data/authordata_{startidx}-{endidx}.csv", encoding='utf-8')

University Name    Harvard College
Name               Alberto Alesina
Position                 Professor
lastname                   Alesina
firstnames                 Alberto
firstinits                       A
suffix                            
reordered_name          Alesina, A
Name: 0, dtype: object
Alesina, A
multiple pages for:  Alesina, A pages =  3

***Searching for: Alesina, A

***Searching for: Alesina, A

***Searching for: Alesina, A
***doclist***: [['ALESINA, A', 'CBE-0396-2022', 'WOS:A1979GV96200029', '', '0002-9939', '', 'PROCEEDINGS OF THE AMERICAN MATHEMATICAL SOCIETY', [], [1979], 0], ['ALESINA, A', 'CBE-0396-2022', 'WOS:A1981LH48600008', '10.1109/TCS.1981.1084993', '0098-4094', '', 'IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS', [], [1981], 284], ['ALESINA, A', 'CBE-0396-2022', 'WOS:A1982QW15200001', '10.2140/pjm.1982.103.251', '0030-8730', '', 'PACIFIC JOURNAL OF MATHEMATICS', [], [1982], 5], ['ALESINA, A', 'CBE-0396-2022', 'WOS:A1983RJ74800005', '', '0017-0097', '', 'GIO

In [None]:
docdf.head(10)

Unnamed: 0,University Name,Name,Position,lastname,firstnames,firstinits,suffix,reordered_name,doclists,toomany_orno_matches,match_name,resID,doc_uid,doc_doi,doc_issn,doc_eissn,so_title,au_keywords,pubyear,citedby_count
0,Harvard College,Alberto Alesina,Professor,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, CBE-0396-2022, WOS:A1979GV9620002...",False,"ALESINA, A",CBE-0396-2022,WOS:A1979GV96200029,,0002-9939,,PROCEEDINGS OF THE AMERICAN MATHEMATICAL SOCIETY,[],1979.0,0.0
0,Harvard College,Alberto Alesina,Professor,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, CBE-0396-2022, WOS:A1981LH4860000...",False,"ALESINA, A",CBE-0396-2022,WOS:A1981LH48600008,10.1109/TCS.1981.1084993,0098-4094,,IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS,[],1981.0,284.0
0,Harvard College,Alberto Alesina,Professor,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, CBE-0396-2022, WOS:A1982QW1520000...",False,"ALESINA, A",CBE-0396-2022,WOS:A1982QW15200001,10.2140/pjm.1982.103.251,0030-8730,,PACIFIC JOURNAL OF MATHEMATICS,[],1982.0,5.0
0,Harvard College,Alberto Alesina,Professor,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, CBE-0396-2022, WOS:A1983RJ7480000...",False,"ALESINA, A",CBE-0396-2022,WOS:A1983RJ74800005,,0017-0097,,GIORNALE DEGLI ECONOMISTI E ANNALI DI ECONOMIA,[],1983.0,0.0
0,Harvard College,Alberto Alesina,Professor,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, CBE-0396-2022, WOS:A1985A10860001...",False,"ALESINA, A",CBE-0396-2022,WOS:A1985A108600011,,0017-0097,,GIORNALE DEGLI ECONOMISTI E ANNALI DI ECONOMIA,[],1985.0,0.0
0,Harvard College,Alberto Alesina,Professor,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, EMJ-5429-2022, WOS:A1987J36310001...",False,"ALESINA, A",EMJ-5429-2022,WOS:A1987J363100010,10.2307/1884222,0033-5533,,QUARTERLY JOURNAL OF ECONOMICS,[],1987.0,601.0
0,Harvard College,Alberto Alesina,Professor,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, CBE-0396-2022, WOS:A1987L12860000...",False,"ALESINA, A",CBE-0396-2022,WOS:A1987L128600005,10.1111/j.1465-7295.1987.tb00764.x,0095-2583,,ECONOMIC INQUIRY,[],1987.0,143.0
0,Harvard College,Alberto Alesina,Professor,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, EMJ-5429-2022, WOS:A1988M17760001...",False,"ALESINA, A",EMJ-5429-2022,WOS:A1988M177600014,10.1016/0304-3932(88)90055-4,0304-3932,,JOURNAL OF MONETARY ECONOMICS,[],1988.0,3.0
0,Harvard College,Alberto Alesina,Professor,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, CBE-0396-2022, WOS:A1988AH4800000...",False,"ALESINA, A",CBE-0396-2022,WOS:A1988AH48000001,10.2307/3584936,0889-3365,,NBER MACROECONOMICS ANNUAL,[],1988.0,132.0
0,Harvard College,Alberto Alesina,Professor,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, EMJ-5429-2022, WOS:A1988M17930000...",False,"ALESINA, A",EMJ-5429-2022,WOS:A1988M179300004,,0022-2879,,JOURNAL OF MONEY CREDIT AND BANKING,[],1988.0,129.0
