## Using WoS Starter API to retrieve author information

In order to run the code below you will need to request and acquire an API key for the Web of Science Starter API.

Once you have acquired this key, save it in a document called `.env` within this same folder. Save the key in the following format at the top line of that document:

    STARTERKEY = "XXXXXXXXXXX" #insert the key in place of the Xs

See this repository's ReadME.md [**INSERT LINK**] for more on saving and retrieving your API Key from an .env file.

### I. Getting Started

1. **Import all necessary packages.** To first install these packages to a local environment, you can use the requirements.txt file. Open a terminal / command prompt within this project's folder and type:

```
pip install -r requirements.txt
```

Then you can import these packages.

In [1]:
import requests
import time
import os
import urllib.parse
import pandas as pd
import random
from bs4 import BeautifulSoup   #for parsing xml and html
from random import randint  
from dotenv import load_dotenv 
load_dotenv()

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


True

2. Store the API link to memory and retrieve your API key. 

In [2]:
BASEURL_ST = 'https://api.clarivate.com/apis/wos-starter/v1/'
HEADERS_ST = {'X-APIKey': os.getenv("APIKEY")}

In [3]:
import Levenshtein as lev
lev_wts = (1, 0, 1)

In [4]:
authordf = pd.read_csv("data/chor_davin_authors.csv", encoding = 'utf-8', usecols=[0,1,2,3,4,5], nrows=10000)
#authordf = authordf.iloc[:, 0:6]
print(authordf.shape)
authordf = authordf.dropna(how="all")
authordf["Ranking"] = authordf["Ranking"].ffill()
authordf['Ranking'] = authordf['Ranking'].astype(int)
authordf["University Name"] = authordf["University Name"].ffill()
#authordf = authordf.astype({"Ranking": str, "University Name": str})

print(authordf.shape)
authordf = authordf.fillna("")
authordf.head()


(10000, 6)
(2991, 6)


Unnamed: 0,Ranking,University Name,Name,Position,URL,hosting domain
0,1,Harvard College,Alberto Alesina,Professor,https://scholar.harvard.edu/alesina/publications,University webpage
1,1,Harvard College,Isaiah Andrews,Professor,https://scholar.harvard.edu/iandrews/publications,University webpage
2,1,Harvard College,Pol Antràs,Professor,https://scholar.harvard.edu/antras/publications,University webpage
3,1,Harvard College,Robert J. Barro,Professor,,
4,1,Harvard College,Emily Breza,Assistant Professor,https://sites.Google site.com/view/ebreza/rese...,Google site


### Convert First M. Last Suffix to Last FM

In [5]:
def convert_name_first2last_to_last_fminits(orig_name: str):
    if orig_name == "":
        return "", "", "", "", ""
    comma_split = orig_name.split(", ")
    if len(comma_split) > 1:
        suffix = comma_split[-1]
    else:
        suffix = ""
    name = comma_split[0]
    name_parts = name.split()
    last_name = name_parts.pop(-1)
    first_names = ' '.join(name_parts)
    first_inits = ''.join([part[0] for part in name_parts])
    #reordered_name = last_name + " " + first_names + " " + suffix
    #reordered_shortname = last_name + " " + first_inits
    reordered_shortname2 = last_name + ", " + first_inits
    return last_name, first_names, first_inits, suffix, reordered_shortname2


In [6]:
authordf["lastname"], authordf["firstnames"], authordf["firstinits"], authordf["suffix"], authordf["reordered_name"] = \
    zip(*authordf["Name"].apply(convert_name_first2last_to_last_fminits))
authordf.head()

Unnamed: 0,Ranking,University Name,Name,Position,URL,hosting domain,lastname,firstnames,firstinits,suffix,reordered_name
0,1,Harvard College,Alberto Alesina,Professor,https://scholar.harvard.edu/alesina/publications,University webpage,Alesina,Alberto,A,,"Alesina, A"
1,1,Harvard College,Isaiah Andrews,Professor,https://scholar.harvard.edu/iandrews/publications,University webpage,Andrews,Isaiah,I,,"Andrews, I"
2,1,Harvard College,Pol Antràs,Professor,https://scholar.harvard.edu/antras/publications,University webpage,Antràs,Pol,P,,"Antràs, P"
3,1,Harvard College,Robert J. Barro,Professor,,,Barro,Robert J.,RJ,,"Barro, RJ"
4,1,Harvard College,Emily Breza,Assistant Professor,https://sites.Google site.com/view/ebreza/rese...,Google site,Breza,Emily,E,,"Breza, E"


## Apply to dataframe of multiple authors

In [7]:
#def get_author_info(row: pd.Series, authorname: str, page: int):
def get_author_info(authorname: str, page: int):
    print("\n***Searching for:", authorname)
    SEARCH_QUERY = f"AU={authorname}"
    request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY)}&limit=50&page={page}', headers=HEADERS_ST)
    data = request.json()
    #id_dict = {}
    doclist = []
    hits = [hit for hit in data['hits']]
    for hit in hits:
        try:
            authors = hit['names']['authors']
            short_names = [au['wosStandard'] for au in authors]
            short_names_lower = [name.lower() for name in short_names]
            ids = [au['researcherId'] if 'researcherId' in au.keys() else '' for au in authors]
            lev_scores = [lev.distance(authorname.lower(), name, weights = lev_wts) for name in short_names_lower]
            min_idx = lev_scores.index(min(lev_scores))
            if lev_scores[min_idx] > 2:
                print("no close match:", short_names[min_idx])
                # return something or set some value??
                continue
            #use min idx to get name and id
            tgt_shortname = short_names[min_idx]
            tgt_id = ids[min_idx]
        except KeyError:
            print("missing authors:", authorname, hit['uid'])
            continue
        keywords = hit['keywords']['authorKeywords']
        if "sourceTitle" in hit['source'].keys():
            so_title = hit['source']['sourceTitle']
        else:
            so_title = []
        if "publishYear" in hit['source'].keys():
            pubyear = [hit['source']['publishYear']]
        else:
            pubyear = []
        doc_uid = hit['uid']
        if 'doi' in hit['identifiers'].keys():
            doi = hit['identifiers']['doi']
        else:
            doi = ""
        if 'issn' in hit['identifiers'].keys():
            issn = hit['identifiers']['issn']
        else:
            issn = ""
        if 'eissn' in hit['identifiers'].keys():
            eissn = hit['identifiers']['eissn']
        else:
            eissn = ""
        citedby_count = hit['citations'][0]['count']
        doclist.append([tgt_shortname, tgt_id, 
                        doc_uid, doi, issn, eissn,
                        so_title, keywords, pubyear, 
                        citedby_count])
    
    return(doclist)

In [8]:
#get_author_info(authorname="Fryer, RG", page=1)

In [9]:
def get_author_metadata(row:pd.Series, colname:str):
    """
    Reads in an author name written in the form of "LastName, FirstInits"
    Uses Starter API to search for authors with this name
    Retrieve author info for potential matching authors
    Also retrieves document ids for documents written by these authors
    """
    print(row)
    print(row[colname])
    authorname = row[colname]
    SEARCH_QUERY = f"AU={authorname}"
    initial_request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY)}&limit=10', headers=HEADERS_ST)
    data = initial_request.json()
    num_results = int(data['metadata']['total'])
    num_pages = 1 + (num_results -1)// 50
    doclist = []
    if num_pages <= 1:
        doclist = get_author_info(authorname, page=1)
    elif num_pages > 10:
        #row['too_many_matches'] = "yes"
        print("TOO MANY MATCHES:", authorname, "\n")
        #return(row)
        return [], True
    else:
        print("multiple pages for: ", authorname, "pages = ", num_pages)
        for i in range(1, num_pages + 1):
            doclist.extend(get_author_info(authorname, i))
    print("***doclist***:", doclist)
    #newdoclist = []
    #for doc in doclist:
    #    newdoclist.append(row + doc)
    #return newdoclist
    return doclist, False

In [24]:
#doclist = []
#for row in authordf.iloc[:3].iterrows():
#    doclist.extend(get_author_metadata(row, "reordered_name"))
#authordf.loc[:3, doclist] = 
#print(len(doclist))

#authordf.loc[:1, "doclists"] = authordf.iloc[:3].apply(lambda x: get_author_metadata(x, "reordered_name"), axis=1)
#authordf.head()

In [23]:
numauthors = len(authordf)
print(numauthors)
num_iterations = numauthors // 50
print(num_iterations)

2991
59


In [35]:
for i in range(5, 8):#num_iterations + 1):
    startidx = i * 50
    endidx = startidx + 50
    #startidx = 0
    #endidx = 3
    subdf = authordf.iloc[startidx:endidx].copy()
    subdf["doclists"], subdf['too_many_matches'] = \
       zip(*subdf.apply(lambda x: get_author_metadata(x, "reordered_name"), axis=1))
    #authordf.loc[startidx: endidx, "doclists"] = authordf.iloc[startidx:endidx].apply(lambda x: get_author_metadata(x, "reordered_name"), axis=1)
    docdf = subdf.explode("doclists")
    docdf['doclists'] = docdf['doclists'].apply(lambda d: d if isinstance(d, list) else [])
    #docdf['doclists'] = docdf['doclists'].fillna("").apply(list)
    docdf[["match_name", "resID", "doc_uid", "doc_doi", "doc_issn",
           "doc_eissn", "so_title", "au_keywords", "pubyear", "citedby_count"]] =\
           pd.DataFrame(docdf.doclists.tolist(), index=docdf.index)
    docdf.to_csv(f"data/docdata_{startidx}-{endidx}.csv", encoding='utf-8')   
    #df = authordf.iloc[startidx: endidx].apply(lambda x: get_author_metadata(x, "reordered_name"), axis=1)
    #authordf.to_csv(f"data/authordata_{startidx}-{endidx}.csv", encoding='utf-8')

Ranking                              3
University Name    Stanford University
Name                     Paul A. David
Position                     Professor
URL                                   
hosting domain                        
lastname                         David
firstnames                     Paul A.
firstinits                          PA
suffix                                
reordered_name               David, PA
doclists                           NaN
Name: 150, dtype: object
David, PA
multiple pages for:  David, PA pages =  3

***Searching for: David, PA

***Searching for: David, PA

***Searching for: David, PA
Ranking                                                            3
University Name                                  Stanford University
Name                                                   Walter Falcon
Position                                                   Professor
URL                https://fse.fsi.stanford.edu/people/walter_p_f...
hosting domain         

ValueError: All arrays must be of the same length

In [16]:
explodedf = authordf.explode("doclists")
print(explodedf.shape)
explodedf.head(10)

(3359, 12)


Unnamed: 0,Ranking,University Name,Name,Position,URL,hosting domain,lastname,firstnames,firstinits,suffix,reordered_name,doclists
0,1,Harvard College,Alberto Alesina,Professor,https://scholar.harvard.edu/alesina/publications,University webpage,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, CBE-0396-2022, WOS:A1979GV9620002..."
0,1,Harvard College,Alberto Alesina,Professor,https://scholar.harvard.edu/alesina/publications,University webpage,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, CBE-0396-2022, WOS:A1981LH4860000..."
0,1,Harvard College,Alberto Alesina,Professor,https://scholar.harvard.edu/alesina/publications,University webpage,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, CBE-0396-2022, WOS:A1982QW1520000..."
0,1,Harvard College,Alberto Alesina,Professor,https://scholar.harvard.edu/alesina/publications,University webpage,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, CBE-0396-2022, WOS:A1983RJ7480000..."
0,1,Harvard College,Alberto Alesina,Professor,https://scholar.harvard.edu/alesina/publications,University webpage,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, CBE-0396-2022, WOS:A1985A10860001..."
0,1,Harvard College,Alberto Alesina,Professor,https://scholar.harvard.edu/alesina/publications,University webpage,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, EMJ-5429-2022, WOS:A1987J36310001..."
0,1,Harvard College,Alberto Alesina,Professor,https://scholar.harvard.edu/alesina/publications,University webpage,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, CBE-0396-2022, WOS:A1987L12860000..."
0,1,Harvard College,Alberto Alesina,Professor,https://scholar.harvard.edu/alesina/publications,University webpage,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, EMJ-5429-2022, WOS:A1988M17760001..."
0,1,Harvard College,Alberto Alesina,Professor,https://scholar.harvard.edu/alesina/publications,University webpage,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, CBE-0396-2022, WOS:A1988AH4800000..."
0,1,Harvard College,Alberto Alesina,Professor,https://scholar.harvard.edu/alesina/publications,University webpage,Alesina,Alberto,A,,"Alesina, A","[ALESINA, A, EMJ-5429-2022, WOS:A1988M17930000..."


https://stackoverflow.com/questions/35491274/split-a-pandas-column-of-lists-into-multiple-columns

[tgt_shortname, tgt_id, 
                        doc_uid, doi, issn, eissn,
                        so_title, keywords, pubyear, 
                        citedby_count]

In [18]:
explodedf['doclists'] = explodedf['doclists'].fillna("[]")

In [19]:
explodedf[["match_name", "resID", "doc_uid", "doc_doi", "doc_issn",
           "doc_eissn", "so_title", "au_keywords", "pubyear", "citedby_count"]] =\
           pd.DataFrame(explodedf.doclists.tolist(), index=explodedf.index)


In [21]:
explodedf.columns

Index(['Ranking', 'University Name', 'Name', 'Position', 'URL',
       'hosting domain', 'lastname', 'firstnames', 'firstinits', 'suffix',
       'reordered_name', 'doclists', 'match_name', 'resID', 'doc_uid',
       'doc_doi', 'doc_issn', 'doc_eissn', 'so_title', 'au_keywords',
       'pubyear', 'citedby_count'],
      dtype='object')

Unnamed: 0,Ranking,University Name,Name,Position,URL,hosting domain,lastname,firstnames,firstinits,suffix,...,match_name,resID,doc_uid,doc_doi,doc_issn,doc_eissn,so_title,au_keywords,pubyear,citedby_count
0,1,Harvard College,Alberto Alesina,Professor,https://scholar.harvard.edu/alesina/publications,University webpage,Alesina,Alberto,A,,...,"ALESINA, A",CBE-0396-2022,WOS:A1979GV96200029,,0002-9939,,PROCEEDINGS OF THE AMERICAN MATHEMATICAL SOCIETY,[],[1979],0.0
0,1,Harvard College,Alberto Alesina,Professor,https://scholar.harvard.edu/alesina/publications,University webpage,Alesina,Alberto,A,,...,"ALESINA, A",CBE-0396-2022,WOS:A1981LH48600008,10.1109/TCS.1981.1084993,0098-4094,,IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS,[],[1981],273.0
0,1,Harvard College,Alberto Alesina,Professor,https://scholar.harvard.edu/alesina/publications,University webpage,Alesina,Alberto,A,,...,"ALESINA, A",CBE-0396-2022,WOS:A1982QW15200001,10.2140/pjm.1982.103.251,0030-8730,,PACIFIC JOURNAL OF MATHEMATICS,[],[1982],5.0
0,1,Harvard College,Alberto Alesina,Professor,https://scholar.harvard.edu/alesina/publications,University webpage,Alesina,Alberto,A,,...,"ALESINA, A",CBE-0396-2022,WOS:A1983RJ74800005,,0017-0097,,GIORNALE DEGLI ECONOMISTI E ANNALI DI ECONOMIA,[],[1983],0.0
0,1,Harvard College,Alberto Alesina,Professor,https://scholar.harvard.edu/alesina/publications,University webpage,Alesina,Alberto,A,,...,"ALESINA, A",CBE-0396-2022,WOS:A1985A108600011,,0017-0097,,GIORNALE DEGLI ECONOMISTI E ANNALI DI ECONOMIA,[],[1985],0.0
