# Working with the Web of Science Starter API

This notebook demonstrates some standard means for accessing and retrieving data from the Web of Science (WoS) database using the WoS Starter API.

The Starter API allows users to automate searches of the WoS database, retrieving the following data:
+ author(s), editor(s), etc.
+ document title
+ source / publication title
+ document type
+ author keywords <!--[?]-->
+ document identifiers (ISSN, eISSN, ISBN, DOI, PubMed Id, Web of Science identifier [UT])
+ author /researcher identifiers
+ publication year
+ volume and issue
+ pages
+ times cited??

*Note: The Starter API also retrieves records using the following fields (but which are not included in the results):*
+ keywords:
    <!--+ author keywords-->
    + Keywords Plus
+ Abstract

*Thus, you may search for all documents that use "nuclear disarmament" in their keywords or abstract, but the returned results will not include the keywords or abstract themselves. If you want to examine how a particular text uses this term you will have to visit the WoS record to review its abstract or use one of the other methods available tot the Dartmouth community for accessing WoS [detailed here](https://researchguides.dartmouth.edu/c.php?g=59725&p=9910244).*

### Limits on Starter API use

Institutional subscriptions to the Starter API (which Dartmouth Library has) allow researchers to:
+ place up to 5,000 requests per day
+ up to 5 requests per second
+ retrieve up to 50 records per request

meaning a researcher can retrieve a maximum of 50,000 records in a given day.

For a more detailed comparison of the WoS Starter and Lite APIs and different ways to access WoS data - particularly relevant for Dartmouth community members, please see the [Dartmouth Library *Accessing Web of Science Data* Guide](https://researchguides.dartmouth.edu/c.php?g=59725&p=9910244). For more on the database fields that may be searched and those that will be returned by the Starter API, see the [README available with the Starter API Client Github page](https://github.com/clarivate/wosstarter_python_client/blob/master/README.md).

## I. Getting Started

1. **Import all necessary packages.** To first install these packages to a local environment, you can use the requirements.txt file. Open a terminal / command prompt within this project's folder and type:

```
pip install -r requirements.txt
```

Then you can import these packages.

In [1]:
import requests
import time
import os
import urllib.parse
import pandas as pd
import random
from bs4 import BeautifulSoup   #for parsing xml and html
from random import randint  
from dotenv import load_dotenv 
load_dotenv()

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


True

2. Store the API link to memory and retrieve your API key. See this repository's ReadME.md [**INSERT LINK**] for more on saving and retrieving your API Key from an .env file.

In [2]:
BASEURL_ST = 'https://api.clarivate.com/apis/wos-starter/v1/'
HEADERS_ST = {'X-APIKey': os.getenv("APIKEY")}

## II. Search Queries

### IIa. Retrieve documents by author

3. We can begin with a simple search query for one author. For example, we can search amongst the [suspiciously prolific publications by the Spanish scholar Rafael Luque](https://english.elpais.com/science-tech/2023-04-02/one-of-the-worlds-most-cited-scientists-rafael-luque-suspended-without-pay-for-13-years.html).

In [11]:
#SEARCH_QUERY = 'AU=Lepore, Jill' 
#SEARCH_QUERY = "AU=Luque, Rafael" # Enter your search query here, in this case we are looking for an author named Rafael Luque
#SEARCH_QUERY = 'AU=Schnell, JD'  

#SEARCH_QUERY = "AI=IMS-5344-2023"
#SEARCH_QUERY = "AI=EFQ-9500-2022"
SEARCH_QUERY = "UT=WOS:000188058500010"

4. **RETRIEVE DATA**: We can then insert the API url (BASEURL_ST), the search query, and the API Key (HEADERS_ST) into a requests command.

In [12]:
initial_request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY)}', headers=HEADERS_ST)
data = initial_request.json()
data

{'metadata': {'total': 1, 'page': 1, 'limit': 10},
 'hits': [{'uid': 'WOS:000188058500010',
   'title': 'Thoracic findings in pediatric patients with cystic fibrosis',
   'types': ['Article'],
   'sourceTypes': ['Article'],
   'source': {'sourceTitle': 'RADIOLOGE',
    'publishYear': 2003,
    'publishMonth': 'DEC',
    'volume': '43',
    'issue': '12',
    'pages': {'range': '1103-1108',
     'begin': '1103',
     'end': '1108',
     'count': 6}},
   'names': {'authors': [{'displayName': 'Wunsch, R',
      'wosStandard': 'Wunsch, R',
      'researcherId': 'EFQ-9500-2022'},
     {'displayName': 'Wunsch, C',
      'wosStandard': 'Wunsch, C',
      'researcherId': 'GDU-2826-2022'}]},
   'links': {'record': 'https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=dartrds_jeremy_01&SrcAuth=WosAPI&KeyUT=WOS:000188058500010&DestLinkType=FullRecord&DestApp=WOS_CPL',
    'references': 'https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=dartrds_jeremy_01&SrcAuth=WosAPI&KeyUT=WOS:0

[limits]

In [8]:
initial_request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY)}&limit=50', headers=HEADERS_ST)
data = initial_request.json()
data

{'metadata': {'total': 50, 'page': 1, 'limit': 50},
 'hits': [{'uid': 'WOS:A1997WH47600007',
   'title': 'Lymphomas in childhood',
   'types': ['Article'],
   'sourceTypes': ['Article'],
   'source': {'sourceTitle': 'RADIOLOGE',
    'publishYear': 1997,
    'publishMonth': 'JAN',
    'volume': '37',
    'issue': '1',
    'pages': {'range': '51-61', 'begin': '51', 'end': '61', 'count': 11}},
   'names': {'authors': [{'displayName': 'Wunsch, C',
      'wosStandard': 'Wunsch, C',
      'researcherId': 'GDU-2826-2022'},
     {'displayName': 'Wunsch, R',
      'wosStandard': 'Wunsch, R',
      'researcherId': 'EFQ-9500-2022'},
     {'displayName': 'Richter, GM',
      'wosStandard': 'Richter, GM',
      'researcherId': 'FXB-9894-2022'},
     {'displayName': 'Betsch, B',
      'wosStandard': 'Betsch, B',
      'researcherId': 'CGG-7868-2022'},
     {'displayName': 'Brado, M',
      'wosStandard': 'Brado, M',
      'researcherId': 'CHD-4667-2022'},
     {'displayName': 'Kauffmann, GW',
      

The results above, however, return the work of multiple scholars with the name Rafael Luque or R. Luque. Instead we need to search for this particular scholar by using his ResearcherID.

We can filter the data already retrieved to disambiguate the different Rafael Luques. 

Let's first examine the data returned (under the name `data`).



In [53]:
def author_summary(jsondict):
    auth_rows = []
    hits = [hit for hit in jsondict['hits']]
    print(f"Titles included in response to the query: '{SEARCH_QUERY}' - {len(hits)}")
    for hit in hits:
        title = hit['title']
        year = hit['source']['publishYear']
        authors = hit['names']['authors']
        au_keywords = hit['keywords']['authorKeywords']
        #print(authors)
        for author in authors:
            auth_rows.append([title, year, au_keywords, author['wosStandard'], author['researcherId']])
    return(auth_rows)    
rows = author_summary(data)

author_df = pd.DataFrame(rows, columns = ["title", "year", "author_keywords", "authorname_wos_std", "researcherid"])
print(author_df.shape)
author_df.head(40)


Titles included in response to the query: 'AU=Luque, Rafael' - 50
(286, 5)


Unnamed: 0,title,year,author_keywords,authorname_wos_std,researcherid
0,Development of Educative Software: A Methodolo...,2005,"[Educational software, methodology, software d...","Quintero, H",IOR-0062-2023
1,Development of Educative Software: A Methodolo...,2005,"[Educational software, methodology, software d...","Portillo, L",IHG-2338-2023
2,Development of Educative Software: A Methodolo...,2005,"[Educational software, methodology, software d...","Luque, R",IZK-3335-2023
3,Development of Educative Software: A Methodolo...,2005,"[Educational software, methodology, software d...","González, M",CTD-2653-2022
4,INFECTIOUS BACTEREMIC ECTHYMA IN A PATIENT WIT...,2005,"[ecthyma, HIV, bacteremia, Streptococcus pyoge...","Bernabeu, J",DXC-2321-2022
5,INFECTIOUS BACTEREMIC ECTHYMA IN A PATIENT WIT...,2005,"[ecthyma, HIV, bacteremia, Streptococcus pyoge...","Aparicio, R",EMT-5884-2022
6,INFECTIOUS BACTEREMIC ECTHYMA IN A PATIENT WIT...,2005,"[ecthyma, HIV, bacteremia, Streptococcus pyoge...","Luque, R",HVB-2615-2023
7,INFECTIOUS BACTEREMIC ECTHYMA IN A PATIENT WIT...,2005,"[ecthyma, HIV, bacteremia, Streptococcus pyoge...","Nieto, MD",GBN-0050-2022
8,Object recognition and inspection in difficult...,2006,[],"Domínguez, E",K-8465-2012
9,Object recognition and inspection in difficult...,2006,[],"Spinola, C",GAF-9034-2022


In [None]:
#SEARCH_QUERY = 'RI=F-9853-2010'
#type(data)

In [43]:
a = [1, 2, 3, 4]
b = [[1, 2, 3], [2, 3, 5], [2,6,7], [2, 3, 8]]
c = [[1, 2, 3], [2, 3, 5], [2,6,7], 5]

def flatten_extend(matrix: list):
    flat_list = []
    for row in matrix:
        if type(row) is list:
            flat_list.extend(row)
        else:
            flat_list.append(row)
    return flat_list

flatten_extend(a)


[1, 2, 3, 4]

In [64]:
# groupby researcherid
from collections import Counter
#author_df.groupby(by = "researcherid")["author_keywords"].apply(list)

def flatten_extend(matrix: list):
    flat_list = []
    for row in matrix:
        if type(row) is list:
            flat_list.extend(row)
        else:
            flat_list.append(row)
    return flat_list

def most_common_keywords(kwlist: list):
    #print(kwlist)
    flat_kwlist = flatten_extend(kwlist)
    #print("FLATTENED:", flat_kwlist)
    kw_ctr = Counter(flat_kwlist)
    #print("***counter:", kw_ctr)
    kw_most_common = kw_ctr.most_common(5)
    #print("$$Top 3 most common:", kw_most_common)
    return kw_most_common

author_summary = author_df.groupby(by = "researcherid")\
    .agg({"authorname_wos_std": "first", "title": "count", "author_keywords": lambda x: most_common_keywords(x)})

author_summary = author_summary.sort_values(by = "title", ascending=False)
author_summary_sub = author_summary[author_summary['authorname_wos_std'].str.startswith("Luque")]
author_summary_sub.head()

Unnamed: 0_level_0,authorname_wos_std,title,author_keywords
researcherid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
F-9853-2010,"Luque, R",40,"[(heterogeneous catalysis, 8), (mesoporous mat..."
Q-8132-2018,"Luque, RM",4,"[(network security, 2), (data mining, 2), (Com..."
HVB-2615-2023,"Luque, R",2,"[(ecthyma, 1), (HIV, 1), (bacteremia, 1), (Str..."
IZK-3335-2023,"Luque, R",2,"[(Educational software, 1), (methodology, 1), ..."
DWC-6013-2022,"Luque, RJ",1,"[(bladder cancer, 1), (apoptosis, 1), (Ki-67, 1)]"


5. **VIEW SUMMARY INFORMATION FROM WOS REQUEST**: As you can see, this returns a json document as a Python dictionary. We can retrieve specific information from this record the same way we would a dictionary. To retrieve only a summary of the record, we can run the following:

In [None]:
data['metadata']  #returns total number of records 

To retrieve only summary information for each publication (title, source, year, and page range) we can run the following function:

In [None]:
def title_summary(jsondict):
    hits = [hit for hit in jsondict['hits']]
    print(f"Titles included in response to the query: '{SEARCH_QUERY}'")
    for hit in hits:
        title = hit['title']
        so_title = hit['source']['sourceTitle']
        year = hit['source']['publishYear']
        pages = hit['source']['pages']['range'].split("-")
        print(f"{year}. '{title}', {so_title}: {pages[0]}-{pages[1]}")

In [None]:
title_summary(data)

6. **RECORD LIMITS**: Although this query finds 50 records, only 10 are returned due to the default limit of records. Let's change that.

In [None]:
initial_request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY)}' f'&limit=50', headers=HEADERS_ST)
data = initial_request.json()
data['metadata']

In [None]:
title_summary(data)

### IIb. Other Search Fields

7. **SEARCHING THE WOS DATASET USING OTHER FIELDS**: We can search by a variety of fields besides author ("AU") and even combine searches by multiple fields.

Search fields for the Web of Science Starter API are listed in the [project's README page](https://github.com/clarivate/wosstarter_python_client) but are also copied here: 

| Field Tag | Description                                                                                                                                                 |
|-----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
| TI        | Title of document                                                                                                                                           |
| IS        | ISSN or ISBN                                                                                                                                                |
| SO        | Source title - The result contains all source titles within product database (for example, journal titles and/or book titles if the product includes books) |
| VL        | Volume                                                                                                                                                      |
| PG        | Page                                                                                                                                                        |
| CS        | Issue                                                                                                                                                       |
| PY        | Year Published                                                                                                                                              |
| AU        | Author                                                                                                                                                      |
| AI        | Author Identifier                                                                                                                                                      |
| UT        | Accession Number                                                                                                                                            |
| DO        | DOI                                                                                                                                                         |
| DT        | [Document Type](https://webofscience.help.clarivate.com/en-us/Content/document-types.html)                                                                                                                                                         |
| PMID      | PubMed ID                                                                                                                                                   |
| OG        | Search for preferred organization names and/or their name variants from the Preferred Organization Index. <p> A search on a preferred organization name returns all records that contain the preferred name and all records that contain its name variants. A search on a name variant returns all records that contain the variant. For example, Cornell Law Sch returns all records that contain Cornell Law Sch in the Addresses field. <p> When searching for organization names that contain a Boolean (AND, NOT, NEAR, and SAME), always enclose the word in quotation marks ( \" \" ). For example: <p>   - OG=(Japan Science \"and\" Technology Agency (JST))      <br>   - OG=(\"Near\" East Univ)         <br> - OG=(\"OR\" Hlth Sci Univ)                           |
| TS        | Searches for topic terms in the following fields within a document: <p> - Title <br> - Abstract <br> - Author keywords <br> - Keywords Plus


Allowed tags are AI, AU, CS, DO, DT, IS, OG, PG, PMID, PY, SO, SUR, TI, TS, UT, VL


### IIc. Search by Publication

8. **SEARCH FOR A SPECIFIC JOURNAL**: You may also, for example, search by journal. We recommend looking up the journal's ISSN number rather than searching by journal title. You can look up a journal's ISSN using the [ISSN Portal](https://portal.issn.org/).

For example, we can search in ***The William and Mary Quarterly*** using its ISSN ("0043-5597"). This query identifies 6,082 results (although it only returns 10\*). 

*\*Note: we are keeping the 'limit' at 10 for this experimentation. But, if you want the API to retrieve the maximum allowed number of sources (50), add in "&limit=50" as we did in Step #6 above. If we wanted to retrieve all 6,082 results, for example, we could do so using a loop that first retrieves records 1-50, then 51-100, and so on for 122 times until it finishes. Part III shows how to do so below.*

In [None]:
SEARCH_QUERY = "IS=0043-5597"
initial_request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY)}', headers=HEADERS_ST)
initial_request.json()
 

9. **FILTER BY YEAR(S)**: We may then narrow our search of this one journal by limiting it to a series of years

In [None]:
SEARCH_QUERY = "IS=0043-5597 AND PY=(2000-2010)"
initial_request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY)}', headers=HEADERS_ST)
initial_request.json()

10. **FILTER BY DOCUMENT TYPE**: We can filter by Document Type (DT) to retrieve only articles (rather than book reviews and other types) from this journal. As you can see, we have reduced the number of results to 222.

In [None]:
SEARCH_QUERY = "IS=0043-5597 AND PY=(2000-2010) AND DT=Article"
initial_request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY)}', headers=HEADERS_ST)

initial_request.json()

### IId. Search by keyword

11. **TOPIC TERMS SEARCH**: We can also use the Starter API's Topic Terms (TS) field to search for keywords and terms found in a text's Title, Author Keywords, Keywords Plus, and Abstract fields. As noted in the introduction to this notebook, while you can search through these fields, only the Title field is actually returned. 

In the example below, we retrieve all works that have the term "humanized landscape" stored in these fields. 

In [None]:
#SEARCH_QUERY = "TS=(ecology AND humanized landscape)"
SEARCH_QUERY = "TS=humanized landscape"
initial_request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY)}', headers=HEADERS_ST)
data = initial_request.json()
data['metadata']

In [None]:
data

## III. Placing Retrieved Data in a Dataframe and saving to a csv

The Starter API returns query results in the format of [JSON data](https://www.w3schools.com/js/js_json_intro.asp), which is hierarchical in format. 

Often, however, researchers will want to place this data in a two-dimensional data table (in Python - these data tables are known as dataframes). This part shows how to transform this returned JSON data into a dataframe.

12. **Retrieve 50 or fewer records and place in a dataframe**

12a. We will continue with our most recent SEARCH_QUERY. First, let's retrieve the number of records this query found:

In [None]:
print(f"Our current search query: {SEARCH_QUERY}")
total_records = data['metadata']['total']
print(total_records)

12b. In a later step, we will work on retrieving all 306 records. For the moment, however, let's just request the first 50.

In [None]:
#50 or less results
datadict = {}
initial_request = requests.get(
        f'{BASEURL_ST}documents?db=WOS&q={SEARCH_QUERY}&limit=50', headers=HEADERS_ST)
data = initial_request.json()
datadict = data
print(f"Total number of records pulled: {len(datadict['hits'])}")
print(f"for the search query: {SEARCH_QUERY}")
uids = set([hit['uid'] for hit in datadict['hits']])
print(f"Total number of unique ids: {len(uids)}")
print(f"Number of requests remaining today: {initial_request.headers['X-RateLimit-Remaining-Day']}.")

12c. The code above retrieved the data ("initial_request"), saved it as a json file ("data"), created an empty dictionary ("datadict"), placed the JSON data in this dictionary, and then created a list of unique IDs found in the dictionary ("uids").

Let's examine the results:

In [None]:
list(uids)

In [None]:
datadict.keys()
#there are two keys in this dictionary: 'metadata' and 'hits'. 
## Let's examine the values stored with the 'hits' key more closely.

In [None]:
type(datadict['hits'])
# Information stored under the 'hits' key is stored in a list. 
# Let's examine how many items are in the list:

In [None]:
len(datadict['hits'])
#... and the results stored within the first item in the list:

In [None]:
datadict['hits'][0]

12d. Now that we have a better idea how data is returned by the WOS Starter API, we can create a function that converts the JSON results (which we have stored in a Python dictionary) into a dataframe.

In [None]:
def retrieve_all_data(datahits):
    #hits = [hit for hit in data['hits']]
    datalist = []
    for hit in datahits:
        datadict = {}
        datadict['uid'] = hit.get("uid", "")
        datadict['title'] = hit.get("title", "")
        datadict['authors'] = "; ".join([name.get('wosStandard') for name in hit.get("names").get("authors")])
        datadict['researcherIds'] = "; ".join([str(name.get('researcherId')) for name in hit.get("names").get("authors")])
        datadict['pubyear'] = hit.get("source").get("publishYear")
        datadict['source_title'] = hit.get("source").get("sourceTitle")
        datadict['volume'] = hit.get("source").get("volume")
        datadict['page_start'] = hit.get("source").get("pages").get("begin")
        datadict['page_end'] = hit.get("source").get("pages").get("end")
        datadict['page_count'] = hit.get("source").get("pages").get("count")
        identifiers = hit.get("identifiers")
        datadict['doi'] = identifiers.get("doi")
        datadict['issn'] = identifiers.get("issn")
        datadict['eissn'] = identifiers.get("eissn")
        datadict['isbn'] = identifiers.get("isbn")
        citations = hit.get("citations")
        if len(citations) > 0:   #for some reason the citations key stores a dict inside a list
            datadict["citation_counts"] = citations[0].get("count")
         
        datadict['author_keywords'] = "; ".join([kw.lower() for kw in hit.get("keywords").get("authorKeywords")])
        #datadict['keywords_plus'] = hit.get("keywords").get("keywordsPlus")
        links = hit.get("links")
        datadict['record_links'] = links.get("record")
        datadict['citing_links'] = links.get("citingArticles")
        datadict['reference_links'] = links.get("references")
        datadict['related_links'] = links.get("related")
        datalist.append(datadict)
    #print(datalist)
    return(pd.DataFrame(datalist))

12e. With this function we can now convert our dictionary ("datadict") into a dataframe.

In [None]:
df1 = retrieve_all_data(datadict['hits'])
df1.head()

12f. ...and export this dataframe into a csv.

In [None]:
# write results to csv
df1.to_csv(f"wos_results_{SEARCH_QUERY}_page1only.csv", encoding = 'utf=8')

13. **SEND MULTIPLE REQUESTS TO RETRIEVE 50+ RECORDS FROM A QUERY**:

However, our search query ("TS=humanized landscapes") returned more than 50 results (306 in total).

When a query returns more than 50 records, we need to write a iterative loop that retrieves 50 records at a time.

13a. Create a for loop to retrieve 50 records at a time and assemble the result into one Python dictionary:

In [None]:
total_records = data['metadata']['total']
print(f"Our current search query: {SEARCH_QUERY} returned {total_records} records.")

requests_required = ((total_records - 1) // 50) + 1  #306 records - 1 = 305 // 50 = 6 + 1 = 7
print(requests_required)
datadict = {}
if requests_required > 1:
    print(f"API requests required to get all data from the query - '{SEARCH_QUERY}': {requests_required}")
for i in range(requests_required):
    subsequent_response = requests.get(
        f'{BASEURL_ST}documents?db=WOS&q={SEARCH_QUERY}&limit=50&page={i+1}', headers=HEADERS_ST)
    data = subsequent_response.json()
    if i == 0:
        print(data['metadata'])
        datadict = data
    else:
        datadict['hits'].extend(data['hits'])
    print(f"**Pulling from Page {i+1} of {requests_required}**")
print(f"Total number of records pulled: {len(datadict['hits'])}")
uids = set([hit['uid'] for hit in datadict['hits']])
print(f"Total number of unique ids: {len(uids)}")
print(f"Number of requests remaining today: {subsequent_response.headers['X-RateLimit-Remaining-Day']}.")    

13b. Next, we can transform the dictionary into a dataframe:

In [None]:
df_all = retrieve_all_data(datadict['hits'])
df_all.head()

13c. ... and export it as a csv:

In [None]:
df_all.to_csv(f"wos_results_{SEARCH_QUERY}_all.csv", encoding = 'utf=8')

That's it. We can review the number of requests we have remaining for today:

In [None]:
subsequent_response.headers['X-RateLimit-Remaining-Day']