# Intro to the Web of Science Starter API

This notebook demonstrates some standard means for accessing and retrieving data from the Web of Science (WoS) database using the WoS Starter API.

The Starter API allows users to automate searches of the WoS database, retrieving the following data:
+ author(s), editor(s), etc.
+ document title
+ source / publication title
+ document type
+ author keywords <!--[?]-->
+ document identifiers (ISSN, eISSN, ISBN, DOI, PubMed Id, Web of Science identifier [UT])
+ author /researcher identifiers
+ publication year
+ volume and issue
+ pages
+ times cited??

*Note: The Starter API also retrieves records using the following fields (but which are not included in the results):*
+ keywords:
    <!--+ author keywords-->
    + Keywords Plus
+ Abstract

*Thus, you may search for all documents that use "nuclear disarmament" in their keywords or abstract, but the returned results will not include the keywords or abstract themselves. If you want to examine how a particular text uses this term you will have to visit the WoS record to review its abstract or use one of the other methods available tot the Dartmouth community for accessing WoS [detailed here](https://researchguides.dartmouth.edu/c.php?g=59725&p=9910244).*

### Limits on Starter API use

Institutional subscriptions to the Starter API (which Dartmouth Library has) allow researchers to:
+ place up to 5,000 requests per day
+ up to 5 requests per second
+ retrieve up to 50 records per request

meaning a researcher can retrieve a maximum of 50,000 records in a given day.

For a more information on the WoS Starter and Lite APIs and different ways to access WoS data - particularly relevant for Dartmouth community members, please see the [Dartmouth Library *Accessing Web of Science Data* Guide](https://researchguides.dartmouth.edu/c.php?g=1324739&p=11200897&preview=f2c39380122f0fc7a1ebb71e8a4b55ea). For more on the database fields that may be searched and those that will be returned by the Starter API, see the [README available with the Starter API Client Github page](https://github.com/clarivate/wosstarter_python_client/blob/master/README.md).

## I. Getting Started

1. **Import all necessary packages.** To first install these packages to a local environment, you can use the requirements.txt file. Open a terminal / command prompt within this project's folder and type:

```
pip install -r requirements.txt
```

Then you can import these packages.

In [137]:
import requests
import time
import os
from pathlib import Path
import urllib.parse
import pandas as pd
import random
from bs4 import BeautifulSoup   #for parsing xml and html
from random import randint  
from dotenv import load_dotenv 
load_dotenv()

True

2. Store the API link to memory and retrieve your API key. See this repository's [ReadME.md](../../README.md) for more acquiring, saving, and retrieving your API Key from an .env file.

In [138]:
BASEURL_ST = 'https://api.clarivate.com/apis/wos-starter/v1/'
HEADERS_ST = {'X-APIKey': os.getenv("APIKEY")}

## II. Search Queries

### IIa. Retrieve documents by author

1. We can begin with a simple search query for one author. For example, we can search amongst the [suspiciously prolific publications by the Spanish scholar Rafael Luque](https://english.elpais.com/science-tech/2023-04-02/one-of-the-worlds-most-cited-scientists-rafael-luque-suspended-without-pay-for-13-years.html).

In [139]:
#SEARCH_QUERY = 'AU=Lepore, Jill' 
SEARCH_QUERY = "AU=Luque, Rafael" # Enter your search query here, in this case we are looking for an author named Rafael Luque
#SEARCH_QUERY = 'AU=Schnell, JD'  

#SEARCH_QUERY = "AI=IMS-5344-2023"
#SEARCH_QUERY = "AI=EFQ-9500-2022"
#SEARCH_QUERY = "UT=WOS:000188058500010"

2. **RETRIEVE DATA**: We can then insert the API url (BASEURL_ST), the search query, and the API Key (HEADERS_ST) into a requests command.

In [140]:
initial_request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY)}', headers=HEADERS_ST)
data = initial_request.json()
data

{'metadata': {'total': 1066, 'page': 1, 'limit': 10},
 'hits': [{'uid': 'WOS:001591054100001',
   'title': 'Supported metal nanoparticles on porous materials. Methods and applications (Vol 38, pg 481, 2009)',
   'types': ['Correction', 'Early Access'],
   'sourceTypes': ['Correction', 'Early Access'],
   'source': {'sourceTitle': 'CHEMICAL SOCIETY REVIEWS',
    'publishYear': 2025,
    'pages': {'count': 2}},
   'names': {'authors': [{'displayName': 'White, Robin J.',
      'wosStandard': 'White, RJ',
      'researcherId': 'CFN-0008-2022'},
     {'displayName': 'Luque, Rafael',
      'wosStandard': 'Luque, R',
      'researcherId': 'F-9853-2010'},
     {'displayName': 'Budarin, Vitaliy L.',
      'wosStandard': 'Budarin, VL',
      'researcherId': 'LMY-8679-2024'},
     {'displayName': 'Clark, James H.',
      'wosStandard': 'Clark, JH',
      'researcherId': 'LRH-3540-2024'},
     {'displayName': 'Macquarrie, Duncan J.',
      'wosStandard': 'Macquarrie, DJ',
      'researcherId': 'DX

3. **RECORD LIMITS**: Although this query finds 1066 records, only 10 are returned due to the default limit of records. We can increase that to 50 using the `limit` parameter:

In [141]:
initial_request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY)}&limit=50', headers=HEADERS_ST)
data = initial_request.json()

4. **VIEW SUMMARY INFORMATION FROM WOS REQUEST**: As you can see, this returns a json document as a Python dictionary. We can retrieve specific information from this record the same way we would a dictionary. To retrieve only a summary of the record, we can run the following:

In [142]:
data['metadata']  #returns total number of records 

{'total': 1066, 'page': 1, 'limit': 50}

5. **CONVERT TO DATAFRAME TO DISAMBIGUATE AUTHORS**: 

The results above return the work of multiple scholars with the name Rafael Luque or R. Luque. Instead we need to search for this particular scholar by using his ResearcherID.

We can filter the data already retrieved to disambiguate the different Rafael Luques. 

Let's first examine the data returned (under the name `data`).



In [143]:
def author_summary(jsondict):
    """
    This function reads in the json returned with a WoS search query
    and then converts that response into a dataframe of authors
    """
    auth_rows = []
    hits = [hit for hit in jsondict['hits']]
    print(f"Titles (and their authors) included in response to the query: '{SEARCH_QUERY}' - first {len(hits)} hits")
    for hit in hits:
        title = hit.get('title')
        year = hit.get('source', {}).get('publishYear')
        authors = hit.get('names', {}).get('authors')
        au_keywords = hit.get('keywords', {}).get('authorKeywords')
        #print(authors)
        for author in authors:
            # print("\nauthor = ", author)
            au_std_name = author.get('wosStandard')
            au_res_id = author.get('researcherId')
            auth_rows.append([title, year, au_keywords, au_std_name, au_res_id])
    return(auth_rows)    

rows = author_summary(data)

author_df = pd.DataFrame(rows, columns = ["title", "year", "author_keywords", "authorname_wos_std", "researcherid"])
print(author_df.shape)
author_df.head()


Titles (and their authors) included in response to the query: 'AU=Luque, Rafael' - first 50 hits
(284, 5)


Unnamed: 0,title,year,author_keywords,authorname_wos_std,researcherid
0,Supported metal nanoparticles on porous materi...,2025,[],"White, RJ",CFN-0008-2022
1,Supported metal nanoparticles on porous materi...,2025,[],"Luque, R",F-9853-2010
2,Supported metal nanoparticles on porous materi...,2025,[],"Budarin, VL",LMY-8679-2024
3,Supported metal nanoparticles on porous materi...,2025,[],"Clark, JH",LRH-3540-2024
4,Supported metal nanoparticles on porous materi...,2025,[],"Macquarrie, DJ",DXM-7702-2022


6. **GROUP BY AUTHOR ID TO CREATE A SUMMARY TABLE OF AUTHORS**: 

In [144]:
# groupby researcherid
from collections import Counter
#author_df.groupby(by = "researcherid")["author_keywords"].apply(list)

def flatten_extend(matrix: list):
    flat_list = []
    for row in matrix:
        if type(row) is list:
            flat_list.extend(row)
        else:
            flat_list.append(row)
    return flat_list

def most_common_keywords(kwlist: list):
    #print(kwlist)
    flat_kwlist = flatten_extend(kwlist)
    #print("FLATTENED:", flat_kwlist)
    kw_ctr = Counter(flat_kwlist)
    #print("***counter:", kw_ctr)
    kw_most_common = kw_ctr.most_common(5)
    #print("$$Top 3 most common:", kw_most_common)
    return kw_most_common

author_summary = author_df.groupby(by = "researcherid")\
    .agg({"authorname_wos_std": "first", "title": "count", "author_keywords": lambda x: most_common_keywords(x)})

author_summary = author_summary.rename(columns = {"title": "au_count"})

author_summary = author_summary.sort_values(by = "au_count", ascending=False)
author_summary_sub = author_summary[author_summary['authorname_wos_std'].str.startswith("Luque")]
author_summary_sub

Unnamed: 0_level_0,authorname_wos_std,au_count,author_keywords
researcherid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
F-9853-2010,"Luque, R",40,"[(heterogeneous catalysis, 8), (mesoporous mat..."
Q-8132-2018,"Luque, RM",4,"[(network security, 2), (data mining, 2), (Com..."
HVB-2615-2023,"Luque, R",2,"[(ecthyma, 1), (HIV, 1), (bacteremia, 1), (Str..."
IGB-3297-2023,"Luque, R",2,"[(Educational software, 1), (methodology, 1), ..."
DWC-6013-2022,"Luque, RJ",2,"[(bladder cancer, 1), (apoptosis, 1), (Ki-67, 1)]"


In [145]:
list(author_summary_sub.loc[ :, "author_keywords"])

[[('heterogeneous catalysis', 8),
  ('mesoporous materials', 4),
  ('Ti-MCM-41', 2),
  ('Organic functionalisation', 2),
  ('Microwaves', 2)],
 [('network security', 2),
  ('data mining', 2),
  ('Competitive learning', 1),
  ('intrusion detection system', 1),
  ('Multiagent system', 1)],
 [('ecthyma', 1),
  ('HIV', 1),
  ('bacteremia', 1),
  ('Streptococcus pyogenes', 1),
  ('Escherichia coli', 1)],
 [('Educational software', 1),
  ('methodology', 1),
  ('software development', 1)],
 [('bladder cancer', 1), ('apoptosis', 1), ('Ki-67', 1)]]

7. **SEARCH BY AUTHOR ID**: We can now refine our search using a **WoS Research Id** to focus our search on "R Luque" the chemist:

In [146]:
SEARCH_QUERY2 = 'AI=F-9853-2010'

initial_request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY2)}&limit=50', headers=HEADERS_ST)
data2 = initial_request.json()


In [147]:
data2['metadata']  #returns total number of records 

{'total': 984, 'page': 1, 'limit': 50}

In [148]:
len(data2['hits'])

50

8. **RETRIEVE A LIST OF TITLES BY ONE AUTHOR**: To retrieve only summary information for each publication (title, source, year, and page range) we can run the following function:

In [149]:
def title_summary(jsondict):
    hits = [hit for hit in jsondict['hits']]
    print(f"Titles included in response to the query: '{SEARCH_QUERY2}'")
    for hit in hits:
        title = hit.get('title')
        so_title = hit.get('source', {}).get('sourceTitle')
        year = hit.get('source', {}).get('publishYear')
        pages = hit.get('source', {}).get('pages', {}).get('range', {})
        if isinstance(pages, str):
            pages = pages.split("-")
        print(f"{year}. '{title}', {so_title}: {pages}")

In [150]:
title_summary(data2)

Titles included in response to the query: 'AI=F-9853-2010'
2025. 'Supported metal nanoparticles on porous materials. Methods and applications (Vol 38, pg 481, 2009)', CHEMICAL SOCIETY REVIEWS: {}
1994. 'COTTARDS-SYMDROME - HISTORICAL AND CONCEPTUAL ASPECTS', ACTAS LUSO-ESPANOLAS DE NEUROLOGIA PSIQUIATRIA Y CIENCIAS AFINES: ['178', '188']
1995. 'THE CAMBRIDGE NEUROLOGICAL INVENTORY - A CLINICAL INSTRUMENT FOR ASSESSMENT OF SOFT NEUROLOGICAL SIGNS IN PSYCHIATRIC-PATIENTS', PSYCHIATRY RESEARCH: ['183', '204']
1996. 'Demographic and phenomenological features of a Spanish population of patients with late paraphrenia', INTERNATIONAL JOURNAL OF GERIATRIC PSYCHIATRY: ['745', '747']
2003. 'Effect of phosphate precursor and organic additives on the structural and catalytic properties of amorphous mesoporous AlPO<sub>4</sub> materials', CHEMISTRY OF MATERIALS: ['3352', '3364']
2005. 'Cyclohexene conversion and toluene methylation with dimethyl carbonate over Al-MCM-41 catalysts', MOLECULAR SIEVES

### IIb. Other Search Fields

1. **SEARCHING THE WOS DATASET USING OTHER FIELDS**: We can search by a variety of fields besides author ("AU") and even combine searches by multiple fields.

Search fields for the Web of Science Starter API are listed in the [project's README page](https://github.com/clarivate/wosstarter_python_client) but are also copied here: 

| Field Tag | Description                                                                                                                                                 |
|-----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
| TI        | Title of document                                                                                                                                           |
| IS        | ISSN or ISBN                                                                                                                                                |
| SO        | Source title - The result contains all source titles within product database (for example, journal titles and/or book titles if the product includes books) |
| VL        | Volume                                                                                                                                                      |
| PG        | Page                                                                                                                                                        |
| CS        | Issue                                                                                                                                                       |
| PY        | Year Published                                                                                                                                              |
| AU        | Author                                                                                                                                                      |
| AI        | Author Identifier                                                                                                                                                      |
| UT        | Accession Number                                                                                                                                            |
| DO        | DOI                                                                                                                                                         |
| DT        | [Document Type](https://webofscience.help.clarivate.com/en-us/Content/document-types.html)                                                                                                                                                         |
| PMID      | PubMed ID                                                                                                                                                   |
| OG        | Search for preferred organization names and/or their name variants from the Preferred Organization Index. <p> A search on a preferred organization name returns all records that contain the preferred name and all records that contain its name variants. A search on a name variant returns all records that contain the variant. For example, Cornell Law Sch returns all records that contain Cornell Law Sch in the Addresses field. <p> When searching for organization names that contain a Boolean (AND, NOT, NEAR, and SAME), always enclose the word in quotation marks ( \" \" ). For example: <p>   - OG=(Japan Science \"and\" Technology Agency (JST))      <br>   - OG=(\"Near\" East Univ)         <br> - OG=(\"OR\" Hlth Sci Univ)                           |
| TS        | Searches for topic terms in the following fields within a document: <p> - Title <br> - Abstract <br> - Author keywords <br> - Keywords Plus


Allowed tags are AI, AU, CS, DO, DT, IS, OG, PG, PMID, PY, SO, SUR, TI, TS, UT, VL


### IIc. Search by Publication

1. **SEARCH FOR A SPECIFIC JOURNAL**: You may also, for example, search by journal. We recommend looking up the journal's ISSN number rather than searching by journal title. You can look up a journal's ISSN using the [ISSN Portal](https://portal.issn.org/).

For example, we can search in ***The William and Mary Quarterly*** using its ISSN ("0043-5597"). This query identifies 6,082 results (although it only returns 10\*). 

*\*Note: we are keeping the 'limit' at 10 for this experimentation. But, if you want the API to retrieve the maximum allowed number of sources (50), add in "&limit=50" as we did in Step #6 above. If we wanted to retrieve all 6,082 results, for example, we could do so using a loop that first retrieves records 1-50, then 51-100, and so on for 122 times until it finishes. Part III shows how to do so below.*

In [151]:
SEARCH_QUERY = "IS=0043-5597"
initial_request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY)}', headers=HEADERS_ST)
initial_request.json()
 

{'metadata': {'total': 6218, 'page': 1, 'limit': 10},
 'hits': [{'uid': 'WOS:A1956CCF2500007',
   'title': 'ASTROLOGY IN COLONIAL AMERICA - AN EXTENDED QUERY',
   'types': ['Article'],
   'sourceTypes': ['Article'],
   'source': {'sourceTitle': 'WILLIAM AND MARY QUARTERLY',
    'publishYear': 1956,
    'volume': '13',
    'issue': '4',
    'pages': {'range': '551-563', 'begin': '551', 'end': '563', 'count': 13}},
   'names': {'authors': [{'displayName': 'STAHLMAN, WD',
      'wosStandard': 'STAHLMAN, WD',
      'researcherId': 'DYG-3714-2022'}]},
   'links': {'record': 'https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=dartrds_jeremy_01&SrcAuth=WosAPI&KeyUT=WOS:A1956CCF2500007&DestLinkType=FullRecord&DestApp=WOS_CPL',
    'citingArticles': 'https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=dartrds_jeremy_01&SrcAuth=WosAPI&KeyUT=WOS:A1956CCF2500007&DestLinkType=CitingArticles&DestApp=WOS_CPL',
    'references': 'https://www.webofscience.com/api/gateway?GWVersion=2&Sr

2. **FILTER BY YEAR(S)**: We may then narrow our search of this one journal by limiting it to a series of years

In [152]:
SEARCH_QUERY = "IS=0043-5597 AND PY=(2000-2010)"
initial_request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY)}', headers=HEADERS_ST)
initial_request.json()

{'metadata': {'total': 1054, 'page': 1, 'limit': 10},
 'hits': [{'uid': 'WOS:000084741500025',
   'title': 'The crisis of the standing order: Clerical intellectuals and cultural authority in Massachusetts, 1780-1833',
   'types': ['Review'],
   'sourceTypes': ['Book Review'],
   'source': {'sourceTitle': 'WILLIAM AND MARY QUARTERLY',
    'publishYear': 2000,
    'publishMonth': 'JAN',
    'volume': '57',
    'issue': '1',
    'pages': {'range': '244-247', 'begin': '244', 'end': '247', 'count': 4}},
   'names': {'authors': [{'displayName': 'Andrew, J',
      'wosStandard': 'Andrew, J',
      'researcherId': 'EKB-2906-2022'}]},
   'links': {'record': 'https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=dartrds_jeremy_01&SrcAuth=WosAPI&KeyUT=WOS:000084741500025&DestLinkType=FullRecord&DestApp=WOS_CPL',
    'references': 'https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=dartrds_jeremy_01&SrcAuth=WosAPI&KeyUT=WOS:000084741500025&DestLinkType=CitedReferences&DestApp=WOS',


3. **FILTER BY DOCUMENT TYPE**: We can filter by Document Type (DT) to retrieve only articles (rather than book reviews and other types) from this journal. As you can see, we have reduced the number of results to 222.

In [153]:
SEARCH_QUERY = "IS=0043-5597 AND PY=(2000-2010) AND DT=Article"
initial_request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY)}', headers=HEADERS_ST)

initial_request.json()

{'metadata': {'total': 222, 'page': 1, 'limit': 10},
 'hits': [{'uid': 'WOS:000084741500009',
   'title': "Jefferson's rationalizations",
   'types': ['Article'],
   'sourceTypes': ['Article'],
   'source': {'sourceTitle': 'WILLIAM AND MARY QUARTERLY',
    'publishYear': 2000,
    'publishMonth': 'JAN',
    'volume': '57',
    'issue': '1',
    'pages': {'range': '183-197', 'begin': '183', 'end': '197', 'count': 15}},
   'names': {'authors': [{'displayName': 'Burstein, A',
      'wosStandard': 'Burstein, A',
      'researcherId': 'ERQ-1357-2022'}]},
   'links': {'record': 'https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=dartrds_jeremy_01&SrcAuth=WosAPI&KeyUT=WOS:000084741500009&DestLinkType=FullRecord&DestApp=WOS_CPL',
    'citingArticles': 'https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=dartrds_jeremy_01&SrcAuth=WosAPI&KeyUT=WOS:000084741500009&DestLinkType=CitingArticles&DestApp=WOS_CPL',
    'references': 'https://www.webofscience.com/api/gateway?GWVersion=2

### IId. Search by keyword

1. **TOPIC TERMS SEARCH**: We can also use the Starter API's Topic Terms (TS) field to search for keywords and terms found in a text's Title, Author Keywords, Keywords Plus, and Abstract fields. As noted in the introduction to this notebook, while you can search through these fields, only the Title field is actually returned. 

In the example below, we retrieve all works that have the term "humanized landscape" stored in these fields. 

In [154]:
#SEARCH_QUERY = "TS=(ecology AND humanized landscape)"
SEARCH_QUERY = "TS=humanized landscape"
initial_request = requests.get(f'{BASEURL_ST}documents?db=WOS&q={urllib.parse.quote(SEARCH_QUERY)}', headers=HEADERS_ST)
data = initial_request.json()
data['metadata']

{'total': 421, 'page': 1, 'limit': 10}

In [155]:
data

{'metadata': {'total': 421, 'page': 1, 'limit': 10},
 'hits': [{'uid': 'WOS:A1978FT34300005',
   'title': 'LANDSCAPED CAMPUS SITE HUMANIZES LIFE FOR ELDERLY',
   'types': ['Article'],
   'sourceTypes': ['Article'],
   'source': {'sourceTitle': 'HOSPITALS',
    'publishYear': 1978,
    'volume': '52',
    'issue': '20',
    'pages': {'range': '75-&', 'begin': '75', 'end': '&', 'count': 0}},
   'names': {'authors': [{'displayName': 'SMART, JD',
      'wosStandard': 'SMART, JD',
      'researcherId': 'FYX-7131-2022'}]},
   'links': {'record': 'https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=dartrds_jeremy_01&SrcAuth=WosAPI&KeyUT=WOS:A1978FT34300005&DestLinkType=FullRecord&DestApp=WOS_CPL'},
   'citations': [{'db': 'WOS', 'count': 0}],
   'identifiers': {'issn': '0018-5973', 'pmid': '689636'},
   'keywords': {'authorKeywords': []}},
  {'uid': 'WOS:A1992JH71700003',
   'title': 'THE PRISTINE MYTH - THE LANDSCAPE OF THE AMERICA IN 1492',
   'types': ['Article'],
   'sourceTypes':

## III. Placing Retrieved Data in a Dataframe and saving to a csv

The Starter API returns query results in the format of [JSON data](https://www.w3schools.com/js/js_json_intro.asp), which is hierarchical in format. 

Often, however, researchers will want to place this data in a two-dimensional data table (in Python - these data tables are known as dataframes). This part shows how to transform this returned JSON data into a dataframe.

### IIIa. Retrieve 50 or fewer records and place in a dataframe

1. We will continue with our most recent SEARCH_QUERY. First, let's retrieve the number of records this query found:

In [156]:
print(f"Our current search query: {SEARCH_QUERY}")
total_records = data.get('metadata', {}).get('total')
print(total_records)

Our current search query: TS=humanized landscape
421


1b. In a later step, we will work on retrieving all of these records. For the moment, however, let's just request the first 50.

In [157]:
#50 or less results
datadict = {}
initial_request = requests.get(
        f'{BASEURL_ST}documents?db=WOS&q={SEARCH_QUERY}&limit=50', headers=HEADERS_ST)
data = initial_request.json()
datadict = data
print(f"Total number of records pulled: {len(datadict['hits'])}")
print(f"for the search query: {SEARCH_QUERY}")
uids = set([hit['uid'] for hit in datadict['hits']])
print(f"Total number of unique ids: {len(uids)}")
print(f"Number of requests remaining today: {initial_request.headers['X-RateLimit-Remaining-Day']}.")

Total number of records pulled: 50
for the search query: TS=humanized landscape
Total number of unique ids: 50
Number of requests remaining today: 4947.


1c. The code above retrieved the data ("initial_request"), saved it as a json file ("data"), created an empty dictionary ("datadict"), placed the JSON data in this dictionary, and then created a list of unique IDs found in the dictionary ("uids").

Let's examine the results:

In [158]:
list(uids)

['WOS:A1997WD25200006',
 'WOS:000297917200005',
 'WOS:000239166900006',
 'WOS:000169881400010',
 'WOS:000081973600046',
 'WOS:000280915400009',
 'WOS:000274206500014',
 'WOS:000231043700001',
 'WOS:000275947300124',
 'WOS:000086006700005',
 'WOS:000075020300018',
 'WOS:000270600800009',
 'WOS:000291370500030',
 'WOS:000280784600010',
 'WOS:000436985900009',
 'WOS:000231927600005',
 'WOS:000260575400005',
 'WOS:000280633800009',
 'WOS:000165903400002',
 'WOS:000075608300011',
 'WOS:000243899200008',
 'WOS:A1978FT34300005',
 'WOS:000256236000011',
 'WOS:A1996UK81200005',
 'WOS:000286419400006',
 'WOS:000287956600016',
 'WOS:000300167300002',
 'WOS:000090103900001',
 'WOS:000075183900005',
 'WOS:000296955600003',
 'WOS:000232979900001',
 'WOS:000286541400011',
 'WOS:000186870800010',
 'WOS:000260514400014',
 'WOS:000086292600015',
 'WOS:000257297800007',
 'WOS:000394684200028',
 'WOS:A1992JH71700003',
 'WOS:000245219600015',
 'WOS:000242239000002',
 'WOS:000081123600003',
 'WOS:0002642611

1d. Now we can create a function that converts the JSON results (which we have stored in a Python dictionary) into a dataframe.

In [159]:
def retrieve_all_data(datahits):
    #hits = [hit for hit in data['hits']]
    datalist = []
    for hit in datahits:
        datadict = {}
        datadict['uid'] = hit.get("uid", "")
        datadict['title'] = hit.get("title", "")
        datadict['authors'] = "; ".join([name.get('wosStandard') for name in hit.get("names").get("authors")])
        datadict['researcherIds'] = "; ".join([str(name.get('researcherId')) for name in hit.get("names").get("authors")])
        datadict['pubyear'] = hit.get("source", {}).get("publishYear")
        datadict['source_title'] = hit.get("source", {}).get("sourceTitle")
        datadict['volume'] = hit.get("source", {}).get("volume")
        datadict['page_start'] = hit.get("source", {}).get("pages", {}).get("begin")
        datadict['page_end'] = hit.get("source", {}).get("pages", {}).get("end")
        datadict['page_count'] = hit.get("source", {}).get("pages", {}).get("count")
        identifiers = hit.get("identifiers")
        datadict['doi'] = identifiers.get("doi")
        datadict['issn'] = identifiers.get("issn")
        datadict['eissn'] = identifiers.get("eissn")
        datadict['isbn'] = identifiers.get("isbn")
        citations = hit.get("citations")
        if len(citations) > 0:   #for some reason the citations key stores a dict inside a list
            datadict["citation_counts"] = citations[0].get("count")
         
        datadict['author_keywords'] = "; ".join([kw.lower() for kw in hit.get("keywords").get("authorKeywords")])
        #datadict['keywords_plus'] = hit.get("keywords").get("keywordsPlus")
        links = hit.get("links")
        datadict['record_links'] = links.get("record")
        datadict['citing_links'] = links.get("citingArticles")
        datadict['reference_links'] = links.get("references")
        datadict['related_links'] = links.get("related")
        datalist.append(datadict)
    #print(datalist)
    return(pd.DataFrame(datalist))

1e. With this function we can now convert our dictionary ("datadict") into a dataframe.

In [160]:
df1 = retrieve_all_data(datadict['hits'])
df1.head()

Unnamed: 0,uid,title,authors,researcherIds,pubyear,source_title,volume,page_start,page_end,page_count,doi,issn,eissn,isbn,citation_counts,author_keywords,record_links,citing_links,reference_links,related_links
0,WOS:A1978FT34300005,LANDSCAPED CAMPUS SITE HUMANIZES LIFE FOR ELDERLY,"SMART, JD",FYX-7131-2022,1978,HOSPITALS,52,75,&,0,,0018-5973,,,0,,https://www.webofscience.com/api/gateway?GWVer...,,,
1,WOS:A1992JH71700003,THE PRISTINE MYTH - THE LANDSCAPE OF THE AMERI...,"DENEVAN, WM",ESD-1505-2022,1992,ANNALS OF THE ASSOCIATION OF AMERICAN GEOGRAPHERS,82,369,385,17,10.1111/j.1467-8306.1992.tb01965.x,0004-5608,1467-8306,,931,"pristine myth; 1492; columbus, native american...",https://www.webofscience.com/api/gateway?GWVer...,https://www.webofscience.com/api/gateway?GWVer...,https://www.webofscience.com/api/gateway?GWVer...,https://www.webofscience.com/api/gateway?GWVer...
2,WOS:A1996UK81200005,Landscapes without peasants,"Prado, P",IHD-6041-2023,1996,HOMME,36,111,120,18,,0439-4216,,,0,,https://www.webofscience.com/api/gateway?GWVer...,,https://www.webofscience.com/api/gateway?GWVer...,https://www.webofscience.com/api/gateway?GWVer...
3,WOS:A1997WD25200006,Incremental agroforestry: Enriching Pacific la...,"Clarke, WC; Thaman, RR",HWS-5742-2023; EAS-9162-2022,1997,CONTEMPORARY PACIFIC,9,121,148,28,,1043-898X,1527-9464,,12,,https://www.webofscience.com/api/gateway?GWVer...,https://www.webofscience.com/api/gateway?GWVer...,https://www.webofscience.com/api/gateway?GWVer...,https://www.webofscience.com/api/gateway?GWVer...
4,WOS:000075020300018,Autonomous development. Humanizing the landsca...,Savyasachi,HWM-7416-2023,1998,CONTRIBUTIONS TO INDIAN SOCIOLOGY,32,140,140,1,10.1177/006996679803200117,0069-9659,0973-0648,,0,,https://www.webofscience.com/api/gateway?GWVer...,,https://www.webofscience.com/api/gateway?GWVer...,https://www.webofscience.com/api/gateway?GWVer...


1f. ...and export this dataframe into a csv.

In [161]:
# write results to csv
df1.to_csv(f"wos_results_{SEARCH_QUERY}_page1only.csv", encoding = 'utf=8')

### IIIb. SEND MULTIPLE REQUESTS TO RETRIEVE 50+ RECORDS FROM A QUERY

However, our search query ("TS=humanized landscapes") returned more than 50 results (306 in total).

When a query returns more than 50 records, we need to write a iterative loop that retrieves 50 records at a time.

1a. **RETRIEVE RESULTS AND PLACE IN A DICTIONARY**: Create a for loop to retrieve 50 records at a time and assemble the result into one Python dictionary:

In [162]:
total_records = data['metadata']['total']
print(f"Our current search query: {SEARCH_QUERY} returned {total_records} records.")

requests_required = ((total_records - 1) // 50) + 1  #306 records - 1 = 305 // 50 = 6 + 1 = 7
print(requests_required)
datadict = {}
if requests_required > 1:
    print(f"API requests required to get all data from the query - '{SEARCH_QUERY}': {requests_required}")
for i in range(requests_required):
    subsequent_response = requests.get(
        f'{BASEURL_ST}documents?db=WOS&q={SEARCH_QUERY}&limit=50&page={i+1}', headers=HEADERS_ST)
    data = subsequent_response.json()
    if i == 0:
        print(data['metadata'])
        datadict = data
    else:
        datadict['hits'].extend(data['hits'])
    print(f"**Pulling from Page {i+1} of {requests_required}**")
print(f"Total number of records pulled: {len(datadict['hits'])}")
uids = set([hit['uid'] for hit in datadict['hits']])
print(f"Total number of unique ids: {len(uids)}")
print(f"Number of requests remaining today: {subsequent_response.headers['X-RateLimit-Remaining-Day']}.")    

Our current search query: TS=humanized landscape returned 421 records.
9
API requests required to get all data from the query - 'TS=humanized landscape': 9
{'total': 421, 'page': 1, 'limit': 50}
**Pulling from Page 1 of 9**
**Pulling from Page 2 of 9**
**Pulling from Page 3 of 9**
**Pulling from Page 4 of 9**
**Pulling from Page 5 of 9**
**Pulling from Page 6 of 9**
**Pulling from Page 7 of 9**
**Pulling from Page 8 of 9**
**Pulling from Page 9 of 9**
Total number of records pulled: 421
Total number of unique ids: 421
Number of requests remaining today: 4938.


1b. **TRANSFORM DICTIONARY INTO A DATAFRAME**: Next, we can transform the dictionary into a dataframe:

In [163]:
df_all = retrieve_all_data(datadict['hits'])
df_all.head()

Unnamed: 0,uid,title,authors,researcherIds,pubyear,source_title,volume,page_start,page_end,page_count,doi,issn,eissn,isbn,citation_counts,author_keywords,record_links,citing_links,reference_links,related_links
0,WOS:A1978FT34300005,LANDSCAPED CAMPUS SITE HUMANIZES LIFE FOR ELDERLY,"SMART, JD",FYX-7131-2022,1978,HOSPITALS,52,75,&,0,,0018-5973,,,0,,https://www.webofscience.com/api/gateway?GWVer...,,,
1,WOS:A1992JH71700003,THE PRISTINE MYTH - THE LANDSCAPE OF THE AMERI...,"DENEVAN, WM",ESD-1505-2022,1992,ANNALS OF THE ASSOCIATION OF AMERICAN GEOGRAPHERS,82,369,385,17,10.1111/j.1467-8306.1992.tb01965.x,0004-5608,1467-8306,,931,"pristine myth; 1492; columbus, native american...",https://www.webofscience.com/api/gateway?GWVer...,https://www.webofscience.com/api/gateway?GWVer...,https://www.webofscience.com/api/gateway?GWVer...,https://www.webofscience.com/api/gateway?GWVer...
2,WOS:A1996UK81200005,Landscapes without peasants,"Prado, P",IHD-6041-2023,1996,HOMME,36,111,120,18,,0439-4216,,,0,,https://www.webofscience.com/api/gateway?GWVer...,,https://www.webofscience.com/api/gateway?GWVer...,https://www.webofscience.com/api/gateway?GWVer...
3,WOS:A1997WD25200006,Incremental agroforestry: Enriching Pacific la...,"Clarke, WC; Thaman, RR",HWS-5742-2023; EAS-9162-2022,1997,CONTEMPORARY PACIFIC,9,121,148,28,,1043-898X,1527-9464,,12,,https://www.webofscience.com/api/gateway?GWVer...,https://www.webofscience.com/api/gateway?GWVer...,https://www.webofscience.com/api/gateway?GWVer...,https://www.webofscience.com/api/gateway?GWVer...
4,WOS:000075020300018,Autonomous development. Humanizing the landsca...,Savyasachi,HWM-7416-2023,1998,CONTRIBUTIONS TO INDIAN SOCIOLOGY,32,140,140,1,10.1177/006996679803200117,0069-9659,0973-0648,,0,,https://www.webofscience.com/api/gateway?GWVer...,,https://www.webofscience.com/api/gateway?GWVer...,https://www.webofscience.com/api/gateway?GWVer...


1c. **EXPORT THE RESULTS DATAFRAME INTO A CSV**:

In [164]:
df_all.to_csv(f"wos_results_{SEARCH_QUERY}_all.csv", encoding = 'utf=8')

That's it. We can review the number of requests we have remaining for today:

In [165]:
subsequent_response.headers['X-RateLimit-Remaining-Day']

'4938'