# Document Search and Retrieval

In [3]:
import pybliometrics
from pybliometrics.scopus import ScopusSearch
pybliometrics.scopus.init()
#from pybliometrics.scopus.utils import config

## Writing a Scopus Query

You can search for documents using author name (as we did in notebook "02") or by a variety of other fields (affiliation, publication name, date range, etc.).

However, often, researchers will want to search for a term or set of terms across multiple fields. Some example queries are below:

* `ABS(dopamine)` returns documents where "dopamine" is in the document abstract.
* `AUTHKEY(stroke)` returns documents where "stroke" is an author keyword.
* `CHEMNAME(oxidopamine)` returns documents with "oxidopamine" in the chemical name field.
* `FUND-ACR(NASA)` returns documents with NASA mentioned as the sponsor acronym in the acknowledgements section of the article. 
* `LANGUAGE(french)` returns documents originally written in French.
* `OPENACCESS(1)` returns Open Access content indexed in Scopus.
* `OPENACCESS(0)` returns subscription-based content indexed in Scopus.
* `PUBYEAR > 1994` returns documents with a publication year after 1994.
* `PUBYEAR < 1994` returns documents with a publication year before 1994.
* `PUBYEAR = 1994` returns documents with a publication year of 1994. 
* To find documents where your search terms occur in the same reference, use: `REF(darwin 1859)`
* `SRCTYPE(j)` returns documents from journal sources.
* `TITLE("neuropsychological evidence")` returns documents with the phrase "neuropsychological evidence" in their title.

If you want to search across multiple fields, you can also use the following combined fields:
* `ALL("heart attack")` returns documents with "heart attack" in any field
* `KEY(oscillator)` returns documents where "oscillator" is a keyword.
    * searches the AUTHKEY, INDEXTERMS, TRADENAME, and CHEMNAME fields
* `TITLE-ABS-KEY("heart attack")` returns documents with "heart attack" in their abstracts, article titles, or keyword fields.
* `TITLE-ABS-KEY-AUTH(heart attack)` returns documents with "heart attack" in their abstracts, article titles, keywords, or author name fields.

For more on search queries within Scopus review its [Search Tips page](https://dev.elsevier.com/sc_search_tips.html).

In [4]:
# a simple search 
# only takes seconds on Scopus web platform, but takes 30-60 seconds for every 1000 records or so
query = "TITLE-ABS-KEY(bibliometrics)"

# to speed this up, we recommend for your first search, 
## adding the parameter "download=False"
## once you are sure you are getting the type and number of results 
## you want you can remove this parameter

# results
s = ScopusSearch(query, download=False)
print(s)


Search 'TITLE-ABS-KEY(bibliometrics)' yielded 39,248 documents as of 2025-10-29, which have not been downloaded


## Task

The above search yields tens of thousands of records. Unfortunately, the Scopus API is slow to retrieve this number of records. Therefore, as you are testing your code it makes sense to retrieve a smaller subset of results. You can always go back and request a larger number of results once you know your code is working well.

Below, we will write some code that narrows the search results to only those publications written since 2023 and that also use the term "scopus" in one of its key fields. We will then place the results in a dataframe.

In [5]:
# searching for multiple terms

query = "TITLE-ABS-KEY(bibliometrics scopus)"
s2 = ScopusSearch(query, download=False)
print(s2)

Search 'TITLE-ABS-KEY(bibliometrics scopus)' yielded 6,886 documents as of 2025-10-29, which have not been downloaded


In [21]:
s2.results

In [27]:
# searching and filtering

query = "PUBYEAR > 2022 AND TITLE-ABS-KEY(bibliometrics scopus)"
s3 = ScopusSearch(query, download=False)
print(s3)
# 1366 results in 1m 6s when download=True


Search 'PUBYEAR > 2022 AND TITLE-ABS-KEY(bibliometrics scopus)' yielded 1,366 documents as of 2024-04-15:
    2-s2.0-85187537046
    2-s2.0-85185243349
    2-s2.0-85182643737
    2-s2.0-85189041916
    2-s2.0-85188636339
    2-s2.0-85188013593
    2-s2.0-85186720130
    2-s2.0-85185567160
    2-s2.0-85185180408
    2-s2.0-85182361252
    2-s2.0-85179489401
    2-s2.0-85178248788
    2-s2.0-85189754170
    2-s2.0-85189566084
    2-s2.0-85189558401
    2-s2.0-85189452146
    2-s2.0-85189288345
    2-s2.0-85187780984
    2-s2.0-85186975954
    2-s2.0-85185835560
    2-s2.0-85182917255
    2-s2.0-85188676726
    2-s2.0-85188671888
    2-s2.0-85187715391
    2-s2.0-85186687338
    2-s2.0-85125960781
    2-s2.0-85189161372
    2-s2.0-85189109028
    2-s2.0-85188719728
    2-s2.0-85188701330
    2-s2.0-85188242194
    2-s2.0-85188210306
    2-s2.0-85187204502
    2-s2.0-85186648051
    2-s2.0-85186494507
    2-s2.0-85184390401
    2-s2.0-85184034789
    2-s2.0-85183964622
    2-s2.0-851836610

In [8]:
s3.get_eids()

['2-s2.0-85187537046',
 '2-s2.0-85185243349',
 '2-s2.0-85182643737',
 '2-s2.0-85189041916',
 '2-s2.0-85188636339',
 '2-s2.0-85188013593',
 '2-s2.0-85186720130',
 '2-s2.0-85185567160',
 '2-s2.0-85185180408',
 '2-s2.0-85182361252',
 '2-s2.0-85179489401',
 '2-s2.0-85178248788',
 '2-s2.0-85189754170',
 '2-s2.0-85189566084',
 '2-s2.0-85189558401',
 '2-s2.0-85189452146',
 '2-s2.0-85189288345',
 '2-s2.0-85187780984',
 '2-s2.0-85186975954',
 '2-s2.0-85185835560',
 '2-s2.0-85182917255',
 '2-s2.0-85188676726',
 '2-s2.0-85188671888',
 '2-s2.0-85187715391',
 '2-s2.0-85186687338',
 '2-s2.0-85125960781',
 '2-s2.0-85189161372',
 '2-s2.0-85189109028',
 '2-s2.0-85188719728',
 '2-s2.0-85188701330',
 '2-s2.0-85188242194',
 '2-s2.0-85188210306',
 '2-s2.0-85187204502',
 '2-s2.0-85186648051',
 '2-s2.0-85186494507',
 '2-s2.0-85184390401',
 '2-s2.0-85184034789',
 '2-s2.0-85183964622',
 '2-s2.0-85183661016',
 '2-s2.0-85183622536',
 '2-s2.0-85180268544',
 '2-s2.0-85175452777',
 '2-s2.0-85173544436',
 '2-s2.0-85

In [9]:
query = "PUBYEAR > 2022 AND KEY(bibliometrics scopus)"
s4 = ScopusSearch(query)
print(s4)
#467 results 27s

Search 'PUBYEAR > 2022 AND KEY(bibliometrics scopus)' yielded 467 documents as of 2024-04-15:
    2-s2.0-85182643737
    2-s2.0-85189041916
    2-s2.0-85185180408
    2-s2.0-85182361252
    2-s2.0-85179489401
    2-s2.0-85189288345
    2-s2.0-85186975954
    2-s2.0-85182917255
    2-s2.0-85125960781
    2-s2.0-85189161372
    2-s2.0-85187204502
    2-s2.0-85186648051
    2-s2.0-85186494507
    2-s2.0-85183661016
    2-s2.0-85180268544
    2-s2.0-85153496541
    2-s2.0-85181115488
    2-s2.0-85187644844
    2-s2.0-85141517452
    2-s2.0-85187674682
    2-s2.0-85187507366
    2-s2.0-85186065231
    2-s2.0-85180410383
    2-s2.0-85171581202
    2-s2.0-85169846143
    2-s2.0-85168293931
    2-s2.0-85144201081
    2-s2.0-85185707760
    2-s2.0-85185557764
    2-s2.0-85184864918
    2-s2.0-85184505525
    2-s2.0-85184284667
    2-s2.0-85184068878
    2-s2.0-85181884769
    2-s2.0-85181804203
    2-s2.0-85175567362
    2-s2.0-85174552634
    2-s2.0-85141971904
    2-s2.0-85182501153
    2-s2.

In [10]:
import pandas as pd

df = pd.DataFrame(s4.results)
print(df.shape)
df.head()

(467, 36)


Unnamed: 0,eid,doi,pii,pubmed_id,title,subtype,subtypeDescription,creator,afid,affilname,...,pageRange,description,authkeywords,citedby_count,openaccess,freetoread,freetoreadLabel,fund_acr,fund_no,fund_sponsor
0,2-s2.0-85182643737,10.1007/s10238-023-01254-3,,38252392.0,Global landscape of COVID-19 research: a visua...,ar,Article,Zyoud S.H.,60072733,An-Najah National University,...,,The emergence of COVID-19 in 2019 has resulted...,Bibliometric | COVID-19 | Randomized controlle...,0,1,all publisherhybridgold,All Open Access Hybrid Gold,NIH,,National Institutes of Health
1,2-s2.0-85189041916,10.1016/j.radphyschem.2024.111705,S0969806X2400197X,,Science mapping of the development of scintill...,re,Review,Ardiansyah A.,60107208;60078536;60069390;60021097,Sunway University;King Saud bin Abdulaziz Univ...,...,,This paper presents a comprehensive bibliometr...,Bibliometric analysis | Neutron detection | Ne...,0,0,,,,,
2,2-s2.0-85185180408,10.1016/j.mex.2024.102601,S2215016124000554,,Using Scopus and OpenAlex APIs to retrieve bib...,ar,Article,Harder R.,60027152,Sveriges lantbruksuniversitet,...,,Evidence synthesis methodologies rely on bibli...,Bibliographic analysis | Bibliometric analysis...,0,1,,,,20200021,Familjen Kamprads Stiftelse
3,2-s2.0-85182361252,10.1016/j.mex.2023.102484,S2215016123004806,,Combining methods to conduct a systematic revi...,ar,Article,Eyzaguirre I.A. .,60001890;130710886;129976216,Universidade Federal do Pará;Sarambui Civil So...,...,,This study aims to present a combination of me...,Bibliometric | Conceptual framework | Mangrove...,0,1,all repository repositoryvor,All Open Access Green,CAPES,36919-2,Coordenação de Aperfeiçoamento de Pessoal de N...
4,2-s2.0-85179489401,10.1016/j.inat.2023.101879,S2214751923001627,,Landscape of epilepsy research: Analysis and f...,re,Review,Sharma M.,60113205;60093233,"Chitkara University, Punjab;Mody University of...",...,,Epilepsy is a neurological condition character...,Bibliometrix analysis | Epilepsy | Seizure | T...,0,1,all publisherfullgold,All Open Access Gold,CUSAT,,Cochin University of Science and Technology


In [11]:
df.columns

Index(['eid', 'doi', 'pii', 'pubmed_id', 'title', 'subtype',
       'subtypeDescription', 'creator', 'afid', 'affilname',
       'affiliation_city', 'affiliation_country', 'author_count',
       'author_names', 'author_ids', 'author_afids', 'coverDate',
       'coverDisplayDate', 'publicationName', 'issn', 'source_id', 'eIssn',
       'aggregationType', 'volume', 'issueIdentifier', 'article_number',
       'pageRange', 'description', 'authkeywords', 'citedby_count',
       'openaccess', 'freetoread', 'freetoreadLabel', 'fund_acr', 'fund_no',
       'fund_sponsor'],
      dtype='object')

We can then use the **.get_eids()** method to retrieve the eids for each document that matches the search query above.

In [12]:
eids = s4.get_eids()
eids = [eid.split("-")[-1] for eid in eids]
eids

['85182643737',
 '85189041916',
 '85185180408',
 '85182361252',
 '85179489401',
 '85189288345',
 '85186975954',
 '85182917255',
 '85125960781',
 '85189161372',
 '85187204502',
 '85186648051',
 '85186494507',
 '85183661016',
 '85180268544',
 '85153496541',
 '85181115488',
 '85187644844',
 '85141517452',
 '85187674682',
 '85187507366',
 '85186065231',
 '85180410383',
 '85171581202',
 '85169846143',
 '85168293931',
 '85144201081',
 '85185707760',
 '85185557764',
 '85184864918',
 '85184505525',
 '85184284667',
 '85184068878',
 '85181884769',
 '85181804203',
 '85175567362',
 '85174552634',
 '85141971904',
 '85182501153',
 '85150753624',
 '85189559644',
 '85189507225',
 '85189149342',
 '85188943940',
 '85188627067',
 '85188615762',
 '85188446096',
 '85188442116',
 '85188429994',
 '85188308077',
 '85188045497',
 '85188027618',
 '85187801911',
 '85187664644',
 '85187486974',
 '85186950349',
 '85186920950',
 '85186722031',
 '85186654415',
 '85186199259',
 '85185951079',
 '85185450225',
 '851853

## Retrieve Abstracts

If we want to get more detailed information about the documents we identified above, we can use the pybliometric's **AuthorRetrieval** class to retrieve this info. 

In [13]:
from pybliometrics.scopus import AbstractRetrieval

ab = AbstractRetrieval(eids[1])

In [14]:
dir(ab)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_cache_file_path',
 '_confevent',
 '_head',
 '_json',
 '_mdate',
 '_ref',
 '_refresh',
 '_view',
 'abstract',
 'affiliation',
 'aggregationType',
 'authkeywords',
 'authorgroup',
 'authors',
 'chemicals',
 'citedby_count',
 'citedby_link',
 'confcode',
 'confdate',
 'conflocation',
 'confname',
 'confsponsor',
 'contributor_group',
 'copyright',
 'copyright_type',
 'correspondence',
 'coverDate',
 'date_created',
 'description',
 'document_entitlement_status',
 'doi',
 'eid',
 'endingPage',
 'funding',
 'funding_text',
 'get_bibtex',
 'get_cache_file_age',
 'get_cache_file_mdate',
 'get_html',
 'get_key_

In [15]:
ab.description #when abstract is empty try this and vice-versa

'This paper presents a comprehensive bibliometric analysis to understand the evolution of scintillators as neutron detection from 2014 to 4 July 2023, utilizing data sourced from Scopus. The 312 selected articles were visualized using vosViewer and Tableau. This study delves into critical aspects, such as the growth of publication in this field over time, contributions made by various countries and their collaborative networks, top journals publishing articles related to scintillators as neutron detection, frequently cited documents used as references, and research trends over specific periods. The results show that research on scintillators for neutron detection has been popular since 2014, with at least 20 articles published yearly. The United States, China, France, and Italy have published the most papers, but Lithuania, Russia, India, and China have been the most active recently. The article provides researchers with an extensive overview of their field. This information empowers t

In [16]:
import sys
sys.path.insert(0, '../code')
import scopusapi_functions

In the dictionary below we specify the data fields we want to retrieve for each document. A Python dictionary contains a list of key/value pairs. Below, the format is:

    [KEY = name we want to give to resulting dataframe] : [VALUE: name of field as set by Scopus]

In [17]:
coldict = {"doc_eid": "eid", "doi": "doi", "authors": "authors",
           "title": "title", 
           "pub_title": "publicationName", "volume": "volume",
           "date": "coverDate", 
           "abstract": "abstract", "description": "description",
           "citedby_count": "citedby_count", 
           "authkeywords": "authkeywords", 
           }

In [18]:
eid = eids[0]
scopusapi_functions.get_scopus_abstractinfo(eid, coldict)

{'doc_eid': '2-s2.0-85182643737',
 'doi': '10.1007/s10238-023-01254-3',
 'authors': [Author(auid=58486874500, indexed_name='Zyoud S.H.', surname='Zyoud', given_name='Sa’ed H.', affiliation='60072733;60072733')],
 'title': 'Global landscape of COVID-19 research: a visualization analysis of randomized clinical trials',
 'pub_title': 'Clinical and Experimental Medicine',
 'volume': '24',
 'date': '2024-12-01',
 'abstract': None,
 'description': 'The emergence of COVID-19 in 2019 has resulted in a significant global health crisis. Consequently, extensive research was published to understand and mitigate the disease. In particular, randomized controlled trials (RCTs) have been considered the benchmark for assessing the efficacy and safety of interventions. Hence, the present study strives to present a comprehensive overview of the global research landscape pertaining to RCTs and COVID-19. A bibliometric analysis was performed using the Scopus database. The search parameters included article

In [19]:
ab_df = pd.DataFrame()

for eid in eids:
    abstract_dict = scopusapi_functions.get_scopus_abstractinfo(eid, coldict)
    new_df = pd.DataFrame([abstract_dict])
    #ab_df = ab_df.append(abstract_dict, ignore_index=True)
    ab_df = pd.concat([ab_df, new_df], ignore_index=True)

print(ab_df.shape)
ab_df.head()

(467, 11)


Unnamed: 0,doc_eid,doi,authors,title,pub_title,volume,date,abstract,description,citedby_count,authkeywords
0,2-s2.0-85182643737,10.1007/s10238-023-01254-3,"[(58486874500, Zyoud S.H., Zyoud, Sa’ed H., 60...",Global landscape of COVID-19 research: a visua...,Clinical and Experimental Medicine,24,2024-12-01,,The emergence of COVID-19 in 2019 has resulted...,0,
1,2-s2.0-85189041916,10.1016/j.radphyschem.2024.111705,"[(58339362500, Ardiansyah A., Ardiansyah, Ardi...",Science mapping of the development of scintill...,Radiation Physics and Chemistry,220,2024-07-01,,This paper presents a comprehensive bibliometr...,0,
2,2-s2.0-85185180408,10.1016/j.mex.2024.102601,"[(55413737300, Harder R., Harder, Robin, 60027...",Using Scopus and OpenAlex APIs to retrieve bib...,MethodsX,12,2024-06-01,,Evidence synthesis methodologies rely on bibli...,0,
3,2-s2.0-85182361252,10.1016/j.mex.2023.102484,"[(57212528168, Eyzaguirre I.A.L., Eyzaguirre, ...",Combining methods to conduct a systematic revi...,MethodsX,12,2024-06-01,,This study aims to present a combination of me...,0,
4,2-s2.0-85179489401,10.1016/j.inat.2023.101879,"[(36680805500, Sharma M., Sharma, Manisha, 600...",Landscape of epilepsy research: Analysis and f...,Interdisciplinary Neurosurgery: Advanced Techn...,36,2024-06-01,,Epilepsy is a neurological condition character...,0,


## Exercise 1

Conduct your own search / query of the Scopus dataset using the ScopusSearch function. Save a list of eids for each of these documents.

## Exercise 2

Using the list of eids created above, use the AbstractRetrieval function to retrieve detailed document info about each of these documents and place this resulting information into a dataframe and export as a csv.