# Search for Authors and Retrieve Their Data

## Setting up the Scopus API Key

You should have your Scopus API Key available. If you have not yet requested a key or do not know where to find it, review the documentation in notebook "01".

The first time you run the code cell below, it will open a small prompt window (usually appearing near the top of the screen), which asks you to paste in your API Key.

![Initial prompt window for entering your API Key](..\images\enter_key.png)

Then, a second window will appear asking for an institional token. That is not necessary, so just press enter.

In [1]:
import pybliometrics
from pybliometrics.scopus.utils import config

Now, your Scopus API Key is stored in a configuration file on your computer so you should not need to enter it again, unless you move or delete the configuration file. To find where the configuration file is on your computer, run the following code:

In [3]:
pybliometrics.scopus.utils.constants.CONFIG_FILE

WindowsPath('C:/Users/F0040RP/.config/pybliometrics.cfg')

## Use your Scopus API from within the Dartmouth online network

+ Be sure that you are within the campus network (either on campus or logged into the vpn) to ensure the API will retrieve all requested results

+ Otherwise some requests will return the error: `Scopus401Error: The requestor is not authorized to access the requested view or fields of the resource`

## Types of Bibliometric Data

Most bibliometric data is stored at the document level. That is, bibliometric databases record metadata for each individual article, report, book, or other paper. However, this data can also be aggregated in various ways. Thus, some common types of bibliometric data include:

* document-level data
* author-level summary data + document-level data for each document this an author (co-)authored
* publication-level summary data and metrics (measuring the "impact" of a journal, for example, by quantifying the number of citations or its articles)
+ institutional-level metadata 

In this lesson, we will begin with an author's name, distinguish this particular author from others with the same name, and then retrieve data for the documents (co-)authored by this researcher.

## Get Information for one single author

Using the [Pybliometrics](https://pybliometrics.readthedocs.io/en/stable/) Python library, we can begin by extracting metadata for one single author. However, unless you have an unusual first and last name combination (like me), you will first need to identify the correct individual. For example if you search for "Jane Smith" you might need to parse through data for multiple authors named "Jane Smith" and identify correct matches. 

For example, Jane Smith at Dartmouth may be a different person than Jane Smith at Vassar, but she may be the same person as Jane Smith at UNH (Scopus records often have not been aggregated to merge records of the same person when they move to another institution).

To begin we will search for the [Spanish chemist Rafael Luque who has been suspended by his institution in Spain for academic impropriety](https://cen.acs.org/research-integrity/Highly-cited-chemist-suspended-claiming-to-be-affiliated-with-Russian-and-Saudi-universities/101/i12) related to a highly dubious publication profile (co-authoring 60-70 papers annually) and for accepting salaries as an adjunct scholar at Saudi and Russian universities (while still employed in Spain), which wanted his publication and citation recod to boost their rankings.

We will first use the **AuthorSearch API** to find the correct Rafael Luque. We will then use the **AuthorRetrieval API** to retrieve information about his documents

In [4]:
from pybliometrics.scopus import AuthorSearch
lastname = "Luque"
firstname = "Rafael"
au = AuthorSearch(f"AUTHLAST({lastname}) and AUTHFIRST({firstname})")

In [8]:
au.authors

[Author(eid='9-s2.0-26643003700', orcid='0000-0003-4190-1916', surname='Luque', initials='R.G.', givenname='Rafael Geraldo', affiliation='RUDN University', documents=898, affiliation_id='60015024', city='Moscow', country='Russian Federation', areas='CHEM (644); CENG (622); ENVI (611)'),
 Author(eid='9-s2.0-57194868074', orcid='0000-0002-4671-2957', surname='Luque', initials='R.', givenname='Rafael', affiliation='The University of Chicago', documents=126, affiliation_id='60029278', city='Chicago', country='United States', areas='PHYS (121); EART (115); MULT (6)'),
 Author(eid='9-s2.0-57535563900', orcid='0000-0001-5536-1805', surname='Luque-Baena', initials='R.M.', givenname='Rafael Marcos', affiliation='Universidad de Málaga', documents=100, affiliation_id='60003662', city='Malaga', country='Spain', areas='COMP (161); MATH (43); ENGI (25)'),
 Author(eid='9-s2.0-58220142700', orcid='0000-0003-1963-0523', surname='López-Luque', initials='R.', givenname='Rafael', affiliation='Universidad 

The AuthorSearch command sends a **request** for information using the search query above. The API then sends a **response** with the request information, whichb we have saved in the variable `au`.

If we just call `au` we just receive a wrapper for the information. To retrieve specific information about the authors that matched this query, we need to be more specific.

In [7]:
?au

[1;31mType:[0m           AuthorSearch
[1;31mString form:[0m   
Search 'AUTHLAST(Luque) and AUTHFIRST(Rafael)' yielded 29 authors as of 2024-04-05:
           Luque, Ra <...> l; AUTHOR_ID:57213514771 (1 document(s))
           Luque, Rafael; AUTHOR_ID:35810760000 (1 document(s))
[1;31mFile:[0m           c:\users\f0040rp\documents\dartlib_rds\projects\bibliometrics\.venv\lib\site-packages\pybliometrics\scopus\author_search.py
[1;31mDocstring:[0m      <no docstring>
[1;31mInit docstring:[0m
Interaction with the Author Search API.

:param query: A string of the query.  For allowed fields and values see
              https://dev.elsevier.com/sc_author_search_tips.html.
:param refresh: Whether to refresh the cached file if it exists or not.
                If `int` is passed, cached file will be refreshed if the
                number of days since last modification exceeds that value.
:param download: Whether to download results (if they have not been
                 cached).
:pa

In [9]:
dir(au)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_action',
 '_cache_file_path',
 '_integrity',
 '_json',
 '_mdate',
 '_n',
 '_query',
 '_refresh',
 '_view',
 'authors',
 'get_cache_file_age',
 'get_cache_file_mdate',
 'get_key_remaining_quota',
 'get_key_reset_time',
 'get_results_size']

In [10]:
au.authors

#eid: '9-s2.0-26643003700'
#orcid: '0000-0003-4190-1916'

[Author(eid='9-s2.0-26643003700', orcid='0000-0003-4190-1916', surname='Luque', initials='R.G.', givenname='Rafael Geraldo', affiliation='RUDN University', documents=898, affiliation_id='60015024', city='Moscow', country='Russian Federation', areas='CHEM (644); CENG (622); ENVI (611)'),
 Author(eid='9-s2.0-57194868074', orcid='0000-0002-4671-2957', surname='Luque', initials='R.', givenname='Rafael', affiliation='The University of Chicago', documents=126, affiliation_id='60029278', city='Chicago', country='United States', areas='PHYS (121); EART (115); MULT (6)'),
 Author(eid='9-s2.0-57535563900', orcid='0000-0001-5536-1805', surname='Luque-Baena', initials='R.M.', givenname='Rafael Marcos', affiliation='Universidad de Málaga', documents=100, affiliation_id='60003662', city='Malaga', country='Spain', areas='COMP (161); MATH (43); ENGI (25)'),
 Author(eid='9-s2.0-58220142700', orcid='0000-0003-1963-0523', surname='López-Luque', initials='R.', givenname='Rafael', affiliation='Universidad 

In [11]:
?AuthorSearch

[1;31mInit signature:[0m
[0mAuthorSearch[0m[1;33m([0m[1;33m
[0m    [0mquery[0m[1;33m:[0m [0mstr[0m[1;33m,[0m[1;33m
[0m    [0mrefresh[0m[1;33m:[0m [0mUnion[0m[1;33m[[0m[0mbool[0m[1;33m,[0m [0mint[0m[1;33m][0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mverbose[0m[1;33m:[0m [0mbool[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mdownload[0m[1;33m:[0m [0mbool[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mintegrity_fields[0m[1;33m:[0m [0mUnion[0m[1;33m[[0m[0mList[0m[1;33m[[0m[0mstr[0m[1;33m][0m[1;33m,[0m [0mTuple[0m[1;33m[[0m[0mstr[0m[1;33m,[0m [1;33m...[0m[1;33m][0m[1;33m][0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mintegrity_action[0m[1;33m:[0m [0mstr[0m [1;33m=[0m [1;34m'raise'[0m[1;33m,[0m[1;33m
[0m    [0mcount[0m[1;33m:[0m [0mint[0m [1;33m=[0m [1;36m200[0m[1;33m,[0m[1;33m
[0m    [1;33m**[0m[0mkwds[0m[1;33m:[0m 

https://dev.elsevier.com/sc_author_search_tips.html

In [10]:
#eid = '9-s2.0-26643003700' #doesn't work
eid = '26643003700'  #works
au2 = AuthorSearch(f"AU-ID({eid})")

In [11]:
au2.authors

[Author(eid='9-s2.0-26643003700', orcid='0000-0003-4190-1916', surname='Luque', initials='R.G.', givenname='Rafael Geraldo', affiliation='RUDN University', documents=898, affiliation_id='60015024', city='Moscow', country='Russian Federation', areas='CHEM (644); CENG (622); ENVI (611)')]

retrieve specific info

In [17]:
au.authors[2].country

'Spain'

Other ways to narrow down author searches:
* include affiliations or affiliation ids
* include subject areas
* include middle names or initials

## Author Information Retrieval

In [13]:
from pybliometrics.scopus import AuthorRetrieval
eid = '26643003700'
ar = AuthorRetrieval(eid)
ar


<pybliometrics.scopus.author_retrieval.AuthorRetrieval at 0x16bc0ed73d0>

In [22]:
#ar.    ##view options
print(ar.indexed_name)
print(ar.affiliation_current)
print("Number of (co-)authored documents:", ar.document_count)
print("Number of citations in these documents:", ar.citation_count)
print("Number of papers citing this author's documents:", ar.cited_by_count)

Luque R.
[Affiliation(id=60015024, parent=None, type='parent', relationship='author', afdispname=None, preferred_name='RUDN University', parent_preferred_name=None, country_code='rus', country='Russian Federation', address_part='Miklukho-Maklaya str.6', city='Moscow', state='Moscow Oblast', postal_code='117198', org_domain='rudn.ru', org_URL='https://eng.rudn.ru/')]
Number of (co-)authored documents: 898
Number of citations in these documents: 41852
Number of papers citing this author's documents: 33930


In [25]:
ar.get_documents()

[Document(eid='2-s2.0-85175836371', doi='10.1007/s40820-023-01221-3', pii=None, pubmed_id=None, title='Understanding Bridging Sites and Accelerating Quantum Efficiency for Photocatalytic CO<inf>2</inf> Reduction', subtype='ar', subtypeDescription='Article', creator='Wang K.', afid='60110590;60021182;60009506;60003138', affilname='Universidad ECOTEC;Sun Yat-Sen University;King Fahd University of Petroleum and Minerals;Universidad de Córdoba', affiliation_city='Samborondon;Guangzhou;Dhahran;Cordoba', affiliation_country='Ecuador;China;Saudi Arabia;Spain', author_count='11', author_names='Wang, Kangwang;Hu, Zhuofeng;Yu, Peifeng;Balu, Alina M.;Li, Kuan;Li, Longfu;Zeng, Lingyong;Zhang, Chao;Luque, Rafael;Yan, Kai;Luo, Huixia', author_ids='57217179138;55859110700;57226131094;22940159800;57454617900;58604868000;57218134989;57199502932;26643003700;35732459200;35272439700', author_afids='60021182;60021182;60021182;60003138;60021182;60021182;60021182;60021182;60009506-60110590;60021182;60021182'

In [27]:
import pandas as pd

doc_df = pd.DataFrame(ar.get_documents())
doc_df.head()

Unnamed: 0,eid,doi,pii,pubmed_id,title,subtype,subtypeDescription,creator,afid,affilname,...,pageRange,description,authkeywords,citedby_count,openaccess,freetoread,freetoreadLabel,fund_acr,fund_no,fund_sponsor
0,2-s2.0-85175836371,10.1007/s40820-023-01221-3,,,Understanding Bridging Sites and Accelerating ...,ar,Article,Wang K.,60110590;60021182;60009506;60003138,Universidad ECOTEC;Sun Yat-Sen University;King...,...,,We report a novel double-shelled nanoboxes pho...,Bridging sites | CO reduction 2 | Electronic ...,6,1,repositoryvor,Green,NSFC,11922415,National Natural Science Foundation of China
1,2-s2.0-85188795658,10.1016/j.jphotochem.2024.115625,S1010603024001692,,Synthesis of Cu-doped TiO<inf>2</inf> modified...,ar,Article,Jabbari P.,60110590;60028174;60003662;60003138;116418480,Universidad ECOTEC;Isfahan University of Techn...,...,,"In the present study, a photocatalytic oxidati...",Bismuth vanadate | Cu-doped TiO 2 | Dibenzothi...,0,0,,,IUT,PID2021‐126235OB‐C32,Isfahan University of Technology
2,2-s2.0-85187783663,10.1016/j.scp.2024.101520,S2352554124000950,,Green chemistry in Italy and Spain (1999–2019)...,ar,Article,Ciriminna R.,60110590;60030318;60008737,Universidad ECOTEC;Università degli Studi di M...,...,,The study of green chemistry uptake in Italy a...,Benign by design | Green chemistry | Green che...,0,1,publisherhybridgold,Hybrid Gold,UniMi,PE00000004,Università degli Studi di Milano
3,2-s2.0-85183933285,10.1016/j.ccr.2024.215660,S0010854524000067,,Flexible–robust MOFs/HOFs for challenging gas ...,re,Review,Ebadi Amooghin A.,60110590;60018844;60011476;60005010;116418480,Universidad ECOTEC;Fujian Normal University;Un...,...,,The critical separation and purification indus...,Benchmark Materials | Challenging Gas Separati...,1,0,,,,undefined,
4,2-s2.0-85183426715,10.1016/j.fuel.2023.130761,S0016236123033756,,Co-pyrolysis of biomass and polyethylene terep...,ar,Article,Cupertino G.F.M.,60110590;60028426;60008088;60004923;127387421,Universidad ECOTEC;Universidade Federal do Esp...,...,,Disposal of waste plastics is an environmental...,Biomass | Circular economy | Plastic waste con...,0,0,,,CAPES,14/2023,Coordenação de Aperfeiçoamento de Pessoal de N...


In [31]:
doc_df.to_csv(f"../data/{lastname}_{firstname}_{eid}_documents.csv", encoding = 'utf-8')

'26643003700'