# Search for Authors and Retrieve Their Data

## Setting up the Scopus API Key

You should have your Scopus API Key available. If you have not yet requested a key or do not know where to find it, review the documentation in notebook "01".

The first time you run the code cell below, it will open a small prompt window (usually appearing near the top of the screen), which asks you to paste in your API Key.

![Initial prompt window for entering your API Key](..\images\enter_key.png)

Then, a second window will appear asking for an institional token. That is not necessary, so just press enter.

In [1]:
import pybliometrics
from pybliometrics.scopus.utils import config

Now, your Scopus API Key is stored in a configuration file on your computer so you should not need to enter it again, unless you move or delete the configuration file. To find where the configuration file is on your computer, run the following code:

In [2]:
pybliometrics.scopus.utils.constants.CONFIG_FILE

WindowsPath('C:/Users/F0040RP/.config/pybliometrics.cfg')

## Use your Scopus API from within the Dartmouth online network

+ Be sure that you are within the campus network (either on campus or logged into the vpn) to ensure the API will retrieve all requested results

+ Otherwise some requests will return the error: `Scopus401Error: The requestor is not authorized to access the requested view or fields of the resource`

## Types of Bibliometric Data

Most bibliometric data is stored at the document level. That is, bibliometric databases record metadata for each individual article, report, book, or other paper. However, this data can also be aggregated in various ways. Thus, some common types of bibliometric data include:

* document-level data
* author-level summary data + document-level data for each document this an author (co-)authored
* publication-level summary data and metrics (measuring the "impact" of a journal, for example, by quantifying the number of citations or its articles)
+ institutional-level metadata 

In this lesson, we will begin with an author's name, distinguish this particular author from others with the same name, and then retrieve data for the documents (co-)authored by this researcher.

## Get Information for one single author

Using the [Pybliometrics](https://pybliometrics.readthedocs.io/en/stable/) Python library, we can begin by extracting metadata for one single author. However, unless you have an unusual first and last name combination (like me), you will first need to identify the correct individual. For example if you search for "Jane Smith" you might need to parse through data for multiple authors named "Jane Smith" and identify correct matches. 

For example, Jane Smith at Dartmouth may be a different person than Jane Smith at Vassar, but she may be the same person as Jane Smith at UNH (Scopus records often have not been aggregated to merge records of the same person when they move to another institution).

To begin we will search for the [Spanish chemist Rafael Luque who has been suspended by his institution in Spain for academic impropriety](https://cen.acs.org/research-integrity/Highly-cited-chemist-suspended-claiming-to-be-affiliated-with-Russian-and-Saudi-universities/101/i12) related to a highly dubious publication profile (co-authoring 60-70 papers annually) and for accepting salaries as an adjunct scholar at Saudi and Russian universities (while still employed in Spain), which wanted his publication and citation recod to boost their rankings.

We will first use the **AuthorSearch API** to find the correct Rafael Luque. We will then use the **AuthorRetrieval API** to retrieve information about his documents

In [3]:
from pybliometrics.scopus import AuthorSearch
lastname = "Luque"
firstname = "Rafael"
au = AuthorSearch(f"AUTHLAST({lastname}) and AUTHFIRST({firstname})")

In [4]:
au.authors

[Author(eid='9-s2.0-26643003700', orcid='0000-0003-4190-1916', surname='Luque', initials='R.G.', givenname='Rafael Geraldo', affiliation='RUDN University', documents=898, affiliation_id='60015024', city='Moscow', country='Russian Federation', areas='CHEM (644); CENG (622); ENVI (611)'),
 Author(eid='9-s2.0-57194868074', orcid='0000-0002-4671-2957', surname='Luque', initials='R.', givenname='Rafael', affiliation='The University of Chicago', documents=126, affiliation_id='60029278', city='Chicago', country='United States', areas='PHYS (121); EART (115); MULT (6)'),
 Author(eid='9-s2.0-57535563900', orcid='0000-0001-5536-1805', surname='Luque-Baena', initials='R.M.', givenname='Rafael Marcos', affiliation='Universidad de Málaga', documents=100, affiliation_id='60003662', city='Malaga', country='Spain', areas='COMP (161); MATH (43); ENGI (25)'),
 Author(eid='9-s2.0-58220142700', orcid='0000-0003-1963-0523', surname='López-Luque', initials='R.', givenname='Rafael', affiliation='Universidad 

In [5]:
 #retrieve specific info
au.authors[2].country

'Spain'

The AuthorSearch command sends a **request** for information using the search query above. The API then sends a **response** with the request information, whichb we have saved in the variable `au`.

If we just call `au` we just receive a wrapper for the information. To retrieve specific information about the authors that matched this query, we need to be more specific. Try:

* `? au`
* `dir(au)`

In [6]:
dir(au)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_action',
 '_cache_file_path',
 '_integrity',
 '_json',
 '_mdate',
 '_n',
 '_query',
 '_refresh',
 '_view',
 'authors',
 'get_cache_file_age',
 'get_cache_file_mdate',
 'get_key_remaining_quota',
 'get_key_reset_time',
 'get_results_size']

In [7]:
au.authors

[Author(eid='9-s2.0-26643003700', orcid='0000-0003-4190-1916', surname='Luque', initials='R.G.', givenname='Rafael Geraldo', affiliation='RUDN University', documents=898, affiliation_id='60015024', city='Moscow', country='Russian Federation', areas='CHEM (644); CENG (622); ENVI (611)'),
 Author(eid='9-s2.0-57194868074', orcid='0000-0002-4671-2957', surname='Luque', initials='R.', givenname='Rafael', affiliation='The University of Chicago', documents=126, affiliation_id='60029278', city='Chicago', country='United States', areas='PHYS (121); EART (115); MULT (6)'),
 Author(eid='9-s2.0-57535563900', orcid='0000-0001-5536-1805', surname='Luque-Baena', initials='R.M.', givenname='Rafael Marcos', affiliation='Universidad de Málaga', documents=100, affiliation_id='60003662', city='Malaga', country='Spain', areas='COMP (161); MATH (43); ENGI (25)'),
 Author(eid='9-s2.0-58220142700', orcid='0000-0003-1963-0523', surname='López-Luque', initials='R.', givenname='Rafael', affiliation='Universidad 

The first entry in the results above, the "Rafael Luque" from RUDN appears to be our suspiciously prolific author. Although, observe that this Rafael Luque may have multiple records as Scopus often produces multiple records for individuals who have worked at multiple institutions. But, for this exercise, let's just retrieve information for the Rafael Luque from RUDN.

In [8]:
full_eid = au.authors[0].eid
full_eid

'9-s2.0-26643003700'

In [9]:
full_eid.split("-")[-1]

'26643003700'

In [10]:
eid = full_eid.split("-")[-1]
eid

'26643003700'

Other ways to narrow down author searches:
* include affiliations or affiliation ids
* include subject areas
* include middle names or initials

## Exercise

Search for an author you know well (could be yourself or a colleague!). How hard is it to parse their publication record from the record of authors with similar names?  

For authors with common names, you can further filter the results by adding in affiliation or other information. See the [Search Tips page](https://dev.elsevier.com/sc_search_tips.html) for more information about these search fields.

## Author Information Retrieval

With **pybliometrics** and the Scopus APIs, you will want to use the **SEARCH** APIs to retrieve ids for authors, documents, or affiliations. To retrieve more extensive information about those authors, documents, or affiliations, however, you will want to plug these ids into the **RETRIEVAL APIS**. 

For example, we can enter in the researcher id for RUDN's Rafael Luque into the **AuthorRetrieval** API.

In [11]:
from pybliometrics.scopus import AuthorRetrieval
ar = AuthorRetrieval(full_eid)
ar


<pybliometrics.scopus.author_retrieval.AuthorRetrieval at 0x1ddd11b4bd0>

In [12]:
print(ar.indexed_name)
print(ar.affiliation_current)
print("Number of (co-)authored documents:", ar.document_count)
print("Number of citations in these documents:", ar.citation_count)
print("Number of papers citing this author's documents:", ar.cited_by_count)

Luque R.
[Affiliation(id=60015024, parent=None, type='parent', relationship='author', afdispname=None, preferred_name='RUDN University', parent_preferred_name=None, country_code='rus', country='Russian Federation', address_part='Miklukho-Maklaya str.6', city='Moscow', state='Moscow Oblast', postal_code='117198', org_domain='rudn.ru', org_URL='https://eng.rudn.ru/')]
Number of (co-)authored documents: 898
Number of citations in these documents: 41852
Number of papers citing this author's documents: 33930


In [13]:
import pandas as pd

df = pd.DataFrame(ar.get_documents())
df

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Unnamed: 0,eid,doi,pii,pubmed_id,title,subtype,subtypeDescription,creator,afid,affilname,...,pageRange,description,authkeywords,citedby_count,openaccess,freetoread,freetoreadLabel,fund_acr,fund_no,fund_sponsor
0,2-s2.0-85175836371,10.1007/s40820-023-01221-3,,,Understanding Bridging Sites and Accelerating ...,ar,Article,Wang K.,60110590;60021182;60009506;60003138,Universidad ECOTEC;Sun Yat-Sen University;King...,...,,We report a novel double-shelled nanoboxes pho...,Bridging sites | CO reduction 2 | Electronic ...,6,1,repositoryvor,Green,NSFC,11922415,National Natural Science Foundation of China
1,2-s2.0-85188795658,10.1016/j.jphotochem.2024.115625,S1010603024001692,,Synthesis of Cu-doped TiO<inf>2</inf> modified...,ar,Article,Jabbari P.,60110590;60028174;60003662;60003138;116418480,Universidad ECOTEC;Isfahan University of Techn...,...,,"In the present study, a photocatalytic oxidati...",Bismuth vanadate | Cu-doped TiO 2 | Dibenzothi...,0,0,,,IUT,PID2021‐126235OB‐C32,Isfahan University of Technology
2,2-s2.0-85187783663,10.1016/j.scp.2024.101520,S2352554124000950,,Green chemistry in Italy and Spain (1999–2019)...,ar,Article,Ciriminna R.,60110590;60030318;60008737,Universidad ECOTEC;Università degli Studi di M...,...,,The study of green chemistry uptake in Italy a...,Benign by design | Green chemistry | Green che...,0,1,publisherhybridgold,Hybrid Gold,UniMi,PE00000004,Università degli Studi di Milano
3,2-s2.0-85183933285,10.1016/j.ccr.2024.215660,S0010854524000067,,Flexible–robust MOFs/HOFs for challenging gas ...,re,Review,Ebadi Amooghin A.,60110590;60018844;60011476;60005010;116418480,Universidad ECOTEC;Fujian Normal University;Un...,...,,The critical separation and purification indus...,Benchmark Materials | Challenging Gas Separati...,1,0,,,,undefined,
4,2-s2.0-85183426715,10.1016/j.fuel.2023.130761,S0016236123033756,,Co-pyrolysis of biomass and polyethylene terep...,ar,Article,Cupertino G.F.M.,60110590;60028426;60008088;60004923;127387421,Universidad ECOTEC;Universidade Federal do Esp...,...,,Disposal of waste plastics is an environmental...,Biomass | Circular economy | Plastic waste con...,0,0,,,CAPES,14/2023,Coordenação de Aperfeiçoamento de Pessoal de N...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
893,2-s2.0-24344497922,10.1016/j.micromeso.2005.05.013,S1387181105002052,,NH<inf>4</inf>F effect in post-synthesis treat...,ar,Article,Luque R.,60003138,Universidad de Córdoba,...,11-20,A series of Al-MCM-41 mesoporous molecular sie...,Heterogeneous catalysis | MCM-41 | Mesoporous ...,49,0,,,MICYT,CTQ2004-02200/BQU,Ministerio de Ciencia y Tecnología
894,2-s2.0-14344260111,10.1016/j.jcat.2004.12.004,S002195170400586X,,Synthesis of acidic Al-MCM-48: Influence of th...,ar,Article,Campelo J.M.,60016476;60003138,Universidad de Cádiz;Universidad de Córdoba,...,327-338,Al-MCM-48 mesostructures with different Al con...,Heterogeneous catalysis | MCM-48 | Mesoporous ...,73,0,,,FEDER,BQU2001-2605,European Regional Development Fund
895,2-s2.0-30844455578,10.1016/s0167-2991(05)80488-4,S0167299105804884,,Cyclohexene conversion and toluene methylation...,cp,Conference Paper,Campelo J.M.,60003138,Universidad de Córdoba,...,1383-1390,Cyclohexene conversion and toluene alkylation ...,,0,0,,,ERDF,CTQ2004-2200,European Regional Development Fund
896,2-s2.0-22544485064,10.1039/b501130b,,,A novel highly active biomaterial supported pa...,ar,Article,Gronnow M.J.,60016418;60003138,University of York;Universidad de Córdoba,...,552-557,We have developed an expanded starch supported...,,116,0,,,,undefined,


In [15]:
df.to_csv("test.csv")

In [14]:
df.columns

Index(['eid', 'doi', 'pii', 'pubmed_id', 'title', 'subtype',
       'subtypeDescription', 'creator', 'afid', 'affilname',
       'affiliation_city', 'affiliation_country', 'author_count',
       'author_names', 'author_ids', 'author_afids', 'coverDate',
       'coverDisplayDate', 'publicationName', 'issn', 'source_id', 'eIssn',
       'aggregationType', 'volume', 'issueIdentifier', 'article_number',
       'pageRange', 'description', 'authkeywords', 'citedby_count',
       'openaccess', 'freetoread', 'freetoreadLabel', 'fund_acr', 'fund_no',
       'fund_sponsor'],
      dtype='object')

**TASK**: Retrieve information for all documents written by this author and place in a dataframe and export as a csv file.

## Exercise

Retrieve information for all documents written by an author of your choosing. Following the code above, place this information into a dataframe and export as a csv.

