# Document Search and Retrieval

In [None]:
import pybliometrics
from pybliometrics.scopus.utils import config

In [None]:
from pybliometrics.scopus import ScopusSearch

## Writing a Scopus Query

You can search for documents using author name (as we did in notebook "02") or by a variety of other fields (affiliation, publication name, date range, etc.).

However, often, researchers will want to search for a term or set of terms across multiple fields. Some example queries are below:

* `ABS(dopamine)` returns documents where "dopamine" is in the document abstract.
* `AUTHKEY(stroke)` returns documents where "stroke" is an author keyword.
* `CHEMNAME(oxidopamine)` returns documents with "oxidopamine" in the chemical name field.
* `FUND-ACR(NASA)` returns documents with NASA mentioned as the sponsor acronym in the acknowledgements section of the article. 
* `LANGUAGE(french)` returns documents originally written in French.
* `OPENACCESS(1)` returns Open Access content indexed in Scopus.
* `OPENACCESS(0)` returns subscription-based content indexed in Scopus.
* `PUBYEAR > 1994` returns documents with a publication year after 1994.
* `PUBYEAR < 1994` returns documents with a publication year before 1994.
* `PUBYEAR = 1994` returns documents with a publication year of 1994. 
* To find documents where your search terms occur in the same reference, use: `REF(darwin 1859)`
* `SRCTYPE(j)` returns documents from journal sources.
* `TITLE("neuropsychological evidence")` returns documents with the phrase "neuropsychological evidence" in their title.

If you want to search across multiple fields, you can also use the following combined fields:
* `ALL("heart attack")` returns documents with "heart attack" in any field
* `KEY(oscillator)` returns documents where "oscillator" is a keyword.
    * searches the AUTHKEY, INDEXTERMS, TRADENAME, and CHEMNAME fields
* `TITLE-ABS-KEY("heart attack")` returns documents with "heart attack" in their abstracts, article titles, or keyword fields.
* `TITLE-ABS-KEY-AUTH(heart attack)` returns documents with "heart attack" in their abstracts, article titles, keywords, or author name fields.

For more on search queries within Scopus review its [Search Tips page](https://dev.elsevier.com/sc_search_tips.html).

In [None]:
# a simple search 
# only takes seconds on Scopus web platform, but takes 30-60 seconds for every 1000 records or so
query = "TITLE-ABS-KEY(bibliometrics)"

# to speed this up, we recommend for your first search, 
## adding the parameter "download=False"
## once you are sure you are getting the type and number of results 
## you want you can remove this parameter

# results
s = ScopusSearch(query, download=False)
print(s)


## Task

The above search yields tens of thousands of records. Unfortunately, the Scopus API is slow to retrieve this number of records. Therefore, as you are testing your code it makes sense to retrieve a smaller subset of results. You can always go back and request a larger number of results once you know your code is working well.

Below, we will write some code that narrows the search results to only those publications written since 2023 and that also use the term "scopus" in one of its key fields. We will then place the results in a dataframe.

## Task

We can then use the **.get_eids()** method to retrieve the eids for each document that matches the search query above.

## Retrieve Abstracts

If we want to get more detailed information about the documents we identified above, we can use the pybliometric's **AuthorRetrieval** class to retrieve this info. 

In [None]:
from pybliometrics.scopus import AbstractRetrieval


### Create a dataframe of document data

#### Creating and using a function to compile document data

We can create and call a function to place data from multiple documents into a dataframe, while also specifying the columns we will be using.

In [None]:
import sys
sys.path.insert(0, '../code')
import scopusapi_functions

In the dictionary below we specify the data fields we want to retrieve for each document. A Python dictionary contains a list of key/value pairs. Below, the format is:

    [KEY = name we want to give to resulting dataframe] : [VALUE: name of field as set by Scopus]



In [None]:
coldict = {"doc_eid": "eid", "doi": "doi", "authors": "authors",
           "title": "title", 
           "pub_title": "publicationName", "volume": "volume",
           "date": "coverDate", 
           "abstract": "abstract", "description": "description",
           "citedby_count": "citedby_count", 
           "authkeywords": "authkeywords", 
           }

In [None]:
ab_df = pd.DataFrame()

for eid in eids:
    abstract_dict = scopusapi_functions.get_scopus_abstractinfo(eid, coldict)
    new_df = pd.DataFrame([abstract_dict])
    ab_df = pd.concat([ab_df, new_df], ignore_index=True)

print(ab_df.shape)
ab_df.head()

## Exercise 1

Conduct your own search / query of the Scopus dataset using the ScopusSearch function. Save a list of eids for each of these documents.

## Exercise 2

Using the list of eids created above, use the AbstractRetrieval function to retrieve detailed document info about each of these documents and place this resulting information into a dataframe and export as a csv.