# Extracting a list of DOIs from `Elsapy`
---
In this notebook we describe methods to find and download papers to text mine their synthesis protocols. As this is a worked example, we go through the process with a single paper only.

The steps takes to find and download a paper are: 
1. Searching the SCOPUS database for a target paper using `elsapy`
2. Ientifying key metadata about the paper like the publisher and DOI
3. Using this metadata to download the paper


## Importing useful libraries

The first step is to import the `Elsapy` library and some ancillary tools.

In [1]:
from elsapy.elsclient import ElsClient #The elsevier input/output client
from elsapy.elsprofile import ElsAuthor, ElsAffil #Some useful information classes about the papers
from elsapy.elsdoc import FullDoc, AbsDoc # Different paper document types
from elsapy.elssearch import ElsSearch # The actual search engine we'll be using
import requests # For accessing REST APIs over the internet
import json # A common database system (javascript objects)
import numpy as np # Numerical python, essential for handling mathematical terms and matrices
import pandas as pd # A common data frame system, letting us access spreadsheet-likel data in python

## Setting up `Elsapy`
Then, using credentials from [dev.elsevier.com](https://dev.elsevier.com/), instantiate the `ElsClient` object. 

N.B. For this section, and institution token isn't needed, only the API key. However, if you have an institution token you **must** use it, even when it's not needed.

In [2]:
## Load configuration
conFile = open("./elsapy_config.json")
config = json.load(conFile)
conFile.close()

## Initialize client
client = ElsClient(config['apikey'])
client.inst_token = config['insttoken']
print(config['apikey'])
print(config['insttoken'])

2d4ca340a0995366916361c4b429e8a9
e9d97c41f75ee97fb474c64ec63763a2


## Defining search terms
Here, we instantiate an `ElsSearch` object with identical search terms as when we're searching the web of science. 
The filter syntax is somewhat complex, but we've showcased a few options here. 
Further instructions for search string entry can be found [here](https://dev.elsevier.com/sc_search_tips.html).

In [18]:
myDocSrch = ElsSearch('TITLE(ZIF-8 AND synthesis) AND SRCTYPE(J) AND pubyear > 2022 AND PUBLISHER(Elsevier)','scopus')

We separately run the search, and count how many results we get. In the first instance we set `get_all` to false to make it run faster and check we've found a reasonable number of papers.

In [19]:
myDocSrch.execute(client, get_all=False)
print(myDocSrch.tot_num_res, 'papers found')

13 papers found


Once done, we can re-run the search with `get_all` as True, from which we see the results in table form.

In [20]:
myDocSrch.execute(client, get_all=True)
myDocSrch.results_df.head()

Unnamed: 0,@_fa,link,prism:url,dc:identifier,eid,dc:title,dc:creator,prism:publicationName,prism:issn,prism:volume,...,subtype,subtypeDescription,article-number,source-id,openaccess,openaccessFlag,prism:eIssn,pubmed-id,freetoread,freetoreadLabel
0,True,{'self': 'https://api.elsevier.com/content/abs...,https://api.elsevier.com/content/abstract/scop...,SCOPUS_ID:85151550892,2-s2.0-85151550892,High catalytic performance of CoCuFe<inf>2</in...,Moghaddam F.M.,Journal of Molecular Structure,222860,1285,...,ar,Article,135496,24642,0,False,,,,
1,True,{'self': 'https://api.elsevier.com/content/abs...,https://api.elsevier.com/content/abstract/scop...,SCOPUS_ID:85153174695,2-s2.0-85153174695,Plasma-assisted synthesis of ZIF-8 membrane fo...,Shan Y.,Separation and Purification Technology,13835866,317,...,ar,Article,123871,14292,0,False,18733794.0,,,
2,True,{'self': 'https://api.elsevier.com/content/abs...,https://api.elsevier.com/content/abstract/scop...,SCOPUS_ID:85149059141,2-s2.0-85149059141,Synthesis and ciprofloxacin adsorption of Gum ...,Yang D.,Colloids and Surfaces A: Physicochemical and E...,9277757,664,...,ar,Article,131196,26589,0,False,18734359.0,,,
3,True,{'self': 'https://api.elsevier.com/content/abs...,https://api.elsevier.com/content/abstract/scop...,SCOPUS_ID:85150042265,2-s2.0-85150042265,Molten NaCl assisted pyrolysis of ZIF-8/PAN el...,Ran S.,Chemical Engineering Journal,13858947,463,...,ar,Article,142174,16398,0,False,,,,
4,True,{'self': 'https://api.elsevier.com/content/abs...,https://api.elsevier.com/content/abstract/scop...,SCOPUS_ID:85144627163,2-s2.0-85144627163,Bicomponent hydrogels assisted templating synt...,Zheng H.,Journal of Molecular Structure,222860,1277,...,ar,Article,134824,24642,0,False,,,,


The table contains lots of columns, many of which are not necessary for our purposes.

In [21]:
from pprint import pprint
print(len( myDocSrch.results_df.columns), 'columns')
pprint([x for x in myDocSrch.results_df.columns])

28 columns
['@_fa',
 'link',
 'prism:url',
 'dc:identifier',
 'eid',
 'dc:title',
 'dc:creator',
 'prism:publicationName',
 'prism:issn',
 'prism:volume',
 'prism:pageRange',
 'prism:coverDate',
 'prism:coverDisplayDate',
 'prism:doi',
 'pii',
 'citedby-count',
 'affiliation',
 'prism:aggregationType',
 'subtype',
 'subtypeDescription',
 'article-number',
 'source-id',
 'openaccess',
 'openaccessFlag',
 'prism:eIssn',
 'pubmed-id',
 'freetoread',
 'freetoreadLabel']


Instead, we can select just the columns we care about like this:

In [22]:
necessary_columns = ['dc:title', 'dc:creator', 'affiliation', 'prism:publicationName', 'prism:doi', 'pii']
paper_info = myDocSrch.results_df[necessary_columns]
paper_info.head(20)

Unnamed: 0,dc:title,dc:creator,affiliation,prism:publicationName,prism:doi,pii
0,High catalytic performance of CoCuFe<inf>2</in...,Moghaddam F.M.,"[{'@_fa': 'true', 'affilname': 'Sharif Univers...",Journal of Molecular Structure,10.1016/j.molstruc.2023.135496,S0022286023005938
1,Plasma-assisted synthesis of ZIF-8 membrane fo...,Shan Y.,"[{'@_fa': 'true', 'affilname': 'Jiangxi Normal...",Separation and Purification Technology,10.1016/j.seppur.2023.123871,S1383586623007797
2,Synthesis and ciprofloxacin adsorption of Gum ...,Yang D.,"[{'@_fa': 'true', 'affilname': 'Wuhan Polytech...",Colloids and Surfaces A: Physicochemical and E...,10.1016/j.colsurfa.2023.131196,S0927775723002807
3,Molten NaCl assisted pyrolysis of ZIF-8/PAN el...,Ran S.,"[{'@_fa': 'true', 'affilname': 'Dalian Univers...",Chemical Engineering Journal,10.1016/j.cej.2023.142174,S1385894723009051
4,Bicomponent hydrogels assisted templating synt...,Zheng H.,"[{'@_fa': 'true', 'affilname': 'Henan Universi...",Journal of Molecular Structure,10.1016/j.molstruc.2022.134824,S002228602202470X
5,Facile synthesis of ZIF-8 incorporated electro...,Mohammed Y.A.Y.A.,"[{'@_fa': 'true', 'affilname': 'University of ...",Chemical Engineering Journal,10.1016/j.cej.2023.141972,S1385894723007039
6,An ultra-sensitive electrochemical aptasensor ...,Zhang Y.,"[{'@_fa': 'true', 'affilname': 'Chongqing Univ...",Microchemical Journal,10.1016/j.microc.2022.108316,S0026265X22011444
7,Synthesis of glycidol via transesterification ...,Timofeeva M.N.,"[{'@_fa': 'true', 'affilname': 'Novosibirsk St...",Molecular Catalysis,10.1016/j.mcat.2023.113014,S2468823123001001
8,ZIF-8-templated synthesis of core-shell struct...,Jiao S.,"[{'@_fa': 'true', 'affilname': 'Qingdao Univer...",Electrochimica Acta,10.1016/j.electacta.2023.141817,S001346862300004X
9,Facile synthesis of dual-hydrolase encapsulate...,Li M.,"[{'@_fa': 'true', 'affilname': 'Huaibei Coal I...",Chemosphere,10.1016/j.chemosphere.2022.137673,S0045653522041662


## Downloading the paper
Now that we've identified a paper that we want to download and textmine, we need to access it. 
While Elsevier allows easy access to their articles through the `elsapy` interface, I've found that accessing the web version simplifies later text minng. Accordingly, we're going to download the paper using `requests` alongside the paper identifier we found siogn `elsapy` and our acess tokens. 

This returns the manuscript as a binary xml document, which can be output to a file.

In [23]:
paper = paper_info.loc[5,'pii']

x = requests.get(
    f'https://api.elsevier.com/content/article/pii/{paper}',
    params = {
        'apiKey': config['apikey'],
        'insttoken': config['insttoken']
    }
)

with open(f'./{paper}.xml', 'wb') as f:
    f.write(x.content)