# Extracting a list of DOIs from `Elsapy`
---
In this notebook we describe methods to find and download papers to text mine their synthesis protocols. As this is a worked example, we go through the process with a single paper only.

The steps takes to find and download a paper are: 
1. Searching the SCOPUS database for a target paper using `elsapy`
2. Ientifying key metadata about the paper like the publisher and DOI
3. Using this metadata to download the paper


## Importing useful libraries

The first step is to import the `Elsapy` library and some ancillary tools.

In [2]:
from elsapy.elsclient import ElsClient #The elsevier input/output client
from elsapy.elsprofile import ElsAuthor, ElsAffil #Some useful information classes about the papers
from elsapy.elsdoc import FullDoc, AbsDoc # Different paper document types
from elsapy.elssearch import ElsSearch # The actual search engine we'll be using
import requests # For accessing REST APIs over the internet
import json # A common database system (javascript objects)
import numpy as np # Numerical python, essential for handling mathematical terms and matrices
import pandas as pd # A common data frame system, letting us access spreadsheet-likel data in python

## Setting up `Elsapy`
Then, using credentials from [dev.elsevier.com](https://dev.elsevier.com/), instantiate the `ElsClient` object. 

N.B. For this section, and institution token isn't needed, only the API key. However, if you have an institution token you **must** use it, even when it's not needed.

In [3]:
## Load configuration
conFile = open("./elsapy_config.json")
config = json.load(conFile)
conFile.close()

## Initialize client
client = ElsClient(config['apikey'])
client.inst_token = config['insttoken']
print(config['apikey'])
print(config['insttoken'])

2d4ca340a0995366916361c4b429e8a9
e9d97c41f75ee97fb474c64ec63763a2


## Defining search terms
Here, we instantiate an `ElsSearch` object with identical search terms as when we're searching the web of science. 
The filter syntax is somewhat complex, ans we;ve showcased a few options here. 
Further instructions for search string entry can be found [here](https://dev.elsevier.com/sc_search_tips.html).

In [4]:
myDocSrch = ElsSearch('TITLE(MCM-41 AND "mesoporous molecular sieve") AND SRCTYPE(J) AND pubyear = 2000 AND PUBLISHER(Elsevier)','scopus')

We separately run the search, and count how many results we get. In the first instance we set `get_all` to false to make it run faster and check we've found a reasonable number of papers.

In [5]:
myDocSrch.execute(client, get_all=False)
print(myDocSrch.tot_num_res, 'papers found')

1 papers found


Once done, we can re-run the search with `get_all` as True, from which we see the results in table form.

In [6]:
myDocSrch.execute(client, get_all=True)
myDocSrch.results_df.head()

Unnamed: 0,@_fa,link,prism:url,dc:identifier,eid,dc:title,dc:creator,prism:publicationName,prism:issn,prism:volume,...,prism:doi,pii,citedby-count,affiliation,prism:aggregationType,subtype,subtypeDescription,source-id,openaccess,openaccessFlag
0,True,{'self': 'https://api.elsevier.com/content/abs...,https://api.elsevier.com/content/abstract/scop...,SCOPUS_ID:0034281469,2-s2.0-0034281469,Electrorheological properties of a suspension ...,Choi H.,Microporous and Mesoporous Materials,13871811,39,...,10.1016/S1387-1811(00)00167-0,S1387181100001670,83,"[{'@_fa': 'true', 'affilname': 'Inha Universit...",Journal,ar,Article,26989,0,False


The table contains lots of columns, many of which are not necessary for our purposes.

In [7]:
from pprint import pprint
print(len( myDocSrch.results_df.columns), 'columns')
pprint([x for x in myDocSrch.results_df.columns])

24 columns
['@_fa',
 'link',
 'prism:url',
 'dc:identifier',
 'eid',
 'dc:title',
 'dc:creator',
 'prism:publicationName',
 'prism:issn',
 'prism:volume',
 'prism:issueIdentifier',
 'prism:pageRange',
 'prism:coverDate',
 'prism:coverDisplayDate',
 'prism:doi',
 'pii',
 'citedby-count',
 'affiliation',
 'prism:aggregationType',
 'subtype',
 'subtypeDescription',
 'source-id',
 'openaccess',
 'openaccessFlag']


Instead, we can select just the columns we care about like this:

In [8]:
necessary_columns = ['dc:title', 'dc:creator', 'affiliation', 'prism:publicationName', 'prism:doi', 'pii']
paper_info = myDocSrch.results_df[necessary_columns]
paper_info.head()

Unnamed: 0,dc:title,dc:creator,affiliation,prism:publicationName,prism:doi,pii
0,Electrorheological properties of a suspension ...,Choi H.,"[{'@_fa': 'true', 'affilname': 'Inha Universit...",Microporous and Mesoporous Materials,10.1016/S1387-1811(00)00167-0,S1387181100001670


## Downloading the paper
Now that we've identified a paper that we want to download and textmine, we need to access it. 
While Elsevier allows easy access to their articles through the `elsapy` interface, I've found that accessing the web version simplifies later text minng. Accordingly, we're going to download the paper using `requests` alongside the paper identifier we found siogn `elsapy` and our acess tokens. 

This returns the manuscript as a binary xml document, which can be output to a file.

In [9]:
paper = paper_info.loc[0,'pii']

x = requests.get(
    f'https://api.elsevier.com/content/article/pii/{paper}',
    params = {
        'apiKey': config['apikey'],
        'insttoken': config['insttoken']
    }
)

with open(f'./{paper}.xml', 'wb') as f:
    f.write(x.content)

In [17]:
from lxml import etree

with open('./els_dtd.xml', 'r') as f:
    dtd = f.read()

print(dtd)

etree.DTD(dtd)


<!--    Elsevier Journal Article Input DTD version 5.6.0p1
        Public Identifier: -//ES//DTD journal article DTD version 5.6.0//EN//XML
        
        Copyright © 1993-2018 Elsevier B.V.
        This is open access material under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

        Documentation available at https://www.elsevier.com/locate/xml
-->
<!--    Supported doctypes: article, simple-article, book-review, exam

        Typical invocations:

        <!DOCTYPE article
          PUBLIC "-//ES//DTD journal article DTD version 5.6.0//EN//XML"
          "art560.dtd">

        <!DOCTYPE simple-article 
          PUBLIC "-//ES//DTD journal article DTD version 5.6.0//EN//XML"
          "art560.dtd">

        <!DOCTYPE book-review
          PUBLIC "-//ES//DTD journal article DTD version 5.6.0//EN//XML"
          "art560.dtd">

        <!DOCTYPE exam
          PUBLIC "-//ES//DTD journal article DTD version 5.6.0//EN//XML"
          "art560.dtd">

-->

<!-- includ

DTDParseError: error parsing DTD