# Importation of documents from Open Science sources

---
## <a name="HAL">HAL</a> (France, UPHF)
---

This document shows the data provided by HAL repository and  how to get data with a specific request.

---

 - Need to be register : No
 - API : Yes
 - Cost : Free
 - Documentation : Yes (in french only)
 - Access : simple GET request
 - Ref biblio : bibtex, endnote, COINS
 - answer format : json (default), xml, atom, csv

- fields : a lot !!
  - title, abstract, authors, origin (conference, journal, project, ...), date

In [1]:
#import for the waiting
from tqdm import tqdm
#import for web requests
import requests

---
### Define the request
See ["type de champs"](https://api.archives-ouvertes.fr/docs/search/?schema=fields#fields) for the detail about the parameters.

In [16]:
#  HAL API URL to search documents (here focused on UPHF docs)
base_url = 'https://api.archives-ouvertes.fr/search/uphf/'
#  HAL API URL to search documents (here focused on ANR projects)
#base_url = 'https://api.archives-ouvertes.fr/search/anr/'

In [41]:
# search parameters
# query = agent and keyword linked to IA
# fields to get = title, absract, authors, conference, doi, ....
# format = JSON (can be XML, Bibtex, ..)
# nb of document to get = 5 for the example
# search in keyword (possibbility to search in abstract, title, .....)
query='(agent~  OR Bayesian~ OR learning~ OR constraint~ OR holonic~ OR artificial intelligence OR self-org~ OR intelligence~ OR self-org~ OR self-adapt)'

# Fields to get back (title, abstract, ...) (see API doc)
details = 'title_s,authFullName_s,keyword_s,conferenceTitle_s,doiId_s,journalTitle_s,proceedings_s,bookTitle_s,abstract_s,submittedDateY_i,anrProjectAcronym_s,collIsParentOfColl_fs'

params = {
    'q': "*", #question
    'fl': details,  
    'start':0,
    'sort': 'submittedDateY_i desc', # sort by year decreasing (asc for ascending)    'wt': 'json',  # output format (JSON here, but XML, Bibtex, .... are possible)
    'rows': 10000,  # nb of documents to get
}

----
### Launch the request
A simple GET request interrogate the HAL repository to retreive the asked document in the format you have chosen


In [23]:
# GET request
response = requests.get(base_url, params=params)
#response = requests.get(base_url)


In [18]:
response

<Response [200]>

In [61]:
# WARNING you should check if the request succeed (200 = Ok, 404 = Error, 102 = in process)
if response.status_code == 200: 
    print("response in JSON format : ok")


response in JSON format : ok


---
### Decode the result
You have to choose the decoder in relation with the format : 


In [62]:
data = response.json()
# data is a dictionary containing the nb of articles found, and a dictionary with the articles

In [63]:
# from the response of the data, get the docs
articles = data["response"]["docs"]

In [52]:
len(articles)

2000

In [66]:
articles[1]

{'proceedings_s': '0',
 'title_s': ['Literatura, literatura infantil e valores: da inculcação ao questionamento',
  'Littérature, littérature de jeunesse et valeurs : de l’inculcation à l’interrogation'],
 'keyword_s': ['Littérature de jeunesse Valeurs',
  'Inculcation ou interrogation Ethique et esthétique'],
 'abstract_s': ['A literatura sempre foi e continua sendo atravessada pela questão dos valores. Apesar de seu relativo sucesso no século XX, os formalistas e os estruturalistas não conseguiram eliminar essa questão dos valores. Pelo contrário, os trabalhos recentes lhe consagram um lugar cada vez maior. Este lugar foi profundamente renovado, especialmente na literatura para a juventude, pois se trata sempre mais decisivamente de passar de uma postura de inculcação de valores a uma didática centrada sobre sua interrogação para combinar formação literária e formação pessoal. Como descrever e explicar essas evoluções? Por que valorizar os corpus e as abordagens que trabalham para co

In [54]:
#we keep only the non empty details for each articles
true_details = []
for idx, article in enumerate(articles, start=1):
    dict_details = {}
    tab_details = details.split(',')
    for d in tab_details:
        value = article.get(d, '')
        if value != '':
            dict_details[d] = value
#            print(f"{d} : {dict_details[d]}")
    true_details.append(dict_details)
#    print("="*50 + "\n")

In [55]:
len(true_details)

2000

In [58]:
true_details[0]

{'title_s': ['The Implementation of a Quality Management Standard in a Food SME: A Network Learning Perspective'],
 'authFullName_s': ['Zam-Zam Abdirahman', 'Loïc Sauvée'],
 'keyword_s': ['Food SME innovation learning network quality management standard',
  'Food SME',
  'Innovation',
  'Learning',
  'Network',
  'Quality management standard',
  'Food Consumption/Nutrition/Food Safety',
  'Food Security and Poverty',
  'Industrial Organization',
  'Research and Development/Tech Change/Emerging Technologies',
  'Food SME',
  'Innovation',
  'Learning',
  'Network',
  'Quality management standard'],
 'doiId_s': '10.22004/ag.econ.144857',
 'journalTitle_s': 'International Journal on Food System Dynamics',
 'abstract_s': ['In the modern agrifood economies, the development of quality management standards is crucial, and food small and medium enterprises (SMEs) usually face difficulties in implementing them. In this context, the aim of the article is two-fold. Firstly it is to craft an origi

---
## Save
We complete the data of each paper from their DOI (field, references).<br>
And we save the result in a csv file with the headers `` 'title', 'keywords', 'abstract', 'field', 'authors', 'date', 'doi', 'references'``.
The objective is to store the data of the different repositories in a same format.



In [11]:
#we choose these data from DOI (we could have more (pdf if provided, ....))
doi_keys = ["title","author","subject", "reference"]

def get_info_from_doi(doi):
    """return an extract of the DOI data"""
    doi_info = {}
    if doi[-1] == '.': doi = doi[:-1]
    url = f"https://api.crossref.org/works/{doi}"
    response = requests.get(url)
#    print(response)
    # check if success
    if response.status_code == 200:
        # extract  JSON data 
        data = response.json()
        data = data["message"]
        for key in doi_keys:
            if key in data:
                doi_info[key] = data[key]
    return doi_info


In [12]:
import pandas as pd

#define a row with the headers title, keyword, abstract, field, authors, date, doi, references
csv_content = [['title', 'keyword', 'abstract', 'field', 'authors', 'date', 'doi', 'references']]
#start a csv file
for article in tqdm(true_details):
    title = ""
    keyword = ""
    abstract = ""
    field  = ""
    authors = ""
    date = ""
    doi = ""
    doi = article.get('doiId_s', '')
    references = ""    
    if(doi!=""):
        doi_values = get_info_from_doi(doi)
        #get the field "subject" if exists  
        if 'subject' in doi_values: field = doi_values['subject']
        if 'reference' in doi_values: 
            refs = doi_values['reference']
            for ref in refs:
                if 'DOI' in ref:
                    references = references + ref['DOI'] + ", "
    if 'title_s' in article: title = article['title_s']
    if 'keyword_s' in article: keyword = article['keyword_s']
    if 'abstract' in article: abstract = article['abstract_s']
    if 'authFullName_s' in article: authors = article['authFullName_s']
    if 'submittedDateY_i' in article: date = article['submittedDateY_i']
    row = [",".join(title), ",".join(keyword), ",".join(abstract), ",".join(field), ",".join(authors), date, doi, references]
    csv_content.append(row)




100%|██████████| 2843/2843 [15:45<00:00,  3.01it/s]


In [13]:
#save it
df = pd.DataFrame(csv_content[1:], columns=csv_content[0])

df.to_csv('outputHAL.csv', index=False,  encoding="utf-8")