# Importation of documents from Open Science sources

---
## Repositorio (Spain, UC)

 - Need to be register : Yes
 - API : Yes (for the DSpace tool)
 - Cost : Free
 - Documentation : Yes 
 - Access : simple GET request
 - Ref biblio : no (to build from the data)
 - answer format : JSON

- fields : authors, abstract, date, ..... 


In [3]:
import requests

In [4]:
#  REST API URL to search documents in the 
base_url = "https://repositorio.unican.es/rest/"

### Use of DSpace
To interact with DSpace, we need several steps  : 
  1. retrieve the <span style="color:red">communities</span> of the repository 
  2. select the <span style="color:red">communities</span> 
  3. retrieve the <span style="color:red">collections</span> (list of articles, books, patents, thesis, ...)
  4. select the <span style="color:red">collection</span> 
  5. retrieve the <span style="color:red">articles</span> from a collection of a community
  6. select the <span style="color:red">article</span> 
  7. retrieve the <span style="color:red">article</span> and its metadata

----
  1. retrieve the <span style="color:red">communities</span> of the repository :  
    - to the REST URL, add **communities/** : Return an list of all communities of the repository in JSON

In [5]:
# GET request
get_communities = requests.get(base_url + "communities/")

In [6]:
print("response from "+base_url + "communities/")
print(get_communities.content)
data = get_communities.json()

response from https://repositorio.unican.es/rest/communities/
b'[{"uuid":"2e6ca748-8c15-41cc-88ed-233f92787953","name":"Centro Internacional de Estudios Superiores del Espa\xc3\xb1ol CIESE-Comillas (Centro adscrito)","handle":"10902/6182","type":"community","expand":["parentCommunity","collections","subCommunities","logo","all"],"logo":null,"parentCommunity":null,"copyrightText":"","introductoryText":"","shortDescription":"","sidebarText":"","countItems":160,"collections":[],"link":"/rest/communities/2e6ca748-8c15-41cc-88ed-233f92787953","subcommunities":[]},{"uuid":"7508978b-1eda-4fc8-8529-4bf3d930d494","name":"Departamento de Administraci\xc3\xb3n de Empresas","handle":"10902/17","type":"community","expand":["parentCommunity","collections","subCommunities","logo","all"],"logo":null,"parentCommunity":null,"copyrightText":"","introductoryText":"El Departamento de Administraci\xc3\xb3n de Empresas centra su actividad docente e investigadora en el \xc3\xa1mbito de la gesti\xc3\xb3n de em

In [7]:
data

[{'uuid': '2e6ca748-8c15-41cc-88ed-233f92787953',
  'name': 'Centro Internacional de Estudios Superiores del Español CIESE-Comillas (Centro adscrito)',
  'handle': '10902/6182',
  'type': 'community',
  'expand': ['parentCommunity',
   'collections',
   'subCommunities',
   'logo',
   'all'],
  'logo': None,
  'parentCommunity': None,
  'copyrightText': '',
  'introductoryText': '',
  'shortDescription': '',
  'sidebarText': '',
  'countItems': 160,
  'collections': [],
  'link': '/rest/communities/2e6ca748-8c15-41cc-88ed-233f92787953',
  'subcommunities': []},
 {'uuid': '7508978b-1eda-4fc8-8529-4bf3d930d494',
  'name': 'Departamento de Administración de Empresas',
  'handle': '10902/17',
  'type': 'community',
  'expand': ['parentCommunity',
   'collections',
   'subCommunities',
   'logo',
   'all'],
  'logo': None,
  'parentCommunity': None,
  'copyrightText': '',
  'introductoryText': 'El Departamento de Administración de Empresas centra su actividad docente e investigadora en el á

### Find the best communities 
  2. **identify** the <span style="color:red">communities</span> 
   - if the id of the communities you search are unknown, you have to browse the results...

   Ex. find a communities about computer science articles : 

In [8]:
def find_id(data, term):
    """from dictionary of data having name and ifd fields, 
       search if the term is in the name and 
       return the list of id"""
    id_list = []
    for community in data:
        if community['name'].find(term)>-1:
            id_list.append(community['uuid'])
    return id_list
    

In [9]:
#the name to search is "Computación", you have to know that it's this name to find !!
id_communities = find_id(data, "Computación")
print(id_communities)

['9ead14f9-a142-48c5-8ee8-d0448574e493', '92aaba72-8d31-41ca-a1b9-0e6df50b759b']


### Find the collectionS
  3. retrieve the <span style="color:red">**collections**</span> (list of articles, books, patents, thesis, ...)
    - to the REST URL, add **communities/[id of the selected community]/collections** : Return an list of all the collections for the community, in JSON

In [10]:
# GET request for the 2nd community (for example)
address = base_url + "communities/" + id_communities[1] + "/collections"
get_collections = requests.get(address )

In [11]:
print("response from "+address)
data = get_collections.json()
for d in data: print(d['name'])

response from https://repositorio.unican.es/rest/communities/92aaba72-8d31-41ca-a1b9-0e6df50b759b/collections
D21 Artículos
D21 Congresos
D21 Documentos de trabajo
D21 Libros y partes de libros
D21 Patentes
D21 Proyectos de Investigación
D21 Tesis


### get the collection 
  4. **identify** the <span style="color:red">collection id</span> 
   - if the id of the collections you search are unknown, you have to browse the results...

   Ex. find a collection about articles : 

In [12]:
#the name to search to get articles is "Artículos"
id_collections = find_id(data, "Artículos")
print(id_collections)

['f8ba545d-52f6-415d-9211-2b9a1b974a18']


### Find the articles
  5. retrieve the <span style="color:red">**articles**</span> from a collection of a community<br>
    - to the REST URL, add **collections/[id of the selected collection]/items** : Return an list of all the items for the collection, in JSON


In [13]:
# GET request for the articles of the collection
address = base_url + "collections/" + id_collections[0] + "/items"
get_collections = requests.get(address )

In [14]:
print("response from "+address)
data = get_collections.json()
for d in data: print(d['name'])

response from https://repositorio.unican.es/rest/collections/f8ba545d-52f6-415d-9211-2b9a1b974a18/items
Sobre un contraejemplo a la conjetura de Hirsch
The Covering Radius and a Discrete Surface Area for Non-Hollow Simplices
Simplified procedure for efficient and unbiased population size estimation
XLVIII Olimpiada Matemática Española, Santander, 22 al 25 de marzo de 2012
A unified approach towards the impossibility of finite time vanishing depth for incompressible free boundary flows
Blowups with log canonical singularities
Union bound minimization approach for designing grassmannian constellations
Some issues on the automatic computation of plane envelopes in interactive environments
A Framework for Identifying Sequences of Interactions That Cause Usability Problems in Collaborative Systems
Models for the Assessment of Treatment Improvement: The Ideal and the Feasible
El problema de la galería de arte: elaboración de conjeturas y demostraciones con GeoGebra
Análisis epistémico y cogn

### get the article 
  6. **identify** the <span style="color:red">article id</span> 
   - if the id of the articles you search is unknown, you have to browse the results...

   Ex. find an article about 'human' : 

In [15]:
#the term to search is "Human" in the name of the articles
id_articles = find_id(data, "Human")
print(id_articles)

['2537b71f-2e02-4da6-aa9e-cf2d8725a68e']


  7. retrieve the <span style="color:red">**article**</span> and its fields<br>
    - to the REST URL, add **items/[id of the selected item]/metadata** : Return an list of the metadata liked to the article, in JSON


In [16]:
# get the metadata of the article about Human
address = base_url + "items/" + id_articles[0] + "/metadata"
print("response from "+address)
get_metadata = requests.get(address )
data = get_metadata.json()

response from https://repositorio.unican.es/rest/items/2537b71f-2e02-4da6-aa9e-cf2d8725a68e/metadata


In [17]:
data

[{'key': 'dc.contributor.author',
  'value': 'Salomón García, Sergio',
  'language': '',
  'element': 'contributor',
  'qualifier': 'author',
  'schema': 'dc'},
 {'key': 'dc.contributor.author',
  'value': 'Duque Medina, Rafael',
  'language': '',
  'element': 'contributor',
  'qualifier': 'author',
  'schema': 'dc'},
 {'key': 'dc.contributor.author',
  'value': 'Bringas Tejero, Santos',
  'language': '',
  'element': 'contributor',
  'qualifier': 'author',
  'schema': 'dc'},
 {'key': 'dc.contributor.author',
  'value': 'Montaña Arnaiz, José Luis',
  'language': '',
  'element': 'contributor',
  'qualifier': 'author',
  'schema': 'dc'},
 {'key': 'dc.contributor.author',
  'value': 'Lage, Carmen',
  'language': '',
  'element': 'contributor',
  'qualifier': 'author',
  'schema': 'dc'},
 {'key': 'dc.contributor.other',
  'value': 'Universidad de Cantabria',
  'language': 'es_ES',
  'element': 'contributor',
  'qualifier': 'other',
  'schema': 'dc'},
 {'key': 'dc.date.accessioned',
  'val

In [18]:
def get_data_article(data):
    """from data about an article in the JSON format of repositorio.unican.es, get the authors, title and abstract"""
    authors = []
    title = ""
    abstract = ""
    source = ""
    date = ""

    for item in data:
        if item['key'] == 'dc.contributor.author':
            authors.append(item['value'])
        elif item['key'] == 'dc.title':
            title = item['value']
        elif item['key'] == 'dc.source':
            source = item['value']
        elif item['key'] == 'dc.date.issued':
            date = item['value']
        elif item['key'] == 'dc.description.abstract':
            abstract = item['value']

    print("Authors :", authors)
    print("Titre :", title)
    print("In :", source)
    print("Date :", date)
    print("Abstract :", abstract)

In [19]:
get_data_article(data)

Authors : ['Salomón García, Sergio', 'Duque Medina, Rafael', 'Bringas Tejero, Santos', 'Montaña Arnaiz, José Luis', 'Lage, Carmen']
Titre : A Convolutional Neural Network-Based Method for Human Movement Patterns Classification in Alzheimer?s Disease
In : Proceedings 2019, 31(1), 72
Date : 2019
Abstract : Alzheimer’s disease (AD) constitutes a neurodegenerative pathology that presents mobility disorders as one of its earliest symptoms. Current smartphones integrate accelerometers that can be used to collect mobility data of Alzheimer’s patients. This paper describes a method that processes these accelerometer data and a convolutional neural network (CNN) that classiﬁes the stage of the disease according to the mobility patterns of the patient. The method is applied in a case study with 35 Alzheimer’s patients, in which a classiﬁcation success rate of 91% was obtained


-----
### Documentation DSpace says that the REST Api adopt the HAL format...
In DSpace REST API, /items display the first 100 articles
https://repositorio.unican.es/rest/items/

Unfortunately, it is not possible to define filter about eh abstract, title.... Only filters about the nature of the item..

In [26]:
base_url = "https://repositorio.unican.es/rest/"

In [27]:

#query found 'agent' in the abstract 
query='abstract_t:(human)'


param_filters = {
        "filter" : "author",
        "operator" : "authority",
        "label" : "Paulo"
      }

In [28]:
# GET request
param_filters = "?limit=3&expand=metadata&query=agent"
##PB query doesn't seems to work
url = base_url+"items"+param_filters
response = requests.get(url)
#response = requests.get(base_url)

In [29]:
print("response from "+url)
data = response.json()
len(data)

response from https://repositorio.unican.es/rest/items?limit=3&expand=metadata&query=agent


3

In [30]:
for d in data: print(d['name'])

Comportamiento como desulfurantes de diferentes sorbentes a baja (58ºC) y media (150ºC) temperatura
Electromagnetic behavior of plasmonic devices with nanoholes. Biosensing applications
Analysis of superregenerative oscillators in nonlinear mode


These results doesn't match with the query