# Importation of documents from Open Science sources

---
## <a name="Repositorio">Repositorio</a> (Spain, UNICT)
---

This document shows the data provided by Repositorio and  how to get data with a specific request.

---
 - Need to be register : Yes
 - API : Yes (for the DSpace tool)
 - Cost : Free
 - Documentation : Yes (in spannish and english (translated) )
 - Access : simple GET request
 - Ref biblio : no (to build from the data)
 - answer format : JSON

- fields : authors, abstract, date, ..... 


---
The api is working, but the query is not easy to use : it needs some steps : 
 - find the communities, choose the best one
 - find the collections, choose the best one
 - find the articles, select one

But, the website provide the list of all the articles in a CSV file. As it is easy to interact with it in python, we can use it.

The code to interact with the API is accessible at [#queries](the bottom of this page). 

---
### Use of CSV file
Here, we extract  the data from the "publicaciones.csv" file and we build a new csv file with the headers `` 'title', 'keywords', 'abstract', 'field', 'authors', 'date', 'doi', 'references'``.
The objective is to store the data of the different repositories in a same format.

In [36]:
from tqdm import tqdm
import requests
import pandas as pd
doc_publicaciones = pd.read_csv("publicaciones.csv")
doc_publicaciones.keys()

Index(['title', 'dateIssued', 'publisher', 'issn', 'isbn', 'doi', 'abstract',
       'authorList', 'Relates_ResearchGroup'],
      dtype='object')

So, it is relatively simple. Only the domain (field) and the references are missing.. This can be obtain for the DOI.


In [37]:
#we choose these data from DOI (we could have more (pdf if provided, ....))
doi_keys = ["title","author","subject", "reference"]

def get_info_from_doi(doi):
    """return an extract of the DOI data"""
    doi_info = {}
    if doi[-1] == '.': doi = doi[:-1]
    url = f"https://api.crossref.org/works/{doi}"
    response = requests.get(url)
    # check if success
    if response.status_code == 200:
        # extract  JSON data 
        data = response.json()
        data = data["message"]
        for key in doi_keys:
            doi_info[key] = data[key] if key in data else ""
    return doi_info


In [41]:
#we crfeate the new entries from the one of repositorio
#from ['title', 'dateIssued', 'publisher', 'issn', 'isbn', 'doi', 'abstract', 'authorList', 'Relates_ResearchGroup'],
#to ['title', 'keywords', 'abstract', 'field', 'authors', 'date', 'doi', 'references']
new_rows = []
for index, row in doc_publicaciones.iterrows():
    new_row = {}
    new_row["title"] = str(row['title']) if 'title' in row else ""
    new_row["keywords"] = ""
    new_row["abstract"] = str(row['abstract'])  if 'abstract' in row else ""
    new_row["date"] = str(row['dateIssued'])  if 'dateIssued' in row else ""
    new_row["authors"] = str(row['authorList'])  if 'authorList' in row else ""
    new_row["doi"] = str(row['doi'])  if 'doi' in row else ""
    new_rows.append(new_row)


In [42]:
#we miss the field and references : ask for them
for publication in tqdm(new_rows):
    field  = ""
    doi = publication.get('doi', '')
    references = ""    
    if(doi!="" and doi!="nan"):
        doi_values = get_info_from_doi(doi)
        #get the field "subject" if exists  
        if 'subject' in doi_values: field = doi_values['subject']
        if 'reference' in doi_values: 
            refs = doi_values['reference']
            for ref in refs:
                if 'DOI' in ref:
                    references = references + ref['DOI'] + ", "
    publication['field'] = ",".join(field)
    publication['references'] = references


100%|██████████| 8270/8270 [1:09:42<00:00,  1.98it/s]


In [40]:
new_rows

[{'title': 'Design and verification of integrated circuitry for real-time frailty monitoring',
  'keywords': '',
  'abstract': 'In this study, a new wireless electronic circuitry to analyze weight distribution was designed and incorporated into a chair to gather data related to common human postures (sitting and standing up). These common actions have a significant impact on various motor capabilities, including gait parameters, fall risk, and information on sarcopenia. The quality of these actions lacks an absolute measurement, and currently, there is no qualitative and objective metric for it. To address this, the designed analyzer introduces variables like Smoothness and Percussion to provide more information and objectify measurements in the assessment of stand-up/sit-down actions. Both the analyzer and the proposed variables offer additional information that can objectify assessments depending on the clinical eye of the physicians.',
  'date': '20/12/2024',
  'authors': 'LUIS RODR

In [34]:
df = pd.DataFrame(new_rows)
df.to_csv('outputRepositori.csv', index=False, header=True,  encoding="utf-8")


----
### <a name="queries">Use of DSpace</a>
To interact with DSpace, we need several steps, next : 
  1. retrieve the <span style="color:red">communities</span> of the repository :  
    - to the REST URL, add **communities/** : Return an list of all communities of the repository in JSON

In [63]:
#  REST API URL to search documents in the 
base_url = "https://repositorio.unican.es/rest/"

In [None]:
# GET request
get_communities = requests.get(base_url + "communities/")

In [None]:
print("response from "+base_url + "communities/")
print(get_communities.content)
data = get_communities.json()

response from https://repositorio.unican.es/rest/communities/
b'[{"uuid":"2e6ca748-8c15-41cc-88ed-233f92787953","name":"Centro Internacional de Estudios Superiores del Espa\xc3\xb1ol CIESE-Comillas (Centro adscrito)","handle":"10902/6182","type":"community","expand":["parentCommunity","collections","subCommunities","logo","all"],"logo":null,"parentCommunity":null,"copyrightText":"","introductoryText":"","shortDescription":"","sidebarText":"","countItems":160,"collections":[],"link":"/rest/communities/2e6ca748-8c15-41cc-88ed-233f92787953","subcommunities":[]},{"uuid":"7508978b-1eda-4fc8-8529-4bf3d930d494","name":"Departamento de Administraci\xc3\xb3n de Empresas","handle":"10902/17","type":"community","expand":["parentCommunity","collections","subCommunities","logo","all"],"logo":null,"parentCommunity":null,"copyrightText":"","introductoryText":"El Departamento de Administraci\xc3\xb3n de Empresas centra su actividad docente e investigadora en el \xc3\xa1mbito de la gesti\xc3\xb3n de em

In [None]:
data

[{'uuid': '2e6ca748-8c15-41cc-88ed-233f92787953',
  'name': 'Centro Internacional de Estudios Superiores del Español CIESE-Comillas (Centro adscrito)',
  'handle': '10902/6182',
  'type': 'community',
  'expand': ['parentCommunity',
   'collections',
   'subCommunities',
   'logo',
   'all'],
  'logo': None,
  'parentCommunity': None,
  'copyrightText': '',
  'introductoryText': '',
  'shortDescription': '',
  'sidebarText': '',
  'countItems': 160,
  'collections': [],
  'link': '/rest/communities/2e6ca748-8c15-41cc-88ed-233f92787953',
  'subcommunities': []},
 {'uuid': '7508978b-1eda-4fc8-8529-4bf3d930d494',
  'name': 'Departamento de Administración de Empresas',
  'handle': '10902/17',
  'type': 'community',
  'expand': ['parentCommunity',
   'collections',
   'subCommunities',
   'logo',
   'all'],
  'logo': None,
  'parentCommunity': None,
  'copyrightText': '',
  'introductoryText': 'El Departamento de Administración de Empresas centra su actividad docente e investigadora en el á

### Find the best communities 
  2. **identify** the <span style="color:red">communities</span> 
   - if the id of the communities you search are unknown, you have to browse the results...

   Ex. find a communities about computer science articles : 

In [None]:
def find_id(data, term):
    """from dictionary of data having name and ifd fields, 
       search if the term is in the name and 
       return the list of id"""
    id_list = []
    for community in data:
        if community['name'].find(term)>-1:
            id_list.append(community['uuid'])
    return id_list
    

In [None]:
#the name to search is "Computación", you have to know that it's this name to find !!
id_communities = find_id(data, "Computación")
print(id_communities)

['9ead14f9-a142-48c5-8ee8-d0448574e493', '92aaba72-8d31-41ca-a1b9-0e6df50b759b']


### Find the collectionS
  3. retrieve the <span style="color:red">**collections**</span> (list of articles, books, patents, thesis, ...)
    - to the REST URL, add **communities/[id of the selected community]/collections** : Return an list of all the collections for the community, in JSON

In [None]:
# GET request for the 2nd community (for example)
address = base_url + "communities/" + id_communities[1] + "/collections"
get_collections = requests.get(address )

In [None]:
print("response from "+address)
data = get_collections.json()
for d in data: print(d['name'])

response from https://repositorio.unican.es/rest/communities/92aaba72-8d31-41ca-a1b9-0e6df50b759b/collections
D21 Artículos
D21 Congresos
D21 Documentos de trabajo
D21 Libros y partes de libros
D21 Patentes
D21 Proyectos de Investigación
D21 Tesis


### get the collection 
  4. **identify** the <span style="color:red">collection id</span> 
   - if the id of the collections you search are unknown, you have to browse the results...

   Ex. find a collection about articles : 

In [None]:
#the name to search to get articles is "Artículos"
id_collections = find_id(data, "Artículos")
print(id_collections)

['f8ba545d-52f6-415d-9211-2b9a1b974a18']


### Find the articles
  5. retrieve the <span style="color:red">**articles**</span> from a collection of a community<br>
    - to the REST URL, add **collections/[id of the selected collection]/items** : Return an list of all the items for the collection, in JSON


In [None]:
# GET request for the articles of the collection
address = base_url + "collections/" + id_collections[0] + "/items"
get_collections = requests.get(address )

In [None]:
print("response from "+address)
data = get_collections.json()
for d in data: print(d['name'])

response from https://repositorio.unican.es/rest/collections/f8ba545d-52f6-415d-9211-2b9a1b974a18/items
Sobre un contraejemplo a la conjetura de Hirsch
The Covering Radius and a Discrete Surface Area for Non-Hollow Simplices
Simplified procedure for efficient and unbiased population size estimation
XLVIII Olimpiada Matemática Española, Santander, 22 al 25 de marzo de 2012
A unified approach towards the impossibility of finite time vanishing depth for incompressible free boundary flows
Blowups with log canonical singularities
Union bound minimization approach for designing grassmannian constellations
Some issues on the automatic computation of plane envelopes in interactive environments
A Framework for Identifying Sequences of Interactions That Cause Usability Problems in Collaborative Systems
Models for the Assessment of Treatment Improvement: The Ideal and the Feasible
El problema de la galería de arte: elaboración de conjeturas y demostraciones con GeoGebra
Análisis epistémico y cogn

### get the article 
  6. **identify** the <span style="color:red">article id</span> 
   - if the id of the articles you search is unknown, you have to browse the results...

   Ex. find an article about 'human' : 

In [None]:
#the term to search is "Human" in the name of the articles
id_articles = find_id(data, "Human")
print(id_articles)

['2537b71f-2e02-4da6-aa9e-cf2d8725a68e']


  7. retrieve the <span style="color:red">**article**</span> and its fields<br>
    - to the REST URL, add **items/[id of the selected item]/metadata** : Return an list of the metadata liked to the article, in JSON


In [64]:
# get the metadata of the article about Human
address = base_url + "items/" + id_articles[0] + "/metadata"
print("response from "+address)
get_metadata = requests.get(address )
data = get_metadata.json()

response from https://repositorio.unican.es/rest/items/2537b71f-2e02-4da6-aa9e-cf2d8725a68e/metadata


In [65]:
data

[{'key': 'dc.contributor.author',
  'value': 'Salomón García, Sergio',
  'language': '',
  'element': 'contributor',
  'qualifier': 'author',
  'schema': 'dc'},
 {'key': 'dc.contributor.author',
  'value': 'Duque Medina, Rafael',
  'language': '',
  'element': 'contributor',
  'qualifier': 'author',
  'schema': 'dc'},
 {'key': 'dc.contributor.author',
  'value': 'Bringas Tejero, Santos',
  'language': '',
  'element': 'contributor',
  'qualifier': 'author',
  'schema': 'dc'},
 {'key': 'dc.contributor.author',
  'value': 'Montaña Arnaiz, José Luis',
  'language': '',
  'element': 'contributor',
  'qualifier': 'author',
  'schema': 'dc'},
 {'key': 'dc.contributor.author',
  'value': 'Lage, Carmen',
  'language': '',
  'element': 'contributor',
  'qualifier': 'author',
  'schema': 'dc'},
 {'key': 'dc.contributor.other',
  'value': 'Universidad de Cantabria',
  'language': 'es_ES',
  'element': 'contributor',
  'qualifier': 'other',
  'schema': 'dc'},
 {'key': 'dc.date.accessioned',
  'val

In [10]:
def get_data_article(data):
    """from data about an article in the JSON format of repositorio.unican.es, get the authors, title and abstract"""
    authors = []
    title = ""
    abstract = ""
    source = ""
    date = ""

    for item in data:
        if item['key'] == 'dc.contributor.author':
            authors.append(item['value'])
        elif item['key'] == 'dc.title':
            title = item['value']
        elif item['key'] == 'dc.source':
            source = item['value']
        elif item['key'] == 'dc.date.issued':
            date = item['value']
        elif item['key'] == 'dc.description.abstract':
            abstract = item['value']

    print("Authors :", authors)
    print("Titre :", title)
    print("In :", source)
    print("Date :", date)
    print("Abstract :", abstract)

In [69]:
get_data_article(data)

Authors : ['Salomón García, Sergio', 'Duque Medina, Rafael', 'Bringas Tejero, Santos', 'Montaña Arnaiz, José Luis', 'Lage, Carmen']
Titre : A Convolutional Neural Network-Based Method for Human Movement Patterns Classification in Alzheimer?s Disease
In : Proceedings 2019, 31(1), 72
Date : 2019
Abstract : Alzheimer’s disease (AD) constitutes a neurodegenerative pathology that presents mobility disorders as one of its earliest symptoms. Current smartphones integrate accelerometers that can be used to collect mobility data of Alzheimer’s patients. This paper describes a method that processes these accelerometer data and a convolutional neural network (CNN) that classiﬁes the stage of the disease according to the mobility patterns of the patient. The method is applied in a case study with 35 Alzheimer’s patients, in which a classiﬁcation success rate of 91% was obtained


-----
### Documentation DSpace says that the REST Api adopt the HAL format...
We try

https://repositorio.unican.es/rest/filtered-items/?q=Human

In [1]:
base_url = "https://repositorio.unican.es/rest/"


In [41]:

#query found 'agent' in the abstract 
query='abstract_t:(human)'


param_filters = {
        "filter" : "author",
        "operator" : "authority",
        "label" : "Paulo"
      }

In [22]:
# GET request
param_filters = "?limit=3&expand=metadata&query=agent"
##PB query doesn't seems to work
url = base_url+"items"+param_filters
response = requests.get(url)
#response = requests.get(base_url)

In [23]:
print("response from "+url)
data = response.json()
len(data)

response from https://repositorio.unican.es/rest/items?limit=3&expand=metadata&query=agent


3

In [25]:
for d in data: print(d['name'])

Comportamiento como desulfurantes de diferentes sorbentes a baja (58ºC) y media (150ºC) temperatura
Electromagnetic behavior of plasmonic devices with nanoholes. Biosensing applications
Analysis of superregenerative oscillators in nonlinear mode


In [11]:
response.content



---
## <a name="SINUS">SINUS</a> 
Api not easy to use, some queries necessitate a login: 

- articles about multiagent : 
https://sin.put.poznan.pl/search/publications?personName=&personId=-1&phrase=multiagent&page=1&organizationName=&organizationId=-1&type=all&disciplineCode=&yearFrom=2004&yearTo=2024&waitingForApproval=false&hasContent=false&perPage=100&sortBy=publication-date-desc
- cf. doc about api : https://docs.psnc.pl/display/SIN1/SIN+API

