# Notebook 5.1 - Curation-Keywords (alpha release)

The final release of this notebook will implement the workflow defined in:

[Curating keywords](https://gitlab.gwdg.de/sshoc/marketplace-curation/-/issues/1#note_71056) GitLab issue

The notebook is composed of 4 sections:

0. Import external libraries and loads the MP dataset and the google sheet
1. Look for vocabulary terms with the value from the column *Keyword to map*
2. Look for the term in the column *Map to*
3. Go through all MP items to look for the *keywords-to-map* in the keyword-dynamic-property


## 0 Requirements to run this notebook

This section gives all the relevant information to "interact" with the MP data.

### 0.1 libraries
*There are a number of external libraries needed to run the notebook* 

*Furthermore, a dedicated SSH Open Marketplace library - sshmarketplacelib - with customised functions has been created and can be imported using the python import commands.* 

*Below the libraries import needed to run this notebook*

In [1]:
import numpy as np
import pandas as pd
import requests
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

### 0.2 Get the data



Get the MarketPlace dataset

In [2]:
mpdata = mpd()
df_tool_flat =mpdata.getMPItems ("toolsandservices", True)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", True)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...


Get the list of keywords from the [gsheet](https://docs.google.com/spreadsheets/d/1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA/edit#gid=0)

In [3]:
sheet_id = '1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA'
sheet_name = 'Mappings'
url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}'
df_keywords=pd.read_csv(url)

### 0.3 A look at the data

A few lines of the gsheet

In [19]:
df_keywords.head()

Unnamed: 0,Keyword to map,Map to,Comment
0,Linguistics,https://vocabs.acdh.oeaw.ac.at/oefosdiscipline...,Sprach- und Literaturwissenschaften
1,History,https://vocabs.acdh.oeaw.ac.at/oefosdiscipline...,"Geschichte, Archäologie"
2,Literature,https://vocabs.acdh.oeaw.ac.at/oefosdiscipline...,Sprach- und Literaturwissenschaften
3,Video,https://vocabs.sshopencloud.eu/vocabularies/me...,video
4,Text,https://vocabs.sshopencloud.eu/vocabularies/me...,text


## 1 Look for vocabulary terms with the value from the column *Keyword to map*  

The function *getMPKeywordProperies(myKey)* is a custum function that uses the API entry 

    GET https://marketplace-api.sshopencloud.eu/api/concept-search?types=keyword&q=VALUE

and returns the vocabulary terms for *myKey*.  

In the cell below, the returned dataset is filtered to individuate those values coming from the vocabulary *sshoc-keyword* (vocabulary[code]=sshoc-keyword), then the resulting dataset is shown.


In [12]:
#In this example the key value searched is the one in the rown of the google sheet
rown=0
myKey=df_keywords.iloc[rown]['Keyword to map']
df_vocterms=mpdata.getMPKeywordProperties(myKey)
df_vocterms=df_vocterms[df_vocterms.vocabulary=={'code': 'sshoc-keyword'}]
df_vocterms

Unnamed: 0,code,vocabulary,label,notation,uri,types,candidate
0,Linguistics,{'code': 'sshoc-keyword'},Linguistics,Linguistics,https://vocabs.dariah.eu/sshoc-keyword/Linguis...,[{'code': 'keyword'}],True
1,linguistics,{'code': 'sshoc-keyword'},linguistics,,https://vocabs.dariah.eu/sshoc-keyword/linguis...,[{'code': 'keyword'}],False
2,computational-linguistics,{'code': 'sshoc-keyword'},computational-linguistics,,https://vocabs.dariah.eu/sshoc-keyword/computa...,[{'code': 'keyword'}],False
3,historical-linguistics,{'code': 'sshoc-keyword'},historical-linguistics,,https://vocabs.dariah.eu/sshoc-keyword/histori...,[{'code': 'keyword'}],False
4,Corpus+linguistics,{'code': 'sshoc-keyword'},Corpus linguistics,,https://vocabs.dariah.eu/sshoc-keyword/Corpus+...,[{'code': 'keyword'}],True
5,linguistic-variation,{'code': 'sshoc-keyword'},linguistic-variation,,https://vocabs.dariah.eu/sshoc-keyword/linguis...,[{'code': 'keyword'}],False
6,linguistic+resources,{'code': 'sshoc-keyword'},linguistic resources,,https://vocabs.dariah.eu/sshoc-keyword/linguis...,[{'code': 'keyword'}],True
7,TEI+Lite+markup%2C+no+linguistic+annotation,{'code': 'sshoc-keyword'},"TEI Lite markup, no linguistic annotation",,https://vocabs.dariah.eu/sshoc-keyword/TEI+Lit...,[{'code': 'keyword'}],True
8,+syntactically+parsed%3B+Swedish+subset%3A+no+...,{'code': 'sshoc-keyword'},syntactically parsed; Swedish subset: no ling...,,https://vocabs.dariah.eu/sshoc-keyword/+syntac...,[{'code': 'keyword'}],True
9,Finnish+subset%3A+MSD-tagged%2C+syntactically+...,{'code': 'sshoc-keyword'},"Finnish subset: MSD-tagged, syntactically pars...",,https://vocabs.dariah.eu/sshoc-keyword/Finnish...,[{'code': 'keyword'}],True


The set of vocabulary terms is filterd to individuate those with the exact match (case-insensitive) 

In [13]:
df_vocterms=df_vocterms[(df_vocterms.code).str.lower()==myKey.lower()]
df_vocterms

Unnamed: 0,code,vocabulary,label,notation,uri,types,candidate
0,Linguistics,{'code': 'sshoc-keyword'},Linguistics,Linguistics,https://vocabs.dariah.eu/sshoc-keyword/Linguis...,[{'code': 'keyword'}],True
1,linguistics,{'code': 'sshoc-keyword'},linguistics,,https://vocabs.dariah.eu/sshoc-keyword/linguis...,[{'code': 'keyword'}],False


## 2 Look for the term in the column *Map to*  

The function *getMPConcepts()* is a custum function that uses the API entry: 

GET https://marketplace-api.sshopencloud.eu/api/concept-search?perpage=100&q=URI

to get all the *concepts* from the MarketPlace dataset. 

**Note that executing this function may require some time, currently 14995 records are returned**



In [14]:
df_concepts=mpdata.getMPConcepts()
df_concepts.count()

code          14995
vocabulary    14995
label         14995
notation      14995
uri           14995
types         14995
candidate     14995
definition      209
dtype: int64

a look at few of the returned records

In [15]:
df_concepts.tail()

Unnamed: 0,code,vocabulary,label,notation,uri,types,candidate,definition
14990,+CR-tagged,{'code': 'sshoc-keyword'},CR-tagged,,https://vocabs.dariah.eu/sshoc-keyword/+CR-tagged,[{'code': 'keyword'}],True,
14991,ID-glosses,{'code': 'sshoc-keyword'},ID-glosses,,https://vocabs.dariah.eu/sshoc-keyword/ID-glosses,[{'code': 'keyword'}],True,
14992,+WSD,{'code': 'sshoc-keyword'},WSD,,https://vocabs.dariah.eu/sshoc-keyword/+WSD,[{'code': 'keyword'}],True,
14993,paragraph+aligned,{'code': 'sshoc-keyword'},paragraph aligned,,https://vocabs.dariah.eu/sshoc-keyword/paragra...,[{'code': 'keyword'}],True,
14994,+PoS+tags,{'code': 'sshoc-keyword'},PoS tags,,https://vocabs.dariah.eu/sshoc-keyword/+PoS+tags,[{'code': 'keyword'}],True,


Search for the one concept that has in the *uri* attribute (or column) the value of the *Map to*. There should be only one such concept.

In [16]:
test3=df_concepts[df_concepts.uri==df_keywords.iloc[rown]['Map to']]
test3.head()

Unnamed: 0,code,vocabulary,label,notation,uri,types,candidate,definition
14031,6020,{'code': 'discipline'},Linguistics and Literature,6020,https://vocabs.acdh.oeaw.ac.at/oefosdiscipline...,[{'code': 'discipline'}],False,


The attribute *types* contains the value of the type code

## 3 Go through all MP items to look for the *keywords-to-map* in the keyword-dynamic-property

For this we use the *getAllPropertiesBySources()* custom functions that returns all dynamic properties; for every property it is reported also the main attributes and the PID of the item whom it belongs.

In [17]:
utils=hel.Util()
resultfields=['MPUrl', 'category', 'label', 'type.code', 'type.label', 'concept.code', 'concept.label', 'concept.uri', 'concept.vocabulary.scheme']
udf_alprop=utils.getAllPropertiesBySources()
udf_alprop=udf_alprop[resultfields]

Filter the dataset using the *Keyword to Map* and show the result (or part of the result)

In [18]:
udf_alprop.reset_index(inplace=True)
udf_alprop[udf_alprop['concept.label'].str.lower()==myKey.lower()].head()

Unnamed: 0,index,MPUrl,category,label,type.code,type.label,concept.code,concept.label,concept.uri,concept.vocabulary.scheme
373,373,tool-or-service/IJjaNl,tool-or-service,An alpha version of a lexicographical platform...,keyword,Keyword,linguistics,linguistics,https://vocabs.dariah.eu/sshoc-keyword/linguis...,https://vocabs.dariah.eu/sshoc-keyword/Schema
1404,1404,tool-or-service/pC5SBh,tool-or-service,brat rapid annotation tool,keyword,Keyword,linguistics,linguistics,https://vocabs.dariah.eu/sshoc-keyword/linguis...,https://vocabs.dariah.eu/sshoc-keyword/Schema
2617,2617,tool-or-service/KE90xs,tool-or-service,CorpusExplorer,keyword,Keyword,linguistics,linguistics,https://vocabs.dariah.eu/sshoc-keyword/linguis...,https://vocabs.dariah.eu/sshoc-keyword/Schema
3257,3257,tool-or-service/wa0xoI,tool-or-service,DEREDEC,keyword,Keyword,linguistics,linguistics,https://vocabs.dariah.eu/sshoc-keyword/linguis...,https://vocabs.dariah.eu/sshoc-keyword/Schema
4207,4207,tool-or-service/3m4luN,tool-or-service,EURAC: Extended Linguistic Dependency Diagrams...,keyword,Keyword,linguistics,linguistics,https://vocabs.dariah.eu/sshoc-keyword/linguis...,https://vocabs.dariah.eu/sshoc-keyword/Schema


To be completed