# Notebook 5.1 - Curation-Keywords (alpha release)

The final release of this notebook will implement the workflow defined in:

[Curating keywords](https://gitlab.gwdg.de/sshoc/marketplace-curation/-/issues/1#note_71056) GitLab issue

The notebook is composed of 4 sections:

0. Import external libraries and loads the MP dataset and the google sheet
1. Look for vocabulary terms with the value from the column *Keyword to map*
2. Look for the term in the column *Map to*
3. Go through all MP items to look for the *keywords-to-map* in the keyword-dynamic-property


## 0 Requirements to run this notebook

This section gives all the relevant information to "interact" with the MP data.

### 0.1 libraries
*There are a number of external libraries needed to run the notebook* 

*Furthermore, a dedicated SSH Open Marketplace library - sshmarketplacelib - with customised functions has been created and can be imported using the python import commands.* 

*Below the libraries import needed to run this notebook*

In [1]:
import numpy as np
import pandas as pd
import requests
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

### 0.2 Get the data



Get the MarketPlace dataset

In [2]:
mpdata = mpd()
df_tool_flat =mpdata.getMPItems ("toolsandservices", True)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", True)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...


Get the list of keywords from the [gsheet](https://docs.google.com/spreadsheets/d/1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA/edit#gid=0)

In [3]:
sheet_id = '1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA'
sheet_name = 'Mappings'
url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}'
df_keywords=pd.read_csv(url)

### 0.3 A look at the data

A few lines of the gsheet

In [4]:
df_keywords.head()

Unnamed: 0,Keyword to map,Map to,Comment
0,Linguistics,https://vocabs.acdh.oeaw.ac.at/oefosdiscipline...,Sprach- und Literaturwissenschaften
1,History,https://vocabs.acdh.oeaw.ac.at/oefosdiscipline...,"Geschichte, Archäologie"
2,Literature,https://vocabs.acdh.oeaw.ac.at/oefosdiscipline...,Sprach- und Literaturwissenschaften
3,Video,https://vocabs.sshopencloud.eu/vocabularies/me...,video
4,Text,https://vocabs.sshopencloud.eu/vocabularies/me...,text


## 1 Look for vocabulary terms with the value from the column *Keyword to map*  

The function *getMPKeywordProperies(mKey)* is a custum function that uses the API entry 

    GET https://marketplace-api.sshopencloud.eu/api/concept-search?types=keyword&q=VALUE

and returns the vocabulary terms for *mKey*.  

In the cell below, the returned dataset is filtered to individuate those values coming from the vocabulary *sshoc-keyword* (vocabulary[code]=sshoc-keyword), then the resulting dataset is shown.


In [5]:
#In this example the key value searched is the one in the rown of the google sheet
rown=0
myKey=df_keywords.iloc[rown]['Keyword to map']
df_vocterms=mpdata.getMPKeywordProperties(myKey)
df_vocterms=df_vocterms[df_vocterms.vocabulary=={'code': 'sshoc-keyword'}]
df_vocterms

Unnamed: 0,code,vocabulary,label,notation,uri,types,candidate
0,linguistics,{'code': 'sshoc-keyword'},linguistics,,https://vocabs.dariah.eu/sshoc-keyword/linguis...,[{'code': 'keyword'}],False
1,computational-linguistics,{'code': 'sshoc-keyword'},computational-linguistics,,https://vocabs.dariah.eu/sshoc-keyword/computa...,[{'code': 'keyword'}],False
2,historical-linguistics,{'code': 'sshoc-keyword'},historical-linguistics,,https://vocabs.dariah.eu/sshoc-keyword/histori...,[{'code': 'keyword'}],False
3,Corpus+linguistics,{'code': 'sshoc-keyword'},Corpus linguistics,,https://vocabs.dariah.eu/sshoc-keyword/Corpus+...,[{'code': 'keyword'}],True
4,linguistic-variation,{'code': 'sshoc-keyword'},linguistic-variation,,https://vocabs.dariah.eu/sshoc-keyword/linguis...,[{'code': 'keyword'}],False
5,linguistic+resources,{'code': 'sshoc-keyword'},linguistic resources,,https://vocabs.dariah.eu/sshoc-keyword/linguis...,[{'code': 'keyword'}],True
6,linguistic+technologies,{'code': 'sshoc-keyword'},linguistic technologies,,https://vocabs.dariah.eu/sshoc-keyword/linguis...,[{'code': 'keyword'}],True
7,linguistic+analysis,{'code': 'sshoc-keyword'},linguistic analysis,,https://vocabs.dariah.eu/sshoc-keyword/linguis...,[{'code': 'keyword'}],True
8,MSD-tags%2C+linguistic+standardness,{'code': 'sshoc-keyword'},"MSD-tags, linguistic standardness",,https://vocabs.dariah.eu/sshoc-keyword/MSD-tag...,[{'code': 'keyword'}],True
9,TEI+Lite+markup%2C+no+linguistic+annotation,{'code': 'sshoc-keyword'},"TEI Lite markup, no linguistic annotation",,https://vocabs.dariah.eu/sshoc-keyword/TEI+Lit...,[{'code': 'keyword'}],True


The set of vocabulary terms is filterd to individuate those with the exact match (case-insensitive) 

In [6]:
df_vocterms=df_vocterms[(df_vocterms.code).str.lower()==myKey.lower()]
df_vocterms

Unnamed: 0,code,vocabulary,label,notation,uri,types,candidate
0,linguistics,{'code': 'sshoc-keyword'},linguistics,,https://vocabs.dariah.eu/sshoc-keyword/linguis...,[{'code': 'keyword'}],False


## 2 Look for the term in the column *Map to*  

The function *getMPConcepts()* is a custum function that uses the API entry: 

GET https://marketplace-api.sshopencloud.eu/api/concept-search?perpage=100&q=URI

to get all the *concepts* from the MarketPlace dataset. 

**Note that executing this function may require some time, currently 14995 records are returned**



In [None]:
df_concepts=mpdata.getMPConcepts()
df_concepts.count()

a look at few of the returned records

In [None]:
#df_concepts.tail()

Search for the one concept that has in the *uri* column the value of the *Map to*. There should be only one such concept.

In [None]:
test3=df_concepts[df_concepts.uri==df_keywords.iloc[rown]['Map to']]
test3.head()

The attribute *types* contains the value of the type code

{
      "type": {
        "code": "discipline"
      },
      "concept": {
        "code": "6020",
        "vocabulary": {
          "code": "discipline"
        },
        "uri": "https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/6020"
      }
    }

In [None]:
jsonType={}
jsonConceptVal={}
jsonConcept={}
jsonType["type"]=test3.iloc[0].vocabulary
jsonConceptVal["code"]=test3.iloc[0].code
jsonConceptVal["vocabulary"]=test3.iloc[0].vocabulary
jsonConceptVal["uri"]=test3.iloc[0].uri
jsonConcept["concept"]=jsonConceptVal
jsonConcept

## 3 Go through all MP items to look for the *keywords-to-map* in the keyword-dynamic-property

For this we use the *getAllPropertiesBySources()* custom functions that returns all dynamic properties; for every property it is reported also the main attributes and the PID of the item whom it belongs.

In [None]:
utils=hel.Util()
resultfields=['persistentId', 'MPUrl', 'category', 'label', 'type.code', 'type.label', 'concept.code', 'concept.label', 'concept.uri', 'concept.vocabulary.scheme']
udf_alprop=utils.getAllPropertiesBySources()
udf_alprop=udf_alprop[resultfields]

Filter the dataset using the *Keyword to Map* and show the result (or part of the result)

In [None]:
udf_alprop.reset_index(inplace=True)
test=udf_alprop[udf_alprop['concept.label'].str.lower()==myKey.lower()]


In [None]:
test.iloc[0:1]

In [None]:

jsonConceptVal={}
jsonConceptVal["code"]=test3.iloc[0].code
jsonConceptVal["vocabulary"]=test3.iloc[0].vocabulary
jsonConceptVal["uri"]=test3.iloc[0].uri
jsonConcept
attrList={}
attrList["type"]=test3.iloc[0].vocabulary
attrList["concept"]=jsonConceptVal
filterList={}
filterList["concept"]=myKey.lower()
attrList
mpdata.updateItems(test.iloc[0:1], attrList, filterList)

To be completed