# Notebook 5.1 - Curation-Keywords

This notebook implements the workflow defined in:

[Curating keywords](https://gitlab.gwdg.de/sshoc/marketplace-curation/-/issues/1#note_71056) GitLab issue

The notebook works as follows:

0. Imports external libraries and loads the MP dataset and the google sheet
2. Updates keywords on MP as follows:
    1. Looks for vocabulary terms with the value from the column *Keyword to map*
    2. Looks for the term in the column *Map to*
    3. Goes through all MP items to look for the *keywords-to-map* in the keyword-dynamic-property
    4. Replaces the  the *keywords-to-map* in the MP dataset.


## 0 Requirements to run this notebook

This section gives all the relevant information to "interact" with the MP data.

### 0.1 libraries
*There are a number of external libraries needed to run the notebook* 

*Furthermore, a dedicated SSH Open Marketplace library - sshmarketplacelib - with customised functions has been created and can be imported using the python import commands.* 

*Below the libraries import needed to run this notebook*

In [1]:
import numpy as np
import pandas as pd
import requests
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

### 0.2 Get the data



Get the MarketPlace dataset

In [2]:
mpdata = mpd()
df_tool_flat =mpdata.getMPItems ("toolsandservices", True)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", True)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...


Get the list of keywords from the [gsheet](https://docs.google.com/spreadsheets/d/1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA/edit#gid=0)

In [3]:
sheet_id = '1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA'
sheet_name = 'Mappings'
url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}'
df_keywords=pd.read_csv(url)

The function *getMPConcepts()* is a custom function that uses the API entry: 

GET https://marketplace-api.sshopencloud.eu/api/concept-search?perpage=100&q=URI

to get all the *concepts* from the MarketPlace dataset. 

**Note that executing this function may require some time**



In [4]:
df_concepts=mpdata.getMPConcepts()

In [5]:
utils=hel.Util()
resultfields=['persistentId', 'MPUrl', 'category', 'label', 'type.code', 'type.label', 'concept.code', 'concept.label', 'concept.uri', 'concept.vocabulary.scheme']
udf_alprop=utils.getAllPropertiesBySources()
udf_alprop=udf_alprop.loc[ : ,resultfields]

A few lines of the gsheet

In [6]:
udf_alprop.head()

Unnamed: 0,persistentId,MPUrl,category,label,type.code,type.label,concept.code,concept.label,concept.uri,concept.vocabulary.scheme
0,SIU1nO,tool-or-service/SIU1nO,tool-or-service,140kit,mode-of-use,Mode of use,webApplication,Web application,https://vocabs.sshopencloud.eu/vocabularies/in...,https://vocabs.sshopencloud.eu/vocabularies/in...
1,SIU1nO,tool-or-service/SIU1nO,tool-or-service,140kit,activity,Activity,capturing,Capturing,https://vocabs.dariah.eu/tadirah/capturing,https://vocabs.dariah.eu/tadirah/
2,SIU1nO,tool-or-service/SIU1nO,tool-or-service,140kit,activity,Activity,dataVisualization,Data Visualization,https://vocabs.dariah.eu/tadirah/dataVisualiza...,https://vocabs.dariah.eu/tadirah/
3,SIU1nO,tool-or-service/SIU1nO,tool-or-service,140kit,activity,Activity,analyzing,Analyzing,https://vocabs.dariah.eu/tadirah/analyzing,https://vocabs.dariah.eu/tadirah/
4,SIU1nO,tool-or-service/SIU1nO,tool-or-service,140kit,activity,Activity,analyzing,Analyzing,https://vocabs.dariah.eu/tadirah/analyzing,https://vocabs.dariah.eu/tadirah/


In [7]:
#udf_alprop[udf_alprop['label']=='Manuscript Desk']

In [8]:

#df_concepts[df_concepts.uri=="https://vocabs.dariah.eu/tadirah/annotating"][0:20]

### 1 Update keywords

The function *getMPKeywordProperies(mKey)* used below is a custum function that uses the API entry 

    GET https://marketplace-api.sshopencloud.eu/api/concept-search?types=keyword&q=VALUE

and returns the vocabulary terms for *mKey*.  

The returned dataset is filtered to individuate those values coming from the vocabulary *sshoc-keyword* (vocabulary[code]=sshoc-keyword).


In [14]:

pd.options.mode.chained_assignment = None
selectedItems=pd.DataFrame()
#df_vocterms=pd.DataFrame()
for rown, row in df_keywords[201:208].iterrows():
    
    myKey=df_keywords.iloc[rown]['Keyword to map']
    #df_vocterms=pd.DataFrame()
    df_vocterms=mpdata.getMPKeywordProperties(myKey)
    print(f'\n***** checking {myKey}\n')
    if (df_vocterms.empty):
        print (f"#####  No vocabulary terms found for {myKey}")
        continue;
    df_vocterms.loc[:]=df_vocterms.loc[(df_vocterms.vocabulary=={'code': 'sshoc-keyword'}), ]
    
    #The set of vocabulary terms is filterd to individuate those with the exact match (case-insensitive) 
    df_vocterms=df_vocterms.loc[(df_vocterms.code).str.replace(' %3E ', ' > ').str.lower()==myKey.lower()]

    #Search for the one concept that has in the uri column the value of the Map to. There should be only one such concept.
    
   
    df_mapto=df_concepts.loc[(df_concepts.uri==df_keywords.iloc[rown]['Map to']),]

    
    #Filter the dataset using the Keyword to Map

    df_items=udf_alprop.loc[(udf_alprop['concept.label'].str.lower()==myKey.lower()), ]#& (udf_alprop['type.code'].str.lower()=='keyword')]
    
    if (df_items.empty):
        print (f"\n%%%%%%%%  No Items found for {myKey}")
        continue;
    print (f'\n&&&&&  Found as {df_items.iloc[0]["type.code"]}')
    #update the MP
    jsonConcept={}
    jsonConceptVal={}
    
    jsonConceptVal["code"]=df_mapto.iloc[0].code
    jsonConceptVal["vocabulary"]=df_mapto.iloc[0].vocabulary
    jsonConceptVal["uri"]=df_mapto.iloc[0].uri
    #jsonConcept
   
    attrList={}
    attrList["type"]=df_mapto.iloc[0].types[0]
    attrList["concept"]=jsonConceptVal
    filterList={}
    filterList["concept"]=myKey.lower()
    print (f'update parameters: {attrList}, {filterList} \n')
    
    #df_items.loc[ : ,('updateList')]=[attrList for _ in range(df_items.shape[0])]
    df_items.loc[ : ,('updateList')]=[attrList for _ in range(df_items.shape[0])]
    df_items.loc[ : ,('filterList')]=[filterList for _ in range(df_items.shape[0])]
    
    #selectedItems=selectedItems.append(df_items)
    selectedItems=pd.concat([selectedItems, df_items])
    

attrList={}
filterList={}
selectedItems.head()
mpdata.updateItemsProperties(selectedItems)



***** checking historical geography and cartography


&&&&&  Found as keyword
update parameters: {'type': {'code': 'discipline'}, 'concept': {'code': '5070', 'vocabulary': {'code': 'discipline'}, 'uri': 'https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/5070'}}, {'concept': 'historical geography and cartography'} 


***** checking historical geography and cartography


&&&&&  Found as keyword
update parameters: {'type': {'code': 'discipline'}, 'concept': {'code': '507028', 'vocabulary': {'code': 'discipline'}, 'uri': 'https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/507028'}}, {'concept': 'historical geography and cartography'} 


***** checking Georeferencing > Enrichment-Annotation


&&&&&  Found as keyword
update parameters: {'type': {'code': 'activity'}, 'concept': {'code': 'annotating', 'vocabulary': {'code': 'tadirah2'}, 'uri': 'https://vocabs.dariah.eu/tadirah/annotating'}}, {'concept': 'georeferencing > enrichment-annotation'} 


***** checking Georeferencing > Enrichment-Annotat