# Notebook 5.1 - Curation-Keywords

This notebook implements the workflow defined in:

[Curating keywords](https://gitlab.gwdg.de/sshoc/marketplace-curation/-/issues/1#note_71056) GitLab issue

The notebook works as follows:

0. Imports external libraries and loads the MP dataset and the google sheet
2. Updates keywords on MP as follows:
    1. Looks for vocabulary terms with the value from the column *Keyword to map*
    2. Looks for the term in the column *Map to*
    3. Goes through all MP items to look for the *keywords-to-map* in the keyword-dynamic-property
    4. Replaces the  the *keywords-to-map* in the MP dataset.


## 0 Requirements to run this notebook

This section gives all the relevant information to "interact" with the MP data.

### 0.1 libraries
*There are a number of external libraries needed to run the notebook* 

*Furthermore, a dedicated SSH Open Marketplace library - sshmarketplacelib - with customised functions has been created and can be imported using the python import commands.* 

*Below the libraries import needed to run this notebook*

In [1]:
import numpy as np
import pandas as pd
import requests
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

### 0.2 Get the data



Get the MarketPlace dataset

In [2]:
mpdata = mpd()
df_tool_flat =mpdata.getMPItems ("toolsandservices", True)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", True)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...


Get the list of keywords from the [gsheet](https://docs.google.com/spreadsheets/d/1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA/edit#gid=0)

In [3]:
sheet_id = '1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA'
sheet_name = 'Mappings'
url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}'
df_keywords=pd.read_csv(url)

In [4]:
print (url)

https://docs.google.com/spreadsheets/d/1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA/gviz/tq?tqx=out:csv&sheet=Mappings


In [5]:
df_keywords=df_keywords[df_keywords['Map to']!='delete']

In [6]:
df_grouped_keyword=df_keywords.groupby(['Keyword to map'])['Map to'].apply(list).reset_index(name='Maps')

In [7]:
df_keywords.head()

Unnamed: 0,Keyword to map,Map to,Comment,Discussion
2,activity - software development,https://vocabs.acdh.oeaw.ac.at/oefosdiscipline...,keyword format,
3,Aggregation,https://vocabs.dariah.eu/tadirah/aggregating,Aggregating,
4,Analysis,https://vocabs.dariah.eu/tadirah/analyzing,Analyzing,
5,Annotating,https://vocabs.dariah.eu/tadirah/annotating,Annotating,
6,Annotation,https://vocabs.dariah.eu/tadirah/annotating,Annotating,


The function *getMPConcepts()* is a custom function that uses the API entry: 

GET https://marketplace-api.sshopencloud.eu/api/concept-search?perpage=100&q=URI

to get all the *concepts* from the MarketPlace dataset. 

**Note that this function may require a long execution time**



In [8]:
df_concepts=mpdata.getMPConcepts()

In [9]:
utils=hel.Util()
resultfields=['persistentId', 'MPUrl', 'category', 'label', 'type.code', 'type.label', 'concept.code', 'concept.label', 'concept.uri', 'concept.vocabulary.scheme']
udf_alprop=utils.getAllPropertiesBySources()
udf_alprop=udf_alprop.loc[ : ,resultfields]

In [10]:
udf_alprop.head()

Unnamed: 0,persistentId,MPUrl,category,label,type.code,type.label,concept.code,concept.label,concept.uri,concept.vocabulary.scheme
0,SIU1nO,tool-or-service/SIU1nO,tool-or-service,140kit,mode-of-use,Mode of use,webApplication,Web application,https://vocabs.sshopencloud.eu/vocabularies/in...,https://vocabs.sshopencloud.eu/vocabularies/in...
1,SIU1nO,tool-or-service/SIU1nO,tool-or-service,140kit,activity,Activity,capturing,Capturing,https://vocabs.dariah.eu/tadirah/capturing,https://vocabs.dariah.eu/tadirah/
2,SIU1nO,tool-or-service/SIU1nO,tool-or-service,140kit,activity,Activity,dataVisualization,Data Visualization,https://vocabs.dariah.eu/tadirah/dataVisualiza...,https://vocabs.dariah.eu/tadirah/
3,SIU1nO,tool-or-service/SIU1nO,tool-or-service,140kit,activity,Activity,analyzing,Analyzing,https://vocabs.dariah.eu/tadirah/analyzing,https://vocabs.dariah.eu/tadirah/
4,SIU1nO,tool-or-service/SIU1nO,tool-or-service,140kit,activity,Activity,analyzing,Analyzing,https://vocabs.dariah.eu/tadirah/analyzing,https://vocabs.dariah.eu/tadirah/


### 1 Update keywords

The function *getMPKeywordProperies(mKey)* used below is a custum function that uses the API entry 

    GET https://marketplace-api.sshopencloud.eu/api/concept-search?types=keyword&q=VALUE

and returns the vocabulary terms for *mKey*.  

The returned dataset is filtered to individuate those values coming from the vocabulary *sshoc-keyword* (vocabulary[code]=sshoc-keyword).


In [14]:
pd.options.mode.chained_assignment = None
selectedItems=pd.DataFrame()
df_items=pd.DataFrame()
#df_vocterms=pd.DataFrame()
for rown, row in df_grouped_keyword.iterrows():
    
    uk=df_grouped_keyword.iloc[rown]['Keyword to map']
    myKeys=df_grouped_keyword.iloc[rown]['Maps']
    #df_vocterms=pd.DataFrame()
    jsonmapto=[]
    filterList={}
    for myKey in myKeys:
        myKey=myKey.strip()
        df_vocterms=mpdata.getMPKeywordProperties(myKey.strip())
        #print (f'n. vocabulary terms: {df_vocterms.shape[0]}')
        print(f'*****')
        print(f' Checking {uk}, {myKey}')
        if (df_vocterms.empty):
            print (f"#####  No vocabulary terms found for {myKey[0]}")
            continue;
        df_vocterms.loc[:]=df_vocterms.loc[(df_vocterms.vocabulary=={'code': 'sshoc-keyword'}), ]
    
        #The set of vocabulary terms is filterd to individuate those with the exact match (case-insensitive) 
        df_vocterms=df_vocterms.loc[(df_vocterms.code).str.replace(' %3E ', ' > ').str.lower()==myKey.lower()]

        #Search for the one concept that has in the uri column the value of the Map to. There should be only one such concept.
    
   
        df_mapto=df_concepts.loc[(df_concepts.uri==myKey),]
       
        if (df_mapto.empty):
            print (f"\n%%%%%%%%  No concept found for {myKey}")
            continue;
        
        #Filter the dataset using the Keyword to Map

        df_items=udf_alprop.loc[(udf_alprop['concept.label'].str.lower()==uk.lower()), ]#& (udf_alprop['type.code'].str.lower()=='keyword')]
        
        #print (f' n. prop: {df_items.shape[0]}')
        if (df_items.empty):
            print (f"\n No items found for {myKey}")
            continue;
        print (f'\n Found as {df_items.iloc[0]["type.code"]} - in {df_items.shape[0]} items')
        #update the MP
        jsonConcept={}
        jsonConceptVal={}
    
        jsonConceptVal["code"]=df_mapto.iloc[0].code
        jsonConceptVal["vocabulary"]=df_mapto.iloc[0].vocabulary
        jsonConceptVal["uri"]=df_mapto.iloc[0].uri
        #jsonConcept
        
        attrList={}
        attrList["type"]=df_mapto.iloc[0].types[0]
        attrList["concept"]=jsonConceptVal
        jsonmapto.append(attrList)
        filterList={}
        filterList["concept"]=uk.lower()
    #print (f'update parameters: {jsonmapto} - {filterList} \n')
    #test['b'] = [[5, 6, 7]] * len(test)
    if (not df_items.empty):
        df_items['updateList']=[jsonmapto] * len(df_items)
        df_items['filterList']=[filterList] * len(df_items)
        #df_items.loc[ : ,('updateList')]=[jsonmapto for _ in range(df_items.shape[0])]
        #df_items.loc[ : ,('filterList')]=[filterList for _ in range(df_items.shape[0])]
    
        selectedItems=pd.concat([selectedItems, df_items.loc[df_items.astype(str).drop_duplicates(keep='first').index]])
    

attrList={}
filterList={}
#selectedItems.head()



*****
 Checking Aggregation, https://vocabs.dariah.eu/tadirah/aggregating

 No items found for https://vocabs.dariah.eu/tadirah/aggregating
*****
 Checking Analysis, https://vocabs.dariah.eu/tadirah/analyzing

 No items found for https://vocabs.dariah.eu/tadirah/analyzing
*****
 Checking Annotating, https://vocabs.dariah.eu/tadirah/annotating

 Found as activity - in 248 items
*****
 Checking Annotation, https://vocabs.dariah.eu/tadirah/annotating

 No items found for https://vocabs.dariah.eu/tadirah/annotating
*****
 Checking Archaeology and Prehistory, https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/601021

 No items found for https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/601021
*****
 Checking Archaeology and Prehistory, https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/601003

 No items found for https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/601003
*****
 Checking Architecture, space management, https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/2012

 No items found for https://voc

*****
 Checking Information Retrieval > Analysis-Content Analysis, https://vocabs.dariah.eu/tadirah/contentAnalysis

 No items found for https://vocabs.dariah.eu/tadirah/contentAnalysis
*****
 Checking Interpretation, https://vocabs.dariah.eu/tadirah/interpreting

 No items found for https://vocabs.dariah.eu/tadirah/interpreting
*****
 Checking Linguistics, https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/6020

 No items found for https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/6020
*****
 Checking Linked open data > Enrichment-Annotation; Dissemination-Publishing, https://vocabs.dariah.eu/tadirah/publishing

 No items found for https://vocabs.dariah.eu/tadirah/publishing
*****
 Checking Linked open data > Enrichment-Annotation; Dissemination-Publishing, https://vocabs.dariah.eu/tadirah/linkedOpenData

 No items found for https://vocabs.dariah.eu/tadirah/linkedOpenData
*****
 Checking Literature, https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/6020

 No items found for https://vocabs.ac

*****
 Checking argentinian, https://vocabs.sshopencloud.eu/vocabularies/eosc-geographical-availability/ar

 No items found for https://vocabs.sshopencloud.eu/vocabularies/eosc-geographical-availability/ar
*****
 Checking austrian, https://vocabs.sshopencloud.eu/vocabularies/eosc-geographical-availability/at

 No items found for https://vocabs.sshopencloud.eu/vocabularies/eosc-geographical-availability/at
*****
 Checking authentication, https://vocabs.sshopencloud.eu/vocabularies/eosc-resource-category/subcategory-security_and_operations-security_and_identity-user_authentication

 No items found for https://vocabs.sshopencloud.eu/vocabularies/eosc-resource-category/subcategory-security_and_operations-security_and_identity-user_authentication
*****
 Checking authorization, https://vocabs.sshopencloud.eu/vocabularies/eosc-resource-category/subcategory-security_and_operations-security_and_identity-user_authentication

 No items found for https://vocabs.sshopencloud.eu/vocabularies/eosc-re

*****
 Checking graphs, https://vocabs.dariah.eu/sshoc-keyword/graph

%%%%%%%%  No concept found for https://vocabs.dariah.eu/sshoc-keyword/graph
*****
 Checking historic, https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/6010

 No items found for https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/6010
*****
 Checking historical geography and cartography, https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/5070

 No items found for https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/5070
*****
 Checking historical geography and cartography, https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/507028

 No items found for https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/507028
*****
 Checking historical onomastics and toponomastics, https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/602033

 No items found for https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/602033
*****
 Checking iiif, https://vocabs.sshopencloud.eu/vocabularies/standard/iiif

 Found as standard - in 5 items
*****
 Checking indology, https://

*****
 Checking spatial analysis, https://vocabs.dariah.eu/tadirah/spatialAnalysis

 Found as activity - in 31 items
*****
 Checking spreadsheets, https://vocabs.dariah.eu/sshoc-keyword/spreadsheet

%%%%%%%%  No concept found for https://vocabs.dariah.eu/sshoc-keyword/spreadsheet
*****
 Checking sql, https://vocabs.sshopencloud.eu/vocabularies/media-type/applicationslashsql

 No items found for https://vocabs.sshopencloud.eu/vocabularies/media-type/applicationslashsql
*****
 Checking statistical, https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/101018

 No items found for https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/101018
*****
 Checking statistics, https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/101018

 Found as discipline - in 19 items
*****
 Checking structural analysis, https://vocabs.dariah.eu/tadirah/structuralAnalysis

 Found as activity - in 29 items
*****
 Checking stylistic analysis, https://vocabs.dariah.eu/tadirah/stylisticAnalysis

 Found as activity - in 22 items
****

In [15]:
selectedItems.sort_values('concept.label').tail()

Unnamed: 0,persistentId,MPUrl,category,label,type.code,type.label,concept.code,concept.label,concept.uri,concept.vocabulary.scheme,updateList,filterList
31857,i3Tu69,training-material/i3Tu69,training-material,Webinar: CESSDA Roadshow on Climate Change,object-format,Object format,video,video,https://vocabs.sshopencloud.eu/vocabularies/me...,https://vocabs.sshopencloud.eu/vocabularies/me...,"[{'type': {'code': 'object-format'}, 'concept'...",{'concept': 'video'}
31876,1kgYH3,training-material/1kgYH3,training-material,Webinar: CESSDA Roadshow on COVID-19,object-format,Object format,video,video,https://vocabs.sshopencloud.eu/vocabularies/me...,https://vocabs.sshopencloud.eu/vocabularies/me...,"[{'type': {'code': 'object-format'}, 'concept'...",{'concept': 'video'}
31894,eJ25nS,training-material/eJ25nS,training-material,Webinar: CESSDA Roadshow on Migration,object-format,Object format,video,video,https://vocabs.sshopencloud.eu/vocabularies/me...,https://vocabs.sshopencloud.eu/vocabularies/me...,"[{'type': {'code': 'object-format'}, 'concept'...",{'concept': 'video'}
31920,VxxK6Z,training-material/VxxK6Z,training-material,Webinar: Data in Europe by Topic: Ageing,object-format,Object format,video,video,https://vocabs.sshopencloud.eu/vocabularies/me...,https://vocabs.sshopencloud.eu/vocabularies/me...,"[{'type': {'code': 'object-format'}, 'concept'...",{'concept': 'video'}
31533,ZwuIgC,training-material/ZwuIgC,training-material,Video: CESSDA Series on Variable Harmonization...,object-format,Object format,video,video,https://vocabs.sshopencloud.eu/vocabularies/me...,https://vocabs.sshopencloud.eu/vocabularies/me...,"[{'type': {'code': 'object-format'}, 'concept'...",{'concept': 'video'}


In [None]:
mpdata.updateItemsProperties(selectedItems)

In [None]:
mylog=mpdata._getLog()
mylog.tail()

Get the list of 'Rejecting' keywords from the [gsheet](https://docs.google.com/spreadsheets/d/1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA/edit#gid=0)

In [16]:
sheet_id = '1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA'
sheet_name = 'Rejecting'
rejurl = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}'
df_rej_keywords=pd.read_csv(rejurl)

In [17]:
df_rej_keywords.head()

Unnamed: 0,Keyword to reject,Comment
0,gwt,"no relevant abbreviation; if it stands for ""Go..."
1,AAH,unknown acronym; context: lasers in conservation
2,CHM,unknown acronym; context: lasers in conservation
3,DH Answers,"keyword used once, not very convincing"
4,Rules,unspecific


In [19]:
pd.options.mode.chained_assignment = None
rejectedItems=pd.DataFrame()
for rown, row in df_rej_keywords.iterrows():
    
    rk=df_rej_keywords.iloc[rown]['Keyword to reject']
   
    df_items_wrk=udf_alprop.loc[(udf_alprop['concept.label'].str.lower()==rk.lower()), ]
        
    if (df_items_wrk.empty):
        print (f"\n%%%%%%%%  No items found for {rk}")
        continue;
    print (f'\n Keyword {rk} found as {df_items_wrk.iloc[0]["type.code"]}\n')
    jsonmapto=[]    
    #attrList={}
        
    #jsonmapto.append(attrList)
    filterList={}
    filterList["concept"]=rk.lower()
    
    df_items_wrk['filterList']=[filterList] * len(df_items_wrk)
    df_items_wrk['updateList']=[jsonmapto] * len(df_items_wrk)
    
#     df_items_wrk.loc[ : ,('filterList')]=[filterList for _ in range(df_items_wrk.shape[0])]
#     df_items_wrk.loc[ : ,('updateList')]=[jsonmapto for _ in range(df_items_wrk.shape[0])]
    rejectedItems=pd.concat([rejectedItems, df_items_wrk.loc[df_items_wrk.astype(str).drop_duplicates(keep='first').index]])
    

attrList={}
filterList={}
#rejectedItems.head()



%%%%%%%%  No items found for gwt

%%%%%%%%  No items found for AAH

%%%%%%%%  No items found for CHM

%%%%%%%%  No items found for DH Answers

%%%%%%%%  No items found for Rules

%%%%%%%%  No items found for visual

%%%%%%%%  No items found for leiden

%%%%%%%%  No items found for university

%%%%%%%%  No items found for lynks

%%%%%%%%  No items found for spc

%%%%%%%%  No items found for problem solving

%%%%%%%%  No items found for what is knowledge based system

%%%%%%%%  No items found for service - support service

%%%%%%%%  No items found for activity - resource creation


In [None]:
rejectedItems.head()

In [None]:
mpdata.updateItemsProperties(rejectedItems)