# Notebook 5.1 - Curation-Keywords (alpha release)

The final release of this notebook will implement the workflow defined in:

[Curating keywords](https://gitlab.gwdg.de/sshoc/marketplace-curation/-/issues/1#note_71056) GitLab issue

The notebook works as follows:

0. Imports external libraries and loads the MP dataset and the google sheet
1. Looks for vocabulary terms with the value from the column *Keyword to map*
2. Looks for the term in the column *Map to*
3. Goes through all MP items to look for the *keywords-to-map* in the keyword-dynamic-property
4. Replaces the  the *keywords-to-map* in the MP dataset.


## 0 Requirements to run this notebook

This section gives all the relevant information to "interact" with the MP data.

### 0.1 libraries
*There are a number of external libraries needed to run the notebook* 

*Furthermore, a dedicated SSH Open Marketplace library - sshmarketplacelib - with customised functions has been created and can be imported using the python import commands.* 

*Below the libraries import needed to run this notebook*

In [1]:
import numpy as np
import pandas as pd
import requests
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

### 0.2 Get the data



Get the MarketPlace dataset

In [2]:
mpdata = mpd()
df_tool_flat =mpdata.getMPItems ("toolsandservices", True)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", True)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...


Get the list of keywords from the [gsheet](https://docs.google.com/spreadsheets/d/1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA/edit#gid=0)

In [3]:
sheet_id = '1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA'
sheet_name = 'Mappings'
url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}'
df_keywords=pd.read_csv(url)

The function *getMPConcepts()* is a custum function that uses the API entry: 

GET https://marketplace-api.sshopencloud.eu/api/concept-search?perpage=100&q=URI

to get all the *concepts* from the MarketPlace dataset. 

**Note that executing this function may require some time, currently 14995 records are returned**



In [4]:
df_concepts=mpdata.getMPConcepts()

In [5]:
utils=hel.Util()
resultfields=['persistentId', 'MPUrl', 'category', 'label', 'type.code', 'type.label', 'concept.code', 'concept.label', 'concept.uri', 'concept.vocabulary.scheme']
udf_alprop=utils.getAllPropertiesBySources()
udf_alprop=udf_alprop.loc[ : ,resultfields]

A few lines of the gsheet

In [6]:
df_keywords.head()

Unnamed: 0,Keyword to map,Map to,Comment
0,Linguistics,https://vocabs.acdh.oeaw.ac.at/oefosdiscipline...,Sprach- und Literaturwissenschaften
1,History,https://vocabs.acdh.oeaw.ac.at/oefosdiscipline...,"Geschichte, Archäologie"
2,Literature,https://vocabs.acdh.oeaw.ac.at/oefosdiscipline...,Sprach- und Literaturwissenschaften
3,Video,https://vocabs.sshopencloud.eu/vocabularies/me...,video
4,Text,https://vocabs.sshopencloud.eu/vocabularies/me...,text


## Update the MP dataset

The function *getMPKeywordProperies(mKey)* is a custum function that uses the API entry 

    GET https://marketplace-api.sshopencloud.eu/api/concept-search?types=keyword&q=VALUE

and returns the vocabulary terms for *mKey*.  

In the cell below, the returned dataset is filtered to individuate those values coming from the vocabulary *sshoc-keyword* (vocabulary[code]=sshoc-keyword), then the resulting dataset is shown.


In [7]:

rown=0
selectedItems=pd.DataFrame()
for rown, row in df_keywords.iterrows():
    myKey=df_keywords.iloc[rown]['Keyword to map']
    
    print(f'\n***** Checking {myKey}')
    df_vocterms=mpdata.getMPKeywordProperties(myKey)
    
    if (df_vocterms.empty):
        print (f"vvvvv No vocabulary terms found for {myKey}")
        continue;
    df_vocterms=df_vocterms.loc[df_vocterms.vocabulary=={'code': 'sshoc-keyword'}]
    #The set of vocabulary terms is filterd to individuate those with the exact match (case-insensitive) 
    df_vocterms=df_vocterms.loc[(df_vocterms.code).str.lower()==myKey.lower()]

    #Search for the one concept that has in the uri column the value of the Map to. There should be only one such concept.

    test3=df_concepts.loc[df_concepts.uri==df_keywords.iloc[rown]['Map to']]

    #Filter the dataset using the Keyword to Map and show the result (or part of the result)

    test=udf_alprop.loc[(udf_alprop['concept.label'].str.lower()==myKey.lower()), ]#& (udf_alprop['type.code'].str.lower()=='keyword')]
    
    if (test.empty):
        print (f"iiiii  No Items found for {myKey}")
        continue;
    print (f'fffff  Found as {test.iloc[0]["type.code"]}')
    #update the MP
    jsonConcept={}
    jsonConceptVal={}
    jsonConceptVal["code"]=test3.iloc[0].code
    jsonConceptVal["vocabulary"]=test3.iloc[0].vocabulary
    jsonConceptVal["uri"]=test3.iloc[0].uri
    #jsonConcept
    attrList={}
    attrList["type"]=test3.iloc[0].types[0]
    attrList["concept"]=jsonConceptVal
    filterList={}
    filterList["concept"]=myKey.lower()
    test.loc[ : ,'updateList']=[attrList for _ in range(test.shape[0])]
    test.loc[ : ,'filterList']=[filterList for _ in range(test.shape[0])]
    selectedItems=selectedItems.append(test)

attrList={}
filterList={}
mpdata.updateItemsProperties(selectedItems, attrList, filterList)



***** Checking Linguistics
iiiii  No Items found for Linguistics

***** Checking History
iiiii  No Items found for History

***** Checking Literature
iiiii  No Items found for Literature

***** Checking Video
fffff  Found as object-format

***** Checking Text
fffff  Found as object-format

***** Checking Publishing


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value


fffff  Found as activity

***** Checking Capture
iiiii  No Items found for Capture

***** Checking Gathering
fffff  Found as activity

***** Checking Dissemination
iiiii  No Items found for Dissemination

***** Checking Storage
iiiii  No Items found for Storage

***** Checking historic
iiiii  No Items found for historic

***** Checking Analysis
iiiii  No Items found for Analysis

***** Checking Content Analysis
fffff  Found as activity

***** Checking Enrichment
iiiii  No Items found for Enrichment

***** Checking Organizing
fffff  Found as activity

***** Checking document sharing
iiiii  No Items found for document sharing

***** Checking Annotation
iiiii  No Items found for Annotation

***** Checking virtual reality
iiiii  No Items found for virtual reality

***** Checking visualization
iiiii  No Items found for visualization

***** Checking Visualization
iiiii  No Items found for Visualization

***** Checking Discovering
fffff  Found as activity

***** Checking collaboration
iiiii  

iiiii  No Items found for Transcriber

***** Checking lemmatizer
iiiii  No Items found for lemmatizer

***** Checking merge
iiiii  No Items found for merge

***** Checking convert
iiiii  No Items found for convert

***** Checking parser
iiiii  No Items found for parser

***** Checking south african
iiiii  No Items found for south african

***** Checking scraper
vvvvv No vocabulary terms found for scraper

***** Checking structural analysis
fffff  Found as activity

***** Checking Software development
fffff  Found as discipline

***** Checking Augmented reality
fffff  Found as discipline

***** Checking Gender studies
fffff  Found as discipline

***** Checking Biological anthropology
fffff  Found as discipline

***** Checking Classical studies
fffff  Found as discipline

***** Checking Communication sciences
iiiii  No Items found for Communication sciences

***** Checking Cultural studies
fffff  Found as discipline

***** Checking Demography
iiiii  No Items found for Demography

***** C


Changing the property:  "type", from  "type": {'code': 'activity', 'label': 'Activity', 'type': 'concept', 'groupName': 'Categorisation', 'hidden': False, 'ord': 17, 'allowedVocabularies': [{'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/', 'label': '', 'closed': True}]} 
to
 "type": {'code': 'activity'}", in item with pid: "tools/00dL9T"
(Log info: current version is: 78198)

Changing the property:  "concept", from  "concept": {'code': 'https://vocabs.dariah.eu/tadirah/contentAnalysis', 'vocabulary': {'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/', 'label': '', 'closed': True}, 'label': 'Content Analysis', 'notation': '', 'uri': 'https://vocabs.dariah.eu/tadirah/contentAnalysis', 'candidate': False} 
to
 "concept": {'code': 'https://vocabs.dariah.eu/tadirah/contentAnalysis', 'vocabulary': {'code': 'tadirah2'}, 'uri': 'https://vocabs.dariah.eu/tadirah


Changing the property:  "concept", from  "concept": {'code': 'https://vocabs.dariah.eu/tadirah/discovering', 'vocabulary': {'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/', 'label': '', 'closed': True}, 'label': 'Discovering', 'notation': '', 'uri': 'https://vocabs.dariah.eu/tadirah/discovering', 'candidate': False} 
to
 "concept": {'code': 'https://vocabs.dariah.eu/tadirah/discovering', 'vocabulary': {'code': 'tadirah2'}, 'uri': 'https://vocabs.dariah.eu/tadirah/discovering'}", in item with pid: "tools/TbOpKF"
(Log info: current version is: 36577)


 *** Running in DEBUG mode, Marketplace dataset not updated. *** 

Changing the property:  "type", from  "type": {'code': 'activity', 'label': 'Activity', 'type': 'concept', 'groupName': 'Categorisation', 'hidden': False, 'ord': 17, 'allowedVocabularies': [{'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/',


Changing the property:  "concept", from  "concept": {'code': 'https://vocabs.dariah.eu/tadirah/structuralAnalysis', 'vocabulary': {'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/', 'label': '', 'closed': True}, 'label': 'Structural Analysis', 'notation': '', 'uri': 'https://vocabs.dariah.eu/tadirah/structuralAnalysis', 'candidate': False} 
to
 "concept": {'code': 'https://vocabs.dariah.eu/tadirah/structuralAnalysis', 'vocabulary': {'code': 'tadirah2'}, 'uri': 'https://vocabs.dariah.eu/tadirah/structuralAnalysis'}", in item with pid: "tools/Q3pcO6"
(Log info: current version is: 77885)


 *** Running in DEBUG mode, Marketplace dataset not updated. *** 

Changing the property:  "type", from  "type": {'code': 'activity', 'label': 'Activity', 'type': 'concept', 'groupName': 'Categorisation', 'hidden': False, 'ord': 17, 'allowedVocabularies': [{'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 


Changing the property:  "concept", from  "concept": {'code': 'https://vocabs.dariah.eu/tadirah/publishing', 'vocabulary': {'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/', 'label': '', 'closed': True}, 'label': 'Publishing', 'notation': '', 'uri': 'https://vocabs.dariah.eu/tadirah/publishing', 'candidate': False} 
to
 "concept": {'code': 'https://vocabs.dariah.eu/tadirah/publishing', 'vocabulary': {'code': 'tadirah2'}, 'uri': 'https://vocabs.dariah.eu/tadirah/publishing'}", in item with pid: "tools/aYYtA7"
(Log info: current version is: 77957)


 *** Running in DEBUG mode, Marketplace dataset not updated. *** 

Changing the property:  "type", from  "type": {'code': 'activity', 'label': 'Activity', 'type': 'concept', 'groupName': 'Categorisation', 'hidden': False, 'ord': 17, 'allowedVocabularies': [{'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/', 'lab


Changing the property:  "concept", from  "concept": {'code': 'https://vocabs.dariah.eu/tadirah/contentAnalysis', 'vocabulary': {'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/', 'label': '', 'closed': True}, 'label': 'Content Analysis', 'notation': '', 'uri': 'https://vocabs.dariah.eu/tadirah/contentAnalysis', 'candidate': False} 
to
 "concept": {'code': 'https://vocabs.dariah.eu/tadirah/contentAnalysis', 'vocabulary': {'code': 'tadirah2'}, 'uri': 'https://vocabs.dariah.eu/tadirah/contentAnalysis'}", in item with pid: "tools/K5js1D"
(Log info: current version is: 78026)


 *** Running in DEBUG mode, Marketplace dataset not updated. *** 

Changing the property:  "type", from  "type": {'code': 'activity', 'label': 'Activity', 'type': 'concept', 'groupName': 'Categorisation', 'hidden': False, 'ord': 17, 'allowedVocabularies': [{'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs

Changing the property:  "type", from  "type": {'code': 'activity', 'label': 'Activity', 'type': 'concept', 'groupName': 'Categorisation', 'hidden': False, 'ord': 17, 'allowedVocabularies': [{'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/', 'label': '', 'closed': True}]} 
to
 "type": {'code': 'activity'}", in item with pid: "tools/NsFPLs"
(Log info: current version is: 78082)

Changing the property:  "concept", from  "concept": {'code': 'https://vocabs.dariah.eu/tadirah/gathering', 'vocabulary': {'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/', 'label': '', 'closed': True}, 'label': 'Gathering', 'notation': '', 'uri': 'https://vocabs.dariah.eu/tadirah/gathering', 'candidate': False} 
to
 "concept": {'code': 'https://vocabs.dariah.eu/tadirah/gathering', 'vocabulary': {'code': 'tadirah2'}, 'uri': 'https://vocabs.dariah.eu/tadirah/gathering'}", in item wit

Changing the property:  "type", from  "type": {'code': 'activity', 'label': 'Activity', 'type': 'concept', 'groupName': 'Categorisation', 'hidden': False, 'ord': 17, 'allowedVocabularies': [{'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/', 'label': '', 'closed': True}]} 
to
 "type": {'code': 'activity'}", in item with pid: "tools/E4qLgD"
(Log info: current version is: 78132)

Changing the property:  "concept", from  "concept": {'code': 'https://vocabs.dariah.eu/tadirah/communicating', 'vocabulary': {'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/', 'label': '', 'closed': True}, 'label': 'Communicating', 'notation': '', 'uri': 'https://vocabs.dariah.eu/tadirah/communicating', 'candidate': False} 
to
 "concept": {'code': 'https://vocabs.dariah.eu/tadirah/communicating', 'vocabulary': {'code': 'tadirah2'}, 'uri': 'https://vocabs.dariah.eu/tadirah/communica

Changing the property:  "type", from  "type": {'code': 'activity', 'label': 'Activity', 'type': 'concept', 'groupName': 'Categorisation', 'hidden': False, 'ord': 17, 'allowedVocabularies': [{'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/', 'label': '', 'closed': True}]} 
to
 "type": {'code': 'activity'}", in item with pid: "publications/B0MCj6"
(Log info: current version is: 78183)

Changing the property:  "concept", from  "concept": {'code': 'https://vocabs.dariah.eu/tadirah/annotating', 'vocabulary': {'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/', 'label': '', 'closed': True}, 'label': 'Annotating', 'notation': '', 'uri': 'https://vocabs.dariah.eu/tadirah/annotating', 'candidate': False} 
to
 "concept": {'code': 'https://vocabs.dariah.eu/tadirah/annotating', 'vocabulary': {'code': 'tadirah2'}, 'uri': 'https://vocabs.dariah.eu/tadirah/annotating'}",


Changing the property:  "concept", from  "concept": {'code': '102022', 'vocabulary': {'code': 'discipline', 'scheme': 'https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/Schema', 'namespace': 'https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/', 'label': 'ÖFOS 2012. Austrian Fields of Science and Technology Classification 2012', 'closed': True}, 'label': 'Software development', 'notation': '102022', 'uri': 'https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/102022', 'candidate': False} 
to
 "concept": {'code': '102022', 'vocabulary': {'code': 'discipline'}, 'uri': 'https://vocabs.acdh.oeaw.ac.at/oefosdisciplines/102022'}", in item with pid: "trainingMaterials/Tz3Owh"
(Log info: current version is: 78191)


 *** Running in DEBUG mode, Marketplace dataset not updated. *** 

Changing the property:  "type", from  "type": {'code': 'discipline', 'label': 'Discipline', 'type': 'concept', 'groupName': 'Categorisation', 'hidden': False, 'ord': 19, 'allowedVocabularies': [{'code': 'discipline', 'scheme