# Notebook 5.1 - Curation-Keywords

This notebook implements the workflow defined in:

[Curating keywords](https://gitlab.gwdg.de/sshoc/marketplace-curation/-/issues/1#note_71056) GitLab issue

The notebook works as follows:

0. Imports external libraries and loads the MP dataset and the google sheet
2. Updates keywords on MP as follows:
    1. Looks for vocabulary terms with the value from the column *Keyword to map*
    2. Looks for the term in the column *Map to*
    3. Goes through all MP items to look for the *keywords-to-map* in the keyword-dynamic-property
    4. Replaces the  the *keywords-to-map* in the MP dataset.


## 0 Requirements to run this notebook

This section gives all the relevant information to "interact" with the MP data.

### 0.1 libraries
*There are a number of external libraries needed to run the notebook* 

*Furthermore, a dedicated SSH Open Marketplace library - sshmarketplacelib - with customised functions has been created and can be imported using the python import commands.* 

*Below the libraries import needed to run this notebook*

In [None]:
import numpy as np
import pandas as pd
import requests
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

### 0.2 Get the data



Get the MarketPlace dataset

In [None]:
mpdata = mpd()
df_tool_flat =mpdata.getMPItems ("toolsandservices", True)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", True)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

Get the list of keywords from the [gsheet](https://docs.google.com/spreadsheets/d/1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA/edit#gid=0)

In [None]:
sheet_id = '1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA'
sheet_name = 'Mappings'
url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}'
df_keywords=pd.read_csv(url)

In [None]:
df_grouped_keyword=df_keywords.groupby(['Keyword to map'])['Map to'].apply(list).reset_index(name='Maps')

In [None]:
df_grouped_keyword.iloc[40:45]

The function *getMPConcepts()* is a custom function that uses the API entry: 

GET https://marketplace-api.sshopencloud.eu/api/concept-search?perpage=100&q=URI

to get all the *concepts* from the MarketPlace dataset. 

**Note that this function may require a long execution time**



In [None]:
df_concepts=mpdata.getMPConcepts()

In [None]:
utils=hel.Util()
resultfields=['persistentId', 'MPUrl', 'category', 'label', 'type.code', 'type.label', 'concept.code', 'concept.label', 'concept.uri', 'concept.vocabulary.scheme']
udf_alprop=utils.getAllPropertiesBySources()
udf_alprop=udf_alprop.loc[ : ,resultfields]

In [None]:
udf_alprop.head()

In [None]:
df_grouped_keyword.iloc[70:90]

### 1 Update keywords

The function *getMPKeywordProperies(mKey)* used below is a custum function that uses the API entry 

    GET https://marketplace-api.sshopencloud.eu/api/concept-search?types=keyword&q=VALUE

and returns the vocabulary terms for *mKey*.  

The returned dataset is filtered to individuate those values coming from the vocabulary *sshoc-keyword* (vocabulary[code]=sshoc-keyword).


In [None]:
pd.options.mode.chained_assignment = None
selectedItems=pd.DataFrame()
#df_vocterms=pd.DataFrame()
for rown, row in df_grouped_keyword.iterrows():
    
    uk=df_grouped_keyword.iloc[rown]['Keyword to map']
    myKeys=df_grouped_keyword.iloc[rown]['Maps']
    #df_vocterms=pd.DataFrame()
    jsonmapto=[]
    filterList={}
    for myKey in myKeys:
        myKey=myKey.strip()
        df_vocterms=mpdata.getMPKeywordProperties(myKey.strip())
        #print (f'n. vocabulary terms: {df_vocterms.shape[0]}')
        print(f'***** checking {uk}, {myKey}')
        if (df_vocterms.empty):
            print (f"#####  No vocabulary terms found for {myKey[0]}")
            continue;
        df_vocterms.loc[:]=df_vocterms.loc[(df_vocterms.vocabulary=={'code': 'sshoc-keyword'}), ]
    
        #The set of vocabulary terms is filterd to individuate those with the exact match (case-insensitive) 
        df_vocterms=df_vocterms.loc[(df_vocterms.code).str.replace(' %3E ', ' > ').str.lower()==myKey.lower()]

        #Search for the one concept that has in the uri column the value of the Map to. There should be only one such concept.
    
   
        df_mapto=df_concepts.loc[(df_concepts.uri==myKey),]
       
    
        #Filter the dataset using the Keyword to Map

        df_items=udf_alprop.loc[(udf_alprop['concept.label'].str.lower()==uk.lower()), ]#& (udf_alprop['type.code'].str.lower()=='keyword')]
        
        #print (f' n. prop: {df_items.shape[0]}')
        if (df_items.empty):
            print (f"\n%%%%%%%%  No items found for {myKey}")
            continue;
        print (f'&&&&&  Found as {df_items.iloc[0]["type.code"]}')
        #update the MP
        jsonConcept={}
        jsonConceptVal={}
    
        jsonConceptVal["code"]=df_mapto.iloc[0].code
        jsonConceptVal["vocabulary"]=df_mapto.iloc[0].vocabulary
        jsonConceptVal["uri"]=df_mapto.iloc[0].uri
        #jsonConcept
        
        attrList={}
        attrList["type"]=df_mapto.iloc[0].types[0]
        attrList["concept"]=jsonConceptVal
        jsonmapto.append(attrList)
        filterList={}
        filterList["concept"]=uk.lower()
    #print (f'update parameters: {jsonmapto} - {filterList} \n')
    #test['b'] = [[5, 6, 7]] * len(test)
    df_items['updateList']=[jsonmapto] * len(df_items)
    df_items['filterList']=[filterList] * len(df_items)
    #df_items.loc[ : ,('updateList')]=[jsonmapto for _ in range(df_items.shape[0])]
    #df_items.loc[ : ,('filterList')]=[filterList for _ in range(df_items.shape[0])]
    
    selectedItems=pd.concat([selectedItems, df_items.loc[df_items.astype(str).drop_duplicates(keep='first').index]])
    

attrList={}
filterList={}
#selectedItems.head()
mpdata.updateItemsProperties(selectedItems)


Get the list of 'Rejecting' keywords from the [gsheet](https://docs.google.com/spreadsheets/d/1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA/edit#gid=0)

In [None]:
sheet_id = '1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA'
sheet_name = 'Rejecting'
rejurl = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}'
df_rej_keywords=pd.read_csv(rejurl)

In [None]:
df_rej_keywords.head()

In [None]:
pd.options.mode.chained_assignment = None
rejectedItems=pd.DataFrame()
for rown, row in df_rej_keywords.iterrows():
    
    rk=df_rej_keywords.iloc[rown]['Keyword to reject']
   
    df_items_wrk=udf_alprop.loc[(udf_alprop['concept.label'].str.lower()==rk.lower()), ]
        
    if (df_items_wrk.empty):
        print (f"\n%%%%%%%%  No items found for {rk}")
        continue;
    print (f'&&&&&  Found as {df_items_wrk.iloc[0]["type.code"]}\n')
    jsonmapto=[]    
    #attrList={}
        
    #jsonmapto.append(attrList)
    filterList={}
    filterList["concept"]=rk.lower()
    
    df_items_wrk['filterList']=[filterList] * len(df_items_wrk)
    df_items_wrk['updateList']=[jsonmapto] * len(df_items_wrk)
    
#     df_items_wrk.loc[ : ,('filterList')]=[filterList for _ in range(df_items_wrk.shape[0])]
#     df_items_wrk.loc[ : ,('updateList')]=[jsonmapto for _ in range(df_items_wrk.shape[0])]
    rejectedItems=pd.concat([rejectedItems, df_items_wrk.loc[df_items_wrk.astype(str).drop_duplicates(keep='first').index]])
    

attrList={}
filterList={}
#rejectedItems.head()
mpdata.updateItemsProperties(rejectedItems)