# Notebook 5.2 - Curation Keywords: Duplicates

The final release of this notebook will implement the workflow to curate duplicated dynamic properties values.

The notebook works as follows:

0. Imports external libraries and loads the MP dataset and the google sheet
1. Searches for possible duplicates in dynamic properties (Experimental)
2. Updates keywords on MP as follows: _TBD_

## 0 Requirements to run this notebook

This section gives all the relevant information to "interact" with the MP data.

### 0.1 libraries
*There are a number of external libraries needed to run the notebook* 

*Furthermore, a dedicated SSH Open Marketplace library - sshmarketplacelib - with customised functions has been created and can be imported using the python import commands.* 

*Below the libraries import needed to run this notebook*

In [1]:
import numpy as np
import pandas as pd
import requests
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

### 0.2 Get the data



Get the MarketPlace dataset

In [2]:
mpdata = mpd()
df_tool_flat =mpdata.getMPItems ("toolsandservices", True)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", True)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...


Get the list of keywords from the [gsheet](https://docs.google.com/spreadsheets/d/1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA/edit#gid=0)

In [3]:
sheet_id = '1-Oh9_SxIhfMAT6KNJrMf4LetCpy5s1fHZEyTL__TUVA'
sheet_name = 'Mappings'
url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}'
df_keywords=pd.read_csv(url)

The function *getMPConcepts()* is a custum function that uses the API entry: 

GET https://marketplace-api.sshopencloud.eu/api/concept-search?perpage=100&q=URI

to get all the *concepts* from the MarketPlace dataset. 

**Note that executing this function may require some time, currently 14995 records are returned**



In [4]:
#df_concepts=mpdata.getMPConcepts()

In [5]:
utils=hel.Util()
resultfields=['persistentId', 'MPUrl', 'category', 'label', 'type.code', 'type.label', 'concept.code', 'concept.label', 'concept.uri', 'concept.vocabulary.scheme']
udf_alprop=utils.getAllPropertiesBySources()
udf_alprop=udf_alprop.loc[ : ,resultfields]

A few lines of the gsheet

In [6]:
df_keywords.head()

Unnamed: 0,Keyword to map,Map to,Comment,Discussion
0,Linguistics,https://vocabs.acdh.oeaw.ac.at/oefosdiscipline...,Sprach- und Literaturwissenschaften,
1,History,https://vocabs.acdh.oeaw.ac.at/oefosdiscipline...,"Geschichte, Archäologie",
2,Literature,https://vocabs.acdh.oeaw.ac.at/oefosdiscipline...,Sprach- und Literaturwissenschaften,
3,Video,https://vocabs.sshopencloud.eu/vocabularies/me...,video,
4,Text,https://vocabs.sshopencloud.eu/vocabularies/me...,text,


### 1 Find duplicates in properties

The code below checks all items and individuate those with possible duplicated dynamic properties.

In [7]:
df_dupl_props = pd.DataFrame (columns = ['persistentId','category', 'label', 'possibleDupProps'])
duplKW={"persistentId": [], "category":[], "label":[], "possibleDupProps":[]}
df_all_items=pd.concat([df_tool_flat, df_publication_flat, df_trainingmaterials_flat, df_workflows_flat, df_datasets_flat])
for item in df_all_items.itertuples():
    seen = set()
    dupes = [x['concept']['code'].lower() for x in item.properties 
             if (("concept" in x) and (x['concept']['code'].lower() in seen or seen.add(x['concept']['code'].lower())))]    
    dupllist=[(f"{x['type']['code'].lower()}: {x['concept']['code'].lower()}") for x in item.properties 
              if ("concept" in x and x['concept']['code'].lower() in dupes)]
    if (dupllist):
        duplKW["persistentId"].append(item.persistentId)
        duplKW["category"].append(item.category)
        duplKW["label"].append(item.label)
        duplKW["possibleDupProps"].append(", ".join(dupllist))

df_dupl_props = pd.DataFrame(duplKW)
      
df_dupl_props.tail()

Unnamed: 0,persistentId,category,label,possibleDupProps
1044,xIrlJz,dataset,Corpus of Soqotri Oral Literature,"discipline: 6020, discipline: 6020"
1045,sw65vM,dataset,"Data for ""The Life Cycles of Genres""","keyword: fiction, keyword: fiction"
1046,Ihbwts,dataset,English Language Stop Words,"object-format: text, object-format: text"
1047,LRAZDl,dataset,ParIce,"keyword: alignment, keyword: alignment"
1048,dnEWZ8,dataset,The Sign Language Analyses (SLAY) Database,"keyword: sign-languages, keyword: sign-languages"


#### Example: a set of items with possible duplicated properties

In [8]:
df_dupl_props['MPUrl']=df_dupl_props['category']+'/'+df_dupl_props['persistentId']
clickable_duplproptable = df_dupl_props.iloc[0:10].style.format({'MPUrl': utils.make_clickable})
clickable_duplproptable

Unnamed: 0,persistentId,category,label,possibleDupProps,MPUrl
0,SIU1nO,tool-or-service,140kit,"activity: capturing, activity: analyzing, activity: analyzing, activity: capturing, activity: gathering, activity: gathering",tool-or-service/SIU1nO
1,rdwzoM,tool-or-service,4th Dimension,"activity: webdevelopment, activity: webdevelopment",tool-or-service/rdwzoM
2,XsXzlp,tool-or-service,80legs,"activity: analyzing, activity: discovering, activity: analyzing, activity: discovering, activity: analyzing",tool-or-service/XsXzlp
3,uo4gCA,tool-or-service,960 Grid System,"activity: creating, activity: creating, activity: webdevelopment, activity: webdevelopment",tool-or-service/uo4gCA
4,MXsRM1,tool-or-service,Abbot,"keyword: uncategorized, keyword: uncategorized",tool-or-service/MXsRM1
5,gpYKo6,tool-or-service,ABBYY FineReader,"activity: capturing, activity: capturing",tool-or-service/gpYKo6
6,LtXDGc,tool-or-service,ABFREQ,"activity: analyzing, activity: analyzing, activity: analyzing",tool-or-service/LtXDGc
7,y2dwph,tool-or-service,Academia.edu,"activity: capturing, activity: collaborating, activity: publishing, activity: publishing, activity: collaborating, activity: capturing, activity: gathering, activity: gathering, activity: disseminating, activity: disseminating",tool-or-service/y2dwph
8,j4uXG0,tool-or-service,Acronym Finder - Beta (TAPoRware),"keyword: uncategorized, keyword: uncategorized",tool-or-service/j4uXG0
9,j8MRN1,tool-or-service,Active Server Pages (ASP),"activity: publishing, activity: publishing, activity: webdevelopment, activity: webdevelopment",tool-or-service/j8MRN1
