# Notebook 5.2 - Curation Keywords: Duplicates

The final release of this notebook will implement the workflow to curate duplicated dynamic properties values.

The notebook works as follows:

0. Imports external libraries and loads the MP dataset and the google sheet
1. Searches for possible duplicates in dynamic properties (Experimental)
2. Updates keywords on MP

## 0 Requirements to run this notebook

This section gives all the relevant information to "interact" with the MP data.

### 0.1 libraries
*There are a number of external libraries needed to run the notebook* 

*Furthermore, a dedicated SSH Open Marketplace library - sshmarketplacelib - with customised functions has been created and can be imported using the python import commands.* 

*Below the libraries import needed to run this notebook*

In [1]:
import numpy as np
import pandas as pd
import requests
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

### 0.2 Get the data



Get the MarketPlace dataset

In [2]:
mpdata = mpd()
df_tool_flat =mpdata.getMPItems ("toolsandservices", True)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", True)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...


In [3]:
utils=hel.Util()
resultfields=['persistentId', 'MPUrl', 'category', 'label', 'type.code', 'type.label', 'concept.code', 'concept.label', 'concept.uri', 'concept.vocabulary.scheme']
udf_alprop=utils.getAllPropertiesBySources()
udf_alprop=udf_alprop.loc[ : ,resultfields]

### 1 Find duplicates in properties

The code below checks all items and individuate those with possible duplicated dynamic properties.

In [4]:
df_dupl_props = pd.DataFrame (columns = ['persistentId','category', 'label', 'possibleDupProps','pDProps'])
duplKW={"persistentId": [], "category":[], "label":[], "possibleDupProps":[],"pDProps":[]}
df_all_items=pd.concat([df_tool_flat, df_publication_flat, df_trainingmaterials_flat, df_workflows_flat, df_datasets_flat])
for item in df_all_items.itertuples():
    seen = set()
    dupes = [x['concept']['code'].lower() for x in item.properties 
             if (("concept" in x) and (x['concept']['code'].lower() in seen or seen.add(x['concept']['code'].lower())))]    
    dupllist=[(f"{x['type']['code'].lower()}:{x['concept']['code'].lower()}") for x in item.properties 
              if ("concept" in x and x['concept']['code'].lower() in dupes)]
    if (dupllist):
        #print(dupes)
        duplKW["persistentId"].append(item.persistentId)
        duplKW["category"].append(item.category)
        duplKW["label"].append(item.label)
        duplKW["possibleDupProps"].append(", ".join(dupllist))
        duplKW["pDProps"].append(dupllist)

df_dupl_props = pd.DataFrame(duplKW)
df_dupl_props.pDProps=df_dupl_props.pDProps.apply(lambda y: list(set(y)))
df_dupl_props.head()

Unnamed: 0,persistentId,category,label,possibleDupProps,pDProps
0,vJ9sE6,tool-or-service,ArkeoGIS,"keyword:csv, standard:csv","[keyword:csv, standard:csv]"
1,ysUiV1,tool-or-service,Gallica,"keyword:iiif, standard:iiif","[keyword:iiif, standard:iiif]"
2,Ctd5u5,tool-or-service,Mapping Memory Manager,"keyword:cidoc-crm, standard:cidoc-crm","[standard:cidoc-crm, keyword:cidoc-crm]"
3,SG7WGa,tool-or-service,Ontop,"activity:mapping, keyword:mapping","[activity:mapping, keyword:mapping]"
4,lxjJee,tool-or-service,Srophé App (an eXist-db publishing platform fo...,"keyword:tei, standard:tei","[keyword:tei, standard:tei]"


#### Example: a set of items with possible duplicated properties

In [5]:
df_dupl_props['MPUrl']=df_dupl_props['category']+'/'+df_dupl_props['persistentId']
clickable_duplproptable = df_dupl_props.iloc[0:30].style.format({'MPUrl': utils.make_clickable})
clickable_duplproptable

Unnamed: 0,persistentId,category,label,possibleDupProps,pDProps,MPUrl
0,vJ9sE6,tool-or-service,ArkeoGIS,"keyword:csv, standard:csv","['keyword:csv', 'standard:csv']",tool-or-service/vJ9sE6
1,ysUiV1,tool-or-service,Gallica,"keyword:iiif, standard:iiif","['keyword:iiif', 'standard:iiif']",tool-or-service/ysUiV1
2,Ctd5u5,tool-or-service,Mapping Memory Manager,"keyword:cidoc-crm, standard:cidoc-crm","['standard:cidoc-crm', 'keyword:cidoc-crm']",tool-or-service/Ctd5u5
3,SG7WGa,tool-or-service,Ontop,"activity:mapping, keyword:mapping","['activity:mapping', 'keyword:mapping']",tool-or-service/SG7WGa
4,lxjJee,tool-or-service,Srophé App (an eXist-db publishing platform for TEI datasets),"keyword:tei, standard:tei","['keyword:tei', 'standard:tei']",tool-or-service/lxjJee
5,gvthYy,publication,Extracting melodies for analysis: What SPARQL can do on Music-XML,"keyword:xml, standard:xml","['keyword:xml', 'standard:xml']",publication/gvthYy
6,l1xtpH,training-material,Aligner des données XML avec une ontologie avec 3M,"keyword:cidoc-crm, standard:cidoc-crm","['standard:cidoc-crm', 'keyword:cidoc-crm']",training-material/l1xtpH
7,R8Uj9l,training-material,Aligner une base de données avec une ontologie avec Protégé- Ontop,"activity:mapping, keyword:mapping","['activity:mapping', 'keyword:mapping']",training-material/R8Uj9l
8,vAbBMG,training-material,Onto Match Game,"standard:cidoc-crm, keyword:cidoc-crm","['standard:cidoc-crm', 'keyword:cidoc-crm']",training-material/vAbBMG
9,DYykZw,training-material,Train-the-Trainers Package,"keyword:trainer, intended-audience:trainer","['keyword:trainer', 'intended-audience:trainer']",training-material/DYykZw


#### 2. Updates keywords on MP

In [None]:
mpdata.removeDuplicatedProperties(df_dupl_props)

In [None]:
cases_df=df_dupl_props.groupby(['possibleDupProps'])['label'].count().reset_index(name='numberofcases')

In [None]:

cases_df.head(20)