# Notebook 4.1 - Process duplicated items

the same entity should only be referenced once in the SSH Open Marketplace. Duplicate items should be merged to ensure the coherence of the items showcased in the portal. This notebook is used to identify duplicates and to create merged items in the MP.

### Libraries

In [1]:
import pandas as pd #to manage dataframes
import json #to manage json objects
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

### Data(frames) download

In [2]:
mpdata = mpd()
df_tool_flat =mpdata.getMPItems ("toolsandservices", True)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", True)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...


### Identify the duplicates

Duplicates are individuated for every category by defining the attributes that should be inspected to identify equal items. In the next cell the function __getDuplicates(category, attributes)__ is invoked to get for every category the subset of items having the same value in the 'attribute' `label`, each set of duplicated items is stored in a variable.

In [3]:
utils=hel.Util()
filter_attributes='accessibleAt'
df_tool_duplicates=utils.getDuplicates(df_tool_flat, filter_attributes)
df_publication_duplicates=utils.getDuplicates(df_publication_flat, filter_attributes)
df_trainingmaterials_duplicates=utils.getDuplicates(df_trainingmaterials_flat, filter_attributes)
df_workflows_duplicates=utils.getDuplicates(df_workflows_flat, filter_attributes)
df_datasets_duplicates=utils.getDuplicates(df_datasets_flat, filter_attributes)

In [4]:
print (f'Using the attribute(s) "{filter_attributes}" as filter, there are: {df_tool_duplicates.shape[0]} duplicated tools, {df_publication_duplicates.shape[0]} duplicated publications,'
       +f' {df_trainingmaterials_duplicates.shape[0]} duplicated training materials,'+
      f' {df_workflows_duplicates.shape[0]} duplicated workflows,'+
      f' {df_datasets_duplicates.shape[0]} duplicated datasets')

Using the attribute(s) "accessibleAt" as filter, there are: 562 duplicated tools, 24 duplicated publications, 20 duplicated training materials, 2 duplicated workflows, 249 duplicated datasets


In [5]:
item_vis_mask=['MPUrl','persistentId', 'label', 'accessibleAt', 'source.label']
df_tool_duplicates[item_vis_mask].head(6)

Unnamed: 0,MPUrl,persistentId,label,accessibleAt,source.label
6,tool-or-service/zrfCly,zrfCly,3DVIA Virtools,,TAPoR
13,tool-or-service/LtXDGc,LtXDGc,ABFREQ,,TAPoR
16,tool-or-service/j4uXG0,j4uXG0,Acronym Finder - Beta (TAPoRware),,TAPoR
19,tool-or-service/KdAfc6,KdAfc6,Adobe Acrobat Distiller,,TAPoR
20,tool-or-service/nMm0wM,nMm0wM,Adobe Acrobat Reader,,
21,tool-or-service/U6hzqf,U6hzqf,Adobe After Effects,,TAPoR


In [6]:
clickable_cmp_table = df_tool_duplicates[item_vis_mask].style.format({'MPUrl': utils.make_clickable})
clickable_cmp_table

Unnamed: 0,MPUrl,persistentId,label,accessibleAt,source.label
6,tool-or-service/zrfCly,zrfCly,3DVIA Virtools,,TAPoR
13,tool-or-service/LtXDGc,LtXDGc,ABFREQ,,TAPoR
16,tool-or-service/j4uXG0,j4uXG0,Acronym Finder - Beta (TAPoRware),,TAPoR
19,tool-or-service/KdAfc6,KdAfc6,Adobe Acrobat Distiller,,TAPoR
20,tool-or-service/nMm0wM,nMm0wM,Adobe Acrobat Reader,,
21,tool-or-service/U6hzqf,U6hzqf,Adobe After Effects,,TAPoR
22,tool-or-service/Tpi8KE,Tpi8KE,Adobe Bridge,,TAPoR
24,tool-or-service/usExJA,usExJA,Adobe Illustrator,,TAPoR
25,tool-or-service/aeQ2f6,aeQ2f6,Adobe InDesign,,TAPoR
28,tool-or-service/57QHiH,57QHiH,Aelfred,,TAPoR


### Obtain the merged item and view it
The function __getMergedItem(category, pids)__ takes the category of the items to be merged and the list of *persistentId* of the items to be merged.  
It returns two values: 
<ul><li>a dataframe that can be print to inspect the merged items</li><li>a JSon that must be used as parameter in the function that writes back the merged item to the MP dataset</li></ul>
In the next two cells the function is invoked and the result is printed 

In [13]:
category="toolsandservices"
#persistentId of duplicated items
#pids="C75gkx, XE6Spj"
pids="KdAfc6, nMm0wM"
#create the data frame
persistentids=pids.replace(" ", "").split(',')
compareitems=df_tool_flat[df_tool_flat.persistentId.isin(persistentids)]

In [14]:
css_equal="font-size:1.5rem; border: 2px solid silver;background-color: white; padding: 10px 20px"
css_diff="background-color: lightyellow;  font-size:1.5rem; border: 2px solid silver; padding: 10px 20px"

In [15]:
#view items
showdiff = compareitems.T.style.apply(lambda x: [css_equal if ((len(utils.lists_to_list(x.values))==1)) else css_diff for i in x],
                    axis=1)
showdiff

Unnamed: 0,19,20
MPUrl,tool-or-service/KdAfc6,tool-or-service/nMm0wM
id,28724,40836
category,tool-or-service,tool-or-service
label,Adobe Acrobat Distiller,Adobe Acrobat Reader
persistentId,KdAfc6,nMm0wM
lastInfoUpdate,2021-11-23T17:34:28+0000,2022-05-20T08:52:27+0000
status,approved,approved
description,"Adobe Acrobat Distiller was software for converting Postscript files to PDF. It was discontinued in 2013, with the exception of a server-based version.","PDF viewer for reading, searching, printing and interacting with PDF file."
contributors,"[{'actor': {'id': 1860, 'name': 'Adobe Systems Incorporated', 'externalIds': [{'identifierService': {'code': 'SourceActorId', 'label': 'Source ActorId', 'ord': 7, 'urlTemplate': ''}, 'identifier': '1-ef7de2bd89183a02bdf9c66eecfa112a49ad8de3e3da41176a0be1b5e4b1b305'}], 'affiliations': []}, 'role': {'code': 'creator', 'label': 'Creator', 'ord': 3}}]",[]
properties,"[{'type': {'code': 'keyword', 'label': 'Keyword', 'type': 'concept', 'groupName': 'Categorisation', 'hidden': False, 'ord': 18, 'allowedVocabularies': [{'code': 'sshoc-keyword', 'scheme': 'https://vocabs.dariah.eu/sshoc-keyword/Schema', 'namespace': 'https://vocabs.dariah.eu/sshoc-keyword/', 'label': 'Keywords from SSHOC MP', 'closed': False}]}, 'concept': {'code': 'conversion', 'vocabulary': {'code': 'sshoc-keyword', 'scheme': 'https://vocabs.dariah.eu/sshoc-keyword/Schema', 'namespace': 'https://vocabs.dariah.eu/sshoc-keyword/', 'label': 'Keywords from SSHOC MP', 'closed': False}, 'label': 'conversion', 'notation': 'conversion', 'uri': 'https://vocabs.dariah.eu/sshoc-keyword/conversion', 'candidate': True}}, {'type': {'code': 'keyword', 'label': 'Keyword', 'type': 'concept', 'groupName': 'Categorisation', 'hidden': False, 'ord': 18, 'allowedVocabularies': [{'code': 'sshoc-keyword', 'scheme': 'https://vocabs.dariah.eu/sshoc-keyword/Schema', 'namespace': 'https://vocabs.dariah.eu/sshoc-keyword/', 'label': 'Keywords from SSHOC MP', 'closed': False}]}, 'concept': {'code': 'Other', 'vocabulary': {'code': 'sshoc-keyword', 'scheme': 'https://vocabs.dariah.eu/sshoc-keyword/Schema', 'namespace': 'https://vocabs.dariah.eu/sshoc-keyword/', 'label': 'Keywords from SSHOC MP', 'closed': False}, 'label': 'Other', 'notation': 'Other', 'uri': 'https://vocabs.dariah.eu/sshoc-keyword/Other', 'candidate': True}}, {'type': {'code': 'terms-of-use', 'label': 'Terms Of Use', 'type': 'string', 'groupName': 'Access', 'hidden': False, 'ord': 3, 'allowedVocabularies': []}, 'value': 'Closed Source'}, {'type': {'code': 'activity', 'label': 'Activity', 'type': 'concept', 'groupName': 'Categorisation', 'hidden': False, 'ord': 17, 'allowedVocabularies': [{'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/', 'label': 'TaDiRAH 2', 'closed': True}]}, 'concept': {'code': 'enriching', 'vocabulary': {'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/', 'label': 'TaDiRAH 2', 'closed': True}, 'label': 'Enriching', 'notation': '', 'uri': 'https://vocabs.dariah.eu/tadirah/enriching', 'candidate': False}}]","[{'type': {'code': 'curation-flag-merged', 'label': 'Curate merged items', 'type': 'boolean', 'groupName': 'Curation', 'hidden': True, 'ord': 39, 'allowedVocabularies': []}, 'value': 'TRUE'}]"


In [16]:
#get the merged item
mergeditem=mpdata.getMergedItem(category, pids)

check the merged item

In [17]:
mergeditem[0].head(30).style.set_properties(**{'width': '75% ; border: 1px solid silver;background-color: lightblue; padding: 10px 20px'})

Unnamed: 0,0
id,28724
category,tool-or-service
label,Adobe Acrobat Distiller / Adobe Acrobat Reader
persistentId,KdAfc6
lastInfoUpdate,2021-11-23T17:34:28+0000
status,approved
description,"Adobe Acrobat Distiller was software for converting Postscript files to PDF. It was discontinued in 2013, with the exception of a server-based version.  / PDF viewer for reading, searching, printing and interacting with PDF file."
contributors,"[{'actor': {'id': 1860, 'name': 'Adobe Systems Incorporated', 'externalIds': [{'identifierService': {'code': 'SourceActorId', 'label': 'Source ActorId', 'ord': 7, 'urlTemplate': ''}, 'identifier': '1-ef7de2bd89183a02bdf9c66eecfa112a49ad8de3e3da41176a0be1b5e4b1b305'}], 'affiliations': []}, 'role': {'code': 'creator', 'label': 'Creator', 'ord': 3}}]"
properties,"[{'type': {'code': 'keyword', 'label': 'Keyword', 'type': 'concept', 'groupName': 'Categorisation', 'hidden': False, 'ord': 18, 'allowedVocabularies': [{'code': 'sshoc-keyword', 'scheme': 'https://vocabs.dariah.eu/sshoc-keyword/Schema', 'namespace': 'https://vocabs.dariah.eu/sshoc-keyword/', 'label': 'Keywords from SSHOC MP', 'closed': False}]}, 'concept': {'code': 'conversion', 'vocabulary': {'code': 'sshoc-keyword', 'scheme': 'https://vocabs.dariah.eu/sshoc-keyword/Schema', 'namespace': 'https://vocabs.dariah.eu/sshoc-keyword/', 'label': 'Keywords from SSHOC MP', 'closed': False}, 'label': 'conversion', 'notation': 'conversion', 'uri': 'https://vocabs.dariah.eu/sshoc-keyword/conversion', 'candidate': True}}, {'type': {'code': 'keyword', 'label': 'Keyword', 'type': 'concept', 'groupName': 'Categorisation', 'hidden': False, 'ord': 18, 'allowedVocabularies': [{'code': 'sshoc-keyword', 'scheme': 'https://vocabs.dariah.eu/sshoc-keyword/Schema', 'namespace': 'https://vocabs.dariah.eu/sshoc-keyword/', 'label': 'Keywords from SSHOC MP', 'closed': False}]}, 'concept': {'code': 'Other', 'vocabulary': {'code': 'sshoc-keyword', 'scheme': 'https://vocabs.dariah.eu/sshoc-keyword/Schema', 'namespace': 'https://vocabs.dariah.eu/sshoc-keyword/', 'label': 'Keywords from SSHOC MP', 'closed': False}, 'label': 'Other', 'notation': 'Other', 'uri': 'https://vocabs.dariah.eu/sshoc-keyword/Other', 'candidate': True}}, {'type': {'code': 'terms-of-use', 'label': 'Terms Of Use', 'type': 'string', 'groupName': 'Access', 'hidden': False, 'ord': 3, 'allowedVocabularies': []}, 'value': 'Closed Source'}, {'type': {'code': 'activity', 'label': 'Activity', 'type': 'concept', 'groupName': 'Categorisation', 'hidden': False, 'ord': 17, 'allowedVocabularies': [{'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/', 'label': 'TaDiRAH 2', 'closed': True}]}, 'concept': {'code': 'enriching', 'vocabulary': {'code': 'tadirah2', 'scheme': 'https://vocabs.dariah.eu/tadirah/', 'namespace': 'https://vocabs.dariah.eu/tadirah/', 'label': 'TaDiRAH 2', 'closed': True}, 'label': 'Enriching', 'notation': '', 'uri': 'https://vocabs.dariah.eu/tadirah/enriching', 'candidate': False}}, {'type': {'code': 'curation-flag-merged', 'label': 'Curate merged items', 'type': 'boolean', 'groupName': 'Curation', 'hidden': True, 'ord': 39, 'allowedVocabularies': []}, 'value': 'TRUE'}]"
externalIds,"[{'identifierService': {'code': 'Wikidata', 'label': 'Wikidata', 'ord': 1, 'urlTemplate': 'https://www.wikidata.org/wiki/{source-item-id}'}, 'identifier': 'Q2634567'}]"


The function __postMergedItem(JSonItem, pids)__ stores the merged item into the MP dataset. It takes the merged item as a JSon object, and the list of *persistentId* ids of the merged items.


In [19]:
mergeditem[0]

Unnamed: 0,0
id,28724
category,tool-or-service
label,Adobe Acrobat Distiller / Adobe Acrobat Reader
persistentId,KdAfc6
lastInfoUpdate,2021-11-23T17:34:28+0000
status,approved
description,Adobe Acrobat Distiller was software for conve...
contributors,"[{'actor': {'id': 1860, 'name': 'Adobe Systems..."
properties,"[{'type': {'code': 'keyword', 'label': 'Keywor..."
externalIds,"[{'identifierService': {'code': 'Wikidata', 'l..."


In [18]:
mpdata.postMergedItem(mergeditem[1], pids)

curation-flag-merged
Error, please check the merged item


''