# Notebook 4.1 - Process duplicated items

the same entity should only be referenced once in the SSH Open Marketplace. Duplicate items should be merged to ensure the coherence of the items showcased in the portal. This notebook is used to identify duplicates and to create merged items in the MP.

### Libraries

In [1]:
import pandas as pd #to manage dataframes
import json #to manage json objects
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

### Data(frames) download

In [3]:
mpdata = mpd()
df_tool_flat =mpdata.getMPItems ("toolsandservices", True)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", False)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...


### Identify the duplicates

Duplicates are individuated for every category by defining the attributes that should be inspected to identify equal items. In the next cell the function __getDuplicates(category, attributes)__ is invoked to get for every category the subset of items having the same value in the 'attribute' `label`, each set of duplicated items is stored in a variable.

In [None]:
utils=hel.Util()
filter_attribute='label'
df_tool_duplicates=utils.getDuplicates(df_tool_flat, filter_attribute)
df_publication_duplicates=utils.getDuplicates(df_publication_flat, filter_attribute)
df_trainingmaterials_duplicates=utils.getDuplicates(df_trainingmaterials_flat, filter_attribute)
df_workflows_duplicates=utils.getDuplicates(df_workflows_flat, filter_attribute)
df_datasets_duplicates=utils.getDuplicates(df_datasets_flat, filter_attribute)

In [None]:
print (f'Using the attribute "{filter_attribute}" as filter, there are: {df_tool_duplicates.shape[0]} duplicated tools, {df_publication_duplicates.shape[0]} duplicated publications,'
       +f' {df_trainingmaterials_duplicates.shape[0]} duplicated training materials,'+
      f' {df_workflows_duplicates.shape[0]} duplicated workflows,'+
      f' {df_datasets_duplicates.shape[0]} duplicated datasets')

In [None]:
item_vis_mask=['MPUrl','persistentId', 'label', 'accessibleAt', 'source.label']
df_tool_duplicates[item_vis_mask].head(6)

In [None]:
clickable_cmp_table = df_tool_duplicates[item_vis_mask].style.format({'MPUrl': utils.make_clickable})
clickable_cmp_table

### Obtain the merged item and view it
The function __getMergedItem(category, pids)__ takes the category of the items to be merged and the list of *persistentId* of the items to be merged.  
It returns two values: 
<ul><li>a dataframe that can be print to inspect the merged items</li><li>a JSon that must be used as parameter in the function that writes back the merged item to the MP dataset</li></ul>
In the next two cells the function is invoked and the result is printed 

In [None]:
category="toolsandservices"
#persistentId of duplicated items
#pids="C75gkx, XE6Spj"
pids="bKefY4, mTDlTo"
#create the data frame
persistentids=pids.replace(" ", "").split(',')
compareitems=df_tool_duplicates[df_tool_duplicates.persistentId.isin(persistentids)]

In [None]:
css_equal="font-size:1.5rem; border: 2px solid silver;background-color: white; padding: 10px 20px"
css_diff="background-color: lightyellow;  font-size:1.5rem; border: 2px solid silver; padding: 10px 20px"

In [None]:
#view items
showdiff = compareitems.T.style.apply(lambda x: [css_equal if ((len(utils.lists_to_list(x.values))==1)) else css_diff for i in x],
                    axis=1)
showdiff

In [None]:
#get the merged item
mergeditem=mpdata.getMergedItem(category, pids)

check the merged item

In [None]:
mergeditem[0].head(30).style.set_properties(**{'width': '75% ; border: 1px solid silver;background-color: lightblue; padding: 10px 20px'})

The function __postMergedItem(JSonItem, pids)__ stores the merged item into the MP dataset. It takes the merged item as a JSon object, and the list of *persistentId* ids of the merged items.


In [None]:
mpdata.postMergedItem(mergeditem[1], pids)