# Notebook 4.3 - Actors curation: duplicates

This notebook gather several checks that can be run together or independently of each other. The set of these checks helps moderators to curate Duplicated actors in the SSH Open Marketplace. 

This notebook is composed of 6 sections:

0. Requirements to run this notebook
1. Get actors 
2. Duplicated actors

    2.1 Find duplicates for actors with same name and same website 
    
    2.2 Compare duplicated actors with same name and same website 
    
    2.3 Merge duplicated actors with same name and same website 
    
    2.4 Reload the Actors
    
    2.5 Find duplicates for actors with same name 
    
    2.6 Merge duplicated actors with same name 


## 0 Requirements to run this notebook

This section gives all the relevant information to "interact" with the MP data.

### 0.1 libraries
*There are a number of external libraries needed to run the notebook* 

*Furthermore, a dedicated SSH Open Marketplace library - sshmarketplacelib - with customised functions has been created and can be imported using the python import commands.* 

*Below the libraries import needed to run this notebook*

In [1]:
import pandas as pd #to manage dataframes
#import matplotlib.pyplot as plt #to create histograms and images
#import seaborn as sns #to create histograms and images
import numpy as np #to manage json objects
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

In [2]:
mpdata = mpd()
utils=hel.Util()
check=eva.URLCheck()

### 0.2 Utility functions

In [3]:
def getDuplicateActorsWithEmptyItems(actors):
    df_actors_ei=pd.DataFrame()
    for item in actors.itertuples():
        allEmpty=True;
        for actorid in item.idstobemerged:
            actitems=mpdata.getItemsforActor(str(actorid))
            allEmpty=allEmpty & actitems.empty
            if(not allEmpty):
                break
        if (allEmpty):
            entry = actors.loc[actors['name'] == item.name]
            df_actors_ei=pd.concat([df_actors_ei, entry])
    return df_actors_ei

In [4]:
def getDuplicateActorsWithDifferentItems(actors):
    df_results=pd.DataFrame()
    for item in actors.itertuples():  
        if (len (item.idstobemerged)<2):
            continue
        
        tempdf_sn=mpdata.getItemsforActor(str(item.idstobemerged[0])).drop_duplicates('persistentId', keep='first')
        for actorid in item.idstobemerged[1:]:
            #print (actorid)
            actitems=mpdata.getItemsforActor(str(actorid)).drop_duplicates('persistentId', keep='first')
            entry = actors.loc[actors['name'] == item.name]
            if ((tempdf_sn.empty & (not actitems.empty)) | ((not tempdf_sn.empty) & actitems.empty)):
                df_results=pd.concat([df_results, entry])
                break 
            if ((not actitems.empty)):
                tre=(actitems['persistentId'].isin(tempdf_sn['persistentId'])).value_counts()
                tre_r= (tempdf_sn['persistentId'].isin(actitems['persistentId'])).value_counts()  
                
                if ((False in tre.to_dict()) & (False in tre_r.to_dict())):
                    #print (f'{item.idstobemerged[0]}, {actorid}')
                   
                    df_results=pd.concat([df_results, entry])
                    break
    return df_results


In [5]:
import functools as ft
def getDuplicateActorsWithSameItems(dfs):
    df_actors_si=pd.DataFrame()
    act_dfs=[x for x in dfs if not x.empty]
    if (len(act_dfs)<1):
        return df_actors_si
    if (len(act_dfs)==1):
        return dfs[2]
    df_temp=ft.reduce(lambda left, right: pd.merge(left, right, on=['name'], how="outer"), act_dfs)
    if (len(act_dfs)==3):
        return df_temp.loc[(df_temp[['idstobemerged_x', 'idstobemerged_y']].isnull().all(1)) & (df_temp['idstobemerged'].notnull())][['name', 'idstobemerged']]
    if (len(act_dfs)==2):
        testr=df_temp.loc[(df_temp['idstobemerged_x'].isnull()) & 
                         (df_temp['idstobemerged_y'].notnull())][['name', 'idstobemerged_y']]
        testr.rename(columns = {'idstobemerged_y':'idstobemerged'}, inplace=True)
        return testr
    

## 1. Get actors

In [6]:
df_actors_flat =mpdata.getMPItems ("actors", False)

In [7]:
df_actors_flat.tail()

Unnamed: 0,id,name,externalIds,affiliations,website
7946,8062,Zong Peng,"[{'identifierService': {'code': 'DBLP', 'label...",[],
7947,218,Zoomify Inc.,[{'identifierService': {'code': 'SourceActorId...,[],
7948,1590,Zoppi Angela,[],[],
7949,7819,Zsófia Fellegi,"[{'identifierService': {'code': 'DBLP', 'label...",[],
7950,9882,Zsolt Szántó,[],[],


## 2. Duplicated actors
    2.1 Get duplicates for actors using *actor.name* and *actor.website* as filters
    2.2 Compare duplicated actors (optional)
    2.3 Merge duplicated actors (with same name, same website)
    2.4 reload the actors data from the MP
    2.5 Get duplicates for actors using *actor.name* as filter 
    2.6 Merge actors without manual checks
    2.7 Merge actors after comparison step

### 2.1 Get duplicates for actors using *actor.name* and *actor.website* as filter

In [8]:
utils=hel.Util()
filter_attribute=['name', 'website']
df_actor_duplicates=df_actors_flat[df_actors_flat.duplicated(subset=filter_attribute, keep=False)]
dupl_actor_website=df_actor_duplicates[df_actor_duplicates['website'].notnull()].sort_values('name')

In [9]:
print (f'Using the attributes "{filter_attribute}" as filter, there are: {dupl_actor_website.shape[0]} duplicated actors')

Using the attributes "['name', 'website']" as filter, there are: 0 duplicated actors


In [None]:
actorwebsite_tomerge=dupl_actor_website.groupby(['name','website'])['id'].apply(list).reset_index(name='idstobemerged')

In [None]:
actorwebsite_tomerge.count()

In [None]:
#The number of actors with more than one duplicate
actorwebsite_tomerge[actorwebsite_tomerge.idstobemerged.map(len)>2].count()

In [None]:
actorwebsite_tomerge.head()

### 2.2 Compare duplicated actors (optional)

In [None]:
#id of duplicated actors
ids=[174, 1978]
compareitems=df_actor_duplicates[df_actor_duplicates.id.isin(ids)]

In [None]:
compareitems

In [None]:
css_equal="font-size:1.5rem; border: 2px solid silver;background-color: white; padding: 10px 20px"
css_diff="background-color: lightyellow;  font-size:1.5rem; border: 2px solid silver; padding: 10px 20px"

In [None]:
#view items
showdiff = compareitems.T.style.apply(lambda x: [css_equal if ((len(utils.lists_to_list(x.values))==1) ) else css_diff for i in x],
                    axis=1)
showdiff

### 2.3 Merge duplicated actors

The function <i>postMergedActors </i> uses the API entry: <i>(POST) /api/actors/{id}/merge</i> to automatically merge the actors contained in the dataframe actorwebsite_tomerge.

In [None]:
for item in actorwebsite_tomerge.itertuples():
    print(item.idstobemerged[0], ", ".join(str (e) for e in item.idstobemerged[1:]))
    mpdata.postMergedActors(str(item.idstobemerged[0]), ", ".join(str (e) for e in item.idstobemerged[1:]))

## 2.4 Reload the actor from MP

In [11]:
df_actors_flat_new =mpdata.getMPItems ("actors", False)

### 2.5 Get duplicates for actors using actor.name as filter

In [12]:
utils=hel.Util()
filter_attribute='name'
df_actor_duplicates_new=df_actors_flat_new[df_actors_flat_new.duplicated(subset=filter_attribute, keep=False)]
actor_tomerge_new=df_actor_duplicates_new.groupby(['name'])['id'].apply(list).reset_index(name='idstobemerged')

In [13]:
print (f'Using the attributes "{filter_attribute}" as filter, there are: {actor_tomerge_new.shape[0]} duplicated actors')

Using the attributes "name" as filter, there are: 340 duplicated actors


The code below will generate three different dataframes : 
<ol>
<li> The dataframe <i>df_actors_samename_empty_items</i> that contains the duplicated actors that were *never attached to any item*</li>
<li> The dataframe <i>df_actors_samename_different_items</i> that contains the duplicated actors that were never attached to the same items, <b>the actors in this dataframe should be manually merged! See next section of the notebook</b></li>
<li> The dataframe <i>df_actors_samename_sameitems</i> that contains the duplicated actors that were attached to same items</li>
</ol>
    
    Actors in dataframes 1 and 3 can be merged automatically. Actors in df 2 should be manually inspected and merge individually (see section 2.7)

#### _Individuating actors with the exact same name that were never attached to any items._

In [14]:
df_actors_samename_empty_items= getDuplicateActorsWithEmptyItems(actor_tomerge_new)

In [15]:
df_actors_samename_empty_items.head(10)

Unnamed: 0,name,idstobemerged
117,"INRIA, Paris, France","[8968, 6441, 8970, 8593, 8960, 8119, 8966, 897..."


#### _Different actors with the exact same name that were never attached to at least one common item_

In [16]:
df_actors_samename_different_items=getDuplicateActorsWithDifferentItems(actor_tomerge_new)

In [17]:
df_actors_samename_different_items.count()

name             337
idstobemerged    337
dtype: int64

#### _Actors with the exact same name that were attached to at least one common item._

In [18]:
dfs=[df_actors_samename_different_items, df_actors_samename_empty_items, actor_tomerge_new]
df_actors_samename_sameitems=getDuplicateActorsWithSameItems(dfs)

In [20]:
df_actors_samename_sameitems.head(10)

Unnamed: 0,name,idstobemerged
338,Digital Innovation Lab,"[2631, 2470, 2427, 578]"
339,Jakub Waszczuk,"[234, 2177, 2045]"


## 2.6 Merge actors without manual checks

The code above has generated three different dataframes:

<ol>
    <li>The dataframe <i>df_actors_samename_empty_items</i> that contains the duplicated actors that were *never attached to any item*</li>
    <li> The dataframe <i>df_actors_samename_different_items</i> that contains the duplicated actors that were never attached to the same items, <b>the actors in this dataframe should be manually merged! See next section of the notebook</b></li>
    <li>The dataframe <i>df_actors_samename_sameitems</i> that contains the duplicated actors that were attached to same items</li>
</ol>

The function <i>postMergedActors </i> uses the API entry: <i>(POST) /api/actors/{id}/merge</i> to automatically merge the actors contained in dataframes 1 and 3.

##### merging duplicates from the *df_actors_samename_empty_items* 

In [21]:
for item in df_actors_samename_empty_items.itertuples():
    print(item.idstobemerged[0], ", ".join(str (e) for e in item.idstobemerged[1:]))
    mpdata.postMergedActors(str(item.idstobemerged[0]), ", ".join(str (e) for e in item.idstobemerged[1:]))

8968 6441, 8970, 8593, 8960, 8119, 8966, 8972, 7953, 8962, 5845, 8964, 8958, 6036, 5404
Merging actor 8968 with actor(s) 6441, 8970, 8593, 8960, 8119, 8966, 8972, 7953, 8962, 5845, 8964, 8958, 6036, 5404...
URL: https://marketplace-api.sshopencloud.eu/api/actors/8968/merge?with=6441,8970,8593,8960,8119,8966,8972,7953,8962,5845,8964,8958,6036,5404
...not executed, running in DEBUG mode.


##### merging duplicates from the *df_actors_samename_sameitems*

In [22]:
for item in df_actors_samename_sameitems.itertuples():
    print(item.idstobemerged[0], ", ".join(str (e) for e in item.idstobemerged[1:]))
    mpdata.postMergedActors(str(item.idstobemerged[0]), ", ".join(str (e) for e in item.idstobemerged[1:]))

2631 2470, 2427, 578
Merging actor 2631 with actor(s) 2470, 2427, 578...
URL: https://marketplace-api.sshopencloud.eu/api/actors/2631/merge?with=2470,2427,578
...not executed, running in DEBUG mode.
234 2177, 2045
Merging actor 234 with actor(s) 2177, 2045...
URL: https://marketplace-api.sshopencloud.eu/api/actors/234/merge?with=2177,2045
...not executed, running in DEBUG mode.


## 2.7 Merge actors after comparison step

In [23]:
df_actors_samename_different_items.sort_values('name').to_csv(path_or_buf='data/duplicatedactors.csv', sep=',', index=False)


In [24]:
df_actors_samename_different_items

Unnamed: 0,name,idstobemerged
0,"Poznańskie Centrum Superkomputerowo-Sieciowe,...","[1687, 3238]"
1,Alan Liu,"[149, 7812]"
2,Alan MacEachern,"[8384, 1017, 2938]"
3,Alastair Dunning,"[6070, 1262]"
4,Alex Brey,"[1105, 3008]"
...,...,...
335,lindat-help@ufal.mff.cuni.cz,"[1757, 3758]"
336,ricard.campos@coronis.es,"[3771, 728]"
337,training@cessda.eu,"[9472, 1213]"
338,user-services.fsd@tuni.fi,"[3774, 1473]"


### Compare Actors

In [25]:
ids=[1687, 3238]
compareitems=df_actor_duplicates[df_actor_duplicates.id.isin(ids)]
css_equal="font-size:1.5rem; border: 2px solid silver;background-color: white; padding: 10px 20px"
css_diff="background-color: lightyellow;  font-size:1.5rem; border: 2px solid silver; padding: 10px 20px"

In [26]:
#view items
showdiff = compareitems.T.style.apply(lambda x: [css_equal if ((len(utils.lists_to_list(x.values))==1) ) else css_diff for i in x],
                    axis=1)
showdiff

Unnamed: 0,5815,5816
id,1687,3238
name,"Poznańskie Centrum Superkomputerowo-Sieciowe, Poznań","Poznańskie Centrum Superkomputerowo-Sieciowe, Poznań"
externalIds,[],"[{'identifierService': {'code': 'SourceActorId', 'label': 'Source ActorId', 'ord': 7, 'urlTemplate': ''}, 'identifier': '4-3d636c0a83a4e1caeafe4f55598ec24a66958263a4fb81c14aa73f32ab816c6c'}]"
affiliations,[],[]
website,,


because this comparison view won't always be enough to judge if actors should be merged, or differenciated in another way, this API call might help: 
GET /api/actors/{id}?items=true

It will show the items an actor was/is attached to and provide useful information to differentiate potential homonyms. Ex for IDs [1687, 3238] : compare https://marketplace-api.sshopencloud.eu/api/actors/1687?items=true and https://marketplace-api.sshopencloud.eu/api/actors/3238?items=true     

In [27]:
mpdata.postMergedActors('1687', '3238')

Merging actor 1687 with actor(s) 3238...
URL: https://marketplace-api.sshopencloud.eu/api/actors/1687/merge?with=3238
...not executed, running in DEBUG mode.


''