# Notebook 4.3 - Actors curation: duplicates

This notebook gather several checks that can be run together or independently of each other. The set of these checks helps moderators to curate Duplicated actors in the SSH Open Marketplace. 

This notebook is composed of 6 sections:

0. Requirements to run this notebook
1. Get actors 
2. Duplicated actors

    2.1 Find duplicates for actors with same name and same website 
    
    2.2 Compare duplicated actors with same name and same website 
    
    2.3 Merge duplicated actors with same name and same website 
    
    2.4 Reload the Actors
    
    2.5 Find duplicates for actors with same name 
    
    2.6 Merge duplicated actors with same name 


## 0 Requirements to run this notebook

This section gives all the relevant information to "interact" with the MP data.

### 0.1 libraries
*There are a number of external libraries needed to run the notebook* 

*Furthermore, a dedicated SSH Open Marketplace library - sshmarketplacelib - with customised functions has been created and can be imported using the python import commands.* 

*Below the libraries import needed to run this notebook*

In [1]:
import pandas as pd #to manage dataframes
#import matplotlib.pyplot as plt #to create histograms and images
#import seaborn as sns #to create histograms and images
import numpy as np #to manage json objects
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

In [2]:
mpdata = mpd()
utils=hel.Util()
check=eva.URLCheck()

### 0.2 Utility functions

In [4]:
def getDuplicateActorsWithEmptyItems(actors):
    df_actors_ei=pd.DataFrame()
    for item in actors.itertuples():
        allEmpty=True;
        for actorid in item.idstobemerged:
            actitems=mpdata.getItemsforActor(str(actorid))
            allEmpty=allEmpty & actitems.empty
            if(not allEmpty):
                break
        if (allEmpty):
            entry = actors.loc[actors['name'] == item.name]
            df_actors_ei=pd.concat([df_actors_ei, entry])
    return df_actors_ei

In [47]:
def getDuplicateActorsWithDifferentItems(actors):
    df_results=pd.DataFrame()
    for item in actors.itertuples():  
        if (len (item.idstobemerged)<2):
            continue
        
        tempdf_sn=mpdata.getItemsforActor(str(item.idstobemerged[0])).drop_duplicates('persistentId', keep='first')
        for actorid in item.idstobemerged[1:]:
            #print (actorid)
            actitems=mpdata.getItemsforActor(str(actorid)).drop_duplicates('persistentId', keep='first')
            entry = actors.loc[actors['name'] == item.name]
            if ((tempdf_sn.empty & (not actitems.empty)) | ((not tempdf_sn.empty) & actitems.empty)):
                #print (f'{tempdf_sn.empty}, {actitems.empty}, {str(item.idstobemerged[0])}, {actorid}')
                df_results=pd.concat([df_results, entry])
                break 
            if ((not actitems.empty)):
                tre=(actitems['persistentId'].isin(tempdf_sn['persistentId'])).value_counts()
                tre_r= (tempdf_sn['persistentId'].isin(actitems['persistentId'])).value_counts()  
                
                if ((False in tre.to_dict()) & (False in tre_r.to_dict())):
                    #print (f'{item.idstobemerged[0]}, {actorid}')
                   
                    df_results=pd.concat([df_results, entry])
                    break
    return df_results


## 1. Get actors

In [3]:
df_actors_flat =mpdata.getMPItems ("actors", False)

In [None]:
df_actors_flat.tail()

## 2. Duplicated actors
    2.1 Get duplicates for actors using *actor.name* and *actor.website* as filters
    2.2 Compare duplicated actors (optional)
    2.4 Merge duplicated actors

### 2.1 Get duplicates for actors using *actor.name* and *actor.website* as filter

In [6]:
utils=hel.Util()
filter_attribute=['name', 'website']
df_actor_duplicates=df_actors_flat[df_actors_flat.duplicated(subset=filter_attribute, keep=False)]
dupl_actor_website=df_actor_duplicates[df_actor_duplicates['website'].notnull()].sort_values('name')

In [7]:
print (f'Using the attributes "{filter_attribute}" as filter, there are: {dupl_actor_website.shape[0]} duplicated actors')

Using the attributes "['name', 'website']" as filter, there are: 715 duplicated actors


In [8]:
actorwebsite_tomerge=dupl_actor_website.groupby(['name','website'])['id'].apply(list).reset_index(name='idstobemerged')

In [9]:
actorwebsite_tomerge.count()

name             346
website          346
idstobemerged    346
dtype: int64

In [10]:
#The number of actors with more than one duplicate
actorwebsite_tomerge[actorwebsite_tomerge.idstobemerged.map(len)>2].count()

name             23
website          23
idstobemerged    23
dtype: int64

In [11]:
actorwebsite_tomerge.head()

Unnamed: 0,name,website,idstobemerged
0,ARTFL Project and Digital Library Development ...,http://artfl-project.uchicago.edu/,"[842, 2720]"
1,AT&T Research,http://www.research.att.com/,"[2566, 701]"
2,ATLAS.ti Scientific Software Development GmbH,http://www.atlasti.com/copyright.html,"[1828, 25]"
3,Alan Liu,http://liu.english.ucsb.edu/,"[149, 1954]"
4,Alan Reed,http://www.textworld.com/,"[2332, 493]"


### 2.2 Compare duplicated actors

In [12]:
#id of duplicated actors
ids=[174, 1978]
compareitems=df_actor_duplicates[df_actor_duplicates.id.isin(ids)]

In [13]:
compareitems

Unnamed: 0,id,name,externalIds,affiliations,website
8193,174,University at Buffalo's Department of Classic...,[],[],
8194,1978,University at Buffalo's Department of Classic...,[{'identifierService': {'code': 'SourceActorId...,[],


In [14]:
css_equal="font-size:1.5rem; border: 2px solid silver;background-color: white; padding: 10px 20px"
css_diff="background-color: lightyellow;  font-size:1.5rem; border: 2px solid silver; padding: 10px 20px"

In [15]:
#view items
showdiff = compareitems.T.style.apply(lambda x: [css_equal if ((len(utils.lists_to_list(x.values))==1) ) else css_diff for i in x],
                    axis=1)
showdiff

Unnamed: 0,8193,8194
id,174,1978
name,"University at Buffalo's Department of Classics and Department of Linguistics, and the VAST Lab of the University of Colorado at Colorado Springs.","University at Buffalo's Department of Classics and Department of Linguistics, and the VAST Lab of the University of Colorado at Colorado Springs."
externalIds,[],"[{'identifierService': {'code': 'SourceActorId', 'label': 'Source ActorId', 'ord': 7, 'urlTemplate': ''}, 'identifier': '1-39d412d7bdd1a79bc8dfce7928d92bc4d316ab3c0b17cc0a37492e1dfe341b69'}]"
affiliations,[],[]
website,,


### 2.3 Merge items

The code above has generated three different dataframes:

<ol>
    <li>The dataframe <i>df_actors_with_same_items</i> that contains the duplicated actors that were attached to same items</li>
    <li>The dataframe <i>df_actors_empty_items</i> that contains the duplicated actors that were *never attached to any item*</li>
    <li> The dataframe <i>df_actors_with_different_items</i> that contains the duplicated actors that were never attached to the same itemsm, <b>the actors in this dataframe should be manually merged!</b></li>
</ol>
The function <i>postMergedActors </i> uses the API entry: <i>(POST) /api/actors/{id}/merge</i> to automatically merge the actors contained in dataframes 1 or 2.

In [None]:
for item in actorwebsite_tomerge.itertuples():
    print(item.idstobemerged[0], ", ".join(str (e) for e in item.idstobemerged[1:]))
    mpdata.postMergedActors(str(item.idstobemerged[0]), ", ".join(str (e) for e in item.idstobemerged[1:]))

## 2.4 Reload the actor from MP

In [16]:
df_actors_flat_new =mpdata.getMPItems ("actors", False)

### 2.4 Get duplicates for actors using actor.name as filter

In [17]:
utils=hel.Util()
filter_attribute='name'
df_actor_duplicates=df_actors_flat[df_actors_flat.duplicated(subset=filter_attribute, keep=False)]
actor_tomerge=df_actor_duplicates.groupby(['name'])['id'].apply(list).reset_index(name='idstobemerged')

In [18]:
print (f'Using the attributes "{filter_attribute}" as filter, there are: {df_actor_duplicates.shape[0]} duplicated actors')

Using the attributes "name" as filter, there are: 2837 duplicated actors


#### _Individuating actors with the exact same name that were never attached to any items._

In [19]:
df_actors_samename_empty_items= getDuplicateActorsWithEmptyItems(actor_tomerge)

In [40]:
df_actors_samename_empty_items.head(10)

Unnamed: 0,name,idstobemerged
1,Maastricht University,"[1665, 3210]"
4,Waterford Institute of Technology,"[1666, 3211]"
53,Andrew Hnatow,"[2945, 1028]"
61,Antoine Henry,"[1139, 3064, 3104]"
68,Aracele Torres,"[3094, 3054, 1127]"
72,Armando Luza,"[1111, 3011]"
74,Artefactual Systems,"[2139, 969]"
86,BBAW,"[3233, 1681]"
90,"Bar-Ilan University, Ramat Gan, Israel","[3842, 3904]"
100,Berlin-Brandenburg Academy of Sciences and Hum...,"[1661, 3246]"


#### _Different actors with the exact same name that were never attached to at least one common items_

In [48]:
df_actors_samename_different_items=getDuplicateActorsWithDifferentItems(actor_tomerge)

In [49]:
df_actors_samename_different_items.count()

name             416
idstobemerged    416
dtype: int64

#### _Actors with the exact same name that were attacheed to same items._

In [50]:
import functools as ft
dfs=[df_actors_samename_empty_items, df_actors_samename_different_items, actor_tomerge, ]
df_temp=ft.reduce(lambda left, right: pd.merge(left, right, on=['name'], how="outer"), dfs)
df_actors_samename_sameitems=df_temp.loc[(df_temp[['idstobemerged_x', 'idstobemerged_y']].isnull().all(1)) & (df_temp['idstobemerged'].notnull())][['name', 'idstobemerged']]

In [51]:
df_actors_samename_sameitems.sort_values('name').head()

Unnamed: 0,name,idstobemerged
512,"Jan Aarts, Hans van Halteren and Nelleke Oost...","[426, 2258]"
513,University at Buffalo's Department of Classic...,"[174, 1978]"
514,API,"[2200, 36]"
515,ARTFL,"[2161, 340]"
516,ARTFL Project and Digital Library Development ...,"[842, 2720]"


### Compare Actors

In [52]:
ids=[426, 2258]
compareitems=df_actor_duplicates[df_actor_duplicates.id.isin(ids)]
css_equal="font-size:1.5rem; border: 2px solid silver;background-color: white; padding: 10px 20px"
css_diff="background-color: lightyellow;  font-size:1.5rem; border: 2px solid silver; padding: 10px 20px"

In [53]:
#view items
showdiff = compareitems.T.style.apply(lambda x: [css_equal if ((len(utils.lists_to_list(x.values))==1) ) else css_diff for i in x],
                    axis=1)
showdiff

Unnamed: 0,3568,3569
id,426,2258
name,"Jan Aarts, Hans van Halteren and Nelleke Oostdijk, University of Nijmegen","Jan Aarts, Hans van Halteren and Nelleke Oostdijk, University of Nijmegen"
externalIds,[],"[{'identifierService': {'code': 'SourceActorId', 'label': 'Source ActorId', 'ord': 7, 'urlTemplate': ''}, 'identifier': '1-385568d7ab786ea0f0d6f35a7a73e948f7fdcee7346c95991e4085525eec52c2'}]"
affiliations,[],[]
website,,


## 2.5 Merge actors

In [None]:
for item in df_actors_samename_sameitems.itertuples():
    print(item.idstobemerged[0], ", ".join(str (e) for e in item.idstobemerged[1:]))
    mpdata.postMergedActors(str(item.idstobemerged[0]), ", ".join(str (e) for e in item.idstobemerged[1:]))