# Notebook 4.2 - Actors curation

This notebook gather several checks that can be run together or independently of each other. The set of these checks helps moderators to improve the actor attributes in the SSH Open Marketplace. 

This notebook is composed of 6 sections:

0. Requirements to run this notebook
1. Get actors 
2. Multiple actor names
3. Check Actors not associated to any items
4. Duplicated actors 
    4.1 Get duplicates for actors
    4.2 Compare duplicated actors
    4.3 Get duplicated actors that are associated to dataset items
    4.4 Merge duplicated actors
5. Check actor.website URL Status


## 0 Requirements to run this notebook

This section gives all the relevant information to "interact" with the MP data.

### 0.1 libraries
*There are a number of external libraries needed to run the notebook* 

*Furthermore, a dedicated SSH Open Marketplace library - sshmarketplacelib - with customised functions has been created and can be imported using the python import commands.* 

*Below the libraries import needed to run this notebook*

In [1]:
import pandas as pd #to manage dataframes
import matplotlib.pyplot as plt #to create histograms and images
import seaborn as sns #to create histograms and images
import numpy as np #to manage json objects
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

In [2]:
mpdata = mpd()
utils=hel.Util()
check=eva.URLCheck()

In [3]:
df_tool_flat =mpdata.getMPItems ("toolsandservices", True)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", True)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...


## 1. Get actors

In [4]:
df_actors_flat =mpdata.getMPItems ("actors", True)

getting data from local repository...


In [5]:
df_actors_flat.tail()

Unnamed: 0,id,name,externalIds,affiliations,website
8977,8062,Zong Peng,"[{'identifierService': {'code': 'DBLP', 'label...",[],
8978,2029,Zoomify Inc.,[{'identifierService': {'code': 'SourceActorId...,[],
8979,218,Zoomify Inc.,[],[],
8980,1590,Zoppi Angela,[],[],
8981,7819,Zsófia Fellegi,"[{'identifierService': {'code': 'DBLP', 'label...",[],


## 2. Multiple actor names

Identify multivalue actors

This section of the notebook inspects actors names and provide a list (csv export downloaded in the `data` folder) of actors and the associated items that have a comma in the actor.name field. Other separators such as ; / - could also be used to inspect and clean the actor.name fields.


In [6]:
#extcontr_ei_df=extcontr_df[extcontr_df['actor.externalIds'].notnull()]
extcontr_df=utils.getContributors()
extcontr_ei_df=extcontr_df[extcontr_df['actor.name'].str.contains(",")]
extcontr_ei_df.head()

Loading Actors for toolsandservices
Loading Actors for publications
Loading Actors for trainingmaterials
Loading Actors for workflows
Loading Actors for datasets
keyword dataset is not present
concepts dataset is not present


Unnamed: 0,label,persistentId,category,actor.id,actor.name,actor.externalIds,actor.affiliations,role.code,role.label,role.ord,actor.website
0,140kit,SIU1nO,tool-or-service,2224,"Ian Pearce, Devin Gaffney",[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,
6,Abbot,MXsRM1,tool-or-service,2584,"Brian Pytlik-Zillig, Stephen Ramsay, and Marti...",[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,http://www.monkproject.org/
7,ABFREQ,LtXDGc,tool-or-service,2042,"Alastair McKinnon, McGill University",[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,
24,A-frame,2zIseh,tool-or-service,2524,"Diego Marcos, Don McCurdy, & Kevin Ngo",[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,
35,ALGOL,YKv47J,tool-or-service,2343,"Bauer, Bottenbruch, Rutishauser, Samelson, Bac...",[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,


In [7]:
# extcontr_ei_df['eilen'] = extcontr_ei_df['actor.externalIds'].apply(lambda y: len(y))

In [8]:
#items= pd.json_normalize(data=extcontr_ei_df, record_path='actor.affiliations', meta=['label', 'persistentId', 'category'])

In [9]:
extcontr_ei_df.shape

(321, 11)

In [10]:
extcontr_ei_df.iloc[0]['actor.externalIds']

[{'identifierService': {'code': 'SourceActorId',
   'label': 'Source ActorId',
   'ord': 7,
   'urlTemplate': ''},
  'identifier': '1-36fbe0d84d048a42c2c4e12f6f89467d24a57ad671acaef81c708c6ee6e134e9'}]

In [11]:
extcontr_nd_df=extcontr_df.drop_duplicates(subset=['actor.name', 'persistentId'], keep='first', inplace=False, ignore_index=True)


In [12]:
extcontr_df.shape

(11701, 11)

In [13]:
extcontr_ei_df=extcontr_nd_df[extcontr_nd_df['actor.name'].str.contains(",")]
extcontr_ei_df.shape

(319, 11)

In [14]:
extcontr_ei_df.sort_values('persistentId').to_csv(path_or_buf='data/commaactors.csv', sep=',', index=False)

## 3. Check Actors not associated to any items

In [15]:
df_contrib=utils.getContributors()
df_noitems=df_actors_flat.merge(df_contrib['actor.id'], left_on='id', right_on='actor.id', how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']
df_noitems.sort_values('name').head()

Loading Actors for toolsandservices
Loading Actors for publications
Loading Actors for trainingmaterials
Loading Actors for workflows
Loading Actors for datasets
keyword dataset is not present
concepts dataset is not present


Unnamed: 0,id,name,externalIds,affiliations,website,actor.id,_merge
2361,3318.0,\n Costas\n ...,[{'identifierService': {'code': 'SourceActorId...,[],,,left_only
3521,3317.0,\n Emilie\n ...,[{'identifierService': {'code': 'SourceActorId...,[],,,left_only
4048,3311.0,\n FEETK\n ...,[{'identifierService': {'code': 'SourceActorId...,[],,,left_only
7623,3312.0,\n Maastricht Universit...,[{'identifierService': {'code': 'SourceActorId...,[],,,left_only
8091,3316.0,\n Martina\n ...,[{'identifierService': {'code': 'SourceActorId...,[],,,left_only


In [16]:
df_noitems.sort_values('name').shape

(2211, 7)

In [17]:
filter_attribute='name'
df_actor_noitems_duplicates=utils.getDuplicates(df_noitems, filter_attribute)
df_actor_noitems_duplicates.shape

(577, 8)

In [18]:
df_actor_noitems_duplicates

Unnamed: 0,MPUrl,id,name,externalIds,affiliations,website,actor.id,_merge
57,actors/3020.0,3020.0,Adam Crymble,"[{'identifierService': {'code': 'GitHub', 'lab...","[{'id': 2902, 'name': 'University College Lond...",http://adamcrymble.org,,left_only
58,actors/3020.0,3020.0,Adam Crymble,"[{'identifierService': {'code': 'GitHub', 'lab...","[{'id': 2902, 'name': 'University College Lond...",http://adamcrymble.org,,left_only
200,actors/1017.0,1017.0,Alan MacEachern,[],"[{'id': 665, 'name': 'Northwestern University,...",,,left_only
201,actors/2938.0,2938.0,Alan MacEachern,[{'identifierService': {'code': 'SourceActorId...,[],,,left_only
283,actors/192.0,192.0,Alexander Prokhorenko,[],[],https://twitter.com/iwhite,,left_only
...,...,...,...,...,...,...,...,...
13816,actors/2169.0,2169.0,ZappTek,[{'identifierService': {'code': 'SourceActorId...,[],http://www.zapptek.com,,left_only
13817,actors/346.0,346.0,ZappTek,[],[],http://www.zapptek.com,,left_only
13848,actors/3042.0,3042.0,Zoe LeBlanc,[{'identifierService': {'code': 'SourceActorId...,"[{'id': 2921, 'name': 'Princeton University, U...",http://zoeleblanc.com,,left_only
13849,actors/3089.0,3089.0,Zoe LeBlanc,"[{'identifierService': {'code': 'GitHub', 'lab...","[{'id': 2921, 'name': 'Princeton University, U...",http://zoeleblanc.com,,left_only


## 4. Duplicated actors
    4.1 Get duplicates for actors
    4.2 Compare duplicated actors
    4.3 Get duplicated actors that are associated to dataset items
    4.4 Merge duplicated actors

### 4.1 Get duplicates for actors

In [19]:
utils=hel.Util()
filter_attribute='name, website'
df_actor_duplicates=utils.getDuplicates(df_actors_flat, filter_attribute)
dupl_actor_website=df_actor_duplicates[df_actor_duplicates['website'].notnull()].sort_values('name')

In [20]:
print (f'Using the attribute(s) "{filter_attribute}" as filter, there are: {dupl_actor_website.shape[0]} duplicated actors')

Using the attribute(s) "name, website" as filter, there are: 715 duplicated actors


In [21]:
dupl_actor_website=df_actor_duplicates[df_actor_duplicates['website'].notnull()].sort_values('name')

### 4.2 Compare duplicated actors

In [22]:
#id of duplicated actors
ids=[426, 2258]
compareitems=df_actor_duplicates[df_actor_duplicates.id.isin(ids)]

In [23]:
css_equal="font-size:1.5rem; border: 2px solid silver;background-color: white; padding: 10px 20px"
css_diff="background-color: lightyellow;  font-size:1.5rem; border: 2px solid silver; padding: 10px 20px"

In [24]:
#view items
showdiff = compareitems.T.style.apply(lambda x: [css_equal if ((len(utils.lists_to_list(x.values))==1) ) else css_diff for i in x],
                    axis=1)
showdiff

Unnamed: 0,3602,3603
MPUrl,actors/426,actors/2258
id,426,2258
name,"Jan Aarts, Hans van Halteren and Nelleke Oostdijk, University of Nijmegen","Jan Aarts, Hans van Halteren and Nelleke Oostdijk, University of Nijmegen"
externalIds,[],"[{'identifierService': {'code': 'SourceActorId', 'label': 'Source ActorId', 'ord': 7, 'urlTemplate': ''}, 'identifier': '1-385568d7ab786ea0f0d6f35a7a73e948f7fdcee7346c95991e4085525eec52c2'}]"
affiliations,[],[]
website,,


### 4.3 Get duplicated actors that are associated to dataset items

The function __getDuplicatedActorsWithItems(df_actor_duplicates, props)__ returns those actors that are duplicated according to the filter defined in the *props* and that are associated to one or more items in the MP dataset.  
The function returns two dataframe as result.

In [25]:
filter_attribute='name, website'
df_dup_withitems=utils.getDuplicatedActorsWithItems(df_actor_duplicates, filter_attribute)
df_dup_withitems[0].head(10)

Loading Actors for toolsandservices
Loading Actors for publications
Loading Actors for trainingmaterials
Loading Actors for workflows
Loading Actors for datasets
keyword dataset is not present
concepts dataset is not present


Unnamed: 0,isDuplicated,MPUrl,id,name,externalIds,affiliations,website,role.label,persistentId,label,category,item
0,yes,actors/4689,4689,Agiatis Benardou,"[{'identifierService': {'code': 'DBLP', 'label...",[],,Author,dyFa95,"Are we there yet? Functionalities, synergies a...",publication,publication/dyFa95
1,yes,actors/4689,4689,Agiatis Benardou,"[{'identifierService': {'code': 'DBLP', 'label...",[],,Author,n6aXHU,DARIAH-EU's Virtual Competency Center on Resea...,publication,publication/n6aXHU
2,yes,actors/4689,4689,Agiatis Benardou,"[{'identifierService': {'code': 'DBLP', 'label...",[],,Author,McZlzs,"#dariahTeach - online teaching, MOOCs and beyond",publication,publication/McZlzs
3,yes,actors/4689,4689,Agiatis Benardou,"[{'identifierService': {'code': 'DBLP', 'label...",[],,Author,MhxIFa,Playing With Cultural Heritage Through Digital...,publication,publication/MhxIFa
4,yes,actors/4689,4689,Agiatis Benardou,"[{'identifierService': {'code': 'DBLP', 'label...",[],,Author,DeguAO,Reflecting On And Refracting User Needs Throug...,publication,publication/DeguAO
5,yes,actors/4689,4689,Agiatis Benardou,"[{'identifierService': {'code': 'DBLP', 'label...",[],,Author,CTi1FG,Scholarly Research Activities and Digital Tool...,publication,publication/CTi1FG
6,yes,actors/4689,4689,Agiatis Benardou,"[{'identifierService': {'code': 'DBLP', 'label...",[],,Author,pbsN0x,"What's in a Discipline? Research Practices, Us...",publication,publication/pbsN0x
7,yes,actors/9111,9111,Agiatis Benardou,[{'identifierService': {'code': 'SourceActorId...,[],,Author,YomEAW,"Research Infrastructures Should Inspire, Theor...",training-material,training-material/YomEAW
8,yes,actors/9111,9111,Agiatis Benardou,[{'identifierService': {'code': 'SourceActorId...,[],,Author,uwJTCy,What is the Role of Training and Education in ...,training-material,training-material/uwJTCy
9,yes,actors/1262,1262,Alastair Dunning,[],[],,Author,dY5S7K,Minimum requirements for Europeana Cloud,publication,publication/dY5S7K


In [26]:
# need to add the item type alongside the item ID
df_dup_withitems[1].sort_values('name').to_csv(path_or_buf='data/duplicatedactorswithitems.csv', sep=',', index=False)

### 4.4 Merge items

POST /api/actors/{id}/merge


In [None]:
#mpdata.postMergedActors('2505', '2266')

## 5. Check actor.website URL Status

This section of the notebook inspects URL of the `actor.website` field, identifies URL with a 404 status and create a list of the Marketplace items to which the actors with a "bad URL" for their website are attached.

It is also possible to use **notebook 3.1 Curation-flag-URL** to flag the items directly in the "Items to moderate" section.

In [None]:
extcontr_df=utils.getContributors()
extcontr_df.head()

In [None]:
extcontr_df

In [None]:
urls_df=check.checkURLValuesInDataset(extcontr_df, 'actor.website')
#urls_df.tail()

In [None]:
urls_df.shape

In [None]:
my_urls_df=urls_df.drop_duplicates(keep='first', inplace=False)

In [None]:
my_urls_df.shape

In [None]:
myw_url_status=my_urls_df[my_urls_df['status'] == 404].sort_values('category')
myclickable_table = myw_url_status.style.format({'MPUrl': utils.make_clickable})
myclickable_table

In [None]:
myw_url_status.sort_values('category').to_csv(path_or_buf='data/urlactors.csv', sep=',', index=False)