# Notebook 4.2 - Actors curation

This notebook gather several checks that can be run together or independently of each other. The set of these checks helps moderators to improve the actor attributes in the SSH Open Marketplace. 

This notebook is composed of 6 sections:

0. Requirements to run this notebook
1. Get actors 
2. Multiple actor names
3. Check Actors not associated to any items
4. Duplicated actors 
    4.1 Get duplicates for actors
    4.2 Compare duplicated actors
    4.3 Get duplicated actors that are associated to dataset items
    4.4 Merge duplicated actors
5. Check actor.website URL Status


## 0 Requirements to run this notebook

This section gives all the relevant information to "interact" with the MP data.

### 0.1 libraries
*There are a number of external libraries needed to run the notebook* 

*Furthermore, a dedicated SSH Open Marketplace library - sshmarketplacelib - with customised functions has been created and can be imported using the python import commands.* 

*Below the libraries import needed to run this notebook*

In [1]:
import pandas as pd #to manage dataframes
import matplotlib.pyplot as plt #to create histograms and images
import seaborn as sns #to create histograms and images
import numpy as np #to manage json objects
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

In [2]:
mpdata = mpd()
utils=hel.Util()
check=eva.URLCheck()

In [3]:
df_tool_flat =mpdata.getMPItems ("toolsandservices", True)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", True)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...


## 1. Get actors

In [4]:
df_actors_flat =mpdata.getMPItems ("actors", False)

In [5]:
df_actors_flat

Unnamed: 0,id,name,externalIds,affiliations,website,email
0,2235,18th Connect,[{'identifierService': {'code': 'SourceActorId...,[],,
1,2773,37signals,[{'identifierService': {'code': 'SourceActorId...,[],,
2,3213,"3D Optical Metrology (3DOM) unit, Bruno Kessle...",[{'identifierService': {'code': 'SourceActorId...,[],,
3,3212,3D-SHS Huma-Num's consortium,[{'identifierService': {'code': 'SourceActorId...,[],,
4,2185,4D,[{'identifierService': {'code': 'SourceActorId...,[],http://www.4d.com/,
...,...,...,...,...,...,...
8401,8062,Zong Peng,"[{'identifierService': {'code': 'DBLP', 'label...",[],,
8402,218,Zoomify Inc.,[],[],,
8403,2029,Zoomify Inc.,[{'identifierService': {'code': 'SourceActorId...,[],,
8404,1590,Zoppi Angela,[],[],,


## 2. Multiple actor names

Identify multivalue actors

This section of the notebook inspects actors names and provide a list (csv export downloaded in the `data` folder) of actors and the associated items that have a comma in the actor.name field. Other separators such as ; / - could also be used to inspect and clean the actor.name fields.


In [7]:
#extcontr_ei_df=extcontr_df[extcontr_df['actor.externalIds'].notnull()]
extcontr_df=utils.getContributors()
extcontr_ei_df=extcontr_df[extcontr_df['actor.name'].str.contains(",")]
extcontr_ei_df.head()

Unnamed: 0,label,persistentId,category,actor.id,actor.name,actor.externalIds,actor.affiliations,role.code,role.label,role.ord,actor.website,actor.email
0,140kit,SIU1nO,tool-or-service,2224,"Ian Pearce, Devin Gaffney",[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,,
3,Abbot,MXsRM1,tool-or-service,2584,"Brian Pytlik-Zillig, Stephen Ramsay, and Marti...",[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,http://www.monkproject.org/,monkproject@lis.illinois.edu
4,ABFREQ,LtXDGc,tool-or-service,2042,"Alastair McKinnon, McGill University",[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,,
18,A-frame,2zIseh,tool-or-service,2524,"Diego Marcos, Don McCurdy, & Kevin Ngo",[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,,
23,ALGOL,YKv47J,tool-or-service,2343,"Bauer, Bottenbruch, Rutishauser, Samelson, Bac...",[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,,


In [None]:
# extcontr_ei_df['eilen'] = extcontr_ei_df['actor.externalIds'].apply(lambda y: len(y))

In [None]:
#items= pd.json_normalize(data=extcontr_ei_df, record_path='actor.affiliations', meta=['label', 'persistentId', 'category'])

In [8]:
extcontr_ei_df.shape

(257, 12)

In [9]:
extcontr_ei_df.iloc[0]['actor.externalIds']

[{'identifierService': {'code': 'SourceActorId',
   'label': 'Source ActorId',
   'ord': 7,
   'urlTemplate': ''},
  'identifier': '1-36fbe0d84d048a42c2c4e12f6f89467d24a57ad671acaef81c708c6ee6e134e9'}]

In [10]:
extcontr_nd_df=extcontr_df.drop_duplicates(subset=['actor.name', 'persistentId'], keep='first', inplace=False, ignore_index=True)

In [11]:
extcontr_df.shape

(3224, 12)

In [12]:
extcontr_ei_df=extcontr_nd_df[extcontr_nd_df['actor.name'].str.contains(",")]
extcontr_ei_df.shape

(254, 12)

In [13]:
extcontr_ei_df.sort_values('persistentId').to_csv(path_or_buf='data/commaactors.csv', sep=',', index=False)

## 3. Check Actors not associated to any items

In [14]:
df_contrib=utils.getContributors()
df_noitems=df_actors_flat.merge(df_contrib['actor.id'], left_on='id', right_on='actor.id', how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']
df_noitems.sort_values('name').head()

Unnamed: 0,id,name,externalIds,affiliations,website,email,actor.id,_merge
1663,3318.0,\n Costas\n ...,[{'identifierService': {'code': 'SourceActorId...,[],,,,left_only
2398,3317.0,\n Emilie\n ...,[{'identifierService': {'code': 'SourceActorId...,[],,,,left_only
2823,3311.0,\n FEETK\n ...,[{'identifierService': {'code': 'SourceActorId...,[],,,,left_only
5295,3312.0,\n Maastricht Universit...,[{'identifierService': {'code': 'SourceActorId...,[],,,,left_only
5571,3316.0,\n Martina\n ...,[{'identifierService': {'code': 'SourceActorId...,[],,,,left_only


In [15]:
df_noitems.sort_values('name').shape

(6688, 8)

In [16]:
filter_attribute='name'
df_actor_noitems_duplicates=utils.getDuplicates(df_noitems, filter_attribute)
df_actor_noitems_duplicates.shape

(620, 9)

In [17]:
df_actor_noitems_duplicates

Unnamed: 0,MPUrl,id,name,externalIds,affiliations,website,email,actor.id,_merge
88,actors/8383.0,8383.0,Adam Crymble,"[{'identifierService': {'code': 'DBLP', 'label...",[],,,,left_only
89,actors/3020.0,3020.0,Adam Crymble,"[{'identifierService': {'code': 'GitHub', 'lab...","[{'id': 2902, 'name': 'University College Lond...",http://adamcrymble.org,a.crymble@ucl.ac.uk,,left_only
173,actors/149.0,149.0,Alan Liu,[],[],http://liu.english.ucsb.edu/,ayliu@english.ucsb.edu,,left_only
174,actors/7812.0,7812.0,Alan Liu,"[{'identifierService': {'code': 'DBLP', 'label...",[],,,,left_only
175,actors/8384.0,8384.0,Alan MacEachern,"[{'identifierService': {'code': 'DBLP', 'label...",[],,,,left_only
...,...,...,...,...,...,...,...,...,...
9883,actors/1685.0,1685.0,ZIM-ACDH,[],[],,,,left_only
9884,actors/3236.0,3236.0,ZIM-ACDH,[{'identifierService': {'code': 'SourceActorId...,[],,,,left_only
9893,actors/7604.0,7604.0,Zoe LeBlanc,"[{'identifierService': {'code': 'DBLP', 'label...",[],,,,left_only
9897,actors/3089.0,3089.0,Zoe LeBlanc,"[{'identifierService': {'code': 'GitHub', 'lab...","[{'id': 2921, 'name': 'Princeton University, U...",http://zoeleblanc.com,zgleblanc@gmail.com,,left_only


## 4. Duplicated actors
    4.1 Get duplicates for actors
    4.2 Compare duplicated actors
    4.3 Get duplicated actors that are associated to dataset items
    4.4 Merge duplicated actors

### 4.1 Get duplicates for actors

In [18]:
utils=hel.Util()
filter_attribute='name'
df_actor_duplicates=utils.getDuplicates(df_actors_flat, filter_attribute)

In [19]:
print (f'Using the attribute "{filter_attribute}" as filter, there are: {df_actor_duplicates.shape[0]} duplicated actors')

Using the attribute "name" as filter, there are: 2721 duplicated actors


In [21]:
df_actor_duplicates.sort_values('name').head(10)

Unnamed: 0,MPUrl,id,name,externalIds,affiliations,website,email
2365,actors/1664,1664,FEETK,[],[],,
2366,actors/3209,3209,FEETK,[{'identifierService': {'code': 'SourceActorId...,[],,
3353,actors/426,426,"Jan Aarts, Hans van Halteren and Nelleke Oost...",[],[],,
3354,actors/2258,2258,"Jan Aarts, Hans van Halteren and Nelleke Oost...",[{'identifierService': {'code': 'SourceActorId...,[],,
4587,actors/3210,3210,Maastricht University,[{'identifierService': {'code': 'SourceActorId...,[],,
4588,actors/1665,1665,Maastricht University,[],[],,
6097,actors/1687,1687,"Poznańskie Centrum Superkomputerowo-Sieciowe,...",[],[],,
6098,actors/3238,3238,"Poznańskie Centrum Superkomputerowo-Sieciowe,...",[{'identifierService': {'code': 'SourceActorId...,[],,
7659,actors/174,174,University at Buffalo's Department of Classic...,[],[],,
7658,actors/1978,1978,University at Buffalo's Department of Classic...,[{'identifierService': {'code': 'SourceActorId...,[],,


### 4.2 Compare duplicated actors

In [22]:
#id of duplicated actors
ids=[3209, 1664]
compareitems=df_actor_duplicates[df_actor_duplicates.id.isin(ids)]

In [23]:
css_equal="font-size:1.5rem; border: 2px solid silver;background-color: white; padding: 10px 20px"
css_diff="background-color: lightyellow;  font-size:1.5rem; border: 2px solid silver; padding: 10px 20px"

In [24]:
#view items
showdiff = compareitems.T.style.apply(lambda x: [css_equal if ((len(utils.lists_to_list(x.values))==1) ) else css_diff for i in x],
                    axis=1)
showdiff

Unnamed: 0,2365,2366
MPUrl,actors/1664,actors/3209
id,1664,3209
name,FEETK,FEETK
externalIds,[],"[{'identifierService': {'code': 'SourceActorId', 'label': 'Source ActorId', 'ord': 7, 'urlTemplate': ''}, 'identifier': '4-53512830f79f3aae7f3605302d315518ed2dc1bb9457026df278ea762b17b135'}]"
affiliations,[],[]
website,,
email,,


### 4.3 Get duplicated actors that are associated to dataset items

The function __getDuplicatedActorsWithItems(df_actor_duplicates, props)__ returns those actors that are duplicated according to the filter defined in the *props* and that are associated to one or more items in the MP dataset.  
The function returns two dataframe as result.

In [26]:
df_dup_withitems=utils.getDuplicatedActorsWithItems(df_actor_duplicates, filter_attribute)
df_dup_withitems[0].head(10)

Unnamed: 0,MPUrl,id,name,externalIds,affiliations,website,email,role.label,persistentId,label,category,item
0,actors/2505,2505,AJ,[{'identifierService': {'code': 'SourceActorId...,[],https://github.com/ajlkn,aj@carrd.co,Creator,d9MQTi,carrd.co,tool-or-service,tool-or-service/d9MQTi
1,actors/2266,2266,AJ,[{'identifierService': {'code': 'SourceActorId...,[],https://aj.lkn.io/,aj@carrd.co,Creator,2XDPHW,HTMLUP,tool-or-service,tool-or-service/2XDPHW
2,actors/3008,3008,Alex Brey,[{'identifierService': {'code': 'SourceActorId...,[],,,Author,cwBoCV,Análisis de redes temporal en R,training-material,training-material/cwBoCV
3,actors/3008,3008,Alex Brey,[{'identifierService': {'code': 'SourceActorId...,[],,,Author,Ql1Fim,Temporal Network Analysis with R,training-material,training-material/Ql1Fim
4,actors/1105,1105,Alex Brey,[],[],,,Author,kXc0Ml,Temporal Network Analysis with R\n,training-material,training-material/kXc0Ml
5,actors/1035,1035,Amanda Morton,[],[],,,Reviewer,bgzUXa,Installing Python Modules with pip,training-material,training-material/bgzUXa
6,actors/1035,1035,Amanda Morton,[],[],,,Reviewer,CiaR5D,Python Introduction and Installation,training-material,training-material/CiaR5D
7,actors/1035,1035,Amanda Morton,[],[],,,Reviewer,ygzJuw,Setting up an Integrated Development Environme...,training-material,training-material/ygzJuw
8,actors/1035,1035,Amanda Morton,[],[],,,Reviewer,K4gWWE,Setting Up an Integrated Development Environme...,training-material,training-material/K4gWWE
9,actors/1035,1035,Amanda Morton,[],[],,,Reviewer,ZEHgKO,Setting Up an Integrated Development Environme...,training-material,training-material/ZEHgKO


In [32]:
df_dup_withitems[1].head(10)

Unnamed: 0,MPUrl,id,name,itemPersistentId
124,actors/2266,2266,AJ,[tool-or-service/2XDPHW]
145,actors/2505,2505,AJ,[tool-or-service/d9MQTi]
55,actors/1105,1105,Alex Brey,[training-material/kXc0Ml]
237,actors/3008,3008,Alex Brey,"[training-material/cwBoCV, training-material/Q..."
194,actors/2951,2951,Amanda Morton,"[training-material/yoiCvh, training-material/V..."
9,actors/1035,1035,Amanda Morton,"[training-material/bgzUXa, training-material/C..."
247,actors/3028,3028,Amanda Visconti,"[training-material/V9mAJr, training-material/d..."
28,actors/1061,1061,Amanda Visconti,"[training-material/RYvL5M, training-material/R..."
123,actors/2244,2244,Ambrosia Software Inc.,[tool-or-service/MDRRLD]
273,actors/412,412,Ambrosia Software Inc.,[publication/FJ9SoH]


In [28]:
# need to add the item type alongside the item ID
df_dup_withitems[1].sort_values('name').to_csv(path_or_buf='data/duplicatedactorswithitems.csv', sep=',', index=False)

### 4.4 Merge items

POST /api/actors/{id}/merge


In [None]:
#mpdata.postMergedActors('2505', '2266')

## 5. Check actor.website URL Status

This section of the notebook inspects URL of the `actor.website` field, identifies URL with a 404 status and create a list of the Marketplace items to which the actors with a "bad URL" for their website are attached.

It is also possible to use **notebook 3.1 Curation-flag-URL** to flag the items directly in the "Items to moderate" section.

In [33]:
extcontr_df=utils.getContributors()
extcontr_df.head()

Unnamed: 0,label,persistentId,category,actor.id,actor.name,actor.externalIds,actor.affiliations,role.code,role.label,role.ord,actor.website,actor.email
0,140kit,SIU1nO,tool-or-service,2224,"Ian Pearce, Devin Gaffney",[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,,
1,3DVIA Virtools,zrfCly,tool-or-service,1925,Dassault Systemes,[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,,
2,80legs,XsXzlp,tool-or-service,2373,80legs,[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,,
3,Abbot,MXsRM1,tool-or-service,2584,"Brian Pytlik-Zillig, Stephen Ramsay, and Marti...",[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,http://www.monkproject.org/,monkproject@lis.illinois.edu
4,ABFREQ,LtXDGc,tool-or-service,2042,"Alastair McKinnon, McGill University",[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,,


In [34]:
extcontr_df

Unnamed: 0,label,persistentId,category,actor.id,actor.name,actor.externalIds,actor.affiliations,role.code,role.label,role.ord,actor.website,actor.email
0,140kit,SIU1nO,tool-or-service,2224,"Ian Pearce, Devin Gaffney",[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,,
1,3DVIA Virtools,zrfCly,tool-or-service,1925,Dassault Systemes,[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,,
2,80legs,XsXzlp,tool-or-service,2373,80legs,[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,,
3,Abbot,MXsRM1,tool-or-service,2584,"Brian Pytlik-Zillig, Stephen Ramsay, and Marti...",[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,http://www.monkproject.org/,monkproject@lis.illinois.edu
4,ABFREQ,LtXDGc,tool-or-service,2042,"Alastair McKinnon, McGill University",[{'identifierService': {'code': 'SourceActorId...,[],creator,Creator,3,,
...,...,...,...,...,...,...,...,...,...,...,...,...
492,YelpNYC,IdZGtV,dataset,1753,Matt Lavin,[],[],curator,Curator,11,,
493,YelpZIP,OMny6U,dataset,1752,Eva Bacas,[],[],curator,Curator,11,,
494,YelpZIP,OMny6U,dataset,1753,Matt Lavin,[],[],curator,Curator,11,,
495,"""You Are Where You Tweet: A Content-Based Appr...",YnEaU0,dataset,1752,Eva Bacas,[],[],curator,Curator,11,,


In [35]:
urls_df=check.checkURLValuesInDataset(extcontr_df, 'actor.website')
#urls_df.tail()

inspecting actor.website


In [36]:
urls_df.shape

(14005, 7)

In [37]:
my_urls_df=urls_df.drop_duplicates(keep='first', inplace=False)

In [38]:
my_urls_df.shape

(706, 7)

In [39]:
myw_url_status=my_urls_df[my_urls_df['status'] == 404].sort_values('category')
myclickable_table = myw_url_status.style.format({'MPUrl': utils.make_clickable})
myclickable_table

Unnamed: 0,MPUrl,persistentId,category,label,property,url,status
2565,tool-or-service/QKqkOF,QKqkOF,tool-or-service,Carrot2,actor.website,http://project.carrot2.org/authors.html,404
2635,tool-or-service/CxFbcY,CxFbcY,tool-or-service,Commentpress,actor.website,http://www.visudo.com/,404
2841,tool-or-service/WCXCgY,WCXCgY,tool-or-service,CSV Sort,actor.website,https://bitbucket.org/richardpenman/csvsort,404
3076,tool-or-service/V6aWAu,V6aWAu,tool-or-service,Discursis,actor.website,http://www.discursis.com/index.php/about/,404
3082,tool-or-service/evJx1C,evJx1C,tool-or-service,etcML,actor.website,http://www.etcml.com/team,404
4342,tool-or-service/cBHEp8,cBHEp8,tool-or-service,LitStats,actor.website,http://www.efs.ualberta.ca/People/Faculty/StephenReimer.aspx,404
4404,tool-or-service/XnpnZc,XnpnZc,tool-or-service,Orange,actor.website,http://www.fri.uni-lj.si/en/tomaz-curk/default.html,404
4458,tool-or-service/yvAIRA,yvAIRA,tool-or-service,RQDA,actor.website,http://homepage.fudan.edu.cn/rghuang/cv/,404
4463,tool-or-service/VP9pEA,VP9pEA,tool-or-service,SALT (Statistic Analysis of Language Transcripts),actor.website,http://www.saltsoftware.com/company/contact/,404
4733,tool-or-service/7fnuoq,7fnuoq,tool-or-service,Versioning Machine,actor.website,http://www.tcd.ie/English/staff/academic-staff/susan-schriebman.php,404


In [None]:
myw_url_status.sort_values('category').to_csv(path_or_buf='data/urlactors.csv', sep=',', index=False)