# Notebook 3.4 - Curation-flag-coverage

This notebook analyses the metadata completness in the SSH Open Marketplace and writes back to the system via two dedicated curation properties: `curation-flag-coverage` and `curation-detail` properties.

This notebook flags Marketplace items that have too little number of metadata filled out, helping Moderators identify curation priorities to improve data quality. 

This notebook is part of a series of 4 notebooks that inform the curation properties used in the SSH Open Marketplace Editorial Dashboard.

It is composed of 3 sections:

0. Requirements to run the notebook
1. Coverage of recommended fields - overview
- for tools&services
- for training materials
- for datasets
- for publications
- for workflows(?)
2. Flag priority items for curation 

## 0 Requirements to run this notebook

This section gives all the relevant information to "interact" with the MP data.

### 0.1 libraries
*There are a number of external libraries needed to run the notebook* 

*Furthermore, a dedicated SSH Open Marketplace library - sshmarketplacelib - with customised functions has been created and can be imported using the python import commands.* 

*Below the libraries import needed to run this notebook*

In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt #to create histograms and images
import requests
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

### 0.2 Get the data

In [19]:
mpdata = mpd()
df_tool_flat =mpdata.getMPItems ("toolsandservices", True)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", True)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...


### 0.3 A look at the data

It can be useful to display the structure of the table. In the next cell the structure of the __df_publication_flat__ table (or dataframe) is shown. Note that all tables obtained with the getMPItems function have the same structure. 

df_tool_flat

df_publication_flat

df_trainingmaterials_flat

df_workflows_flat

df_datasets_flat


This dataframe returns MP attributes as defined in the MP data model (see: https://doi.org/10.5281/zenodo.5749464)

In [20]:
df_trainingmaterials_flat

Unnamed: 0,id,category,label,persistentId,lastInfoUpdate,status,description,contributors,properties,externalIds,...,source.urlTemplate,dateCreated,dateLastUpdated,thumbnail.info.mediaId,thumbnail.info.category,thumbnail.info.location.sourceUrl,thumbnail.info.mimeType,thumbnail.info.hasThumbnail,thumbnail.info.filename,thumbnail.caption
0,30707,training-material,2.1 Error rates and ground truth - Text Digiti...,VsNb9e,2021-03-03T14:24:16+0000,approved,Some description here finally.,[],[],[],...,https://www.zotero.org/groups/427927/items/{so...,,,,,,,,,
1,11515,training-material,3DHOP - How To,UEAOIh,2020-12-29T17:33:44+0000,approved,No description provided.,[],[],[],...,https://www.zotero.org/groups/427927/items/{so...,,,,,,,,,
2,46009,training-material,3ds Max Tutorials: Introduction,X0yXlt,2021-08-05T14:31:24+0000,approved,No description provided.,[],[],"[{'identifierService': {'code': 'Wikidata', 'l...",...,https://www.zotero.org/groups/427927/items/{so...,,,,,,,,,
3,28014,training-material,8 Transcriptions of Speech - The TEI Guidelines,I0iXqR,2021-01-15T09:32:31+0000,approved,No description provided.,[],[],[],...,https://www.zotero.org/groups/427927/items/{so...,,,,,,,,,
4,28515,training-material,"Agisoft PhotoScan. Tutorials, beginner level",iwZacp,2021-01-15T09:33:58+0000,approved,No description provided.,[],[],[],...,https://www.zotero.org/groups/427927/items/{so...,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
238,58298,training-material,XPath for Dictionary Nerds,goaVsX,2021-12-23T13:21:55+0000,approved,XPath (XML Path Language) is a standard query ...,"[{'actor': {'id': 12583, 'name': 'Toma Tasovac...","[{'type': {'code': 'license', 'label': 'Licens...",[],...,https://campus.dariah.eu/id/{source-item-id},,,,,,,,,
239,58406,training-material,"You don't have to be a programmer, but being t...",1zexmH,2021-12-31T11:13:15+0000,approved,Martin Lhoták first began digital research in ...,"[{'actor': {'id': 12554, 'name': 'Martin Lhotá...","[{'type': {'code': 'license', 'label': 'Licens...",[],...,https://campus.dariah.eu/id/{source-item-id},,,,,,,,,
240,58382,training-material,"You don't have to be a programmer, but being t...",oBIePx,2021-12-31T11:03:09+0000,approved,Martin Lhoták first began digital research in ...,"[{'actor': {'id': 12554, 'name': 'Martin Lhotá...","[{'type': {'code': 'license', 'label': 'Licens...",[],...,https://campus.dariah.eu/id/{source-item-id},,,,,,,,,
241,58380,training-material,"You don't have to be a programmer, but being t...",A1TrrJ,2021-12-31T11:02:59+0000,approved,Martin Lhoták first began digital research in ...,"[{'actor': {'id': 12554, 'name': 'Martin Lhotá...","[{'type': {'code': 'license', 'label': 'Licens...",[],...,https://campus.dariah.eu/id/{source-item-id},,,,,,,,,


## 1. Coverage of recommended fields

Each MP item type - *tools&services; training materials; datasets; publications; workflows* - has a set of mandatory and recommended fields. These fields are listed and explained in the [MP Editorial Guidelines](https://marketplace.sshopencloud.eu/contribute/metadata-guidelines#metadata-status). 

The following cells of this notebook give an overview of the recommended fields coverage by item type.

### 1.1 Tools&services - coverage of recommended fields

mandatory and recommended fields for tools&services: 
 {
 "tool": {
    "required": ["label", "description"],
    "recommended": [
      "actors",
      "accessibleAt",
      "externalIds",
      "media",
      "thumbnail",
      "relatedItems",
      "version"
    ],
    "recommendedProperties": [
      "activity",
      "keyword",
      "discipline",
      "language",
      "tool-family",
      "mode-of-use",
      "intended-audience",
      "see-also",
      "user-manual-url",
      "helpdesk-url",
      "license",
      "terms-of-use-url",
      "technical-readiness-level"
    ]
  }


when `getItemsWithNullValues` is set to False (e.g. `getItemsWithNullValues(recommended_ts, False)` the output renders the list of items with AT LEAST one recommeded_ts property with a null value.

when `getItemsWithNullValues` is set to True (e.g. `getItemsWithNullValues(recommended_ts, True)` the output renders the list of items with ALL recommeded_ts properties having null values.


In [21]:
utils=hel.Util()
recommended_ts='accessibleAt, contributors, externalIds, media, relatedItems, version, activity, keyword, discipline, language, tool-family, mode-of-use, intended-audience, see-also, user-manual-url, helpdesk-url, license, terms-of-use-url, technology-readiness-level'
recommended_ts_mask=['persistentId', 'MPUrl', 'label', 'category','accessibleAt', 'contributors', 'externalIds', 'media', 'relatedItems', 'version', 'activity', 'keyword', 'discipline', 'language', 'tool-family', 'mode-of-use', 'intended-audience', 'see-also', 'user-manual-url', 'helpdesk-url', 'license', 'terms-of-use-url', 'technology-readiness-level']


df_items_null_values=utils.getItemsWithNullValues(recommended_ts, False)
df_items_null_values_tools=df_items_null_values[(df_items_null_values['category']=='tool-or-service')]
df_items_null_values_tools.head()

  df_items=pd.merge(left=df_items, right=tmp, left_on='persistentId', right_on='ts_persistentId', how = 'outer').fillna(np.nan)


Unnamed: 0,index,id,category,label,persistentId,lastInfoUpdate,status,description,contributors,properties,...,ts_persistentId_y,user-manual-url,helpdesk-url,ts_persistentId_x,license,ts_persistentId_y.1,terms-of-use-url,ts_persistentId,technology-readiness-level,MPUrl
0,0,45953,tool-or-service,140kit,3IAyEp,2021-07-30T16:03:01+0000,approved,140kit provides a management layer for tweet c...,"[{'actor': {'id': 483, 'name': 'Ian Pearce, De...","[{'type': {'code': 'activity', 'label': 'Activ...",...,,,,,,,,,,tool-or-service/3IAyEp
1,1,49576,tool-or-service,3DF Zephyr - photogrammetry software - 3d mode...,U3gQrh,2021-09-22T15:51:38+0000,approved,,,[{'type': {'code': 'curation-flag-description'...,...,,,,,,,,,,tool-or-service/U3gQrh
2,2,49577,tool-or-service,3DHOP,MnpOWX,2021-09-22T15:51:39+0000,approved,,,[{'type': {'code': 'curation-flag-description'...,...,,,,,,,,,,tool-or-service/MnpOWX
3,3,49578,tool-or-service,3DHOP: 3D Heritage Online Presenter,gA7zFN,2021-09-22T15:51:39+0000,approved,,,[{'type': {'code': 'curation-flag-description'...,...,,,,,,,,,,tool-or-service/gA7zFN
4,4,49579,tool-or-service,3DReshaper \| 3DReshaper,Q49CiV,2021-09-22T15:51:40+0000,approved,,,[{'type': {'code': 'curation-flag-description'...,...,,,,,,,,,,tool-or-service/Q49CiV


The following cells output shows how many recommended fields have null values. 

In [16]:
df_items_null_values_sp=df_items_null_values[(df_items_null_values['category']=='tool-or-service')]

df_coverage_sp=df_items_null_values_sp[recommended_ts_mask]

df_coverage_sp['value']=df_coverage_sp.isnull().sum(axis=1)
labels_sp=df_coverage_sp[['MPUrl', 'persistentId', 'label', 'value']].groupby('value').count()['label']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_coverage_sp['value']=df_coverage_sp.isnull().sum(axis=1)


### Flag items with with more than *max* null values
The following cell creates the data frame that contains the items having more than a defined number of null values in the recommended fields

In [23]:
maxnull=17 # max number of null values allowed
df_coverage_sp['property']='coverage' #flag name
df_test_cov_no_duplicates=df_coverage_sp[df_coverage_sp.duplicated(subset=['label'], keep='first')]
df_flag_dataset=df_test_cov_no_duplicates[df_test_cov_no_duplicates.value>maxnull]
#df_flag_dataset.tail()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_coverage_sp['property']='coverage'


In [7]:
curation_flag_property={"code": "curation-flag-coverage"}
curation_detail_property={"code": "curation-detail"}

### Flag items in the dataset

In [None]:
res_cov=mpdata.setPropertyFlags(df_flag_dataset, curation_flag_property, curation_detail_property)

The following cell creates a pie chart giving the pourcentage of items with a given number of empty recommended fields.  

In [None]:
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.axis('equal')
explode = [0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3]

ax.pie(labels_sp.values, labels = labels_sp.index,   autopct='%1.2f%%', radius = 2)
plt.show()

### 1.2 Training materials - coverage of recommended fields

 mandatory and recommended fields for training materials:
 
 
 "trainingMaterial": {
    "required": ["label", "description"],
    "recommended": [
      "actors",
      "accessibleAt",
      "externalIds",
      "media",
      "thumbnail",
      "relatedItems"
    ],
    "recommendedProperties": [
      "activity",
      "keyword",
      "discipline",
      "language",
      "object-format",
      "extent",
      "intended-audience",
      "see-also",
      "license",
      "terms-of-use-url",
      "year"
    ]
  }


In [None]:
utils=hel.Util()
recommended_ts='accessibleAt, contributors, externalIds, media, relatedItems, activity, keyword, discipline, language, object-format, extent, intended-audience, see-also, license, terms-of-use-url, year'
recommended_ts_mask=['persistentId', 'MPUrl', 'label','accessibleAt', 'contributors', 'externalIds', 'media', 'relatedItems', 'activity', 'keyword', 'discipline', 'language', 'object-format', 'extent', 'intended-audience', 'see-also', 'license', 'terms-of-use-url', 'year']


df_items_null_values=utils.getItemsWithNullValues(recommended_ts, False)
df_items_null_values_training=df_items_null_values[(df_items_null_values['category']=='training-material')]
df_items_null_values_training

In [None]:
df_items_null_values_sp=df_items_null_values[(df_items_null_values['category']=='training-material')]

df_coverage_sp=df_items_null_values_sp[recommended_ts_mask]

df_coverage_sp['value']=df_coverage_sp.isnull().sum(axis=1)
labels_sp=df_coverage_sp[['MPUrl', 'persistentId', 'label', 'value']].groupby('value').count()['label']

In [None]:
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.axis('equal')
explode = [0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3]

ax.pie(labels_sp.values, labels = labels_sp.index,   autopct='%1.2f%%', radius = 2)
plt.show()

### 1.3 Datasets - coverage of recommended fields

mandatory and recommended fields for datasets:

"dataset": {
    "required": ["label", "description"],
    "recommended": [
      "actors",
      "accessibleAt",
      "externalIds",
      "media",
      "thumbnail",
      "relatedItems"
    ],
    "recommendedProperties": [
      "activity",
      "keyword",
      "discipline",
      "language",
      "object-format",
      "extent",
      "intended-audience",
      "see-also",
      "license",
      "year"
    ]
  }


In [None]:
utils=hel.Util()
recommended_ts='accessibleAt, contributors, externalIds, media, relatedItems, activity, keyword, discipline, language, object-format, extent, intended-audience, see-also, license, year'
recommended_ts_mask=['persistentId', 'MPUrl', 'label','accessibleAt', 'contributors', 'externalIds', 'media', 'relatedItems', 'activity', 'keyword', 'discipline', 'language', 'object-format', 'extent', 'intended-audience', 'see-also', 'license', 'year']


df_items_null_values=utils.getItemsWithNullValues(recommended_ts, False)
df_items_null_values_dataset=df_items_null_values[(df_items_null_values['category']=='dataset')]
df_items_null_values_dataset

In [None]:
df_items_null_values_sp=df_items_null_values[(df_items_null_values['category']=='dataset')]

df_coverage_sp=df_items_null_values_sp[recommended_ts_mask]

df_coverage_sp['value']=df_coverage_sp.isnull().sum(axis=1)
labels_sp=df_coverage_sp[['MPUrl', 'persistentId', 'label', 'value']].groupby('value').count()['label']

In [None]:
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.axis('equal')
explode = [0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3]

ax.pie(labels_sp.values, labels = labels_sp.index,   autopct='%1.2f%%', radius = 2)
plt.show()

### 1.4 Publications - coverage of recommended fields

mandatory and recommended fields for publications:

"publication": {
    "required": ["label", "description"],
    "recommended": [
      "actors",
      "accessibleAt",
      "externalIds",
      "media",
      "thumbnail",
      "relatedItems"
    ],
    "recommendedProperties": [
      "activity",
      "keyword",
      "discipline",
      "language",
      "extent",
      "intended-audience",
      "see-also",
      "license",
      "publication-type",
      "publisher",
      "publication-place",
      "year",
      "journal",
      "conference",
      "volume",
      "issue",
      "pages"
    ]
  }


In [None]:
utils=hel.Util()
recommended_ts='accessibleAt, contributors, externalIds, media, relatedItems, activity, keyword, discipline, language, extent, intended-audience, see-also, license, publication-type, publisher, publication-place, year, journal, conference, volume, issue, pages'
recommended_ts_mask=['persistentId', 'MPUrl', 'label','accessibleAt', 'contributors', 'externalIds', 'media', 'relatedItems', 'activity', 'keyword', 'discipline', 'language', 'extent', 'intended-audience', 'see-also', 'license', 'publication-type', 'publisher', 'publication-place', 'year', 'journal', 'conference', 'volume', 'issue', 'pages']


df_items_null_values=utils.getItemsWithNullValues(recommended_ts, False)
df_items_null_values_publi=df_items_null_values[(df_items_null_values['category']=='publication')]
df_items_null_values_publi

In [None]:
df_items_null_values_sp=df_items_null_values[(df_items_null_values['category']=='publication')]

df_coverage_sp=df_items_null_values_sp[recommended_ts_mask]

df_coverage_sp['value']=df_coverage_sp.isnull().sum(axis=1)
labels_sp=df_coverage_sp[['MPUrl', 'persistentId', 'label', 'value']].groupby('value').count()['label']

In [None]:
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.axis('equal')
explode = [0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3]

ax.pie(labels_sp.values, labels = labels_sp.index,   autopct='%1.2f%%', radius = 2)
plt.show()

### 1.5 Workflows - coverage of recommended fields

?? Given the very specific nature of workflow, might be better to not include workflows in this coverage flag ??


## 2. Flag priority items for curation 

Given the poor initial (meta)data quality of MP items, a stricter approach than the recommended properties is taken to set up the curation-flag-coverage.

Depending on the item type, fields that are not covered by other curation flags (i.e. URLs are already inspected via another curation flag) and that are considered high priority to ensure the (meta)data quality of a given item type are defined and inspected.

A two steps approch is defined:
- in the first place (early 2021), the curation-flag is raised when all the fields defined are empty: getItemsWithNullValues(*dataset chosen*, **True**)
- when the data quality will be better, the number of fields used can be enlarged and/or the input parameters modified. For example, if getItemsWithNullValues(*dataset chosen*, **False**) is set to `False` the flag would be raised if at least one value is null among the set of fields chosen


### 2.1 Tools and services coverage 

For tools and services, `activity`, `keyword` and `license` are considered as the most important fields, and should be filled in for all tools and services items. 

This could be easily chnaged in the future.

Because getItemsWithNullValues is set to `True`, all items for which `activity`, `keyword` and `license` are empty (have null values) are listed and will be included in the `curation-flag-coverage`.

In [None]:
utils=hel.Util()
tool_coverage_ts='activity, keyword, license'
tool_coverage_ts_mask=['persistentId', 'MPUrl', 'label','accessibleAt', 'activity', 'keyword', 'license']


df_items_coverage_null_values=utils.getItemsWithNullValues(tool_coverage_ts, True)
df_items_coverage_null_values_tools=df_items_coverage_null_values[(df_items_coverage_null_values['category']=='tool-or-service')]
table_coverage_null_values_tools=df_items_coverage_null_values_tools[tool_coverage_ts_mask].style.format({'MPUrl': utils.make_clickable})
table_coverage_null_values_tools

In [None]:


print (f'\n There are {df_items_coverage_null_values_tools[tool_coverage_ts_mask].shape[0]} tools and services items where the values in activity, keyword, license are null \n')


### 2.2

To be continued with training, datasets, publications and workflows