# Notebook 3.1 - Curation-flag-URL

This notebook analyses the URL-based fields of the SSH Open Marketplace and writes back to the system via two dedicated curation properties: `curation-flag-url` and `curation-detail` properties.

This notebook flags Marketplace items that have errors in their URL-based fields, helping Moderators identify curation priorities to improve data quality. 

This notebook is part of a series of 4 notebooks that inform the curation properties used in the SSH Open Marketplace Editorial Dashboard.

It is composed of 4 sections:

0. Requirements to run the notebook
1. Check & flag error values in `accessibleAt`
2. Check & flag error values in URL-based properties
3. Check & flag error values in URL-based properties for a given source - example


## 0 Requirements to run this notebook

This section gives all the relevant information to "interact" with the MP data.

### 0.1 libraries
*There are a number of external libraries needed to run the notebook* 

*Furthermore, a dedicated SSH Open Marketplace library - sshmarketplacelib - with customised functions has been created and can be imported using the python import commands.* 

*Below the libraries import needed to run this notebook*

In [16]:
import numpy as np
import pandas as pd
import requests
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

### 0.2 Get the data



In [17]:
mpdata = mpd()
df_tool_flat =mpdata.getMPItems ("toolsandservices", True)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", True)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...


### 0.3 A look at the data

df_all_items.head() will show the first 5 rows of the dataframe

df_all_items.tail() will show the 5 last rows of the dataframe

df_all_items.shape will give the dataframe shape (number of rows and columns)


In [3]:
df_all_items=pd.concat([df_tool_flat, df_publication_flat, df_trainingmaterials_flat, df_workflows_flat, df_datasets_flat])
df_all_items.head()

Unnamed: 0,id,category,label,persistentId,lastInfoUpdate,status,description,contributors,properties,externalIds,...,thumbnail.info.mediaId,thumbnail.info.category,thumbnail.info.filename,thumbnail.info.mimeType,thumbnail.info.hasThumbnail,thumbnail.info.location.sourceUrl,thumbnail.caption,dateCreated,dateLastUpdated,composedOf
0,28230,tool-or-service,140kit,SIU1nO,2021-11-23T17:24:25+0000,approved,140kit provides a management layer for tweet c...,"[{'actor': {'id': 2224, 'name': 'Ian Pearce, D...","[{'type': {'code': 'mode-of-use', 'label': 'Mo...",[],...,,,,,,,,,,
1,36324,tool-or-service,3DF Zephyr - photogrammetry software - 3d mode...,4gDAHv,2022-01-13T11:49:02+0000,approved,3DF Zephyr\[1\]\[2\] is a commercial photogram...,[],"[{'type': {'code': 'language', 'label': 'Langu...",[],...,,,,,,,,,,
2,36552,tool-or-service,3DHOP,UcxOmD,2022-01-13T11:50:31+0000,approved,3DHOP (3D Heritage Online Presenter) is an ope...,[],"[{'type': {'code': 'language', 'label': 'Langu...",[],...,,,,,,,,,,
3,36555,tool-or-service,3DHOP: 3D Heritage Online Presenter,uFIMPQ,2022-01-13T11:50:32+0000,approved,No description provided.,[],[],[],...,,,,,,,,,,
4,36189,tool-or-service,3DReshaper \| 3DReshaper,kAkzuz,2022-01-13T11:47:44+0000,approved,No description provided.,[],"[{'type': {'code': 'language', 'label': 'Langu...",[],...,,,,,,,,,,


`df_all_items_work` selects the columns/attributes of interest 

In [4]:
df_all_items_work=df_all_items[['id', 'persistentId', 'category', 'label', 'contributors', 'accessibleAt', 'source.label']]
df_all_items_work.tail()

Unnamed: 0,id,persistentId,category,label,contributors,accessibleAt,source.label
303,12634,l8gLBb,dataset,Yelp Academic Challenge Dataset,"[{'actor': {'id': 1752, 'name': 'Eva Bacas', '...",[https://www.yelp.com/dataset],Humanities Data
304,12631,xvYQQ4,dataset,YelpCHI,"[{'actor': {'id': 1752, 'name': 'Eva Bacas', '...",[http://odds.cs.stonybrook.edu/yelpchi-dataset/],Humanities Data
305,12632,IdZGtV,dataset,YelpNYC,"[{'actor': {'id': 1752, 'name': 'Eva Bacas', '...",[http://odds.cs.stonybrook.edu/yelpnyc-dataset/],Humanities Data
306,12633,OMny6U,dataset,YelpZIP,"[{'actor': {'id': 1752, 'name': 'Eva Bacas', '...",[http://odds.cs.stonybrook.edu/yelpzip-dataset/],Humanities Data
307,12589,YnEaU0,dataset,"""You Are Where You Tweet: A Content-Based Appr...","[{'actor': {'id': 1752, 'name': 'Eva Bacas', '...",[https://archive.org/details/twitter_cikm_2010],Humanities Data


## 1. Check & flag values in `accessibleAt`

`accessibleAt` is the main URL field of MP items.

The following cell checks if there are empty values in `accessibleAt` for all items

In [5]:
df_all_items_work_emptyurls=df_all_items_work[df_all_items_work['accessibleAt'].str.len()==0]

emptyurldescriptionsn=df_all_items_work_emptyurls.count()[0]

print(f'\n There are {emptyurldescriptionsn} items without accessibleAt URLs\n')


 There are 526 items without accessibleAt URLs



### 1.1 Check the validity of URLs in the accessibleAt property using the HTTP Result Status code

The code below explicitly execute an http call for every URL, waits for the [Result Status Code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) of the call and then registers the code.
Depending on connections and server answer times it may take several minutes to process all URLs.


In [6]:
#The list of categories is defined in the following statement

categories="toolsandservices"#, publications, trainingmaterials, workflows, datasets"

check=eva.URLCheck()
df_urls=check.checkURLValues(categories, 'accessibleAt')
df_urls.head()

inspecting accessibleAt


Unnamed: 0,MPUrl,persistentId,category,label,property,url,status
0,tool-or-service/SIU1nO,SIU1nO,tool-or-service,140kit,accessibleAt,https://github.com/WebEcologyProject/140kit,200
1,tool-or-service/4gDAHv,4gDAHv,tool-or-service,3DF Zephyr - photogrammetry software - 3d mode...,accessibleAt,https://www.3dflow.net/3df-zephyr-pro-3d-model...,200
2,tool-or-service/UcxOmD,UcxOmD,tool-or-service,3DHOP,accessibleAt,http://vcg.isti.cnr.it/3dhop/,200
3,tool-or-service/uFIMPQ,uFIMPQ,tool-or-service,3DHOP: 3D Heritage Online Presenter,accessibleAt,https://github.com/cnr-isti-vclab/3DHOP,200
4,tool-or-service/kAkzuz,kAkzuz,tool-or-service,3DReshaper \| 3DReshaper,accessibleAt,https://www.3dreshaper.com/en/,200


In [7]:
df_urls.shape

(1878, 7)

In [8]:
df_urls.drop_duplicates(keep='first', inplace=True)

In [9]:
df_urls.shape

(1220, 7)

In [10]:
utils=hel.Util()
df_http_status_nf_err=df_urls[df_urls['status'] == 404].sort_values('persistentId').drop_duplicates(keep='first', inplace=False)
df_http_status_serv_err=df_urls[df_urls['status'] == 200].sort_values('persistentId').drop_duplicates(keep='first', inplace=False)
df_http_status_err=pd.concat([df_http_status_nf_err, df_http_status_serv_err])
df_http_status_err=df_urls.sort_values('persistentId').drop_duplicates(keep='first', inplace=False)
#df_http_status_err.to_pickle('data/urlstatus.pickle')
#df_http_status_nf_err.to_pickle('data/urlstatus404.pickle')
myclickable_table = df_http_status_nf_err.style.format({'MPUrl': utils.make_clickable})
myclickable_table

Unnamed: 0,MPUrl,persistentId,category,label,property,url,status
1380,tool-or-service/25grr1,25grr1,tool-or-service,VLE - Viennese Lexicographic Editor,accessibleAt,https://www.oeaw.ac.at/en/acdh/tools/vle/,404
375,tool-or-service/3ZpHvD,3ZpHvD,tool-or-service,Finding People or Characters from A Text (Named-Entity Recognition),accessibleAt,https://github.com/TAPoR-3-Tools/Tapor-Coding-Tools/tree/master/tapor_coding_tools/natural%20language%20processing/Finging%20people%20or%20characters%20with%20NER,404
1217,tool-or-service/6xmMnK,6xmMnK,tool-or-service,Tagger - Other (TAPoRware),accessibleAt,http://taporware.ualberta.ca/~taporware/otherTools/tagger.shtml,404
1843,tool-or-service/Aezoqo,Aezoqo,tool-or-service,Wordle,accessibleAt,http://www.wordle.net/,404
363,tool-or-service/C80AK6,C80AK6,tool-or-service,Extract Text From HTML - Beta (TAPoRware),accessibleAt,http://taporware.ualberta.ca/~taporware/betaTools/textextractor.shtml,404
379,tool-or-service/CzSzkq,CzSzkq,tool-or-service,Fixed Phrase - XML (TAPoRware),accessibleAt,http://taporware.ualberta.ca/~taporware/xmlTools/fixedphrase-xml.shtml,404
725,tool-or-service/EcLbxP,EcLbxP,tool-or-service,Integrated Authority File (GND),accessibleAt,http://www.dnb.de/EN/Standardisierung/GND/gnd_node.html,404
377,tool-or-service/F4hxi5,F4hxi5,tool-or-service,Fixed Phrase - HTML (TAPoRware),accessibleAt,http://taporware.ualberta.ca/~taporware/htmlTools/fixedphrase-html.shtml,404
114,tool-or-service/FOhmdk,FOhmdk,tool-or-service,Bibliopedia,accessibleAt,http://sul-cidr.github.io/Bibliopedia/,404
221,tool-or-service/G4iyld,G4iyld,tool-or-service,Concraft -> DependencyParser,accessibleAt,http://zil.ipipan.waw.pl/DependencyParser,404


In [11]:
df_http_status_nf_err.shape

(39, 7)

### 1.2 Flag items with wrong accessibleAt URLs in the Dataset

In [12]:
curation_flag_property={"code": "curation-flag-url"}
curation_detail_property={"code": "curation-detail"}

In [18]:
res=mpdata.setHTTPStatusFlags(df_http_status_nf_err, curation_flag_property, curation_detail_property)

accessibleAt, tools 

The item with PID: FOhmdk has a 404 HTTP status for the property accessibleAt, (False)
flag property exists, value:  {'url': [ {"accessibleAt": "404"}]}
accessibleAt, tools 

The item with PID: I2egYT has a 404 HTTP status for the property accessibleAt, (False)
flag property exists, value:  {'url': [ {"accessibleAt": "404"}]}
accessibleAt, tools 

The item with PID: i3v5q0 has a 404 HTTP status for the property accessibleAt, (False)
Appending curation_detail_flag {'url': [ {"accessibleAt": "404"}]}



Running in debug mode, Marketplace dataset not updated.
accessibleAt, tools 

The item with PID: GwWY2h has a 404 HTTP status for the property accessibleAt, (False)
flag property exists, value:  {'url': [ {"accessibleAt": "404"}]}
accessibleAt, tools 

The item with PID: G4iyld has a 404 HTTP status for the property accessibleAt, (False)
flag property exists, value:  {'url': [ {"accessibleAt": "404"}]}
accessibleAt, tools 

The item with PID: oVSGJ5 has a 404 HTTP st

## 2. Check & flag error values in URL-based properties



In [None]:
df_properties=utils.getProperties()
df_properties.head()

In [None]:
df_properties["type.type"].unique()

In [None]:
df_properties_url=df_properties[df_properties["type.type"]=="url"]
df_properties_url

In [None]:
df_properties_url["type.code"].unique()

In [None]:
urls_df_properties=check.checkURLValuesInDataset(df_all_items, 'terms-of-use-url, user-manual-url, privacy-policy-url, access-policy-url, service-level-url, see-also, helpdesk-url')
urls_df_properties.tail()

### Check error values in URL-based properties for items whose source is: *EOSC Catalogue*

Create a dataframe with all the items having the EOSC Catalogue source

In [None]:
df_ec_items=df_all_items[df_all_items['source.label']=='EOSC Catalogue']

In [None]:
df_ec_items.head(3)

Check the URL properties by invoking the function **checkURLValuesInDataset(dataset, props)**

In [None]:
urls_df_hd=check.checkURLValuesInDataset(df_ec_items, 'terms-of-use-url, user-manual-url, privacy-policy-url, access-policy-url, service-level-url, see-also, helpdesk-url')
urls_df_hd.head()

In [None]:
urls_df_hd_status_nf_err=urls_df_hd[urls_df_hd['status'] == 404].sort_values('persistentId').drop_duplicates(keep='first', inplace=False)

In [None]:
myclickable_table = urls_df_hd_status_nf_err.style.format({'MPUrl': utils.make_clickable})
myclickable_table