# Notebook 3.1 - Curation-flag-URL

This notebook analyses the URL-based fields of the SSH Open Marketplace and writes back to the system via two dedicated curation properties: `curation-flag-url` and `curation-detail` properties.

This notebook flags Marketplace items that have errors in their URL-based fields, helping Moderators identify curation priorities to improve data quality. 

This notebook is part of a series of 4 notebooks that inform the curation properties used in the SSH Open Marketplace Editorial Dashboard.

It is composed of 4 sections:

0. Requirements to run the notebook
1. Check & flag error values in `accessibleAt`
2. Check & flag error values in URL-based properties
3. Check & flag error values in URL-based properties for a given source - example


## 0 Requirements to run this notebook

This section gives all the relevant information to "interact" with the MP data.

### 0.1 libraries
*There are a number of external libraries needed to run the notebook* 

*Furthermore, a dedicated SSH Open Marketplace library - sshmarketplacelib - with customised functions has been created and can be imported using the python import commands.* 

*Below the libraries import needed to run this notebook*

In [None]:
import numpy as np
import pandas as pd
import requests
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

### 0.2 Get the data



In [None]:
mpdata = mpd()
df_tool_flat =mpdata.getMPItems ("toolsandservices", True)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", True)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

### 0.3 A look at the data

df_all_items.head() will show the first 5 rows of the dataframe

df_all_items.tail() will show the 5 last rows of the dataframe

df_all_items.shape will give the dataframe shape (number of rows and columns)


In [None]:
df_all_items=pd.concat([df_tool_flat, df_publication_flat, df_trainingmaterials_flat, df_workflows_flat, df_datasets_flat])
df_all_items.head()

`df_all_items_work` selects the columns/attributes of interest 

In [None]:
df_all_items_work=df_all_items[['id', 'persistentId', 'category', 'label', 'contributors', 'accessibleAt', 'source.label']]
df_all_items_work.tail()

## 1. Check & flag values in `accessibleAt`

`accessibleAt` is the main URL field of MP items.

The following cell checks if there are empty values in `accessibleAt` for all items

In [None]:
df_all_items_work_emptyurls=df_all_items_work[df_all_items_work['accessibleAt'].str.len()==0]

emptyurldescriptionsn=df_all_items_work_emptyurls.count()[0]

print(f'\n There are {emptyurldescriptionsn} items without accessibleAt URLs\n')

### 1.1 Check the validity of URLs in the accessibleAt property using the HTTP Result Status code

The code below explicitly execute an http call for every URL, waits for the [Result Status Code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) of the call and then registers the code.
Depending on connections and server answer times it may take several minutes to process all URLs.


In [None]:
#The list of categories is defined in the following statement

categories="toolsandservices"#, toosandservices publications, trainingmaterials, workflows, datasets"

check=eva.URLCheck()
df_urls=check.checkURLValues(categories, 'accessibleAt')
df_urls.head()

In [None]:
df_urls.drop_duplicates(keep='first', inplace=True)
#df_urls.head()

In [None]:
utils=hel.Util()
df_http_status_nf_err=df_urls[df_urls['status'] == 404].sort_values('persistentId').drop_duplicates(keep='first', inplace=False)
df_http_status_serv_err=df_urls[df_urls['status'] == 200].sort_values('persistentId').drop_duplicates(keep='first', inplace=False)
df_http_status_err=pd.concat([df_http_status_nf_err, df_http_status_serv_err])
#df_http_status_err=df_urls.sort_values('persistentId').drop_duplicates(keep='first', inplace=False)
#df_http_status_err.to_pickle('data/urlstatus.pickle')
#df_http_status_nf_err.to_pickle('data/urlstatus404.pickle')
myclickable_table = df_http_status_nf_err.style.format({'MPUrl': utils.make_clickable})
myclickable_table

In [None]:
#df_http_status_err.to_pickle('data/test404.pickle')

### 1.2 Flag items with wrong accessibleAt URLs in the Dataset

In [None]:
curation_flag_property={"code": "curation-flag-url"}
curation_detail_property={"code": "curation-detail"}

In [None]:
res=mpdata.setHTTPStatusFlags(df_http_status_err, curation_flag_property, curation_detail_property)

## 2. Check & flag error values in URL-based properties



In [None]:
df_properties=utils.getProperties()
df_properties.head()

In [None]:
df_properties["type.type"].unique()

In [None]:
df_properties_url=df_properties[df_properties["type.type"]=="url"]
df_properties_url

In [None]:
df_properties_url["type.code"].unique()

In [None]:
df_all_items.count()

In [None]:
urls_df_properties=check.checkURLValuesInDataset(df_all_items.iloc[0:3000], 'terms-of-use-url, user-manual-url, privacy-policy-url, access-policy-url, service-level-url, see-also, helpdesk-url')
urls_df_properties.tail()

### Check error values in URL-based properties for items whose source is: *EOSC Catalogue*

Create a dataframe with all the items having the EOSC Catalogue source

In [None]:
df_ec_items=df_all_items[df_all_items['source.label']=='EOSC Catalogue']

In [None]:
df_ec_items.head(3)

Check the URL properties by invoking the function **checkURLValuesInDataset(dataset, props)**

In [None]:
urls_df_hd=check.checkURLValuesInDataset(df_ec_items, 'terms-of-use-url, user-manual-url, privacy-policy-url, access-policy-url, service-level-url, see-also, helpdesk-url')
urls_df_hd.head()

In [None]:
urls_df_hd_status_nf_err=urls_df_hd[urls_df_hd['status'] == 404].sort_values('persistentId').drop_duplicates(keep='first', inplace=False)

In [None]:
myclickable_table = urls_df_hd_status_nf_err.style.format({'MPUrl': utils.make_clickable})
myclickable_table