# Notebook 3.2 - Curation-flag-description

This notebook analyses the `description` fields of the SSH Open Marketplace and writes back to the system via two dedicated curation properties: `curation-flag-description` and `curation-detail` properties.

This notebook flags Marketplace items that have empty, too short or too long descriptions, helping Moderators identify curation priorities to improve data quality. 

This notebook is part of a series of 4 notebooks that inform the curation properties used in the SSH Open Marketplace Editorial Dashboard.

It is composed of 3 sections:

0. Requirements to run the notebook
1. Check the length of the text stored in the `description` field
2. Flag items with empty/short/long `description` in the Dataset

## 0 Requirements to run this notebook

This section gives all the relevant information to "interact" with the MP data.

### 0.1 libraries
*There are a number of external libraries needed to run the notebook* 

*Furthermore, a dedicated SSH Open Marketplace library - sshmarketplacelib - with customised functions has been created and can be imported using the python import commands.* 

*Below the libraries import needed to run this notebook*

In [1]:
import numpy as np
import pandas as pd
import requests
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

### 0.2 Get the data

In [2]:
mpdata = mpd()
df_tool_flat =mpdata.getMPItems ("toolsandservices", True)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", True)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...


### 0.3 A look at the data

In [3]:
df_all_items=pd.concat([df_tool_flat, df_publication_flat, df_trainingmaterials_flat, df_workflows_flat, df_datasets_flat])
df_all_items.head()

Unnamed: 0,id,category,label,persistentId,lastInfoUpdate,status,description,contributors,properties,externalIds,...,thumbnail.info.filename,thumbnail.info.mimeType,thumbnail.info.hasThumbnail,thumbnail.caption,version,thumbnail.info.location.sourceUrl,informationContributor.email,dateCreated,dateLastUpdated,composedOf
0,45953,tool-or-service,140kit,3IAyEp,2021-07-30T16:03:01+0000,approved,140kit provides a management layer for tweet c...,"[{'actor': {'id': 483, 'name': 'Ian Pearce, De...","[{'type': {'code': 'activity', 'label': 'Activ...",[],...,acdh-ch-logo96.png,image/png,True,test thumbnail of uploaded media image,,,,,,
1,49576,tool-or-service,3DF Zephyr - photogrammetry software - 3d mode...,U3gQrh,2021-09-22T15:51:38+0000,approved,No description provided.,[],[{'type': {'code': 'curation-flag-description'...,[],...,,,,,,,,,,
2,49577,tool-or-service,3DHOP,MnpOWX,2021-09-22T15:51:39+0000,approved,No description provided.,[],[{'type': {'code': 'curation-flag-description'...,[],...,,,,,,,,,,
3,49578,tool-or-service,3DHOP: 3D Heritage Online Presenter,gA7zFN,2021-09-22T15:51:39+0000,approved,No description provided.,[],[{'type': {'code': 'curation-flag-description'...,[],...,,,,,,,,,,
4,49579,tool-or-service,3DReshaper \| 3DReshaper,Q49CiV,2021-09-22T15:51:40+0000,approved,No description provided.,[],[{'type': {'code': 'curation-flag-description'...,[],...,,,,,,,,,,


In [4]:
df_all_items_work=df_all_items[['id', 'persistentId', 'category', 'label', 'description', 'contributors', 'accessibleAt', 'source.label']]
df_all_items_work.tail()

Unnamed: 0,id,persistentId,category,label,description,contributors,accessibleAt,source.label
303,12634,l8gLBb,dataset,Yelp Academic Challenge Dataset,The Yelp dataset is a subset of our businesses...,"[{'actor': {'id': 1752, 'name': 'Eva Bacas', '...",[https://www.yelp.com/dataset],Humanities Data
304,12631,xvYQQ4,dataset,YelpCHI,This dataset is collected from Yelp.com and fi...,"[{'actor': {'id': 1752, 'name': 'Eva Bacas', '...",[http://odds.cs.stonybrook.edu/yelpchi-dataset/],Humanities Data
305,12632,IdZGtV,dataset,YelpNYC,This dataset is collected from Yelp.com and fi...,"[{'actor': {'id': 1752, 'name': 'Eva Bacas', '...",[http://odds.cs.stonybrook.edu/yelpnyc-dataset/],Humanities Data
306,12633,OMny6U,dataset,YelpZIP,This dataset is collected from Yelp.com and fi...,"[{'actor': {'id': 1752, 'name': 'Eva Bacas', '...",[http://odds.cs.stonybrook.edu/yelpzip-dataset/],Humanities Data
307,12589,YnEaU0,dataset,"""You Are Where You Tweet: A Content-Based Appr...",This dataset is a collection of scraped public...,"[{'actor': {'id': 1752, 'name': 'Eva Bacas', '...",[https://archive.org/details/twitter_cikm_2010],Humanities Data


## 1. Check the length of the text stored in the `description` field

By default, and according to rules set up in the Editorial Guidelines [add link], the minimum number of characters is set up to 25 and the maximum to 1500, but these parameters can be modified in the cell below. 

An overview of the items description length is also given in notebook *2.SSH Open Marketplace data quality overview*

In [5]:
minchars=25
maxchars=1500

utils=hel.Util()

df_items = df_tool_flat.replace(utils.empty_description, np.nan)

df_items_d = df_items[(df_items['description'].notnull()) & (df_items['description'].str.len()>=minchars) & (df_items['description'].str.len()<=maxchars)]

df_items_min = df_items[(df_items['description'].notnull()) & (df_items['description'].str.len()<minchars)]

df_items_max = df_items[(df_items['description'].notnull()) & (df_items['description'].str.len()>maxchars)]

print (f"\nThere are {df_items['description'].isna().sum()} items with empty descriptions, "+
       f" {df_items_min['description'].count()} items with description shorter than {minchars} characters,"+
f" {df_items_max['description'].count()} Items with description longer than {maxchars} characters, "+
      f" {df_items_d['description'].count()} Items with description between {minchars} and {maxchars} characters.")
       


There are 166 items with empty descriptions,  35 items with description shorter than 25 characters, 27 Items with description longer than 1500 characters,  1638 Items with description between 25 and 1500 characters.


In [6]:
df_items_null = df_items[(df_items['description'].isnull())]

In [7]:
nv_description_df=pd.concat([df_items_min, df_items_max, df_items_null])
tempdf=nv_description_df[['persistentId', 'label', 'description']]
tempdf['property']='description'
tempdf['value']=tempdf['description'].str.len()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tempdf['property']='description'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tempdf['value']=tempdf['description'].str.len()


In [8]:
tempdf.to_pickle('data/descrlength.pickle')

In [9]:
tempdf.head()

Unnamed: 0,persistentId,label,description,property,value
91,zsKANt,areMediaEqual crash,areMediaEqual crash,description,19.0
174,ef1r3I,Bread!,Bread!,description,6.0
208,MWcv85,Chartle,Create simple charts.\n,description,22.0
297,xMeq9F,cool tool,very very cool tool!,description,20.0
771,31D6PR,KSHIP,TEST,description,4.0


## 2. Flag items with empty/short/long `description` in the Dataset

In [10]:
curation_flag_property={"code": "curation-flag-description"}
curation_detail_property={"code": "curation-detail"}

In [11]:
# categories="toolsandservices";"publications";"trainingmaterials";"workflows";"datasets"
# res=mpdata.setPropertyStatusFlags(tempdf, categories, curation_flag_property, curation_detail_property)
res_des=mpdata.setPropertyFlags(tempdf, curation_flag_property, curation_detail_property)

Creating log file...
The property: description, has value nan, in item with pid: tools/U3gQrh, (current version: 49576)
flag property exists, value:  {'description':  {"length": "0"}}
The property: description, has value nan, in item with pid: tools/MnpOWX, (current version: 49577)
flag property exists, value:  {'description':  {"length": "0"}}
The property: description, has value nan, in item with pid: tools/gA7zFN, (current version: 49578)
flag property exists, value:  {'description':  {"length": "0"}}
The property: description, has value nan, in item with pid: tools/Q49CiV, (current version: 49579)
flag property exists, value:  {'description':  {"length": "0"}}
The property: description, has value nan, in item with pid: tools/sZbdSF, (current version: 49580)
flag property exists, value:  {'description':  {"length": "0"}}
The property: description, has value nan, in item with pid: tools/wIhZrL, (current version: 49581)
flag property exists, value:  {'description':  {"length": "0"}}
T

The property: description, has value nan, in item with pid: tools/48H2xs, (current version: 49647)
flag property exists, value:  {'description':  {"length": "0"}}
The property: description, has value nan, in item with pid: tools/LWRqHf, (current version: 62702)
flag property exists, value:  {'description':  {"length": "0"}}
The property: description, has value nan, in item with pid: tools/qfWaSc, (current version: 49649)
flag property exists, value:  {'description':  {"length": "0"}}
The property: description, has value nan, in item with pid: tools/46xJuM, (current version: 49650)
flag property exists, value:  {'description':  {"length": "0"}}
The property: description, has value nan, in item with pid: tools/359eAS, (current version: 49651)
flag property exists, value:  {'description':  {"length": "0"}}
The property: description, has value 2673.0, in item with pid: tools/KD7bWX, (current version: 49652)
flag property exists, value:  {'description':  {"length": "2673"}}
The property: de

The property: description, has value nan, in item with pid: tools/8rTzjv, (current version: 49722)
flag property exists, value:  {'description':  {"length": "0"}}
The property: description, has value nan, in item with pid: tools/srPMJU, (current version: 49723)
flag property exists, value:  {'description':  {"length": "0"}}
The property: description, has value nan, in item with pid: tools/S75qHH, (current version: 49724)
flag property exists, value:  {'description':  {"length": "0"}}
The property: description, has value nan, in item with pid: tools/n2LvQz, (current version: 49725)
flag property exists, value:  {'description':  {"length": "0"}}
The property: description, has value nan, in item with pid: tools/JJFBdM, (current version: 49726)
flag property exists, value:  {'description':  {"length": "0"}}
The property: description, has value nan, in item with pid: tools/QO3kr7, (current version: 49727)
flag property exists, value:  {'description':  {"length": "0"}}
The property: descript

The property: description, has value nan, in item with pid: tools/hw6a4s, (current version: 49788)
flag property exists, value:  {'description':  {"length": "0"}}
The property: description, has value nan, in item with pid: tools/4Olcmh, (current version: 49789)
flag property exists, value:  {'description':  {"length": "0"}}
The property: description, has value 1702.0, in item with pid: tools/Qeqmj9, (current version: 49790)
flag property exists, value:  {'description':  {"length": "1702"}}
The property: description, has value nan, in item with pid: tools/M3AICu, (current version: 49791)
flag property exists, value:  {'description':  {"length": "0"}}
