# Notebook 3.2 - Curation-flag-description

This notebook analyses the `description` fields of the SSH Open Marketplace and writes back to the system via two dedicated curation properties: `curation-flag-description` and `curation-detail` properties.

This notebook flags Marketplace items that have empty, too short or too long descriptions, helping Moderators identify curation priorities to improve data quality. 

This notebook is part of a series of 4 notebooks that inform the curation properties used in the SSH Open Marketplace Editorial Dashboard.

It is composed of 3 sections:

0. Requirements to run the notebook
1. Check the length of the text stored in the `description` field
2. Flag items with empty/short/long `description` in the Dataset

## 0 Requirements to run this notebook

This section gives all the relevant information to "interact" with the MP data.

### 0.1 libraries
*There are a number of external libraries needed to run the notebook* 

*Furthermore, a dedicated SSH Open Marketplace library - sshmarketplacelib - with customised functions has been created and can be imported using the python import commands.* 

*Below the libraries import needed to run this notebook*

In [1]:
import numpy as np
import pandas as pd
import requests
#import the MarketPlace Library 
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel

### 0.2 Get the data

In [2]:
mpdata = mpd()
df_tool_flat =mpdata.getMPItems ("toolsandservices", True)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", True)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...


### 0.3 A look at the data

In [3]:
df_all_items=pd.concat([df_tool_flat, df_publication_flat, df_trainingmaterials_flat, df_workflows_flat, df_datasets_flat])
df_all_items.head()

Unnamed: 0,id,category,label,persistentId,lastInfoUpdate,status,description,contributors,properties,externalIds,...,thumbnail.info.mediaId,thumbnail.info.category,thumbnail.info.filename,thumbnail.info.mimeType,thumbnail.info.hasThumbnail,thumbnail.info.location.sourceUrl,thumbnail.caption,dateCreated,dateLastUpdated,composedOf
0,28230,tool-or-service,140kit,SIU1nO,2021-11-23T17:24:25+0000,approved,140kit provides a management layer for tweet c...,"[{'actor': {'id': 2224, 'name': 'Ian Pearce, D...","[{'type': {'code': 'mode-of-use', 'label': 'Mo...",[],...,,,,,,,,,,
1,36324,tool-or-service,3DF Zephyr - photogrammetry software - 3d mode...,4gDAHv,2022-01-13T11:49:02+0000,approved,3DF Zephyr\[1\]\[2\] is a commercial photogram...,[],"[{'type': {'code': 'language', 'label': 'Langu...",[],...,,,,,,,,,,
2,36552,tool-or-service,3DHOP,UcxOmD,2022-01-13T11:50:31+0000,approved,3DHOP (3D Heritage Online Presenter) is an ope...,[],"[{'type': {'code': 'language', 'label': 'Langu...",[],...,,,,,,,,,,
3,36555,tool-or-service,3DHOP: 3D Heritage Online Presenter,uFIMPQ,2022-01-13T11:50:32+0000,approved,No description provided.,[],[],[],...,,,,,,,,,,
4,36189,tool-or-service,3DReshaper \| 3DReshaper,kAkzuz,2022-01-13T11:47:44+0000,approved,No description provided.,[],"[{'type': {'code': 'language', 'label': 'Langu...",[],...,,,,,,,,,,


In [4]:
df_all_items_work=df_all_items[['id', 'persistentId', 'category', 'label', 'description', 'contributors', 'accessibleAt', 'source.label']]
df_all_items_work.tail()

Unnamed: 0,id,persistentId,category,label,description,contributors,accessibleAt,source.label
303,12634,l8gLBb,dataset,Yelp Academic Challenge Dataset,The Yelp dataset is a subset of our businesses...,"[{'actor': {'id': 1752, 'name': 'Eva Bacas', '...",[https://www.yelp.com/dataset],Humanities Data
304,12631,xvYQQ4,dataset,YelpCHI,This dataset is collected from Yelp.com and fi...,"[{'actor': {'id': 1752, 'name': 'Eva Bacas', '...",[http://odds.cs.stonybrook.edu/yelpchi-dataset/],Humanities Data
305,12632,IdZGtV,dataset,YelpNYC,This dataset is collected from Yelp.com and fi...,"[{'actor': {'id': 1752, 'name': 'Eva Bacas', '...",[http://odds.cs.stonybrook.edu/yelpnyc-dataset/],Humanities Data
306,12633,OMny6U,dataset,YelpZIP,This dataset is collected from Yelp.com and fi...,"[{'actor': {'id': 1752, 'name': 'Eva Bacas', '...",[http://odds.cs.stonybrook.edu/yelpzip-dataset/],Humanities Data
307,12589,YnEaU0,dataset,"""You Are Where You Tweet: A Content-Based Appr...",This dataset is a collection of scraped public...,"[{'actor': {'id': 1752, 'name': 'Eva Bacas', '...",[https://archive.org/details/twitter_cikm_2010],Humanities Data


## 1. Check the length of the text stored in the `description` field

By default, and according to rules set up in the Editorial Guidelines [add link], the minimum number of characters is set up to 25 and the maximum to 1500, but these parameters can be modified in the cell below. 

An overview of the items description length is also given in notebook *2.SSH Open Marketplace data quality overview*

In [5]:
minchars=25
maxchars=1500

utils=hel.Util()

df_items = df_tool_flat.replace(utils.empty_description, np.nan)

df_items_d = df_items[(df_items['description'].notnull()) & (df_items['description'].str.len()>=minchars) & (df_items['description'].str.len()<=maxchars)]

df_items_min = df_items[(df_items['description'].notnull()) & (df_items['description'].str.len()<minchars)]

df_items_max = df_items[(df_items['description'].notnull()) & (df_items['description'].str.len()>maxchars)]

print (f"\nThere are {df_items['description'].isna().sum()} items with empty descriptions, "+
       f" {df_items_min['description'].count()} items with description shorter than {minchars} characters,"+
f" {df_items_max['description'].count()} Items with description longer than {maxchars} characters, "+
      f" {df_items_d['description'].count()} Items with description between {minchars} and {maxchars} characters.")
       


There are 55 items with empty descriptions,  2 items with description shorter than 25 characters, 29 Items with description longer than 1500 characters,  1609 Items with description between 25 and 1500 characters.


In [6]:
df_items_null = df_items[(df_items['description'].isnull())]

In [7]:
nv_description_df=pd.concat([df_items_min, df_items_max, df_items_null])
tempdf=nv_description_df[['persistentId', 'label', 'description']]
tempdf['property']='description'
tempdf['value']=tempdf['description'].str.len()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tempdf['property']='description'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tempdf['value']=tempdf['description'].str.len()


In [8]:
tempdf.to_pickle('data/descrlength.pickle')

In [9]:
tempdf.head()

Unnamed: 0,persistentId,label,description,property,value
194,o2X2Sl,Chartle,Create simple charts.\n,description,22.0
917,5jMQ6a,Offline Getting Started and Manual (pdf),Last Update: 14.12.2017,description,23.0
84,S9m9UW,Artifex Press,Artifex Press is a publishing and technology c...,description,3256.0
87,EdUOYX,Asset Bank,A digital asset management system for the stor...,description,1839.0
114,WnrT3t,Bamboo Person Service,The Bamboo Person Service can help scholars ac...,description,1710.0


## 2. Flag items with empty/short/long `description` in the Dataset

In [10]:
curation_flag_property={"code": "curation-flag-description"}
curation_detail_property={"code": "curation-detail"}

In [11]:
# categories="toolsandservices";"publications";"trainingmaterials";"workflows";"datasets"
# res=mpdata.setPropertyStatusFlags(tempdf, categories, curation_flag_property, curation_detail_property)
res_des=mpdata.setPropertyFlags(tempdf, curation_flag_property, curation_detail_property)

Creating log file...
The property: description, has value nan, in item with pid: uFIMPQ, (current version: 36555)
append curation_property_value
append curation_detail_value

Running in debug mode, Marketplace dataset not updated.
The property: description, has value nan, in item with pid: kAkzuz, (current version: 36189)
append curation_property_value
append curation_detail_value

Running in debug mode, Marketplace dataset not updated.
The property: description, has value 3256.0, in item with pid: S9m9UW, (current version: 27957)
append curation_property_value
append curation_detail_value

Running in debug mode, Marketplace dataset not updated.
The property: description, has value 1839.0, in item with pid: EdUOYX, (current version: 27701)
append curation_property_value
append curation_detail_value

Running in debug mode, Marketplace dataset not updated.
The property: description, has value nan, in item with pid: bhZPRG, (current version: 35694)
append curation_property_value
append cu

The property: description, has value 2347.0, in item with pid: OJcHh4, (current version: 40455)
append curation_property_value
append curation_detail_value

Running in debug mode, Marketplace dataset not updated.
The property: description, has value nan, in item with pid: nwihVQ, (current version: 35871)
append curation_property_value
append curation_detail_value

Running in debug mode, Marketplace dataset not updated.
The property: description, has value nan, in item with pid: 7ckBdR, (current version: 35727)
append curation_property_value
append curation_detail_value

Running in debug mode, Marketplace dataset not updated.
The property: description, has value nan, in item with pid: 4n69Z0, (current version: 36123)
append curation_property_value
append curation_detail_value

Running in debug mode, Marketplace dataset not updated.
The property: description, has value nan, in item with pid: sXaPdJ, (current version: 36990)
append curation_property_value
append curation_detail_value

Run

The property: description, has value nan, in item with pid: HLlsDS, (current version: 36825)
append curation_property_value
append curation_detail_value

Running in debug mode, Marketplace dataset not updated.
The property: description, has value nan, in item with pid: MmB0aX, (current version: 36573)
append curation_property_value
append curation_detail_value

Running in debug mode, Marketplace dataset not updated.
The property: description, has value nan, in item with pid: xAW2Dh, (current version: 36600)
append curation_property_value
append curation_detail_value

Running in debug mode, Marketplace dataset not updated.
The property: description, has value nan, in item with pid: Vckyp6, (current version: 35718)
append curation_property_value
append curation_detail_value

Running in debug mode, Marketplace dataset not updated.
The property: description, has value nan, in item with pid: rFuWdh, (current version: 36519)
append curation_property_value
append curation_detail_value

Runnin