# Challenge 2C - Reuse the SSH Open Marketplace data using the Application Programming Interface (API)

This notebook provide examples of the Python API used to access the SSH Open Marketplace dataset. 



#### Libraries needed to run the notebook.

In [1]:
import pandas as pd #to manage dataframes
from sshmarketplacelib import MPData as mpd
from sshmarketplacelib import  eval as eva, helper as hel
mpdata = mpd()

#### The function *getMPItems* is used to download the data from the MP dataset. 



In [2]:
help(mpdata.getMPItems)

Help on method getMPItems in module sshmarketplacelib.mpdata:

getMPItems(itemscategory, localrepository=False, pages=0) method of sshmarketplacelib.mpdata.MPData instance
    Loads data from MP dataset. This method creates a dataframe, stores it in a local repository and returns it to the caller.
    
    Parameters:
    -----------
    itemscategory : str
        The category of items, possible values are: "toolsandservices", "publications", "trainingmaterials", "workflows", "dataset"
    localrepository : boolean, optional
    pages : int, optional
        The number of pages, default is all
    
    Returns:
    --------
    DataFrame: Returning value
    
    If the localrepository parameter is 'FALSE' or is not specified, the items are downloaded from the MP dataset, 
    if the localrepository parameter is 'TRUE' the items are first searched in the local repository and 
    if they are not present they are downloaded from the remote MP dataset. 
    When the items are downloaded

#### Download the *tools or services* items

In [3]:
df_tool_flat =mpdata.getMPItems ("toolsandservices", False)
df_publication_flat =mpdata.getMPItems ("publications", True)
df_trainingmaterials_flat =mpdata.getMPItems ("trainingmaterials", True)
df_workflows_flat =mpdata.getMPItems ("workflows", True)
df_datasets_flat =mpdata.getMPItems ("datasets", True)

getting data from local repository...
getting data from local repository...
getting data from local repository...
getting data from local repository...


#### Show data frame info

In [4]:
df_tool_flat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1732 entries, 0 to 1731
Data columns (total 33 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   id                                       1732 non-null   int64  
 1   category                                 1732 non-null   object 
 2   label                                    1732 non-null   object 
 3   persistentId                             1732 non-null   object 
 4   lastInfoUpdate                           1732 non-null   object 
 5   status                                   1732 non-null   object 
 6   description                              1732 non-null   object 
 7   contributors                             1732 non-null   object 
 8   properties                               1732 non-null   object 
 9   externalIds                              1732 non-null   object 
 10  accessibleAt                             1732 no

In [5]:
df_tool_flat[['label', 'category', 'description', 'accessibleAt', 'source.label']].tail()

Unnamed: 0,label,category,description,accessibleAt,source.label
1727,Zoomify,tool-or-service,A suite of products developed for web designer...,[],TAPoR
1728,Zotero,tool-or-service,"Zotero is a free tool that collects, manages a...",[https://www.zotero.org/],TAPoR
1729,ZotFile,tool-or-service,Zotero add-on for Firefox to manage your attac...,[http://zotfile.com/],TAPoR
1730,Zotpress,tool-or-service,Zotpress is a WordPress plugin that enables us...,[https://wordpress.org/plugins/zotpress/],TAPoR
1731,Zubrag HTML Tags Stripper,tool-or-service,This tool simply takes HTML and returns it wit...,[http://www.zubrag.com/tools/html-tags-strippe...,TAPoR


#### Get properties

In [6]:
utils=hel.Util()
df_tool_properties=utils.getProperties(df_tool_flat)

In [7]:
df_tool_properties.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16280 entries, 0 to 16279
Data columns (total 27 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   id                            16280 non-null  int64 
 1   category                      16280 non-null  object
 2   label                         16280 non-null  object
 3   persistentId                  16280 non-null  object
 4   accessibleAt                  16280 non-null  object
 5   description                   16280 non-null  object
 6   relatedItems                  16280 non-null  object
 7   media                         16280 non-null  object
 8   source.label                  15563 non-null  object
 9   source.url                    15563 non-null  object
 10  type.code                     16280 non-null  object
 11  type.label                    16280 non-null  object
 12  type.type                     16280 non-null  object
 13  type.groupName  

In [8]:
df_tool_properties[['category', 'label', 'source.label', 'type.label', 'concept.label', 'value']].tail(10)

Unnamed: 0,category,label,source.label,type.label,concept.label,value
16270,tool-or-service,Zotpress,TAPoR,Activity,Disseminating,
16271,tool-or-service,Zotpress,TAPoR,Keyword,web publishing,
16272,tool-or-service,Zotpress,TAPoR,Keyword,citations,
16273,tool-or-service,Zotpress,TAPoR,Keyword,bibliography,
16274,tool-or-service,Zubrag HTML Tags Stripper,TAPoR,Terms Of Use,,Free
16275,tool-or-service,Zubrag HTML Tags Stripper,TAPoR,Mode of use,Web application,
16276,tool-or-service,Zubrag HTML Tags Stripper,TAPoR,Activity,Enriching,
16277,tool-or-service,Zubrag HTML Tags Stripper,TAPoR,Keyword,Enrichment,
16278,tool-or-service,Zubrag HTML Tags Stripper,TAPoR,Activity,Data Cleansing,
16279,tool-or-service,Zubrag HTML Tags Stripper,TAPoR,Deprecated at Source,,TRUE


#### _Helper_ functions: get the number of items provided by every source

In [9]:
utils=hel.Util()
df_items_by_sources=utils.getAllItemsBySources()
df_items_by_sources

dblp: DH Publications             2810
TAPoR                             1337
SSK Zotero Resources               355
Humanities Data                    302
The Programming Historian          169
NA                                 125
DARIAH-CAMPUS                       80
Language Resource Switchboard       51
DARIAH contribution tool            35
EOSC Catalogue                      30
Standardization Survival Kit        25
SSHopencloud Service Catalogue      23
Name: source.label, dtype: int64

#### Helper functions: get the number of items per category provided by every source

In [10]:
df_categories_bysources=utils.getCategoriesBySources()
df_categories_bysources

source.label,DARIAH contribution tool,DARIAH-CAMPUS,EOSC Catalogue,Humanities Data,Language Resource Switchboard,SSHopencloud Service Catalogue,SSK Zotero Resources,Standardization Survival Kit,TAPoR,The Programming Historian,dblp: DH Publications
Categories,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
dataset,3.0,0.0,0.0,302.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
publication,0.0,0.0,0.0,0.0,0.0,0.0,136.0,0.0,0.0,0.0,2810.0
tool-or-service,31.0,0.0,30.0,0.0,51.0,18.0,156.0,0.0,1337.0,0.0,0.0
training-material,1.0,80.0,0.0,0.0,0.0,2.0,63.0,0.0,0.0,169.0,0.0
workflow,0.0,0.0,0.0,0.0,0.0,0.0,0.0,25.0,0.0,0.0,0.0


#### More examples  [here](https://gitlab.gwdg.de/sshoc/marketplace-curation/-/blob/main/LibTest.ipynb).