# Working with ibm-watson-studio-lib

The `ibm-watson-studio-lib` library provides access to data and connection assets in projects or spaces in Watson Studio. This notebook shows you how to use some of the functions provided by the library. 

This notebook is compatible with CP4D 4.0 and `ibm-watson-studio-lib` 3.0.4.

All necessary information to set up the `ibm-watson-studio-lib` is provided by the Watson Studio environment runtime in which you launched the notebook. Get started using the library:

In [1]:
from ibm_watson_studio_lib import access_project_or_space
wslib = access_project_or_space()

In this notebook, we will look at the following sample use cases that show how to work with ibm-watson-studio-lib:
* [Saving and loading a Pandas DataFrame](#pandas)
* [Pickling data](#pickling)
* [Saving a downloaded dataset in your project](#register)
* [Using your custom Python scripts in a project](#download)
* [Browsing project assets](#assetBrowsing)


<a id="pandas"></a>
## Saving and loading a pandas DataFrame

Assume you have your data in a pandas DataFrame:

In [2]:
sample_data = {
    'col1': [1, 2],
    'col2': ['a', 'b']
}

import pandas as pd
df = pd.DataFrame(data=sample_data)
df

Unnamed: 0,col1,col2
0,1,a
1,2,b


### Saving a DataFrame

There are two ways to store a DataFrame:

a) Write the data to a memory buffer and save the buffer using `ibm-watson-studio-lib`:

In [3]:
from io import BytesIO

# write the dataframe to a buffer
buffer = BytesIO()
df.to_excel(buffer, sheet_name="Example")

# reset for subsequent reading
buffer.seek(0)

# save the data with ibm-watson-studio-lib
assetname="pandas-example.xlsx"
wslib.save_data(assetname, data=buffer.read(), overwrite=True)

{'name': 'pandas-example.xlsx',
 'asset_type': 'data_asset',
 'asset_id': 'c80ab08c-e59f-4757-a7ae-7006956c2dbf',
 'attachment_id': '9ee50584-6cf1-4377-bc5f-9e04c27a08b4',
 'filepath': 'pandas-example.xlsx',
 'data_size': None,
 'mime': 'application/binary',
 'summary': ['created or overwritten file',
  'created data asset',
  'created attachment']}

b) You may also write the data to an intermediate file in the file system of your notebook's runtime container and upload the file to your Watson Studio project:

In [4]:
# write the dataframe to an intermediate file
df.to_excel("pandas-intermediate.xlsx", sheet_name="Example")

# upload the intermediate file to your Watson Studio project
wslib.upload_file(file_path="pandas-intermediate.xlsx", asset_name="pandas-example2.xlsx")

{'name': 'pandas-example2.xlsx',
 'asset_type': 'data_asset',
 'asset_id': 'bf3be16d-2264-4366-8f2d-72d415f88b9e',
 'attachment_id': '643db2b9-e785-4535-880d-4e0f27253cf9',
 'filepath': 'pandas-intermediate.xlsx',
 'data_size': None,
 'mime': 'application/binary',
 'summary': ['created file in storage', 'created asset', 'created attachment'],
 'input_file_copied': True}

You can use `ibm-watson-studio-lib` to show all files in your project. They are refered to as *stored data assets* in contrast to *connected data assets*, which represent data that has to be accessed through a connection.

In [5]:
wslib.list_stored_data()

[{'name': 'helpers.py',
  'description': '',
  'asset_id': '01785492-7d38-4315-b30d-f635a1cb3e9e',
  'asset_type': 'data_asset',
  'tags': []},
 {'name': 'pandas-example2.xlsx',
  'description': '',
  'asset_id': 'bf3be16d-2264-4366-8f2d-72d415f88b9e',
  'asset_type': 'data_asset',
  'tags': []},
 {'name': 'pandas-example.xlsx',
  'description': '',
  'asset_id': 'c80ab08c-e59f-4757-a7ae-7006956c2dbf',
  'asset_type': 'data_asset',
  'tags': []}]

### Loading data into a DataFrame

Here are two ways to load data from your project into a pandas DataFrame:

a) When loading data from your project, `ibm-watson-studio-lib` returns a `BytesIO` buffer that can be passed to `read_excel` in pandas:

In [6]:
# load data
buffer = wslib.load_data("pandas-example.xlsx")

# pass the buffer to the read function
read_df = pd.read_excel(buffer)
read_df.head()

Unnamed: 0.1,Unnamed: 0,col1,col2
0,0,1,a
1,1,2,b


b) In Cloud Pak for Data, the project storage is mounted in the local file system of your runtime. So you can also load the data into a DataFrame by directly reading the file in the project storage:

In [7]:
# get the file path of the file in the mounted project storage
filepath = wslib.mount.get_data_path('pandas-example.xlsx')
read_df = pd.read_excel(filepath)
read_df.head()

Unnamed: 0.1,Unnamed: 0,col1,col2
0,0,1,a
1,1,2,b


<a id="pickling"></a>
## Pickling data

Assume you have a Python data structure that you want to store in your Watson Studio project:

In [8]:
class SampleClass(object):
    def __init__(self, x):
        self.x = x
        self.y = x*x
        return

data = [SampleClass(1), SampleClass(2), SampleClass(3)]

`pickle.dumps` returns a bytes object that can be passed directly to the `save_data` in `ibm-watson-studio-lib`:

In [9]:
import pickle
buffer = pickle.dumps(data)

# store the data in the project
wslib.save_data("pickle-example.data", buffer, overwrite=True)

{'name': 'pickle-example.data',
 'asset_type': 'data_asset',
 'asset_id': '36d6b340-7a22-4706-bb79-ae073bf08614',
 'attachment_id': 'd4fc0549-53d1-413f-8d93-e28a2a999ff4',
 'filepath': 'pickle-example.data',
 'data_size': None,
 'mime': 'application/binary',
 'summary': ['created or overwritten file',
  'created data asset',
  'created attachment']}

To read a pickled object, you can pass the BytesIO buffer returned by the `load_data`  in `ibm-watson-studio-lib` to `pickle.load`:

In [10]:
buffer = wslib.load_data("pickle-example.data")

data = pickle.load(buffer)
for elem in data:
    print("x:", elem.x, " y:", elem.y)

x: 1  y: 1
x: 2  y: 4
x: 3  y: 9


<a id="register"></a>
## Saving a downloaded dataset in your project

Assume you want to download a dataset from some data repository and store this data for later reusage in your project. In this example, we will download the publicly available *Iris Data set* from the UCI Machine Learning Repository.

In [11]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

--2021-06-29 05:23:39--  https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4551 (4.4K) [application/x-httpd-php]
Saving to: ‘iris.data’


2021-06-29 05:23:40 (276 MB/s) - ‘iris.data’ saved [4551/4551]



`wget` downloads the file to the local file system of your notebook's runtime container.

There are two easy ways to store the file in your project:

a) The simplest way to store the file in your project is by using the `upload_file` function in `ibm-watson-studio-lib`:

In [12]:
wslib.upload_file("iris.data")

{'name': 'iris.data',
 'asset_type': 'data_asset',
 'asset_id': 'b09e4e50-ba5e-4cae-9e91-504175951226',
 'attachment_id': '668751c8-856e-4c0a-bcec-f8c55998dbb7',
 'filepath': 'iris.data',
 'data_size': None,
 'mime': 'application/binary',
 'summary': ['created file in storage', 'created asset', 'created attachment'],
 'input_file_copied': True}

`upload_file` reads the input file, stores the contents in the project storage and creates the corresponding data asset. As `upload_file` actually reads the input file, it is not recommended for very large files, because they need to fit entirely into memory.

b) As the project storage is mounted in your notebook runtime container in Cloud Pak for Data, there is also another way to store a file as a data asset in your project:

In [13]:
# get the path of your mounted data_asset folder
wslib.mount.get_base_dir()

'/project_data/data_asset/'

In [14]:
# move the file to the data_asset folder
!mv iris.data /project_data/data_asset/

The file is now contained in the project storage, but to make it available as a data asset and see it in the projects data assets list in the UI, you need to register the file as a data asset:

In [15]:
wslib.mount.register_asset("/project_data/data_asset/iris.data", asset_name="iris.csv")

{'name': 'iris.csv',
 'asset_type': 'data_asset',
 'asset_id': 'a443f4bb-9118-498d-a8d5-bbf3dbc3b111',
 'attachment_id': '630b26ae-a4e5-4962-be80-e9637621edeb',
 'filepath': 'iris.data',
 'data_size': None,
 'mime': 'text/csv',
 'summary': ['created asset', 'created attachment']}

<a id="download"></a>
## Using your custom Python scripts in a project

Assume you have a custom Python script called `helpers.py` on your local machine and you want to use functions from this file in your notebook.

Step 1: Upload the file to your project using the **Find and add data** panel on the top right of your notebook.

Step 2: Download the file to the file system of your notebook's runtime container to make it available to your notebook:

In [16]:
# download the script
wslib.download_file("helpers.py")

# import your functions to use them in your notebook
from helpers import hello
hello()

Hello World!


<a id="assetBrowsing"></a>
## Browsing project assets

The entrypoint `wslib.assets` offers functions to list assets and retrieve the metadata of a given asset.

For some use cases, you may want to list all assets of a certain file type or that match a specific name pattern so that you can process the assets in subsequent steps. `wslib.assets.list_assets` lists all assets for a given filter object. The `query` parameter passes a Lucene query 'as is' to the [Watson Data API](https://cloud.ibm.com/apidocs/watson-data-api-cpd#searchnewassetv2), whereas the `selector` parameter is used as a post-filter by `ibm-watson-studio-lib`.

The following sections show examples of listing asstes and retrieving metadata. 

#### Listing assets using the *query* parameter

In [17]:
# list all files of type '.xslx' and name starting with 'pandas_example'
pandas_files = wslib.assets.list_assets("data_asset", query="asset.name:(pandas-example*.xlsx)")
wslib.show(pandas_files)

# 0
{
  "name": "pandas-example.xlsx",
  "description": "",
  "asset_id": "c80ab08c-e59f-4757-a7ae-7006956c2dbf",
  "asset_type": "data_asset",
  "tags": []
}
# 1
{
  "name": "pandas-example2.xlsx",
  "description": "",
  "asset_id": "bf3be16d-2264-4366-8f2d-72d415f88b9e",
  "asset_type": "data_asset",
  "tags": []
}


You can use the list items to access the file contents:

In [18]:
for file in pandas_files:
    df = pd.read_excel(wslib.load_data(file))
    print(df.head())

   Unnamed: 0  col1 col2
0           0     1    a
1           1     2    b
   Unnamed: 0  col1 col2
0           0     1    a
1           1     2    b


#### Listing assets using the *selector* parameter

The `selector` parameter works as a post-filter after assets are retrieved using the Watson Data API. Let's look at the metadata of an existing asset to see how the `selector` parameter works. First, you may retrieve the full set of metadata for an existing asset to see an example of the metadata structure:

In [19]:
asset_metadata = wslib.assets.get_asset("pandas-example.xlsx", "data_asset", raw=True)
wslib.show(asset_metadata)

{
  "metadata": {
    "rov": {
      "mode": 0,
      "collaborator_ids": {}
    },
    "project_id": "04463443-3881-4e6d-84f3-f6e0c36eef68",
    "sandbox_id": "04463443-3881-4e6d-84f3-f6e0c36eef68",
    "usage": {
      "last_updated_at": "2021-06-29T05:23:16Z",
      "last_updater_id": "1000330999",
      "last_update_time": 1624944196335,
      "last_accessed_at": "2021-06-29T05:23:16Z",
      "last_access_time": 1624944196335,
      "last_accessor_id": "1000330999",
      "access_count": 0
    },
    "name": "pandas-example.xlsx",
    "description": "",
    "tags": [],
    "asset_type": "data_asset",
    "origin_country": "us",
    "resource_key": "pandas-example.xlsx",
    "rating": 0.0,
    "total_ratings": 0,
    "catalog_id": "f6f9d65e-c280-455b-970b-95d98a1715ca",
    "created": 1624944196053,
    "created_at": "2021-06-29T05:23:16Z",
    "owner_id": "1000330999",
    "size": 5428,
    "version": 2.0,
    "asset_state": "available",
    "asset_attributes": [
      "data_asset"

The `selector` parameter takes a function that works on the retrieved metadata:

In [20]:
# List all data assets which are larger than 10kB.
# The selector function resembles the metadata structure seen above.
sizeFilter = lambda asset: asset['metadata']['size'] > 5000
assets = wslib.assets.list_assets("data_asset", selector=sizeFilter, raw=True)
wslib.show(assets)

# 0
{
  "metadata": {
    "rov": {
      "mode": 0,
      "collaborator_ids": {}
    },
    "project_id": "04463443-3881-4e6d-84f3-f6e0c36eef68",
    "sandbox_id": "04463443-3881-4e6d-84f3-f6e0c36eef68",
    "usage": {
      "last_updated_at": "2021-06-29T05:23:16Z",
      "last_updater_id": "1000330999",
      "last_update_time": 1624944196335,
      "last_accessed_at": "2021-06-29T05:23:16Z",
      "last_access_time": 1624944196335,
      "last_accessor_id": "1000330999",
      "access_count": 0
    },
    "name": "pandas-example.xlsx",
    "description": "",
    "tags": [],
    "asset_type": "data_asset",
    "origin_country": "us",
    "resource_key": "pandas-example.xlsx",
    "rating": 0.0,
    "total_ratings": 0,
    "catalog_id": "f6f9d65e-c280-455b-970b-95d98a1715ca",
    "created": 0,
    "created_at": "2021-06-29T05:23:16Z",
    "owner_id": "1000330999",
    "size": 5428,
    "version": 0.0,
    "asset_state": "available",
    "asset_attributes": [
      "data_asset"
    ],


#### Listing assets by a specific asset type

You can use `wslib.assets.list_assets` without a filter to retrieve all assets of a given asset type. Use `wslib.assets.list_asset_types` to get a list of all available asset types. Or you can use the generic asset type `asset` to retrieve all assets, including connections, notebooks and others.

In [21]:
# list all assets in your project
all_assets = wslib.assets.list_assets("asset")
wslib.show(all_assets)

# 0
{
  "name": "pandas-example.xlsx",
  "description": "",
  "asset_id": "c80ab08c-e59f-4757-a7ae-7006956c2dbf",
  "asset_type": "data_asset",
  "tags": []
}
# 1
{
  "name": "helpers.py",
  "description": "",
  "asset_id": "01785492-7d38-4315-b30d-f635a1cb3e9e",
  "asset_type": "data_asset",
  "tags": []
}
# 2
{
  "name": "Working with ibm-watson-studio-lib in CPD",
  "description": "Sample use cases for ibm-watson-studio-lib in Cloud Pak for Data",
  "asset_id": "67dc9ade-ebae-4d9f-b96f-47c316a75a3c",
  "asset_type": "notebook",
  "tags": [
    "notebook"
  ]
}
# 3
{
  "name": "pandas-example2.xlsx",
  "description": "",
  "asset_id": "bf3be16d-2264-4366-8f2d-72d415f88b9e",
  "asset_type": "data_asset",
  "tags": []
}
# 4
{
  "name": "pickle-example.data",
  "description": "",
  "asset_id": "36d6b340-7a22-4706-bb79-ae073bf08614",
  "asset_type": "data_asset",
  "tags": []
}
# 5
{
  "name": "iris.data",
  "description": "",
  "asset_id": "b09e4e50-ba5e-4cae-9e91-504175951226",
  "asse