# W&B Artifacts API Demo

In this notebook, we'll show you how to use W&B Artifacts to keep track of dataset versions in Weights & Biases.

### How it works
 Using our Artifacts API, you can log artifacts as outputs of W&B runs, or use artifacts as input to runs.
 
 ![](https://gblobscdn.gitbook.com/assets%2F-Lqya5RvLedGEWPhtkjU%2F-M94QAXA-oJmE6q07_iT%2F-M94QJCXLeePzH1p_fW1%2Fsimple%20artifact%20diagram%202.png?alt=media&token=94bc438a-bd3b-414d-a4e4-aa4f6f359f21)

Since a run can use another run’s output artifact as input, artifacts and runs together form a directed graph. You don’t need to define pipelines ahead of time. Just use and log artifacts, and we’ll stitch everything together.

In [1]:
import os
import random
import wandb

## 1. Initialize a run

Initialize a wandb run by calling `wandb.init`. Use a run to track any script in your pipeline— anything from training and evaluation to scraping and preprocessing data. Specify what type of run it is in the **job_type**, and you'll be able to filter and group based on **job_type** in the web interface.

In [2]:
# This will create a new run in the W&B database, and start tracking stdout/stderr and system metrics automatically.
run = wandb.init(project='artifacts-demo', job_type='producer')

Failed to query for notebook name, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable
[34m[1mwandb[0m: Currently logged in as: [33mjrose[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.10.5 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


## 2. Create an artifact
Now, let's create our first artifact, add a file to it, and save it as output of our run.

In [3]:
# Create an Artifact. Give it a name and a type. Type is used for
# organizational purposes and should typically be "dataset" or "model".
artifact = wandb.Artifact('hello-dataset', type='dataset', metadata={})

# Store a new file in the artifact, and write something into its contents.
with artifact.new_file('hello.txt') as f:
    f.write('my first artifact!\n')

# Save the artifact to W&B. It will be tracked as output of the current run
# and appended to the Artifact Sequence called 'hello artifacts dataset'.
run.log_artifact(artifact)

# End the current run (useful in notebooks)
wandb.join()

[34m[1mwandb[0m: Adding directory to artifact (/tmp/tmp7w61sx9s)... Done. 0.1s


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=0.98517940717…

In [1]:
from pyleaves.pipelines.WandB_Leaves_vs_PNAS import *




  assert isinstance(locations, collections.Iterable), 'Must provide locations for directive.'


In [5]:
TARGET_SIZE = (768,768)
BSZ = 12
RANDOM_STATE = 237
VALIDATION_SPLIT = 0.1

from pyleaves.utils import set_tf_config
set_tf_config(gpu_num=None, num_gpus=1)

import tensorflow as tf
from tensorflow.keras import backend as K
K.clear_session()
from pyleaves.utils.pipeline_utils import build_model
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping
from pyleaves.utils import pipeline_utils
from tensorflow.keras.applications.resnet_v2 import preprocess_input
from pyleaves.utils.WandB_artifact_utils import load_Leaves_Minus_PNAS_dataset, load_Leaves_Minus_PNAS_test_dataset

train_df, test_df, pnas_train_df = load_Leaves_Minus_PNAS_test_dataset()
train_df, val_df = train_test_split(train_df, test_size=VALIDATION_SPLIT, random_state=RANDOM_STATE, shuffle=True, stratify=train_df.family)

train_data_info = data_df_2_tf_data(train_df,
                                    x_col='archive_path',
                                    y_col='family',
                                    training=True,
                                    preprocess_input=preprocess_input,
                                    seed=RANDOM_STATE,
                                    target_size=TARGET_SIZE,
                                    batch_size=BSZ,
                                    augmentations={'flip':1.0,'rotate':1.0},
                                    num_parallel_calls=-1,
                                    cache=False,
                                    shuffle_first=True,
                                    fit_class_weights=True)

  and should_run_async(code)


setGPU: Setting GPU to: [6]
Initial visible GPUs: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
visible GPUs: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Successfully set memory_growth=True and limited GPUs visible to tensorflow.

Now using GPU(s):
['/physical_device:GPU:0']


./artifacts/Leaves-PNAS_test:v2


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

target_size =  (768, 768)


In [13]:
data = train_data_info['data_table']
class_encodings = train_data_info['encoder']
y_col='family'
x_col='archive_path'
data = data.assign(y_true=data[y_col].apply(lambda x: class_encodings[x]),
                   x_true=data[x_col])

In [19]:
data

from more_itertools import unzip

In [24]:
train_data = train_data_info['data']

train_iter = list(unzip([batch for batch in iter(train_data.take(3).unbatch())]))
train_iter

[<map at 0x7fadf9e8ab90>, <map at 0x7fadf9e8abd0>]

In [76]:
a = lambda: ((img,label) for img, label in iter(train_data.take(3).unbatch()))

In [29]:
print(a[0][0].shape, a[0][1].shape)

(768, 768, 3) (253,)


In [64]:
a

<function __main__.<lambda>()>

In [78]:
import numpy as np


bb = np.stack([img for img, _ in a()])
bb.shape

(36, 768, 768, 3)

In [79]:
bb = np.stack([lbl for _, lbl in a()])
bb.shape

(36, 253)

In [80]:
bb

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [69]:
list(a())[1]

1

In [17]:
wandb.init()
WandbCallback(training_data=train_data.take(20))



VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

ValueError: training data must be a tuple of length two

In [56]:
a = lambda : (i for i in (0,1,2,3,4))

In [57]:
b = list(iter(a()))

In [58]:
b

[0, 1, 2, 3, 4]

In [59]:
a

<function __main__.<lambda>()>

In [61]:
c = list(a())

In [62]:
c

[0, 1, 2, 3, 4]

Open [wandb.ai](https://app.wandb.ai/home) and click on your latest run to see the artifact tab appear on the left sidebar.

## 3. Use an artifact
Next, let's use that artifact as input to another run.

In [5]:
# Start a new run
run = wandb.init(project='artifacts-demo', job_type='consumer')

[34m[1mwandb[0m: wandb version 0.10.5 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


In [6]:
# We'll use the artifact we created in the previous run as input to this run.
# This will fetch the latest entry in the 'hello artifacts dataset' Artifact Sequence
artifact = run.use_artifact('hello-dataset:latest')

# Download all of the files contained in the artifact.
artifact_dir = artifact.download()

# Let's take a look at the downloaded files.
print(os.listdir(artifact_dir))

['hello.txt']


In [7]:
artifact_dir

'./artifacts/hello-dataset:v0'

In [9]:
os.path.abspath(artifact_dir)

'/home/jacob/projects/pyleaves/notebooks/artifacts/hello-dataset:v0'

In [8]:
print(open(os.path.join(artifact_dir, 'hello.txt')).read())

my first artifact!



Let's log an output artifact too— in this example it's just a fake model file.

In [10]:
artifact = wandb.Artifact('run-%s-model' % run.id, type='model')

# This time we'll use artifact.add_file, to add a file that already exists.
f = open('mymodel.txt', 'w')
f.write('This is a really awesome trained model: %s' % random.random())
f.close()

artifact.add_file('mymodel.txt')
run.log_artifact(artifact)

# end the current run
wandb.join()

VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=0.95609386828…

Now you can navigate to your project page (linked above), and then click on the artifacts tab, to dig into all the artifacts you've created so far.

If you click through to an artifact, and then click on the "Graph" tab, you'll see a visualization that shows how your artifacts and runs are related to each other.

## Documentation

For more details, [see the docs →](https://docs.wandb.com/artifacts)
- Storing directories in artifacts
- Referring to external data using references
- Automatic file and artifact deduplication
- Best practices for dataset versioning and model management