<img src="http://wandb.me/logo-im-png" width="400" alt="Weights & Biases" />

<!--- @wandbcode{artifacts-fundamentals} -->


# Set Up

## ðŸª„ Install `wandb` library and login


Start by installing the library and logging in to your free account.



In [None]:
!pip install wandb -qU

## Log in to W&B
- You can explicitly login using `wandb login` or `wandb.login()` (See below)
- Alternatively you can set environment variables. There are several env variables which you can set to change the behavior of W&B logging. The most important are:
    - `WANDB_API_KEY` - find this in your "Settings" section under your profile
    - `WANDB_BASE_URL` - this is the url of the W&B server
- Find your API Token in "Profile" -> "Setttings" in the W&B App


In [None]:
import wandb

WANDB_ENTITY = None #@param #Entity can be your personal entity or point to a team you are a part of!
WANDB_PROJECT = "wandb-artifacts-registry" #@param
WANDB_HOST= None #@param
YOUR_NAME = None #We will use this for our filtering and grouping to make it easy for your to identify your runs in the project



In [None]:

#wandb.login(host=WANDB_HOST) #or wandb.login(key=<key>, host=<host>)
wandb.login()

#Artifacts

Use W&B Artifacts to track and version data as the inputs and outputs of your W&B Runs. In addition to logging hyperparameters, metadata, and metrics to a run, you can use an artifact to log the dataset used to train the model as input and the resulting model checkpoints as outputs.

<img src="https://drive.google.com/uc?export=view&id=1i8L9OxTwtIKCA8beTN8u_jVQzIi1ga4y" alt="Artifacts Simple" width="700" height="200">

## Create a Dataset
Let's create some datasets that we can work with in this example.

In [None]:
import os
import numpy as np
import csv

directory = "dataset"
os.makedirs(directory, exist_ok=True)
file1, file2 = os.path.join(directory, "file1.csv"), os.path.join(directory, "file2.csv")

def generate_dummy_data(num_samples):
    data = [
        np.random.normal(50, 10, num_samples),
        np.random.randint(1, 100, num_samples),
        np.random.choice(['A', 'B', 'C', 'D'], num_samples),
        np.random.uniform(0.0, 1.0, num_samples)
    ]
    return zip(*data)

def save_to_csv(file, data):
    with open(file, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['feature1', 'feature2', 'feature3', 'feature4'])
        writer.writerows(data)

num_samples = 100
save_to_csv(file1, generate_dummy_data(num_samples))
save_to_csv(file2, generate_dummy_data(num_samples))

## Create An Artifact

The general workflow for creating an Artifact is:


1.   Intialize a run.
2.   Create an Artifact.
3.   Add a any files or directories to the new Artifact that you want to track and version.
4.   Log the artifact in the W&B platform.

The most straightforward way of accomplishing this is the second line of code in the example below, which will log, track and version a new dataset (i.e. do points 2, 3, and 4 above in one step).

In [None]:
run = wandb.init(entity=WANDB_ENTITY, project=WANDB_PROJECT)
run.log_artifact(artifact_or_path=f"{directory}/file1.csv", name=f"my_first_artifact_{YOUR_NAME}", type="dataset")
run.finish()

In the second line we log the Artifact with [`run.log_artifact()`](https://docs.wandb.ai/ref/python/public-api/run#log_artifact). In this example, we use three common arguments to the function.
1. With `artifact_or_path` we specifiy the path to where the data we want to version exists. Any file or directory can be added here.
2. with `name` we give the artifact a name within Weights & Biases that we will use to access it.
3. With `type` we give the artifact a higher level grouping. For example, we may have multiple artifacts of type data, and multiple artifacts of type model.


See the [Artifacts Reference](https://docs.wandb.ai/ref/python/artifact) guide for more information and other commonly used arguments, including how to store additional metadata.

Each time the above `log_artifact` is executed, wandb will create a new version of the Artifact within Weights & Biases if the underlying data has changed.


An alternative approach that offers more control (at the expense of more lines of code) can be seen below.

In [None]:
run = wandb.init(entity=WANDB_ENTITY, project=WANDB_PROJECT)

artifact = wandb.Artifact(f"my_first_artifact_{YOUR_NAME}", type="dataset")
# the below will add two individual files to the artifact.
artifact.add_file(local_path=f"{directory}/file1.csv")
artifact.add_file(local_path=f"{directory}/file2.csv")
# or the below if you wanted to add the entire directory contents.
artifact.add_dir(local_path=f"{directory}")
# explictly log the artifact to Weights & Biases.
run.log_artifact(artifact)

wandb.finish()

In the above example, lines 3-5 will create a new Artifact within your Weights & Biases project. With the resulting artifact object, you can call the [`artifact.add_file`](https://docs.wandb.ai/ref/python/artifact#add_file) or [`artifact.add_dir`](https://docs.wandb.ai/ref/python/artifact#add_dir) functions in order to add as many files and directories to the Artifact as you want. Once added, the artifact must then be explictly logged to Weights & Biases.

## Use an Artifact

When you want to use a specific version of an Artifact in a downstream task, you can specify the specific version you would like to use via either `v0`, `v1`, `v2` and so on, or via specific aliases you may have added. The `latest` alias always refers to the most recent version of the Artifact logged.

The proceeding code snippet specifies that the W&B Run will use an artifact called `my_first_artifact` with the alias `latest`:


In [None]:
run = wandb.init(entity=WANDB_ENTITY, project=WANDB_PROJECT)
artifact = run.use_artifact(artifact_or_name=f"my_first_artifact_{YOUR_NAME}:latest") # this creates a reference within Weights & Biases that this artifact was used by this run.
path = artifact.download() # this downloads the artifact from Weights & Biases to your local system where the code is executing.
print(f"Data directory located at {path}")
#run training with the downloaded artifact
run.finish()

For more information on ways to customize your Artifact download, including via the command line, see the [Download and Usage guide](https://docs.wandb.ai/guides/artifacts/download-and-use-an-artifact).

## Create a new Artifact version

Let's say we want to modify our dataset while also tracking and versioning these changes. In the below example we will subsample our dataset and save it as a new file. We will use the [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) library to read our CSV file.

In the second block of code we will log it to Weights & Biases under the same Artifact name (*my_first_artifact*) so that Weights & Biases knows that this is a new version of an existing artifact.

In [None]:
import pandas
df = pandas.read_csv(f"{directory}/file1.csv")
# subsample to 50% of the original size
df_subsampled = df.sample(frac=0.5, random_state=1)
# save the subsampled dataframe to a new file.
df_subsampled.to_csv(f"{directory}/file1.csv", index=False)

Now we have a new subsampled version of our dataset locally, we can log the new version to Weights & Biases.

In [None]:
run = wandb.init(entity=WANDB_ENTITY, project=WANDB_PROJECT)
run.log_artifact(artifact_or_path=f"{directory}/file1.csv", name=f"my_first_artifact_{YOUR_NAME}", type="dataset", aliases =["subsampled"])
run.finish()

Now the sampled dataset will be logged to the `my_first_artifact_<name>` Artifact as a new version.

The Artifact has also been given a custom `alias`, which is a unique label for this Artifact version. While the `alias` is currently `subsampled`, the default aliases is `vN`, where `N` is the number of versions the Artifact has. This increments automatically. You can always access specific versions of an Artifact by using an alias.

## Update Artifact version metadata

You can update the `description`, `metadata`, and `alias` of an artifact on the W&B platform during or outside a W&B Run.


This example changes the `description` of the `my_first_artifact` artifact inside a run:

In [None]:
run = wandb.init(entity=WANDB_ENTITY, project=WANDB_PROJECT)
artifact = run.use_artifact(artifact_or_name=f"my_first_artifact_{YOUR_NAME}:subsampled")
artifact.description = "This is an edited description."
artifact.metadata = {"source": "local disk", "internal data owner": "platform team"}
artifact.save()  # persists changes to an Artifact's properties
run.finish()

## Use the Artifact within your pipelines
Once the artifact is tracked and versioned within Weights & Biases it's now easy to integrate it into your ML workflows.

In [None]:
run = wandb.init(entity=WANDB_ENTITY, project=WANDB_PROJECT)
artifact = run.use_artifact(artifact_or_name=f"my_first_artifact_{YOUR_NAME}:latest")
# the below is left as an exercise to the reader
# train model
# log model as artifact
run.finish()


## Navigate the Artifacts UI

You can also manage your Artifacts via the W&B platform. This can give you insight into your model's performance or dataset versioning. To navigate to the relevant information, click this [link](https://wandb.ai/wandb-smle/artifact_workflow/overview), then click on the **Artifacts** tab.

Navigating to the **Lineage** section in the tab will show the dependency graph formed by calling `run.use_artifact()` when an Artifact is an input to a run, and `run.log_artifact()` when an Artifact is output to a run. This helps visualize the relationship between different model versions and other objects like datasets and jobs in your project. Click [this](https://wandb.ai/wandb-smle/artifact_workflow/artifacts/dataset/preprocessed/v6/lineage) link to navigate to the project's lineage page.

## **Artifacts Time-to-live (TTL)**

W&B Artifacts supports setting time-to-live policies on each version of an Artifact. The following examples show the use TTL policy in a common Artifact logging workflow. We'll cover:

* Setting a TTL policy when creating an Artifact
* Retroactively setting TTL for a specific Artifact aliases


## Setting TTL on New Artifacts
Below we create two new Artifacts from the colab provided sample_data
- mnist_test.csv
- mnist_train_small.csv

Upload them as artifacts files to artifact of type `mnist_dataset` and assign them a TTL


In [None]:
from datetime import timedelta

run = wandb.init(entity=WANDB_ENTITY,
                project=WANDB_PROJECT,
                job_type="raw-data")

raw_mnist_train = wandb.Artifact(
    f"mnist_train_small_{YOUR_NAME}",
    type="mnist_dataset",
    description="Small MNIST Training Set"
)

raw_mnist_train.add_file("sample_data/mnist_train_small.csv")
raw_mnist_train.ttl = timedelta(days=10)
run.log_artifact(raw_mnist_train, aliases=["small", "mnist", "train"])

raw_mnist_test = wandb.Artifact(
    f"mnist_test_small_{YOUR_NAME}",
    type="mnist_dataset",
    description="Small MNIST Test Set"
)

raw_mnist_test.add_file("sample_data/mnist_test.csv")
raw_mnist_test.ttl = timedelta(days=10)
run.log_artifact(raw_mnist_test, aliases=["small", "mnist", "test"])

run.finish()

Retroactively setting TTL for a specific Artifact aliases

In [None]:
from datetime import timedelta

run = wandb.init(entity=WANDB_ENTITY,
                project=WANDB_PROJECT,
                job_type="modify-ttl")

test_art = run.use_artifact(f"{WANDB_ENTITY}/{WANDB_PROJECT}/mnist_test_small_{YOUR_NAME}:latest")
test_art.ttl = timedelta(days=365)  # Delete in a year
test_art.save()

train_art = run.use_artifact(f"{WANDB_ENTITY}/{WANDB_PROJECT}/mnist_train_small_{YOUR_NAME}:latest")
train_art.ttl = timedelta(days=2)  # Delete in 2 days
train_art.save()

print(test_art.ttl)
print(train_art.ttl)

run.finish()

## Artifact References

Artifacts currently support the following URI schemes:

* **http(s)://:** A path to a file accessible over HTTP. The artifact will track checksums in the form of etags and size metadata if the HTTP server supports the ETag and Content-Length response headers.
* **s3://:** A path to an object or object prefix in S3. The artifact will track checksums and versioning information (if the bucket has object versioning enabled) for the referenced objects. Object prefixes are expanded to include the objects under the prefix, default up to 100,000 objects.
* **gs://:** A path to an object or object prefix in GCS. The artifact will track checksums and versioning information (if the bucket has object versioning enabled) for the referenced objects. Object prefixes are expanded to include the objects under the prefix, default up to 100,000 objects.

See below for an example of reference local files

In [None]:
run = wandb.init(entity=WANDB_ENTITY,
                project=WANDB_PROJECT,
                job_type="upload-references")
artifact = wandb.Artifact(name=f"local-file-references_{YOUR_NAME}", type="reference-dataset")
artifact.add_reference("file:///content/sample_data", checksum=True)
run.log_artifact(artifact)
run.finish()

## **Consideration Artifacts Caching and Staging Directories**

By default Wandb artifacts requires 2x their size in storage availability due to the following workflows

When youâ€™re uploading artifacts, wandb creates two copies locally, one in the .`cache directory` (used for faster retrievals to fetch files when `use_artifacts` is called instead of downloading them) and another in `.local/share/wandb/artifacts/staging` where files are duplicated in order to avoid issues if the files are being updated by th during uploads. The reason staging exists is to prevent modifications of files you added before you upload it. For example, if you call `artifact.add(file.txt)` then, you modify `file.txt` later in your code, W&B currently guarantees that we will upload the original content of `file.txt` when you added it.

How to set where artifacts are cached?
- Set the `WANDB_CACHE_DIR` env var to directory where users has read/write access

How to set where artifacts are staged?
- Set the `WANDB_DATA_DIR` env var to directory where users has read/write access

Caching and Stagings can be controled by the user directly in `add_dir` and `add_file`methods by settings `skip_cache` and `policy`

Here's a list with all the possible variations:

- with `skip_cache=True` and `policy=mutable`, only staging are created.
- with `skip_cache=False` and `policy=mutable`, staging and cache files are created, but staging files are deleted while caching.
- with `skip_cache=True` and `policy=immutable`, neither staging nor cache files are created.
- with `skip_cache=False` and `policy=immutable`, only cache files are created.

# Artifact Training Pipeline Example

The below pipeline includes:

* Data Versioning: The Heart Disease dataset is split into training, validation, and test sets, each logged as a W&B artifact for easy tracking and reproducibility.

* Model Training: A neural network is trained on the training set, with performance monitored on the validation set. The best model version is saved and versioned as a W&B artifact.

* Model Evaluation: The best model is retrieved from W&B and evaluated on the test set to demonstrate how W&B ensures reproducibility and traceability throughout the ML lifecycle.

* What It Showcases:
How to use W&B for seamless data and model versioning.
Best practices for tracking model performance during training and evaluation.
Efficient and reproducible ML workflows in a production-ready environment.

### Data Preperation and uploading to wandb

In [None]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np

# Load the Heart Disease dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg",
           "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
data = pd.read_csv(url, header=None, names=columns)

# Replace missing values ('?') with NaN and drop rows with NaN values
data.replace('?', np.nan, inplace=True)
data = data.dropna().astype(float)

# Convert target variable: 0 = no heart disease, 1 = presence of heart disease
data['target'] = (data['target'] > 0).astype(int)

# Shuffle the dataset to ensure random distribution
data = data.sample(frac=1, random_state=42).reset_index(drop=True)

# Perform a train/validation/test split (60/20/20)
train_size = int(0.6 * len(data))
val_size = int(0.2 * len(data))
test_size = len(data) - train_size - val_size

train_data = data[:train_size]
val_data = data[train_size:train_size + val_size]
test_data = data[train_size + val_size:]

# Save the entire dataset as a CSV file
data.to_csv("heart_disease_full_dataset.csv", index=False)

#simple function to save log dataset artifacts
def save_and_log_dataset(data, filename, artifact_name, aliases):
    # Save the dataset as a CSV file
    data.to_csv(filename, index=False)

    # Create and log the dataset artifact
    dataset_artifact = wandb.Artifact(name=artifact_name, type='dataset')
    dataset_artifact.add_file(filename)
    wandb.log_artifact(dataset_artifact, aliases=aliases)

#Upload data to wandb
run = wandb.init(entity=WANDB_ENTITY,
                project=WANDB_PROJECT,
                group = YOUR_NAME,
                job_type="heart-disease-data-uploads",
                name = f"heart_disease_data_uploads_{YOUR_NAME}",
                tags = ["data-upload"]
                )

# Save and log the entire dataset
save_and_log_dataset(data, "heart_disease_full_dataset.csv", f'heart_disease_full_dataset_{YOUR_NAME}', ["initial_commit", "complete_dataset"])

# Save and log the training dataset
save_and_log_dataset(train_data, "heart_disease_train_dataset.csv", f'heart_disease_train_dataset_{YOUR_NAME}', ["initial_commit", "train_split"])

# Save and log the validation dataset
save_and_log_dataset(val_data, "heart_disease_val_dataset.csv", f'heart_disease_validation_dataset_{YOUR_NAME}', ["initial_commit", "validation_split"])

# Save and log the test dataset
save_and_log_dataset(test_data, "heart_disease_test_dataset.csv", f'heart_disease_test_dataset_{YOUR_NAME}', ["initial_commit", "test_split"])


#Log all dataset to W&B tables for visual analysis
wandb.log({f"train_data_table_{YOUR_NAME}": wandb.Table(dataframe=train_data),
           f"test_data_table_{YOUR_NAME}": wandb.Table(dataframe=test_data),
           f"validation_data_table_{YOUR_NAME}": wandb.Table(dataframe=val_data)})

wandb.finish()

### Model Training and Logging Artifacts
In the below example we are

1. Downloading the entire training and validation datasets using artifact.download() inside `load_data`
2. Using the training data to execute a training run and saving best version of the model based on the condition if current `val_loss < best_performance`, a new model artifact will be created and logged with the `best` alias
3. There after we will validate the `best` model  against the test dataset

In [None]:
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

#simple function for loading artifacts from wandb
def load_data(entity, project, artifact_name, your_name, split_name):

    artifact_full_name = f'{entity}/{project}/{artifact_name}_{your_name}:latest'
    artifact = wandb.use_artifact(artifact_full_name, type='dataset')
    artifact_dir = artifact.download()#If you are using local version of artifact, you can simple utilize wandb.use_artifact() only instead of downloading to associated lineage to the artifact

    data = pd.read_csv(f"{artifact_dir}/heart_disease_{split_name}_dataset.csv")

    X = torch.tensor(data.drop("target", axis=1).values, dtype=torch.float32)
    y = torch.tensor(data["target"].values, dtype=torch.float32)

    return X, y

# Initialize training run
run = wandb.init(entity=WANDB_ENTITY,
                project=WANDB_PROJECT,
                group = YOUR_NAME,
                job_type="heart-disease-training",
                name = f"heart_disease_training_validation_{YOUR_NAME}"
                )

# Load training data
X_train, y_train = load_data(WANDB_ENTITY, WANDB_PROJECT, 'heart_disease_train_dataset', YOUR_NAME, 'train')

# Load validation data
X_val, y_val = load_data(WANDB_ENTITY, WANDB_PROJECT, 'heart_disease_validation_dataset', YOUR_NAME, 'val')

# Define a simple neural network model
class HeartDiseaseModel(nn.Module):
    def __init__(self, input_size):
        super(HeartDiseaseModel, self).__init__()
        self.fc1 = nn.Linear(input_size, 128)
        self.bn1 = nn.BatchNorm1d(128)
        self.fc2 = nn.Linear(128, 64)
        self.bn2 = nn.BatchNorm1d(64)
        self.fc3 = nn.Linear(64, 32)
        self.bn3 = nn.BatchNorm1d(32)
        self.fc4 = nn.Linear(32, 16)
        self.fc5 = nn.Linear(16, 1)
        self.dropout = nn.Dropout(0.5)

    def forward(self, x):
        x = F.relu(self.bn1(self.fc1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.fc2(x)))
        x = self.dropout(x)
        x = F.relu(self.bn3(self.fc3(x)))
        x = self.dropout(x)
        x = F.relu(self.fc4(x))
        x = torch.sigmoid(self.fc5(x))
        return x

model = HeartDiseaseModel(input_size=X_train.shape[1])
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

best_performance = float('inf')
version = 1

for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    outputs = model(X_train).squeeze()
    loss = criterion(outputs, y_train)
    loss.backward()
    optimizer.step()

    # Calculate and log training accuracy
    predictions = (outputs >= 0.5).float()
    train_accuracy = (predictions == y_train).float().mean().item()

    wandb.log({"train/epoch": epoch, "train/train_loss": loss.item(), "train/train_accuracy": train_accuracy})

    # Evaluate the model on validation set
    model.eval()
    with torch.no_grad():
        val_outputs = model(X_val).squeeze()
        val_loss = criterion(val_outputs, y_val).item()

        # Calculate and log validation accuracy
        val_predictions = (val_outputs >= 0.5).float()
        val_accuracy = (val_predictions == y_val).float().mean().item()

        wandb.log({"val/val_loss": val_loss, "val/val_accuracy": val_accuracy})

        if val_loss < best_performance:
            best_performance = val_loss
            model_path = f"heart_disease_model_v{version}.pth"
            torch.save(model.state_dict(), model_path)
            artifact = wandb.Artifact(name=f'heart_disease_model_{YOUR_NAME}', type='model')
            artifact.add_file(model_path)
            wandb.log_artifact(artifact, aliases=[f"v{version}", "best"])
            version += 1

wandb.finish()


### Evaluation Phase Using the Versioned Model Artifact
We will now use the best model logged in our training loop and test against our test dataset

In [None]:
# Start a new W&B run for evaluation
run = wandb.init(entity=WANDB_ENTITY,
                project=WANDB_PROJECT,
                group=YOUR_NAME,
                job_type="heart-disease-testing",
                name=f"heart_disease_testing_{YOUR_NAME}"
                )

# Load test data
X_test, y_test = load_data(WANDB_ENTITY, WANDB_PROJECT, 'heart_disease_test_dataset', YOUR_NAME, 'test')

# Perform evaluation on the test set
with torch.no_grad():
    outputs = model(X_test).squeeze()
    test_loss = criterion(outputs, y_test).item()
    wandb.log({"test/final_test_loss": test_loss})

wandb.finish()



## **Automations**

Automations: Trigger automatic workflows via webhooks

Create an automation to trigger workflow steps, such as automated model testing and deployment when a new version of an artifact is added to a registry or an alias is added to a linked version more on this [here](https://docs.wandb.ai/guides/model_registry/model-registry-automations)

## **Example**

<img src="https://drive.google.com/uc?export=view&id=1yG_cO2chVw8vN0snZn5lcEs95Mx_LcwZ" alt="Automation" width="700" height="400">

**Demo Workflow: Automations to open a "dataset merge request" issue in GH**

**Context:** An approval workflow process is required (i.e like approving PRs in github) when a new dataset is logged to W&B

**Workflow:**

âž• Log new dataset version: A review requestor logs a new version to a dataset artifact inside a project, using wandb.log_artifact()

ðŸš€ Automation to open GH issue: new dataset version (event) triggers GitHub Actions to create a GitHub Issue (action) for a reviewer to assess.

âœ… Review, Approval & Linking: The reviewer can discuss the request with the requestor via the GH issue comment thread. Once approved, the reviewer links the logged dataset to the "approved" Dataset Registry in W&B (it is now an approved dataset version usable by others).

ðŸ“¦ [Optional] Aggregation: new version in "approved" Dataset Registry triggers a second data merge workflow back on GH Actions, which may log a new version of the merged & approved dataset to a separate "golden" Dataset Registry

Workflow Resources

* github repo with GHA (yaml [link text](https://github.com/ijdoc/wandb-recipes/blob/aa787b0613089021b6e1567e15c78b775b5cfee0/.github/workflows/data-pr.yml))

* W&B project - [PR creation automation](https://wandb.ai/smle-reg-team-1/demo-data-pr/automations) - view details through options menu to see payload example

#Registry

W&B Registry is a curated central repository that stores and provides versioning, aliases, lineage tracking, and governance of assets. Registry allows individuals and teams across the entire organization to share and collaboratively manage the lifecycle of all models, datasets and other artifacts. The registry can be access directly in SaaS by visiting https://wandb.ai/registry or on your private instance through `https://<host-url>/registry`

The Registry has 2 core registeries, `Model` & `Datasets`, users can also create their own custom registeries, see docs above for more details.

Some usecases for registry

- A registry for non-model objects like forecasts or datasets that need to be surfaced as valid for production runs, even if models are published in different projects.
- Comprehensive tracking beyond models and experiments, including datasets and other elements like prompts.
- Enables the handoff of datasets from the â€˜dataset manufacturingâ€™ team to a centralized, discoverable list. Once promoted to the Registry, these datasets can be used by any team with the proper approvals for model training.
- Allows sharing of models across the organization without granting access to the project/team where the model was developed.
- The model registry supports teams in logging and consuming models with traceability, but larger organizations need enhanced collaboration and sharing across different roles.


<img src="https://drive.google.com/uc?export=view&id=10qkxTZza7kg3j7cYESNfH5Lrw4y8Kv5V" alt="Registry" width="700" height="300">



## Track and Publish dataset

Within the core Dataset registry we will create a collection called `heart_disease` to promote our entire `heart disease dataset` to production.  A collection is a set of linked artifact versions in a registry.  

To create a collection we need to do two things:
1. Specify the collection and registry we want to link our artifact version to. To do this, we specify a "target path" for our artifact version.
2. Use the `run.link_artifact` method and pass our artifact object and the target path.

The target path consists of the name of the organization your team belongs to, the name of the registry, and the name of the collection. There are two ways to get the target path, [interatively with the W&B App UI](https://docs.wandb.ai/guides/registry/link_version#confirm-the-path-of-a-registry-in-the-wb-app-ui) or programmatically with the W&B Python SDK.

For this example, we will programmatically create the collection target path:

<!-- #### Interactively get target path of a collection

1. Navigate to the Registry app at https://wandb.ai/registry/
2. Click on the registry you want to link your artifact version to.
3. At the top of the page, you will see an autogenerated code snippet. Copy the string next to the `target_path` parameter in `run.link_artifact()`. -->


#### Programmatically make the collection target path

The target path of a collectin consists of three parts:
* The name of the organization
* The name of the registry
* The name of the collection within the registry

If you know these three fields, you can create the full name yourself with string concatanation, f-strings, and so forth:
```python
target_path = f"{ORG_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}"
```

In [None]:
#uncomment the below when ready to run
%%script false --no-raise-error

ORG_NAME = "moe_SAMSBNIZPAIR4" #see top right of <host_url>/registry  for org name in example path run.link_artifact(artifact, f"<org-name>/wandb-registry-model/{<insert_collection_name>}")
DATASET_REGISTRY_NAME = "dataset"
DATASET_COLLECTION_NAME = "heart_disease_production"

# Path to link the artifact to a collection
dataset_target_path = f"{ORG_NAME}/wandb-registry-{DATASET_REGISTRY_NAME}/{DATASET_COLLECTION_NAME}"


dataset_artifact_name = f"{WANDB_ENTITY}/{WANDB_PROJECT}/heart_disease_train_dataset_{YOUR_NAME}:latest"

#Promote your <entity>/wandb-intro-session/heart_disease_full_dataset_<your_name>:v0 dataset to the registry
run = wandb.init(entity=WANDB_ENTITY,
                project=WANDB_PROJECT,
                group=YOUR_NAME,
                name=f"dataset_registry_promotion_{YOUR_NAME}"
                )

model_artifact = run.use_artifact(artifact_or_name=dataset_artifact_name, type="dataset")

run.link_artifact(artifact=model_artifact,
                  target_path=dataset_target_path,
                  aliases=["production"],
                  )
run.finish()

## Track and Publish Model

We will now follow a similar approach and instead promote our best model to the `Model` registry

In [None]:
#uncomment the below when ready to run
%%script false --no-raise-error

ORG_NAME = "moe_SAMSBNIZPAIR4"
MODEL_REGISTRY_NAME = "models"
MODEL_COLLECTION_NAME = "heart_disease_production"

# Path to link the artifact to a collection
dataset_target_path = f"{ORG_NAME}/wandb-registry-{MODEL_REGISTRY_NAME}/{MODEL_COLLECTION_NAME}"


model_artifact_name = f"{WANDB_ENTITY}/{WANDB_PROJECT}/heart_disease_model_{YOUR_NAME}:latest"

#Promote your <entity>/wandb-intro-session/heart_disease_model_<your_name>:v0 model to the registry
run = wandb.init(entity=WANDB_ENTITY,
                project=WANDB_PROJECT,
                group=YOUR_NAME,
                name=f"model_registry_promotion_{YOUR_NAME}"
                )

model_artifact = run.use_artifact(artifact_or_name=model_artifact_name, type="model")

run.link_artifact(artifact=model_artifact,
                  target_path=dataset_target_path,
                  aliases=["production"],
                  )
run.finish()

## Unlinking an Artifact from a collection

There are time when you want to unlink and artifact from a collection. This can be done via two methods

1. Directly in Association with a run
2. Using the W&B API

In [None]:
#uncomment the below when ready to run
%%script false --no-raise-error
#Unlink the dataset artifact from registry via a run
dataset_artifact_name = f"{ORG_NAME}/wandb-registry-{DATASET_REGISTRY_NAME}/{DATASET_COLLECTION_NAME}:latest"

#Promote your <entity>/wandb-intro-session/heart_disease_full_dataset_<your_name>:v0 dataset to the registry
run = wandb.init(entity=WANDB_ENTITY,
                project=WANDB_PROJECT,
                group=YOUR_NAME,
                name=f"dataset_regisbry_unlinking_{YOUR_NAME}"
                )

model_artifact = run.use_artifact(artifact_or_name=dataset_artifact_name, type="dataset")

model_artifact.unlink()
run.finish()

In [None]:
#comment the below when ready to run
%%script false --no-raise-error
#Unlink the Model artifact from registry via the wandb api
model_artifact_name = f"{ORG_NAME}/wandb-registry-{MODEL_REGISTRY_NAME}/{MODEL_COLLECTION_NAME}:latest"

api = wandb.Api()
artifact = api.artifact(model_artifact_name)
artifact.unlink()

# Next steps - Resources

**Artifacts**
1. [Artifacts Python reference documentation](https://docs.wandb.ai/ref/python/artifact): Deep dive into artifact parameters and advanced methods.
2. [Lineage](https://docs.wandb.ai/guides/artifacts/explore-and-traverse-an-artifact-graph): View lineage graphs, which are automatically built when using W&B artifact system, providing an auditable visual overview of the relationships between specific artifact versions, datasets models and runs.
3. [Artifact Automations](https://docs.wandb.ai/guides/artifacts/project-scoped-automations): Automatically run specific Weights & Biases jobs based on changes to your artifacts, such as automatically training a new model each time a new version of the training data is logged.
4. [Reference Artifacts](https://docs.wandb.ai/guides/artifacts/track-external-files#download-a-reference-artifact): Track files saved outside the W&B server, like Amazon S3 buckets, GCS buckets, Azure blobs, and more.
5. [Artifact TTL](https://docs.wandb.ai/guides/artifacts/ttl): Schedule when artifacts are deleted from W&B with W&B Artifact time-to-live (TTL) policy.

**Registry**
6. [Registry](https://docs.wandb.ai/guides/registry): Learn how to centralize your best artifact versions in a shared registry across your organization.
7. [Registry Zoo Example](https://colab.research.google.com/drive/1RTUJgmJqqAElW0n7uy6NRQQQgjPHIcXr?usp=sharing): A secondary registry example representing the following scenario

  Team 1 / Project 1:  Creating datasets, publishing them to datasets type registry. Then training a model against those datasets and promoting it to the model registry

  Team 2 / Project 2:  Pull the dataset and model above to perform an inference from a completely different team. Would require multi-team membership to test

## **Indepth Resources on CI/CD with webhook automations:**

10. [Model CI/CD Course: Enterprise Model Management features](https://www.youtube.com/watch?v=VWdRQL0CsAk&list=PLD80i8An1OEGECFPgY-HPCNjXgGu-qGO6&index=1)

11. [Model CI/CD with webhook automations on Weights & Biases Report](https://wandb.ai/wandb/wandb-model-cicd/reports/Model-CI-CD-with-W-B--Vmlldzo0OTcwNDQw#using-webhook-automations-in-weights-&-biases)