# Introduction to the Lab Client <a class="tocSkip">

This tutorial demonstrates you how to connect to a Lab instance, manage data and run your experiments.

**In this notebook:**

* [Initialize environment from ML Lab](#Initialize-Environment)
* [Create an experiment and store result data](#Create-Experiment)
* [Load & upload data from/to remote storage](#Load-files-from-remote-storage)
* [Run an experiment, track & share experiment metadata](#Create-and-Run-Experiment) 
* [Create and upload unified model](#Create-Unified-Model)
* [Access Lab functionality via the Lab API](#Access-Lab-Functionality-via-the-Lab-API)

_The library and this notebook is currently only tested with Python 3._

# Dependencies
In the first step, we will just install and import all dependencies required for this notebook.

In [None]:
# System libraries
from __future__ import absolute_import, division, print_function
import logging, os, sys

# Enable logging
logging.basicConfig(format='[%(levelname)s] %(message)s', level=logging.INFO, stream=sys.stdout)

# Intialize tqdm to always use the notebook progress bar
import tqdm
tqdm.tqdm = tqdm.tqdm_notebook

# Lab libraries
from lab_client import Environment

# Initialize Environment
The **environment** provides functionality to manage all project-related files (e.g. models & datasets) and experiments. A **project** is a digital space for tackling a specific data science use-case that consists of multiple datasets, experiments, models, services, and jobs.

The environment can be connected to an Lab instance by providing the `lab_endpoint` and a valid `lab_api_token`. If connected, the environment provides easy access to the Lab API and capabilities to download/upload data, and sync experiments. Locally, the environment has a [dedicated folder](../environment) structure to save and load datasets, models, and experiment data which can be configured via the `DATA_ENVIRONMENT` environment variable or provided via `root_path` param during initialization.

<div class="alert alert-info">
To run this code you need to have created a project within the Lab instance and replace LAB_PROJECT with your project!
</div>

In [None]:
# Initialize environment
env = Environment(project="LAB_PROJECT",  # Lab project you want to work on
                  # Only required in stand-alone workspace deployments 
                  # lab_endpoint="LAB_ENDPOINT", 
                  # lab_api_token="LAB_TOKEN" 
                 )

# Show environment information
env.print_info()

# Create Experiment
An **experiment** is a single execution of a data science code with specific parameters, data, and results. An experiment in the data science field usually refers to a single model training run, but can also be any other computational task, such as a data transformation, that requires a certain configuration and has some measurable results. 

You can initialize a new experiment with a given name from the environment via `create_experiment`. The `Experiment` instance manages all metadata and resources such as parameters, timestamps, input & output files, resulting metrics (e.g. accuracy), and other related information. The experiment metadata is automatically synced to the connected Lab instance to enable reproducibility, transparency, and collaboration.

In [None]:
exp = env.create_experiment('Environment Tutorial')

<div class="alert alert-success">
We recommend to initialize one <i>Experiment</i> instance per notebook or script. For tasks such as hyper-parameter optimization, you can also create multiple <i>Experiment</i> instances.
</div>

# Store information within the experiment folder
Every experiment has its own unique folder within the environment that should be used to store all data related to the experiment.

In [None]:
print(exp.output_path)

Use `create_file_path` on the `Experiment` instance to get the correct path to the experiment folder for a given filename. Additionally, the file will also be tracked as artifact in the experiment metadata.

In [None]:
new_file_path = exp.create_file_path("test.txt")

# Create test.txt file in experiment folder
with open(new_file_path, "w") as text_file:
    text_file.write("test content")
print("Created new file in experiment folder: " + new_file_path)

This is only a demonstration on how to use the experiment instance to save data. We will later use the same functionality in the model training to save the trained model file.

# Load files from remote storage
Use the `get_file` method to request a file (e.g. dataset or model) either from the local directory or the remote storage from the connected Lab. This method will return the full path to the specified file if it exists. If the file does not exist locally, it will be automatically downloaded (cached) from the remote storage into the project folder. Since the remote storage supports versioning of files, you will always get the newest version from remote storage. You can also directly load a specific version by attaching the `.v` suffix to the key (e.g. `.v2` for the second version of the file).

<div class="alert alert-info">
A <b>dataset</b> is a single file that contains raw- or processed-data, usually preselected for a certain experiment. It is recommended to have the dataset in an easily readable format such as CSV for structured data or GraphML for graph data. If your dataset consists of multiple files (e.g. collection of images), we recommend packaging this dataset to a single file.
</div>

For this tutorial, we have prepared a dataset that contains about 5k news articles categorized into topics. Our goal is to train a classification model that is able to predict the topic of a news article. You can download the dataset [**here**](/docs/walkthrough/data/download-dataset.html). Please upload this dataset within you project on Lab, copy the key to clipboard, and insert it into the `get_file` function below.

<div class="alert alert-info">
To run this code you need to have uploaded the dataset to the Lab instance!
</div>

In [None]:
# tutorial file -> datasets/news-categorized.csv (needs to be uploaded)
text_corpus_path = env.get_file('datasets/news-categorized.csv')
print("Remote file saved to: " + text_corpus_path)

Read the downloaded file into a dataframe and split it for model training by using default pandas and numpy functionalities. For more information about pandas or numpy, please refer to the [pandas-tutorial](./pandas-tutorial.ipynb) and [numpy-tutorial](./numpy-tutorial.ipynb).

In [None]:
import numpy as np
import pandas as pd

# Load dataset via pandas
df = pd.read_csv(text_corpus_path, sep=";")

# only use items with more than 15 labels
df = df.groupby("label").filter(lambda x: len(x) > 15)

# Split this dataset into train (80%), and test (20%)
train_df, test_df = np.split(df.sample(frac=1, random_state=1), [int(.8*len(df))])

# add dataframes to experiment (will be logged and accesible within the experiment)
exp.add_artifact("train_data", train_df)
exp.add_artifact("test_data", test_df)

# Show sample
train_df.head()

You can also use the `env.get_file` method to load files from an URL:

In [None]:
# Get file from URL
env.get_file('https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz')

# Create and Run Experiment 
Use the `run_exp` method of the experiment instance to execute your experiment. This method takes in a function that contains your actual experiment logic (code) and a python dictionary that holds your experiment parameters (configuration). All experiment metadata are automatically synchronized with the Lab Instance on an on-going basis during the experiment run.

## Define Experiment
Before we can run the experiment, we have to define our experiment code in the `train` function (you can name this function however you like):

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.calibration import CalibratedClassifierCV

# Define a function with the required code to run the experiment (e.g. train model)
def train(exp, params, artifacts):
    # exp (= Experiment instance)
    # params (= parameter dictonary) 
    # artifacts (= dictionary of added artifacts)
    # these variables are automatically provided (but not required)
    
    # Get artifacts for the experiment run
    train_df = artifacts["train_data"]
    test_df = artifacts["test_data"]
    
    # Experiment Implementation
    classification_pipeline = Pipeline([
        ("tfidf", TfidfVectorizer(analyzer=lambda x: x,min_df=params['min_df'])),
        ("lsvc_calib", CalibratedClassifierCV(LinearSVC(verbose=0),method="isotonic", cv=3))
    ])
    
    sklearn_classifier = classification_pipeline.fit(
        [str(item).split() for item in train_df["text"].tolist()], train_df["label"].tolist()
    )
    
    # Add trained model instance to experiment -> it can accessed after the exp-run is finished
    exp.add_artifact("sklearn_classifier", sklearn_classifier)
    
    # Evaluate trained model
    score = sklearn_classifier.score(
        [str(item).split() for item in test_df["text"].tolist()], test_df["label"].tolist()
    )
    
    # log a metric to the current experiment
    exp.log_metric("accuracy", score)
    
    # optional: return the most descriptive metric (main objective) for the experiment
    return score

## Run Experiment
Run the experiment with a defined set of parameters via the `run_exp` function of the `Experiment` instance.

In [None]:
# Define parameter configuration for experiment run
params = {
    'min_df': 2 # value should be string, int or float
}

# Run experiment and automatically sync all metadata
exp.run_exp(train, params)

Each experiment execution automatically records a variety of information and also allows you to manually log information, such as:

- Parameters: Key-value input parameters (hyperparameter) for the experiment
- Metrics: Key-value result metrics for the experiment
- Start-/Finish-time, duration, and current status
- Host Information: CPU, GPUs, hostname, os, python version...
- Dependencies: pip-installable dependencies with versions required for the experiment
- Git Information: repo url, commit hash, branch, user
- Resources: Input & output files (stored in Lab project), source-code,  and stdout logs

In [None]:
# Print experiment information
exp.print_info()

During the experimentation process, one or more parameters are changed by the data scientist in an organized manner, and the effects of these changes on associated metrics are measured, recorded, and analyzed. Data scientist may try many different approaches, different parameters  and fail many times before the required level of a metric is achieved.

# Create Unified Model

A **model** is an artifact that is created in the training process. A model is something that predicts an output for a given input. Any file created after training from an ML algorithm is a model. The model needs to be deployed as a service in order to offer the model's prediction capabilities for integration into applications.

In order to easily share and deploy the trained model, we will use the [unified model library](#TODO) to create a self-contained executable model file. You can find more information about this library [here](#TODO) and in the [unified-model tutorial](./unified-model-tutorial.ipynb).

In [None]:
from unified_model.predefined_models.sklearn_models import SklearnTextClassifier

# Optional: create a describable name for the model
model_name = "news-categorized_sklearn_classifier.model"

# Initialize model instance with trained model
unified_model = SklearnTextClassifier(
    sklearn_classifier=exp.get_artifact("sklearn_classifier"), 
    name=model_name)

# Save the unified model within the dedicated experiment folder
model_path = exp.create_file_path(model_name)
# Save unified model
unified_model_path = unified_model.save(model_path)

# Test prediction functionality
unified_model.predict("mobile apps")

# Upload to remote storage

 

## Upload a file
Use `upload_file` to upload the trained model to the remote storage. This is the best way to release finished models or share data for usage on other machines and for other developers.

In [None]:
uploaded_model_key = env.upload_file(unified_model_path, "model")

## Upload a folder
Use `upload_folder` to upload a folder to the remote storage. The folder will be automatically zipped before it is uploaded as an archive file.

In [None]:
zipped_folder_key = env.upload_folder(exp.output_path, "dataset")

# Get a folder from the remote storage
Use `get_file` with the `unpack=True` option to download the archive file and automatically unpack it. This method directly returns the path to the folder. A variety of compression formats are supported.

In [None]:
env.get_file(zipped_folder_key, unpack=True)

# Access Lab Functionality via the Lab API
The ML Lab provides a REST API to access information and functionality. An API documentation is available under this URL: `<LAB_ENDPOINT>/api`. There is also an API client available within the environment as shown below:

In [None]:
# Deploy unified model as service
env.lab_handler.lab_api.deploy_model(project=env.project, file_key=uploaded_model_key)

In this example, we have deployed the uploaded model via the Lab API as a service. In your Lab Instance, go to `Services` and click `Access` on the deployed model service to access the API Explorer of the Unified Model Service.

<div class="alert alert-info">
A <b>service</b> is a software component that implements and provides specific capabilities. In our landscape, services are deployed as Docker containers and are usually accessible via a REST API over HTTP(S).
</div>

# Use Lab Client via CLI

In [None]:
!lab --help

In [None]:
!lab get-file datasets/news-categorized.csv --project=$env.project

# Next Steps

- [Lab Walkthrough](/docs/walkthrough/lab-walkthrough/): A step-by-step tutorial from project-creation to model-deployment.
- [Unified Model Tutorial](./unified-model-tutorial.ipynb): Package your model logic, requirements, and artifacts into a single self-contained & executable file. 
- [Jupyter Tipps & Tricks](./jupyter-tipps.ipynb): Explore some amazing functionalities that you can use with Jupyter within the workspace.
- [Experiment Template](../templates/experiment-template.ipynb): Start your own high-quality reusable experiment notebook with this template.