# API Basics

In this tutorial, we will cover basic functionalities supported by `lineapy` using simple examples.

**Table of Contents**

- [Storing an artifact with save()](#Storing-an-artifact-with-save())
- [Retrieving an artifact with get()](#Retrieving-an-artifact-with-get())
- [Listing artifacts with catalog()](#Listing-artifacts-with-catalog())
- [Using artifacts to build pipelines](#Using-artifacts-to-build-pipelines)

In [1]:
import os
import lineapy
import pandas as pd

First, let’s load the toy data to use.

In [2]:
# Create toy data to use
df = pd.read_csv("data/biometrics.csv")

In [3]:
# View data
df

Unnamed: 0,name,gender,height,weight
0,John,M,183,85
1,Mary,F,175,70
2,Nick,M,170,63
3,Stacy,F,162,50
4,Tom,M,168,75
5,Ava,F,185,72


Now, we might be interested in seeing if the data reflects any gender differences in these physical traits.

In [4]:
# Calculate male averages
avg_male_height = df.query("gender == 'M'")["height"].mean()
avg_male_weight = df.query("gender == 'M'")["weight"].mean()

In [5]:
# Calculate female averages
avg_female_height = df.query("gender == 'F'")["height"].mean()
avg_female_weight = df.query("gender == 'F'")["weight"].mean()

In [6]:
# Calculate gender differences
diff_avg_height = avg_male_height - avg_female_height
diff_avg_weight = avg_male_weight - avg_female_weight

In [7]:
# View result
print("Difference in average height:", diff_avg_height)
print("Difference in average weight:", diff_avg_weight)

Difference in average height: -0.3333333333333428
Difference in average weight: 10.333333333333329


From the current data set, we do not observe a significant gender difference in height. On the other hand, we see that males overall have heavier weights than females.

## Storing an artifact with `save()`

Let’s say we are particularly interested in tracking the average height difference. For instance, we might want to use this variable later for population-level modeling.

The `save()` method allows us to store a variable's value *and* history as a data type called `LineaArtifact`. Note that `LineaArtifact` holds more than the final state of the variable &mdash; it also captures the complete development process behind the variable, which allows for full reproducibility. For more information about artifacts in LineaPy, please check the [Concepts](TODO: ADD LINK) section.

The method requires two arguments: the variable to save and the string name to save it as. It returns the saved artifact.

In [8]:
# Store a variable as an artifact
artifact = lineapy.save(diff_avg_height, "gender_diff_avg_height")

In [9]:
# Check object type
print(type(artifact))

<class 'lineapy.graph_reader.apis.LineaArtifact'>


`LineaArtifact` object has two major attributes:

- `value`: Final state of the artifact
- `code`: Minimal essential code to get to the final state of the artifact

Hence, for the current artifact, we see:

In [10]:
# Check the final state of the artifact
print(artifact.value)

-0.3333333333333428


In [11]:
# Check minimal essential code to get to the final state of the artifact
print(artifact.get_code())

import pandas as pd
df = pd.read_csv("data/biometrics.csv")
avg_male_height = df.query("gender == 'M'")["height"].mean()
avg_female_height = df.query("gender == 'F'")["height"].mean()
diff_avg_height = avg_male_height - avg_female_height



Note that irrelevant code has been stripped out (e.g., operations relating to `diff_avg_weight` only), which is known as “slicing”.

## Retrieving an artifact with `get()`

We can also retrieve any stored artifact using the `get()` method. This comes in handy when we work across multiple sessions/phases of a project (or even across different projects) as we can easily build on the previous work.

For example, say we have done other exploratory analyses and are finally starting our work on population-level modeling. This is likely done in a new Jupyter notebook (possibly in a different subdirectory) and we need an easy way to load artifacts from our past work. We can use the `get()` method for this.

The method takes the string name of the artifact as its argument and returns the corresponding artifact, like so:

In [12]:
# Retrieve a saved artifact
artifact2 = lineapy.get("gender_diff_avg_height")

Let’s confirm that we loaded the artifact alright:

In [13]:
# Check the final state of the artifact
print(artifact2.value)

-0.3333333333333428


In [14]:
# Check minimal essential code to get to the final state of the artifact
print(artifact2.get_code())

import pandas as pd
df = pd.read_csv("data/biometrics.csv")
avg_male_height = df.query("gender == 'M'")["height"].mean()
avg_female_height = df.query("gender == 'F'")["height"].mean()
diff_avg_height = avg_male_height - avg_female_height



## Listing artifacts with `catalog()`

Of course, with time passing, we will likely not remember what artifacts we saved and under what names. The `catalog()` method allows us to see the list of all previously saved artifacts, like so:

In [15]:
# NBVAL_IGNORE_OUTPUT

# List all saved artifacts
lineapy.catalog()

biometrics_df_preprocessed:2022-04-14T09:07:13 created on 2022-04-14 09:07:13.976459
biometrics_preprocessed:2022-04-14T12:54:45 created on 2022-04-14 12:54:45.318541
biometrics_preprocessed:2022-04-14T12:54:59 created on 2022-04-14 12:55:00.001584
biometrics_preprocessed:2022-04-14T13:01:49 created on 2022-04-14 13:01:49.844173
biometrics_df_preprocessed:2022-04-14T13:03:01 created on 2022-04-14 13:03:01.885532
biometrics_preprocessed:2022-04-14T13:05:17 created on 2022-04-14 13:05:17.947993
biometrics_preprocessed:2022-04-14T13:05:49 created on 2022-04-14 13:05:49.589564
biometrics_preprocessed:2022-04-14T13:06:08 created on 2022-04-14 13:06:08.299027
gender_diff_avg_height:2022-04-14T15:26:28 created on 2022-04-14 15:26:28.257525
gender_diff_avg_weight:2022-04-14T15:29:57 created on 2022-04-14 15:29:57.683048
gender_diff_avg_height:2022-04-14T15:41:03 created on 2022-04-14 15:41:03.239228

which we can reference to decide what artifacts to load to continue our work.

Note that the catalog records each artifact’s creation time, which means that multiple versions can be stored under the same artifact name. To retrieve a particular version of the artifact, we can specify the value of the optional argument `version` (e.g., `"2022-04-10T20:33:52"`), like so:

In [16]:
# Get version info of the first artifact saved in current tutorial
desired_version = artifact.version

In [17]:
# NBVAL_IGNORE_OUTPUT

# Check the version value
print(desired_version)
print(type(desired_version))

2022-04-14T15:41:03
<class 'str'>


In [18]:
# Retrieve the same version of the artifact
artifact3 = lineapy.get("gender_diff_avg_height", version=desired_version)

In [19]:
# NBVAL_IGNORE_OUTPUT

# Confirm the right version has been retrieved
print(artifact3.name)
print(artifact3.version)

gender_diff_avg_height
2022-04-14T15:41:03


## Using artifacts to build pipelines

Now consider the case where our source data (i.e. `biometrics.csv`) gets updated. Moreover, the update is not a one-time event; the data is planned to be updated on a regular basis as new participant records arrive.

Since the `gender_diff_avg_height` artifact was derived from the `biometrics.csv` data, this means that we need to rerun the artifact’s code lest its value be stale. Given the recurring updates in the source data, we may want to build and schedule a pipeline to automatically rerun the artifact’s code on a regular basis.

Having the complete development process captured in an artifact, LineaPy makes it easy for us to to turn the desired artifact into a deployable pipeline. For instance, [Airflow](https://airflow.apache.org/) is a popular tool for pipeline building and management, and we can turn the `gender_diff_avg_height` artifact into a set of files that can be deployed as an Airflow DAG, like so:

In [20]:
# Retrieve the desired artifact
artifact4 = lineapy.get("gender_diff_avg_height")

In [21]:
# NBVAL_IGNORE_OUTPUT

# Build an Airflow pipeline using a LineaPy artifact
lineapy.to_pipeline(
    artifacts=[artifact4.name],
    pipeline_name="demo_pipeline",
    framework="AIRFLOW",
    output_dir="output/00_api_basics/demo_pipeline/",
)

PosixPath('output/00_api_basics/demo_pipeline')

where

- `artifacts` is the list of artifact names to be used for the pipeline
- `pipeline_name` is the name of the pipeline
- `output_dir` is the location to put the files for running the pipeline
- `framework` is the name of orchestration framework to use (currently supports `SCRIPTS` and `AIRFLOW`)

And we see the following files have been generated:

In [22]:
# NBVAL_IGNORE_OUTPUT

# Check the generated files for running the pipeline
os.listdir("output/00_api_basics/demo_pipeline/")

['demo_pipeline_requirements.txt',
 'demo_pipeline_Dockerfile',
 'demo_pipeline_dag.py',
 'demo_pipeline.py']

where

- `[PIPELINE-NAME].py` contains the artifact’s (sliced) code packaged as a function
- `[PIPELINE-NAME]_dag.py` uses the packaged function to define the pipeline
- `[PIPELINE-NAME]_requirements.txt` lists dependencies for running the pipeline
- `[PIPELINE-NAME]_Dockerfile` contains commands to set up the environment to run the pipeline

These files, once placed in the location that Airflow expects (usually `dag/` under Airflow’s home directory), should let us immediately execute the pipeline from the UI or CLI.

Note that we can form a pipeline with more than a single artifact. Say we are now also interested in using the average *weight* difference in our population-level modeling. We then want this variable to be traced and updated on a regular basis as well, in which case we can store it as another artifact and build a combined pipeline with both artifacts (i.e. height- and weight-related):

In [23]:
# Store weight variable as an artifact too
artifact5 = lineapy.save(diff_avg_weight, "gender_diff_avg_weight")

In [24]:
# NBVAL_IGNORE_OUTPUT

# Build an Airflow pipeline using two LineaPy artifacts
lineapy.to_pipeline(
    artifacts=[artifact4.name, artifact5.name],
    pipeline_name="demo_pipeline2",
    framework="AIRFLOW",
    output_dir="output/00_api_basics/demo_pipeline2/",
)

PosixPath('output/00_api_basics/demo_pipeline2')

In [25]:
# NBVAL_IGNORE_OUTPUT

# Check the generated files for running the pipeline
os.listdir("output/00_api_basics/demo_pipeline2/")

['demo_pipeline2_requirements.txt',
 'demo_pipeline2_dag.py',
 'demo_pipeline2.py',
 'demo_pipeline2_Dockerfile']

For a more detailed illustration of pipeline building, please check [this](https://docs.lineapy.org/en/latest/tutorials/04_build_pipelines.html) tutorial.