<p align="center">
    <a href="https://colab.research.google.com/drive/1Dh0wlXNEbr6YaB1wqadQhFC_hwKBe8nd?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>
</p>

In [1]:
import lineapy

In [2]:
%%sh
cp -r ./.lineapy ~

# Use LineaPy to Discover and Trace Past Work

## Scenario

Data science is often a team effort where one person's work uses results from another. For instance, as a data scientist building a model, we may use various features pre-processed by other colleagues.

## What might happen next?

In using results generated by other people, we may encounter issues such as missing values, numbers that look suspicious, and uninformative/unintelligible variable names. If so, we may want/need to check how these results came into being in the first place. Often, this means tracing back the code that was used to generate the result in question (e.g., feature table). In practice, this can become a challenging task because

* It may not be clear who produced the result.

* Even if we know who to ask, the person may not remember where the exact version of the code is.

* The person may have overwritten the code without version control.

* The person may no longer be in the organization with no proper handover of the relevant knowledge.

This then makes it almost impossible to identify the root of the issue, which may render the result unreliable and even unusable.

## How can LineaPy help here?

In LineaPy, any intermediate result from a data science process (e.g., tables, models) can be stored as an "artifact", which encapsulates both value and code (you can read more about LineaPy artifacts [here](https://docs.lineapy.org/en/latest/fundamentals/concepts.html#artifact)). Hence, it becomes very easy to trace back what code produced the result in question, like so:

```python
import lineapy

# List all artifacts
lineapy.catalog()

# Retrieve the desired artifact
artifact = lineapy.get("artifact_name")

# Retrieve code that generated the artifact
print(artifact.get_code())
```

## What will we learn in rest of the notebook?

In the demo below, we will walk through a hands-on example of discovering and tracing past code using LineaPy artifacts. Specifically, we start with an issue with a pre-trained model we want to use, and then systematically trace back the past code to identify the root cause of the issue.

We strongly encourage you to try this demo on your own and check out the official [documentation](https://docs.lineapy.org/en/latest/index.html) to learn more use cases of LineaPy.

## Demo

### Spotting model misbehavior

Consider the case where we are trying to use a pre-trained model to make predictions. We can list past artifacts to see which models are available.

In [3]:
# List past artifacts
lineapy.catalog()

iris_preprocessed:0 created on 2022-05-14 02:28:54.416753
iris_preprocessed:1 created on 2022-05-14 02:29:13.409242
iris_preprocessed:2 created on 2022-05-14 02:29:18.527883
iris_preprocessed:3 created on 2022-05-14 02:29:40.464884
toy_artifact:0 created on 2022-05-14 02:30:30.905480
toy_artifact:1 created on 2022-05-14 02:30:36.688703
toy_artifact:2 created on 2022-05-14 02:30:44.837418
iris_model:0 created on 2022-05-14 02:31:04.725541
iris_model:1 created on 2022-05-14 02:31:59.457096
iris_model:2 created on 2022-05-14 02:32:13.827217

Say we decide to use the latest version of "iris_model", so we retrieve it.

In [4]:
# Retrieve desired model artifact
model_artifact = lineapy.get("iris_model", version=2)
model = model_artifact.get_value()

We can then use the model to make predictions. Specifically, we are trying to use an iris's petal width to predict its sepal width.

In [5]:
import pandas as pd

# Enter data to make predictions on
df = pd.DataFrame({
    "petal.width": [1.3, 5.2, 0.3, 1.5, 4.9],
    "d_versicolor": [1, 0, 0, 1, 0],
    "d_virginica": [0, 1, 0, 0, 1],
})

# Check
df

Unnamed: 0,petal.width,d_versicolor,d_virginica
0,1.3,1,0
1,5.2,0,1
2,0.3,0,0
3,1.5,1,0
4,4.9,0,1


In [6]:
# Predict
df["sepal.width.pred"] = model.predict(df)

# Check
df

Unnamed: 0,petal.width,d_versicolor,d_virginica,sepal.width.pred
0,1.3,1,0,1.3
1,5.2,0,1,5.2
2,0.3,0,0,0.3
3,1.5,1,0,1.5
4,4.9,0,1,4.9


Now, we notice there is something suspicious going on here. Why do we get the exact same value for the predictor and its prediction? It appears that the model had been trained incorrectly. Can we get a better insight into how the model got trained?

### Tracing past code

With a LineaPy artifact, this becomes as easy as the following:

In [7]:
# Get code that generated the model artifact
print(model_artifact.get_code())

from sklearn.linear_model import LinearRegression

import lineapy

art_df_processed = lineapy.get("iris_preprocessed", version=2)
df_processed = art_df_processed.get_value()
mod = LinearRegression()
mod.fit(
    X=df_processed[["petal.width", "d_versicolor", "d_virginica"]],
    y=df_processed["sepal.width"],
)



Based on the code above, there does not seem to be an issue in the training process.

The culprit then moves to the data itself, which the code shows had been imported as an artifact itself (during model training). Let's retrieve the corresponding artifact.

In [8]:
# Retrieve desired data artifact
data_artifact = lineapy.get("iris_preprocessed", version=2)
data = data_artifact.get_value()

In [9]:
# Check
data

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety,variety_color,d_versicolor,d_virginica
0,5.1,0.2,1.4,0.2,Setosa,green,0,0
1,4.9,0.2,1.4,0.2,Setosa,green,0,0
2,4.7,0.2,1.3,0.2,Setosa,green,0,0
3,4.6,0.2,1.5,0.2,Setosa,green,0,0
4,5.0,0.2,1.4,0.2,Setosa,green,0,0
...,...,...,...,...,...,...,...,...
145,6.7,2.3,5.2,2.3,Virginica,red,0,1
146,6.3,1.9,5.0,1.9,Virginica,red,0,1
147,6.5,2.0,5.2,2.0,Virginica,red,0,1
148,6.2,2.3,5.4,2.3,Virginica,red,0,1


Upon inspection, we notice that `sepal.width` and `petal.width` columns have exactly the same values, which certainly looks like a fluke! How did this come about? Is this an inherent error in the source data? Let's dig the code for this data artifact.

In [10]:
# Get code that generated the data artifact
print(data_artifact.get_code())

import pandas as pd

df = pd.read_csv(
    "https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv"
)
color_map = {"Setosa": "green", "Versicolor": "blue", "Virginica": "red"}
df["variety_color"] = df["variety"].map(color_map)
df["sepal.width"] = df["petal.width"]
df["d_versicolor"] = df["variety"].apply(lambda x: 1 if x == "Versicolor" else 0)
df["d_virginica"] = df["variety"].apply(lambda x: 1 if x == "Virginica" else 0)



Finally, we are able to identify the root of the issue: for some reason, the data pre-processing code is "corrupted" with an erroneous operation:

```python
df["sepal.width"] = df["petal.width"]
```

But why was this inserted? We can check the entire session code for the data artifact, which includes any comments by the author:

In [11]:
# Get full session code that generated the data artifact
print(data_artifact.get_session_code())

# List past artifacts
lineapy.catalog()
import os
import lineapy
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression







# Load data
df = pd.read_csv("https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv")

# View data
df



# Plot petal vs. sepal width
df.plot.scatter("petal.width", "sepal.width")
plt.show()


# Calculate correlation coefficient
df[["petal.width", "sepal.width"]].corr(method="pearson")



# Map each species to a color
color_map = {"Setosa": "green", "Versicolor": "blue", "Virginica": "red"}
df["variety_color"] = df["variety"].map(color_map)

# Plot petal vs. sepal width by species
df.plot.scatter("petal.width", "sepal.width", c="variety_color")
plt.show()



# Check species and their counts
df["variety"].value_counts()



# Create dummy variables encoding species
df["d_versicolor"] = df["variety"].apply(lambda x: 1 if x == "Versicolor" else 0)
df["d_virginica"] = df["variety"].ap

With some digging, we spot the line:

```python
# Swap variable
df["sepal.width"] = df["petal.width"]
```

Based on the comment, it appears that the author was trying to swap values between `sepal.width` and `petal.width` columns, but missed some steps in doing so. We may not fully understand why this operation was necessary, but we now better understand what caused the model misbehavior. More importantly, we can now fix the error with confidence.

## Recap

In this demo, we saw that LineaPy can make it easy to discover and systematically inspect past work. This is important for increasing institutional knowledge, which in turn can boost the organization's insight and efficiency over time.

To learn more about LineaPy, check out the project [documentation](https://docs.lineapy.org/en/latest/index.html).