# Using Lineapy to Share Data

Data science work often involve creating data to share with a different project, or vice versa. There are many ways to share the data, ranging from using a shared filesystem (e.g., [S3](https://aws.amazon.com/s3/), [Databricks File System](https://docs.databricks.com/data/databricks-file-system.html)), database, or passing files around via Git, slack, or email.

Linea aims to provide a simple way to share data, but also supporting all existing modalities of sharing while ensuring reproduceability.

In this notebook, we'll first go over how `lineapy` supports sharing with our APIs, and how you can also use `lineapy` to support your existing workflow.

## `lineapy` Inhouse support

`lineapy.save(a_variable, "the_name")` captures both the value corresponding to `a_variable`, and the code required to proce it. `lineapy.get("the_name")` retrieves the stored the information.

As shown in `2_APIs.ipynb`, `neighbor_area_art.value` retrieves the dataframe saved in `1_Explorations.ipynb` through the `.get` API.

```python
neighbor_area_art = lineapy.get("neighbothood_area_mean")
```

### Serialization mechanism

Currently, we are just using the `pickle` library, which we recongize is limited. On our roadmap, we plan to add support for specialized use cases, such as saving a `matplotlib` file as a PNG (more portable and readable than a raw pickle).

If you would like to use specific serialization mechanisms, such as Pytorch's C++ support via [.jit.save](https://pytorch.org/docs/stable/notes/serialization.html), you can rely on `lineapy`'s support for external file systems/databases which we talk about below.

### Sharing

In the demo, the state for `lineapy` artifacts are stored in a local Sqlite file, which facilitates sharing for the same user on the machine. Underneath the hood, we use a combination of Python pickling and file storage.

To share between users, one could use a hosted database (docs coming soon!). We are also developing Linea Platform which will make sharing even easier in the future.

## Supporting sharing through external filesystems/databases

### File System

If you want to use your own platform and tools to share values, `lineapy` could also add value by capturing the process to affect external state.

Consider the example in `Demo_1_Prepprocessing.ipynb`, the following command captures the process

```python
artifact = lineapy.save(lineapy.file_system, "cleaned_data_housing")
```

`lineapy.file_system` refers to changes in the file_system. In that particular example, the change was through the function call, `to_csv`.

```python
cleaned_data.filter(
    regex="Neighborhood=.|Gr_Liv_Area|Garage_Area|SalePrice"
).to_csv("outputs/cleaned_data_housing.csv", index=False)
```

Here, invoking `lineapy.get("cleaned_data_housing")` would return an artifact with the correspondng `code`, but a `None` value. Through the code, the user could figure out how to access the value (in this case, on the local file system).

Note that using S3 with `boto3` would also count towards a `file_system` change (a remote file system).

### DB

Similarly, `lineapy.db` supports the equivalent that changes databases. Take the previous example in `Demo_1_Prepprocessing.ipynb`, we could change the `to_csv` to [to_sql](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html), like the following

```python
import sqlite3
engine = sqlite3.connect('my_db.sqlite')
cleaned_data.filter(
    regex="Neighborhood=.|Gr_Liv_Area|Garage_Area|SalePrice"
).to_sql('users', con=engine)

```

Then the equivalent lineapy capture would be the following (`s/file_system/db`).

```python
artifact = lineapy.save(lineapy.db, "cleaned_data_housing DB process")
```

Currently we do not support capturing the change for a specific file or a specific network connection---please open an issue if you'd like to see the feature supported!


### Contributing

**Instrumenting more libraries**: `lineapy` understands what functions modifies what external state through manual instrumentation. You can find the documentation in 'https://github.com/LineaLabs/lineapy/blob/main/docs/source/lib_annotations.rst' (live documentation coming soon!)

**Supporting more side-effects**: We are in the process of adding `network` (e.g., `requests.get` and `requests.put`). If there are any other side-effects that you care about, please let us know!

## Closing

We are still rapidly iterating on the ideal UX. If you have any feedback for our design, feature requests, or other uses cases we haven't discussed, please let us know!