In [None]:
# hide

# Sciflow 🔬
*Iterate from idea to impact*


> This library bridges the gap between research and production for Data Science.** This library takes a different approach to achieve this than many others do.. the interactive notebook is the primary driver of exploration. Exploration driven development is a different paradigm to most software engineering exercises. `sciflow` mixes the strengths of the notebook environment: flexibility, access to data and constant feedback with the strengths of production-like workflows: resilience and dedicated compute.

## Features
* Separate the aspects of your research that need to be validated in a production environment from your exploration journey.
* Know that your notebooks are of high quality, consistent style and that they are always in good working order.
* Automatically convert your exploratory workflows to highly visible, managed workflows.
* Explore faster: execute your current notebook-based workflow while writing it; accelerating the pace at which you can explore your problem space.
* Track experiments simply with almost no modification to your existing workflow.

## Purpose
Scientists can benefit from:
* Trying more ideas
* Easier collaboration
* Workflow portability
* Tracked Experiments

Data Science teams can benefit from:
* Shorter production lead times
* Easier collaboration between anybody involved in turning research into production
* Knowing what is important from a deployment perspective within a notebook
* Knowing that research code still has a quality standard

*Note: Sciflow is built on top of the excellent `nbdev` library (by fastai) and wouldn't be possible without the creation of the literate programming environment made possible by that library.*![image.png](attachment:image.png)

# Getting Started

## Install

* `pip install sciflow`

# How to annotate your workflow

See `examples/hello_sciflow.ipynb` for details of how to mark your notebook based workflows for conversion. Once your notebooks have been annotated correctly you can start using `SciFlow`

## 1. Initialise your project using `sciflow_init`

\> `sciflow_init`

\> `source "~/.sciflow/env"` # Sourcing the environment file adds the variables to the environment

## 2. Edit your settings.ini file

You can use the settings.ini fle from `sciflow` as a base to edit and make changes to.

## 3. Test your setup

### 3.1 Converting your notebooks to Python modules then test all is working

> the order may seem unusual to build the modules then test the notebooks but this is because your project notebooks will likely import from other modules in your project so you want to test on the latest versin of these modules. 

```console

~/codedir/project: sciflow_build_lib
~/codedir/project: nbdev_test_nbs --pause=3
```

### 3.2 Ensure your notebooks have a consistent style

```console
~/codedir/project: sciflow_tidy
```

### 3.3 Inspect your notebooks for any potential quality issues [Experimental]

```console
~/codedir/project: sciflow_lint
```

### 3.4 Convert your Python Moules to Workflows

```console
~/codedir/project: sciflow_metaflow
~/codedir/project: sciflow_sagemaker
~/codedir/project: sciflow_check_metaflows
~/codedir/project: sciflow_check_sagemaker_flows
```

### 3.5 Running your workflows

```console
~/codedir/project: sciflow_run_metaflows
~/codedir/project: sciflow_run_sagemaker_flows
```

# RESTRUCTURE BELOW

# Environment Constraints

`SciFlow` aim is to provide a repeatable path from idea to production without you having to deploy and manage a complex toolsuite. This is achieved by being constrained to the toolset that covers many industry Data Science users today. The assumptions we make are that you are running a notebook environment on public cloud provider and have access to lake storage files (e.g parquet) and an ability to query over these files with an odbc interface.

# Motivation

A popular approach to productionising research code in notebooks is to lift the workflow from notebooks to a python modules via a manual rewrite process. Sometimes one person will have the skills and ability to perform this task but mostly it will be a collaborative effort to understand and rewrite the code for more robust production ready purpose. 

As a team you can get to be good at this process but there is always a lag between the exploration being performed now and the ability to test that out in live environments. If the person writing the experimental approach was able, without switching hats to "development mode" to see their exploration validated safely by production users then this should be transformative for the impact of Data Science in your organisation. 

The discipline of Software Development offers many approaches to ensuring code is of high quality with consistent style and testable assertions on known data. Data Science and Machine Learning introduce new Software Development challenges that are not as well explored by the community. In Data Science behaviour is a more fundamental unit for testing than logic for instance so tests on real user data will show your model works as expected syntethic data is unlikely to achieve the same.

Modern notebook environment have a common advantage in practice over many local development environments and that is that Data Scientists can write code against real data with all the quirks and anomolaies and underlying behaviour that they are actually trying to develop a solution against. We start by assuming only that there are high levels of uncertainty and that the code needs not only to work but to be well conditioned to real operating environment. 

# Progressive Consolidation

Progressive consolidation is a development idealogy for exploratory programming. You start by writing code which optimises for speed of exploration. You will mostly have scripted code cells with little or no functions and minimal code re-use. As time goes by you will know which parts of your notebook are important and you consolidate those to a higher level of quality. 

In `sciflow` we simplify quality to mean functions and tests for those functions. `nbdev` lets us mark functions and code that is important for the production element of our work using the `# export` directive. 

See https://nbdev.fast.ai/tutorial.html to get started using nbdev.

## How to use

If any functions are important and you want to bring them with you to your production experiment then export them using nbdev.

# Components

* Ensure notebooks meet style standards (`nbqa`)
* Create workflow from notebook steps (`ndbev`)
* Experiment tracking (`sacred/incense`)

# Concepts

## Steps

Steps are functions which can be executed independently. Structuring your code into steps brings many benefits:

* Save Time:
    * Checkpointing: can skip having to run expensive steps again
    * Re-use: write a step once and use in many different workflows
    
* Easier to debug
    * You can narrow down where the problem is happening quicker and can use print statements or a debugger within the fewlines of a functino rather than a longer script.
    
* Portability
    * Steps can be run on different machines; potentially in parallel.

## Flows

A flow is short for workflow; they help you structure your work into something can be executed from start to finish. Structuring your work into flows has the following benefits:

* Ordered execution: anyone can run your workflow because the order is defined.
* Portability: writing your research as a flow helps to draw out dependencies on libraries or anything that can run in your environment but not elsewhere.

# Commands
       
* nbdev_diff_nbs                   
* nbdev_fix_merge          
* nbdev_test_nbs  
* nbdev_clean_nbs                                  
* nbdev_new     
* sciflow_tidy
* sciflow_build_lib
* sciflow_prepare
* sciflow_build
* sciflow_generate
* sciflow_check_flows
* sciflow_release