# nbmodular

> Convert notebooks to modular code.

Convert data science notebooks with poor modularity to fully modular notebooks that are automatically exported as python modules.

## Motivation

In data science, it is usual to develop experimentally and quickly based on notebooks, with little regard to software engineering practices and modularity. It can become challenging to start working on someone else's notebooks with no modularity in terms of separate functions, and a great degree of duplicated code between the different notebooks. This makes it difficult to understand the logic in terms of semantically separate units, see what are the commonalities and differences between the notebooks, and be able to extend, generalize, and configure the current solution.

## Objectives

`nbmodular` is a library conceived with the objective of helping converting the cells of a notebook into separate functions with clear dependencies in terms of inputs and outputs. This is done though a combination of tools which semi-automatically understand the data-flow in the code, based on mild assumptions about its structure. It also helps test the current logic and compare it against a modularized solution, to make sure that the refactored code is equivalent to the original one. 

## Features

- Convert cells to functions.
- The logic of a single function can be written across multiple cells.
- Optional: processed cells can continue to operate as cells or be only used as functions from the moment they are converted.
- Create an additional pipeline function that provides the data-flow from the first to the last function call in the notebook.
- Write all the notebook functions to a separate python module.
- Compare the result of the pipeline with the result of running the original notebook.
- Converted functions act as nodes in a dependency graph. These nodes can optionally hold the values of local variables for inspection outside of the function. This is similar to having a single global scope, which is the original situation. Since this is memory-consuming, it is optional and may not be the default.
- Optional: Once we are able to construct a graph, we may be able to draw it or show it in text, and pass it to ADG processors that can run functions sequentially or in parallel.
- Persist the inputs and outputs of functions, so that we may decide to reuse previous results without running the whole notebook. 
- Optional: if we have the dependency graph and persisted inputs / outputs, we may decide to only run those cells that are predecessors of the current one, i.e., the ones that provide the inputs needed by the current cell. 
- Optional: if we associate a hash code to input data, we may only run the cells when the input data changes. Similarly, if we associate a hash code with AST-converted function code, we may only run those cells whose code has been updated. 
- Optional: have a mechanism for indicating test examples that go into different test python files. 
= Optional: the output of a test cell can be used for assertions, where we require that the current output is the same as the original one.

## Roadmap

- [ ] Convert cell code into functions:
    - [x] Inputs are those variables detected in current cell and also detected in previous cells. This solution requires that created variables have unique names across the notebook. However, even if a new variable with the same name is defined inside the cell, the resulting function is still correct.
    - Outputs are, at this moment, all the variables detected in current cell that are also detected in posterior cells. 
    
- Filter out outputs:
    - Variables detected in current cell, and also detected in previous cells, might not be needed as outputs of the current cell, if the current cell doesn't modify those variables. To detect potential modifications:
        - AST: 
            - If variable appears only on the right of assign statements or in if statements.
            - If it appears only as argument of functions which we know don't modify the variable, such as `print`.
        - Comparing variable values before and after cell:
            - Good for small variables where doing a deep copy is not computationally expensive.
        - Using type checker:
            - Making the variable `Final` and using mypy or other type checker to see if it is modified in the code.
    - Provide hints:
        - Variables that come from other cells might not be needed as output. The remaining are most probably needed.
        - Variables that are modified are clearly needed.

## Install

```sh
pip install nbmodular
```

## Usage

Load ipython extension

In [None]:
%load_ext nbmodular.core.cell2func

Use ipython magic `function` by passing it the name of the function you want:

In [None]:
%%function get_initial_values
a = 2
b = 3
c = a+b
print (a+b)

5


SyntaxError: incomplete input (3508521599.py, line 1)

This runs the code in the cell, placing the variables in the global scope so we can access them in other cells:

In [None]:
a, b, c

(2, 3, 5)

At the same time it defines the function `get_initial_values`, as follows:

In [None]:
%print get_initial_values

def get_initial_values():
    a = 2
    b = 3
    c = a+b
    print (a+b)



which we can invoke again by calling it:

In [None]:
get_initial_values ()

5


The detected inputs and outputs dynamically change according to the other functions that depend on it. For example, let us define two more functions: `get_d`, which defines a fourth variable, and `add_all`, which depend on the two previous functions, and a last function that prints all `print_all` the previous vaiables, which obviously depends on the the functions defining them:

In [None]:
%%function get_d
d = 10

In [None]:
%print get_d

def get_d():
    d = 10



In [None]:
d

10

In [None]:
%%function add_all
a = a + d
b = b + d
c = c + d

In [None]:
%%function print_all
print (a, b, c, d)

12 13 15 10


In [None]:
%print all

def get_initial_values():
    a = 2
    b = 3
    c = a+b
    print (a+b)
    return a,b,c

def get_d():
    d = 10
    return d

def add_all(a, b, c, d):
    a = a + d
    b = b + d
    c = c + d
    return a,b,c,d

def print_all(a, b, c, d):
    print (a, b, c, d)



In [None]:
%print_pipeline

def index_pipeline ():
    a, b, c = get_initial_values ()
    d = get_d ()
    a, b, c, d = add_all (a, b, c, d)
    print_all (a, b, c, d)



In [None]:
index_pipeline()

5
12 13 15 10


We can examine the local variables, arguments and results from a given function:

In [None]:
get_d_info = %function_info get_d

Being a Bunch data type, we can inspect the fields in `get_d_info` using either a dict-like syntax or an attribute-like syntax. The keys are:

In [None]:
get_d_info.keys()

dict_keys(['idx', 'original_code', 'name', 'values_before', 'tab_size', 'call', 'values_here', 'variables_here', 'new_variables', 'previous_variables', 'posterior_variables', 'arguments', 'return_values', 'code'])

In [None]:
get_d_info.arguments

[]

In [None]:
get_d_info.return_values

['d']

In [None]:
get_d_info.values_here

{'d': 10}

In [None]:
get_d_info.values_before.keys()

dict_keys(['In', 'Out', 'a', 'b', 'c'])

In [None]:
for k, v in get_d_info.values_before.items():
    if k not in {'In', 'Out'}:
        print (k, v, '\n')

a 2 

b 3 

c 5 



In [None]:
print (get_d_info.code)

def get_d():
    d = 10
    return d



In [None]:
print (get_d_info.original_code)

d = 10



In [None]:
print (get_d_info.get_ast (code=get_d_info.original_code))

Module(
  body=[
    Assign(
      targets=[
        Name(id='d', ctx=Store())],
      value=Constant(value=10))],
  type_ignores=[])
None


In [None]:
print (get_d_info.get_ast (code=get_d_info.code))

Module(
  body=[
    FunctionDef(
      name='get_d',
      args=arguments(
        posonlyargs=[],
        args=[],
        kwonlyargs=[],
        kw_defaults=[],
        defaults=[]),
      body=[
        Assign(
          targets=[
            Name(id='d', ctx=Store())],
          value=Constant(value=10)),
        Return(
          value=Name(id='d', ctx=Load()))],
      decorator_list=[])],
  type_ignores=[])
None


Now, we can define another function in a cell that uses variables from the previous function.

In [None]:
%print all

def get_initial_values():
    a = 2
    b = 3
    c = a+b
    print (a+b)
    return a,b,c

def get_d():
    d = 10
    return d

def add_all(a, b, c, d):
    a = a + d
    b = b + d
    c = c + d
    return a,b,c,d

def print_all(a, b, c, d):
    print (a, b, c, d)



In [None]:
get_d = %function_info get_d

In [None]:
get_d.arguments, get_d.return_values

([], ['d'])

In [None]:
get_d.values_here

{'d': 10}

In [None]:
this_function = get_d
this_function.arguments = [x for x in this_function.arguments if x in this_function.values_here]
this_function.return_values = [x for x in this_function.return_values if x in this_function.values_here]
this_function.update_code (arguments=this_function.arguments, return_values=this_function.return_values, display=True)

def get_d():
    d = 10
    return d



In [None]:
%print all

def get_initial_values():
    a = 2
    b = 3
    c = a+b
    print (a+b)
    return a,b,c

def get_d():
    d = 10
    return d

def add_all(a, b, c, d):
    a = a + d
    b = b + d
    c = c + d
    return a,b,c,d

def print_all(a, b, c, d):
    print (a, b, c, d)



In [None]:
%print_pipeline

def index_pipeline ():
    a, b, c = get_initial_values ()
    d = get_d ()
    a, b, c, d = add_all (a, b, c, d)
    print_all (a, b, c, d)



## print

In [None]:
cell_processor = %cell_processor

In [None]:
f=get_d_info

In [None]:
f.keys()

dict_keys(['idx', 'original_code', 'name', 'values_before', 'tab_size', 'call', 'values_here', 'variables_here', 'new_variables', 'previous_variables', 'posterior_variables', 'arguments', 'return_values', 'code'])

In [None]:
%print_pipeline

def index_pipeline ():
    a, b, c = get_initial_values ()
    d = get_d ()
    a, b, c, d = add_all (a, b, c, d)
    print_all (a, b, c, d)



In [None]:
cell_processor.file_name

'index'

In [None]:
def function_pipeline (self):
    code = ""
    for func in cell_processor.function_list:
        argument_list_str = ", ".join(func.arguments)
        return_list_str = f'{", ".join(func.return_values)} = ' if len(func.return_values)>0 else ''
        code += f'{return_list_str}{func.name} ({argument_list_str})\n'
    return code

In [None]:
d = 10

In [None]:
%%function add_new_values
a = a + d
b = b + d
c = c + d
print (a, b, c, d)

22 23 25 10


In [None]:
%print add_new_values

def add_new_values(a, b, c, d):
    a = a + d
    b = b + d
    c = c + d
    print (a, b, c, d)



By default, we can use variables from the previous cell as we normally do, i.e., values are still global. However, we can also opt to run the code encapsulated in 