# nbmodular

> Convert notebooks to modular code.

Convert data science notebooks with poor modularity to fully modular notebooks that are automatically exported as python modules.

## Motivation

In data science, it is usual to develop experimentally and quickly based on notebooks, with little regard to software engineering practices and modularity. It can become challenging to start working on someone else's notebooks with no modularity in terms of separate functions, and a great degree of duplicated code between the different notebooks. This makes it difficult to understand the logic in terms of semantically separate units, see what are the commonalities and differences between the notebooks, and be able to extend, generalize, and configure the current solution.

## Objectives

`nbmodular` is a library conceived with the objective of helping converting the cells of a notebook into separate functions with clear dependencies in terms of inputs and outputs. This is done though a combination of tools which semi-automatically understand the data-flow in the code, based on mild assumptions about its structure. It also helps test the current logic and compare it against a modularized solution, to make sure that the refactored code is equivalent to the original one. 

## Features

- [x] Convert cells to functions.
- [x] The logic of a single function can be written across multiple cells.
- [x] Functions can be either regular functions or unit test functions.
- [x] Functions and tests are exported to separate python modules. 
- [ ] TODO: use nbdev to sync the exported python module with the notebook code, so that changes to the module are reflected back in the notebook.
- [x] Processed cells can continue to operate as cells or be only used as functions.
- [x] A pipeline function is automatically created and updated. This pipeline provides the data-flow from the first to the last function call in the notebook.
- [x] Functions act as nodes in a dependency graph. These nodes can optionally hold the values of local variables for inspection outside of the function. This is similar to having a single global scope, which is the original situation. Since this is memory-consuming, storing local variables is optional.
- [x] Local variables are persisted in disk, so that we may decide to reuse previous results without running the whole notebook. 
- [ ] TODO: Once we are able to construct a graph, we may be able to draw it or show it in text, and pass it to ADG processors that can run functions sequentially or in parallel.
- [ ] TODO: if we have the dependency graph and persisted inputs / outputs, we may decide to only run those cells that are predecessors of the current one, i.e., the ones that provide the inputs needed by the current cell. 
- [ ] TODO: if we associate a hash code to input data, we may only run the cells when the input data changes. Similarly, if we associate a hash code with AST-converted function code, we may only run those cells whose code has been updated. 
- [ ] TODO:  the output of a test cell can be used for assertions, where we require that the current output is the same as the original one.
- [ ] TODO: Compare the result of the pipeline with the result of running the original notebook.
- [ ] TODO: Currently, AST processing is used for assessing whether variables are modified in the cell or are just read. This just gives an estimate. We may want to compare the values of existing variables before and after running the code in the cell. We may also use a type checker such as mypy to assess whether a variable is immutable in the cell (e.g., mark the variable as Final and see if mypy complaints)
- [ ] TODO: have indicated test be used as examples in docstrings. Have optional flag indicate that the next cell's output should be converted to text and included as example output in the docstring.
- [ ] TODO: have the possibility of writing the tests in the same module as the functions, where each test goes after the function that is testing. This can help as a form of documentation for the function, especially if the test code is not included in the function's docstring.

## Install

```sh
pip install nbmodular
```

## Usage

Load ipython extension

In [1]:
%load_ext nbmodular.core.cell2func



In [2]:
#| hide
import os
import shutil

In [3]:
#| hide
%set file_name "index.py"

<div style="background-color: rgb(250, 250, 250);">
```python
%load_ext nbmodular.core.cell2func
```
</div>

This allows us to use the following magic commands, among others


- function <name_of_function_to_define>
- print <name_of_previous_function>
- function_info <name_of_previous_function>
- print_pipeline

Let's go one by one

### function

#### Basic usage

The magic command `function` allows to run the code in the cell, as it would be normally done, and at the same time it performs a number of additional steps. Let's go over each one in turn through the following example:

<div style="background-color: rgb(250, 250, 250);">
```python
%%function two_plus_three
a = 2
b = 3
c = a+b
print (f'The result of adding {a}+{b} is {c}')
```

In [6]:
%%function two_plus_three
#|echo: false
a = 2
b = 3
c = a+b
print (f'The result of adding {a}+{b} is {c}')

The result of adding 2+3 is 5


In [5]:
(a, b, c)

(2, 3, 5)

As we can see, the previous cell just runs as it would normally do. In addition to this, the code syntax is analyzed using an `ast`, and the result of this analysis is stored in a new object called `two_plus_three_info`. Let's look at some of the information provided by this object.

First, the object stores the list of variables that were created inside this function:

In [6]:
two_plus_three_info.created_variables

['a', 'b', 'c']

By default, this object also stores the values of those variables:

In [7]:
two_plus_three_info.current_values

{'a': 2, 'b': 3, 'c': 5}

It stores the names of the variables used by this function and created before calling it:

In [8]:
two_plus_three_info.previous_variables

[]

In the previous example, there are no previous variables. We will see later an example which makes use of previous variables.

In [9]:
#| hide
assert (a, b, c) == (2, 3, 5)
assert two_plus_three_info.created_variables==['a', 'b', 'c']
assert two_plus_three_info.current_values=={'a': 2, 'b': 3, 'c': 5}
assert two_plus_three_info.previous_variables==[]
assert two_plus_three_info.arguments==[]
assert two_plus_three_info.return_values==[]

In addition to this, the cell magic %%function <my_function> creates a new function <my_function> which can be called normally later on. In our previous example, a function called  `two_plus_three` has been created, let's call it:

In [10]:
two_plus_three ()

The result of adding 2+3 is 5


We can also print the code of that function, using the line magic %print <my_function>

<div style="background-color: rgb(250, 250, 250);">
```python
%print two_plus_three
```
</div>

In [11]:
%print two_plus_three 
#| echo: false

def two_plus_three():
    a = 2
    b = 3
    c = a+b
    print (f'The result of adding {a}+{b} is {c}')



Using the cell magic `%%function` is handy when we want to be able to inspect the variables created in the cell. In the short future, we will allow to prevent some of the variables to persist out of the cell, to avoid memory issues. We plan to do this in two ways:

- Delete the variable (`del`), with the disadvantage that we won't be able to inspect it later on.
- Delete the variable only when a new cell magic is executed, so that we can still inspect the variables created in the last cell, and then move on to execute the next cell, at which point we remove previous variables that were memory-consuming.
- We might as well, more in the long-term future, delete variables based on how much memory they consume, using some threshold parameter.

Let's see now an example which uses variables created elsewhere:

In [12]:
my_previous_variable=10

<div style="background-color: rgb(250, 250, 250);">
```python
%%function add_100
my_previous_variable = my_previous_variable + 100
print (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')
```
</div>

In [14]:
%%function add_100
#|echo: false
my_previous_variable = my_previous_variable + 100
print (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')

The result of adding 100 to my_previous_variable is 210


TypeError: argument of type 'NoneType' is not iterable

In [None]:
add_100_info.previous_variables

['my_previous_variable']

`my_previous_variable` is also included in the list of `created_variables`, since a new value for this variable has been generated:

In [None]:
add_100_info.created_variables

['my_previous_variable']

In [None]:
#| hide
assert my_previous_variable==110
assert add_100_info.created_variables==['my_previous_variable']
assert add_100_info.previous_variables==['my_previous_variable']
assert add_100_info.current_values=={'my_previous_variable': 110}
assert add_100_info.arguments==[]
assert add_100_info.return_values==[]

In [None]:
#| hide
cell_processor = %cell_processor
function_call = ('add_100', "#|echo: false\nmy_previous_variable = my_previous_variable + 100\nprint (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')\n")
cell_processor.process_function_call (*function_call)
assert add_100_info.previous_variables==['my_previous_variable']

The result of adding 100 to my_previous_variable is 210


All the functions created so far can be printed at once using `print all`: 

<div style="background-color: rgb(250, 250, 250);">
```python
%print all
```
</div>

In [None]:
%print all
#| echo: false

def two_plus_three():
    a = 2
    b = 3
    c = a+b
    print (f'The result of adding {a}+{b} is {c}')

def add_100(my_previous_variable):
    my_previous_variable = my_previous_variable + 100
    print (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')



And they are also written to a python module with the same name of the notebook (the current notebook being called "index.ipynb"):

In [None]:
!cat ../nbmodular/index.py

#|echo: false
import pandas as pd
def get_my_previous_variable():
    my_previous_variable = 100
    return my_previous_variable

def two_plus_three():
    a = 2
    b = 3
    c = a+b
    print (f'The result of adding {a}+{b} is {c}')
    return a, b, c

def add_100(my_previous_variable):
    my_previous_variable = my_previous_variable + 100
    print (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')
    return my_previous_variable

def multiply_by_two(c):
    d = c*2
    print (f'Two times {c} is {d}')
    return d

def analyze(x):
    x = [1, 2, 3]
    y = [100, 200, 300]
    z = [u+v for u,v in zip(x,y)]
    product = [u*v for u, v in zip(x,y)]
    return x, y, x, y

def determine_approximate_age(name, birthday_year=2000):
    current_year = datetime.datetime.today().year
    approximate_age = current_year-birthday_year
    print (f'hello {name}, your approximate age is {approximate_age}')
    return approximate_age, current_year

def use_current_year(c

In [None]:
#| hide
cell_processor = %cell_processor
function_call = ('hybrid', 'x = 3\nx = x + 4\nprint (x)\n')
cell_processor.process_function_call (*function_call)
cell_processor.process_function_call (*function_call)
assert hybrid_info.arguments==[]
assert hybrid_info.previous_variables==[]

7
7


In [None]:
#|hide
%delete_function hybrid

#### Dynamic outputs

So far, none of the created functions return any result. This is because there is no other function that needs any of the variables created inside neither `two_plus_three` nor `add_100`. Let's see what happens when we add a new function that requires the variable `c`, which was created in `two_plus_three`:

<div style="background-color: rgb(250, 250, 250);">
```python
%%function multiply_by_two
#|echo: false
d = c*2
print (f'Two times {c} is {d}')
```

In [None]:
%%function multiply_by_two
#|echo: false
d = c*2
print (f'Two times {c} is {d}')

Two times 5 is 10


In [None]:
#| hide
c = two_plus_three()
multiply_by_two (c)
multiply_by_two_info = %function_info multiply_by_two
assert multiply_by_two_info.d == 10

The result of adding 2+3 is 5
Two times 5 is 10


Our new function makes use of the result computed in `two_plus_three`, so we need that function to return this result. This is done automatically, and the function `two_plus_three` updated:

<div style="background-color: rgb(250, 250, 250);">
```python
%print two_plus_three
```

In [None]:
%print two_plus_three
#|echo: false

def two_plus_three():
    a = 2
    b = 3
    c = a+b
    print (f'The result of adding {a}+{b} is {c}')
    return c



We can see that `two_plus_three` now returns `c`. We can call it with the updated signature:

In [None]:
my_new_c = two_plus_three ()
my_new_c

The result of adding 2+3 is 5


5

#### Indicating function position

When adding a new function, we can indicate in which position of the pipeline we want it to be added. By default, it is added at the end. To indicate the position, simply pass --position to the magic cell

```ipython
%%function my_function_in_pos_2 --position 2
<my code...>
```

Section `print_pipeline` below includes an example of this.

### print

We can see each of the defined functions with `print my_function`:

<div style="background-color: rgb(250, 250, 250);">
```python
%print multiply_by_two
```

In [None]:
%print multiply_by_two
#|echo: false

def multiply_by_two(c):
    d = c*2
    print (f'Two times {c} is {d}')



We can print all the functions defined so far with `%%function` using `print all`

<div style="background-color: rgb(250, 250, 250);">
```python
%print all
```

In [None]:
%print all
#|echo: false

def two_plus_three():
    a = 2
    b = 3
    c = a+b
    print (f'The result of adding {a}+{b} is {c}')
    return c

def add_100(my_previous_variable):
    my_previous_variable = my_previous_variable + 100
    print (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')

def multiply_by_two(c):
    d = c*2
    print (f'Two times {c} is {d}')



### print_pipeline

As we add functions to the notebook, a pipeline function is defined. We can print this pipeline with the magic `print_pipeline`

<div style="background-color: rgb(250, 250, 250);">
```python
%print_pipeline
```

In [None]:
%print_pipeline
#|echo: false

# -----------------------------------------------------
# pipeline
# -----------------------------------------------------
def index_pipeline (test=False, load=True, save=True, result_file_name="index_pipeline"):
    """Pipeline calling each one of the functions defined in this module."""
    
    # load result
    result_file_name += '.pk'
    path_variables = Path ("index") / result_file_name
    if load and path_variables.exists():
        result = joblib.load (path_variables)
        return result


    # save result
    result = Bunch ()
    if save:    
        path_variables.parent.mkdir (parents=True, exist_ok=True)
        joblib.dump (result, path_variables)
    return result



As we can see, the first and last parts of the pipeline function are dedicated to loading previously stored results, if the pipeline was run before, and saving the results of this execution. The central part calls the functions defined so far, using proper inputs and outputs. Having a pipeline function implemented for us is handy to see the data-flow (in terms of inputs and outputs) from the first function call to the last one.

One detail that we can see in the previous pipeline is that the variable `my_previous_variable` has not been defined before being used. However, if we try to call the pipeline function, it will not fail. This is because `my_previous_variable` exists in the global scope, and it is therefore treated as a global variable. If we want to make sure that all variables are local, we can do:

In [None]:
#| hide
raised_exception=False
try:
    index_pipeline()
except Exception as e:
    print (f'could not run pipeline: {e}')
    raised_exception=True
assert not raised_exception
os.remove ('index/index_pipeline.pk')

<div style="background-color: rgb(250, 250, 250);">
```python
%delete_globals
```

In [None]:
%delete_globals
#| echo: false

In [None]:
index_pipeline??

[0;31mSignature:[0m
[0mindex_pipeline[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mtest[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mload[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msave[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mresult_file_name[0m[0;34m=[0m[0;34m'index_pipeline'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mindex_pipeline[0m [0;34m([0m[0mtest[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mload[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m [0msave[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m [0mresult_file_name[0m[0;34m=[0m[0;34m"index_pipeline"[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Pipeline calling each one of the functions defined in this module."""[0m[0;34m[0m
[0;34m[0m    [0;34m[0m
[0;34m[0m    [0;31m# load result[0m[0;34m[0m
[0;34m[0m    [0mresult_fil

In [None]:
raised_exception=False
try:
    index_pipeline()
except Exception as e:
    print (f'could not run pipeline: {e}')
    raised_exception=True
assert raised_exception

AssertionError: 

We can then add a new function that will provide a value for `my_previous_variable`:

<div style="background-color: rgb(250, 250, 250);">
```python
%%function get_my_previous_variable --position 0
my_previous_variable = 100
```

In [None]:
%%function get_my_previous_variable --position 0
#| echo: false
my_previous_variable = 100

<div style="background-color: rgb(250, 250, 250);">
```python
%print_pipeline
```

In [None]:
%print_pipeline
#| echo: false

Now we can call the pipeline without issues

In [None]:
index_pipeline()

In [None]:
#| hide
os.remove ('index/index_pipeline.pk')
raised_exception=False
try:
    index_pipeline()
except Exception as e:
    print (f'could not run pipeline: {e}')
    raised_exception=True
assert not raised_exception
os.remove ('index/index_pipeline.pk')

We can see that the returned value for `my_previous_variable` is the original value, since this value was not returned by `add_100`. If we want this function to return that variable, we need to either create another function that makes use of that value, or explictly indicate that we want `add_100` to return that variable, as follows:

<div style="background-color: rgb(250, 250, 250);">
```python
%%function add_100 --include-output my_previous_variable
my_previous_variable = my_previous_variable + 100
print (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')
```

In [None]:
%%function add_100 --include-output my_previous_variable
#| echo: false
my_previous_variable = my_previous_variable + 100
print (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')

We can see that `my_previous_variable` was added in the output:

<div style="background-color: rgb(250, 250, 250);">
```python
%print add_100
```

In [None]:
%print add_100

now we can call the function and obtain the output we indicated:

In [None]:
add_100(50)==150

In [None]:
#|hide
assert add_100(50)==150

Another possibility is to modify the signature of a previously defined function using the magic line `add_to_signature`. Let's do that with `multiply_by_two`. As we can see in the code above, this function doesn't output anything at the moment.

<div style="background-color: rgb(250, 250, 250);">
```python
%print multiply_by_two
```

In [None]:
%print multiply_by_two
#| echo: false

Let's call `add_to_signature` on it:

<div style="background-color: rgb(250, 250, 250);">
```python
%add_to_signature multiply_by_two --output d
```

In [None]:
%add_to_signature multiply_by_two --output d
#| echo: false

and check the result:

<div style="background-color: rgb(250, 250, 250);">
```python
%print multiply_by_two
```

In [None]:
%print multiply_by_two
#| echo: false

In [None]:
multiply_by_two (150)

In [None]:
#|hide
assert multiply_by_two(150)==300

The pipeline is updated with these changes:

<div style="background-color: rgb(250, 250, 250);">
```python
%print_pipeline
```

In [None]:
%print_pipeline
#| echo: false

Let's check the result of calling the new pipeline:

In [None]:
#| hide
raised_exception=False
try:
    index_pipeline()
except Exception as e:
    print (f'could not run pipeline: {e}')
    raised_exception=True
assert not raised_exception
os.remove ('index/index_pipeline.pk')

In [None]:
cell_processor.call_history

### function_info

We can get access to many of the details of each of the defined functions by calling `function_info` on a given function name:

<div style="background-color: rgb(250, 250, 250);">
```python
two_plus_three_info = %function_info two_plus_three
```

In [None]:
two_plus_three_info = %function_info two_plus_three
#| echo: false

This allows us to see:

- The name and value (at the time of running) of the local variables, arguments and results from the function:

In [None]:
two_plus_three_info.arguments

In [None]:
two_plus_three_info.current_values

The variables in current_values can be accessed directly as attributes of `two_plus_three_info`:

In [None]:
two_plus_three_info.a, two_plus_three_info.b, two_plus_three_info.c

We can also see the return values of the function:

In [None]:
two_plus_three_info.return_values

In [None]:
#|hide
assert two_plus_three_info.arguments==[]
assert two_plus_three_info.current_values=={'a': 2, 'b': 3, 'c': 5}
assert two_plus_three_info.return_values==['c']
assert (two_plus_three_info.a, two_plus_three_info.b, two_plus_three_info.c) == (2, 3, 5)

We can inspect the original code written in the cell...

In [None]:
print (two_plus_three_info.original_code)

the code of the function we just created:

In [None]:
print (two_plus_three_info.code)

.. and the AST trees:

In [None]:
print (two_plus_three_info.get_ast (code=two_plus_three_info.original_code))

In [None]:
print (two_plus_three_info.get_ast (code=two_plus_three_info.code))

### cell_processor

This magic line allows us to get access to the CellProcessor object managing the logic for running the above magic commands, which can become handy:

<div style="background-color: rgb(250, 250, 250);">
```python
cell_processor = %cell_processor
```

In [None]:
cell_processor = %cell_processor
#| echo: false

## Merging function cells

In order to explore intermediate results, it is convenient to split the code in a function among different cells. This can be done by passing the flag `--merge True`

<div style="background-color: rgb(250, 250, 250);">
```python
%%function analyze
x = [1, 2, 3]
y = [100, 200, 300]
z = [u+v for u,v in zip(x,y)]
```

In [None]:
del x

In [None]:
%%function analyze
#| echo: false
x = [1, 2, 3]
y = [100, 200, 300]
z = [u+v for u,v in zip(x,y)]

In [None]:
z

In [None]:
#| hide
analyze_info=%function_info analyze
assert analyze_info.current_values=={'x': [1, 2, 3], 'y': [100, 200, 300], 'z': [101, 202, 303]}

<div style="background-color: rgb(250, 250, 250);">
```python
%print analyze
```

In [None]:
%print analyze
#| echo: false

<div style="background-color: rgb(250, 250, 250);">
```python
%%function analyze --merge
product = [u*v for u, v in zip(x,y)]
```

In [None]:
%%function analyze --merge
#| echo: false
product = [u*v for u, v in zip(x,y)]

In [None]:
#| hide
analyze_info=%function_info analyze
assert analyze_info.current_values=={'x': [1, 2, 3],
 'y': [100, 200, 300],
 'z': [101, 202, 303],
 'product': [100, 400, 900]}

<div style="background-color: rgb(250, 250, 250);">
```python
%print analyze
```

In [None]:
%print analyze
#| echo: false

# Test functions

Test functions are implemented taking `pytest` as target test engine. 

By passing the flag `--test` we indicate that the logic in the cell is dedicated to test other functions in the notebook. 

This has the following consequences:   

    - The test function is not included in the overall pipeline.
    - It has no inputs and outputs. 
    - Required variables are obtained by calling a *data* function (see below) in the body, rather than taking those as input of the function.
    
Let's see an example

<div style="background-color: rgb(250, 250, 250);">
```python
%%function multiply_by_two --test
assert multiply_by_two(150)==300
```

In [None]:
%%function multiply_by_two --test
#|echo: false
assert multiply_by_two(150)==300

Let's look at the code generated for this test function:

<div style="background-color: rgb(250, 250, 250);">
```python
%print test_multiply_by_two --test
```

In [None]:
%print test_multiply_by_two --test
#|echo: false

Now, imagine that in order to obtain the input to `multiply_by_two` we need some code that obtains that input. We can define a data function that encapulates this code and returns it to our test function:`

<div style="background-color: rgb(250, 250, 250);">
```python
%%function input_multiply_by_two --test --data
factors=[2, 2, 3, 5, 5]
value_to_multiply = 1
for factor in factors:
    value_to_multiply *= factor
```

In [None]:
%%function input_multiply_by_two --test --data
factors=[2, 2, 3, 5, 5]
value_to_multiply = 1
for factor in factors:
    value_to_multiply *= factor

Now we change a little bit `test_multiply_by_two` to use `value_to_multiply` as input of `multiply_by_two``

<div style="background-color: rgb(250, 250, 250);">
```python
%%function multiply_by_two --test
print(multiply_by_two(value_to_multiply))
```

In [None]:
%%function multiply_by_two --test
#|echo: false
print(multiply_by_two(value_to_multiply))

Let's see how `test_multiply_by_two` is implemented after applying the previous change:

<div style="background-color: rgb(250, 250, 250);">
```python
%print test_multiply_by_two --test
```

In [None]:
%print test_multiply_by_two --test
#|echo: false

We can see that the variable `value_to_multiply` is returned by calling the *"test data"* function `test_input_multiply_by_two`. We use this type of implementation to make it possible to use test engines such as `pytest` where the test functions need to be self-contained, i.e., they need to operate independently of other functions. Although `pytest` uses fixtures for this purpose, our test data functions provide an alternative to it.

We can see that `test_input_multiply_by_two` returns the required `value_to_multiply`, so that it can be used by `test_multiply_by_two`.

<div style="background-color: rgb(250, 250, 250);">
```python
%print test_input_multiply_by_two --test --data
```

In [None]:
%print test_input_multiply_by_two --test --data
#|echo: false

To prevent conflicts, two *test data* functions cannot return a variable with the same name:

<div style="background-color: rgb(250, 250, 250);">
```python
%%function second_function --test --data
value_to_multiply = 10
```

If we run the previous code, we get a `ValueError` exception with the following message:

```
ValueError: detected common variables with other test data functions {'value_to_multiply'}:
```

Test functions are written in a separate test module, withprefix `test_`

In [None]:
os.listdir ('../tests')

In [None]:
assert os.listdir ('../tests')==['test_index.py']

# Imports

In order to include libraries in our python module, we can use the magic imports. Those will be written at the beginning of the module:

<div style="background-color: rgb(250, 250, 250);">
```python
%%imports
import pandas as pd
```

In [None]:
%%imports
#|echo: false
import pandas as pd

In [None]:
!cat ../nbmodular/index.py

Imports can be indicated separately for the test module by passing the flag `--test`:

<div style="background-color: rgb(250, 250, 250);">
%%imports --test
import matplotlib.pyplot as plt
```

In [None]:
%%imports --test
#|echo: false
import matplotlib.pyplot as plt

In [None]:
!cat ../tests/test_index.py

# Defined functions

The cell magic `%%function` can also be used on cells that define functions:

In [None]:
import datetime
name = 'Jaume'

In [None]:
%%function
def determine_approximate_age (name, birthday_year=2000):
    #|echo: false
    current_year = datetime.datetime.today().year
    approximate_age = current_year-birthday_year
    print (f'hello {name}, your approximate age is {approximate_age}')
    return approximate_age

In [None]:
determine_approximate_age_info

In [None]:
determine_approximate_age_info.approximate_age, determine_approximate_age_info.current_year

In [None]:
%%function use_current_year
#|echo: false
print (current_year)

In [None]:
%print determine_approximate_age

Functions can be included already being defined with signature and return values. The only caveat is that, if we want the function to be executed, the variables in the argument list need to be created outside of the function. Otherwise we need to pass the flag --norun to avoid errors:

<div style="background-color: rgb(250, 250, 250);">
```python
%%function --not-run
def myfunc (x, y, a=1, b=3):
    #|echo: false
    print ('hello', a, b)
    c = a+b
    return c
```

In [None]:
%%function --not-run
def myfunc (x, y, a=1, b=3):
    #|echo: false
    print ('hello', a, b)
    c = a+b
    return c

Although the internal code of the function is not executed, it is still parsed using an AST:

In [None]:
myfunc_info.created_variables

In [None]:
myfunc_info.previous_variables

This allows to provide tentative *warnings* regarding names not found in the argument list

<div style="background-color: rgb(250, 250, 250);">
```python
def other_func (x, y):
    #|echo: false
    print ('hello', a, b)
    c = a+b
    return c
```

In [None]:
%%function --not-run
def other_func (x, y):
    #|echo: false
    print ('hello', a, b)
    c = a+b
    return c

Let's do the same but running the function:

In [None]:
a=1
b=3

<div style="background-color: rgb(250, 250, 250);">
```python
%%function
def myfunc (x, y, a=1, b=3):
    #|echo: false
    print ('hello', a, b)
    c = a+b
    return c
```

In [None]:
%%function
def myfunc (x, y, a=1, b=3):
    #|echo: false
    print ('hello', a, b)
    c = a+b
    return c

In [None]:
myfunc (10, 20)

<div style="background-color: rgb(250, 250, 250);">
```python
%print analyze
```

In [None]:
myfunc_info = %function_info myfunc
#|echo: false

In [None]:
myfunc_info

In [None]:
myfunc_info.c

# Storing local variables in memory

By default, when we run a cell function its local variables are stored in a dictionary called `current_values`:

<div style="background-color: rgb(250, 250, 250);">
```python
%print analyze
```

In [None]:
%%function my_new_function
#|echo: false
my_new_local = 3
my_other_new_local = 4

The stored variables can be accessed by calling the magic `function_info`:

In [None]:
my_new_function_info = %function_info my_new_function

In [None]:
my_new_function_info.current_values

This default behaviour can be overriden by passing the flag `--not-store`

<div style="background-color: rgb(250, 250, 250);">
```python
%print analyze
```

In [None]:
%%function my_second_new_function --not-store
#|echo: false
my_second_variable = 100
my_second_other_variable = 200

In [None]:
my_second_new_function_info = %function_info my_second_new_function

In [None]:
my_second_new_function_info.current_values

# (Un)packing Bunch I/O

In [None]:
from sklearn.utils import Bunch

<div style="background-color: rgb(250, 250, 250);">
```python
%print analyze
```

In [None]:
%%function bunch_data
#|echo: false
x = Bunch (a=1, b=2)

<div style="background-color: rgb(250, 250, 250);">
```python
%print analyze
```

In [None]:
%%function bunch_processor --unpack-bunch x --include-input "day=1"
#|echo: false
c = 3
a = 4

<div style="background-color: rgb(250, 250, 250);">
```python
%print analyze
```

In [None]:
%print bunch_processor
#|echo: false

# Function's info object holding local variables

In [None]:
#| hide
import pandas as pd

In [None]:
df = pd.DataFrame (dict(Year=[1,2,3], Month=[1,2,3], Day=[1,2,3]))
fy = '2023'

<div style="background-color: rgb(250, 250, 250);">
```python
%print analyze
```

In [None]:
%%function
#|echo: false
def days (df, fy, x=1, /, y=3, *, n=4):
    df_group = df.groupby(['Year','Month']).agg({'Day': lambda x: len (x)})
    df_group = df.reset_index()
    print ('other args: fy', fy, 'x', x, 'y', y)
    return df_group

An info object with name <function_name>_info is created in memory, and can be used to get access to local variables

In [None]:
days_info.df_group

There is more information in this object: previous variables, code, etc.

In [None]:
days_info.current_values

In [None]:
days_info

The function can also be called directly:

In [None]:
days (df*100, 100, x=4)

# Saving and loading

## Saving / loading previous results

Functions can load previously computed results and save the results of the current execution. Let's see an example:

In [None]:
x = 3
n = 5

In [None]:
#| hide
shutil.rmtree ('results', True)

In [None]:
del factors

In [None]:
%%function --save
def multiples (x, n):
    print ('computing multiples')
    factors = range(n)
    result = [x*i for i in factors]
    return result

After running the previous cell, we can load the result of the function from disk:

In [None]:
joblib.load ('results/index/multiples_result.pickle')

By default, the path to the file where the results are saved is determined as follows:
- The root folder is `results`, inside the directory where the notebook is run.
- Inside `results`, a folder called \<name of notebook\> is created, where \<name of notebook\> is the name of the current notebook ("index.ipynb" in our case)
- The file name is the same one as the name of the function (`multiples` in our example), adding the suffix "_result" at the end. 
- The type of result file used by default is "pickle". 

All of these options can be changed as we will see later.

We can avoid the re-computing the results if we pass the flag `--load`:

In [None]:
%%function --load 
def multiples (x, n):
    print ('computing multiples')
    factors = range(n)
    result = [x*i for i in factors]
    return result

As we can see, the function hasn't run, since there is no message printed on screen. If we don't use the `load` flag, it will run normally:

In [None]:
%%function
def multiples (x, n):
    print ('computing multiples')
    factors = range(n)
    result = [x*i for i in factors]
    return result

## Saving / loading local variables

Instead of saving / loading the variables returned by the function, we can save or load the local variables by passing `io-locals`:

In [None]:
%%function --save --io-locals
def multiples (x, n):
    print ('computing multiples')
    factors = range(n)
    result = [x*i for i in factors]
    return result

After running the previous cell, we will have a file with path `locals/index/multiples_locals.pickle`, storing the local variables of the function:

In [None]:
joblib.load ('locals/index/multiples_locals.pickle')

By default, the file is saved in a folder called "locals/\<name of notebook\>", inside the current directory, and with a file name that is the same one as the name of the function, adding the suffix "_locals" at the end. The type of file used by default is "pickle". All of these options can be changed as we will see later.

Again, we can avoid the re-computing the results if we pass the flag `--load`. This will load the local variables into the notebook's memory. To demonstrate that, let's first delete those variables from memory:

In [None]:
del factors
del result

We now load them from disk by passing the flags `load` and `io-locals`:

In [None]:
%%function --load --io-locals
def multiples (x, n):
    print ('computing multiples')
    factors = range(n)
    result = [x*i for i in factors]
    return result

As we can see, the function hasn't run, since there is no printed message, and the local variables have been loaded and are now available:

In [None]:
print (f'factors: {factors}, result: {result}')

## loading / saving in function's code

We insert loading / saving code into the function being defined, by passing the flag `--io-code`:

In [None]:
%%function --io-code
def multiples (x, n):
    print ('computing multiples')
    factors = range(n)
    result = [x*i for i in factors]
    return result

Calling this function with save=True will save the results to 'results/multiples_result.pickle', by default. This is the same path as the one used before, so let us remove it from disk first:

In [None]:
os.remove ('results/index/multiples_result.pickle')

In [None]:
multiples (7, 5, save=True)

In [None]:
joblib.load ('results/index/multiples_result.pickle')

We can also skip the computation in subsequent calls, by passing `load=True`:

In [None]:
multiples (7, 5, load=True)

As we can see, no message has been printed by calling the function, since the result is loaded from disk and the rest of the function is skipped.

## Loading / saving config parameters

We saw earlier how the file to the results path is constructed by default. In order to change this path, we can pass the following parameters:

- `io-type`: indicates both the file extension and type of file to be saved. Current possibilities are: `pickle`, `csv`, and `parquet`, and the default is `pickle`.
- `io-root-path`: root folder inside the current directory. The default is `results`.
- `io-folder`: sub-folder, inside the root folder, where the file is stored. The default name is the same as the name of the current notebook. In the rare occasions where the name of the current notebook cannot be automatically detected, the name `temporary` is used instead. See note below about how to manually indicate the file name of the current notebook.
- `io-file`: name of the file, without extension. The name of the function is used by default.

Let's see an example:

In [None]:
%%function --save --io-type csv --io-root-path results_df --io-folder csv_files --io-file computed_multiples
def multiples (x, n):
    print ('computing multiples')
    factors = range(n)
    result = pd.DataFrame ({
        'factor':factors,
        f'{x} * factor':[x*i for i in factors],
    })
    return result

By running the previous cell, the file `results_df/csv_files/computed_multiples.csv` is created, which can be read using pandas `read_csv`:

In [None]:
pd.read_csv ('results_df/csv_files/computed_multiples.csv', index_col=0)

# Setting global parameters

Some of the default behaviour of the `nbmodular` extension can be changed by setting global parameters with the `set` magic command. Two important cases are setting the name of the file where the python module is saved, and setting the logging level:

In [None]:
%set file_name 'example.py'

In [None]:
cell_processor = %cell_processor
cell_processor.write ()

This will make the code of the notebook to be exported to the indicated file, `'example.py'`:

In [None]:
os.path.exists ('example.py')

This is particularly important when the name of the current notebook cannot be automatically detected. In that case, the default name of the python module is `temporary.py`, and it is advisable to manually indicate the name using the above command.

We can also indicate the full path as follows:

In [None]:
%set file_path folder/subfolder/example.py

In [None]:
cell_processor = %cell_processor
print (f'new file path: {cell_processor.file_path}')
print ('exporting code to that file')
cell_processor.write ()
print (f'The file has been created: {os.path.exists ("folder/subfolder/example.py")}')

We can also set the debugging level:

In [None]:
%set log_level DEBUG

In [None]:
cell_processor.logger.debug ('This is a debug-level message')

In [None]:
%set default_load True

We can indicate whether to run or not a function by default:

In [None]:
%reload_ext nbmodular.core.cell2func

In [None]:
%set default_run False

In [None]:
my_par = 'My Parameter'

In [None]:
%%function
def my_new_function (my_par):
    print (my_par)

In [None]:
%set default_run True

In [None]:
%%function
def my_new_function (my_par):
    print (my_par)

In [None]:
%%function --not-run
def my_new_function (my_par):
    print (my_par)

In [None]:
%set overriden_run True

In [None]:
%%function --not-run
def my_new_function (my_par):
    print (my_par)

In [None]:
%set overriden_run None

In [None]:
%%function --not-run
def my_new_function (my_par):
    print (my_par)

## Classes

In [None]:
%%add_class 
class Person ():

In [None]:
self

In [None]:
name = 'Lex'
year_born=1983

In [None]:
%%method
def __init__ (
    self,
    name,
    year_born,
):
    self.name = name
    self.year_born = year_born

In [None]:
Person.__init__??

In [None]:
print (f'self has name {self.name} and was born in {self.year_born}')

In [None]:
andrew = Person ('Andrew', 1973)

In [None]:
print (f'andrew has name {andrew.name} and was born in {andrew.year_born}')

In [None]:
%%method
def approximate_age_in_2023 (self):
    approximate_age = 2023 - self.year_born 
    return approximate_age

In [None]:
print (f'{self.name} had approximately {self.approximate_age_in_2023 ()} years in 2023')
print (f'{andrew.name} had approximately {andrew.approximate_age_in_2023 ()} years in 2023')