
<center> <h1> 
    7. Sharing code with others
    </h1> </center>

<center> <h3> October 23, 2024 </h3> </center>
<h3> <p style='text-align: center;'>  Jong-Hwan Lee  </p> </h3> 

<p style='text-align: center; font-size: 15px'> Reference: Neuroimaging and Data Science by Ariel Rokem & Tal Yarkoni, 2021 (http://neuroimaging-data-science.org) 
</p>

Collaboration is an important part of science. 

The principles discussed in this chapter can be applied to collaborations with others, with your closest collaborator, and yourself for example from six months ago.
- ultimately, your code can be used by complete strangers
    - this would provide the ultimate proof of its reproducibility and give more impact than it could have if you were the only one using it

## 7.1. What should be shareable?

Jupyter notebooks can be used as a way to prototype code and to present ideas, the notebook format does not, by itself, readily support the implementation of reusuable code or code that is easy to test.

Thus, it is usually recommended to move your code into python files from, for exmple, notebook files.

Here, we'll learn a particular organization that facilitate the emergence of reusuable libraries of code that you can work on with others and follow the conventions of the Python language broadly.

The pieces of code deserve to be written and shared in a manner that others can easily adopt into their code.
- to this end, the code needs to be packaged into a library

We'll do this by way of a simplified example.

## 7.2. From notebook to module

Suppose the following code in the course of our work on the analysis of some MRI data.

In [2]:
from math import pi
import pandas as pd

blob_data = pd.read_csv('./input_data/blob.csv')

blob_radius = blob_data['radius']

blob_area = pi * blob_radius ** 2
blob_circ = 2 * pi * blob_radius

output_data = pd.DataFrame({"area": blob_area, "circ": blob_circ})
output_data.to_csv('./output_data/blob_properties.csv')

- although more details on Pandas in Section 9
  - Pandas is a Python library that knows how to read data from comma-separated value (csv) files, and how to write this data back out

This code is unfortunately not very reusable, even though the results are reproducible.
- this is because it mixes file input and output with computations, and different computations with each other

Good software engineering aims towards _modularity_ and _separation of concerns_.
- one part of the code for calculations, and another part for the one that reads and manipulate the data, yet other functions for visualization the results or producing statistical summaries

Our first step is to identify what are reusuable components of this script and to move these components into a module.
- i.e., area and circumference calculations in this code 

Let's isolate them and rewrite them as functions:

In [3]:
from math import pi
import pandas as pd

def calculate_area(r):
    area = pi * r **2
    return area

def calculate_circ(r):
    circ = 2 * pi * r
    return circ

blob_data = pd.read_csv('./input_data/blob.csv')
blob_radius = blob_data['radius']
blob_area = calculate_area(blob_radius)
blob_circ = calculate_circ(blob_radius)

output_data = pd.DataFrame({"area": blob_area, "circ": blob_circ})
output_data.to_csv('./output_data/blob_properties.csv')

The next step may be the move of these functions into a separate file.

Let's name this file as `geometry.py` with document what they do:

In [4]:
from math import pi

def calculate_area(r):
    """
    Calculates the area of a circle.

    Parameters
    ----------
    r : numerical
        The radius of a circle

    Returns
    -------
    area : numerical
        The calculated area
    """
    area = pi * r **2
    return area


def calculate_circ(r):
    """
    Calculates the circumference of a circle.

    Parameters
    ----------
    r : numerical
        The radius of a circle

    Returns
    -------
    circ : float or array
        The calculated circumference
    """
    circ = 2 * pi * r
    return circ

- in documentation, at least a one-sentence description of the function, and detailed descriptions of the input parameters and outputs or returns
    - in the docstrings of this example area carefully written to comply with the numpy docstring guide

### 7.2.1. Importing and using functions

Before we see how we'll use the `geometry` module that we created, let's learn a bit what happens when we call `import` statements in Python.
- when we call the `import geometry` statement, Python starts by looking for a file called `geometry.py` in your present working directory

Once you saved `geometry.py`, you can now rewrite the analysis script as:


In [5]:
import geometry as geo
import pandas as pd

blob_data = pd.read_csv('./input_data/blob.csv')
blob_radius = blob_data['radius']
blob_area = geo.calculate_area(blob_radius)
blob_circ = geo.calculate_circ(blob_radius)
output_data = pd.DataFrame({"area": blob_area, "circ": blob_circ})
output_data.to_csv('./output_data/blob_properties.csv')

- this is good because you can import and reuse these functions across different analysis scripts
  
We just learned a transition of a part of your code from a one-off notebook or script to a module.

Next, let's see how you transition from a module to a library.

## 7.3. From module to package

Now, we are limited to using the code in the `geometry` module only.

The next level of reusuability is to create a library, or a _package_, that can be installed and imported across multiple different projects.

Again, when `import geometry` is called and if there is no file called `geometry.py` in the present working direction (pwd), Python look for a Python package called `geometry`.

What is a Python package? 
- it's a folder that has a file called `__init__.py` 

This can be imported just like a module.
- if the folder is in your pwd, importing it will execute the code in `__init__.py`

For example, if you were to put the functions available in `geometry.py` in `geometry/__init__.py`
- you could import them from the directory that contains the `geometry` directory

More typically, a package might contain different modules that each have some code.

For example:

```
    .
    └── geometry
        ├── __init__.py
        └── circle.py
```

- the code previously had in `geometry.py` is now in the `circle.py` module of the `geometry` package

To make the names in `circle.py` available to us, we can import them explicitly as follows:

In [4]:
from geometry import circle
circle.calculate_area(blob_radius)

ImportError: cannot import name 'circle' from 'geometry' (/Users/jhlee/Library/CloudStorage/OneDrive-Personal/_lecture_OD/___2024fall/BRI519_뇌공학프로그래밍입문/__lnotes_github/bri519_fall2024_new2/bri519_fall2024/_lnotes/geometry.py)

Or, we can have the `__init__.py` file import them for us by adding this code to the `__init__.py` file:

``from .circle import calculate_area, calculate_circ``

This way, we can import our functions as follows:

In [11]:
from geometry import calculate_area

This implies that the `__init__.py` file can manage all the imports from the multiple modules that we want to add to the package.
- also it can perform other operations that you might want to do whenever you import the package

Now, you have your code in a package and you'll want to install the code in your machine so that you can import the code from anywhere on your machine (not only from this particular directory).
- eventually, others can also easily install it and run it on their machines

Here, we need to understand one more thing about the `import` statement.
- if `import` cannot find a module or package locally in the pwd, it will proceed to look for this name somewhere in the Python _path_

The Python path is a list of file system locations that Python uses to search for packages and modules to import.

Let's try this:

In [2]:
import sys
print(sys.path)

['/Users/jhlee/opt/anaconda3/envs/my_env_py3p12/lib/python312.zip', '/Users/jhlee/opt/anaconda3/envs/my_env_py3p12/lib/python3.12', '/Users/jhlee/opt/anaconda3/envs/my_env_py3p12/lib/python3.12/lib-dynload', '', '/Users/jhlee/opt/anaconda3/envs/my_env_py3p12/lib/python3.12/site-packages', '/Users/jhlee/opt/anaconda3/envs/my_env_py3p12/lib/python3.12/site-packages/setuptools/_vendor']


- we need to copy the code into one of these file system locations 

To this end, let's let Python do this for us using the `setuptools` library.

The main instrument for `setuptools` operations is a file called `setup.py` file which will be introduced in the next section.

## 7.4. The setup file

Suppose a scenario that you want to use the code that you have written across multiple projects, or share it with others for them to use in their projects

You also want to organize the files in a separate directory devoted to your library:

```
    .
    └── geometry
        ├── geometry
        │   ├── __init__.py
        │   └── circle.py
        └── setup.py
```

Notice that we have two directories called `geometry`:
- the top-level directory contains both our Python package (i.e., the `geometry` package) and other files to organize our project

For example, the file called `setup.py` is saved in the top-level directory of our library.
- tell Python how to set our software up and how to install it
- within this file, we rely on the Python standard library [setuptools](https://setuptools.readthedocs.io/en/latest/) module to do a lot of the work
    - we need to provide some metadata about our software and some information about the available packages within our software

For example, here's a minimal setup file:

In [5]:
from setuptools import setup, find_packages

with open("README.md", "r") as fh:
    long_description = fh.read()

setup(
    name="geometry",
    version="0.0.1",
    author="Ariel Rokem",
    author_email="author@example.com",
    description="Calculating geometric things",
    long_description=long_description,
    long_description_content_type="text/markdown",
    url="https://github.com/arokem/geometry",
    packages=find_packages(),
    classifiers=[
        "Programming Language :: Python :: 3",
        "License :: OSI Approved :: MIT Licence",
        "Operating System :: OS Independent",
        "Intended Audience :: Science/Research",
        "Topic :: Scientific/Engineering"
    ],
    python_requires='>=3.8',
    install_requires=["pandas"]
)

SystemExit: usage: ipykernel_launcher.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
   or: ipykernel_launcher.py --help [cmd1 cmd2 ...]
   or: ipykernel_launcher.py --help-commands
   or: ipykernel_launcher.py cmd --help

error: option --fullname must not have an argument

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


The code of this file is a call to a function called `setup`.
- with [many different options](https://setuptools.readthedocs.io/en/latest/setuptools.html#command-reference)
- one of the options is `install`: take all the steps needed to properly install the software in the right way into your Python path

One you are done with writing and organizing the files and folderes in your Python library in the right way, you can call:

``python setup.py install`

- then, you will be able to find this library from anywhere in your filesystem and use the functions stored within it

Next, let's look at the contents of the file section by section.

### 7.4.1. Contents of a setup.py file

The first thing in the `setup.py` (after the `import` statements) is a long_description from a README file.
- e.g., in GitHub, to track the changes in your code and to collaborate with others, it is a good idea to use the markdown format (with the `.md` extention) for this

Let's write something informative in the README.md file:

```
# geometry

This is a library of functions for geometric calculations.

# Contributing

We welcome contributions from the community. Please create a fork of the
project on GitHub and use a pull request to propose your changes. We strongly encourage creating
an issue before starting to work on major changes, to discuss these changes first.

# Getting help

Please post issues on the project GitHub page.
```

The second thing that happens is a call to the setup function.
- take several keyword arguments

The first few ones are general meta-data about the software:
```
name="geometry",
author="Ariel Rokem",
author_email="author@example.com",
description="Calculating geometric things",
long_description=long_description,
```

The next one to make sure that the long description gets properly rendered in web pages (e.g., the Python package index, [PyPi](https://pypi.org/)):

`long_description_content_type="text/markdown",`

Another kind of meta-data is classifiers to catalog the software within PyPI so that interested users can more easily find it:

```
classifiers=[
    "Programming Language :: Python :: 3",
    "License :: OSI Approved :: MIT License",
    "Operating System :: OS Independent",
    "Intended Audience :: Science/Research",
    "Topic :: Scientific/Engineering"
],
```

- licence: it is best to use a standard [OSI-approved license](https://opensource.org/licenses) and the MIT license for publicly providing the software including towards commercial applications

Then, the version of the software 
- the [semantic versioning conventions](https://semver.org/)
- e.g., `version="0.0.1",`

A URL for the software
- e.g., `url="https://github.com/arokem/geometry",`

The next item calls a `setuptools` function 
- automatically traverse the filesystem in this directory and find the packages/sub-packages
- e.g., `packages = find_packages(),` 

Alternatively, we can explicitly write out the names of the packages to install as part of the software
- e.g., `packages=['geometry']`

The last two items defines the dependencies of the software.
- e.g., 

```
python_requires = '>=3.8',
install_requires = ["pandas"]
```




## 7.5. A complete project

Now, our project is starting to take shape with the following filesystem of our library:

```
    .
    └── geometry
        ├── LICENSE
        ├── README.md
        ├── geometry
        │   ├── __init__.py
        │   └── circle.py
        └── setup.py
```

We can add all these files and then push into a repository on GitHub!

A few further steps to take remained.

### 7.5.1. Testing and continuous integration

As it was introduced in Section 6, tests are particularly useful if they are automated and run repeatedly.

In the context of well-organized Python project, this can be achieved by including a test module for every package in the library.

For example, we can add a `tests` package within our `geometry` package:

```
    .
    └── geometry
        ├── LICENSE
        ├── README.md
        ├── geometry
        │   ├── __init__.py
        |   ├── tests
        |   │   ├── __init__.py
        |   |   └── test_circle.py
        │   └── circle.py
        └── setup.py
```

Where `__init__.py` is an empty file, signaling that the `tests` folder is a package as well and the `test_circle.py` file may contain a simple set of functions for testing as follows:

In [None]:
from geometry.circle import calculate_area
from math import pi

def test_calculate_area():
    assert calculate_area(1) == pi

- this will test that the `calculate_area` function does the right thing

To take advantage of systems that automate the running of tests as much as possible (sometimes denoted as "test harnesses"), [Pytest](https://docs.pytest.org/), one popular test harness for Python, can be used.
- the Pytest test harness identifies functions that are software tests by looking for them in files whose names start with `test_` or end with `_test.py"
  - it runs these functions and keeps track of the functions that pass the test - do not raise errors - and those that fail - do raise errors

Another approach for automating your testing is called "continuous integration".
- the system keeps track of versions of your code, e.g., the GitHub website - also automatically runs all of the tests that you wrote every time that you make changes to the code
  - the tests can be run on the code before it is integrated into the `main` branch, allowing contributors to fix changes that cause test failures before they are merged
- continous integration is implemented in GitHub through a system called "[GitHub Actions](https://github.com/features/actions)"


### 7.5.2. Documentation

A further step is to write more detailed documentation and make the documentation available together with your software.
- routinely used system across the Python universe is [Sphinx](https://www.sphinx-doc.org/en/master/)


## 7.6. Summary

When you make the software that you write for your use easy to install and openly available, you'll make your work easier to reproduce, and also easier to extend.
- other people might start using it
    - some of them might run into bugs and issues with the software; some of them might even contact you to ask for help with the software
    - this could lead to fruitful collaborations with other researchers who use your software

It could have an impact on the understanding of the universe, and the improvement of the human condition.
- some people have made careers out of building and supporting a community of users and developers around software that they write and maintain

Also, it is also fine to let people know that the software is provided openly, but it is provided with no assurance of any support.

### 7.6.1. Software citation and attribution

It is common to cite a paper for the findings and ideas in it when we perform our research.
- less common to the notion that software should also be cited

In recent years, an increased effort to provide ways for researchers to cite software, and for researchers who share their software to be cited.
- to do this, make sure that your software has a Digital Object Identifier (DOI)
    - many journals require that a DOI be assigned to a digital object, particularly, the paper/article
    - one way to do this is through a service adminstered by the European Council for Nuclear Research (CERN) called [Zenodo](https://zenodo.org/)

## 7.7. Additional resources

The Python community put together the [Python Packaging Authority (PyPA) website](https://www.pypa.io/en/latest/)
- explains how to package and distribute Python code

We learned about the conda package manager in Section 4.1.
- another great way to distribute scientific software using conda is provided through the [Conda Forge](https://conda-forge.org/) project

Jake Vanderplas's useful [blog post](https://www.astrobetter.com/blog/2014/03/10/the-whys-and-hows-of-licensing-scientific-code/) on the topic of scientific software licensing

A book, [Producing Open Source Software](https://producingoss.com/) by Karl Fogel
- everything from naming an open-source software project to legal and management issues such as licensing, distribution, and intellectual property rights


<center> <h1> Thank you! </h1>

<h1> Q/A? </h1> </center>

<p style='text-align: right; font-size: 10px'> 7. Sharing code with others </p>