Python packaging
================

When you import a Python library, you are importing code from a package (also sometimes called a module). There is some magic that happens when you do this. For example, consider this simple import of a core library. Where is that code?



In [None]:
import os



We can find where the code for that library resides using the `__file__` attribute.



In [None]:
os.__file__



The reason we can import this file without saying where it is is because Python has a list of directories it knows to look in. These are available to you in the `sys` module. This contains a list of directories where Python looks. Here, it specifically looks for a file named os.py in one of those directories. Your path may look different from this.



In [None]:
import sys
sys.path



You can see how this works here.



In [None]:
for path in sys.path:
    if os.path.exists(os.path.join(path, 'os.py')):
        print(path)
        break
                      



# Anatomy of a package

A Python package is a collection of files and directories that follow some conventions. It is common for the whole set to be in a single root directory. This is helpful to isolate the files from other files, so they are easy to move later.

In the package root, you need several files:

- [README.md](./package-root/README.md) :: A text file describing the package
- [setup.py](./package-root/setup.py) :: A Python file for installing the package
- [LICENSE](./package-root/LICENSE) :: A file containing the terms of use for your package.

There are a lot of licenses: https://opensource.org/licenses. We will primarily focus on the MIT license.

We put the source for our package in a directory inside called *testpack*.

Inside the testpack directory there must be an `__init__.py` file, and maybe additional package source files (.py files). 

Check out [\_\_init__.py](./package-root/testpack/__init__.py). This file is run every time you import the package. We define a single function in this file that we can use later, and there is a diagnostic line that should print when we import the package later.

Finally see the overall structure here.



In [None]:
! tree package-root



We cannot directly import this package yet. Try it:



In [None]:
import testpack
testpack.__file__



That fails because it is not found anywhere on your Python path. Usually, we will install a package to do that, but we will first manually modify the path for development purposes. `sys.path` is just a list of directories, and we can add to it or append directories using Python. This is only temporary, while this notebook is alive. We use a relative path here, which implies the working directory is the same as the path to this notebook. If you haven't specifically changed that, it should be. If in doubt, you can also use an absolute path.



In [None]:
import sys
sys.path.insert(0, 'package-root')
import testpack



Now, we can access the hello function that is present in the `__init__.py` file. We have to use the dot notation to access this.



In [None]:
testpack.hello('John')



# Version control

It is tempting to start modifying the package right away. That would probably be a mistake though. What if we do something that breaks it? How would we recover back to a working state? The solution to this problem is called *version control*. It is an essential part of software development. We will use git (https://git-scm.com/doc) for version control. 

With git, we will create a *repository* in our package-root. Then we can *commit* changes we make to files in the repository as we go. If some changes don't work out, we can *revert* them. We can also make *branches* to test ideas out on. 

To get started, we need to tell git about ourselves. Open a terminal, and run these commands (obviously, change the name and email to yours):

    git config --global user.name "John Doe"
    git config --global user.email johndoe@example.com

That should create a file called ~/.gitconfig. Check out the contents:

    cat ~/.gitconfig 



Next, change into the package-root directory in your shell.

    cd ~/s23-06682/lectures/03-python-packaging/package-root
    
and in this directory run this command to create a git repository.

    git init
    
You should see something like:

    Initialized empty Git repository in /home/jupyter-jkitchin@andrew.cm-11dd7/src/lectures/03-python-packaging/package-root/.git/
    
a new directory has been created in the folder called .git. This is where your git repository is stored. So far, there is nothing in it. Let's check the status.



In [None]:
%%bash
cd package-root
git init
git status



git is telling us that we are on the master branch and we have many untracked files. Today it is more favorable for the default branch to be called ~main~ rather than master (https://www.theserverside.com/feature/Why-GitHub-renamed-its-master-branch-to-main). Let's change that. We just checkout a new branch called main. 

    git checkout -b main
    



In [None]:
%%bash
cd package-root
git checkout -b main



Now, we can add files. There are some files we want to ignore. For example, .ipynb_checkpoints does not need to be under version control, and there is a `__pycache__` we don't need in the repository. Let us set up a .gitignore file. This goes in the package-root directory. I do it here with shell commands, but you can also open an editor and write it directly. Now, running `git status` should not show those files.

We use > to redirect output into a file. This will overwrite the file each time you use it. To append, we use >>.



In [None]:
! echo __pycache__ > package-root/.gitignore
! echo .ipynb_checkpoints >> package-root/.gitignore  
! cat package-root/.gitignore



The next step is to add and commit the files. Since we have set up the .gitignore file, we will take a shortcut this time, and add everything. Then, we commit the files.

    git add *
    git commit -m "First commit"



In [None]:
%%bash
cd package-root
git add *
git commit -m "First commit"



In [None]:
%%bash
cd package-root
git status



Note that the wild-card did not match the .gitignore file. We have to add and commit that separately.



In [None]:
%%bash
cd package-root
git add .gitignore
git commit -m "Add the .gitignore file"
git status



Now we have a "clean" repository. All files are added and committed, and `git status` tells us everything is good. We have made two commits so far.



In [None]:
%%bash
cd package-root
git log



In the log, you can see the two commits, and each one is identified by a long hash, e.g. commit 33a50e04b75c90b34a274aea287dd1e6c6c045de. This is a unique cryptographic hash of the content that we committed, and we can use it to see what happened or changed, to revert changes, etc. We will return to that later. Now, we are ready to safely make some changes to our package. By safely, I mean we will be able to undo changes, revert changes, see what changes were made, etc.



# First package modification

There are lots of ways to use git. Here we explore the idea of using a `feature branch`. We have a working package, and we want to add a new feature in a way that minimizes the risk of messing up the current state. The strategy is that we make a new branch, do all our work there, and when we are satisfied with it, we merge it back on to main.

Let's see what we have so far. Our commit history is linear, and the current position is at the HEAD commit on `main`.



In [None]:
! cd package-root; git log --graph --oneline



## A feature branch
We are going to checkout a new branch, let's call it `feature`.



In [None]:
%%bash
cd package-root
git checkout -b feature
git status
git log --graph --oneline



Now we can add some new features. Let's add a new function to the `__init__.py` file:

```
def goodbye(name):
    return f'Goodbye {name}'
```

After you add that, save the file, and check your git status:



In [None]:
%%bash
cd package-root
git status



This is telling us two things:
1. We are on the feature branch
2. There is a modified file.

Now, let's commit this change.



In [None]:
%%bash
cd package-root
git commit testpack/__init__.py -m "Add a new function"
git log --graph --oneline



## Back and forth on branches

Before we go further, let's see that we can go back to the main branch where that addition does not exist, and then come back. First, we see what is in the file right now.



In [None]:
%%bash
cd package-root
git status
cat testpack/__init__.py



Now, we checkout the main branch. The change we made does not exist there.



In [None]:
%%bash
cd package-root
git checkout main
git status
cat testpack/__init__.py



And now back to our feature branch. Now you see the new feature is back.



In [None]:
%%bash
cd package-root
git checkout feature
git status
cat testpack/__init__.py



## Add a commit on main

git allows us to have many branches where we can add features, fix bugs, try new implementations, etc. You can make changes to all the branches simultaneously. For example, let's go back to the main branch to add some detail to the README.



In [None]:
%%bash
cd package-root
git checkout main
echo -e "\n\nThere is one function: testpack.hello." >> README.md
git commit README.md -m "document the function in the package"
git log --graph --oneline



If we switch back to our feature branch, you will see that this new change does not exist.



In [None]:
%%bash
cd package-root
git checkout feature
cat README.md



## merge main onto feature branch

Before we continue, we should merge the new change in main into our feature branch. 



In [None]:
%%bash
cd package-root
git merge main
git log --graph --oneline



Now, we can finish up our feature branch. Let's add some documentation to the README.md. Add some text about the new function you added, then commit the change.



Finally, when satisfied with your feature branch, we go back to our main branch, and merge the feature into it. If you are done with the branch, it is a good practice to delete it. 



In [None]:
%%bash
cd package-root
git checkout main
git merge feature
git branch --delete feature
git log --graph --oneline



Let's take some time to review what this git log shows. You can see there was some branching, with commits on different branches. You can see where the main branch was merged into the feature branch, and at the end where the feature branch was merged back into main.



In [None]:
# Check we don't have the branch anymore
! cd package-root; git branch -a



# Try the new python function

We might naively just try it, but it does not work.



In [None]:
testpack.goodbye('John')



It doesn't work though. It is necessary to reload this package (or you have to restart the kernel). This is a limitation of how Python (and in particular the persistent environment in Jupyter lab) loads packages. We simply have to reload it like this.



In [None]:
import importlib
importlib.reload(testpack)
testpack.goodbye('John')



# Summary

We learned how to:

1. initialize a git repo
2. Add files and commit them to the repo
3. edit files and commit changes.
4. Create a feature branch
5. make changes on the feature branch
6. switch between branches
7. merge changes in branches
8. delete a feature branch.
9. Look at the commit log

git is an iceberg. You can learn a lot more from the [Pro Git book](https://git-scm.com/book/en/v2) and the [reference manual](https://git-scm.com/docs).

You should also read https://merely-useful.tech/py-rse/git-advanced.html.

Today we learned about using branches to try making a change. The nice thing about branches is if you don't like the change, you can simply delete the branch, or go back to the main branch. If you do like it, then you just merge it in, and get on with your work.

There is still quite a bit to learn about git. We will get in to some of these things next time, including dealing with merge conflicts, 

