# Jupyter Notebok tweaks for the presentation when using RISE
(RISE is a jupyter notebook extension)

In [1]:
%%HTML
<style>
 <! –– cheat codes for improving the presentation fonts sizes (and also tables).
    Will screw up normal notebook view though ––>
.CodeMirror {
    width: 100vw;
}

.container {
    width: 99% !important;
}

.rendered_html {
  font-size:0.6em;
}
.rendered_html table, .rendered_html th, .rendered_html tr, .rendered_html td {
     font-size: 150%;
}

</style>

# Packaging/making a library in Python
![Presentation logo](imgs/presentation_logo_transparent.png)


By **Thibault Bétrémieux** _Data Scientist at port-neo Freiburg GmbH (part of port-neo GmbH)_

thibault.betremieux@port-neo.com

# Reasons to use Python libraries

* No more copy pasting code everywhere. Just import functions from a library
* If you use a library instead of spreading repeated code everywhere it makes it easier for:
    * Testing (local or with continuous integration - we will see what that is after)
    * Versioning (with the documentation of every improvements along the way and releases)
    * Collaborative development
* Code faster. For instance I can clean names with just a few lines of codes that I know from memory
* Compartmentalizing your code in modules/packages will make it more comprehensible and easier to test
* Sharing code to your colleagues and the world and making it easy to install
* Lots of other reasons...

# Creating and publishing a simple library, example with "pangres"

_What is pangres?_

Basically it is a library to update records in a PostgreSQL database very conveniently by using tables objects from pandas (pandas DataFrames).

It is installable with <code>pip install pangres</code> since I published it to PyPI (Python Package Index).

I'd be very happy if you check out the repo [here](https://github.com/ThibTrip/pangres) (it gives a bit more details on the library as well) and give me some feedback or ask me some questions 🙂! 

# Live demo of "pangres" for the attendees to my presentation 😎

In this demo we will create an initial dataset and an "updated" dataset (just as if we collected some marketing data at different time periods). 

From the initial dataset we will create a PostgreSQL table using "pangres". Then we will update our table with the "updated" dataset also using "pangres".

## Pretend to collect data from a marketing API

In [2]:
import pandas as pd
# suppose we got this data from some marketing API
df = (pd.DataFrame({'profileid':[123,124],
                    'first_name':['Thibault','John'],
                    'last_name':['Bétrémieux','Rambo'],
                    'is_subscribed':[None, None]})
      .set_index('profileid'))
# then we receive some new data
new_df = (pd.DataFrame({'profileid':[124,125],
                       'first_name':['John','Arnold'],
                       'last_name':['Travolta','Schwarzenegger'],
                       'is_subscribed':[1, 0],
                       'likes_dancing':[1,0]})
          .set_index('profileid'))
# see next slide for viewing the data

In [3]:
print('initial data'); display(df)
print('new data')
print("""* John Rambo became John Travolta
* is_subscribed now has data but it is of integer dtype and in the database it defaulted to "TEXT" since we had no data...
* we have a new profile (125) but a profile disappeared (124)
* we have a new column (likes_dancing)""")
display(new_df)

initial data


Unnamed: 0_level_0,first_name,last_name,is_subscribed
profileid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
123,Thibault,Bétrémieux,
124,John,Rambo,


new data
* John Rambo became John Travolta
* is_subscribed now has data but it is of integer dtype and in the database it defaulted to "TEXT" since we had no data...
* we have a new profile (125) but a profile disappeared (124)
* we have a new column (likes_dancing)


Unnamed: 0_level_0,first_name,last_name,is_subscribed,likes_dancing
profileid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
124,John,Travolta,1,1
125,Arnold,Schwarzenegger,0,0


In [4]:
# let's create the table using pg_upsert
# first create a sqlalchemy engine to interact with our PostgreSQL database
# more info here https://docs.sqlalchemy.org/en/13/core/engines.html
from sqlalchemy import create_engine
from pangres import pg_upsert
engine = create_engine('postgresql://test_user:password@localhost:5432/test') # replace with your connection string!

# remove previous tests
engine.execute('DROP TABLE IF EXISTS tests.test;')

# push df and create the table at the same time
pg_upsert(engine=engine, df=df,
          table_name='test',
          if_exists='upsert_overwrite',
          schema='tests',
          create_schema=True)

# now push the new data (new_df)
pg_upsert(engine=engine, df=new_df,
          table_name='test',
          # with "upsert_overwrite" John Rambo will become John Travolta
          # with "upsert_keep" John Rambo stays John Rambo 💪
          if_exists='upsert_overwrite', 
          schema='tests',
          add_new_columns=True, # will create column "likes_dancing"
          adapt_dtype_of_empty_db_columns=True) # will change "is_subscribed" dtype to integer

2020-01-24 23:26:27 | INFO     | pangres     | helpers:add_new_columns:281 - Added column likes_dancing BIGINT in tests."test"
2020-01-24 23:26:27 | INFO     | pangres     | helpers:adapt_dtype_of_empty_db_columns:369 - Adapted column type in postgres according to frame, column "is_subscribed" is now of dtype integer


In [5]:
print('initial data'); display(df)
print('new data'); display(new_df)
print('data in the database thanks to pangres 🐼🐘')
(pd.read_sql('SELECT * FROM tests.test', con=engine, index_col='profileid')
 .astype({'is_subscribed':'Int64', 'likes_dancing':'Int64'})) # use nullable integer dtype ("Int64")

initial data


Unnamed: 0_level_0,first_name,last_name,is_subscribed
profileid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
123,Thibault,Bétrémieux,
124,John,Rambo,


new data


Unnamed: 0_level_0,first_name,last_name,is_subscribed,likes_dancing
profileid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
124,John,Travolta,1,1
125,Arnold,Schwarzenegger,0,0


data in the database thanks to pangres 🐼🐘


Unnamed: 0_level_0,first_name,last_name,is_subscribed,likes_dancing
profileid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
123,Thibault,Bétrémieux,,
124,John,Travolta,1.0,1.0
125,Arnold,Schwarzenegger,0.0,0.0


# Base structure of "pangres"

```
pangres/ # on this level lies setup/config files, requirements etc. 
│        # on GitHub it would be the repository name
│   LICENSE # this has been generated by GitHub when I created the repo
│   README.md # essential info about your library (what it does,
│             # how to use it...) in Markdown language
│             # Alternatively people use README.rst in 
│             # reStructuredText language
│
├───pangres  # on this level and below lies the code of the library 
│   │        # (it is recommanded to use the same name as the 
│   │        # repository/parent folder)
│   │   core.py
│   │   helpers.py
│   │   __init__.py
│   │       
│   ├───tests
│   │   │   conftest.py # things to run before/after tests 
│   │   │               # (used by the library pytest)
│   │   │   test_chunsize.py
│   │   │   test_index.py
│   │   │   test_pandas_special_engine.py
│   │   │   test_pg_upsert.py
│   │   │   test_pg_upsert2.py
│   │   │   test_pg_upsert_speed.py
│   │   │   __init__.py
```

## About \_\_init\_\_.py files

Those files have 2 purposes:
* The presence of the file indicates to Python we are inside a package
* Telling Python how to initialize the library
```python
# without __init__.py
from pangres.core import pg_upsert
# with __init__.py
from pangres import pg_upsert
```

Note that we could have the same effect by putting all the code in core.py inside \_\_init\_\_.py but that does not seem very Pythonic to me (I think \_\_init\_\_.py should be reserved for initialization as the name indicates).

# Testing

The library pytest (<code>pip install pytest</code>) automatically collects tests located in the "tests" folder of my library. In all modules prefixed with "test\_" all the functions whose name are prefixed with "test\_" will be run by pytest.

Testing should be done in an **isolated environment**. You can create so called virtual environments with lots of tools. I recommand using conda (can be installed by installing [Miniconda](https://docs.conda.io/en/latest/miniconda.html) or [Anaconda](https://www.anaconda.com/distribution/#download-section)).

You can also test code examples in docstrings using doctest. doctest can [work together with pytest](http://doc.pytest.org/en/latest/doctest.html).

# How I packaged and published the library to PyPI (Python Package Index)

(using the old way because I got lazy 🐢 and didn't know how to do it the "new way")

Seriously though there is probably nothing wrong with this method if you can understand how it works. Besides it is used by the great majority of Python libraries nowadays. But the "new way" (described after) is much more convenient.

Everything is very well documented in this [article](https://medium.com/@joel.barmettler/how-to-upload-your-python-package-to-pypi-65edc5fe9c56). Also [here](https://python-packaging.readthedocs.io/en/latest/minimal.html) is a great minimal package structure example.

**Important**: you should upload to [TestPyPI](https://test.pypi.org/) before uploading it to [PyPI](https://pypi.org/)! You'll need to create different accounts for both.

See the structure of pangres after I was done packaging it (minus some exotic folders and files such as ".circleci" which I'll talk about later) on the next slide.

```
pangres/
│   .gitignore # lists files to ignore in "prangres" repo
│              # (for instance pycache), added by GitHub
│   LICENSE
│   MANIFEST.in # lists non Python files to include when pip installing
│               # so they are copied over in site_packages
│               # (for instance README.md)
│   README.md
│   requirements.txt # lists dependencies for end users e.g. pandas
│   requirements-dev.txt # lists dependencies for developers e.g. pytest
│   setup.cfg # this is for additional configuration for the setup somehow
│             # (e.g. which type of distribution)
│   setup.py # script containing essential metadata on how to install
│            # the library (author, version,...).
│            # it reuses the content of requirements.txt  
│               
├───pangres
│   │   core.py
│   │   helpers.py
│   │   __init__.py
│   │       
│   ├───tests
│   │   │   conftest.py
│   │   │   test_chunsize.py
│   │   │   test_index.py
│   │   │   test_pandas_special_engine.py
│   │   │   test_pg_upsert.py
│   │   │   test_pg_upsert2.py
│   │   │   test_pg_upsert_speed.py
│   │   │   __init__.py
```

# Packaging using poetry
Like the other method you will need to have your code hosted in a repository online (e.g. on GitHub).

Let's repackage "pangres" using the tool "poetry".
I recommand this [guide](https://johnfraney.ca/posts/2019/05/28/create-publish-python-package-poetry/) for learning how to publish with poetry.

We will use a single file called "pyproject.toml" which contains all the information in:
* MANIFEST.in
* requirements.txt
* requirements-dev.txt
* setup.cfg
* setup.py

## A. Install poetry

**osx / linux / bashonwindows**

<code>curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python</code>

**windows powershell**

<code>(Invoke-WebRequest -Uri https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py -UseBasicParsing).Content | python</code>


## B. Create a pyproject.toml with poetry

<code>poetry init</code>

Let poetry guide you to create the pyproject.toml file.

## C. Let's put pyproject.toml into our structure

Note that I could also have created the library structure using <code>poetry new pangres</code> but I would have to change too many things to the default structure so in this case it is not helpful.

```
pangres/
│   LICENSE
│   -> pyproject.toml
│   README.md
│
├───pangres
│   │   core.py
│   │   helpers.py
│   │   _config.py
│   │   __init__.py
│   │       
│   ├───tests
│   │   │   conftest.py
│   │   │   test_chunsize.py
│   │   │   test_index.py
│   │   │   test_pandas_special_engine.py
│   │   │   test_pg_upsert.py
│   │   │   test_pg_upsert2.py
│   │   │   test_pg_upsert_speed.py
│   │   │   __init__.py
```

## D. Let's install and test pangres

```
poetry install
# pytest will run inside a virtual environment made/selected by poetry 😎!
poetry run pytest pangres
```
If you change some code in your library you can just rerun <code>poetry run pytest pangres</code>.

WARNING: it is not possible to use <code>pip install -e .</code> (install library in editable mode) without a "setup.py" file. So if you want to use your library outside of poetry you will have to use <code>pip install .</code> (so normal, non editable install). If you want to update your local library you will have to rebuild (see step E.) it then use pip install again (I hope this will be possible soon as it is important to me).

```
$ pip install -e .
ERROR: File "setup.py" not found. Directory cannot be installed in editable mode: /home/thib/Documents/pangres
(A "pyproject.toml" file was found, but editable mode currently requires a setup.py based build.)
```

## E. Build, test publish and publish
Build:
```
poetry build
```

Test publish using TestPyPI:
```
# add TestPyPI as an alternate package repository
poetry config repositories.testpypi https://test.pypi.org/legacy
# publish to TestPyPI then test a pip install from TestPyPI
poetry publish -r testpypi
pip install --index-url https://test.pypi.org/simple/ pangres
```

If we can use the library as expected publish on PyPI:
```
poetry publish
```

## Additional cool stuff with poetry

* Adding a new dependency <code>poetry add pandas</code>
* Removing a dependency <code>poetry remove pandas</code>
* Adding a new dev dependency <code>poetry add pytest-cov --dev</code>
* Removing a dev dependency <code>poetry remove pytest-cov --dev</code>
* Get latest version of dependencies <code>poetry update</code>
* showing dependencies <code>poetry show</code>
* [Working with local packages](https://python-poetry.org/docs/versions/#path-dependencies) 
* etc.

See more [here](https://python-poetry.org/docs/cli/).

# Continuous integration and coverage

* Sign up on [circleci.com](https://circleci.com/) using GitHub
* Sign up on [codecov.io](https://codecov.io/) using GitHub
* Create a folder ".circleci" in the root of your repo then create a file called "config.yml" inside it which will contain the configuration for the tests. Example [here](https://github.com/ThibTrip/pangres/blob/master/.circleci/config.yml).
* Follow the project on circleci
* Follow the project on code coverage

# Additional recommandations for libraries (and when writing Python in general)


## Docstrings
For the docstrings I highly recommand [pandas' docstring guide](https://dev.pandas.io/docs/development/contributing_docstring.html#docstring). All docstrings in pangres are written according to the pandas' guide. Alternatively you may prefer [Google docstring style](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html).
pandas indicates parameter types and types of return values within docstrings. You probably do not need to do that since there is a better way to do this directly within the code by using [type annotations](https://docs.python.org/3/library/typing.html). Here is a simple example of types annotations:
```python
from typing import Dict

def decompose_name(full_name: str) -> Dict[str, str]: # returns dict with keys and vals of type str
    splitted = full_name.split(' ')
    return {'first_name':splitted[0], 'last_name':splitted[1]} 
```


## Code style / Linting
I recommand some code formatter like [black](https://github.com/psf/black) so Python code always looks the same and you can focus on what it does. I personnaly used [yapf](https://github.com/google/yapf) (google flavor) because I don't like how they place brackets and parentheses it looks like javascript  😬 and method chaining is often screwed up (I use it a lot with pandas). I find yapf-google much better for me but it's still far from perfect.

You should also use [flake8](http://flake8.pycqa.org/en/latest/) to check for syntax, superfluous spaces, undefined names etc.

# Publishing documentation


I will indicate three ways to create documentation.

## First way: auto generated documentation using docstrings (and optionally your own files)

Sphinx is very popular for doing that (<code>pip install sphinx</code>) and it does look good. Example of documentation generated with sphinx [here](https://pandas.pydata.org/pandas-docs/stable/index.html) (this pandas documentation is now old they have a [new website](https://dev.pandas.io/)).

The procedure goes something like that with Sphinx:
* sphinx-quickstart (will start building your documentation)
* [sphinx-apidoc](https://www.sphinx-doc.org/en/master/man/sphinx-apidoc.html) (to auto generate documentation with python docstrings)
* add your own Markdown (you'll have to do some trickery to get Markdown working otherwise use reStructuredText) files in the mix
* [readthedocs](https://readthedocs.org/) to host your documentation for free (I think github pages would work too but readthedocs picks it up from your repo which is convenient)

I have managed this in the past but could somehow not reproduce it. Online guides where not very helpful either as it would never produce the expected output.
Perhaps it's just me and it is actually straightforward. Anyhow there are lots of other tools out there to do this such as [mkdocs](https://www.mkdocs.org/).

## Second way: GitHub wiki

A great example [here](https://github.com/Netflix/Hystrix/wiki). This example seems very convincing and this is something I would definitely try.

## Third way (well yes but actually no): Markdown madness
I presented this during the meetup so I included this third "way" but now I realize that it's a bit stupid especially when there is GitHub wiki which seems quite easy to use 🙈. So basically I just put everything in README.md. It's perhaps practical for small docs but otherwise with lots of functions it will be quite tedious... Here is a an [example](https://github.com/VingtCinq/python-mailchimp). You can also do a table of contents BTW:

![Markdown doc](imgs/markdown_doc.png)

Note: you will have to replace spaces with "-" in anchor links (e.g. #Do**<span style="color:red">-</span>**Stuff)

# Conclusion

* Making libraries can be very useful. It does require some knowledge but this is good knowledge to have! For instance git, docstring formatting or continuous integration
* If you decide to publish a library you may help other people and have other people help you improving your library! People may even create new features or other libraries based on it
* poetry makes things a little easier 🐢
* Publishing documentation still seems very impractical to me but I am very hopeful with GitHub wiki, thanks for the tip during the presentation 🐓!