In [None]:
%%HTML
<style>

.CodeMirror {
    width: 100vw;
}

.container {
    width: 99% !important;
}

.rendered_html {
  font-size:0.6em;
}
.rendered_html table, .rendered_html th, .rendered_html tr, .rendered_html td {
     font-size: 100%;
}

</style>

# Packaging/making a library in Python
![Presentation logo](imgs/presentation_logo.png)


By **Thibault Bétrémieux** _Data Scientist at port-neo Freiburg GmbH (part of port-neo GmbH)_

# Introduction

# About port-neo GmbH
![port-neo logo](imgs/port-neo_logo.png)

* 360 marketing agency
    * Marketing strategy
    * Market and needs analysis
    * Digital brand management
    * Customer Journey
    * ...
* ~120 employees distributed amongst Freiburg, Stuttgart, Munich and Zürich
* The philosophy of port-neo is to be an important partner for its customers and grow along with them
* Motto: "Data meets empathy"
* Website: https://www.port-neo.com/

# About port-neo Freiburg GmbH

port-neo Freiburg GmbH focuses on communication with newsletter (so called "Dialogmarketing").

Main activities (my activities in **<span style="color:green">green</span>**):
* **Support**, management, distribution of newsletter accounts (Evalanche, PROMIO, Elaine)
* Newsletter templates (1000+ templates designed), design and content
* Subscription pages (for subscribing to a newsletter)
* Marketing dashboards - **<span style="color:green">I collect and prepare all the data for the dashboards</span>**
* **<span style="color:green">Data cleaning</span>**
    * Encoding issues e.g. <code>"Falsches Ãœben von Xylophonmusik quÃ¤lt jeden grÃ¶ÃŸeren Zwerg."</code> and superfluous characters issues e.g. <code>") Thomas   Müller () ("</code>
    * Infering and completing/correcting genders using first names and countries e.g. <code>("Andrea", "IT") => "male"</code>|<code>("Andrea", "DE") => "female"</code> (probabilites and sample sizes are used)
    * Name cleaning e.g. <code>"Pr. Dr. Otto-H von Heinrich (cooler Typ)" => {'first_name':'Otto', 'last_name':'Heinrich', 'title':'Pr. Dr.', ...}</code>
    * and much more...

# Why I use Python

* Data cleaning (9 private Python libraries dedicated to data cleaning)
* Data collection (text files, Excel, REST/SOAP, ...) from customers and marketing accounts
* Data transformations
* SQL databases management (mostly postgres and SQlite3)
* Web apps experiments with dash and flask
* Making libraries
* Lots of other reasons...

My work relies heavily on the Python library pandas 🐼. Amongst the 20+ private libraries I have delopped only one or two does not use pandas.

Each new major version of pandas is like a small Christmas 🎅 for me and I am starting to contribute.

# My environment for the presentation

* OS: Pop OS! (Ubuntu clone)
* Jupyter Lab. My favorite programming enviromnent which does about everything apart making coffee. I mostly do Python in it but have also tried out a bit of Javascript and Julia. With extensions you can do crazy stuff like using draw.io within it.
* DBeaver for SQL databases management/exploration

# Reasons to use Python libraries

* No more copy pasting code everywhere. Just import functions from a library
* If you use a library instead of spreading repeated code everywhere it makes it easier for:
    * Testing (local or with continuous integration)
    * Versioning (with the documentation of every improvements along the way and releases)
    * Collaborative development
* Code faster. For instance I can clean names with just a few lines of codes that I know from memory
* Compartmentalizing your code in modules/packages will make it more comprehensible and easier to test
* Sharing code to your colleagues and the world and making it easy to install
* Lots of other reasons...

# Creating and publishing a simple library, example with "pangres"

_What is pangres?_

<blockquote cite="https://github.com/ThibTrip/pangres">
Postgres upsert with pandas DataFrames (ON CONFLICT DO NOTHING or ON CONFLICT DO UPDATE) with some additional optional features:

* Create columns in DataFrame to upsert that do not yet exist in the postgres database
* Alter column data types in postgres for empty columns that do not match the data types of the DataFrame to upsert.
</blockquote>

# Live demo of "pangres" for the attendees to my presentation 😎

In this demo we will create an initial dataset and an "updated" dataset (just as if we collected some marketing data at different time periods). 

From the initial dataset we will create a PostgreSQL table using "pangres". Then we will update our table with the "updated" dataset also using "pangres".

## Pretend to collect data from a marketing API

In [1]:
import pandas as pd

# suppose we got this data from some marketing API
print('initial data')
df = (pd.DataFrame({'profileid':[123,124],
                    'first_name':['Thibault','John'],
                    'last_name':['Bétrémieux','Rambo'],
                    'is_subscribed':[None, None]})
      .set_index('profileid'))
display(df)

# then we receive some new data
print('new data')
print('> John Rambo became John Travolta')
print('> is_subscribed now has data')
print('> we have a new profile (125) but a profile disappeared (124)')
print('> we have a new column (likes_dancing)')
df = (pd.DataFrame({'profileid':[124,125],
                    'first_name':['John','Arnold'],
                    'last_name':['Travolta','Schwarzenegger'],
                    'is_subscribed':[1, 0],
                    'likes_dancing':[1,0]})
      .set_index('profileid'))
display(df)

initial data


Unnamed: 0_level_0,first_name,last_name,is_subscribed
profileid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
123,Thibault,Bétrémieux,
124,John,Rambo,


new data
> John Rambo became John Travolta
> is_subscribed now has data
> we have a new profile (125) but a profile disappeared (124)
> we have a new column (likes_dancing)


Unnamed: 0_level_0,first_name,last_name,is_subscribed,likes_dancing
profileid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
124,John,Travolta,1,1
125,Arnold,Schwarzenegger,0,0


In [None]:
# let's create the table using pg_upsert
# first create a sqlalchemy engine to interact with our PostgreSQL database
# more info here https://docs.sqlalchemy.org/en/13/core/engines.html
# replace '<CONNECTION_STRING>' with a connection string to a PostgreSQL database
engine = create_engine('<CONNECTION_STRING>')

# push df and create the table at the same time
pg_upsert(engine=engine,
          df=df,
          table_name='test',
          if_exists='upsert_overwrite',
          schema='tests',
          create_schema=True)

# now push the new data (new_df)
pg_upsert(engine=engine,
          df=new_df,
          table_name='test',
          # with "upsert_overwrite" John Rambo will become John Travolta
          # with "upsert_keep" John Rambo stays John Rambo 💪
          if_exists='upsert_overwrite', 
          schema='tests',
          add_new_columns=True, # will create column "likes_dancing"
          adapt_dtype_of_empty_db_columns=True) # will change "is_subscribed" dtype to integer
          
with engine.connect() as con:
    con.execute('DROP TABLE IF EXISTS tests.test;')

# Base structure of "pangres"

```
pangres/ # on this level lies setup/config files, requirements etc. on GitHub it would be the repository name
│   LICENSE # this has been generated by GitHub when I created the repo
│   README.md # essential information about your library (what it does, how to use it...) written in Markdown language
│             # Alternatively people use README.rst in reStructuredText language
│
├───pangres  # on this level and below lies the code of the library 
│   │        # (it is recommanded to use the same name as the repository/parent folder)
│   │   core.py
│   │   helpers.py
│   │   _config.py # the prefix "_" indicates it is a private (hidden) module and 
│   │              # it won't appear the same way in auto-completion
│   │   __init__.py
│   │       
│   ├───tests
│   │   │   test_chunsize.py
│   │   │   test_index.py
│   │   │   test_pandas_special_engine.py
│   │   │   test_pg_upsert.py
│   │   │   test_pg_upsert2.py
│   │   │   test_pg_upsert_speed.py
│   │   │   __init__.py
```

## About \_\_init\_\_.py files

Those file have 2 purposes:
* The presence of the file indicates to Python we are inside a package
* Telling Python how to initialize the library
```python
# without __init__.py
from pangres.core import pg_upsert
# with __init__.py
from pangres import pg_upsert
```

Note that we could have the same effect by putting all the code in core.py inside \_\_init\_\_.py but that does not seem very Pythonic to me (I think \_\_init\_\_.py should be reserved for initialization as the name indicates).

# Testing

The library pytest (<code>pip install pytest</code>) automatically collects tests located in the "tests" folder of my library. In all modules prefixed with "test_" all the functions whose name are prefixed with "test_" will be run by pytest.

Testing should be done in an **isolated environment**. You can create so called virtual environments with lots of tools. I recommand using conda (can be installed by installing [Miniconda](https://docs.conda.io/en/latest/miniconda.html) or [Anaconda](https://www.anaconda.com/distribution/#download-section)).

You can also test code examples in docstrings using doctest. doctest can [work together with pytest](http://doc.pytest.org/en/latest/doctest.html).

# How I packaged and published the library to PyPI (Python Package Index)

(using the old way because I got lazy 🐢 and didn't know how to do it the "new way")

Seriously though there is probably nothing wrong with this method if you can understand how it works. Besides it is used by the great majority of Python libraries nowadays. But the "new way" (described after) is much more convenient.

Everything is very well documented in this [article](https://medium.com/@joel.barmettler/how-to-upload-your-python-package-to-pypi-65edc5fe9c56). **Important**: you should upload to [TestPyPI](https://test.pypi.org/) before uploading it to [PyPI](https://pypi.org/)! You'll need to create different accounts for both.

This is what the structure of pangres looked like after I was done (minus some exotic folders and files such as ".circleci" which I'll talk about later):
```
pangres/
│   .gitignore # lists files to ignore in "prangres" repo (for instance pycache), added by GitHub
│   LICENSE
│   MANIFEST.in # lists non Python files to include when pip installing (for instance README.md)
│               # so they are copied over in site_packages
│   README.md
│   requirements.txt # lists dependencies for end users e.g. pandas
│   requirements-dev.txt # lists dependencies for developers e.g. pytest
│   setup.cfg # this is for additional configuration for the setup somehow (e.g. which type of distribution)
│   setup.py # script containing essential metadata on how to install the library (author, version,...).
│            # it reuses the content of requirements.txt  
│               
├───pangres
│   │   core.py
│   │   helpers.py
│   │   _config.py
│   │   __init__.py
│   │       
│   ├───tests
│   │   │   test_chunsize.py
│   │   │   test_index.py
│   │   │   test_pandas_special_engine.py
│   │   │   test_pg_upsert.py
│   │   │   test_pg_upsert2.py
│   │   │   test_pg_upsert_speed.py
│   │   │   __init__.py
```

# Packaging using poetry
Like the other method you will need to have your code hosted in a repository online (e.g. on GitHub).

Let's repackage "pangres" using the tool "poetry" and publish a new release at the same time.
I recommand this [guide](https://johnfraney.ca/posts/2019/05/28/create-publish-python-package-poetry/) for learning how to publish with poetry.

We will use a single file called "pyproject.toml" which contains all the information in:
* MANIFEST.in
* requirements.txt
* requirements-dev.txt
* setup.cfg
* setup.py

## A. Install poetry

**osx / linux / bashonwindows**

<code>curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python</code>

**windows powershell**

<code>(Invoke-WebRequest -Uri https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py -UseBasicParsing).Content | python</code>


## B. Create a pyproject.toml with poetry

<code>poetry init</code>

Let poetry guide you to create the pyproject.toml file.

## C. Let's put pyproject.toml into our structure

Note that I could also have created the library structure using <code>poetry new pangres</code> but I would have to change too many things to the default structure it is not helpful.

```
pangres/
│   LICENSE
│   -> pyproject.toml
│   README.md
│
├───pangres
│   │   core.py
│   │   helpers.py
│   │   _config.py
│   │   __init__.py
│   │       
│   ├───tests
│   │   │   test_chunsize.py
│   │   │   test_index.py
│   │   │   test_pandas_special_engine.py
│   │   │   test_pg_upsert.py
│   │   │   test_pg_upsert2.py
│   │   │   test_pg_upsert_speed.py
│   │   │   __init__.py
```

## D. Let's install and test pangres

```
poetry install
# the statement below will make pytest run inside a virtual environment made/selected by poetry 😎!
poetry run pytest pangres
```

## E. Build, test publish and publish
Build:
```
poetry build
```

Test publish using TestPyPI:
```
# add TestPyPI as an alternate package repository
poetry config repositories.testpypi https://test.pypi.org/legacy
# publish to TestPyPI then test a pip install from TestPyPI
poetry publish -r testpypi
pip install --index-url https://test.pypi.org/simple/ pangres
```

If we can use the library as expected publish on PyPI:
```
poetry publish
```

## Additional cool stuff with poetry

* Adding a new dependency <code>poetry add pandas</code>
* Removing a dependency <code>poetry remove pandas</code>
* Adding a new dev dependency <code>poetry add pytest-cov --dev</code>
* Removing a dev dependency <code>poetry remove pytest-cov --dev</code>
* Get latest version of dependencies <code>poetry update</code>
* showing dependencies <code>poetry show</code>
* Working in editable mode: <code>poetry install --develop</code>
* [Working with local packages](https://python-poetry.org/docs/versions/#path-dependencies) 
* etc.

See more [here](https://python-poetry.org/docs/cli/).

# Continuous integration and coverage

* Sign up on [circleci.com](https://circleci.com/) using GitHub
* Sign up on [codecov.io](https://codecov.io/) using GitHub
* Create a folder ".circleci" in the root of your repo then create a file called "config.yml" inside it

<details>
    <summary><b>Example of config.yml for circleci</b></summary>

```
version: 2 # version of circleci
jobs:
  build: # set up the environment
    docker: # use docker images of Python and postgres
    - image: circleci/python:3.7.3
    - image: circleci/postgres:12
      environment:
        POSTGRES_USER: circleci_user
        POSTGRES_DB: circleci_test
    working_directory: ~/pangres
    steps:
    - checkout
    - restore_cache: # try to get dependencies from a cache if they have been installed before (speeds up test time)
        keys:
        - v1-dependencies-{{ checksum "requirements.txt" }}
        - v1-dependencies-
    - run: # install dependencies if necessary
        name: Install dependencies
        command: |
          python3 -m venv venv
          . venv/bin/activate
          pip install -r requirements.txt
    - save_cache: # cache dependencies
        paths:
        - ./venv
        key: v1-dependencies-{{ checksum "requirements.txt" }}
    - run:
        name: run tests
        command: |
          . venv/bin/activate
          # install package (fetches setup.py in current directory)
          pip install .
          pip install codecov
          pip install coverage
          pip install numpy
          pip install pytest
          pip install pytest-benchmark
          pip install pytest-cov
          # somehow the commented line below does not do coverage of non tests files
          # pytest pangres --cov=./
          # use this instead
          coverage run -m pytest
          codecov
workflows:
  version: 2
  workflow:
    jobs:
    - build
```
    
</details>

* Follow the project on circleci
* Follow the project on code coverage

# Additional recommandations for libraries (and when writing Python in general)

## Docstrings
For the docstrings I highly recommand [pandas' docstring guide](https://dev.pandas.io/docs/development/contributing_docstring.html#docstring). All docstrings in pangres are written according to the pandas' guide. Alternatively you may prefer [Google docstring style](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html).

## Code style / Linting
I recommand some code formatter like [black](https://github.com/psf/black) so Python code always looks the same and you can focus on what it does. I personnaly used [yapf](https://github.com/google/yapf) (google flavor) because I don't like how they place brackets and parentheses it looks like javascript  😬 and method chaining is often screwed up (I use it a lot with pandas). I find yapf-google much better for me but it's still far from perfect.

You should also use [flake8](http://flake8.pycqa.org/en/latest/) to check for syntax, superfluous spaces, undefined names etc.

# Publishing documentation


I can recommand at least two ways to create documentation using docstrings + your own files.




## Auto generated documentation using docstrings

Sphinx is very popular for doing that (<code>pip install sphinx</code>) and it does look good. Example of documentation generated with sphinx [here](https://pandas.pydata.org/pandas-docs/stable/index.html) (this pandas documentation is now old they have a new website).

The procedure goes something like that with Sphinx:
* sphinx-quickstart (will start building your documentation)
* sphinx-apidoc (to auto generate documentation with python docstrings)
* add your own Markdown (you'll have to do some trickery to get Markdown working otherwise use reStructuredText) files in the mix
* readthedocs to host your documentation (I think github pages would work too but readthedocs picks it up from your repo which is convenient)

I have managed this in the past but could somehow not reproduce it. Online guides where not very helpful either as it would never produce the expected output.

Perhaps it's just me and it is actually straightforward. In either case the second method is stupidly simple, always works and is perfect for private repositories if you don't want to bother with a private website to host your documentation.

## Markdown FTW
Just put everything in README.md. With a table of contents, python flavored Markdown etc. this can be really practical for the user but can become quite tedious for developers if there are lots of modules and lots of functions to document...

Here is a great [example](https://github.com/VingtCinq/python-mailchimp).

You can even make a table of contents like this:

````markdown
# Table of contents

[Description](#Description) # anchor link to chapter "Description"

[Usage](#Usage)

&ensp;&ensp;[Do cool struff](#Do-cool-stuff) # use &ensp; (URL encoded space) to tabulate for subchapters (idk about a better alternative)
                                             #  also beware "-" and not " " (since spaces are not URL compatible)

[Contributing](#Contributing)

[Testing](#Testing)

# Description

A very cool library

# Usage
## Do cool stuff
```python
from very_cool_library import do_stuff
do_stuff()
wow great stuff was done!
```

# Contributing

Contributions are welcome...


etc.

````

Once rendered this produces:

# Table of contents

[Description](#Description)

[Usage](#Usage)

&ensp;&ensp;[Do cool struff](#Do-cool-stuff)

[Contributing](#Contributing)

[Testing](#Testing)

[Thanks to](#Thanks-to)

# Description

A very cool library

# Usage
## Do cool stuff

```python
from very_cool_library import do_stuff
do_stuff()
wow great stuff was done!
```

# Contributing

Contributions are welcome...


etc.

# Conclusion

* Making libraries can be very useful. It does require some knowledge but this is good knowledge to have! For instance git, docstring formatting or continuous integration
* If you decide to publish a library you may help other people and have other people help you improving your library! People may even create new features or other libraries based on it
* poetry makes things a little easier 🐢
* Publishing documentation still seems very impractical (to me at least)