# 1. This is my house

### Environment reproducibility for Python

## 1.1 The [watermark](https://github.com/rasbt/watermark) extension

Tell everyone when your notebook was run, and with which packages. This is especially useful for nbview, blog posts, and other media where you are not sharing the notebook as executable code.

In [None]:
# if you don't have the watermark extension installed:
%install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py
    
# once it is installed, you'll just need this in future notebooks:
%load_ext watermark

In [None]:
%watermark -a "Peter Bull" -d -v -p numpy,pandas -g

## 1.2 Laying the foundation

[`virtualenv`](https://virtualenv.pypa.io/en/latest/installation.html) and [`virtualenvwrapper`](http://virtualenvwrapper.readthedocs.org/en/latest/#) give you a new foundation.

 - Start from "scratch" on each project
 - Choose Python 2 or 3 as appropriate
 - Packages are cached locally, so no need to wait for download/compile on every new env
 
Installation is as easy as:
 - `pip install virtualenv`
 - `pip install virtualenvwrapper`
 - Add the following lines to `~/.bashrc`:
 
------

```
export WORKON_HOME=$HOME/.virtualenvs
export PROJECT_HOME=$HOME/Devel
source /usr/local/bin/virtualenvwrapper.sh
```

-----


To create a virtual environment:

 - `mkvirtualenv <name>`
 
To work in a particular virtual environment:

 - `workon <name>`
 
To leave a virtual environment:

 - `deactivate`
 
 
**`#lifehack`: create a new virtual environment for every project you work on**


## 1.1 The `pip` [requirements.txt](https://pip.readthedocs.org/en/1.1/requirements.html) file

Track your "Minimum reproducible environment" in a `requirements.txt` file

**`#lifehack`: never again run `pip install <package>`. Instead, update `requirements.txt` and run `pip install -r requirements.txt`**

In [None]:
!head -n 15 ../requirements.txt

# 2. The Life-Changing Magic of Tidying Up

## 2.1 Consistent project structure means

 - relative paths work
 - other collaborators know what to expect
 - order of scripts is self-documenting

In [None]:
! tree ..

# 3. Edit-run-repeat: how to stop the cycle of pain

The goal: don't edit, execute and verify any more. It's a fine way to start a project, but it doesn't scale as code runs longer and gets more complex.

### Debugging, refactoring, testing

 - Start with repeated code
 - Write functions - test with asserts
 - Refactor to modules - test with `unittest` 
 - Special testing tools for data science (`numpy.testing`, `engarde`)

## 3.1 No more docs-guessing

**`#lifehack`: never again run `pip install <package>`. Instead, update `requirements.txt` and run `pip install -r requirements.txt`**

In [5]:
import pandas as pd

In [None]:
df = pd.read_csv("../data/water-pumps.csv")
df.head(1)

## STEP: Try adding parameter index=0

In [None]:
pd.read_csv?

In [None]:
df = pd.read_csv("../data/water-pumps.csv",
                 index_col=0,
                 parse_dates="date_recorded")
df.head()

pd.read_csv()

**`#lifehack`: in addition to the `?` operator, the Jupyter notebooks has great "intellisense"; try `tab` when typing the name of a function, try `shift+tab` when inside a method call **

## 3.2 No more copy pasta

Don't repeat yourself.

In [6]:
import seaborn as sns



In [None]:
plot_data = df['construction_year']
plot_data = plot_data[plot_data != 0]
sns.kdeplot(plot_data, bw=0.1)
plt.show()

plot_data = df['longitude']
plot_data = plot_data[plot_data != 0]
sns.kdeplot(plot_data, bw=0.1)
plt.show()

## STEP: Paste for 'amount_tsh'
## STEP: Paste for 'latitude'

In [None]:
def kde_plot(dataframe, variable, upper=0.0, lower=0.0, bw=0.1):
    plot_data = dataframe[variable]
    plot_data = plot_data[(plot_data > lower) & (plot_data < upper)]
    sns.kdeplot(plot_data, bw=bw)
    plt.show()

In [None]:
kde_plot(df, 'construction_year', upper=2016)
kde_plot(df, 'longitude', upper=42)

In [None]:
kde_plot(df, 'amount_tsh', lower=20000, upper=400000)

## 3.3 No more guess-and-check

Interrupt execution with:
 - `%debug` magic: drops you out into pdb in IPython
 - `import q;q.d()`: drops you into pdb, even outside of IPython
 
Interrupt execution on an Exception with `%pdb` magic. Use [pdb](https://docs.python.org/2/library/pdb.html) the Python debugger to debug inside a notebook.  Key commands for `pdb` are:

 - `p`: Evaluate and print Python code
 
 
 - `w`: Where in the stack trace am I?
 - `u`: Go up a frame in the stack trace.
 - `d`: Go down a frame in the stack trace.
 
 
 - `c`: Continue execution
 - `q`: Stop execution

In [None]:
kde_plot(df, 'date_recorded')

In [None]:
def kde_plot_debug(dataframe, variable, upper=0.0, lower=0.0, bw=0.1):
    plot_data = dataframe[variable]
    plot_data = plot_data[(plot_data > lower) & (plot_data < upper)]
    
    %debug
    
    sns.kdeplot(plot_data, bw=bw)
    plt.show()
    
kde_plot_debug(df, 'date_recorded')

In [None]:
# "1" turns pdb on, "0" turns pdb off
%pdb 1

kde_plot(df, 'date_recorded')

In [None]:
# turn off debugger
%pdb 0

**`#lifehack`: %debug and %pdb are great, but pdb can be clunky. Try the 'q' module. Adding the line `import q;q.d()` anywhere in a project gives you a normal python console at that point. This is great if you're running outside of IPython. **

## 3.4 No more "Restart & Run All"

`assert` is the poor man's unit test: stops execution if condition is `False`, continues silently if `True`

In [11]:
import numpy as np

In [17]:
def gimme_the_mean(series):
    return np.mean(series)

assert gimme_the_mean([0.0]*10) == 0.0
assert gimme_the_mean(range(10)) == 5

4.5


AssertionError: 

## 3.5 No more copy-pasta between notebooks 

Refactor to module

In [1]:
import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

# import my method from the source code
from preprocess.build_features import remove_invalid_data

df = remove_invalid_data("../data/water-pumps.csv")
print df.shape

(59400, 39)


In [2]:
# TRY ADDING print "lalalala" to the method
df = remove_invalid_data("../data/water-pumps.csv")

Restart the kernel, let's try this again....

In [1]:
# Load the "autoreload" extension
%load_ext autoreload

# always reload modules marked with "%aimport"
%autoreload 1

import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

# import my method from the source code
%aimport preprocess.build_features
from preprocess.build_features import remove_invalid_data

In [4]:
df = remove_invalid_data("../data/water-pumps.csv")
df.head()

lalalala


Unnamed: 0_level_0,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
69572,6000,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
8776,0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
34310,25,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
67743,0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
19728,0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


## 3.6 No more letting other people (including future you) break your things

`unittest` is built in to Python. See `src/preprocess/tests.py` for an example.

In [19]:
%run ../src/preprocess/tests.py

...
----------------------------------------------------------------------
Ran 3 tests in 0.006s

OK


## 3.7 Special treats for datascience testing

### `numpy.testing`
Provides useful assertion methods for values that are numerically close and for numpy arrays.

In [21]:
data = np.random.normal(0.0, 1.0, 1000000)
assert gimme_the_mean(data) == 0.0

AssertionError: 

In [31]:
np.testing.assert_almost_equal(gimme_the_mean(data),
                               0.0,
                               decimal=1)

In [32]:
a = np.random.normal(0, 0.0001, 10000)
b = np.random.normal(0, 0.0001, 10000)

np.testing.assert_array_equal(a, b)

AssertionError: 
Arrays are not equal

(mismatch 100.0%)
 x: array([  7.230267e-05,   8.902187e-05,   4.363191e-05, ...,   3.418931e-05,
        -2.389755e-04,   1.099132e-05])
 y: array([  9.644248e-05,  -1.471220e-04,   2.087174e-04, ...,   2.100400e-05,
        -6.046026e-05,   5.240043e-05])

In [33]:
np.testing.assert_array_almost_equal(a, b, decimal=3)

In [None]:
import engarde.decorators as ed

# 4. Next-level code inspection

## 4.1 Code coverage

`coverage.py` is an _amazing_ tool for seeing what code gets executed when you run your test suite. You can 

In [36]:
!coverage run ../src/preprocess/tests.py
!coverage report

...
----------------------------------------------------------------------
Ran 3 tests in 0.007s

OK
Name                                                                    Stmts   Miss  Cover
-------------------------------------------------------------------------------------------
/Users/bull/data-science-is-software/src/preprocess/build_features.py       9      1    89%
/Users/bull/data-science-is-software/src/preprocess/tests.py               26      0   100%
-------------------------------------------------------------------------------------------
TOTAL                                                                      35      1    97%


In [45]:
!coverage html

from IPython.display import HTML, IFrame
IFrame("htmlcov/index.html", 800, 300)

## 4.2 Code profiling

## 4.3 The world beyond Jupyter

### Linting and Graphical Debugging (IDEs)

[PyCharm](https://www.jetbrains.com/pycharm/download/) is a fully-featured Python IDE. It has _tons_ of integrations with the normal development flow. The features I use most are:

 - `git` integration
 - interactive graphical debugger
 - flake8 linting
 - smart refactoring/go to