# Do I need to read this?

Read this if any of the following statements of your knowledge are *not* true

* Can get cli arguments inside a script
* Know what a `defaultdict` is 
* Know how to compute the permutations of a list using `itertools`
* Can dump a python object to disk using `cPickle`
* Can construct a cli argument parser using `argparse`
* Know what `doctest` is and how it can be helpful for testing functions
* Can log using `logging` and know when `print` should be used and when logging methods should be used
* Know how to use the python debugger called `pdb` and use it within Jupyter.

# Introduction to the standard library

Python's philosophy to the standard library is "batteries included", i.e. it has *a lot* of stuff!

Peruse [The Python Standard Library](https://docs.python.org/2/library/index.html) table of contents just to see the breadth and depth of the included features.

We'll take a look at few of the modules inside the standard library that may well be of use throughout the course:

* [`sys`](https://docs.python.org/2/library/sys.html)
* [`os`](https://docs.python.org/2/library/os.html) (in particular, [`os.path`](https://docs.python.org/2/library/os.path.html#module-os.path))
* [`collections`](https://docs.python.org/2/library/collections.html)
* [`random`](https://docs.python.org/2/library/random.html)
* [`itertools`](https://docs.python.org/2/library/itertools.html)
* [`functools`](https://docs.python.org/2/library/functools.html)
* [`tempfile`](https://docs.python.org/2/library/tempfile.html)
* [`cPickle`](https://docs.python.org/2/library/pickle.html#module-cPickle)
* [`argparse`](https://docs.python.org/2/library/argparse.html)
* [`logging`](https://docs.python.org/2/library/logging.html)
* [`subprocess`](https://docs.python.org/2/library/subprocess.html)
* [`doctest`](https://docs.python.org/2/library/doctest.html)
* [`pdb`](https://docs.python.org/2/library/pdb.html)


## `sys`

This module provides a bunch of useful things like `sys.argv`, `sys.argc`, `sys.exit`, `sys.maxint`, `sys.path`, `sys.stdin`, `sys.stdout`, `sys.stderr`; all of which are fairly self explanatory.

## `os`


This module provides operating specific functionality, one of the main sub modules being `os.path` for path manipulation. 


In [1]:
import os

In [2]:
os.environ['PATH']

'/usr/bin:/home/will/.nvm/versions/node/v6.11.3/bin:/home/will/.gem/ruby/2.4.0/bin:/home/will/.yarn/bin:/home/will/.pyenv/shims:/opt/sonar-scanner/bin:/home/will/.local/bin:/home/will/.gem/ruby/2.4.0/bin:/home/will/.gem/ruby/2.3.0/bin:/home/will/.bin:/home/will/code/go/bin:/home/will/.fzf/bin:/home/will/.gem/ruby/2.4.0/bin:/home/will/.yarn/bin:/home/will/.pyenv/shims:/opt/sonar-scanner/bin:/home/will/.local/bin:/home/will/.gem/ruby/2.4.0/bin:/home/will/.gem/ruby/2.3.0/bin:/home/will/.bin:/home/will/code/go/bin:/home/will/.cargo/bin:/opt/google-cloud-sdk/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/android-sdk/platform-tools:/opt/android-sdk/tools:/opt/android-sdk/tools/bin:/opt/COMODO:/opt/cxoffice/bin:/usr/lib/jvm/default/bin:/opt/opencascade/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/opt/sonar-scanner/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/lib/jvm/default/bin'

In [3]:
try:
    del os.environ['NEW_ENV_VAR'] # clear env var if already present
except KeyError:
    pass

try:
    print(os.environ['NEW_ENV_VAR'])
except KeyError:
    print('NEW_ENV_VAR not present in environment')
    
os.environ['NEW_ENV_VAR'] = '1234'

print(os.environ['NEW_ENV_VAR'])

NEW_ENV_VAR not present in environment
1234


In [4]:
os.uname()

('Linux',
 'petal',
 '4.13.3-1-ARCH',
 '#1 SMP PREEMPT Thu Sep 21 20:33:16 CEST 2017',
 'x86_64')

In [5]:
with os.tmpfile() as f:
    f.write('Hello World')
    f.seek(0)
    for line in f.readlines():
        print(line)

Hello World


In [6]:
print(os.getcwd())
os.chdir('/tmp')
print(os.getcwd())

/home/will/cloud/teaching/adl/public-labsheets/Lab_0_Python_Intro
/tmp


In [7]:
try:
    os.mkdir('/tmp/new-temp')
except OSError:
    pass

!ls -al /tmp/new-temp

total 0
drwxr-xr-x  2 will will   40 Oct  2 21:47 .
drwxrwxrwt 19 root root 1220 Oct  2 21:47 ..


In [8]:
!rm -r /tmp/new

try:
    os.mkdir('/tmp/new/nested/directories')
except OSError:
    print("os.mkdir can't handle creation of nested directories")
    
os.makedirs('/tmp/new/nested/directories')
print("but os.makedirs can")

!find /tmp/new

rm: cannot remove '/tmp/new': No such file or directory
os.mkdir can't handle creation of nested directories
but os.makedirs can
/tmp/new
/tmp/new/nested
/tmp/new/nested/directories


In [9]:
os.getuid()

1000

In [10]:
os.getppid()

25779

In [11]:
# filesystem path separator
os.sep

'/'

In [12]:
# Path separator for PATH variable
os.pathsep

':'

In [13]:
# Line separator \n on linux, and \r\n on windos
os.linesep

'\n'

In [14]:
# path to null device
os.devnull

'/dev/null'

## `os.path`

This module is a submodule of the `os` module and contains a high level interface for path manipulations, it has been superseded by `pathlib` in python 3.

In [15]:
import os.path

os.chdir(os.path.expanduser("~"))

In [16]:
os.path.abspath("folder")

'/home/will/folder'

In [17]:
os.path.join(os.getcwd(), 'a', 'b', 'c')

'/home/will/a/b/c'

In [18]:
os.path.exists('/tmp/does-not-exist')

False

In [19]:
os.path.exists('/tmp/')

True

In [20]:
os.path.abspath('$HOME/asdf')

'/home/will/$HOME/asdf'

In [21]:
os.path.abspath(os.path.expandvars('$HOME/asdf'))

'/home/will/asdf'

In [22]:
os.path.abspath(os.path.expanduser('~/asdf/$VAR'))

'/home/will/asdf/$VAR'

In [23]:
os.environ['VAR'] = 'b'
path = os.path.expanduser(os.path.expandvars('~/a/$VAR'))
del os.environ['VAR']
path

'/home/will/a/b'

In [24]:
os.path.dirname('/tmp/script.py')

'/tmp'

In [25]:
os.path.basename('/tmp/script.py')

'script.py'

In [26]:
os.path.split('/tmp/a/b/c/d')

('/tmp/a/b/c', 'd')

In [27]:
os.path.split('/tmp/a/b/c/d/')

('/tmp/a/b/c/d', '')

In [28]:
def walker(dirname, names):
    # do something with dirname and names
    return dirname, names
    
list(os.walk('/etc/systemd/', walker))[0]

('/etc/systemd/',
 ['user', 'system', 'network'],
 ['timesyncd.conf',
  'journal-upload.conf',
  'journald.conf',
  'coredump.conf',
  'logind.conf',
  'journal-remote.conf',
  'system.conf',
  'user.conf',
  'resolved.conf'])

---

## `collections`

The python collections module is similar to the Java collections API providing a variety of data structures.

| Data structure name | Description |
|---------------------|-------------|
| `namedtuple()`      | factory function for creating tuple subclasses with named fields |
| `deque`             | list-like container with fast appends and pops on either end |
| `Counter`	          | dict subclass for counting hashable objects |
| `OrderedDict`       | dict subclass that remembers the order entries were added |
| `defaultdict`	      | dict subclass that calls a factory function to supply missing values |

`namedtuple` makes for very nice to use data objects

In [29]:
import collections

In [30]:
DataSet = collections.namedtuple('DataSet', ['examples', 'labels'])
ds = DataSet([[0, 1], [0, 1], [0, 0], [1, 1]], [1, 1, 0, 0])
print(ds)

print("Examples: {}".format(ds.examples))
print("Labels: {}".format(ds.labels))

DataSet(examples=[[0, 1], [0, 1], [0, 0], [1, 1]], labels=[1, 1, 0, 0])
Examples: [[0, 1], [0, 1], [0, 0], [1, 1]]
Labels: [1, 1, 0, 0]


---

In [31]:
# deques are generalisations of stacks and queues. 
# Deque is short for double ended queue and pronounced 'deck'
d = collections.deque(range(10))
d

deque([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [32]:
print(d.pop())
d


9


deque([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [33]:
d.popleft()
d

deque([1, 2, 3, 4, 5, 6, 7, 8])

In [34]:
d.append(20)
d

deque([1, 2, 3, 4, 5, 6, 7, 8, 20])

In [35]:
d.appendleft(10)
d

deque([10, 1, 2, 3, 4, 5, 6, 7, 8, 20])

---

In [36]:
def default_factory():
    return ""

normal_dict = dict()
dict_with_default = collections.defaultdict(default_factory)

In [37]:
normal_dict['a'] = 'b'
dict_with_default['a'] = 'b'

In [38]:
normal_dict['a']

'b'

In [39]:
dict_with_default['a']

'b'

In [40]:
normal_dict['c']

KeyError: 'c'

In [41]:
dict_with_default['c']

''

---

In [42]:
c = collections.Counter(['a', 'a', 'a', 'b', 'b', 'c', 'b', 'd', 'e', 'a'])
c

Counter({'a': 4, 'b': 3, 'c': 1, 'd': 1, 'e': 1})

In [43]:
c.keys()

['a', 'c', 'b', 'e', 'd']

---

## `itertools`

`itertools` is a beautiful module for performing operations on iterators (something that iterates over a container, be it a list, tree, or something else). If you like Haskell, definitely take a look at this module.

The [documentation](https://docs.python.org/2/library/itertools.html) is excellent and provides a concise table of all the functions and their purposes.

All the functions return iterators that can be materialized by calling `list` on the result, however this is inefficient unless you actually need the full result, i.e. you don't need to materialize the iterator if you're going to loop over the element etc.

In [44]:
import itertools

In [45]:
l1 = list('abcdefgh')
l1

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']

In [46]:
l2 = list(range(len(l1)))
l2

[0, 1, 2, 3, 4, 5, 6, 7]

In [47]:
zip(l1, l2) # this returns a list (allocating memory)

[('a', 0),
 ('b', 1),
 ('c', 2),
 ('d', 3),
 ('e', 4),
 ('f', 5),
 ('g', 6),
 ('h', 7)]

In [48]:
itertools.izip(l1, l2) # returns an iterator taking constant memory

<itertools.izip at 0x7fecf00cb0e0>

In [49]:
list(itertools.chain(l1, l2))

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 0, 1, 2, 3, 4, 5, 6, 7]

In [50]:
list(itertools.compress(l1, [1, 0, 0, 1])) # selectively pull elements out of l1

['a', 'd']

In [51]:
list(itertools.dropwhile(lambda char: char < 'd', l1))

['d', 'e', 'f', 'g', 'h']

In [52]:
from collections import namedtuple
Point = namedtuple('Point', ['x', 'y'])

l = [Point(x, y) for x in range(4) for y in range(4)]

points_grouped_by_x = itertools.groupby(l, lambda p: p.x)
for i, point_group in points_grouped_by_x:
    print("Grouped by x = {}, values: {}".format(i, list(point_group)))
    

Grouped by x = 0, values: [Point(x=0, y=0), Point(x=0, y=1), Point(x=0, y=2), Point(x=0, y=3)]
Grouped by x = 1, values: [Point(x=1, y=0), Point(x=1, y=1), Point(x=1, y=2), Point(x=1, y=3)]
Grouped by x = 2, values: [Point(x=2, y=0), Point(x=2, y=1), Point(x=2, y=2), Point(x=2, y=3)]
Grouped by x = 3, values: [Point(x=3, y=0), Point(x=3, y=1), Point(x=3, y=2), Point(x=3, y=3)]


In [53]:
# same as filter in haskell
list(itertools.ifilter(lambda char: char  < 'd', l1))

['a', 'b', 'c']

In [54]:
# equivalent to negating predicate in the above example
list(itertools.ifilterfalse(lambda char: char  < 'd', l1))

['d', 'e', 'f', 'g', 'h']

In [55]:
# The iterating version of l1[start:stop:step]
list(itertools.islice(l1, 1, 6, 2))

['b', 'd', 'f']

In [56]:
# iterating version of map. In Python 3 itertools.imap replaces the builtin map
list(itertools.imap(lambda x: x**2 , l2))

[0, 1, 4, 9, 16, 25, 36, 49]

In [57]:
# starmap expands nested lists for 
nested_list = [[1, 2, 3], [2, 3, 4], [3, 4, 5]]
list(itertools.starmap(lambda *args: sum(args), nested_list))

[6, 9, 12]

In [58]:
# What if we want to duplicate the iterator? 
# You can't just iterate over the same repeatedly as iterating mutates the interal state of the iterator
i1 = iter(l1)
for entry in i1:
    print(entry)
    
print("\nIterating over the same iterator")
for entry in i1:
    print(entry)
print("End iterating over the same iterator\n")

copies = 2
iterators = itertools.tee(iter(l1), copies)
for i, iterator in enumerate(iterators):
    print("Iterating over copy {} of iterator".format(i + 1))
    for entry in iterator:
        print(entry)
    print("End of copy {}\n".format(i + 1))

a
b
c
d
e
f
g
h

Iterating over the same iterator
End iterating over the same iterator

Iterating over copy 1 of iterator
a
b
c
d
e
f
g
h
End of copy 1

Iterating over copy 2 of iterator
a
b
c
d
e
f
g
h
End of copy 2



In [59]:
list(itertools.takewhile(lambda x: x < 5, l2))

[0, 1, 2, 3, 4]

Now for the *combinatoric generators*

In [60]:
l1 = list(range(3))
l2 = list('abc')

In [61]:
list(itertools.product(l1, l2))

[(0, 'a'),
 (0, 'b'),
 (0, 'c'),
 (1, 'a'),
 (1, 'b'),
 (1, 'c'),
 (2, 'a'),
 (2, 'b'),
 (2, 'c')]

In [62]:
list(itertools.permutations(l1))

[(0, 1, 2), (0, 2, 1), (1, 0, 2), (1, 2, 0), (2, 0, 1), (2, 1, 0)]

In [63]:
list(itertools.combinations(l1, 2))

[(0, 1), (0, 2), (1, 2)]

In [64]:
list(itertools.combinations_with_replacement(l1, 2))

[(0, 0), (0, 1), (0, 2), (1, 1), (1, 2), (2, 2)]

And finally, the infinite iterators

In [65]:
list(itertools.takewhile(lambda x: x < 10, itertools.count())) 
# list(itertools.count()) would hang trying to materialise an infinite list

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [66]:
i = 0
for x in itertools.cycle(l1):
    print(x)
    i += 1
    if i > 10:
        break

0
1
2
0
1
2
0
1
2
0
1


In [67]:
i = 0
for x in itertools.repeat(1):
    print(x)
    i += 1
    if i > 10:
        break

1
1
1
1
1
1
1
1
1
1
1


---

## `tempfile`

`tempfile` is useful for writing intermediate results to disk, however this might be too slow for you, so check out [`StringIO`](https://docs.python.org/2/library/stringio.html) for memory mapped files.

`tempfile` is also useful in unit testing contexts where you have code that operates on files, you can create a temporary directory to exercise the unit under test (although this can be slow as you're hitting a real filesystem, consider using mocks here instead)

In [68]:
import tempfile

In [69]:
# a temp file exists until it is closed e.g. with `.close()` or on leaving a context manager

with tempfile.TemporaryFile() as f:
    f.writelines("Hello World!")
    f.seek(0)
    print(f.readlines()[0])

Hello World!


In [70]:
# Sometimes you might want the path to the file, in which case use `tempfile.mkdtemp`
fd, path = tempfile.mkstemp(prefix='frame-', suffix='.png')
print(path)

/tmp/frame-HeiXBi.png


In [71]:
# Directories are supported too
path = tempfile.mkdtemp()
print(path)

/tmp/tmpcpeyYr


---

## `cPickle`

Got an expensive computation, want to save the result? `cPickle` to the rescue! `cPickle` is a serialisation modules that will *pickle* (serialise) POPOs (plain old python objects) to files, and *unpickle* (deserialise) them back into POPOs.

In [72]:
import cPickle

In [73]:
import numpy as np
import tempfile
xs = np.linspace(0, 5, 1000)

with tempfile.TemporaryFile() as f:
    cPickle.dump(xs, f)
    f.seek(0)
    unpickled_xs = cPickle.load(f)
    print(type(unpickled_xs))
    print(len(unpickled_xs ))

<type 'numpy.ndarray'>
1000


In [74]:
# Numpy provides some nice helper methods for dumping ndarrays to disk
with tempfile.TemporaryFile(suffix='.npy') as f:
    np.save(f, xs)
    f.seek(0) # rewind file
    loaded_xs = np.load(f)
    print(type(loaded_xs))
    print(len(loaded_xs))
    
# This is more portable than pickling as numpy persists arrays in its own format which is
# portable across python versions whereas pickles are python version depedent

# See https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html for more

<type 'numpy.ndarray'>
1000


---

## `argparse`

This module provides the argument parsing for CLI apps. It's very helpful in data science where you might want to run on different data sets (e.g. train, validation, test), tweak parameters, toggle options etc.

In [None]:
# %load argparse_example.py

import os

if __name__ == '__main__':
    # imports can occur any where in the program
    # it is good practice to put module dependencies at the top of the file
    # and CLI dependency imports inside the `__name__ == '__main__'` conditional block
    # so they are only imported when the script is run as a program and not imported as a library 
    import argparse
    
    parser = argparse.ArgumentParser(description='Train CNN on MNIST dataset')
    parser.add_argument('dataset-dir', 
                        type=str, 
                        help='Directory in which to download')
    parser.add_argument('--stride', 
                        type=str, 
                        default='2x2')
    parser.add_argument('--batch-size', 
                        type=int, 
                        default=32)

    args = parser.parse_args()
    print(args)

In [2]:
%run argparse_example.py --help

usage: argparse_example.py [-h] [--stride STRIDE] [--batch-size BATCH_SIZE]
                           dataset-dir

Train CNN on MNIST dataset

positional arguments:
  dataset-dir           Directory in which to download

optional arguments:
  -h, --help            show this help message and exit
  --stride STRIDE
  --batch-size BATCH_SIZE


In [3]:
%run argparse_example /tmp/mnist --stride 1x1 --batch-size 64

Namespace(batch_size=64, dataset-dir='/tmp/mnist', stride='1x1')


---

## `logging`

This module provides a standard logging package for Python, it is expected that all libraries implement logging with this module so that applications have a unified logging stack.

You should never use `print` unless you're printing out CLI args help, that's about the only time you *should* use `print`, for everything else use a logger from `logging`.

In [78]:
import logging

Each logger has a `name`, which must adhere to the same naming standards as python packages. Individual loggers form part of a logger hierarchy allowing users to turn logging for components on and off easily.


In [9]:
# %load logging_example.py
#!/usr/bin/env python

import logging

top_level_logger = logging.getLogger('app')
subcomponent_logger = logging.getLogger('app.component')


if __name__ == '__main__':
    logging.basicConfig(level=logging.DEBUG)

    top_level_logger.debug('starting main function')
    top_level_logger.info('ready')
    subcomponent_logger.debug('initialising')
    subcomponent_logger.debug('ready')


In [13]:
!./logging_example.py

DEBUG:app:starting main function
INFO:app:ready
DEBUG:app.component:initialising
DEBUG:app.component:ready


---

## `doctest`

Doctests are examples contained in a [docstring](https://www.python.org/dev/peps/pep-0257/#what-is-a-docstring) that can be run to verify the behaviour of a function. They work quite well for simple idempotent functions.

Alternatively the [unittest](https://docs.python.org/2/library/unittest.html) module provides a standard xUnit-style testing library. 

In [80]:
import doctest

In [81]:
# %load doctest_example.py

def sum(*xs):
    """
    Sum the value of all arguments
    
    >>> sum(1)
    1
    >>> sum(1, 2)
    3
    >>> sum(3, 0, 8)
    11
    """
    sum = 0
    for x in xs:
        sum += x
    return sum

if __name__ == '__main__':
    import doctest
    doctest.testmod(verbose=True)

Trying:
    sum(1)
Expecting:
    1
ok
Trying:
    sum(1, 2)
Expecting:
    3
ok
Trying:
    sum(3, 0, 8)
Expecting:
    11
ok
11 items had no tests:
    __main__
    __main__.DataSet
    __main__.DataSet.__dict__
    __main__.DataSet.examples
    __main__.DataSet.labels
    __main__.Point
    __main__.Point.__dict__
    __main__.Point.x
    __main__.Point.y
    __main__.default_factory
    __main__.walker
1 items passed all tests:
   3 tests in __main__.sum
3 tests in 12 items.
3 passed and 0 failed.
Test passed.


In [14]:
%run doctest_example.py

Trying:
    sum(1)
Expecting:
    1
ok
Trying:
    sum(1, 2)
Expecting:
    3
ok
Trying:
    sum(3, 0, 8)
Expecting:
    11
ok
1 items had no tests:
    __main__
1 items passed all tests:
   3 tests in __main__.sum
3 tests in 2 items.
3 passed and 0 failed.
Test passed.


As you can see the examples in the docstring take the form of a REPL session. `>>>` lines prefix the input and the output from the function invocation appears directly below.

---

## `pdb`

`pdb` is the python debugger. If you're coming from a `gdb` background you're going to be surprised. Unlike gdb where you point it at a binary and tell it to break on a specific line, `pdb` is imported into the source code and a new statement is added to the program to trigger a breakpoint.

In [83]:
d = {
    'a': 1,
    'b': 2
}
# TODO: Uncomment line below and run code
#import pdb; pdb.set_trace()

In [84]:
# Using the ipython `%pdb` magic you can drop into a pdb prompt
# when an exception 
%pdb on

d = {
    'a': 1,
    'b': 2
}
# Trigger `KeyError` exception which will cause Jupyter to drop into pdb debugger
# TODO: Uncomment line below
#print(d['c'])

Automatic pdb calling has been turned ON


---