![](images/LogoPNG.png)
# Python Development and Data analysis

-----

**Alex Mitchell @data_alex**

**Prash Majmudar @prashmaj**

## Python
### Obligatory history

* Python is over 20 years old
* Creator Guido van Rossum has written up the history of Python / language design:
http://python-history.blogspot.co.uk/2009/01/introduction-and-overview.html
* Python is a multi-purpose language, it is dynamically, strongely typed and interpreted

### This talk

* Focus is on Python 2.7 (Python 3 features might be mentioned)
* Will cover idiomatic Python aspects (Pythonic approaches)
* When we talk about about Python, we're talking about CPython (Python written in C), not Jython, IronPython, PyPy etc.

#### Structure

* First half will cover core language concepts
* Second half will cover Python for data analysis using the Pandas library

In [24]:
import this
reload(this)

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


<module 'this' from '/home/prash/.pyenv/versions/2.7.9/lib/python2.7/this.pyc'>

## Why choose Python?

### Readable / Easy to get started 

* Readablity core to language design
 * Implies maintainable, re-usable code
 
### Productivity
* Intepreted --> Faster to iterate (no compile + link phase) e.g. with REPL
* Large standard library
* Core language has higher order functions (hides implementation detail) - generally means less code required to get stuff done (less to maintain / debug etc.)
* Large community support

### It's fun

```import antigravity```

## Why Not Python?

* Dynamic typing, lack of control over memory management: discipline is required
* If you need to write fast CPU intensive code (though extensions such as NumPy exist) - you may want to use C  / C++
* ...

## Use cases

- Web Development (Django, Flask, Tornado, ...)
- Machine learning (scikit-learn, PyBrain, PyMC, theano, ...)
- Data analysis (Pandas, IPython NB)
- Data pipelines (Luigi, PySpark, mrjob, disco)
- Natural Language Processing (NLTK, gensim)
- Machine Vision (OpenCV bindings)
- Scripting / sysadmin (large standard lib for sysadmin tasks)
- Networking (Twisted)
- GUIs (Tkinter)
- Games (pygame)
- Animation

Python bindings exist for most external services you might use (SQL / NoSQL databases, AWS)



##  Getting started

### Pick a development environment

Some choices:

* PyCharm - IDE

* IPython + Editor (Vim, Emacs, Sublime).
  NB: Vim / Emacs - harder to learn (needs setup, plus key bindings). Productive when setup (but takes years to perfect setup!). Available on the server and locally - same toolsets across environments.


### Source control

* Git hosted on Github is a good choice


### Package managers and dependency isolation

* Pip + virtualenv (https://virtualenv.pypa.io/en/latest/):
 * `pip install --upgrade pip`
 * `pip install virtualenv`
 
* Start your env
 * `virtualenv TESTENV`
 * `source TESTENV\bin\activate`  and `deactivate`
 
* Update requirements files
 * `pip freeze`
 * `pip install -r requirements.txt`
 
* Anaconda

### Other tools

* Pylint

### Resources

* Hitchhikers guide to Python (http://docs.python-guide.org/en/latest/)
* Full Stack Python (https://www.fullstackpython.com/)
* Python docs are extensive (https://docs.python.org/2/)
 * e.g. Python HOW-TOs (https://docs.python.org/2/howto/)


In [53]:
# Python borrows heavily  from C i.e. if else, while, for etc.

# Create a file (Python Module) with the following code and run it:
print "Spam"
for x in range(10):
    if x % 2:
        print x
    else:
        print "An even number", x
    


Spam
An even number 0
1
An even number 2
3
An even number 4
5
An even number 6
7
An even number 8
9


### What's happening here?

* Python is compiling the source files into bytecode (note this is not machine code)
* The bytecode files are cached as .pyc files - compilation only occurs if timestamps differ
* The bytecode is run through the Python Interpreter (Python Virtual Machine - platform independent)
* ASIDE: The Python intepreter uses a Global Interpreter Lock (i.e. each operation is locked). This has consequences for multi-threading in Python

<img src="images/python_compile.png">

### IPython terminal

* Install IPython (an enhanced Python Interpreter)
* Great way to explore package / module api's and quickly debug / run code
* More on IPython Later!

In [26]:
import os

# use "?" to explore the API
os.path?

#  use dir(os.path)
dir(os.path)

['__all__',
 '__builtins__',
 '__doc__',
 '__file__',
 '__name__',
 '__package__',
 '_joinrealpath',
 '_unicode',
 '_uvarprog',
 '_varprog',
 'abspath',
 'altsep',
 'basename',
 'commonprefix',
 'curdir',
 'defpath',
 'devnull',
 'dirname',
 'exists',
 'expanduser',
 'expandvars',
 'extsep',
 'genericpath',
 'getatime',
 'getctime',
 'getmtime',
 'getsize',
 'isabs',
 'isdir',
 'isfile',
 'islink',
 'ismount',
 'join',
 'lexists',
 'normcase',
 'normpath',
 'os',
 'pardir',
 'pathsep',
 'realpath',
 'relpath',
 'samefile',
 'sameopenfile',
 'samestat',
 'sep',
 'split',
 'splitdrive',
 'splitext',
 'stat',
 'supports_unicode_filenames',
 'sys',
 'walk',

### Exercise

* Check the version of Python you have. Check you have pip!
* Install and activate virtualenv
* Use pip freeze to check installed packages in your virtualenv
* Install the `requests` python package
* Fire up IPython, import requests and work out how to download the content from "growthintel.com" (GET request and print the content)
* Use pip freeze to save your deps to a requirements file
* Create and run a Python script (get.py) to save your code and run it

### Solution

## Core types  / structures

* Ints
* Strings
* Lists
* Dicts
* Tuples
* Sets
* Bools
* None

A few other points to note:

* Variables in Python are basically aliases / names for objects
 * Useful explantion here: http://robertheaton.com/2014/02/09/pythons-pass-by-object-reference-as-explained-by-philip-k-dick/
* Nearly everything is an object. Including functions - these are First Class citizens. This means we can pass assign functions to variables, pass them into functions / methods, operate on them

## Strings (Python  2)

### Unicode and str

* Strings have two representations in Python 2:
 * Byte strings - `str`
 * Unicode - `unicode`
 
* Byte strings - just a series of bytes. Text characters are encoded into bytes using an encoding scheme:
 * ASCII - an encoding scheme where a single byte represents a character (i.e. 255 character encodings) with a direct mapping.
 * UTF-8, UTF-16, UTF-32
 * Latin-1
 * CP-1252
 * Other.... 
* Unicode is an internal representation in Python, strings must be prefixed with `u` e.g. `u'Hello'`

In [27]:
# Bytestrings in Python
# \x Is the escape character for HEXADECIMAL representation. Hex 48 == 72 in decimal
print '\x48'

#  Equivalent to
print 'H'

H
H


### Unicode
* More complex characters (e.g. non-english languages, symbols) are not represented in ASCII. Unicode defines a codespace of over 1M code points (characters)
  * READ the Unicode Primer if you're working with text!: https://docs.python.org/2/howto/unicode.html

In [28]:
# Create a Unicode string as follows
bytestr1 = u'bytestring'

# OR
bytestr2 = unicode('bytestring')

print repr(bytestr1)
print repr(bytestr2)

# Characters can be encoded as a series of 8-bit (1 byte) units - UTF-8
a = u'£'.encode('utf-8')

# Characters encoded as a series of 16-bit units - UTF-16
b = u'£'.encode('utf-16')

#  UTF-32
c = u'£'.encode('utf-32')
print repr(a), a
print repr(b), b
print repr(c), c


# Note you can also specify the codepoint number itself using \u. i.e. U+00A3
print u'\u00a3'


u'bytestring'
u'bytestring'
'\xc2\xa3' £
'\xff\xfe\xa3\x00' ��� 
'\xff\xfe\x00\x00\xa3\x00\x00\x00' ��  �   
£



<img style="float: left" src="images/Gotcha_small.png">
### How to handle encoding /  decoding of data


* As early as possible try and DECODE from bytes into Pythons Unicode representation (e.g. when reading in a file). Lots of IO libraries support specifying the codec when reading / writing (e.g Pandas)
* Work with Unicode in your code
* When writing data out ENCODE from unicode into bytes)

* If using non-ascii characters in your code (e.g. accented characters) than add a comment string to the top of the file to declare the encoding type (see Unicode Primer)

```
data = data.decode('utf-8')
out_data = transform(data)
out_data = out_data.encode('utf-8)
```


* Sometimes you might not know what your data is encoded in, consider using:
 * `pip install chardet`

### Raw strings

In [29]:
# Raw strings treat escape characters differently - backslashes are not escaped
# Useful for Regular expressions
import re
regex_str = r'\bHello\b'
hello_regex = re.compile(regex_str)
matches = hello_regex.search('Good day and Hello    to you all')
print matches.group()

# Otherwise I need to do this (less readable for complex regular expressions)
regex_str = '\\bHello\\b'
hello_regex = re.compile(regex_str)
matches = hello_regex.search('Good day and Hello    to you all')
print matches.group()

Hello
Hello


### String manipulation

* String API is rich (uppercasing, titlecasing, checking case). Prefer built-in methods e.g.
 * `text.startswith('Prash')`
* Strings are immutable. You should prefer using `join` rather than trying to "add" strings together: Addition is effectively creating copies and return a new string. Can be inefficient (slower, memory) for very long lists of strings.


In [30]:
def concat_address(addr):
    """Concat addresses."""
    
    return ", ".join(addr)
    
address = ["Level 42", "1 Canada Sq", "Canary Wharf"] 
print concat_address(address)

Level 42, 1 Canada Sq, Canary Wharf


### Exercise

* Write out some non-ascii data to file (use `open` to create a file object)
 * E.g. write out use £ symbols etc.
 * Be explicit about the encoding: Set the encoding to utf-16
* Try reading the data back in without setting the encoding
* Try and detect the encoding by installing the chardet package



### Solution

In [10]:
text = u"£100 of spam and eggs"

with open('test_encoding.txt', 'w') as f:
    f.write(text.encode('utf-16'))
    
with open('test_encoding.txt', 'r') as f:
    data = f.read()

import chardet
print data

# Detect the encoding
encoding = chardet.detect(data)
print encoding

# Decode using detected encoding
print data.decode(encoding['encoding'])

��� 1 0 0   o f   s p a m   a n d   e g g s 
{'confidence': 1.0, 'encoding': 'UTF-16LE'}
﻿£100 of spam and eggs


### Lists

* Mutable, ordered collections of objects

In [31]:

names =  ['alex', 'bob', 'fred', 'alice']
print "Original list: ", names

last = names.pop()
print "Last list element: ", last

# Lists are mutable
names.append('james')
print "List is now: ", names

# Count number of values
names.count('bob')




Original list:  ['alex', 'bob', 'fred', 'alice']
Last list element:  alice
List is now:  ['alex', 'bob', 'fred', 'james']


1

### Slicing

In [54]:
# Slicing
print "start of array up to (not including) index of 2", names[:2]
print "Last element: ", names[-1]
print "First element: ", names[0]
print "Start at index 2, up to (not inc) index 4 ", names[2:4]

start of array up to (not including) index of 2 ['bob', 'spam']
Last element:  eggs
First element:  bob
Start at index 2, up to (not inc) index 4  ['eggs', 'bob']


<img style="float: left" src="images/Gotcha_small.png">

## Aside: Mutable default parameters

In [55]:
# What's wrong with this?
def test(a=12, b=[]):
    b.append(a)
    print b
    
test(a=31)
test()

[31]
[31, 12]


What's has happened?

* When this code is run (e.g. when the function is defined), a list object is created in the function definition. The default parameter b is an alias for this object - but this object persists. I.e. the SAME list is used for each successive call to this function.
* Don't do it! Instead use `b=None` and create the list in the method `if b is None`

### Dictionaries

In [35]:
# Key-value collection
url_visits = {"test.com":1256,  "growthintel.com":5000, "bbc.co.uk":5e6}
print url_visits
print "Gi has: ", url_visits["growthintel.com"]


{'bbc.co.uk': 5000000.0, 'test.com': 1256, 'growthintel.com': 5000}
Gi has:  5000


In [36]:
# Key does not exist
try:
    url_visits["bob.com"]
except KeyError as e:
    print "Got a Key Error:", e

Got a Key Error: 'bob.com'


In [37]:
print "Value for bob.com is: ", url_visits.get("bob.com", None)

# OR check key exists first

if "bob.com" in url_visits:
    print "Key exists!"
else:
    print "Key does not exist!"

Value for bob.com is:  None
Key does not exist!


### Defaultdicts

* Extends dictionaries to have a default factory for new keys that are added

In [38]:
from collections import defaultdict
dict_of_lists = defaultdict(list)
dict_of_lists['a'].append('bob')
print dict_of_lists

defaultdict(<type 'list'>, {'a': ['bob']})


### Exercise:

* Install the following packages (https://github.com/joke2k/faker):
 - `pip install fake-factory`
 - `pip install json`


* Use the fake factory to create 10 dummy example records and store in a dict.
* Each record should contain an example Name, Address, Email and two phone numbers
* Serialise the dict to JSON (hint use ipython to find the proper method)

### Solution

In [75]:
import json
import faker

fixture_fact = faker.Factory.create()

records = []

for i in xrange(10):
    record = {}
    record['name'] = fixture_fact.name()
    record['email'] = fixture_fact.email()
    record['address'] = fixture_fact.address()
    record['numbers'] = [fixture_fact.phone_number(), fixture_fact.phone_number()]
    records.append(record)
    
print len(records)

# JSON data can be stored in a DB or passed / returned to web APIs
records_json = json.dumps(records)
print records_json



10
[{"numbers": ["1-129-652-7002", "09484502465"], "address": "PSC 3270, Box 1734\nAPO AA 60351", "name": "Dr. Lute Collins", "email": "leonor.hansen@hotmail.com"}, {"numbers": ["(597)146-3530", "391-906-6271x5365"], "address": "8910 Junius Drives Suite 247\nLake Coby, TX 28927-1633", "name": "Roderick Kling III", "email": "cam.mccullough@gmail.com"}, {"numbers": ["653-540-7450x563", "1-578-610-3449x68117"], "address": "35557 Noble Squares Apt. 951\nKirlinbury, PA 62312", "name": "Woodie Will", "email": "fitzgerald79@muller.com"}, {"numbers": ["503.141.3100", "(819)199-4591"], "address": "03681 Rogahn Isle Apt. 991\nEast Waldemar, IA 04386", "name": "Mr. Colton Cole", "email": "romaguera.nola@hotmail.com"}, {"numbers": ["868-643-7698x29154", "(355)719-6248x466"], "address": "3247 Crete Branch Suite 879\nEast Easter, MH 77488-3617", "name": "Aidyn Doyle", "email": "zrosenbaum@yahoo.com"}, {"numbers": ["(119)267-5467x856", "1-261-411-6059x125"], "address": "6892 Beatty Pine Suite 418\nRi

### Sets

* Collection of **unordered**, **unique** objects
* Sets are mutable
* Useful for testing memberships (object in set, are sets supersets or subsets of each other)
* Useful for calculating intersections, unions between sets

In [39]:
# Useful if you want to find the unique values of a collection
names = ["bob", "spam", "eggs", "bob", "eggs", "eggs"]
print set(names)

set(['bob', 'eggs', 'spam'])


In [40]:
# Note braces are used for sets, be careful not to confuse dict notation and set notation!
mp_names = {"eggs", "spam"}
print type(mp_names)

# Be careful - this is an empty dict!
empty_ = {}
empty_set = set()

# Intersection of names and mp_names
mp_names.intersection(names)

<type 'set'>


{'eggs', 'spam'}

### Tuples

* Tuples are **immutable**.
 * Used as keys in dictionaries
 * Data containers to be passed to functions / methods - data cannot be modified (also see: `namedtuples`)
* Unfortuntately the syntax can be a bit clunky

In [41]:
# Use the comma operator - GOTCHA: Watch for trailing commas, you'll be constructing a tuple!
a = 1,

#BUT Recommend use parentheses (1,)

# Single items MUST use the comma operator
b = (1,) # Don't use (1)

# Empty tuples MUST use parentheses
c =  ()
c  = tuple()

d = 1,2,3,4,"hello"

print a
print b
print c
print d

(1,)
(1,)
()
(1, 2, 3, 4, 'hello')


#### Example Named Tuples as a data  container

* Namedtuple is factory function for creating tuples with field attributes

In [16]:
import collections

Person = collections.namedtuple('Person',  ['name', 'email'])
fred = Person(name='fred', email='fred@gmail.com')
print fred

# Named Tuple is immutable
try:
    fred.email = 'bob@gmail.com'
except AttributeError as e:
    print e
    print "You can't do that!"

Person(name='fred', email='fred@gmail.com')
can't set attribute
You can't do that!


### Exercise

* Create two sets with 

### Solution

### Sequences

* Sequence slicing (saw this earlier with lists)
* Sequences can be unpacked (tuple, list, string unpacking)
* Iterating on sequences, iterating on sequences in parallel (zip), generators, itertools

In [43]:
a,b,c = ["bob", "fred", "alice"]
print "a: ", a
print "b: ", b
print "c: ", c

# You can swap objects efficiently - TUPLE unpacking
b,c = c,b

print "Swapped b: ", b
print "Swapped c: ", c

# Unpack strings
d,e,f = "def"
print d
print e
print f

a:  bob
b:  fred
c:  alice
Swapped b:  alice
Swapped c:  fred
d
e
f


## Comprehensions

Map, Reduce, Filter -> Don't need them, use list comprehensions!

Preferable to use List  / Dict comprehensions






In [44]:
def square(x):
    return x**2

# Equivalent of map
squares = [square(x) for x in range(10)]
print squares

# Map and Filter
large_squares = [square(x) for x in range(10) if x > 2]
print large_squares

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
[9, 16, 25, 36, 49, 64, 81]


## Generators

* Lazily evaluate sequences.
* Functions can be converted into generators. The function creates an iterable that `yield` (rather than returning) values - the function is paused between each call. 
* Generators are great for long sequences (infinite) 
* Generators are great for reading from queues or from large files
* Why? Because they're memory efficient - they yield rather than creating the whole sequence in memory

In [45]:
import csv

# Create a generator
file_gen = (x for x in csv.reader(open('datasets/test_gen.csv', 'r')))
for line in file_gen:
    print line

['1']
['2']
['3']
['4']
['5']
['6']
['7']
['8']
['9']
['10']


### Turning functions into generators

In [46]:
# Functions can be converted to generators by using yield

def infinite_squares():
    a = 1
    while True:
        yield a*a
        a += 1
        
# The function call creates my generator
squares_gen = infinite_squares()

print type(squares_gen)
print squares_gen.next()
print squares_gen.next()
print squares_gen.next()
print squares_gen.next()

# I could do the following forever
# for square in squares_gen:

<type 'generator'>
1
4
9
16


In [47]:
# xrange is another type of lazily evalauted sequence
print range(2,12)

# xrange does not return the list, but an iterable object - therefore more memory efficient
print xrange(2,12)

# You can still perfrom seqquence type operations
lazy_list = xrange(2,12)
print lazy_list[0]
print len(lazy_list)


[2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
xrange(2, 12)
2
10


### Iterating over multiple sequences in parallel

In [48]:
#  ZIP

a = [1,2,3,4]
b  = ['a', 'b', 'c', 'd']

# Use zip to create a new list of tuples to pass to the dict constructor
dict(zip(a,b))

{1: 'a', 2: 'b', 3: 'c', 4: 'd'}

In [49]:
# For long lists consider using izip to return an iterable rather than a new list
import itertools

print zip(a,b)

print itertools.izip(a,b)

zip_gen = itertools.izip(a,b)
zip_gen.next()
print type(zip_gen)

[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
<itertools.izip object at 0x7f7ec8223758>
<type 'itertools.izip'>


### More reading

In [50]:
# EXPLORE itertools and collections

import itertools
print ", ".join(dir(itertools))

import collections
print ", ".join(dir(collections))


__doc__, __file__, __name__, __package__, chain, combinations, combinations_with_replacement, compress, count, cycle, dropwhile, groupby, ifilter, ifilterfalse, imap, islice, izip, izip_longest, permutations, product, repeat, starmap, takewhile, tee
Callable, Container, Counter, Hashable, ItemsView, Iterable, Iterator, KeysView, Mapping, MappingView, MutableMapping, MutableSequence, MutableSet, OrderedDict, Sequence, Set, Sized, ValuesView, __all__, __builtins__, __doc__, __file__, __name__, __package__, _abcoll, _chain, _class_template, _eq, _field_template, _get_ident, _heapq, _imap, _iskeyword, _itemgetter, _repeat, _repr_template, _starmap, _sys, defaultdict, deque, namedtuple


### Exercises

* Write a list comprehension to return even numbers up to 20
* Use a list comprehension to clean (titlecase, strip whitespace) and create a list of dicts where the key is the index into the list and the value is the cleaned up string. Apply to the following list of strings:
 * `['  the start  ', '  the middle', '  the end ']`
 * Hint: Have a look at the `enumerate` function

### Solution

In [43]:
# even numbers up to 20
a = [x for x in range(20) if not x%2]

# NB a better way without a list comp!
b = range(0,20,2)

print a
print b

# cleaning and converting to list of dicts
text_list = ['  the start  ', '  the middle', '  the end ']
def cleanline(index, line):
    val = line.strip().title()
    return {index:val}

[cleanline(i, x) for i, x in enumerate(text_list)]

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]


[{0: 'The Start'}, {1: 'The Middle'}, {2: 'The End'}]

## Further reading

### Functions

* Functions are first-class
* You may want to look at decorators - common cross-cutting code can be abstracted from the logic of your code with  syntatic sugar
* Have a look at functools
* Understand argument and keyword argument unpacking (`*args` and `**kwargs`) - can be useful for functions with an unknown number of arguments

### OO in Python:  Classes

* New style classes Inherit from `object` - be consistent
* Generally advisable not to use multiple inheritance, unless *mixing-in* methods
* Use `super` to call parent class methods
* Inspect `__mro__`  to understand method resolution order if you're not sure

##  Building an Application

### Setup a virtualenv

* Ensure you have an up to date requirements.txt

### Follow a styleguide

* Read PEP8 - Python Style guide

### Organising your code

* Python code is organised into files (modules)
* Collections of modules are packages

* Python uses the file system (directory structure) to identify module imports
 * As imports are dynamically looked up, you need to mark directories using `__init__.py`
 * You also need to set a PYTHONPATH e.g.:
 * `export PYTHONPATH=/home/prash/projects`
* Use module namespacing to achieve isolation
* Use Classes (OO) where appropriate

### Packages



### Modules

Modules are imported once. Code is executed on import and namespaced for the module created.
Note:  Modules are sensible way to have an object created once only (Singleton)

* AVOID `from module import *` - It will import everything in the global namespace - potentially overriding existing attributes
* PREFER fully qualified imports e.g. `import pandas` or `import os.path`

### Testing

* nose is a good framework that extends the standard library unittest framework
* Create a tests directory that mirrors your app

### App summary

* Consider use docopt for a nicer command line interface
* Understand DB Access with an Object Relational Mapper (ORM) e.g. sqlalchemy
* Use the logging library to log to file (prefer over print statements)
* Use sensible imports (collect system and package imports etc)
* Use docstrings!