<DIV ALIGN=CENTER>

# Advanced Python Concepts
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Introduction 

In this lesson, we build on the foundation provided by the previous two
Python lessons to introduce file input/output, using external packages
to include new functionality, and to create and apply user-defined
functions. When complete, you will be able to read and write Python
programs, which we will demonstrate by writing a Python program that we
will execute at the Unix command line.

-----

## Working with files

When working with files, or any other system object, we must be careful
about properly managing the underlying resource. In this particular
case, that means a file and the associated file descriptor that the host
operating system uses to reference the actual file. While modern
operating systems can typically manage a very large number of file
descriptors, when we use virtualization, like with a Docker container,
we want to minimize our server footprint. Thus, we need to carefully
husband resources like file descriptors to avoid exhausting our server
resources.

But a more important aspect is that whenever we open a file, we want to
be sure that the file is properly closed, and that any data that a
program wrote to the file has actually been written to permanent
storage. Thus we need to ensure that every file that was opened has
been properly closed. To open a file, Python has an `open` method that
opens the named file and returns a file object that you either read from
or write to depending on the mode used to open the file. Conversely,
Python also has a `close` method that closes the file object. 

To explicitly state why a file is being opened, the `open` method
accepts a _mode_ argument, whose default values is `rt` or _open for
reading text data_. The allowed modes are detailed in the following
table.

| Mode | Description |
| ---- | ----- |
| 'r'  | reading (default) | 
| 'w'  | writing, truncate file first |
| 'x'  | create and open file for writing |
| 'a'  | writing, append to file if exists |
| 'b'  | binary mode |
| 't'  | text mode (default) |
| '+'  | open for reading and writing |

Normally, and especially for the purposes of this class, we will only
read from a text file or write to a text file when using traditional
Python file input and output. Thus, to open a text file named `test.txt`
for writing without truncating the existing file contents (i.e.,
append), you would use `f = open('test.txt', 'a')` and after all
operations on the file are complete, you would use `f.close()` to close
the file and release all associated resources. One last item, when
opening a file for reading and writing, the `+` mode follows either a
`w` to open the file but truncate the file contents, or an `r` to open
the file without truncation.

In Python3, the approach to file input/output has changed with the
introduction of the runtime
[context](https://docs.python.org/3/reference/datamodel.html?highlight=
context%20manager#with-statement-context-managers), which is a way to
enforce what should happen when a code block is entered and exited. The
_context_ is created by using the `with` command in Python, where the
rest of the line following the `with` command creates the actual context
manages the entry into and exit from the enclosed code block. For our
purposes, the standard application for a Python context is opening an
closing files. As demonstrated in the following code block, we can now
open a file, perform operatons on the file, and no longer worry about
closing the file, which is now taken care of automatically by the
context.

```
with open('temp.txt', 'a') as fout:
    fout.write(data)
```
-----

In [1]:
# File writing demonstration

# We explicitly place a newline at the end of each string
with open('temp.txt', 'w') as fout:
    fout.write("Hello World!\n")
    fout.write("Goodbye World!\n")

In [2]:
!cat temp.txt

Hello World!
Goodbye World!


----- 

To read data with Python3, we simply open the file (in a context). By
default, for a text file, we simply iterate though the file object,
which returns each line of the text file as a Python string.

```
with open('temp'txt', 'r') as fin:
    for line in fin:
        print(line)
```

The `open` method also takes an `encoding` attribute that can be used to
specify the character encoding used in the file. For example, the
airline data we have used previously has a character encoding of
`latin-1`. Originally, the only character encoding used by computers was
the ASCII encoding, which only required 8-bits to represent each
character. This encoding only represented the standard  american
typewriter characters, and thus failed to work for non-english languages
or words. To support character encodings for any language, the [Unicode
consortium](http://www.unicode.org) was formed and standardized
character encoding have subsequently been developed. One of the most
popular current character encodings is `utf-8`, which is a unicode
standard.

In the following set of code bocks, we first grab the airline data,
uncompress it, grab out the first one thousand lines for simplicity, and
use a small Python program to read the lines from this file and display
what airline flights left the Baltimore airport (code: BWI).

Note that this file is already cachd locally in the `data` directory, so you can skip the frst two code cells and simply use the `head` command to grab the first 1000 lines from the cached file `../data/2001.csv`.

-----

In [3]:
# First we will grab the data of interest
!wget http://stat-computing.org/dataexpo/2009/2001.csv.bz2

--2015-08-16 22:30:31--  http://stat-computing.org/dataexpo/2009/2001.csv.bz2
Resolving stat-computing.org (stat-computing.org)... 54.231.161.99
Connecting to stat-computing.org (stat-computing.org)|54.231.161.99|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 83478700 (80M) [application/x-bzip2]
Saving to: ‘2001.csv.bz2’


2015-08-16 22:31:17 (1.76 MB/s) - ‘2001.csv.bz2’ saved [83478700/83478700]



In [4]:
# Now decompress
!bzip2 -d 2001.csv.bz2

In [7]:
# Now extract the first thousand lines to save time

!head -1000 2001.csv > temp.csv

In [8]:
# Now we can use Python to read in the file

# Here is out formatted print string
fString = "Flight {0} departed from Baltimore on {1}/{2}/{3}"

# Now loop through the file. The encoding is latin-1, failure to 
# specify this encoding will cause problems

with open('temp.csv', 'r', encoding='latin-1') as fin:
    for line in fin:
        cols = line.split(',')
        if cols[16] == 'BWI':
            print(fString.format(cols[9], cols[1], cols[2], cols[0]))

Flight 375 departed from Baltimore on 1/17/2001
Flight 375 departed from Baltimore on 1/18/2001
Flight 375 departed from Baltimore on 1/19/2001
Flight 375 departed from Baltimore on 1/20/2001
Flight 375 departed from Baltimore on 1/21/2001
Flight 375 departed from Baltimore on 1/22/2001
Flight 375 departed from Baltimore on 1/23/2001
Flight 375 departed from Baltimore on 1/24/2001
Flight 375 departed from Baltimore on 1/25/2001
Flight 375 departed from Baltimore on 1/26/2001
Flight 375 departed from Baltimore on 1/27/2001
Flight 375 departed from Baltimore on 1/28/2001
Flight 375 departed from Baltimore on 1/29/2001
Flight 375 departed from Baltimore on 1/30/2001
Flight 375 departed from Baltimore on 1/31/2001


-----
## Python Packages

As the Python language has become more popular, individuals and
organizations have invested considerable time, energy, and effort in
developing Python applications. Fortunately, the Python language
supports encapsulating code into
[modules](https://docs.python.org/3/tutorial/modules.html), which are
essentially files containing Python definitions, for example functions,
classes, or variables. A _module_ can be imported into another Python
file, allowing the definitions to be reused. 

When one or modules are more widely used, they can be bundled together
into a Python package, which can provide enhanced functionality. To
import a package (or module) into another Python program, you use the
`import` statement, which has the following forms:

1. `import numpy`
2. `import numpy as np`
3. `from numpy import arange`
4. `from numpy import *`

The first form brings the entire contents of the numpy package into the
current program, but leaves all items in the numpy namespace. Thus to
refer to a particular definition, like `arange` one must use the `numpy`
prefix, as in `numpy.arange()`. The second form is similar to the first,
but the prefix has been shortened to `np`. The third form only imports
the single, listed definition which is also brought into the current
namespace and thus does not require any prefix. The last form brings the
entire contents of the _numpy_ package into the current file and
namespace. As a result, the chances for name collisions increases and
thus the last form is strongly discouraged.

Many popular packages have been included with the standard Python
distributions and are known collectively as the Standar Library. Other
packages are available from third parties, yet can be very useful in
specific circumstances. The following table lists some of the more
popular Python packages that are relevant for this course:

| name | Description |
| --- | --- |
| [numpy][1] | Fast numerical arrays and matrices|
| [scipy][2] | Comprehensive set of scientific and engineering functions|
| [matplotlib][3] | Comprehensive plotting library|
| [seaborn][4] | Better data plotting|
| [pandas][5] | Data structures and simplifies data analysis tasks |
| [csv][6] | Easily read and write CSV files |
| [bzip2][7] | Supports compressing and decompressing by using bzip2 compression algorithm|
| [scikit_learn][8] | Provides Machine Learning tools |

In addition to these listed packages, many other packages exist. The
official repository for public Python packages is PyPI, the [Python
Package index][pypi], as shown below. These libraries can generally be
installed with the Python package management tool known as [pip][pip].
If you build `pip` with Python3, you can enforce Python3 package
management only by using the `pip3` tool, which is the original `pip`
tool that is configured by default to invoke Python3.

-----

[1]: http://www.numpy.org
[2]: http://www.scipy.org/scipylib/index.html
[3]: http://matplotlib.org
[4]: http://web.stanford.edu/~mwaskom/software/seaborn/index.html
[5]: http://pandas.pydata.org
[6]: https://docs.python.org/3/library/csv.html
[7]: https://docs.python.org/3/library/bz2.html
[8]: http://scikit-learn.org/stable/index.html
[pypi]: https://pypi.python.org/pypi
[pip]: https://python-packaging-user-guide.readthedocs.org/en/latest/current.html

In [5]:
from IPython.display import HTML
HTML('<iframe src=https://pypi.python.org/pypi width=800 height=400></iframe>')

----


A caveat, however, to blindly using libraries from PYPI or any other
distribution mechanism is that while a particular library may simplify
the development of a Python program, this same library may conversely
complicate the distribution and maintenance of a Python program by
introducing extra dependencies that are possibly out of the control of
the developer. Thus, a judicious evaluation of the benefits and risks of
using any Python package should be considered before their adoption. The
Python packages listed previously, as well as other community-standard
python packages are generally safe to adopt as they are well supported
and widely available.

The maintenance problem is usually not the result of the Python package
itself, but with its dependencies. As an example, the popular
[scipy](http://scipy.org) package requires external C and Fortran
libraries that provide the actual implementation of basic linear algebra
and special mathematical functions. To acquire these libraries for any
given operating system and hardware platform can be difficult and might
require compiling the original sources, further increasing any dependency
issues that are not handled by `pip`.

While ongoing efforts exist in the community to provide a solution to
these dependency issues, the current recommended approach is to use the
[Anaconda Python][AP] distribution from Continuum Analytics. Anaconda is
freely available, and provides a complete Python installation along with
a number of the more  popular Python packages, available for most
operating systems. The Anaconda website is shown below.

-----
[AP]: https://store.continuum.io/cshop/anaconda/

In [6]:
HTML('<iframe src=https://store.continuum.io/cshop/anaconda/ width=800 height=400></iframe>')


In [9]:
# Now we read from a CSV file using the CSV package

import csv

# Here is out formatted print string
fString = "Flight {0} departed from Baltimore on {1}/{2}/{3}"

# Now loop through the file. The encoding is latin-1, failure to 
# specify this encoding will cause problems

with open('temp.csv', 'r', encoding = 'latin-1') as csvfile:
    for row in csv.reader(csvfile):
         if row[16] == 'BWI':
            print(fString.format(row[9], row[1], row[2], row[0]))

Flight 375 departed from Baltimore on 1/17/2001
Flight 375 departed from Baltimore on 1/18/2001
Flight 375 departed from Baltimore on 1/19/2001
Flight 375 departed from Baltimore on 1/20/2001
Flight 375 departed from Baltimore on 1/21/2001
Flight 375 departed from Baltimore on 1/22/2001
Flight 375 departed from Baltimore on 1/23/2001
Flight 375 departed from Baltimore on 1/24/2001
Flight 375 departed from Baltimore on 1/25/2001
Flight 375 departed from Baltimore on 1/26/2001
Flight 375 departed from Baltimore on 1/27/2001
Flight 375 departed from Baltimore on 1/28/2001
Flight 375 departed from Baltimore on 1/29/2001
Flight 375 departed from Baltimore on 1/30/2001
Flight 375 departed from Baltimore on 1/31/2001


-----
## Functions

The Python language supports the creation and application of
user-defined functions, which can simplify program development by
promoting code reuse (in a similar manner as importing packages
developed by others, but in this case you are the code provider and
consumer). In Python, a function is actually an object, like everything
else, and they can thus be created and passed dynamically like any other
data type.

A Python function is defined by using the `def` keyword, followed by the
function name. After the function name are a set of matching parentheses
that enclose the arguments to the function, and a colon character
follows the closing parenthesis, which signifies the start of the code
block that provides the function implementation and is known as the
function body. As a simple example, the following function takes no
arguments and simply prints a standard message:

```
def hello():
    print("Hello World!")
```

This function is called in a Python program by simply using its name
followed by the parentheses, `hello()`, which will print out `Hello
World!` to the display.

### Doc Strings

A standard practice is to employ a docstring comment immediately after
the function definiton line to provide documentation for the function.
The Python interpreter will by default use this docstring as the
official function documentation, which typically is accessed by using
the built-in `help()` function. This is demonstrated in the following
two code blocks, where we define and call this function, and
subsequently access the documentation.

-----

In [10]:
def hello():
    """Display a welcome message to the user."""
    print("Hello World!")

hello()

Hello World!


In [11]:
help(hello)

Help on function hello in module __main__:

hello()
    Display a welcome message to the user.



-----

### Function Arguments

A function can accept zero or more arguments by simply listing the
argument names between the parentheses. These argument names are the
names you use to access the values contained in these arguments within
the function body. for example, we can modify the original `hello`
function to take a `name` argument that is used when printing out a
welcome message:

```
def hello(name):
    """Display a welcome message to the user."""
    print("Hello {0}".format(name))
```

When called with an argument `Alexander`, this function will print
`Hello Alexander`. Functions can take multiple arguments, they are
simply separated by commas, and they can take data structures as
arguments, like lists or tuples.

```
def hellolist(names, text):
    """
    Display a welcome message to the user(s) listed in the names, 
    which is assumed to be a list. 
    
    """

    for name in names:
        print("Hello {0}, {1}".format(name, text))
```

If this function is called as 
`hellolist(['Alex', 'Joe', 'Jane'], "welcome to class.")`, 
the following output is displayed:

```
Hello Alex, welcome to class.
Hello Joe, welcome to class.
Hello Jane, welcome to class.
```

In some cases, a function accepts one or more arguments that often have
_defualt_ values. Python supports default arguments that enable a Python
programmer to specify a default value for an argument, which can be
overridden if the user supplies a specific value. A default argument is
specified by simply including an equal sign and the default value after
a specific argument, like `text = ' welcome to class.'`. With this
default argument for the `hellolist` function, we could leave off the
second argument if desired.

Default arguments are often used in functions that are part of large
packages, like `numpy` or `matplotlib` to simplify their use. New users
can quickly call he functions, while advanced users can achieve more
control of the function by specifying additional arguments explicitly.

### Keyword Arguments

One last aspect of function arguments is that when a function is called,
the argument names listed in the function definition can be explicitly
specified, along with the values they should take when the function is
called. This type of function call is said to be using _keyword
arguments_. When using keyword arguments, the order of the arguments
listed in the function call is arbitrary and does not need to explicitly
match the argument order listed in the function definition. For example,
we could call the hellolist function:

```
hellolist(text=' welcome to class', names = ['Alex', 'Joe', 'Jane'])
```

### Returning Values

A function can return values by using the `return` keyword. Single
values are returned as the type of the value, while multiple values are
returned as a tuple of values. These two cases are demonstrated in the
following sample code:

```
def hello2():
   return "Hello World!"

def hello3(name):
    return "Hello, ", name
```

Calling the first function in this sample code as `msg = hello2()` will
assign the string `Hello World!` to the `msg` variable, while calling
the second of these functions as `msg = hello3()` will assign the tuple
`('Hello', name)` to the `msg` variable. Note that for the `hello3`
function, the argument is not required, as written, to be a string;
therefore, the return tuple will have a string as the first element and
the second element will be the same type as the argument `name`.

Note, when returning multiple values as a tuple, we can either enclose
the values in parentheses, or simply separate them with commas as shown
in the example.

### Lambda Functions

Python also supports the ability to create unnamed functions, which are
short functions that are defined and used in place. An unnamed function
in Python is called a `lambda` function, and is defined by using the
`lambda` keyword. lambda functions are often used in comprehensions or
in function calls, when an argument expects a function. For example, we
can create a `lambda` function that takes multiple arguments and assign
to an arbitrary variable for later invocation:

```
f = lambda x, y: x**2 + y**3

print(f(3, 4))

73
```

We can use a lambda function to create a list comprehension, either
implicitly with defined function like the previous example, or explicitly
with an in-place `lambda` function. In either case, we can use the `map`
function to apply the function to a range of input arguments.

```
[x for x in map(lambda x: (3*x**2 + 2*x + 4), range(-3,4))]
```
Will create the following list: 

    [25, 12, 5, 4, 9, 20, 37]

-----

## Writing a Python program

While the majority of the Python code we will write in this course will
be done within an IPython Notebook, you also can write Python programs
that run at the Unix command line. This can be useful when a command
needs to be repeatedly run, or when you want to share just a Python
script or program with others. While you can write a Python program by
using a development environment like [Spyder][sp], [emacs][em], or even
[vim][vim], we will write a Python program within an IPython Notebook,
write the program to a file, and run this file as a Unix command.

First, we will use the `%%writefile` cell magic to write the contents of
a cell to a file. The first two lines of a Python program that is
designed to run at the Unix command prompt contain special data. The
first line indicates how the program should be run, and by default, for
a Python3 program, has the following form:

    #!/usr/bin/env python3

The second line, which is optional, specifies the character encoding
used within the file, which enables your Python program to contain
Unicode characters. In this case, you most likely would use the
following character encoding, which is the default in Python 3:

    # -*- coding: UTF-8 -*-

After these two lines, you simply write legal Python statements as you
would in an IPython Notebook code cell. We demonstrate this in the
subsequent code block, where we create a simple Python "Hello World!"
program.

-----

[sp]: https://code.google.com/p/spyderlib/
[em]: https://www.gnu.org/software/emacs/
[vim]: http://www.vim.org

In [12]:
%%writefile test.py
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-

print("Hello World!")

Writing test.py


In [13]:
!cat test.py

#!/usr/bin/env python3
# -*- coding: UTF-8 -*-

print("Hello World!")

-----

In the previous two code blocks, we first write out a simple Python
program to the `test.py` file and then use the Unix `cat` command to
display the contents of the file. In order to execute this program,
however, we need to change the permission of the new file to enable
execution. If you recall, we do this by calling the `chmod` command to
allow the user to execute the program with `u+x`. After this, you can
run the program by specifying the path to the file, which we can
shortcut with `./test.py` to specify that the file is located in the
current directory. Alternatively, we can specify the full path, for
example, on the course JupyterHub server this would likely be
`/home/data_scientist/rp-pdss15/notebooks/test.py`. When in doubt,
simply execute a `pwd` command and use the resulting full path info when
running the program.

-----

In [18]:
!chmod u+x test.py

In [19]:
!./test.py

Hello World!


In [17]:
# We assume the current directory is the course notebook folder in our JupyterHub server

!/home/data_scientist/notebooks/test.py

Hello World!


-----
## Writing a Module

We can also develop our own Python modules to promote code reuse to
minimize repetition and possible errors that can have their own
variables, functions, and classes. In the following code block, we
create a new module, which is a simple Python program file that is now
designed to be imported into other Python programs and not executed. In
this new file, we include a module `docstring`, a module variable, a
module function, and a module class, in order to demonstrate how each of
these can be created inside a module, and subsequently accessed and used
by other Python programs.

-----

In [20]:
%%writefile rppds.py

# RP Practical Data Science Test Module
# We include a module docstring
"""
This is a Test Module for the Research Park 2015 Practical Data Science course.
This module contains some variables, functions, and a simple class soley for 
demonstration purposes.
"""

# module specific variables: 

year = 2015
location = "EnterpriseWorks"

# Module Functions
# We include a docstring for the method

def rp(name):
    """
    This method welcome a student to the course. If a name is provided, the function
    will specifically welcome the named student.
    """
    
    fmt = "Welcome {0} to the RP Data Science course"
    return (fmt.format(name))

# Module Classes
# We include a docstring for the class.
class student:
    """
    This class represents a student in the RP Practical Data Science course.
    """
    
    def __init__(self, name, company = "UI"):
        """
        Create and initialize a new student object.
        """
        self.name = name
        self.company = company
        
    def welcome(self):
        """
        Create and return a class welcome message for this student.
        """
        return rp(self.name)

Writing rppds.py


-----

With the module file written, we can simply import the new module into a
Python program, or in this case, an IPython Notebook code cell. Note
that Python3 does not allow for reloading modules. So if you make
changes to the module cell block and want to import the new version, you
can either restart the IPython kernel or change the name of the file and
the subsequent import.

One technique to simplify this process during development is to use the
`import xyz as abs` command, which insulates the actual module filename
from the resulting code. If you change the module code above, simply
change the name of the file that is written, for example to rppds2.py,
and the following `import` statement should be changed to 

    import rppds2 as rp

and the rest of the program will work seamlessly.

-----

In [21]:
import rppds as rp

test = rp.student("Alexander")

print(test.welcome())

Welcome Alexander to the RP Data Science course


-----

When we created our new module, we created document strings for the
module itself as wel for the new class and any defined functions. This
information can be displayed interactively in the IPython Notebook, for
example, enter `rp.` and hit tab in a code cell; or via the IPython
Notebook help display, for example, enter `rp?` and execute the code
cell; or most easily, simply use the built-in Python `help` method,
passing in the module name, in this case we use our abbreviation `rp`.

-----

In [22]:
help(rp)

Help on module rppds:

NAME
    rppds

DESCRIPTION
    This is a Test Module for the Research Park 2015 Practical Data Science course.
    This module contains some variables, functions, and a simple class soley for 
    demonstration purposes.

CLASSES
    builtins.object
        student
    
    class student(builtins.object)
     |  This class represents a student in the RP Practical Data Science course.
     |  
     |  Methods defined here:
     |  
     |  __init__(self, name, company='UI')
     |      Create and initialize a new student object.
     |  
     |  welcome(self)
     |      Create and return a class welcome message for this student.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)

FUNCTIONS
    rp(name)
        Thi

In [20]:
# Execute this code to see the IPython Notebook documentation
rp?

-----
## Additional Python Concepts

In this course, we have insufficient time to fully explore the Python
language. While we have covered the majority of the language, especially
those aspects that are relevant to using Python for data science
applications. For those interested in a more complete mastery of the
Python language, the following list of topics are recommended for further
study.

- [Object-Oriented Programming](https://docs.python.org/3/tutorial/classes.html)
- [Exception Handling](https://docs.python.org/3/library/exceptions.html)
- [Regular Expressions](https://docs.python.org/3/library/re.html)

-----

### Additional References

1. [Dive into Python3](http://getpython3.com/diveintopython3/)

-----

### Return to the [Course Index](index.ipynb).

-----