# Using Python for Data

## Useful Packages
+ [`numpy`](http://www.numpy.org/): Adds ability to deal with multi-dimensional arrays and vectorized math functions
+ [`scipy`](http://www.scipy.org/): Extends `numpy` by adding common scientific functions such as ODE integration, statistical analysis, linear algebra, and FFT
+ [`matplotlib`](http://matplotlib.org/): A useful plotting package
+ [`astroML`](http://www.astroml.org/): Common statistical analysis and machine learning tools used in astronomy

## Installing python
The easiest way to install python on any OS is to use [anaconda python](https://www.continuum.io/downloads).  This will install a local version of python on your system so you don't need to worry about needing admin to install new packages.  Most of the packages listed above are installed by default with anaconda.  For this class we will be using python 3, and I recommend you use this version for you research (unless you have a very good reason to use python 2).  In these notes I have marked where the syntax or behavior has changed between python 2 and 3.

## Text editors
Although there are numerous IDEs (e.g. IDLE, Spyder) for python, for most everyday use you will likely be writing python code in a text editor and running your programs via the command line.  In this case it is important to have a good text editor that supports syntax highlighting and possibly live linting (syntax and style checking).  I use the [atom](https://atom.io/) text editor, a 'hackable' text editor that offers a large range of add-ons to support your coding style.  If you decide to use atom you will want the following add-ons: `language-python`, `linter`, `linter-python`, and the python packages `pylama` and `pylama-pylint` installed.  As a bonus the atom editor has full support for `git` and `git-hub`.

## Coding style
When working on code with others, it is helpful to define a coding style for a project.  That way the code is written in a predictable way and it is easy to read.  Many projects use [PEP 8](https://www.python.org/dev/peps/pep-0008/) as a starting point for a style.

## Basic syntax examples
For a general overview of python's syntax head over to [codecademy](https://www.codecademy.com/learn/python) and take their interactive tutorial.  This class will highlight some of the more important things.

### importing packages
Any package or code from another `.py` file can be imported with a simple `import` statement.  By default all imported code has its own name space, so you don't have to worry about overwriting existing functions.  The final line of this code block is a "magic" `Jupyter` function needed to make interactive plots inside of `Jupyter notebooks`.

In [0]:
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
%matplotlib notebook

### math

Basic math opperations work (mostly) as expected:

In [0]:
# addition
print(1 + 1)

# subtraction
print(1 - 1)

# multiplication
print(3 * 4)

# division
print(5 / 4)

# integer division
print(5 // 4)

# exponents
print(2**5)

# modulo
print(5 % 2)

**Note:** In python 2 division defaults to integer division if both values are integers!  This was an easy error to make (and difficult to debug/notice) so the default was changed in python 3 and the `//` opperator was introduced for integer division.

### data containers
Data inside of python can be stored in several different types of contains.  The most basic ones are:

+ `list`: an indexed data structure that can hold any objects as an element
+ `tuple`: same as a `list` except the data is immutable
+ `dictionary`: objects stored as a `{key: value}` set (note: any immutable object can be used as a key including a tuple)

In [0]:
# a list
example_list = [1, 2, 3]

# a tuple
example_tuple = (1, 2, 3)

# a dictionary
example_dict = {'key1': 1, 'key2': 2, ('key', 3): 3}

Elements in these objects can be accessed using an zero-based index (`list` and `tuple`) or key (`dict`).

In [0]:
print(example_list[0], example_list[-1])
print(example_tuple[1])
print(example_dict['key1'], example_dict[('key', 3)])

Each of these objects have various methods that can be called on them to do various things.  To learn what methods can be called you can look at the python documentation (e.g. https://docs.python.org/3/tutorial/datastructures.html) or you can inspect the object directly and use python's `dir` and `help` functions to get the methods and doc string.

**Note:** Methods that start with `__` or `_` are private methods that are not designed to be called directly on the object.

In [0]:
# print the names for all the methods of a list
print(dir(example_list))

print('=========')

# print the help text for the `pop` method
help(example_list.pop)

### Slicing lists
Many times it is useful to slice and manipulate lists.  The format for slicing a list is: `list[start_index:end_index:step_size]`

**Note:** `end_index` in not inclusive.

In [0]:
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# print the full list
print(a)

# print the first 3 elements
print(a[:3])

# print the middle 4 elements
print(a[3:7])

# print the last 3 elements
print(a[7:])

# you can also use neg index
print(a[-3:])

# print only even index
print(a[::2])

# print only odd index
print(a[1::2])

# print the reverse list
print(a[::-1])

### Looping over `list`s and `dict`s
There are several ways to loop over a `list` or `dict` depending on what values you want access to.

**Note:** Two of the print statements in this example using string formatting. `'{0} {2} {1}'.format(a, b, c)` will replace `{0}` with `a` (the 0th argument of the format function), `{1}` with `b`, and `{2}` with `c`.

In [0]:
# loop over values in a list
for i in example_list:
    print(i)
print('=========')

# loop over valeus in a list with index
for idx, i in enumerate(example_list):
    print('{0}: {1}'.format(idx, i))
print('=========')

# loop over keys in dict
for i in example_dict:
    print(i)
print('=========')

# loop over values in dict
for i in example_dict.values():
    print(i)
print('=========')

# loop over keys and values in dict
for key, value in example_dict.items():
    print('{0}: {1}'.format(key, value))

**Note**: In python 2 the final loop would have been over `example_dict.iteritems()`.

### list/dict comprehension
If you need to make a `list` or `dict` as the result of a loop you can use comprehension. 

**Note** comprehension is faster than a normal loop since the iteration uses the `map` function that is compiled in `C`.

In [0]:
# slower method
list_loop = []
dict_loop = {}
for i in a:
    list_loop.append(i**2)
    dict_loop['key{0}'.format(i)] = i

print(list_loop)
print(dict_loop)

In [0]:
# faster method
list_comp = [i**2 for i in a]
dict_comp = {'key{0}'.format(i): i for i in a}
print(list_comp)
print(dict_comp)

## Writing reusable code
It is always best to keep your code DRY (don't repeat yourself).  If you find yourself writing the same block of code more than 2 times you should think about extracting it to a function.  If you need to create a custom object that has its own methods assigned to it you should create a custom class.

### functions

In [0]:
def cube(x):
    result = x ** 3
    return result

print(cube(3))

**Note:** In python functions use a local name space, so don't worry about reusing variable names.  Only if a variable is not in the local name space will the function look to the global name space.  If the function argument is immutable changes will be local in scope, otherwise it will not.

In [0]:
# numbers passed into a function are immutable
def alpha(x):
    x = x + 1
    return x

x = 1
print(alpha(x))
print(x)

print('=======')

# lists passed into a function are not immutable!
def beta(x):
    x[0] = x[0] + 1
    return x

x = [1]
print(beta(x))
print(x)

### classes
Classes are useful when you will have multiple instances of an object type:

In [0]:
class Shape:
    # the `__init__` method is run when an instance of the class is inialized
    def __init__(self, x, y, cx=0.0, cy=0.0):
        self.name = 'rectangle'
        self.x = x
        self.y = y
        self.cx = cx
        self.cy = cy

    def area(self):
        return self.x * self.y

    def move(self, dx, dy):
        self.cx += dx
        self.cy += dy

    def get_position(self):
        return '[x: {0}, y: {1}]'.format(self.cx, self.cy)


# make a sub-class of Shape
class Square(Shape):
    # This will override the `__init__` method of the super-class
    def __init__(self, x, cx=0.0, cy=0.0):
        self.name = 'square'
        self.x = x
        self.y = x
        self.cx = cx
        self.cy = cy
    # all methods that are not overridden are inherited from the super-class

# make another sub-class of Shape
class Circle(Shape):
    # This will override the `__init__` method of the super-class
    def __init__(self, r, cx=0.0, cy=0.0):
        self.name = 'circle'
        self.r = r
        self.cx = cx
        self.cy = cy

    # This will override the `area` method of the super-class
    # The block quote at the top of the function will be return when `help` is called
    def area(self):
        '''Return the area of the circle'''
        return np.pi * self.r**2

# Make some instance of each class
shape_list = [Shape(1, 2), Square(3), Circle(5)]
for sdx, s in enumerate(shape_list):
    # move each instace a different amount
    s.move(sdx, sdx)
    # print the results of different method calls
    print('{0} area: {1}, position: {2}'.format(s.name, s.area(), s.get_position()))


As demonstrated before, you can show all the methods available to a class by using the `dir` function.  If a docstring is defined (triple quote comment on the first line of a function) it will be displayed if `help` is called on the function.

In [0]:
print(dir(Circle))
print('=========')
print(help(shape_list[2].area))

### `if __name__ == '__main__':`
Sometimes you want a file to run a bit of code when called directly form the command line, but not call that code if it is imported into another file.  This can be done by checking the value of the global variable `__name__`, when a bit of code it directly run `__name__` will be `'__main__'`, when imported it will not.

In [0]:
if __name__ == '__main__':
    # code that is only run when this file is directly called from the command line
    # This is a good place to put example code for the functions and classes defined in the file
    print('An example')

### `with` blocks
When working with objects that have `__enter__` and `__exit__` methods defined (most commonly the `open` function), you can use a `with` block to automatically call `__enter__` at the start and `__exit__` at the end.  A typical use case is automatically closing files after you are done reading/writing data:

In [0]:
with open('data.csv', 'r') as file:
    print(file.readline())
    
print('=======')

# This line should fail since the file is autmatically closed by the `with` block
print(file.readline())

## Numpy
NumPy extends Python to provide n-dimensional arrays along with a wealth of statistical and mathematical functions.

In [0]:
# creating a 2D array
b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(b)

There are several ways to create arrays of a given size:

In [0]:
# a 3D array of zeros
zero = np.zeros((2, 2, 3))
print(zero)

print('========')

# a 2D array of ones
one = np.ones((2, 4))
print(one)

print('========')

# a 2D empty array
empty = np.empty((3, 3))
print(empty)

**Note:** empty fills the array with whatever happened to be in that bit of memory earlier!

### Basic operations
Arrays typically act element by element or try to cast the operations in "obvious" ways:
![Array brodcasting](http://www.astroml.org/_images/fig_broadcast_visual_1.png)
-image ref: http://www.astroML.org

In [0]:
print(b)
print('========')

# element wise addition
print (b + b)
print('========')

# multipy all elements by 3
print (3 * b)
print('========')

# row wise addition
d = np.array([1, 2, 3])
print(d)
print (b + d)
print('========')

# column wise addition
e = np.array([[1], [2], [3]])
print(e)
print (b + e)
print('========')

# outter addition
print(d + e)

### Methods
Arrays also have methods such as `sum()`, `min()`, `max()` and these also take axis arguments to operate just over one index.

In [0]:
# sum of all elements
print(b.sum())

# sum along the columns
print(b.sum(axis=0))

# sum along the rows
print(b.sum(axis=1))

### Slices
Works the same as lists, just provide a slice for each dimension:

In [0]:
print(b[0, 0:2])
print('=======')

print(b[:, 0:2])
print('=======')

print(b[0:2, 2:])

### Iterating
When using an array as an iterator it will loop over the first index of the array (e.g. for a 2d array it loops row-by-row).  Loop over the resulting object to loop over the second index, etc...

In [0]:
for row in b:
    print(row)
    print('-------')
    for col in row:
        print(col)
    print('=======')

### Masking arrays
Many times you want to find the values in an array to pass a particular condition (e.g. `B-V < 0.3`).  This can be done with array masks:

In [0]:
mask = b >= 5
print(mask)
print(b[mask])

You can also combine multiple masks with the _bitwise_ comparison opperators (`&`, `|`, `~`, `^`):

In [0]:
mask2 = b <= 7
print(mask2)

# and
print(b[mask & mask2])

# or
print(b[mask | mask2])

# xor
print(b[mask ^ mask2])

# not
print(b[~mask | mask2])

You can also create masks based on parts of an array (e.g. the frist column) and apply it to other parts of the array (e.g. the second column):

In [0]:
# mask of the first column only
mask3 = b[:, 0] <= 4
print(mask3)

# apply that mask to each of the columns
print(b[:, 0][mask3])
print(b[:, 1][mask3])
print(b[:, 2][mask3])

### Looking at source code
`Numpy` also as a function that lets you take a look at source code:

In [0]:
np.source(plt.figure)

# Astropy
The package is the magic that will make your astronomy code easier to write.  There are already functions for many of the things you would want to do, e.g. `.fits` reading/writing, data table reading/writing, sky coordinate transformations, cosmology calculations, and more.

## Reading tables
You won't want to type most data directly into your python code, instead you can use [`astropy.table`](http://docs.astropy.org/en/stable/io/unified.html) (see also: http://docs.astropy.org/en/stable/table/) to read the data in from a file.  The following data types are directly supported:

+ fits
+ ascii
+ aastex
+ basic
+ cds
+ daophot
+ ecsv
+ fixed_width
+ html
+ ipac
+ latex
+ rdb
+ sextractor
+ tab
+ csv
+ votable

For other formats you can extend the existing `table` class to support it.

In [0]:
import astropy
print(astropy.__version__)

In [0]:
from astropy.table import Table
t = Table.read('data.csv', format='ascii.csv')
print(t)
print('==========')
print(t.info)
print('==========')
print(t.colnames)

The columns of `t` can be accessed by name:

In [0]:
print(t['ID', 'pxy'])

And math can be applied:

In [0]:
print(np.sqrt(t['sx']**2 + t['sy']**2))

If you have multiple data tables you can also stack them (vertically or horizontally) or join them (see http://docs.astropy.org/en/stable/table/operations.html)