# Using Python for Data

## Useful Packages
+ [`astropy`](http://www.astropy.org/): Includes functions for reading/writing data files (including `.fits`), cosmology calculations, astronomical constants and coordinate systems, image processing, and much more
+ [`numpy`](http://www.numpy.org/): Adds ability to deal with multi-dimensional arrays and vectorized math functions
+ [`scipy`](http://www.scipy.org/): Extends `numpy` by adding common scientific functions such as ODE integration, statistical analysis, linear algebra, and FFT
+ [`matplotlib`](http://matplotlib.org/): A useful plotting package
+ [`pandas`](https://pandas.pydata.org/): Package for dealing with data tables
+ [`astroML`](http://www.astroml.org/): Common statistical analysis and machine learning tools used in astronomy
+ [`scikit-learn`](http://scikit-learn.org/stable/): More machine learning tools written in python

## Installing python
The easiest way to install python on any OS is to use [anaconda python](https://www.continuum.io/downloads).  This will install a local version of python on your system so you don't need to worry about needing admin to install new packages.  Most of the packages listed above are installed by default with anaconda.  For this class we will be using python 3, and I recommend you use this version for you research (unless you have a very good reason to use python 2).

### Note
As of October 2019 python 2.7 is officially depreciated and will only receive security updates and in December 2021 python 3.6 will be offically depreciated as well.  Many of the major packages listed above have already dropped python 2 support are are startting to drop support of python 3.6 and lower.

## Text editors
Although there are numerous IDEs (e.g. IDLE, Spyder) for python, for most everyday use you will likely be writing python code in a text editor and running your programs via the command line. In this case it is important to have a good text editor that supports syntax highlighting, live linting (syntax and style checking), and is easy to configure the way you want. I can highly recommend [VScode](https://code.visualstudio.com/) as a free text editor with all the features above.

For python coding in VScode you will want to install the `Python` extension by Microsoft (you will be prompted to install it when you first open a .py file) and the `Jupyter` extension by Microsoft. Other useful extensions are the `Excel Viewer` extension for easier viewing CSV files, `open in browser` for and option to open HTML files in your browser, `MyST-Markdown` for rendering markdown files, and `Code Spell Checker` for basic spell checking.

## Coding style
What is a coding style?  Beyond the syntax of a coding language, a coding style is a set of conventions that can be followed to make it easier for other developers (including your future self) to read you code and to understand the intention behind your code.  For python coding the style most developers use has it basis in [PEP 8](https://peps.python.org/pep-0008/).

>A style guide is about consistency. Consistency with this style guide is important. Consistency within a project is more important. Consistency within one module or function is the most important.
>
>However, know when to be inconsistent – sometimes style guide recommendations just aren’t applicable. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don’t hesitate to ask!

Here are some examples of PEP 8 conventions:

- Use 4 spaces to indent lines (rather than a tab)
- A max line limit of 79 characters (preferred by people who use command line editors, I typically override this to be higher)
- Constants are defined at the module level with names in `ALL_CAPS`
- Class names should normally use the `CapWords` convention
- Function names should be `lowercase`, with words separated by underscores as necessary to improve readability

## Basic syntax examples
For a general overview of python's syntax head over to [codecademy](https://www.codecademy.com/learn/python) and take their interactive tutorial.  In this class we will only be covering what is necessary for data analysis.

### importing packages
Any package or code from another `.py` file can be imported with a simple `import` statement.  By default all imported code has its own name space, so you don't have to worry about overwriting existing functions.  The final line of this code block is a "magic" `Jupyter` function needed to make interactive plots inside of `Jupyter notebooks`.

In [2]:
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
%matplotlib inline

### data containers
Data inside of python can be stored in several different types of containers.  The most basic ones are:

+ `list`: an indexed data structure that can hold any objects as an element
+ `tuple`: same as a `list` except the data is immutable
+ `dictionary`: objects stored as a `{key: value}` set (note: any immutable object can be used as a key including a tuple)

In [3]:
example_list = [1, 2, 3]
example_tuple = (1, 2, 3)
example_dict = {'key1': 1, 'key2': 2, ('key', 3): 3}

Elements in these objects can be accessed using an zero-based index (`list` and `tuple`) or key (`dict`).

In [4]:
print(example_list[0], example_list[-1])
print(example_tuple[1])
print(example_dict['key1'], example_dict[('key', 3)])

1 3
2
1 3


Each of these objects have various methods that can be called on them to do various things.  To learn what methods can be called you can look at the python documentation (e.g. https://docs.python.org/3/tutorial/datastructures.html) or you can inspect the object directly and use python's `help` function to get the doc string.

Note: Methods that start with `__` or `_` are private methods that are not designed to be called directly on the object.

In [5]:
print(dir(example_list))
print('\n\n')
help(example_list.pop)

['__add__', '__class__', '__class_getitem__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']



Help on built-in function pop:

pop(index=-1, /) method of builtins.list instance
    Remove and return item at index (default last).
    
    Raises IndexError if list is empty or index is out of range.



### Slicing lists
Many times it is useful to slice and manipulate lists:

In [6]:
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
print(a)
# print the first 3 elements
print(a[:3])
# print the middle 4 elements
print(a[3:7])
# print the last 3 elements
print(a[7:])
# you can also use neg index
print(a[-3:])
# print only even index
print(a[::2])
# print only odd index
print(a[1::2])
# print the reverse list
print(a[::-1])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2]
[3, 4, 5, 6]
[7, 8, 9]
[7, 8, 9]
[0, 2, 4, 6, 8]
[1, 3, 5, 7, 9]
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]


### Looping over `list`s and `dict`s
There are several ways to loop over a `list` or `dict` depending on what values you want access to.

In [7]:
# loop over values in a list
for i in example_list:
    print(i)
print('=========')

# loop over values in a list with index
for idx, i in enumerate(example_list):
    print('{0}: {1}'.format(idx, i))
print('=========')

# loop over keys in dict
for i in example_dict:
    print(i)
print('=========')

# loop over values in dict
for i in example_dict.values():
    print(i)
print('=========')

# loop over keys and values in dict
for key, value in example_dict.items():
    print('{0}: {1}'.format(key, value))

1
2
3
0: 1
1: 2
2: 3
key1
key2
('key', 3)
1
2
3
key1: 1
key2: 2
('key', 3): 3


### list/dict comprehension
If you need to make a `list` or `dict` as the result of a loop you can use comprehension. **Note** comprehension is faster than a normal loop since the iteration uses the `map` function that is compiled in `C`.

In [8]:
# slower method
list_loop = []
dict_loop = {}
for i in a:
    list_loop.append(i**2)
    dict_loop['key{0}'.format(i)] = i
print(list_loop)
print(dict_loop)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
{'key0': 0, 'key1': 1, 'key2': 2, 'key3': 3, 'key4': 4, 'key5': 5, 'key6': 6, 'key7': 7, 'key8': 8, 'key9': 9}


In [9]:
# faster method
list_comp = [i**2 for i in a]
dict_comp = {'key{0}'.format(i): i for i in a}
print(list_comp)
print(dict_comp)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
{'key0': 0, 'key1': 1, 'key2': 2, 'key3': 3, 'key4': 4, 'key5': 5, 'key6': 6, 'key7': 7, 'key8': 8, 'key9': 9}


## Writing reusable code
It is always best to keep your code DRY (don't repeat yourself).  If you find yourself writing the same block of code more than 2 times you should think about extracting it to a function.  If you need to create a custom object that has its own methods assigned to it you should create a custom class.

### functions
In python functions use a local name space, so don't worry about reusing variable names.  Only if a variable is not in the local name space will the function look to the global name space.  If the function argument is immutable it will be local in scope, otherwise it will not.

In [10]:
def alpha(x):
    x = x + 1
    return x

x = 1
print(alpha(x))
print(x)

def beta(x):
    x[0] = x[0] + 1
    return x

x = [1]
print(beta(x))
print(x)

2
1
[2]
[2]


### classes
Classes are useful when you will have multiple instances of an object type:

In [11]:
class Shape:
    def __init__(self, x, y, cx=0.0, cy=0.0):
        self.name = 'rectangle'
        self.x = x
        self.y = y
        self.cx = cx
        self.cy = cy

    def area(self):
        return self.x * self.y

    def move(self, dx, dy):
        self.cx += dx
        self.cy += dy

    def get_position(self):
        return '[x: {0}, y: {1}]'.format(self.cx, self.cy)


class Square(Shape):
    def __init__(self, x, cx=0.0, cy=0.0):
        self.name = 'square'
        self.x = x
        self.y = x
        self.cx = cx
        self.cy = cy


class Circle(Shape):
    def __init__(self, r, cx=0.0, cy=0.0):
        self.name = 'circle'
        self.r = r
        self.cx = cx
        self.cy = cy

    def area(self):
        '''Return the area of the circle'''
        return np.pi * self.r**2

shape_list = [Shape(1, 2), Square(3), Circle(5)]
for sdx, s in enumerate(shape_list):
    s.move(sdx, sdx)
    print('{0} area: {1}, position: {2}'.format(s.name, s.area(), s.get_position()))


rectangle area: 2, position: [x: 0.0, y: 0.0]
square area: 9, position: [x: 1.0, y: 1.0]
circle area: 78.53981633974483, position: [x: 2.0, y: 2.0]


As demonstrated before, you can show all the methods available to a class by using the `dir` function.  If a docstring is defined (triple quote comment on the first line of a function) it will be displayed if `help` is called on the function.

In [12]:
print(dir(Circle))
print('\n\n')
print(help(shape_list[2].area))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'area', 'get_position', 'move']



Help on method area in module __main__:

area() method of __main__.Circle instance
    Return the area of the circle

None


### `if __name__ == '__main__':`
Sometimes you want a file to run a bit of code when called directly form the command line, but not call that code if it is imported into another file.  This can be done by checking the value of the global variable `__name__`, when a bit of code it directly run `__name__` will be `'__main__'`, when imported it will not.

In [13]:
if __name__ == '__main__':
    # code that is only run when this file is directly called from the command line
    # This is a good place to put example code for the functions and classes defined in the file
    print('An example')

An example


### `with` blocks
When working with objects that have `__enter__` and `__exit__` methods defined, you can use a `with` block to automatically call `__enter__` at the start and `__exit__` at the end.  A typical use case is automatically closing files after you are done reading/writing data:

In [14]:
with open('data.csv', 'r') as file:
    print(file.readline())
    
print(file.readline())

ID,x,y,sy,sx,pxy



ValueError: I/O operation on closed file.

## Numpy
NumPy extends Python to provide n-dimensional arrays along with a wealth of statistical and mathematical functions.

In [15]:
b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(b)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


There are several ways to create arrays of a given size:

In [16]:
zero = np.zeros((2, 2, 3))
print(zero)
one = np.ones((2, 4))
print(one)
empty = np.empty((3, 3))
print(empty)

[[[0. 0. 0.]
  [0. 0. 0.]]

 [[0. 0. 0.]
  [0. 0. 0.]]]
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]]
[[6.90800305e-310 4.68277252e-310 0.00000000e+000]
 [0.00000000e+000 0.00000000e+000 0.00000000e+000]
 [0.00000000e+000 0.00000000e+000 3.95252517e-322]]


Note: empty fills the array with whatever happened to be in that bit of memory earlier!

### Basic operations
Arrays typically act element by element or try to cast the operations in "obvious" ways:
![Array broadcasting](./images/array_broadcasting.png)

-image ref: http://www.astroML.org

In [None]:
print(b)
print('========')

print (b + b)
print('========')

print (3 * b)
print('========')

d = np.array([1, 2, 3])
print(d)
print (b + d)
print('========')

e = np.array([[1], [2], [3]])
print(e)
print (b + e)

### Methods
Arrays also have methods such as `sum()`, `min()`, `max()` and these also take axis arguments to operate just over one index.

In [None]:
print(b.sum())
print(b.sum(axis=0))
print(b.sum(axis=1))

### Slices
Works the same as lists, just provide a slice for each dimension:

In [None]:
print(b[0, 0:2])
print(b[:, 0:2])
print(b[0:2, 2:])

### Iterating
When using an array as an iterator it will loop over the first index of the array (e.g. for a 2d array it loops row-by-row).  Loop over the resulting object to loop over the second index, etc...

In [None]:
for row in b:
    print(row)
    for col in row:
        print(col)

### Masking arrays
Many times you want to find the values in an array to pass a particular condition (e.g. `B-V < 0.3`).  This can be done with array masks:

In [None]:
mask = b >= 5
print(mask)
print(b[mask])

You can also combine multiple masks with the _bitwise_ comparison operators (`&`, `|`, `~`, `^`):

In [None]:
mask2 = b <= 7
print(mask2)
print(b[mask & mask2])
print(b[mask | mask2])
print(b[~mask | mask2])

You can also create masks based on parts of an array (e.g. the first column) and apply it to other parts of the array (e.g. the second column):

In [None]:
mask3 = b[:, 0] <= 4
print(mask3)
print(b[:, 0][mask3])
print(b[:, 1][mask3])
print(b[:, 2][mask3])

### Looking at source code
`Numpy` also as a function that lets you take a look at source code:

In [None]:
np.source(plt.figure)

# Astropy
The package is the magic that will make your astronomy code easier to write.  There are already functions for many of the things you would want to do, e.g. `.fits` reading/writing, data table reading/writing, sky coordinate transformations, cosmology calculations, and more.

## Reading tables
You won't want to type most data directly into your python code, instead you can use [`astropy.table`](http://docs.astropy.org/en/stable/io/unified.html) (see also: http://docs.astropy.org/en/stable/table/) to read the data in from a file.  The following data types are directly supported:

+ fits
+ ascii
+ aastex
+ basic
+ cds
+ daophot
+ ecsv
+ fixed_width
+ html
+ ipac
+ latex
+ rdb
+ sextractor
+ tab
+ csv
+ votable

For other formats you can extend the existing `table` class to support it.

In [None]:
import astropy
print(astropy.__version__)

In [None]:
from astropy.table import Table
t = Table.read('data.csv', format='ascii.csv')
display(t)
print(t.info)
print(t.colnames)

The columns of `t` can be accessed by name:

In [None]:
print(t['ID', 'pxy'])

And math can be applied:

In [None]:
print(np.sqrt(t['sx']**2 + t['sy']**2))

If you have multiple data tables you can also stack them (vertically or horizontally) or join them (see http://docs.astropy.org/en/stable/table/operations.html)

## Constants and Units
Many of the constants you would need can be found in [`astropy.constants`](http://docs.astropy.org/en/stable/constants/).  You can also assign units to your values using [`astropy.units`](http://docs.astropy.org/en/stable/units/).

In [None]:
from astropy import constants as const
print(const.c)

In [None]:
from astropy import units as u
wavelength = [1000., 2000., 3000.] * u.nm
print(wavelength)
# convert to meters
print(wavelength.to(u.m))
# convert to frequncy
freq = wavelength.to(u.Hz, equivalencies=u.spectral())
print(freq)
# convert to velocity from a rest wavelength of 2000 nm
freq_to_vel = u.doppler_optical(2000 * u.nm)
vel = freq.to(u.km / u.s, equivalencies=freq_to_vel)
print(vel)

# Pandas

Data tables can also be read in with pandas:

In [None]:
import pandas
data = pandas.read_csv('data.csv')
display(data)
print(data.columns)

The columns can be accessed with 'dot' notation or name

In [None]:
print(data.x)
print(data[['x', 'y']])

As before math can be done directly on the columns

In [None]:
print(np.sqrt(data.sx**2 + data.sy**2))

Pandas teats these `DataFrames` like databases, so most database operations (e.g. join, merge, groupby, etc...) can be done on a data table.