# ADACS - Introduction to computing for astronomers

This lesson material was developed for ADACS face-to-face training. However, as the material is fairly comprehensive one can work through the notebooks on their own. As the face-to-face training is a live coding workshop a solutions notebook is supplied to help anyone working through the notebooks in their own pace.

This notebook was put together by:
- Rebecca Lange | Curtin Institute for Computation
- Paul Hancock | Curtin Institute for Radio Astronomy Research

Inspiration and material for this notebook was taken from:
- Data Carpentry
- Towards Data Science blog
- ...

## Introduction to Jupyter notebooks 


Jupyter Notebook Cheat Sheet
![Jupyter Notebook Cheat Sheet -- courtesy of DataCamp.com](https://cdn-images-1.medium.com/max/800/1*_nFAOrPMxYwE7cBt-ryqZA.png)


## Introduction to Python, Jupyter notebooks and coding best practices

Python is a high-level, interpreted programming language. This means the code is easy to read for humans and there is no need for us to compile it and in many cases we do not have to think too much about the underlying system fro e.g. memory usage.

As a consequence, we can use it in two ways:
- Using the interpreter as an "advanced calculator" in interactive mode:
- Executing programs/scripts saved as a text file, usually with *.py extension:

In [None]:
2+2

In [None]:
print("Hello")

In [None]:
%run my_script.py

# Types of Data

How information is stored in a DataFrame or a python object affects what we can do with it and the outputs of calculations as well. There are two main types of data that we're explore in this lesson: numeric and character types.


## Numeric Data Types

Numeric data types include integers and floats. A **floating point** (known as a
float) number has decimal points even if that decimal point value is 0. For
example: 1.13, 2.0 1234.345. If we have a column that contains both integers and
floating point numbers, Pandas will assign the entire column to the float data
type so the decimal points are not lost. In a vector or data fram (we learn about these different types later) the entire object or an entire column will be of the same type.

An **integer** will never have a decimal point. Thus 1.13 would be stored as 1.
1234.345 is stored as 1234. You will often see the data type `Int64` in python
which stands for 64 bit integer. The 64 simply refers to the memory allocated to
store data in each cell which effectively relates to how many digits it can
store in each "cell". Allocating space ahead of time allows computers to
optimize storage and processing efficiency.



## Character Data Types

Strings are values that contain numbers and / or characters. 
For example, a string might be a word, a sentence, or several sentences. 
A string can also contain or consist of numbers. For instance, '1234' could be stored as a
string. As could '10.23'. However **strings that contain numbers can not be used
for mathematical operations**!





In [None]:
text = "Data Carpentry"
number = 42
pi_value = 3.1415

Here we've assigned data to variables, namely `text`, `number` and `pi_value`,
using the assignment operator `=`. The variable called `text` is a string which
means it can contain letters and numbers. We could reassign the variable `text`
to an integer too - but be careful reassigning variables as this can get 
confusing.

To print out the value stored in a variable we can simply type the name of the
variable into the interpreter:

In [None]:
text

however, in scripts we must use the `print` function:

In [None]:
# Comments start with #
# Next line will print out text
print(text)

In [None]:
# We also need the print statement if we want to see more than one variable
text
number

In [None]:
print(text, number, pi_value)

### Operators

We can perform mathematical calculations in Python using the basic operators
 `+, -, /, *, %`:

In [None]:
6*7
2**16
13 % 5

** In python 2 if we divide one integer by another, we get an integer! **
The result in python 3 is different where we get a float.
Remember to convert your integers to floats when you want floating point precision for divisions!

In [None]:
10/3

In [None]:
# convert to integer
a = 6.6
int(a)

In [None]:
# convert to float
b=5
float(b)

In [None]:
10/float(3)

We can also use comparison and logic operators:
`<, >, ==, !=, <=, >=` and statements of identity such as
`and, or, not`. The data type returned by this is 
called a _boolean_.

In [None]:
3>4
True and False
True or False

## Sequential types: Lists and Tuples

### Lists

**Lists** are a common data structure to hold an ordered sequence of
elements. Each element can be accessed by an index.  Note that Python
indexes start with 0 instead of 1:

In [None]:
numbers = [1,2,3]
numbers[0]

To add elements to the end of a list, we can use the `append` method:

In [None]:
numbers.append(4)
print(numbers)

**Methods** are a way to interact with an object (a list, for example). We can invoke 
a method using the dot `.` followed by the method name and a list of arguments in parentheses. 
To find out what methods are available for an object, we can use the built-in `help` command:

In [None]:
help(numbers)

We can also access a list of methods using `dir`. Some methods names are
surrounded by double underscores. Those methods are called "special", and
usually we access them in a different way. For example `__add__` method is
responsible for the `+` operator.

In [None]:
dir(numbers)

### Tuples

A tuple is similar to a list in that it's an ordered sequence of elements. However,
tuples can not be changed once created (they are "immutable"). Tuples are
created by placing comma-separated values inside parentheses `()`.

In [None]:
a_tuple = (1,2,3)
another_tuple = ("blue", "green", "red")

### Challenge
1. What happens when you type `a_tuple[2]=5` vs `a_list[1]=5` ?
2. Type `type(a_tuple)` into python - what is the object type?


In [None]:
a_list=[1,2,3]

In [None]:
a_list[1]=5

In [None]:
a_list

## Dictionaries

A **dictionary** is a container that holds pairs of objects - keys and values.

Dictionaries work a lot like lists - except that you index them with *keys*. 
You can think about a key as a name for or a unique identifier for a set of values
in the dictionary. Keys can only have particular types - they have to be 
"hashable". Strings and numeric types are acceptable, but lists aren't.

In [None]:
translation = {"one":1, "two":2}

In [None]:
translation["one"]

In [None]:
rev = {1:"one", 2:["two", "birds"]}
rev

In [None]:
bad = {[1,2,3]:3}

To add an item to the dictionary we assign a value to a new key:

In [None]:
rev[3]="three"

In [None]:
rev

### Challenge

Can you do reassignment in a dictionary? Give it a try. 

1. First check what `rev` is right now (remember `rev` is the name of our dictionary). 
    
2. Try to reassign the second value (in the *key value pair*) so that it no longer reads "two" but instead reads "apple-sauce". 

3. Now display `rev` again to see if it has changed. 

It is important to note that dictionaries are "unordered" and do not remember the
sequence of their items (i.e. the order in which key:value pairs were added to 
the dictionary). Because of this, the order in which items are returned from loops
over dictionaries might appear random and can even change with time.

In [None]:
rev

In [None]:
rev[2]="apple-sauce"

In [None]:
rev

In [None]:
#if you need to change your directory
import os

In [None]:
os.chdir("../") #make sure you enter the correct fille path

In [None]:
os.listdir("./")
os.chdir("data/")

# STOP HERE


# Automating data processing using For Loops

So far, we've used Python and the pandas library to explore and manipulate
individual datasets by hand, much like we would do in a spreadsheet. The beauty
of using a programming language like Python, though, comes from the ability to
automate data processing through the use of loops and functions.

## For loops

Loops allow us to repeat a workflow (or series of actions) a given number of
times or while some condition is true. We would use a loop to automatically
process data that's stored in multiple files (daily values with one file per
year, for example). Loops lighten our work load by performing repeated tasks
without our direct involvement and make it less likely that we'll introduce
errors by making mistakes while processing each file by hand.

Let's write a simple for loop that simulates what a kid might see during a
visit to the zoo:

In [None]:
animals = ['lion','tiger','crocodile','vulture','hippo']
print(animals)

In [None]:
for creatures in animals:
    print(creatures)

The line defining the loop must start with `for` and end with a colon, and the
body of the loop must be indented.

In this example, `creature` is the loop variable that takes the value of the next
entry in `animals` every time the loop goes around. We can call the loop variable
anything we like. After the loop finishes, the loop variable will still exist
and will have the value of the last entry in the collection:

In [None]:
for creatures in animals:
    pass

In [None]:
creatures

We are not asking python to print the value of the loop variable anymore, but
the for loop still runs and the value of `creature` changes on each pass through
the loop. The statement `pass` in the body of the loop just means "do nothing".


The file we've been using so far, `GalaxyZoo1.csv`, contains 10s of 1000s of observations and
very large. We would like to separate the data for each galaxy class.

Let's start by making a new directory inside the folder `data` to store all of
these files using the module `os`:

In [None]:
os.mkdir("gals_by_class")

The command `os.mkdir` is equivalent to `mkdir` in the shell. Just so we are
sure, we can check that the new directory was created within the `data` folder:

In [None]:
os.listdir(".")

The command `os.listdir` is equivalent to `ls` in the shell.

Previously, we saw how to use the library pandas to load the species
data into memory as a DataFrame, how to select a subset of the data using some
criteria, and how to write the DataFrame into a csv file. Let's write a script
that performs those three steps in sequence for selecting clockwise spirals:

```python
import pandas as pd

# Load the data into a DataFrame
gal_df = pd.read_csv('GalaxyZoo1.csv',
                         keep_default_na=False, na_values=["NA"])

# Select only clockwise spirals
cw_gals = gal_df[gal_df.class == 'CW']

# Write the new DataFrame to a csv file
cw_gals.to_csv('gals_by_class/cw_gals.csv')
```

To create files for each class, we could repeat the last two commands over and
over, once for each class of galaxy. Repeating code is neither elegant nor
practical, and is very likely to introduce errors into your code. We want to
turn what we've just written into a loop that repeats the last two commands for
every year in the dataset.

Let's start by writing a loop that simply prints the names of the files we want
to create - the dataset we are using covers CW, ACW, EDGE, E0 through E7 and U, and we'll create
a separate file for each of those years. Listing the filenames is a good way to
confirm that the loop is behaving as we expect.

We have seen that we can loop over a list of items, so we need a list of galaxy classes 
to loop over. We can get the unique classes in our DataFrame with:

In [None]:
gal_df = pd.read_csv('GalaxyZoo1.csv',
                         keep_default_na=False, na_values=["NA"])
cw_gals = gal_df[gal_df.class == 'CW']
cw_gals.to_csv('gals_by_class/cw_gals.csv')

In [None]:
gal_df['class'].unique()

Putting this into our for loop we get

In [None]:
for galclass in gal_df['class'].unique():
    filename = 'gals_by_class/' + galclass + '_gals.csv'
    print(filename)

We can now add the rest of the steps we need to create separate text files.
Once finished look inside the `gals_by_class` directory and check a couple of the files you
just created to confirm that everything worked as expected.

In [None]:
for galclass in gal_df['class'].unique():
    filename = 'gals_by_class/' + galclass + '_gals.csv'
    # extracting data of a specific year
    class_df = gal_df[gal_df.class == galclass]
    # writing to file
    class_df.to_csv(filename)

## Writing Unique FileNames

Notice that the code above created a unique filename for each year.

	 filename = 'gals_by_class/' + galclass + '_gals.csv'

Let's break down the parts of this name:

* The first part is simply some text that specifies the directory to store our
  data file in 
* We can concatenate this with the value of a variable, in this case `galclass` by
  using the plus `+` sign and the variable we want to add to the file name: `+
  galclass`
  _Note:_ if you wanted to concatenate a number in the filename convert it to a string first using str(number)
* Then we add the file extension and a short descriptor as another text string: `+ '_gals.csv'`

Notice that we use single quotes to add text strings. The variable is not
surrounded by quotes.

### Challenge

1. Some of the entries are missing data (i.e. NaN for the probability measurements). Modify the for loop so that the entries with null values are not included in the class files.

## Building reusable and modular code with functions

Suppose that separating large data files into individual files is a task
that we frequently have to perform. We could write a **for loop** like the one above
every time we needed to do it but that would be time consuming and error prone.
A more elegant solution would be to create a reusable tool that performs this
task with minimum input from the user. To do this, we are going to turn the code
we've already written into a function.

Functions are reusable, self-contained pieces of code that are called with a
single command. They can be designed to accept arguments as input and return
values, but they don't need to do either. Variables declared inside functions
only exist while the function is running and if a variable within the function
(a local variable) has the same name as a variable somewhere else in the code,
the local variable hides but doesn't overwrite the other.

Every method used in Python (for example, `print`) is a function, and the
libraries we import (say, `pandas`) are a collection of functions. We will only
use functions that are housed within the same code that uses them, but it's also
easy to write functions that can be used by different programs.

Functions are declared following this general structure:

```python
def this_is_the_function_name(input_argument1, input_argument2):

    # The body of the function is indented
    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, 
    '(this is done inside the function!)')

    # And returns their product
    return input_argument1 * input_argument2
```

The function declaration starts with the word `def`, followed by the function
name and any arguments in parenthesis, and ends in a colon. The body of the
function is indented just like loops are. If the function returns something when
it is called, it includes a return statement at the end.

In [None]:
#let's define this function
def function_name(input_argument1, input_argument2):

    # The body of the function is indented
    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')

    # And returns their product
    return input_argument1 * input_argument2

In [None]:
#and now let's call the function:
#and now let's call the function:
result = function_name(4,4)
result

### Challenge:

1. Change the values of the arguments in the function and check its output
2. Try calling the function by giving it the wrong number of arguments (not 2)
   or not assigning the function call to a variable (no `product_of_inputs =`)
3. Declare a variable inside the function and test to see where it exists (Hint:
   can you print it from outside the function?)
4. Explore what happens when a variable both inside and outside the function
   have the same name. What happens to the global variable when you change the
   value of the local variable?

Now let's write a function to save galaxy data for a range of probabilities. 
Let's first write a function that separates data for just one probability value and saves that data to a file:
To make this easier for us we will need to round our probabilities to 1 digit.
For this we will use the nympy round method, so we will have to import numpy as well.


```python
import numpy as np

def one_prob_csv_writer(this_prob, all_data):
    """
    Writes a csv file for data for a given class.

    this_prob --- probability for which data is extracted
    all_data --- DataFrame with multi-class data
    """

    # Select data for the galaxy class
    class_df = all_data[np.round(all_data.p_e, 1) == this_prob]

    # create new file name
    filename = filename = 'gals_by_class/Probability' + str(this_prob) + '_gals.csv'

    # Write the new DataFrame to a csv file
    class_df.to_csv(filename)
```

In [None]:
import numpy as np

def one_prob_csv_writer(this_prob, all_data):
    """
    Writes a csv file for data for a given class.

    this_prob --- probability for which data is extracted
    all_data --- DataFrame with multi-class data
    """

    # Select data for the galaxy class
    class_df = all_data[np.round(all_data.p_e, 1) == this_prob]

    # create new file name
    filename = filename = 'gals_by_class/Probability' + str(this_prob) + '_gals.csv'

    # Write the new DataFrame to a csv file
    class_df.to_csv(filename)

The text between the two sets of triple double quotes is called a docstring and
contains the documentation for the function. It does nothing when the function
is running and is therefore not necessary, but it is good practice to include
docstrings as a reminder of what the code does. Docstrings in functions also
become part of their 'official' documentation:

In [None]:
one_prob_csv_writer?


In [None]:
one_prob_csv_writer(0.5,gal_df)

Check the `gals_by_class` directory for the file. Did it do what you expect?

What we really want to do, though, is create files for multiple probabilities without
having to request them one by one. Let's write another function that replaces
the entire For loop by simply looping through a sequence of years and repeatedly
calling the function we just wrote, `one_year_csv_writer`:


```python
def prob_data_csv_writer(start_prob, end_prob, all_data):
    """
    Writes separate csv files for each probability in the data.

    start_prob --- the first year of data we want
    end_prob --- the last year of data we want
    all_data --- DataFrame with multi-year data
    """

    # "end_prob" is the last prob of data we want to pull, so we loop to end_prob+0.1
    for prob in np.arange(start_prob, end_prob+0.1, 0.1):
        one_prob_csv_writer(prob, all_data)
```

In [None]:
def prob_data_csv_writer(start_prob, end_prob, all_data):
    """
    Writes separate csv files for each probability in the data.

    start_prob --- the first year of data we want
    end_prob --- the last year of data we want
    all_data --- DataFrame with multi-year data
    """

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for prob in range(start_prob, end_prob+0.1):
        one_prob_csv_writer(prob, all_data)

Because people will naturally expect that the end year for the files is the last
year with data, the for loop inside the function ends at `end_year + 1`. 
This is because when we specify `range()` the last number is not included, try it for yourself.

By writing the entire loop into a function, we've made a reusable tool for whenever
we need to break a large data file into yearly files. Because we can specify the
first and last year for which we want files, we can even use this function to
create files for a subset of the years available. This is how we call this
function:

In [None]:
prob_data_csv_writer(0.3,0.5,gal_df)

**BEWARE!** If you are using IPython Notebooks and you modify a function, you MUST
re-run that cell in order for the changed function to be available to the rest
of the code. Nothing will visibly happen when you do this, though, because
simply defining a function without *calling* it doesn't produce an output. Any
cells that use the now-changed functions will also have to be re-run for their
output to change.

### Challenge:

1. Add two arguments to the functions we wrote that take the path of the
   directory where the files will be written and the root of the file name.
   Create a new set of files with a different name in a different directory.

The functions we wrote demand that we give them a value for every argument.
Ideally, we would like these functions to be as flexible and independent as
possible. Let's modify the function `prob_data_csv_writer` so that the
`start_prob` and `end_prob` default to the full range of the data if they are
not supplied by the user. Arguments can be given default values with an equal
sign in the function declaration. Any arguments in the function without default
values (here, `all_data`) is a required argument and MUST come before the
argument with default values (which are optional in the function call).

```python
    def prob_range_data_arg_test(all_data, start_year = None, end_year = None):
        """
        Modified from prob_data_csv_writer to test default argument values!

        start_prob --- the min probability of data we want --- default: None - check all_data
        end_prob --- the max probability of data we want --- default: None - check all_data
        all_data --- DataFrame with multi-year data
        """

        if not start_ptob:
            start_prob = min(all_data.p_e)
        if not end_prob:
            end_prob = max(all_data.prob)

        return start_prob, end_prob
```

In [None]:
#define function

In [None]:
#test function

The default values of the `start_prob` and `end_prob` arguments in the function
`prob_range_data_arg_test` are now `None`. This is a build-it constant in Python
that indicates the absence of a value - essentially, that the variable exists in
the namespace of the function (the directory of variable names) but that it
doesn't correspond to any existing object.

The body of the test function now has two conditional 'loops' (if statement) that
check the values of `start_prob` and `end_prob`. If statements execute the body of
the 'loop' when some condition is met. They commonly look something like this:

```python
    a = 5

    if a<0: # meets first condition?

        # if a IS less than zero
        print('a is a negative number')

    elif a>0: # did not meet first condition. meets second condition?

        # if a ISN'T less than zero and IS more than zero
        print('a is a positive number')

    else: # met neither condition

        # if a ISN'T less than zero and ISN'T more than zero
        print('a must be zero!')

    a is a positive number
```

Change the value of `a` to see how this function works. The statement `elif`
means "else if", and all of the conditional statements must end in a colon.

Some more useful info to get you started and keep you going with notebooks and pandas:

**Notebook tips and tricks**
https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

**Pandas cheat sheet**
https://www.analyticsvidhya.com/blog/2015/07/11-steps-perform-data-analysis-pandas-python/


**Join the ADACS facebook group to stay up-to-date with upcoming training!**