# Introduction to Python and Jupyter notebooks
_A python exercise notebook written by Rita Tojeiro, October 2017; last revised in August 2020. 
This notebook has benefited from examples provided by Britt Lundgren (University of North Carolina) and Jordan Raddick (John Hopkins University), and resource lists by Rick Muller (Sandia National Laboratories)._

Welcome to the AS4010 computational exercise. The aim is to enable and encourage you to explore the largest astronomical dataset in existence, and derive some conclusions about galaxy properties, large-scale structure, and spectral data. This will help you to consolidate the knowledge you gain from lectures, as well as bring you into direct contact with the data that your lecturers are using every day to conduct their research.


Throughout the semester you will work through three Python notebooks which will:
   1. Refresh your memory about python, or introduce you to python if you've never used it before (unassessed).
   2. Introduce you to the analysis of Sloan Digital Sky Survey (SDSS) data (unassessed).
   
We recommend that you complete these in the first 3-4 weeks of the semester.
   
After that, you will conduct a short research project using the tools you have learned (assessed - 20% of the total mark module).

It is our sincere hope that you take this brief introduction and explore this vast dataset in your own time - the sky is your limit, and at your fingertips is now the most used astronomy dataset in the world.

## Jupyter notebooks and completing this notebook

Python is a programming language, and notebooks are web applications consisting of sequences of code cells that allow you to run snippets of code and keep a record of your work - just like a Lab book. They are also useful ways to guide users through a pre-established work flow, and  that is the way we will use them here. 

Cells in notebooks can have multiple functions. In our case we will use _code_ cells or _markdown_ cells. You can change the function of a cell on the dropdown menu at the header of the page. Code cells interpret and execute code, and markdown cells allow you to write down text - just as I am doing now.

To run the code in a cell you simply type in your commands, then press **shift+enter** to execute (just pressing enter will give you a new line). 

The way this exercise is structured is via a combination of pre-filled code cells, which you need to execute, and empty code cells, which you need to fill in and then execute. If you wish to add a new cell, either press the "+" sign in the menu at the top of the page, or click Insert->Insert Cell Above/Below.

Jupyter notebooks are auto-saved with some frequency, but you are strongly encouraged to File->Save and Checkpoint at the end of every completed exercise. When you want to close your notebook, you need to **File->Close and Halt** (instead of just closing your browser window).

This first introductory notebook is a highly incomplete, extremely brief introduction to Python. It is designed to give you the minimum set of tools to interact with data, and successfully complete the coursework. You are encouraged to go use additional resources to solidify your knowledge of Python. 

## Submission

**This notebook is NOT assessed, and it is OPTIONAL.**

You do not need to hand-in your notebook for marking. If you are unfamiliar with Python you are strongly encouraged to follow the instructions and experiment with the code provided. If you're familiar with Python, Numpy, Matplotlib and Pandas, you may work through it a little faster.


### Python resources

* [Python for Astronomers](https://prappleizer.github.io/index.html) A introduction to Python, aimed at complete beginners. Chapters 2, 4 and 7 are particularly relevant for this project, and recommended. You can download the interactive tutorials and upload them onto SciServer to run them (select Python 2 if SciServer asks you for a Kernerl when you open the tutorial).

* [Python Tutorial](http://docs.python.org/2/tutorial/) First stop reference for basic tasks (looping, string formating, classes, functions, etc). A good place to look for specific examples. Sections 3, 4 and 5 are the most relevant for this module, but go beyond what you will need here. 
* [Python Standard Library](http://docs.python.org/2/library/) The definitive reference for everything Python can do out of the box. Many common and not so common tasks are already taken care of. Not suitable for the complete beginner.


## Part 1 - Getting started with Python and Jupyter notebooks

### 1.1 The basics
We will begin with the traditional start in any programming language - by typing in "Hello world!", and asking python to display the message using the print() function.

I've filled in the python command in the cell below for you. To execute it, select the cell with your mouse and press shift+enter.

In [None]:
print('Hello world!')

The print() function can print the value of a wide range of variables. In the next cells, we define 5 variables, and print their values in different ways. Execute it. Do all the printed values make sense? You can change the variable values and the operations within the print function argument to make sure you understand what is happening.

In [None]:
x = 10 #an integer
y = 2.0 #a float
given_name = 'Rita'#a string
favourite_colour = 'purple'#another string
blank = ' '

In [None]:
print(x,y,given_name)

In [None]:
print(x + y)

In [None]:
print(x/y)

In [None]:
print(given_name + blank + str(x))

Fill in the values in cell below, and _print_.

In [None]:
my_given_name=
my_favourite_colour=


In our first example, we asked python to print the sum of two numbers. Python performed the computation, and printed the outcome. We can also create new variables by performing operations on variables that we have previously named, like so:

In [None]:
h = x + y
print(h)

In [None]:
h = x - y
print(h)

In [None]:
h = x/y
print(h)

In [None]:
h = x*y
print(h)

In [None]:
h = x**y
print(h)

From the example above, you can also see that if you assign a different value to a variable, you overwrite its previous value. Variables have no memory!

### 1.2 Error messages ###

If you make an error in syntax or do something else wrong (like calling a variable that isn't defined), python will print an error message. Don't panic. Python error messages can be verbose and long, but typically the very last line will give you a clue as to the reason of the error. Execute the next two cells: they will result in an error. Amend the code in the cell and re-execute until you don't get an error message . 

In [None]:
print(t)

In [None]:
print('My surname is ' my_surname)

### 1.3 Loops

**Loops** allow you to execute a set of instructions a number of times, according to either a counter (in an *if* loop), or until some condition is satisfied (in a *while* loop). For example, here's a quick way to print the numbers 1 to 10:

In [None]:
for i in range(0,10):
    print(i)
print("loop is finished")

**Indentation** is extremely important in Python, and it defines whether lines of code fall within a loop (or function), or not. In the example above, we created a loop where a variable i varies from 0 to 9. Within each iteration of the loop we printed the value of i. When the loop was finished, we printed a statement - note the indentation of the print statement. 

What would happen if the print statement was indented? Write down your answer **before** you execute the next code cell.
 

**Write your answer here** (double-click this cell to edit it): 

In [None]:
for i in range(0,10):
    print(i)
    print("loop is finished")

Use the next cell to write some code that will:

 1. in each iteration compute a variable called t that is 2 times i, and print its value.
 2. print "loop is running" in each iteration

In [None]:
#answer here

### 1.4 Lists and arrays

A variable doesn't need to hold a single value. They can be **lists** of numbers, strings, or combinations of any type of variable. Below are examples of lists, some operations, and examples of how to access certain elements. 

Does addition do what you expect it to do?

In [None]:
a = [0,1,2,3]
b = ['0','1','2','3']
c = ['hello', 'world', '37']

In [None]:
print(a+b)

In [None]:
print(c + b)

In [None]:
print(c)

In [None]:
print(c[0])

In [None]:
print(a[1] + a[2])

Now access different elements of the lists yourself. Write code that prints the last element of variables `a` and `b`.

In [None]:
#Answer here:

### Indices

The position of a certain element in an list (or array - you can think of arrays as special types of lists, that only hold one type of variable) is called an **index**. As you can see from the above examples, the first element has an index of `0`. So to access the first element of variable `a`, I typed `a[0]`. The second element is at index `1`, and to access it I would type `a[1]`.


### 1.5 Numpy

Generally speaking, when doing data analysis and numerical calculations, python lists just don't cut it. We will be using the **Numpy library** to create and manipulate arrays, and operate on them using built in functions.

The Numpy library is the math workhorse of Python. Everything is vectorized, so `a*b+c**d` makes sense if a is a number **or** an n-dimensional array. This tutorial gives you some very limited examples of the potential of numpy, below are some extra resources.

* [Numpy Example List](http://wiki.scipy.org/Numpy_Example_List_With_Doc) Examples using every Numpy function. Keep close at hand!
* [Numpy for Matlab Users](http://wiki.scipy.org/NumPy_for_Matlab_Users) Know Matlab? Then use this translation guide.
* [Numpy Reference](http://docs.scipy.org/doc/numpy/reference/) Main documentation

Numpy is not automatically loaded when you start up python (or a notebook), so we have to explicity import it. We'll call it **np** for short.


In [None]:
import numpy as np

We'll define some numpy arrays using np.array() and make some simple calculations. Does it behave the way you expect? You can use the last cell to experiment, and make other computations with the arrays.

In [None]:
a = np.array([0,1,2,3]) #a numpy array
b = np.array([4,5,6,7]) #another numpy array

In [None]:
print(a + b)

In [None]:
print(a * b)

In [None]:
print(a - b)

In [None]:
#experiment here

Next we will work with larger arrays. We'll begin by using a numpy function, np.linspace(), to create an array, and demonstrate how to access elements in several ways.

At its simplest, the command `np.linspace(i_first, i_last, N)` will return an array with `N` elements, where the first element is set to `i_start`, the last element is set to `i_stop`, and the other elements are linearly spaced in between these two. More information here:

https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.linspace.html


In [None]:
N=100
i_start=0
i_stop=99
x = np.linspace(i_start,i_stop,N)
print(x)

You can operate on the whole array, like so:

In [None]:
y = x + 1
print(y)


In [None]:
y = x*2
print(y)

And you can assess and operate on individual elements of the array, by specifying their **index**, like so:

In [None]:
a = x[20]
print(a)

In [None]:
b = x[20] + x[21]
print(b)

You can also access more than one element of arrays, in a specified range of **indices**, for example:

In [None]:
x[0:10]

In [None]:
x[:8]

In [None]:
x[40:45]

The following cell will give an error. Why?

In [None]:
x[200]

Finally, you can also define an **array of indices**, to make it easier to access certain parts of your array. For example, if you were only interested in every 10th element of an array, you could define a new variable, which we we call `indices`, and that we can use to access those elements quickly, by writing `x[indices]`:

In [None]:
indices = np.array([10,20,30])
print(x[indices])
print(y[indices])

This is where things get more complicated, but also more powerful. We will use a Numpy function, called `where()`, to return the indices of all array elements that pass a certain condition.

(https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.where.html) 

For example, let us say that you want to know all the elements of `y` that have a value that is less than 10. 

We begin by using `where()` to return the indices that hold elements for which that is true (note, because `where` is a Numpy function, we have to preceed it with `np.`, so we write `np.where`):

In [None]:
y_is_small = np.where(y < 10)[0]

(aside: every time we use `np.where` we need to add a `[0]` at the end of the command. If you really want to know why, ask me, but you don't really need to, as long as you remember. Hint: next week's Lab will be harder if you forget!)

In [None]:
print(y_is_small)

So there are 5 elements of the array `y` that have a value less than 10. We can check that this is true, by printing the values of `y` at these indices:

In [None]:
print(y[y_is_small])

Good!

Multidimensional arrays (matrices) are straightforward. Let's define an array `a` that has two rows:

In [None]:
a = np.array( [[1,2,3], [4,5,6]])
print(a)

Element access is done in order row->column. So to access the first row 1 and column 0 you would write:

In [None]:
a[1,0]

Numpy arrays have methods (functions) that allow you to easily compute basic statistics. To compute the maximum value, the mean value, and the number of elements in `y`, you can write:

In [None]:
print(y.max(), y.mean(), len(y))

### 1.6 Visualisation

In this Lab, you will be using Python to interact with data. A large part of that is visualisation. Being able to explore aspects of your data using graphs is a fundamental skill in academia and industry at large.

We will use the Python library **matplotlib** - an extensive library with extraordinary functionality. We won't even stratch the surface here, and instead focus on being able to produce basic, clear plots.

More information here: http://matplotlib.org/api/pyplot_summary.html

First we will import the library, we'll call it plt for short.

In [None]:
import matplotlib.pyplot as plt

We're ready for our first plot. Throughout this lab you will be asked to produce plots in exercises. **Always label your axes, provide a sensible title to your plot, and add a legend where necessary!** 

Let's do a simple **line plot** of y vs x, and label everything sensibly. Line plots will link data points with a line.

In [None]:
x = np.linspace(0,99,100)
y = x**2

In [None]:
plt.figure(figsize=(10,8))
plt.plot(x,y, label='y = x$^2$')
plt.xlabel('x', fontsize=20)
plt.ylabel('y', fontsize=20)
plt.title('Hello matplotlib!')
plt.legend(loc='upper left')

Let's zoom in at the very start of our curve, and explicitly mark the actual data points with small circles. Experiment with some of the commands and parameters below.

In [None]:
plt.figure(figsize=(10,8))
plt.plot(x,y, label='y = x$^2$', marker='o') #see https://matplotlib.org/api/markers_api.html
plt.xlabel('x', fontsize=20)
plt.ylabel('y', fontsize=20)
plt.title('Hello matplotlib!')
plt.legend(loc='upper left')
plt.xlim(0,10)
plt.ylim(0,100)

**Histograms** are simply couting plots, separating data points into **bins** according to the value of the data points. 

We will begin by creating two arrays, each with 5000 points drawn randomly between -1 and 1 according to a Gaussian distribution. 

https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.random.html

In [None]:
y1 = np.random.randn(5000)
y2 = np.random.randn(5000)

You can visualise your data by plotting the value of an array (say y1) as a function of its index number, as exemplified below.

In [None]:
plt.figure(figsize=(15,6))
plt.plot(y1, marker='o', linestyle='none')
plt.xlabel('array index number i', fontsize=15)
plt.ylabel('y[i]', fontsize=15)

By eye, it looks like most elements of y1 have a value around zero, and most are perhaps between -1 and 1. A histogram bins the values of an array according to their value, so you can easily inspect the most frequent values.

Let us now make a histogram for each variable. We will count how many array elements have a certain value, in 20 bins ranging from -5 to 5.

In [None]:
plt.figure(figsize=(10,8))
plt.hist(y1, bins=20, range=(-5,5), label='y1', color='purple', histtype='step') #the histtype = 'step' makes your histogram be a line, rather than filled boxes like the next one)
plt.hist(y2, bins=20, range=(-5,5), alpha=0.4, label='y2', color='green') #note the alpha value - it makes the second plot slightlt transparent (alpha=1 = opaque)
plt.ylabel('N(value)', fontsize=15)
plt.xlabel('value of random number', fontsize=15)
plt.title('A histogram', fontsize=15)
plt.legend( fontsize=15)

Experiment with the np.hist() command, changing the width of the bins, or the number of bins. You can also change the colour, labels, etc (**N.B.: Python understands American spelling only - you'll need to use `color`**).

Finally, we will look at **scatter** plots, useful for when we need to plot data that we don't necessarily think should be connected with a line. For example, when we are looking at how variables correlate with one another, or where they sit in some parameter space.

The y1 and y2 arrays are sets of (uncorrelated) random numbers. That means that the value of element 1 (for example) in y1 will be independent of (or uncorrelated with) the value of element 1 in y2. Let's plot the values of y1 against the values of y2. Can you predict what you will see?

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(y1,y2, marker='*', s=50)
plt.xlabel('y$_1$', fontsize=20)
plt.ylabel('y$_2$', fontsize=20)
plt.title('A scatter plot')

You can easily select data elements from your dataset according to their value, and see where they sit in your plot. For example, let us look at where all the datapoints with negative y1 are - we will use the `np.where` function again.

In [None]:
y1_neg = np.where(y1 < 0)[0]
plt.figure(figsize=(10,10))
plt.scatter(y1,y2, marker='*', s=50)
plt.scatter(y1[y1_neg],y2[y1_neg], marker='*', s=50, color='red')
plt.xlabel('y$_1$', fontsize=20)
plt.ylabel('y$_2$', fontsize=20)

You can also have more than one plot in the same figure, using the plt.subplot() command. See the example below. 

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplot.html?highlight=subplot#matplotlib.pyplot.subplot

In [None]:
y1_neg = np.where(y1 < 0)

plt.figure(figsize=(20,10)) #notice we changed the size of our figure, to fit two square subplots side by side.

#first subplot
plt.subplot(121)    #the arguments passed on to subplot 
                    #define the size of the grid of plots, 
                    #and which cell in the grid you want to plot.
                    #In this case we are working on a grid of 1 rows and 2 columns,
                    #and this is plot number 1
plt.scatter(y1,y2, marker='*', s=50)
plt.xlabel('y$_1$', fontsize=20)
plt.ylabel('y$_2$', fontsize=20)
plt.title('all points')
plt.xlim(-5,5)


plt.subplot(122)   #now on to plot number 2
plt.scatter(y1[y1_neg],y2[y1_neg], marker='*', s=50, color='red')
plt.xlabel('y$_1$', fontsize=20)
plt.ylabel('y$_2$', fontsize=20)
plt.title('negative y1')
plt.xlim(-5,5)

### 1.7 Pandas

The next library we will consider is called **pandas** - the Python data analysis library. The workhorse of pandas is an object called a **DataFrame**. Think of it as a **table of data** that can be queried like a **database** and manipulated very efficiently, especially for large datasets. Some of the objects used in next week session are dataframes, so let us have a quick round of how to access dataframe elements (DataFrames are argually one of the most powerful tools for data analysis in Python - we will not use them to their potential).

We will begin by importing pandas, as pd for short.

https://pandas.pydata.org

In [None]:
import pandas as pd

Next we will create a dataframe, called df, with two columns, each holding one of our arrays of random elements. DataFrames are indexed (that's the first column you see below), allowing one to access elements very efficiently. The next few cells demonstrate how to access DataFrame elements. 

In [None]:
df = pd.DataFrame({'y1':y1, 'y2':y2})
df

To access a specific row of data, you do so by specifiying the location of that row. E.g., to see the values of y1 and y2 in row 10:

In [None]:
df.loc[10]

If you want to access a column specifically, say the value of y1 in row 10:

In [None]:
df.loc[10]['y1']

You can also slice data frames in the same way you did with numpy arrays. For example, to first select all rows where y2 is greater than 0 but less than 0.5, and then print those y2 values, you might do:

In [None]:
y2_pos = np.where( (y2 >0) & (y2 < 0.5))[0]
df.loc[y2_pos]['y2']

Similarly to what you did before, you can then plot slices of your data frame in different colours. E.g.:

In [None]:
plt.figure(figsize=(10,8))
plt.scatter(df['y1'],df['y2'], marker='*', s=50)
plt.scatter(df.loc[y2_pos]['y1'],df.loc[y2_pos]['y2'], marker='*', s=50, color='red')
plt.xlabel('y$_1$', fontsize=20)
plt.ylabel('y$_2$', fontsize=20)


**Congratulations!** You should now have a reasonable working knowledge of how python can by used to explore and analyse data. In the next notebook we will put these commands into practice to investigate some SDSS data. 