# Exercise 01 | Data Basics

Written By: Aiden Zelakiewicz (asz39@cornell.edu)

Inspired By: tOSU ASTRO1221 HW02 by [Prof. Donald Terndrup](https://astronomy.osu.edu/people/terndrup.1)

In this exercise, you will be introduced and familiarized with manipulating data in Python with a couple different packages.
We will use selected stars from the [Landolt 1992](https://ui.adsabs.harvard.edu/abs/1992AJ....104..340L/abstract) photometric standards catalog.
The file included in this exercise, `landolt.dat`, contains stellar names, Right Ascension, Declination, $V$-band magnitude, $B-V$ color, and parallax in milliarcsec.

When you see a "<---TO DO--->" header in this notebook, that is an exercise for you to complete! If you are struggling and have questions, ask one of the graduate students for help and we will lead you in the right direction :D


## Goal for This Exercise:

Create a Distance vs Absolute Magnitude Diagram while learning how to

1. Import data

2. Create Functions

3. Use if/else Statements

4. Loops and list comprehension

5. Perform Operations on Data

6. Export your data to file

## Importing Data

Importing data, as one might guess, is necessary for most astronomy applications of Python. There are various Python packages you might find useful to do so, such as `numpy` or `pandas`.

### Using Python

Using native Python to import data can be very difficult, and I will not recommend doing so.
For completeness sake, I will include it... hesitantly.

Every line is read in as a single `string` and needs to be split apart manually by default.
If you want a line to contain various data types, you need to declare that manually which can be very time consuming.

In [None]:
fname = "landolt.dat"

# Read in the file
file = open(fname, 'r')
lines = file.readlines()
file.close() # Make sure to close the file

# Remove the first 23 lines, which includes comments and header information
landolt_dat = lines[23:]

# Split by whitespace using list comprehension
# We will return to list comprehension later in the bootcamp
landolt_dat = [line.split() for line in landolt_dat]

# Change all columns to floats
landolt_dat = [[float(col) for col in row] for row in landolt_dat]

print(landolt_dat[0])

print("Type:", type(landolt_dat))

That was very difficult! 99% of the time you should avoid using this, as packages make it much easier. Your final datatype will be a `list`, which is a very common way to organize and manipulate data.

### Using Numpy

There are a handful of functions to import data using Numpy. Some key ones to remember are `np.loadtxt()` and `np.genfromtxt()`. The latter gives similar functionality to Pandas dataframes and Python dictionaries, being that you can call the columns with names.

Numpy organizes data in `ndarray`s, which are basically fancy and improved `list`s. We will explore some of the benefits of Numpy `ndarray`s further on.

In [None]:
# Importing the package numpy and giving it the alias np
# This is a common convention and allows us to refer to numpy functions as np.function_name
import numpy as np 

fname = "landolt.dat"

landolt_arr = np.loadtxt(fname, skiprows=23) # Skip the first 23 lines of comments and header

print("Type:", type(landolt_arr))

# You can access rows by index
print(landolt_arr[0])

# Or columns through a cheeky index method
print(landolt_arr[:,0]) #[:] means all rows, then comma seperates the column index

### Using Pandas

Pandas is another package which has become increasingly popular in data science recently due to its easy-to-use and intuitive `DataFrames`.
While Numpy `ndarray`s are fancy `list`s, Pandas `DataFrame`s are more akin to fancy dictionaries (`dict`s).

In [None]:
import pandas as pd

fname = "landolt.dat"

# Often pandas dataframes have the alias df
landolt_df = pd.read_csv(fname, skiprows=22, delim_whitespace=True) # delim_whitespace=True tells pandas to split by ANY whitespace

print("Type:", type(landolt_df))

# We can print the column names and first few rows using the .head() method
# It's a useful way to get a quick look at the data and make sure it was read in correctly
print(landolt_df.head())

In [None]:
# We can access columns by name (more common in pandas)
print(landolt_df['BV'], '\n') # \n is a newline character, just to make the output easier to read

# Or you can access rows by index using .iloc[row_index]
print(landolt_df.iloc[0])

### <---TO DO--->

Included in this folder is a file titled `example.csv` in a common text file format called "comma seperated variables", or CSV for short.
View the CSV to see its structure and how it is laid out.
In the following cell, import the file using Pandas (`read_csv`).

**Print the Name and Age of one of the entries.**

NOTE: The delimeter (what seperates the data) isn't whitespace anymore, and is a comma!


**HINT:**

Both Numpy and Pandas allow an argument of delimeter=" " to define what the delimeter is.

If you have a list inside of a list (or array in an array) in the format (i, k) you can access the kth element of the ith list using my_list[k][i]


In [None]:
# Write the path to the file
fname = "path/to/file.extenstion"

# Read in the file
# YOUR CODE HERE

# Print the name and age of an entry
print("YOUR CODE HERE")

## Functions and Conditional Statements

A good motto to live by when writing code is **DRY: Don't Repeat Yourself**!
If you need to write code twice, you should create a function for it.
Here we will create some functions to analyze the data using a couple of different methods.
We will also explore some conditional statements that will give us flexibility and add logic into our code!

Python is organized by indents, which is uncommon for many coding languages.
This means that to tell Python that a certain block of code belongs together it must be indented together.
This will become clearer when we create a function and use conditional statements.

### Getting Distances

When observing stars close to us, we can use the observed *parallax* of the star to get a distance.
In fact, a *parsec* is defined as the distance an object is from Earth when the observed parallax is 1 arcsecond ($1"=\frac{1^\circ}{3600}$).

![Image showing the definition of a parsec.](https://wwwhip.obspm.fr/~arenou/images/parsec/pc-def.jpg)

Using a small angle approximation, we can get the distance ($D$) in parsecs from the parallax ($p$) in arcseconds using the formula:
$$
D = 1/p
$$
Because we will probably want to use this formula a lot, it might be a good idea to make a function for it!

### Functions

In Python, functions are declared using `def`.
The format for creating a function is `def <function name>:`, where "\<function name\>" is relaced by a whatever you want to call your function.
Function, and class, naming follows the same rules as variables.

All code that you want included in the function must be indented so that Python knows it is part of the function, with the standard being to use a tab.
Functions are usually closed using `return`, which can be used to output an object or just tell Python the function has ended.


Functions should include documentation to explain what they do, called a docstring.
Docstrings come immediately after the function declaration and are wrapped in three pairs of quotation marks (`"""DOCSTRING"""`).
There are some common formatting styles for creating docstrings, but any kind of docstring is better than none!

A second type of functions are `lambda` functions.
These are less common but can be very quick and handy.
They are defined and returned all in a single line of the form `function_name = lambda x: (operation on variable x)`.

In [None]:
def dfromp(p):
    """
    Returns the distance to a star in parsecs given its parallax in arcseconds.

    Parameters
    ----------
    p : float
        parallax in arcseconds

    Returns
    -------
    d : float
        distance in parsecs
    """
    # Above is a sample docstring in the numpy style, my personal favorite

    d = 1/p
    return d

# Creating a lambda function that does the same thing
dfromp_lambda = lambda p: 1/p

In [None]:
# Now let's test the function on a parallax of 0.1 arcseconds
p = 0.1

print(dfromp(p))
print(dfromp_lambda(p))

Let's say someone else is using a function you created by importing your file (like we did with numpy!) and they are confused about your organization.
Someone can simply type `help()` with the function name to view the docstring.

In [None]:
help(dfromp)

### If/Else Statements

One of the most important logistical tools in coding is the `if` and `else` statement.
These can be used to execute a chunk of code `if` a given condition is met or, if not, default to something `else`.

`if`/`else` statements follow the same indentation rules as functions.
The `if` condition is also set up using the following format: `if CONDITION:`.
`else` does not require a condition though: `else:`.

We can also stack conditions using operators like `and` as well as `or`, or set different cases to check using `elif`.
`elif` falls in between `if` and `else`, and follows the same formatting as `if`.

In [None]:
x = 3

if x > 0:
    print("x is positive")
elif x == 0:
    print("x is zero")
else:
    print("x is negative")

We can apply these statements to our distance function earlier to tell it whether we are working in arcseconds or milliarcseconds.
We achieve this by adding an additional parameter into our function, `is_milli`.
Parameters in functions are seperated by a comma, and can be given a default value by using an equals sign (`=`).
Note that parameters without equal signs, namely required parameters, must be declared before optional parameters.

In [None]:
def dfromp(p, is_milli=True):
    """
    Returns the distance to a star in parsecs given its parallax in arcseconds.

    Parameters
    ----------
    p : float
        parallax in arcseconds or milliarcseconds
    
    is_milli : bool, optional
        whether the parallax is in milliarcseconds

    Returns
    -------
    d : float
        distance in parsecs
    """
    
    # Notice the nested indentation here! 1 for the function, 1 for the if statement
    if is_milli:
        d = 1000/p
    else:
        d = 1/p

    return d

# Testing on the same parallax
print('Distance from parallex = 0.1 milliarcsec:',dfromp(p), "parsecs")
print('Distance from parallex = 0.1 arcsec:',dfromp(p, is_milli=False), "parsecs")

### For (and while) loops

Often we want to loop over a list of values to do an operation, or keep performing an operation until a condition is met.
This is where the `for` loop and `while` loop come in.

A `for` loop is run with the following syntax: `for x in y:`, where `x` is the item in each iterable of `y`.
But what does that mean?
Let's go back to the very start where we had the `for` loop: `for i in range(len(landolt_dat)):`.
What we did is create an iterable of a list with length of the data file.
We then told Python to loop over this length, essentially looping over the index location `i` of each entry.
It then started at `i=0` (Python always starts at index=0) until `i=len(landolt_dat)-1`.

A fun application for `for` loops is to loop over a `list` or `array`.
It is identical to the above case, but we instead use a list instead of the iterable, like so:

In [None]:
my_list = [1, 2, 3, 4, 5]
for item in my_list:
    calculation = item**2
    print(calculation)

While `for` loops run until the last element, `while` loops run until a certain condition is met or you force an escape through a `break` command.
You can often avoid using a `break` by smartly defining your `while` conditions, but it is not always possible for complex problems.

In [None]:
x = 10

while x >7:
    print(x)
    x -= 1 # x-=1 is equivalent to x = x - 1


x=10
while x>0:
    print(x)
    x -= 1

    if x < 8: # Breaks out of the loop when x is less than 8
        break # Breaks out of the loop

Now getting back to astronomy, let's say we want to calculate distances for ALL of our data.
We now have all of the tools at our disposal to do so by combining a `for` loop with the `.append()` method.
We can add it as a new column into our data list.

In [None]:
print("Shape of landolt_dat before append:", np.shape(landolt_dat))

for star in landolt_dat:

    # Because parallax is last column, we can access it with -1
    d = dfromp(star[-1])

    # Append to end of row
    star.append(d)

print("Shape of landolt_dat after append:", np.shape(landolt_dat))

In [None]:
# Delete last column we just added
landolt_dat = [star[:-1] for star in landolt_dat]

#### List Comprehension

Now, you may have seen me do some sneaky techniques of putting for loops in a `list`.
This is known as **list comprehension** and it generally is a more compact and computationally efficient way of looping through lists.
When doing the simple `for` loop, Python needs to access and load both the `list` and its `.append()` function on each iteration which takes time, especially on very large datasets.

List comprehension follows a fairly simple logic structure, which is identical to the `for` loop.
Take the cell above, where I deleted the appended data.
The first chunk in the list is the action being taken which would have been indented in a `for` loop.
Then, the standard `for` loop declaration is made.
Let us re-append our data.

In [None]:
# Append the distance to the list using a list comprehension
landolt_dat = [star + [dfromp(star[-1])] for star in landolt_dat]
# When using LISTS, '+' concatenates lists
# When using NUMPY ARRAYS, '+' adds the elements of the arrays. This a key difference between the two data structures

An even more intuitive (and faster!) way to do this is to use Numpy arrays!
Numpy arrays allow you to pass a function an array and it will apply the function to every element of the array, which we can take advantage of.
We can do the calculate and concatenation all in one line, but I will split it up to be more readable.

We can either use Numpy's column stack (`np.column_stack()`) or concatenate (`np.concatenate()`).

In [None]:
# [:] selects every row, then -1 selects the last column
distances = dfromp(landolt_arr[:,-1], is_milli=True)

# Append the distances to the array using np.c_ (column stack)
landolt_arr = np.column_stack((landolt_arr, distances))

**A Note on For Loops:**

While `for` loops are great, they are often the culprit of slow code!
Whenever someone asks me to help speed up their code, the first thing I look for are redundant `for` loops that could be replaced with matrix/vector calculations.

### <---TO DO--->

So by now we have distances in both `landolt_dat` and `landolt_arr`. One thing that might be useful to know about the star is its *absolute visual magnitude*, or $M_V$.
This is defined as the apparent $V$ magnitude the star would have if it were located 10 parsecs away.

Given a distance $D$ in pc and apparent $V$ magnitude, we can derive $M_V$ from the equation below:
$$
V-M_V = 5 \log_{10}(D)-5
$$
Rearrange this equation for $M_V$ and **create a function that takes $D$ and $V$ as arguments and returns $M_V$**. Use Numpy's `np.log10()` to compute the base 10 logarithm.

In [None]:
### YOUR CODE HERE

Next, apply your function to the Numpy `ndarray` to get the absolute magnitudes for each star and concatenate it using `np.column_stack()`. Be careful you are choosing the correct column that corresponds to the $V$-band magnitude and $D$ distance. Remember if you need help, ask one of the bootcamp leads :)

In [None]:
### YOUR CODE HERE

Print out the value of $M_V$ for the last star in your array with some "flavor" text to make sure you did the calculation correctly. You should get an absolute magnitude of ~2.76 mag.

In [None]:
### YOUR CODE HERE

### Saving Data

Finally, let's save a subset of the data we have created since our collaborator (you in Exercise02) wants to make some figures.
We will only need the $V$-band magnitude, $B-V$ color, $D$ distance, and $M_V$ absolute magnitude.
While you can save a file in plain Python, `Numpy`, or `Pandas`, we will just use `Numpy`'s `np.savetxt()` function.

In [None]:
# Start by gathering the columns we want to save to a new file
columns = [7,8,10,11] # Corresponding to V, B-V, distance, and M_V

landolt_slice = landolt_arr[:,columns] # This will select all rows and only the columns we want

# Save the file using np.savetxt as a csv
np.savetxt("landolt_subset.csv", landolt_slice, delimiter=',', header="V, B-V, Distance, M_V")