# Data Wrangling: Input/Output

In this lecture/notebook, we'll look at the following:
* Different data file types used in Astronomy/Astrophysics
* Ways of opening data files and accessing the data within Python
* Ways of saving data from a Python session to a file
* Best Practices for dealing with files

For this lesson, we're going to be using the following packages:
* `numpy`
* `pandas`
* `astropy`

In [None]:
import numpy as np
import pandas as pd
from astropy import table


## Data Types we deal with in Astronomy

In astronomy (and in much coding), there's a variety of different file formats that you'll deal with data in:
* Plain, Unstructured Text 
* Structured Plain Text (most commonly, Comma-Separated Values or `csv`)
* Numpy files (`.npy` or `.npz`)
* FITS files<sup>a</sup>
* Structured Markdown (such as HTML)

---
1: We won't go into FITS files in detail. What you _should_ know about them in a nutshell: they contain groups of data, most often images (which are just 2D or 3D arrays of numbers, usually floating points), or tables of data. The tables are relatively easy to read-in with `astropy` 

## The most basic of file writing: a plain unstructured text file
* Opening the file
* Reading all of the content within the file
* The file as a string
* Writing out a file
* Closing a file

#### Reading a File

In [None]:
input_file = open("input.txt", 'r')

type(input_file)

In [None]:
input_string = input_file.read()
input_string

<div class="alert alert-block alert-info"> 
    <b>Tip:</b> One of the most useful things you should be doing regularly in python is checking the type of your variables. Many of your problems will be solved by knowing the variable type.
</div>

In [None]:
type(input_string)

In [None]:
input_file.close()

Instead of loading the file as one large string, you can read it line by line:

In [None]:
input_file = open("input.txt", 'r')

print(input_file.readline())

print("This is the next line:")

print(input_file.readline())

input_file.close()

Or, if you want to loop over the whole file, you can use the file object as an _iterable_ (i.e., throw it in a `for` loop).

In [None]:
input_file = open("input.txt", 'r')

for i, line in enumerate(input_file):
    print("This is line %i: %s" % (i, line))
    
input_file.close()

**Question**: What is the enumerate function doing here?

#### Writing a File

In [None]:
output_file = open("output.txt", 'w')

output_file.write("This is my output file")
output_file.write("\nThis is its second line")

output_file.close()

In [None]:
# Wrong Way:

output_file = open("output.txt", 'w')

output_file.write("This will overwrite the file")

output_file.close()


In [None]:
# The Right Way:

output_file = open("output.txt", 'a')

output_file.write("\nThis will append to the file")

output_file.close()


### Exercise:

1. Write plain text file that contains a 20-30 word bio of yourself. 
2. Read in the plain text file, and using string functions, create a Python List of strings containing each individual word from your bio. 
3. Determine the total number of words in your bio. Also, determine the number of _unique_ words in your bio.

## Reading and Writing to NumPy Arrays
* Remembering the basics of Numpy Arrays (dimensionality, propogations)
* Loading in from a text file (tab-separated, comma-separated)
* Saving and Loading as Numpy Files (`.npy`, `.npz`)

Let's make the most basic of Numpy arrays.

In [None]:
# The Simplest of Arrays
new_array = np.array([])

Every numpy array has a certain number of properties that you should always check. They are the number of dimension (`ndim`), number of total elements (`size`), the length of each dimension (`shape`), and the variable type (`dtype`)

In [None]:
print("Number of Dimensions: %i" % new_array.ndim)
print("Number of Elements: %i" % new_array.size)
print("Shape of Array: %s" % str(new_array.shape))
print("Data Type: %s" % new_array.dtype)

In [None]:
second_array = np.linspace(0, 5, 20)

print("Number of Dimensions: %i" % second_array.ndim)
print("Number of Elements: %i" % second_array.size)
print("Shape of Array: %s" % str(second_array.shape))
print("Data Type: %s" % second_array.dtype)

In [None]:
third_array = np.ones((10, 5))

print("Number of Dimensions: %i" % third_array.ndim)
print("Number of Elements: %i" % third_array.size)
print("Shape of Array: %s" % str(third_array.shape))
print("Data Type: %s" % third_array.dtype)

When doing a mathematical operation on numpy arrays, it will try to perform the operation element-wise. However, if the shapes of the elements work out, it will perform it across rows or columns. For instance:

In [None]:
np.array([1, 2, 3, 4, 5]) * third_array

However, you'll need to make sure the axes line up. Otherwise, it'll throw an error:

In [None]:
# This won't work
np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) * third_array

Here, it's trying to line up the array with size 10 with the second dimension of `third_array`, which is of length 5. Hence they don't work. However, you can `reshape` the array here to make it a 2-D array where the second dimension is just length 1:

In [None]:
# Reshaping the array so that it's two dimensions, with length 1 on the second dimension

reshaped_array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
print(reshaped_array.shape)

reshaped_array * third_array

<div class="alert alert-block alert-info"> 
    <b>Tip:</b> In reshapes, you can use the number "-1" to indicate that you want Python to automatically calculate how long this particular axis should be. Note, you can only do this for one axis at a time in a reshape.
</div>

Why don't we try to read in a structured plain text file? If everything is perfect, you can use the `loadtxt` function:

In [None]:
# Reading in simple file: 

lots_of_numbers = np.loadtxt("lots_of_numbers.txt")

In [None]:
lots_of_numbers

However, for more complicated files, this doesn't work. Take a look at the file "more_numbers.txt" and try to determine why this doesn't work.

In [None]:
# Reading in more complicated file:

# This won't work
more_numbers = np.loadtxt("more_numbers.txt")

But you can handle these kinds of files with the `genfromtxt`function:

In [None]:
more_numbers = np.genfromtxt("more_numbers.txt", missing_values="null")

In [None]:
more_numbers

However, Numpy Arrays have their limits. Generally, they need to be rectangular (or hyper-rectangular), and all of the same data type. So a file this will be a pain to try to load in:

In [None]:
# Missing Values

# Why doesn't this work?
new_table = np.genfromtxt("more_numbers_missing_values.txt", missing_values="null")

Sometimes, you don't need a human-readable file -- in these cases, you can save a numpy array as an `npy` file. For these files, it's one array per file, but they're easily saved and loaded:

In [None]:
# Npy Files

x1 = np.array([2.2, 3.4, 2.1, 3.5, 9.3])
np.save("x1.npy", x1)

In [None]:
new_x1 = np.load("x1.npy")
new_x1

If you want to save more than one array, you can use a Numpy Zip file (or `npz`), which treats all the variable as dictionary-like elements when you load them in:

In [None]:
# Npz Files
x2 = np.array([12.3, 21, 32, np.nan])

np.savez("another_variable.npz", x1=x1, x2=x2)

In [None]:
more_variables = np.load("another_variable.npz")

print(list(more_variables.keys()))

print(x2)

<div class="alert alert-block alert-danger">
<b>Warning:</b> Numpy files, while easy to use, are not great for long-term storage or distribution. A file written on one computer or a specific version or Numpy/Python is not guaranteed to work on another. These are best used as intermediate saves for data that you intend to open on the same computer, relatively soon. 
</div>

### Exercise:

1. Create a two-dimensional and three-dimensional numpy arrays (containing random numbers), and multiply them to form a new array, and save them all as an npz file.  


## Astropy Tables
* A table versus an array
* Columns and Rows
* Reading in various formats
* Saving out various formats

Numpy Arrays have certain limitations, for instance:
* They need to be all the same data type
* They need to be rectangular/hyper rectangular (i.e., missing values are hard to deal with)
* They are inherently multi-dimensional

Often times, what you want to deal with is a certain number of objects that have a bunch of different properties. For this, what you'll want to use is a **Table**. Let's create an astropy style table:

In [None]:
new_table = table.Table()

The above table starts empty. Tables consist of **columns** which are each discrete property of the objects you're describing, and **rows** which are each discrete object. Columns are going to be dictionary-like (i.e., they use keys, and don't have an intrinsic order), and Rows are array-like (i.e., they use indicies and have an explicit order). 

Let's make a bunch of columns:

In [None]:
new_table['name'] = ["alpha", "beta", "gamma"] 
new_table['mass'] = [2.1, 2.3, 4.2]
new_table['temp'] = [6000, 2323, 233]

In [None]:
new_table

One of the nice things about tables is that they're easy to look at and diagnose. Notice, in this table, we're told the data type of each column -- and each one is different. Let's add another object:

In [None]:
new_table.add_row(("delta", 6.3, 10002))

In [None]:
new_table

Let's say for these objects, I only care about the `temp` column, I can grab just that:

In [None]:
new_table["temp"]

Or maybe, I care about both the `mass` and `temp`:

In [None]:
new_table[["mass", "temp"]]

And I can easily just grab the third row (remember, 0 indexing in python):

In [None]:
new_table[2]

One of the great things that you can do with Astropy tables is that you can write them out to a bunch of different formats easily:

In [None]:
# Saving as plain text

new_table.write("my_astropy_table.csv", format="ascii.csv")

new_table.write("my_fixedwidth_table.txt", format="ascii.fixed_width")

new_table.write("my_latex_table.tex", format="ascii.latex")

You can see all of the formats that you can read and write to here: [Formats that Astropy Tables can Read and Write](https://docs.astropy.org/en/stable/io/unified.html#built-in-readers-writers)

For instance, here's an example of an [IPAC](https://irsa.ipac.caltech.edu/frontpage/) table, which comes from the NASA/IPAC Infrared Science Archive, that contains lots of data from different astronomical surveys. 

In [None]:
# Opening an IPAC table

ipac = table.Table.read("ipac.tbl.txt", format="ascii.ipac")

In [None]:
ipac

### Exercise: 

1. Go to the [IPAC Web Interface for the WISE survey](https://irsa.ipac.caltech.edu/applications/Gator/), and using the WISE All-Sky Source Catalog and search for all the objects within 5 arcminutes around your favourite astronomical object. (Hint: The Object name field is generally smart enough to take common names of astronomical objects. You probably shouldn't choose a solar system object -- sorry, Mars lovers). 
2. Download the `ipac` formatted file of the default columns from the WISE object search, and open it in your Jupyter Notebook. 
3. Select the columns for the Object Name, the Position (RA and Dec), and `w1mpro` Magnitude, and save it to a LaTeX table. 

## Reading and Writing to Pandas Dataframes
* What are Pandas dataframes?
* Making a data frame
* Automagically reading in csv files
* Writing out different data formats

Pandas is the primarily data analysis library, which makes dealing with tabular data _much easier_. It's used well beyond astronomy, so it is incredibly robust and feature filled. The most basic element of Pandas starts with a DataFrame. Let's make the simplest of them:

In [None]:
df1 = pd.DataFrame()

Like the Astropy tables, columns are dictionary-like, and rows are array-like. Let's make some columns:

In [None]:
df1["col1"] = [2, 3, 4]
df1["col4"] = [-12.2, 23.1, 984.2]

In [None]:
df1

We can even make columns using data from previous columns:

In [None]:
df1["col5"] = df1["col4"] ** df1["col1"]

And we can add new columns with completely different data types right from the start:

In [None]:
df1["name"] = ["row1", "row2", "row3"]

In [None]:
df1

To select individiual rows (or groups of rows), you can use the `iloc` property of the DataFrame:

In [None]:
df1.iloc[0:2]

Some of the magic starts right off the bat, with reading in a file. CSVs are some of the most common files you'll encounter for distributing data. Let's open one up from a query from the Sloan Digital Sky Survey database:

In [None]:
# Reading in a CSV file

sdss_df = pd.read_csv("sdss_query.csv")

In [None]:
sdss_df

Notice how it automatically populates the column names? It also would deal with missing data without intervention. 

And, as always, it's really easy to save your data out to a bunch of different formats:

In [None]:
# To a CSV
df1.to_csv("pd_df1.csv")

# How about to HTML?
df1.to_html("pd_df1.html")

# Or maybe even Excel?
df1.to_excel("pd_df1.xlsx")

**Note:** If the last line doesn't work, make sure you have the `openpyxl` package installed via pip 

### Exercises
1. Save the IPAC table you created earlier from the Astropy Table section and save it as a CSV. 
2. Read it in as a Pandas Data Frame, and make a new column that is the distance of each object from the first object in the table (remember that these are celestial coordinates, and not simple Euclidean coordinates).  
3. Save the new dataframe as an Excel file