# Data Wrangling: Data Operations and Selections

In this lecture/notebook, we'll look at the following:
* Indexing
* Selecting Data (the `where` command)
* Performing Mathematical and/or Statistical Operations on Data

For this lesson, we're going to be using the following packages:
* `numpy`
* `pandas`
* `astropy`

In [None]:
# All of the basic imports:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import astropy

## What's the deal with indexing?
* Basics of Indexes
* Indexing sequences
* Special Numpy indexes

Let's start with the simplest of lists: 

In [None]:
x = np.arange(10)
x

![indexing1.png](img/indexing1.png)

In [None]:
print(x[0])
print(x[5])

Or we can reverse index:

![indexing2.png](img/indexing2.png)

In [None]:
print(x[-10])
print(x[-5])

Or we can choose a range:

![indexing3.png](img/indexing3.png)

In [None]:
print(x[1:])
print(x[5:8])
print(x[1:-1])

And we can change the counting of what we choose:

![indexing4.png](img/indexing4.png)

In [None]:
print(x[::2])

With numpy arrays, you now have multiple indicies to index with. Let's take the case of a simple two dimensional array:

In [None]:
array1 = np.arange(90).reshape((-1, 10))
array1

What's the shape of the array?

In [None]:
array1.shape

Now, we can index in two different dimensions:

![indexing5.png](img/indexing5.png)

In [None]:
array1[5, 7]

You can select a row or column:

![indexing6.png](img/indexing6.png)

In [None]:
array1[5, 1:8]

Or you can subselect a box:

![indexing7.png](img/indexing7.png)

In [None]:
array1[5:-1, 1:8]

You can even get non-sequential indicies: 

![indexing8.png](img/indexing8.png)

In [None]:
array1[(5, 7), (6, 8)]

## Finding the data you want
* Introduction to the `where` command
* Boolean Conditionals
* What comes out of a `where`
* Best practices with `where`
* Using `where` in Pandas

Let's start by making a random two-dimensional numpy array:

In [None]:
random_array = np.random.randn(10, 5)
random_array

The `where` command lets you know where (heh) in the array the conditions that you're looking for are met. So, for instance, if I want to know the indicies where the above array are greater than 0:

In [None]:
chosen_items = np.where(random_array > 0)

type(chosen_items)

Looking at the output, you'll always get a `tuple`. What's in that tuple?

In [None]:
chosen_items

The output of a where command is a tuple, that is as long as the number of dimensions of the original array. Each of the elements of the tuple is a numpy array that contains the indicies of the elements that meet the criteria inside the where. The length of each numpy array are the same, and the number of elements is the number of elements that meet the criteria. 

In [None]:
print("Number of Dimensions: %i" % len(chosen_items) )

Checking the size of each dimension: 

In [None]:
print("First Dimension Length: %i " % chosen_items[0].size)
print("Second Dimension Length: %i " % chosen_items[1].size)

Using these indicies from the where function, you can use them to replace values:

In [None]:
random_array[chosen_items] = 0.0

In [None]:
random_array

If you have arrays that are the same size, you can mix and match them in the search condition:

In [None]:
random_array = np.random.randn(10, 5)
second_array = np.random.random(random_array.shape)
second_array.shape

In [None]:
joint_indicies = np.where((random_array >= 0.5) & (second_array < 0.6))
print(joint_indicies)

<div class="alert alert-block alert-info"> 
    <b>Note:</b> Boolean queries follow a left-to-right order of operations. Brackets are important to make sure the query works in the way you want it to. For instance: 
    <pre>arr1 > 2 & arr3 < 2</pre>
Will result in an error.
    
To make it work, but brackets around what you want to run first:
    <pre>(arr1 > 2) & (arr3 < 2)</pre>
    
</div>

In pandas, using the square brackets allows you to select across the entire table selecting on columns:

In [None]:
stars = pd.DataFrame(
    {
        "name": ["alpha", "beta", "gamma", "theta"],
        "magnitude": [13.0, 15.9, 14.3, 15.1],
        "mass": [1e33, 1.2e32, 8.5e32, 2.1e33],
    }
)

stars

In [None]:
stars[stars.magnitude > 14]

You can even use conditionals with multiple columns:

In [None]:
stars[(stars.magnitude > 14) & (stars.mass > 1e33)]

Or if you want to use the standard `np.where` command, you can do that easily with the `.iloc` property of a Pandas dataframe:

In [None]:
tmp_ind = np.where(stars.magnitude > 14)
stars.iloc[tmp_ind]

### Exercises: 
1. Create a array of normally-distributed values using the `randn` function of shape `(40, 40)`
2. Find all values within the array that lie between -1 and 1
3. How many values of the array fall between -1 and 1? What fraction of the total array is it? Is this what you should expect? Why?
4. Set the values between -1 and 1 to 0. Manually calculate the mean of the entire array (i.e., without the `mean` function that you may or may not know about). What is the mean of the array. Is this what you should expect?

## Now that I have the data, what can I do with it?
* Numpy: perfoming mathematical operations over an array
* Pandas: performing mathematical operations over an array


With Numpy arrays, most operations assume you're trying to apply them element-wise. For instance:

In [None]:
random_array * 5

This will multiply each value by that one number. Similarly:

In [None]:
random_array * random_array 

This works because both of these arrays are the same size (and in this case, the exact same).

Note, this is the same as doing `random_array**2` or `np.power(random_array,2)` -- in Numpy, there's often multiple ways to do the same thing. 

Also built in to Numpy are statistical functions, such as `mean` or `median`. You can apply them on the whole array:

In [None]:
# Calculating some simple statistical operations
print(np.mean(random_array))
print(np.median(random_array))

But often, you'll want to take the mean across some a specific axis (i.e., you want the mean per ever row or column). Let's take a look at the shape of `random_array`:

In [None]:
random_array.shape

Let's say we want to take the mean for each of the second, we can use the `axis` keyword:

In [None]:
# Taking the same across the first axis:
mean_val = np.mean(random_array, axis=0)
print(mean_val)
print("Shape of Array: %s" % str(mean_val.shape))


Or in the opposite direction:

In [None]:
# Taking the same across the second axis:
mean_val = np.mean(random_array, axis=1)
print(mean_val)
print("Shape of Array: %s" % str(mean_val.shape))


With Pandas dataframes, you can apply the numpy functions on individual columns, or you can use the built in methods:

In [None]:
stars.magnitude.mean()

This is equivalent to:

In [None]:
stars['magnitude'].mean()

One of the nice things about Pandas is that it lets you look at all of the statistical properties at once: 

In [None]:
stars.describe()

It calculates a variety of statistical peoperties (including confidence intervals) for each column separately. 

## Exercises

1. Go to the [IPAC Web Interface for the WISE survey](https://irsa.ipac.caltech.edu/applications/Gator/) and using the WISE All-Sky Source Catalog search for all the objects within 10 arcminutes around your favourite astronomical object. 

2. Download the `ipac` formatted file of the default columns from the WISE object search, and open it in your Jupyter Notebook. 

3. Use the same parameters for 2MASS All-Sky Point Source Catalog and do the same:
![ipac.png](img/ipac.png)

4. Save both of these files as `csv`, and read them both in as Pandas dataframes 

5. Make a function to calculate the distance between two sky locations. 

6. Make an algorithm to go through the WISE results, and match the closest source from 2MASS into a new table. Hints:
    - Make a new pandas dataframe to put the new results in
    - For each WISE source, create a new column for the distance to each 2MASS source
    - On the distance column, use the `np.min` function along side a `where` command to find the closest source.