# Module 4 Challenge | Analyze Gaia Data With Pandas

For astronomers who are interested in understanding how stars form and evolve, the place to look is young star clusters and stellar associations. These are places where lots of stars have recently formed. The Gaia mission will provide the 3D location and proper motions for over 1 billion stars, making it the perfect telescope to discover new star clusters and further characterize known clusters.

## Introduction

Young star clusters and stellar associations are important sites for understanding the stellar birth environment and stellar evolution. The Gaia mission will provide the 3D location and proper motions for over 1 billion stars, making it the perfect telescope to discover new star clusters and further characterize known clusters.

In this challenge, you'll be using [NumPy](http://www.numpy.org), [Matplotlib](https://matplotlib.org), and [Pandas](http://pandas.pydata.org) to explore a piece of the initial Gaia data release using pandas.

## The Data
This dataset combines Hipparcos and Tycho-2 data with new Gaia observations to provide accurate 3D positions and proper motions (i.e., the 2D angular velocity) on the sky (no radial velocities yet).

For this exercise, we've collected the relevant data and stored it in the HDF5 data file, <em>alldata.hdf</em> which should be in the Module_3 folder.

If you are interested, the full datasets can be downloaded in chunks from the Gaia website, [here](http://cdn.gea.esac.esa.int/Gaia/tgas_source/). A description of all of the columns can be found [here](https://gaia.esac.esa.int/documentation/GDR1/datamodel/Ch1/tgas_source.html).

<strong>Load the data by reading it in with Pandas.</strong>

In [None]:
# Load data into DataFrame named 'dat'
import FILL IN CODE
dat  = FILL IN CODE

You can think of this simple Pandas DataFrame as a large table which has built in functions to process rows and columns, read and write to many different formats on disk, and interact with other DataFrames.

<strong>Try printing the DataFrame to see a small sample. You'll have to scroll down to view it all.</strong>

In [None]:
FILL IN CODE

We see that we have 2,057,050 stars in our dataset, each of which has a measured position on the sky (both (ra,dec) and (l,b)), a parallax, a G band magnitude, and proper motion on the sky (pmra,pmdec). Each of these measurements also has an error associated with it.

(By the way - in a moment, we'll be plotting the values ra [right ascension], dec [declination] and parallax - if you're interested in learning more about these quantities, see this Wikipedia page on [Star Positions](https://en.wikipedia.org/wiki/Star_position), as well as references therein.) 

We can see how much memory our DataFrame object is taking up with,

In [None]:
print('{:d} rows'.format(len(dat)))
print('{:.1f} MB'.format(dat.memory_usage(index=True,deep=True).sum()/1e6))

We can try reducing this by only loading in the columns we'll be working with, which are <em>'ra', 'ra_error', 'dec', 'dec_error', 'parallax', 'parallax_error'</em>

<strong>In the cell below:

1. Re-load the data to the same DataFrame (named dat), this time only the 6 columns above

2. Note how much the size of the DataFrame has been reduced compared to the original</strong>


In [None]:
dat  = pd.read_hdf('alldata.hdf',columns=[FILL IN RELEVANT COLUMNS HERE])
print('{:d} rows'.format(len(dat)))
print('{:.1f} MB'.format(dat.memory_usage(index=True,deep=True).sum()/1e6))

print(dat)

The DataFrame object contains some built-in convienence functions for quickly getting a sense of your data. For example, we can quickly make histograms of different columns with the <code>dat.hist()</code> method.

<strong>Plot histograms of the 'ra', 'dec' and 'parallax' columns by filling in the missing code below:</strong>

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
fig,axes = plt.subplots(1,3,figsize=(17,3))
dat.hist([FILL IN CODE - ENTER ALL THREE COLUMN LABELS IN HERE],ax=axes,xlabelsize=15,ylabelsize=15,bins=50);

We can see right away that Gaia is an all sky survey as it covers the full range of right ascension and declination. The parallax histogram looks a little funny though. There seems to be some bad parallax data that we should remove before proceeding. Remember that parallax is related to distance via

$$\text{distance in pc} = 1 / (\text{parallax in arcsec}) .$$

The Gaia parallaxes are reported in milliarcsec, and so the distances will be in kpc. From looking at the histogram for parallax we see a problem—there are a number of negative parallaxes, which would correspond to negative distances. 


<strong>Verify this by slicing our DataFrame object to only show the rows where the parallax is negative</strong>

In [None]:
FILL IN CODE HERE

<strong>Now, remove these by using the dat.drop( ) method, and re-plot the three histograms</strong>

In [None]:
dat = FILL IN CODE TO DROP THE STARS WITH NEGATIVE PARALLAX VALUES
FILL IN CODE TO REPLOT THE SAME HISTOGRAMS

<strong>After removing the negative parallax values, how many stars remain in the sample?</strong>

In [None]:
# Just to be safe, let's reload the 6 columns as before
dat  = FILL IN CODE TO LOAD DATAFRAME
dat = FILL IN CODE TO DROP THE STARS WITH NEGATIVE PARALLAX VALUES
print('{:d} rows'.format(len(dat)))

<strong>Add a new column to the DataFrame called 'dist' that contains the distances to the stars in kpc</strong>, computed from the parallax column using the relationship between distance and parallax (note that the parallax values are in milliarcseconds, which means distances will be in kpc).

<strong>What is the distance to the nearest star in the sample, and what is the mean distance of all the stars?</strong>

In [None]:
FILL IN CODE TO ADD NEW COLUMN
FILL IN CODE TO OBTAIN INFORMATION ABOUT DISTANCES

Let's visualize the 2D positions of the stars in our sample. Create a simple scatter plot of the <em>ra</em> and <em>dec</em> values using <code>dat.plot.scatter(VALUES TO PLOT)</code>. 

<strong>What observation can you make about the dataset from this plot?</strong>

In [None]:
FILL IN CODE HERE

Another way to visualize this data is to create a 2D histogram and represent it as a <em>heatmap</em>. Instead of viewing the positions of individual stars, we view it as a density distribution of the stars. 
One of the methods of doing this is with the pyplot function <em>hexbin</em>, which you can learn more about here: [hexbin demo](http://matplotlib.org/1.4.0/examples/pylab_examples/hexbin_demo.html) 

Create a new plot by replacing "scatter" with "hexbin." Then play around with the attributes to get a really cool looking plot: Try setting <code>gridsize</code> to values between 30 and 300 (you'll see a noticeable difference). To adjust the colors, try&nbsp;setting&nbsp;<code>cmap</code> equal to "inferno" or "gray," or visit this page to learn more about [choosing colormaps](https://matplotlib.org/users/colormaps.html).

<strong>How would you describe the shape of the stellar density distribution you obtain? By the way, what you're seeing is the Milky Way through the eyes of Gaia!</strong>


In [None]:
FILL IN CODE HERE