# HPF data exploratory

In the last notebook we started by plotting one of the $\sim$410 available Goldilocks spectra.  In this notebook we will begin exploring the data through quantifying **line strengths**.  We can break the signals into a few distinct categories:

1. Earth Atmosphere: Absorption
2. Earth Atmosphere: Emission
3. Stellar Absorption
4. Exoplanet Atmosphere Transmission (if in-transit)

Our ultimate goal is to separate signals into these categories.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import glob
from astropy.io import fits
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format='retina'
sns.set_context('paper')

HPF raw data gets processed with two different data reduction pipelines: *Goldilocks* and the default facility data reduction from the Penn State Instrument Team.  These two reductions have slightly different precisions, but for our purposes are essentially indistinguishable.  However, the data columns have different names, so they have to be handled differently.

In [None]:
goldilocks_files = glob.glob('../data/HPF/Helium-transit-data/**/Goldilocks*.fits', recursive=True)
pennstate_files = glob.glob('../data/HPF/Helium-transit-data/**/Slope*.fits', recursive=True)

Let's define a function that takes in a `fits` filename and returns a dataframe.

In [None]:
def get_goldilocks_dataframe(fn):
    """Return a pandas Dataframe given a Goldilocks FITS file name"""
    hdus = fits.open(fn)
    df_original = pd.DataFrame()
    for j in range(28):
        df = pd.DataFrame()
        for i in range(1, 10):
            name = hdus[i].name
            df[name] = hdus[i].data[j, :]
        df['order'] = j
        df_original = df_original.append(df, ignore_index=True)
    keep_mask = df_original[df_original.columns[0:6]] != 0.0
    df_original = df_original[keep_mask.all(axis=1)].reset_index(drop=True)
    
    return df_original

In [None]:
index = 123 # Pick a number in the range (0,410]
fn = goldilocks_files[index]

In [None]:
%time df = get_goldilocks_dataframe(fn)

In [None]:
sns.set_palette("Reds", n_colors=28)

In [None]:
plt.figure(figsize=(16, 5))
for order, group in df.groupby('order'):
    plt.plot(group['Sci Wavl'], group['Sci Flux'], label=order);
plt.xlabel('$\lambda$ ($\AA$)');
plt.ylabel('Raw Flux')
plt.ylim(-3, 30)
plt.legend(ncol=7, title='Echelle Order #', fontsize=11);

In [None]:
plt.figure(figsize=(16, 5))
order = 16
mask = df.order == order
plt.step(df['Sci Wavl'][mask], df['Sci Flux'][mask], label=order, color='#2980b9');
plt.xlabel('$\lambda$ ($\AA$)');
plt.ylabel('Raw Flux')
plt.legend(ncol=7, title='Echelle Order #', fontsize=11);