<a href="https://colab.research.google.com/github/EiffL/Tutorials/blob/master/TimeSeries/QuasarLightcurvesLSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Copyright 2017-2021 Francois Lanusse.

Licensed under the Apache License, Version 2.0 (the "License");

# Quasar classification by LSTM


Author: [@EiffL](https://github.com/EiffL) (Francois Lanusse)

### Overview

In this notebook, we are going to use LSTMs to classify between stars and quasars, using only information about the flux at different point in time.

The idea is to distinguish between stars, and quasars, based on how their fluxes change with time.

We are using real light curve data from the SDSS survey, and using an catalog of known quasars for our training set. 

### Learning objectives:
In this notebook, we will learn how to:

  - Learn how to handle time-series data of different lenghts
  - Use Keras for building an LSTM


### Instructions for enabling GPU access

By default, notebooks are started without acceleration. To make sure that the runtime is configured for using GPUs, go to `Runtime > Change runtime type`, and select GPU in `Hardware Accelerator`.



### Installs and Imports


In [None]:
%pylab inline
import tensorflow as tf
from astropy.table import Table

### Checking for GPU access


In [None]:
#Checking for GPU access
if tf.test.gpu_device_name() != '/device:GPU:0':
  print('WARNING: GPU device not found.')
else:
  print('SUCCESS: Found GPU: {}'.format(tf.test.gpu_device_name()))

## Retrieving data

In this first section, we retrieve a dataset of lightcurves as generated by this [script](https://github.com/McWilliamsCenter/CMUCosmoML/tree/master/applications/quasar_classification) by running an SQL query on the SDSS servers.

In [None]:
# Google Cloud Storage bucket for Estimator logs and storing
# the training dataset.
bucket = 'ahw2019' # Bucket setup for this AHW2019 tutorial
print('Using bucket: {}'.format(bucket))

In [None]:
!gsutil -m cp gs://{bucket}/quasar/ligthcurve_data.fits.gz .

This will download locally the training set. We can now load it to build our input pipeline.


Below we create functions that can build input pipelines:

In [None]:
# Loading dataset
data_table = Table.read('ligthcurve_data.fits.gz')

# Splitting training and testing data
randomize_inds = range(len(data_table))
randomize_inds = permutation(randomize_inds)
randomized_inds_train = randomize_inds[0:45000]
randomized_inds_test  = randomize_inds[45000:]

`data_table` contains all of the data available to us from the database:

In [None]:
data_table

In particular, it has a fixed size `time_series` field and an `obs_len` field. This `obs_len` tells us how many points we actually have in our timeseries.

In [None]:
# Let's plot the observation of the flux of some quasars/stars in different
# filters:
plot(data_table['time_series'][0][:,1], '+')
plot(data_table['time_series'][0][:,2], '+')
plot(data_table['time_series'][0][:,3], '+')
plot(data_table['time_series'][0][:,4], '+')
axvline(data_table['obs_len'][0])

Let's have a look at a different entry:


In [None]:
plot(data_table['time_series'][1][:,1], '+')
plot(data_table['time_series'][1][:,2], '+')
plot(data_table['time_series'][1][:,3], '+')
plot(data_table['time_series'][1][:,4], '+')
axvline(data_table['obs_len'][1])

They have different lengths, how do we provide this information to the RNN? 

We can use a `masking` mechanism. We always send arrays of the same size, but some timesteps, if they are not actually observed are set to a specific value, i.e. -99 and the network will skip these missing steps.

Let's create a pre-processing function that can format our data on this principle:

In [None]:
def mapping_function(x):
    def extract_batch(inds):
        inds = randomized_inds_train[inds]
        ts = clip(data_table['time_series'][inds].astype('float32'),-10,10) 
        length = clip(data_table['obs_len'][inds],0,89).astype('int32')
        ts[length:,:] = -99. # Any points in the light curve after obs_len is set to -99
        return data_table['coadd_label'][inds].astype('float32'), ts
    a,b =tf.py_function( extract_batch, [x], [tf.float32, tf.float32])
    a.set_shape([])
    b.set_shape([90,12])
    return a,b

In [None]:
# And we can apply this pre-processing function on our data to build a tf.data.Dataset
dataset = tf.data.Dataset.range(len(randomized_inds_train))
dataset = dataset.map(mapping_function)

In [None]:
dataset

In [None]:
# Let's grab an example
for batch in dataset.take(1):
  plot(batch[1][:,1], '+')
  plot(batch[1][:,2], '+')
  plot(batch[1][:,3], '+')

Ok great, now we are going to create functions that can produce datasets for various training and testing scenarios.

In the functions below, what changes is which dataset we are using (training or testing), and whether or not entries are shuffled.

In [None]:
# Define input function for training 
def input_fn_train():
  def mapping_function(x):
      def extract_batch(inds):
          inds = randomized_inds_train[inds]
          ts = clip(data_table['time_series'][inds].astype('float32'),-10,10) 
          length = clip(data_table['obs_len'][inds],0,89).astype('int32')
          ts[length:,:] = -99. 
          return data_table['coadd_label'][inds].astype('float32'), ts
      a,b =tf.py_function( extract_batch, [x], [tf.float32, tf.float32])
      a.set_shape([]) # This is the label
      b.set_shape([90,12]) # This is the input light curve
      return b,a

  dataset = tf.data.Dataset.range(len(randomized_inds_train))
  dataset = dataset.map(mapping_function)
  dataset = dataset.cache()
  dataset = dataset.repeat().shuffle(20000).batch(256)
  return  dataset

# Define input function for testing on the training set
def input_fn_train_test():
  def mapping_function(x):
      def extract_batch(inds):
          inds = randomized_inds_train[inds]
          ts = clip(data_table['time_series'][inds].astype('float32'),-10,10) 
          length = clip(data_table['obs_len'][inds],0,89).astype('int32')
          ts[length:,:] = -99. 
          return data_table['coadd_label'][inds].astype('float32'), ts
      a,b =tf.py_function( extract_batch, [x], [tf.float32, tf.float32])
      a.set_shape([]) # This is the label
      b.set_shape([90,12]) # This is the input light curve
      return b,a

  dataset = tf.data.Dataset.range(len(randomized_inds_train))
  dataset = dataset.map(mapping_function)
  dataset = dataset.batch(256)
  return  dataset

# Define input function for testing on the testing set
def input_fn_test():
  def mapping_function(x):
      def extract_batch(inds):
          inds = randomized_inds_test[inds]
          ts = clip(data_table['time_series'][inds].astype('float32'),-10,10) 
          length = clip(data_table['obs_len'][inds],0,89).astype('int32')
          ts[length:,:] = -99. 
          return data_table['coadd_label'][inds].astype('float32'), ts
      a,b =tf.py_function( extract_batch, [x], [tf.float32, tf.float32])
      a.set_shape([]) # This is the label
      b.set_shape([90,12]) # This is the input light curve
      return b,a

  dataset = tf.data.Dataset.range(len(randomized_inds_test))
  dataset = dataset.map(mapping_function)
  dataset = dataset.batch(256)
  return  dataset

## Building the Neural Network

Now that we have the tools to load the data, the next step is to build the LSTM model. We will start with the simplest model possible, an LSTM layer that will process the time-series, followed by a Dense layer that will produce a probability of a timeseries to be a quasar:

In [None]:
tfkl = tf.keras.layers

model = tf.keras.Sequential([
  tfkl.InputLayer([90,12]),
  tfkl.Masking(mask_value=-99.), # This is to tell Keras to skip the missing time steps

  # Create your LSTM model :-) 
  # Here are a few hints:
  # - You probably want to have one LSTM layer, followed by one or more Dense layers
  # - You probably want to use tfkl.LSTM (https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM)
  # - The output should be a detection probabilty, so between 0 and 1
  # .....

])

In [None]:
model.summary()

In [None]:
model.compile(optimizer='adam',
              loss= # What loss function should we use?
              ) 

In [None]:
# Let's build the input dataset
dataset_training = input_fn_train()

In [None]:
# And fit the model
model.fit(dataset_training, 
          steps_per_epoch=45000//256,
          epochs=5)

### Applying the neural network on the testing set

In [None]:
# Evaluating performance on testing set
dataset_testing = input_fn_test()
test_prob = np.concatenate([model(batch[0]) for batch in dataset_testing])

# Concatenating the predicted probabilities
table_test = data_table[randomized_inds_test]
table_test['p'] = test_prob.squeeze()

In [None]:
# Evaluating performance on training set
dataset_training = input_fn_train_test()
train_prob = np.concatenate([model(batch[0]) for batch in dataset_training])

# Concatenating the predicted probabilities
table_train = data_table[randomized_inds_train]
table_train['p'] = train_prob.squeeze()

In [None]:
# Compute ROC curves 
from sklearn.metrics import roc_curve

fpr1, tpr1, thr1 = roc_curve(table_train['coadd_label'], table_train['p'])
fpr2, tpr2, thr2 = roc_curve(table_test['coadd_label'],  table_test['p'])

plot(fpr1, tpr1,label='training set')
plot(fpr2, tpr2,label='testing set')
grid('on')
xscale('log')
legend()

In [None]:
from sklearn.metrics import roc_auc_score

print("Training Set ROC AUC score:", roc_auc_score(table_train['coadd_label'], table_train['p']))
print("Test Set ROC AUC score:", roc_auc_score(table_test['coadd_label'], table_test['p']))

And here  we go! A near perfect Quasar detector :-)

## Going Further

This is pretty much all you need to know to implement a Recurrent Neural Network. You can find the reference guide for the TensorFlow Keras guide [here](https://www.tensorflow.org/guide/keras/rnn).

To go further in this tutorial, you can do the following things:

  - LSTMs are **notorious for overfitting** data very easily, to prevent this, you can add a `dropout` value to the `LSTM` layer.
  - You can replace the `LSTM` layer by a different RNN type, for instance [GRU](https://keras.io/api/layers/recurrent_layers/gru/).
  - You can add a second LSTM layer, like so:
```
...
  tfkl.LSTM(128, return_sequences=True),
  tfkl.LSTM(128),
...
```

