In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import numpy as np
import scipy.stats as stats
import scipy.optimize as optimize

plt.rcParams['font.size'] = 14

# Searching for Exoplanets

## Introduction

Recall that in week 8 of this course, we explored analyzing data in the frequency domain using Fourier transforms. In this project, you will further explore this concept by applying some of the basic statistical techniques that we covered at the beginning of the course, but now in the frequency domain. Fourier analysis has many applications in physics and astronomy, and in this project, you will investigate its use in the search for exoplanets.

There are now many methods for detecting exoplanets, but the first exoplanet discoveries in the late 1980s and early '90s were made using the *radial velocity method*. This technique is based on the fact that the gravitational force from an exoplanet will have a small, but measurable, effect on the motion of the host star. Recall that in a planetary system, both the planet and the star will orbit around the center of mass of the system. Even though the star will typically have much more mass that the planet, the center of mass of the combined system will not be exactly at the center of mass of the star. This will cause the star to "wobble".

Due to the Doppler effect, this slight wobble will cause small shifts in the spectrum of the host star over time. These radial velocities can be very small -- the first exoplanets discovered had velocities of just a few m/s -- so measurements of the spectrum must be extremely precise. This measurement technique is summarized in the image below.

<img src="figures/Radial-Velocity-Method-star-orbits.png" width=600>

The key to detecting these extremely small shifts and distinguishing them from background fluctuations is that they will be periodic. This is why Fourier analysis is such a powerful technique for analyzing this type of data. When we look at the data in the time domain, it may not be possible to pick out the periodic signal from the background. When we convert to the frequency domain, however, a small periodic signal can become quite obvious.

In this project, you will consider some effects that can make it easier or harder to detect a periodic signal.

1. The quality of the measurements. We will study this effect by adding noise to the measurements of radial velocity, thus degrading the signal.

2. The number of measurements over time. If we take more measurements over a longer time range, it will be easier to distinguish the periodic signal from background.

We have provided a data sample to use in your investigation. This is idealized data, based on radial velocity measurements taken from [51 Pegasi](https://en.wikipedia.org/wiki/51_Pegasi) in 1995, which led to the first discovery of an exoplanet orbiting a Sun-like star. 

## Potential goals for this project:

1. Run through the four cases in this notebook that demonstrate the effects of adding noise and reducing the observation time.

2. Decide how to characterize the statistical significance of the peak in the Fourier transform, and whether it constitutes a discovery.

3. Explore how the statistical significance changes as we degrade the data, and discuss your findings using both text and plots.

4. Imagine you are conducting an exoplanet search using a detector with a known level of noise. Estimate the minimum observation time you need to make a discovery with a high degree of confidence (i.e. at least 5$\sigma$)?

These four goals are suggestions, and you should feel free to substitute your own ideas, as long as your overall project has a similar level of complexity to the listed goals. We have provided a few functions and code templates that you will probably find useful, and you are encouraged to use these as starting points for writing your own code. Four scenarios are briefly discussed to illustrate some of the effects of degradation on the data to get you started.

## Useful functions

### Plot the radial velocity as a function of time

In [None]:
def plot_rvel(date, rvel):
    plt.scatter(date, rvel)
    plt.xlabel(r'Observation Time [days]')
    plt.ylabel(r'$\Delta v_{\rm rad} [\frac{m}{s}]$')

### Degrade data by adding noise and/or reducing number of observations

This function allows you to:
- Add random noise to the radial velocity signal, drawn from a Gaussian distribution with $\sigma = $ `noise_scale`
- Reduce the time range of the measurements by a factor of `tfrac`, where $ 0 \leq $ `tfrac` $ \leq 1$

In [None]:
rng = np.random.default_rng()

def degrade_data(date, rvel, noise_scale=0.01, tfrac=1.):
    tmax = tfrac*np.max(date)
    mask = date < tmax
    return (date[mask], rvel[mask] + rng.normal(loc=0, scale=noise_scale, size=mask.sum()))

### Take the Fourier transform of a time series

This FFT (fast Fourier transform) function returns `freqs, fft_vals` where `freqs` and `fft_vals` are the frequency and amplitude of the transformed data, respectively. `freqs` has units of days $^{-1}$ and `fft_vals` is in arbitrary units.

In [None]:
def do_fft(date, rvel, plot=True):
    # Number of sample points
    N = len(date)

    # Arbitrary offset in data
    offset = np.mean(rvel)

    # sample spacing
    T = np.mean(date[1:] - date[0:-1])

    freqs = np.fft.rfftfreq(N, T)
    fft_vals = np.fft.rfft(rvel - offset)

    if plot:
        plt.plot(freqs, np.abs(fft_vals))    
        plt.ylabel("Signal [arbitrary units]")
        plt.xlabel(r"Frequency [${\rm days}^{-1}$]")

        freq_max = freqs[np.argmax(np.abs(fft_vals))]
        period = 1./freq_max
        annotation = (f"f ~ {freq_max:0.2f} days^-1\n"
                      f"P ~ {period:0.2f} days"
                     )
        plt.annotate(annotation, (0.6, 0.8), xycoords="figure fraction")
        
    return freqs, fft_vals

### Characterize the peak and noise level in the FFT

This function returns a dictionary containing statistical information about the transformed data.

In [None]:
def fft_noise_stats(freqs, fft_vals, verbose=False):
    result = {}
    result["peak_height"] = np.max(fft_vals)
    result["peak_freq"] = freqs[np.argmax(fft_vals)]
    min_bin = np.searchsorted(freqs, 0.5)
    other_data = fft_vals[min_bin:]
    result["mean"] = np.mean(other_data)
    result["std"] = np.std(other_data)
    if verbose:
        print("The peak height is {peak_height:.2f} at frequency {peak_freq:.2f} day^-1".format(**result))
        print("Away from the peak, the mean amplitude is {mean:.2f} and the standard deviation is {std:.2f}".format(**result))
    return result

## Data taking scenarios

The following scenarios are provided to illustrate how the data changes in both the time and frequency domains, for various values of noise level and observation time.

### 1. Idealized data (no noise)

This case represents some very idealized measurements. There is no instrumental error, and we observe the star a few times a day for 40 days. The result is a very clear and convincing periodic signal, visible in both the time and frequency domain plots.

In [None]:
data = np.loadtxt(open("../data/51peg_model_rvs.txt", 'rb'), usecols=range(2))

# This is how we pull out the data from columns in the array.
date = data[:,0] - np.min(data[:,0])
rvel = data[:,1]

plot_rvel(date, rvel)

In [None]:
freqs, fft_vals = do_fft(date, rvel)
fft_noise_stats(freqs, fft_vals, True)

### 2. Shorter observation, no noise

In this case we still have no measurment error, but we only have 20 days of observation data instead of 40.  We still get a very clear signal in both plots.

In [None]:
less_data = degrade_data(date, rvel, tfrac=0.5)
plot_rvel(less_data[0], less_data[1])

In [None]:
freqs, fft_vals = do_fft(less_data[0], less_data[1])
fft_noise_stats(freqs, fft_vals, True)

### 3. Full observation time, with noise added

In this case we have added about $50$ m/s of noise in the measurements of $v_\mathrm{rad}$. Even though we can't really see a periodic signal in the time domain, the signal is very clear in the Fourier transform.

In [None]:
noisy_data = degrade_data(date, rvel, noise_scale=100.)
plot_rvel(noisy_data[0], noisy_data[1])

In [None]:
freqs, fft_vals = do_fft(noisy_data[0], noisy_data[1])
fft_noise_stats(freqs, fft_vals, True)

### 4. Shorter observation, with noise added

In this case we have both a shorter observation time and noisy data, and there is no longer a visible signal.

In [None]:
noisy_short_data = degrade_data(date, rvel, noise_scale=100., tfrac=0.5)
plot_rvel(noisy_short_data[0], noisy_short_data[1])

In [None]:
freqs, fft_vals = do_fft(noisy_short_data[0], noisy_short_data[1])
fft_noise_stats(freqs, fft_vals)

## Code templates

The following templates are provided for you to fill in with code and use in your analysis. You can use these, as well as the functions above, as examples to guide you in writing your own analysis code.

### Calculate statistical significance for a given combination of noise and observation time

In [None]:
def get_significance(noise, tfrac):
    noisy_data = degrade_data(date, rvel, noise, tfrac)
    freqs, fft_vals = do_fft(noisy_data[0], noisy_data[1], plot=False)
    fft_stats = fft_noise_stats(freqs, fft_vals)
    
    ## your code here -- calculate and return the significance ##

### Plot significance as a function of both noise level and observation time

The below template illustrates one way of visualizing the data that may be helpful. You may also want to write code to make some one dimensional plots, using `plt.plot(x, y)`.


In [None]:
## Use np.linspace to get an evenly spaced vector of values 
## Replace the values in the arguments below with start, stop, and number of steps
noises = np.linspace(0., 0., 0)

## Your t_fracs should be at least 0.025, otherwise you will have zero data points in your
## time series, which will cause an error. Feel free to experiment with the bounds as long
## as 0.025 <= t_frac <= 1.0
t_fracs = np.linspace(0.025, 1.0, 0)

tt, nn = np.meshgrid(noises, t_fracs) # converts the above vectors to grids

## this converts your significance function so that it can be used on numpy arrays
v_get_significance = np.vectorize(get_significance)
sig_2d = v_get_significance(tt, nn)

## Make the 2D plot. Comment or remove "norm=colors.LogNorm()" to switch between
## linear and logarithmic scale
plt.imshow(sig_2d, extent=(noises[0], noises[-1], t_fracs[0], t_fracs[-1]), aspect="auto", 
           origin="lower", norm=colors.LogNorm())
plt.colorbar()
plt.xlabel("Noise [m/s]")
plt.ylabel("Time fraction")

## Uncomment the line below to draw contour(s) over the plot
## Change `levels` below to your list of contour levels
#plt.contour(noises, times, sig_2d, levels=[0.], colors=["red"], origin="lower")

plt.show()