In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import scipy.optimize as optimize
import datetime
from scipy.fft import fft, fftfreq

# Searching for Exoplanets


## Introduction

Recall that in week 8 of this course, we explored analyzing data in the frequency domain using Fourier transforms. In this project, you will further explore this concept by applying some of the basic statistical techniques that we covered at the beginning of the course, but now in the frequency domain. Fourier analysis has many applications in physics and astronomy, and in this project, you will investigate its use in the search for exoplanets.

There are now many methods for detecting exoplanets, but the first exoplanet discoveries in the late 1980s and early '90s were made using the *radial velocity method*. This technique is based on the fact that the gravitational force from an exoplanet will have a small, but measurable, effect on the motion of the host star. Recall that in a planetary system, both the planet and the star will orbit around the center of mass of the system. Even though the star will typically have much more mass that the planet, the center of mass of the combined system will not be exactly at the center of mass of the star. This will cause the star to "wobble".

Due to the Doppler effect, this slight wobble will cause small shifts in the spectrum of the host star over time. These radial velocities can be very small -- the first exoplanets discovered had velocities of just a few m/s -- so measurements of the spectrum must be extremely precise. This measurement technique is summarized in the image below.

<img src="figures/Radial-Velocity-Method-star-orbits.png" width=600>

The key to detecting these extremely small shifts and distinguishing them from background fluctuations is that they will be periodic. This is why Fourier analysis is such a powerful technique for analyzing this type of data. When we look at the data in the time domain, it may not be possible to pick out the periodic signal from the background. When we convert to the frequency domain, however, a small periodic signal can become quite obvious.

In this project, you will consider some effects that can make it easier or harder to detect a periodic signal.

1. The quality of the measurements. We will study this effect by adding noise to the measurements of radial velocity, thus degrading the signal.

2. The number of measurements over time. If we take more measurements over a longer time range, it will be easier to distinguish the periodic signal from background.

We have provided a data sample to use in your investigation. This is idealized data, based on radial velocity measurements taken from [51 Pegasi](https://en.wikipedia.org/wiki/51_Pegasi) in 1995, which led to the first discovery of an exoplanet orbiting a Sun-like star. 

## Potential goals for this project:

1. Run through the four cases in this notebook that demonstrate idealized data, and what happens as it gets degraded.

2. Come up with a way to define the significance of the peak in the Fourier transform and decide whether it constitues a discovery.

3. Explore how the significance changes as we degrade the data, and characterize those changes.

4. Study how the level of noise affects the amount of time we will need to observe a star in order to make a discovery. 

We have provided a few simple functions, including functions to degrade the data, extract the Fourier transform of the data, and quantify both the size of the signal peak and the amount of background in the Fourier transform.

## Useful functions

### Plot the radial velocity as a function of time

In [None]:
def plot_rvel(date, rvel):
    plt.scatter(date, rvel)
    plt.xlabel(r'Observation Time [days]')
    plt.ylabel(r'$\Delta v_{\rm rad} [\frac{m}{s}]$')

### Degrade data by adding noise and/or reducing number of observations

This function allows you to:
- Add random noise to the radial velocity signal, drawn from a Gaussian distribution with $\sigma = $ `noise_scale`
- Reduce the time range of the measurements by a factor of `tfrac`, where $ 0 \leq $ `tfrac` $ \leq 1$

In [None]:
rng = np.random.default_rng()

def degrade_data(date, rvel, noise_scale=0.01, tfrac=1.):
    tmax = tfrac*np.max(date)
    mask = date < tmax
    return (date[mask], rvel[mask] + rng.normal(loc=0, scale=noise_scale, size=mask.sum()))

### Take the Fourier transform of a time series

This FFT (fast Fourier transform) function returns `xf, yf` where `xf` and `yf` are the frequency and amplitude of the transformed data, respectively. `xf` has units of days $^{-1}$ and `yf` is in arbitrary units.

In [None]:
def do_fft(date, rvel, plot=True):
    # Number of sample points
    N = len(date)

    # Arbitrary offset in data
    offset = np.mean(rvel)

    # sample spacing
    T = np.mean(date[1:] - date[0:-1])

    yf = fft(rvel-offset)
    xf = fftfreq(N, T)[:N//2]

    if plot:
        plt.plot(xf, 2.0/N * np.abs(yf[0:N//2]))    
        plt.ylabel("Signal [arbitrary units]")
        plt.xlabel(r"Frequency [${\rm days}^{-1}$]")

        freq_max = xf[np.argmax(np.abs(yf[0:N//2]))]
        period = 1./freq_max

        plt.annotate(r"$f \sim %0.2f {\rm days}^{-1}$" % freq_max, (2.0, 35))
        plt.annotate(r"$P \sim %0.2f {\rm days}$" % period, (2.0, 32))
        
    return xf, 2.0/N * np.abs(yf[0:N//2])

### Characterize the peak and noise level in the FFT

This function returns a dictionary containing statistical information about the transformed data.

In [None]:
def fft_noise_stats(xf, yf, verbose=False):
    result = {}
    result["peak_height"] = np.max(yf)
    result["peak_freq"] = xf[np.argmax(yf)]
    min_bin = np.searchsorted(xf, 0.5)
    other_data = yf[min_bin:]
    result["mean"] = np.mean(other_data)
    result["std"] = np.std(other_data)
    if verbose:
        print("The peak height is {peak_height:.2f} at frequency {peak_freq:.2f} day^-1".format(**result))
        print("Away from the peak, the mean amplitude is {mean:.2f} and the standard deviation is {std:.2f}".format(**result))
    return result

# Case 1: using idealized data

This case represents some very idealized measurements.  There is no instrumental error, and we observe the star a few times a day for 40 days.  This results in a very clear and convincing signal.

In [None]:
data = np.loadtxt(open("../data/51peg_model_rvs.txt", 'rb'), usecols=range(2))

# This is how we pull out the data from columns in the array.
date = data[:,0] - np.min(data[:,0])
rvel = data[:,1]

plot_rvel(date, rvel)

In [None]:
xf, yf = do_fft(date, rvel)
fft_noise_stats(xf, yf, True)

# Case 2: shorter observation, but still idealized

In this case we still have no measurment error, but we only observed for 20 days.  We still get a very nice clear signal.

In [None]:
less_data = degrade_data(date, rvel, tfrac=0.5)
plot_rvel(less_data[0], less_data[1])

In [None]:
xf, yf = do_fft(less_data[0], less_data[1])
fft_noise_stats(xf, yf)

# Case 3, full observation time, but noisy data

In this case we have about $100 \frac{m}{s}$ of noise in the measurements of $v_\mathrm{rad}$.  Even though we can't really see a signal in the time series, we can see a pretty clear signal in the Fourier transform.

In [None]:
noisy_data = degrade_data(date, rvel, noise_scale=100.)
plot_rvel(noisy_data[0], noisy_data[1])

In [None]:
xf, yf = do_fft(noisy_data[0], noisy_data[1])
fft_noise_stats(xf, yf)

# Case 4, less observation time, and noisy data

In this case we have both noise measurement and a shorter observation, and the signal is getting really marginal.

In [None]:
noisy_short_data = degrade_data(date, rvel, noise_scale=100., tfrac=0.5)
plot_rvel(noisy_short_data[0], noisy_short_data[1])

In [None]:
xf, yf = do_fft(noisy_short_data[0], noisy_short_data[1])
fft_noise_stats(xf, yf)