# Using minimizers and the parameter choice in model fitting

### Goals:

1. To continue fitting models of time-variability to the Vela pulsar data.
2. To understand how our choice of model parameter can affect the quality of the results.

### Timing

1. Try to finish this notebook in 35-40 minutes. 

### Question and Answer Template

You can go to the link below, and do "file" -> "make a copy" to make yourself a google doc that you can use to fill in the answers to the question in this weeks notebooks.

https://docs.google.com/document/d/1Q2jJXXkg0rISnucJdY6voYC9vC_xVOljSD6p3Mm4FiY/edit?usp=sharing

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import scipy.optimize as optimize
import datetime

plt.rcParams['font.size'] = 14

### New functions we will use in this module

| Function Name            | What it does |
| - | - |
| scipy.optimize.minimize  | Find the set of parameters that minimize a function |


### Ok let's pick up where we left off with the Vela pulsar data

(This cell is just a repeat of loading the Vela data)

In [None]:
data = np.loadtxt(open("../data/Vela_Flux.txt", 'rb'), usecols=range(7))

## This is how we pull out the data from columns in the array.

## This is the date in "Mission Elapsed Time" For the Fermi mission, this 
## is defined to be the number of seconds since the start of 2001.
date_MET = data[:,0]

## This is the offset in seconds between the Fermi "MET" and the UNIX 
## "epoch" used by matplotlib
MET_To_Unix = 978336000

## These are the number of photons observed from Vela each week in 
## the "low" Energy Band (100 MeV - 800 MeV)
nObs_LE = data[:,1]

## These are the number of photons expected from Vela each week, under 
## the assumption that it is not varying at all, and the only differences 
## depend on how long we spent looking at Vela that particular weeek
nExp_LE = data[:,2]

## These are the band bounds, in MeV
LE_bounds = (100., 800.)

## We will also take a look at data in the "high" energy band 
## (800 MeV - 10000 MeV)
nObs_HE = data[:,4]
nExp_HE = data[:,5]
signif_HE = data[:6]
HE_bounds = (800., 10000.)

## This converts the dates to something that matplotlib understands
dates = [datetime.datetime.fromtimestamp(date + MET_To_Unix) for date in date_MET]

## Convert the dates to years to make the numbers more sensible
date_YEAR = 2001 +  (date_MET / (24*3600*365))
years_since_mid_2014 = date_YEAR  - 2014.5

### Now let's add the functions we needed to minimize the $\chi^2$

That includes the model, a function to calculate the residuals and a function to compute the $\chi^2$.

These are exactly the same functions that we used last week.

In [None]:
def linear_function(xvals, params):
    return params[0] + xvals*params[1]

## Function to calculate an array of residuals as (data) - (model)
def residual_function(data_x, data_y, model_function, params):
    model_y = model_function(data_x, params)
    residual = data_y - model_y
    return residual

## Compute the chi-squared test statistic for a particular data set, 
## model, and set of parameters. We'll minimize this statistic as a 
## function of the model parameters.
def chi2_function(data_x, data_y, data_sigma_y, model_function, params):
    model_y = model_function(data_x, params)
    chi2 = ((data_y - model_y)/(data_sigma_y))**2
    return np.sum(chi2)

# Function minimizers

Finding the set of parameters that minimize a function is a very common thing to do in research. In our case, we are looking for the set of model parameters that give us the smallest $\chi^2$, i.e., the best fit.

Last week, we did this by hand by varying two parameters in a linear model and calculating the $\chi^2$ for each set of values. 

Since this is such a common thing to do, there are many software packages that will do it.  Typically they refer to the function that is being minimized as the "cost function", and they expect you to provide a function that takes only the model parameters as inputs.

So we are going to write a "cost function" that just calls our $\chi^2$ with the right versions of the data and model.  

Because the software we use (`scipy.optimize`) has a slightly different convention than what we are using, we will multiply the $\chi^2$ by a factor of 0.5 so that the uncertainty estimated returned by the minimizer will be correct.  

In [None]:
## Calculate our excess counts from the observed and expected data.
## This will be the dataset to which we fit a linear model
excess_counts = nObs_LE-nExp_LE

## Assign poisson-like uncertainties to the measured exess counts.
sigma_counts =  np.sqrt(nObs_LE)

## Define a cost function that takes only model parameters as input
## and returns the chi-squared statistic.
def cost_function(params):
    return 0.5*chi2_function(years_since_mid_2014, excess_counts, sigma_counts, 
                             linear_function, params)

### Invoking the minimizer and looking at the result

First we are going to invoke the minimizer in the next cell.

It is worth spending a bit of time thinking about what it is doing.

Note that we pass three things to the minimizer:
   1. the `cost_function` (you might not have explicitly noticed this before, but we can pass functions to other functions)
   2. an initial guess as to the parameter values.  In this case we will start with (0., 0.), i.e., slope and offset are both zero.
   3. The method of minimization. There are many algorithms that have been developed to minimize functions. Here we choose one of the speedier algorithms, but you can specify different algorithms, or allow SciPy to make a choice for you based different optional arguments you might pass to the function. (see [scipy reference](https://docs.scipy.org/doc/scipy/reference/optimize.html)).
   
And note that the minimizer returns a `result` object to us, which we will explore in the second cell.

In [None]:
result = optimize.minimize(cost_function, [0., 0.], method="BFGS")

In [None]:
## Extract the best fit parameters and the value of the cost 
## function at the minimum. The 'result' is a dictionary with 
## many different keys
pars = result['x']
fmin = result['fun']
p0_best = pars[0]
p1_best = pars[1]

## Extract the covariance in parameters, as estimated from the
## inverse of the Hessian matrix.
cov = result['hess_inv']

## Compute uncertainties as the square root of the diagonal
## elements of the covariance matrix, and then compute the
## correlation of the parameters based on these values
p0_err = np.sqrt(cov[0,0])
p1_err = np.sqrt(cov[1,1])
correl = cov[1,0]/np.sqrt(p0_err*p1_err)

print("Fitter result:")
print(result)

print("")
print("Human readable version ---------------")
print(f"               p0 best fit value: {p0_best:.1f} ± {p0_err:.1f} counts")
print(f"               p1 best fit value: {p1_best:.1f} ± {p1_err:.1f} counts / year")
print(f"  Minimum value of cost function: {fmin:.1f}")
print(f"         Minimum value of chi**2: {(2*fmin):.1f}")
print(f"   Correlation between p0 and p1: {correl:.2f}")
print(f"  Number of times cost function was evaluated to find minimum: {result['nfev']}")


### Ok, now let's draw the fit result on top of the contours we made last week

This first cell is just a copy of something we did last week, although the number of manual scan points has been reduced to better illustrate the inaccuracy as compared to the numerical minimizer.

In [None]:
## Alternate number of points to scan in this 2D implementation
nx = 21
ny = 21

params = np.array([0., 0.])
chi2_2d_scan_vals = np.zeros((nx, ny))

## Scan the offset between -10 and 10, and the slope between -5 and 5
offset_2d_scan_points = np.linspace(-10, 10, nx)
slope_2d_scan_points = np.linspace(-5, 5, ny)

## Double loop for 2d scan, sampling the value of chi^2 at each combination of 
## parameter values: offset and slope
for i in range(nx):
    params[0] = offset_2d_scan_points[i]
    for j in range(ny):
        params[1] = slope_2d_scan_points[j]

        chi2_2d_scan_vals[i,j] = chi2_function(years_since_mid_2014, excess_counts, sigma_counts,
                                               linear_function, params)

## Subtract the minimum chi2 value to get the "delta chi^2"
min_chi2 = chi2_2d_scan_vals.min()
chi2_2d_scan_vals -= min_chi2

## This next bit gets the x and y axis values for the grid point at the minimum.
idx = chi2_2d_scan_vals.argmin()
idx_x = idx//nx
idx_y = idx%ny

## Get the values of the offset and slope associated to the minimum chi^2
scan_min_x = offset_2d_scan_points[idx_x]
scan_min_y = slope_2d_scan_points[idx_y]

This second cell is almost a copy of the plot from last week, but with numerical minimizer result also plotted

In [None]:
## Now let's plot it!
fig, ax = plt.subplots(figsize=(8, 6))

## Plot the chi^2 values as a 2D image, with the color scale 
## indicating the value of chi^2
img = ax.imshow(chi2_2d_scan_vals.T, extent=(-10, 10, -5, 5), origin='lower', aspect='auto', cmap='cividis')
fig.colorbar(img, label=r"$\Delta \chi^2$")

## Plot some contours associated to different values of delta chi^2, 
## and thus different uncertainties on the computed parameters. We'll 
## plot contours for 1, 4, and 9, which correspond to 1, 2, and 3 
## sigma uncertainties
ax.contour(offset_2d_scan_points, slope_2d_scan_points, chi2_2d_scan_vals.T, levels=[1, 4, 9], colors="white")

## Plot the manual best-fit point
ax.scatter(scan_min_x, scan_min_y, color='red', marker='x', s=50, label="manual")

## Plot the NEW fit point from our numerically minimized
plt.errorbar(p0_best, p1_best, xerr=p0_err, yerr=p1_err, color='cyan', label="minimized")

ax.set_xlabel(r"$p_0$ = Offset [counts]")
ax.set_ylabel(r"$p_1$ = Slope [counts/year]")

ax.legend(fontsize=10)

fig.tight_layout()

plt.show()

### Questions for discusion

#### 1.1  Let's make sure that you understand what we have done so far. Describe in some detail the relationship between the colormap, the white contours, the cyan error bars, and the red X. What does each thing represent physically? How do they correspond to the information in the "human readable" printout of the result?  (Note that this is the same colormap that we made at the end of last week's second notebook when we did the two dimensional scan over the parameter values.)

#### 1.2 To make the colormap we had to do a double loop over a grid of points for the parameters $p_0$ and $p_1$. The grid was 21x21, for a total of 441 points, meaning we had to evaluate the cost function 441 times. In this relatively simple example, we only had 2 parameters; imagine instead that we had 3 or 4 parameters. How would that affect the time it took to evaluate the cost function on a grid over all the parameters? Compare that to the number of calls that the fitter makes in order to find the minimum.

### Reproducibility

Let's play around with the fitter and try it with 10 different initial guesses and compare the results and the number of calls to the cost function it takes.

In [None]:
## Loop over 10 different random starting points, and do the minimization
## for each of them and print the result.
for i in range(10):

    ## Generate random starting point, sampling the constant from a 
    ## uniform distribution between -10 and 10, and the slope from a
    ## uniform distribution between -5 and 5
    x0 = [np.random.uniform(-10, 10), np.random.uniform(-5, 5)]

    ## Optimize our cost function and extract the best fit parameters
    result = optimize.minimize(cost_function, x0=x0)
    best_fit = result['x']

    ## Print the best fit parameters and the number of times the cost
    ## function was evaluated to find the minimum.
    print(f"Initial guess: ({x0[0]:+.2f} {x0[1]:+.2f}): found fit result: ({best_fit[0]:.6f} {best_fit[1]:.6f}) after {result['nfev']} calls to the cost function")

### Question for discussion

#### 2.1 The results for the 10 trials are very similar but not quite identical. Does this make sense to you? Do you have an idea about why the results aren't identical? How do you think the fitter decides that it is done? (Don't worry if you don't know the answer, but think about how it might decide and elaborate on some of those thoughts here)

# Correlation between model parameters

In last week's notebook, you may have noted that we set t=0 to be in the middle of 2014, which might seem arbitrary. We've reproduced the plot below as a reference to remind you about what the data looked like.

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))

ax.errorbar(years_since_mid_2014, excess_counts, 
            fmt='ko', ms=4, ecolor='pink', yerr=sigma_counts)

ax.set_xlabel(r"Time since mid 2014 [years]")
ax.set_ylabel(r"$n_{\rm excess}$ [per week]")

fig.tight_layout()

plt.show()

There was actually a very good reason to do that, which we will examine now. 

The Fermi telescope actually launched in 2008, so we could set t=0 to be in January 2008.

In [None]:
years_since_2008 = date_YEAR  - 2008

In [None]:
## Making the same plot, but with our new x-axis values
fig, ax = plt.subplots(figsize=(8, 6))

ax.errorbar(years_since_2008, excess_counts, 
            fmt='ko', ms=4, ecolor='pink', yerr=sigma_counts)

ax.set_xlabel(r"Time since 2008 [years]")
ax.set_ylabel(r"$n_{\rm excess}$ [per week]")

fig.tight_layout()

plt.show()

Now, let's think about what happens to fitting our linear model when we make this change. 

Keep in mind that the model function we are using is a simple equation for a line: $y = p_0 + p_1 x$

By moving the zero point of the x-axis, we are changing how that function describes the data, essentially adding a horizontal (time) offset to the $x$ points. The model parameters are now more *correlated*. To demonstrate this, we will make a plot for different values of $p_1$.

In [None]:
## Start by making the same plot as before with our excess counts and 
## associated uncertainties, but now plotted with a different x-axis
fig, ax = plt.subplots(figsize=(8, 6))

ax.errorbar(years_since_2008, excess_counts, 
            alpha=0.2, yerr=sigma_counts)

ax.set_xlabel(r"Time since 2008 [years]")
ax.set_ylabel(r"$n_{\rm excess}$ [per week]")

## Plot the linear model for a few different values of the slope parameter
## i.e. parameter p1, i.e. params[1], all with a constant offset of 0
xvals = years_since_2008
params = np.array([0, 0])
for slope in np.linspace(-15, 15, 5):
    params[1] = slope
    ax.plot(xvals, linear_function(xvals, params), lw=2,
            label=rf"Slope = {slope:0.1f}", zorder=5)

ax.legend(fontsize=10)

fig.tight_layout()

plt.show()

As you can see, all the lines cross at t=0, which is now off the left side of the plot.  Before it was more or less in the middle of the plot.  

**What this means, is that if you were to pick a value like $p_1 = 7.5$, the model tends to be above the average of the data for the entire time.  This means that you would have to change the offset $p_0$ to a negative number to compensate.**

That is exactly what we mean when we say that the parameters have become more correlated: a change in one parameter necessitates a change in another parameter in order to keep the model consistent with the data. 

Let's explore this effect with the minimizer.

First we have to make a version of the cost function that uses this version of the x-axis data. 

In [None]:
def cost_function_bad(params):
    return 0.5*chi2_function(years_since_2008, excess_counts, sigma_counts, linear_function, params)

Now, let's minimize the cost function again to find the best fit parameters:

In [None]:
## Minimize our new cost function, starting at (0, 0)
result_bad = optimize.minimize(cost_function_bad, [0., 0.])

## Extract the best fit parameters and the value of the cost 
## function at the minimum. 
pars_bad = result_bad['x']
fmin_bad = result_bad['fun']
p0_best_bad = pars_bad[0]
p1_best_bad = pars_bad[1]

## Extract the covariance in parameters, as estimated from the
## inverse of the Hessian matrix.
cov_bad = result_bad['hess_inv']

## Compute uncertainties as the square root of the diagonal
## elements of the covariance matrix, and then compute the
## correlation of the parameters based on these values
p0_err_bad = np.sqrt(cov_bad[0,0])
p1_err_bad = np.sqrt(cov_bad[1,1])
correl_bad = cov_bad[1,0]/(p0_err_bad*p1_err_bad)

print("")
print("Human readable version: 'bad' idea ---------------")
print(f"  Minimum value of cost function: {fmin_bad:.1f}")
print(f"         Minimum value of chi**2: {(2*fmin_bad):.1f}")
print(f"                     p0 best fit: {p0_best_bad:.1f} ± {p0_err_bad:.1f} counts")
print(f"                     p1 best fit: {p1_best_bad:.1f} ± {p1_err_bad:.1f} counts / year")
print(f"   Correlation between p0 and p1: {correl_bad:.2f}")

Let's make a similar 2D colormap that we made before with our manual scan, but now using the "bad" cost function.

First, we'll generate the scan points to make the colormap itself, then we'll plot the minimized result on top. For this step, we've increased the number of scan points from 21->51, to improve the smoothness of the visualization.

In [None]:
## Alternate number of points to scan in this 2D implementation
nx = 51
ny = 51

params = np.array([0., 0.])
chi2_2d_scan_vals_bad = np.zeros((nx, ny))

## Scan the offset between -10 and 10, and the slope between -5 and 5
offset_2d_scan_points = np.linspace(-10, 10, nx)
slope_2d_scan_points = np.linspace(-5, 5, ny)

## Double loop for 2d scan, sampling the value of chi^2 at each combination of 
## parameter values: offset and slope
for i in range(nx):
    params[0] = offset_2d_scan_points[i]
    for j in range(ny):
        params[1] = slope_2d_scan_points[j]

        ## Scan the "bad" values simply by using the "years_since_2008" array
        ## instead of the "years_since_mid_2014" array.
        chi2_2d_scan_vals_bad[i,j] = chi2_function(years_since_2008, excess_counts, sigma_counts,
                                               linear_function, params)

## Subtract the minimum chi2 value to get the "delta chi^2"
min_chi2_bad = chi2_2d_scan_vals_bad.min()
chi2_2d_scan_vals_bad -= min_chi2_bad

Plotting our minimized result together with the "BAD" 2D scan,

In [None]:
## Now let's plot our "BAD" data
fig, ax = plt.subplots(figsize=(8, 6))

## Plot the chi^2 values as a 2D image, with the color scale 
## indicating the value of chi^2
img_bad = ax.imshow(chi2_2d_scan_vals_bad.T, extent=(-10, 10, -5, 5), origin='lower', aspect='auto', cmap='cividis')
fig.colorbar(img_bad, label=r"$\Delta \chi^2$")

## Plot some contours associated to different values of delta chi^2, 
## and thus different uncertainties on the computed parameters. We'll 
## plot contours for 1, 4, and 9, which correspond to 1, 2, and 3 
## sigma uncertainties
ax.contour(offset_2d_scan_points, slope_2d_scan_points, chi2_2d_scan_vals_bad.T, levels=[1, 4, 9], colors="white")

## Plot the best-fit from our bad scan
plt.errorbar(p0_best_bad, p1_best_bad, xerr=p0_err_bad, yerr=p1_err_bad, color='cyan')

ax.set_xlabel(r"$p_0$ = Offset [counts]")
ax.set_ylabel(r"$p_1$ = Slope [counts/year]")

fig.tight_layout()

plt.show()

## Print some comparisons of our two fits
print(f" New best fit value is {min_chi2_bad:0.1f} for ({p0_best_bad:+0.1f} ± {p0_err_bad:0.1f}, " 
        + f"{p1_best_bad:+0.1f} ± {p1_err_bad:0.1f})" )
print(f"Original fit value was {min_chi2:0.1f} for ({p0_best:+0.1f} ± {p0_err:0.1f}, " 
        + f"{p1_best:+0.1f} ± {p1_err:0.1f})", end='\n\n')
print(f"Original correlation was {correl:+0.2f}, now it is {correl_bad:+0.2f}")

### Questions for discussion

#### 3.1 What is going on in this plot? Why are the contours tilted? Why are the error bars larger? Explain your answers to these questions in some amount of detail.

#### 3.2 What does this tell us about what we should consider when building models?