# Data analysis - Introduction for FYSC12 Labs

## Table of Content

* [About this Notebook](#about)


* [Importing python packages](#import)


* [Read experimental data from file](#read)
    * [Loading spectrum](#load)
    * [Plotting the data](#plot)


* [Analyzing the data](#fit)
    * [Fitting a Gaussian](#gaussian)
        * [Calculate Peak Area](#peak_area)
        * [Analysis code](#code_gaussian)
        * [Gaussian widget](#widget_gaussian)
    * [Fitting a Gaussian with linear background](#background)
        * [Analysis code](#code_background)
        * [Background widget](#widget_background)
    * [Fit a line - Energy calibration](#line)
    * [Statistical analysis - Error propagation](#stat)

## About this Notebook <a name="about"></a>

The purpose of this _jupyter_ notebook is to introduce data analysis in the
frame of gamma spectroscopy. The example programming language is _Python3_ , but
of course most coding languages can do the job properly. If you have never
programmed before there are so many great tutorials available across the web.
There even exist plenty _Open Online Courses_ , e.g.
https://www.coursera.org/learn/python. Please have a look around for the one
that you like the best. However, note that you do not need to be an expert in
Python to pass the lab.

The data analysis can roughly be divided into four steps:
1. Read experimental data from file.
2. Fit Gaussians to peaks.
3. Calibrate the detector response.
4. Perform a statistical analysis (e.g. error propagation) and present results.

A dedicated python library, i.e. a folder with already written code, located in
`HelpCode`, have been implemented for the data analysis connected to your labs. The folder comprises functions that support 1-3 of the
above-mentioned steps.

Full Python3 coding examples of how to perform the different steps of the data
analysis is given below. Every example is finished with a template of how the
`HelpCode`-folder can be used to perform the same calculations.


## Importing python packages <a name="import"></a>

Here is **the full list of packages** needed to run the code in this Jupyter Notebook. 

In [1]:
# Packages to help importing files 
import sys, os
sys.path.append('../')

# Package that supports working with large arrays
import numpy as np  

# Package for plotting 
import matplotlib   # choose a backend for web applications; remove for stand-alone applications:
matplotlib.use('Agg') # enable interactive notebook plots (alternative: use 'inline' instead of 'notebook'/'widget' for static images)
%matplotlib notebook

# The following line is the ONLY one needed in stand-alone applications!
import matplotlib.pyplot as plt

# Function that fits a curve to data 
from scipy.optimize import curve_fit

# Package to create interactive plots 
from ipywidgets import interact, interactive, fixed, widgets, Button, Layout


%load_ext autoreload
%autoreload 2

# Custom pakages prepared for you to use when analyzing experimental data from labs 
import fithelpers, histhelpers, MCA, fittingFunctions

Inserting parent directory to the path such that the analysis code in `fithelpers.py`, `histhelpers.py` and `MCA.py` can be found by `python`.

In [2]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

--------------------------------------------------------------------------------------------------------------

# Read experimental data from file <a name="read"></a>

## Loading spectrum <a name="load"></a>

With the help of the function `load_spectrum` from package `MCA` one can read the experimental data from one data file as follows:

In [3]:
data = MCA.load_spectrum("test_data.Spe")

_If you are interested in how to read and write files in Python see e.g. http://www.pythonforbeginners.com/files/reading-and-writing-files-in-python or you could have a look at the source code in [MCA.py](../MCA.py)._

`data` is an object of a class `Spectrum` in which we store information about our histogram: `bin_edges` and `bin_centers` variables give us information about our **channels** and `y` is used to store  **counts** (cf. [MCA.py](../MCA.py)). See for instance: 

In [4]:
print('bin edges = ', data.bin_edges)
print('bin centers = ', data.bin_centers)
print('y = ', data.y)

bin edges =  [   0    1    2 ... 8190 8191 8192]
bin centers =  [5.0000e-01 1.5000e+00 2.5000e+00 ... 8.1895e+03 8.1905e+03 8.1915e+03]
y =  [0. 0. 0. ... 0. 0. 0.]


## Plotting the data <a name="plot"></a>

It is always good to visualise your data. This is how you can plot and visualise it:

In [5]:
plt.figure(figsize=(12, 8))
# with the data read in with the first routine
plt.step(data.bin_centers, data.y, where='mid', label='step')

plt.title("Test spectrum") # set title of the plot
plt.xlabel("Channels")     # set label for x-axis 
plt.ylabel("Counts")       # set label for y-axis 
#plt.savefig("test_spectrum.png") #This is how you save the figure


## Could be useful to see this in log scale..?
# plt.yscale('log')
# plt.ylim(ymin=1)

<IPython.core.display.Javascript object>

Text(0, 0.5, 'Counts')

You could also use an option `plt.hist` to plot "proper histogram", where you give your bin centers positions alongsite of bin edges and weights:

In [6]:
plt.figure(figsize=(12, 8))
# with the data read in with the first routine

plt.step(data.bin_centers, data.y, where='mid', label='step plot', lw=0.1)

plt.hist(data.bin_centers, bins=data.bin_edges, weights=data.y, color='red', label='hist plot')

#plt.show()
plt.title("Test spectrum")
plt.xlabel("Channels")
plt.ylabel("Counts")
plt.legend()
#plt.savefig("test_spectrum.png") #This is how you save the figure


<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x7ff565fa38b0>

--------------------------------------------------------------------------------------------------------------

# Analyzing data <a name="fit"></a>

In $\gamma$-ray (or other radiation) spectroscopy measurements, the goal is usually to determine the energy and the intensity of the radiation. To find the energy, the centroid of a peak must be determined. Area of the peak represents to the intensity of the radiation. A good way to find the peak centroid and area is to fit a Gaussian to the peak. 


## Fitting a Gaussian <a name="gaussian"></a>

Read up on the Gaussian function here: [https://en.wikipedia.org/wiki/Gaussian_function](https://en.wikipedia.org/wiki/Gaussian_function)

The following code shows how to use the function `curve_fit` to fit a peak in
the data that was read in above (i.e. you will need to execute the above code
section before this section will work).

_The function `curve_fit` from `scipy.optimize` module does the job for you and the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html) contains all the valuable information on how to use the function. It uses a method called least squares which you can read about in most course literature on statistics
and for instance on [Wolfram Alpha](http://mathworld.wolfram.com/LeastSquaresFitting.html)._

To fit a Gaussian to your peak you need to provide `curve_fit` with some initial guess for its constants:

In [32]:
##### Your initial guess here:

mu_guess = 3300 # a guess for position of peak centroid
n = 30          # number of points on each side to include in fit


##### Now we can perform the fit:

peak_center_idx = (np.abs(data.bin_centers-mu_guess)).argmin() # find index of the mu guess value in bin_centers array 
A_guess = data.y[peak_center_idx]                              # a guess for the amplitude of the peak (you do not need to change it)
sigma_guess = 1                                                # guess for sigma 

# select values from bin_centers and y arrays of your data that correspond to your initiall guess af a peak
peak_x = data.bin_centers[peak_center_idx-n:peak_center_idx+n]
peak_y = data.y[peak_center_idx-n:peak_center_idx+n]

from scipy.optimize import curve_fit

def GaussFunc(x, A, mu, sigma):
    return A*np.exp(-(x-mu)**2/(2.*sigma**2))

guess = [A_guess, mu_guess, sigma_guess] # our initial guess of parameters for a Gaussian fit 

estimates, covar_matrix = curve_fit(GaussFunc, # name of a function with which you want to perform your fit
                                    peak_x,    # your xdata
                                    peak_y,    # your ydata
                                    p0=guess)  # initial guess for the parameters

A, mu, sigma = estimates[0], estimates[1], estimates[2]

# Plot your Gaussian fit
plt.figure()
plt.step(peak_x, peak_y, where='mid', color='cornflowerblue', label='data')                      # plotting the data 
plt.plot(peak_x, GaussFunc(peak_x, A, mu, sigma), color='forestgreen', label = 'Gaussian fit')   # plotting the peak
plt.legend(loc='upper right', frameon=False)                                                     # adding a legend to your plot
plt.show()

# Pring information about your parameters
print("Estimates of (A mu sigma) = (", A, mu, sigma, ")\n")
print("Covariance matrix = \n", covar_matrix, "\n")
print("Uncertainties in the estimated parameters: \n[ sigma^2(A) sigma^2(mu), sigma^2(sigma) ] = \n[", covar_matrix[0][0], covar_matrix[1][1], covar_matrix[2][2], "]" )


<IPython.core.display.Javascript object>

Estimates of (A mu sigma) = ( 1701.438287027319 3300.6739637051596 2.690781632536818 )

Covariance matrix = 
 [[ 2.62193671e+02  5.05274805e-06 -2.76435636e-01]
 [ 5.05274805e-06  8.74353634e-04 -7.99117031e-09]
 [-2.76435636e-01 -7.99117031e-09  8.74353625e-04]] 

Uncertainties in the estimated parameters: 
[ sigma^2(A) sigma^2(mu), sigma^2(sigma) ] = 
[ 262.19367129620366 0.0008743536336408339 0.0008743536253622023 ]


### Calculating Peak Area <a name="peak_area"></a>

There are different ways in how to calculate the area of a peak in a spectrum. The by far easiest method is to calculate the area of the fitted Gaussian function (see [https://en.wikipedia.org/wiki/Gaussian_function](https://en.wikipedia.org/wiki/Gaussian_function)).

In [8]:
Area = np.sqrt(2*np.pi)*A*np.abs(sigma)
print('Area of peak is: ', Area)

Area of peak is:  11475.842788639093


### Analysis code <a name="code_gauss"></a>

To produce the same results you can just use the function `perform_Gaussian_fit` from `fittingFunctions` package. 


**You can just copy the following cell and use it in your Jupyter Notebooks with solutions for laboratories.**

In [25]:
mu_guess = 3300 # guess of a position of a peak centroid 
n = 30 #number of points on each side to include in fit

gauss = fittingFunctions.perform_Gaussian_fit(data.bin_centers, data.y, mu_guess, n)

Area = np.sqrt(2*np.pi)*gauss.A*np.abs(gauss.sigma)
print('Area of peak is: ', Area)

<IPython.core.display.Javascript object>

Estimates of (A mu sigma) = ( 1701.438287027319 3300.6739637051596 2.690781632536818 )

Covariance matrix = 
 [[ 2.62193671e+02  5.05274805e-06 -2.76435636e-01]
 [ 5.05274805e-06  8.74353634e-04 -7.99117031e-09]
 [-2.76435636e-01 -7.99117031e-09  8.74353625e-04]] 

Uncertainties in the estimated parameters: 
[ sigma^2(A) sigma^2(mu), sigma^2(sigma) ] = 
[ 262.19367129620366 0.0008743536336408339 0.0008743536253622023 ]

Area of peak is:  11475.842788639093


### Influence of initial guess on your Gaussian fit - widget <a name="widget_gauss"></a>

Now let's look at how our initial guess of the position of peak centroid and number of points influence out fit. Change numbers for mu_guess and n, and check how the change affects your fit.

In [15]:
mu_guess_widget = widgets.IntSlider(value=3300, min=3200, max=3400, step=1, description=r'mu_guess')
n_widget = widgets.IntSlider(value=30, min=15, max=45, step=1, description=r'n_widget')

interactive_plot=interact.options(manual=True, manual_name="Update")

@interactive_plot(
    mu_guess = mu_guess_widget, n = n_widget)
def interactive_plot(mu_guess = mu_guess_widget.value, n = n_widget.value):
    fittingFunctions.perform_Gaussian_fit(data.bin_centers, data.y, mu_guess_widget.value, n_widget.value)

interactive(children=(IntSlider(value=3300, description='mu_guess', max=3400, min=3200), IntSlider(value=30, d…

## Improving your fit - accounting for a linear background  <a name="background"></a>

Often times we want to subtract the background from our peak as the peak may be on the Compton continuum of other peaks higher in energy. It is needed to be able to correctly determine the intensity of the peak.  

In [11]:
#### Your values go here: 

mu_guess = 3300 # guess of a position of a peak centroid
n = 30          #number of points on each side to include in fit

#Let's select channels on both sides of our fit to which we want to fit our line:
left_selection = [3285, 3290]
right_selection = [3310, 3312]##### Now we can perform the fit:


peak_center_idx = (np.abs(data.bin_centers-mu_guess)).argmin() # find index of the mu guess value in bin_centers array 
# select values from bin_centers and y arrays of your data that correspond to your initiall guess af a peak
peak_x = data.bin_centers[peak_center_idx-n:peak_center_idx+n]
peak_y = data.y[peak_center_idx-n:peak_center_idx+n]


############ Selecting points to fit linear function

left_idx = [(np.abs(data.bin_centers-left_selection[0])).argmin(), (np.abs(data.bin_centers-left_selection[1])).argmin()]
right_idx = [(np.abs(data.bin_centers-right_selection[0])).argmin(), (np.abs(data.bin_centers-right_selection[1])).argmin()]

left_x = data.bin_centers[left_idx[0]:(left_idx[1]+1)]
right_x = data.bin_centers[right_idx[0]:(right_idx[1]+1)]

left_y = data.y[left_idx[0]:(left_idx[1]+1)]
right_y = data.y[right_idx[0]:(right_idx[1]+1)]

lin_x = np.concatenate([left_x, right_x])
lin_y = np.concatenate([left_y, right_y])

############ Fitting linear function to selected points

guess = [2, 1]

estimates_lin, covar_matrix = curve_fit(fittingFunctions.LineFunc,
                                    lin_x,
                                    lin_y,
                                    p0 = guess)


############ Subtracting the linear background

peak_lin = fittingFunctions.LineFunc(peak_x, estimates_lin[0], estimates_lin[1]) # 
y_subst = peak_y - peak_lin

############ Fit the Gaussian to the peak without backround

A_guess = data.y[peak_center_idx]                              # a guess for the amplitude of the peak (you do not need to change it)
sigma_guess = 1                                                # guess for sigma 
guess = [A_guess, mu_guess, sigma_guess]

estimates, covar_matrix = curve_fit(fittingFunctions.GaussFunc,
                                    peak_x,
                                    y_subst,
                                    p0=guess)
g_final = fittingFunctions.Gauss(estimates[0], estimates[1], estimates[2], covar_matrix )



############ Plotting results 
plt.figure()
plt.step(peak_x, peak_y, where='mid', label='data')
# #plot points to which linear function is fitted
plt.step(left_x, left_y, where='mid', color='y')
plt.step(right_x, right_y, where='mid', color='y')
# #plot support lines
plt.plot([left_idx[0]+0.5, left_idx[0]+0.5], [data.y[left_idx[0]]+0.5, g_final.A], color='y', linestyle="--")
plt.plot([left_idx[1]+0.5, left_idx[1]+0.5], [data.y[left_idx[1]]+0.5, g_final.A], color='y', linestyle="--")
plt.plot([right_idx[0]+0.5, right_idx[0]+0.5], [data.y[right_idx[0]]+0.5, g_final.A], color='y', linestyle="--")
plt.plot([right_idx[1]+0.5, right_idx[1]+0.5], [data.y[right_idx[1]]+0.5, g_final.A], color='y', linestyle="--")
# plot Gaussian
#plt.plot(x_fin, GaussFunc(x_fin, g_final.A, g_final.mu, g_final.sigma), color='gray')
plt.plot(peak_x, peak_lin + GaussFunc(peak_x, g_final.A, g_final.mu, g_final.sigma), 'forestgreen', label='Gaussian fit')
# plot linear fit
plt.plot(lin_x, fittingFunctions.LineFunc(lin_x, estimates_lin[0], estimates_lin[1]), color='r', label = 'linear fit', alpha=0.6) 
plt.legend(loc='upper right', frameon=False)
plt.show()

########### Printing results
print("Linear fit estimates (k m) = (", estimates_lin[0], estimates_lin[1], ")\n")
print("Estimates of (A mu sigma) = (", g_final.A, g_final.mu, g_final.sigma, ")\n")
print("Covariance matrix = \n", g_final.covar_matrix, "\n")
print("Uncertainties in the estimated parameters: \n[ sigma^2(A) sigma^2(mu), sigma^2(sigma) ] = \n[", g_final.covar_matrix[0][0], g_final.covar_matrix[1][1], g_final.covar_matrix[2][2], "]" )


<IPython.core.display.Javascript object>

Linear fit estimates (k m) = ( -0.4516607354738622 1514.0357354716984 )

Estimates of (A mu sigma) = ( 1686.1666652864853 3300.6806786415364 2.6354174721402606 )

Covariance matrix = 
 [[ 8.83293440e+01  1.71721456e-06 -9.20375608e-02]
 [ 1.71721456e-06  2.87704369e-04 -2.68428601e-09]
 [-9.20375608e-02 -2.68428601e-09  2.87704366e-04]] 

Uncertainties in the estimated parameters: 
[ sigma^2(A) sigma^2(mu), sigma^2(sigma) ] = 
[ 88.32934398495689 0.00028770436882524883 0.0002877043663450329 ]


### Analysis code  <a name="code_background"></a>

To make a Gaussian fit that takes into account the background just use a function `perform_Gaussian_fit` from `fittingFunctions` package with specifying `left_selection` and `right_selection` arrays. _In case you are interested in how the fit was performed you have a look on the function `perform_Gaussian_fit` [fittingFunctions.py](../fittingFunctions.py)._


**You can just copy the following cell and use it in your Jupyter Notebooks with solutions for laboratories.**

In [30]:
mu_guess = 3300 # guess of a position of a peak centroid 
n = 30 #number of points on each side to include in fit

#Let's select channels on both sides of our fit to which we want to fit our line: 
left_selection = [3285, 3290]
right_selection = [3310, 3312]

gauss = fittingFunctions.perform_Gaussian_fit(data.bin_centers, data.y, mu_guess, n, left_selection, right_selection)

<IPython.core.display.Javascript object>

Estimates of (A mu sigma) = ( 1686.1666572422214 3300.68067863157 2.6354174972521753 )

Covariance matrix = 
 [[ 8.83293280e+01  1.71797675e-06 -9.20375679e-02]
 [ 1.71797675e-06  2.87704464e-04 -2.68488877e-09]
 [-9.20375679e-02 -2.68488877e-09  2.87704465e-04]] 

Uncertainties in the estimated parameters: 
[ sigma^2(A) sigma^2(mu), sigma^2(sigma) ] = 
[ 88.32932801661778 0.0002877044638852663 0.0002877044649540639 ]



### Widget background  <a name="widget_background"></a>

In [31]:
mu_guess_widget = widgets.IntSlider(value=3300, min=3250, max=3350, step=1, description=r'mu_guess')
n_widget = widgets.IntSlider(value=30, min=15, max=45, step=1, description=r'n_widget')
left_selection_widget = widgets.IntRangeSlider(values=[3285, 3290], min=(mu_guess_widget.value - n_widget.value), max=(mu_guess_widget.value - int(n_widget.value/4)))
right_selection_widget = widgets.IntRangeSlider(values=[3310, 3312], min=(mu_guess_widget.value + int(n_widget.value/4)), max=(mu_guess_widget.value + n_widget.value))

def update_left(change):
    left_selection_widget.min = mu_guess_widget.value - n_widget.value
    left_selection_widget.max = mu_guess_widget.value - int(n_widget.value/4)
def update_right(change):    
    right_selection_widget.min = mu_guess_widget.value + int(n_widget.value/4)
    right_selection_widget.max = mu_guess_widget.value + n_widget.value
    
    
left_selection_widget.observe(update_left, 'value')
right_selection_widget.observe(update_right, 'value')

interactive_plot2=interact.options(manual=True, manual_name="Update")

@interactive_plot2(
    mu_guess = mu_guess_widget, n = n_widget, left_selection=left_selection_widget, right_selection=right_selection_widget)

def interactive_plot_background(mu_guess = mu_guess_widget.value, n = n_widget.value, left_selection=left_selection_widget, right_selection=right_selection_widget):
    fittingFunctions.perform_Gaussian_fit(data.bin_centers, data.y, mu_guess_widget.value, n_widget.value, left_selection_widget.value, right_selection_widget.value)

interactive(children=(IntSlider(value=3300, description='mu_guess', max=3350, min=3250), IntSlider(value=30, d…

## Fit a line - Energy calibration <a name="line"></a>

In spectroscopy experiments it is often essential to calibrate the detector response with respect to a known energies emitted from a so called calibration source. The relationship between the detector response and the energy is mostly assumed linear. The code below exemplifies how to estimate the linear calibration for 'random data'.

In [14]:
# x and y are some 'random data'
x = np.asarray([1,3,5,7])
y = np.asarray([1.3, 2.1, 2.9, 4.2])

#If you are more or less uncertain about your y-values this can be used in the fit by including the following line.
sigmay = np.asarray([0.5, 0.3, 0.1, 0.2])

# Define the linear function which you want to fit.
def LineFunc(x, k, m):
    return k*x+m

# As for the Gaussian fit the function curve_fit needs a guess for the parameters to be estimated.
guess = [2, 1]

# Perform the fit
estimates, covar_matrix = curve_fit(LineFunc,
                                    x,
                                    y,
                                    p0 = guess,
                                    sigma = sigmay)

print("Estimates of (k m) = (", estimates[0], estimates[1], ")\n")

# plot the result
plt.figure()
plt.plot(x,y, linestyle="", marker="*", label='data points')
plt.plot(x, LineFunc(x, estimates[0], estimates[1]), label='linear fit')
plt.legend(loc='upper left')
plt.show()

Estimates of (k m) = ( 0.51544342764661 0.4022935634807532 )



<IPython.core.display.Javascript object>

## Statistical analysis - Error propagation<a name="stat"></a>

Background theory and instructions on how to perform statistical analysis on
experimental data, with error propagation, can be found in the document
http://www.fysik.lu.se/fileadmin/fysikportalen/UDIF/Bilder/FYSA31_KF_error.pdf.