# Ahmed Mohamed Session 4: Fitting to arbitrary functions

_Script author: louise.dash@ucl.ac.uk    
Updated: 08/01/2019_

<div class="alert alert-success"> <p><b>Intended learning outcomes:</b> </p>
By the end of this session, you should be able to:
<ul>
<li> fit data to any arbitrary function using scipy.optimize.curve_fit; </li>
<li> quantitatively evaluate the goodness of fit;  </li>
<li> reach physical conclusions based on these results. </li>
</div>

We've already seen how to fit histograms to a Gaussian, and how to use a polynomial to fit a set of data. The last thing we're going to do in this Data Analysis part of the course is to see how to perform a fit to an arbitrary function. 

In these examples, we'll be looking at whether a Lorentzian or Gaussian functions provide a better fit to some optical lineshape data. However, you can use the same method to fit *any* function, provided you can write a suitable Python function to describe your target "fit" function.

### Context for this example

The data we'll use for this session is taken from the Lab 3 Zeeman effect experiment, which some of you will do yourselves in PHAS0058. 

The Zeeman effect occurs when a spectral line is split into different components by a magnetic field. The physics of the Zeeman effect will be covered in detail in PHAS0023 "Atomic and Molecular Physics".

The Lab 3 experiment examines how the lines in the emission spectrum of a mercury discharge lamp split under a magnetic field. The student records the spectrum using a CCD camera, which yields data in the form of recorded intensity (in counts per second) vs pixel position (in pixels). 

We're not going to be considering the *positions* of the spectral lines in this task, instead we're going to be looking at the *lineshapes*. Rather than a spectral line with a single energy, the line is broadened into a wider peak by various physical effects. For example, the uncertainty principle leads to broadening which has a Lorentzian form, while there will also be thermal broadening effects, which are Gaussian in nature (there are also several other sources of broadening, with different effects). In theory, for this experiment, Lorentzian broadening is expected to dominate.

In this task we will look at an experimental spectral line recorded by a student in the Lab 3 experiment, fit it to both a Gaussian and Lorentzian, and determine which provides a better fit.

### Getting started with the code

First, we'll import the modules we will need. The new function we import in the cell below comes from the scipy.optimize library - more on this later, when we come to use it. 

In [1]:
%matplotlib notebook

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.optimize import curve_fit # we're importing just this one function from scipy.optimize

plt.rcParams["patch.force_edgecolor"] = True # include outlines on histograms etc



Now we can import the csv (Comma Separated Value) file with the data the student collected, and plot it. You will need to download this file from Moodle, and as usual, put it in the same directory as this notebook.

In [2]:
# import the data...
xdata,ydata = np.loadtxt('Zeeman_data.csv',delimiter=",",unpack=True) # reminder: need to set delimiter for csv files

# ...and plot it.
plt.figure()
plt.plot(xdata,ydata, 'o')
plt.xlabel("Pixel position")
plt.ylabel("Pixel value (counts/second)")
plt.title("Data from Zeeman effect experiment");

<IPython.core.display.Javascript object>

We can see that we have a single peak with a constant background level. It looks feasible to attempt fitting this to a Gaussian.


In order to use `curve_fit` to fit this to a Gaussian, we need to write a "target" function to fit to, which in this case will be


$$
f(x) = y_0 + h \exp \left(\frac{-(x-x_0)^2}{2 \sigma^2}\right)
$$

(This is a slightly different definition from the one we used when we were fitting histograms to Gaussians in Session 2. Can you see why?)

The parameters for our Gaussian fit will be the mean value (`x0`), the standard deviation (`sigma`), the background value `y0` and the peak height, `h`. Here is a function that will do exactly this.

In [3]:
def gaussian(x,x0,sigma, y0, h):
    '''Returns a single value or 1D array of Gaussian function values for 
    - input x-value or array of x-values: x
    - mean value of distribution: x0
    - standard deviation of distribution: sigma
    - background value y0
    - peak height, h (measured from background level y0)'''
    gauss = h * np.exp(-(x-x0)**2/(2*sigma**2)) + y0 # the gaussian itself
    return gauss

The three parameters, x0, y0 and sigma, are (as yet) unknown. To find them, we use the scipy.optimize.curve_fit function. The full documentation for this is here: http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html#scipy.optimize.curve_fit

We're going to do this in the simplest way possible for the moment, by just sending curve_fit the target function (our "`gaussian`" function), the independent variable (`xdata`) and the dependent variable (`ydata`). We can also, optionally, choose to send an initial guess of the parameters, as well as weightings for each of the ydata data points, but for the moment we won't do that.

The `curve_fit` routine returns two arrays. 
 * The first of these is an array of the fitted parameters - in our case this array will have four elements, as we have four parameters, `x0`, `sigma`, `y0`, and `h`.
 * The second is the matrix of covariance - an indication of the goodness of fit. We covered this in Session 3 when we were doing polynomial fitting.
 
 Let's do this, and see what results we get back:

In [4]:
#popt: Optimized parameters
#pcov: matrix of covariance.
popt,pcov = curve_fit(gaussian,xdata,ydata)

print ("popt :\n", popt)
print ("pcov :\n", pcov)

popt :
 [1.         1.         5.30499999 1.        ]
pcov :
 [[inf inf inf inf]
 [inf inf inf inf]
 [inf inf inf inf]
 [inf inf inf inf]]




We can see that this hasn't worked so well - `curve_fit` hasn't been able to find a fit to the data.

Instead, we'll try to make life easier for `curve_fit` by giving an initial guess for the parameters. From looking at the plot of the data, we can see that the peak is at around $x=75$, and the background around $y=3.5$. We'll try a value of 10 for $\sigma$. These values need to be given in the form of a python list of numbers, in the same order as parameters are given to our "`gaussian`" function. Remember - in Python we use `[` square brackets `]` to define a list, with the elements separated by commas.

In [5]:
guess = [75,10,3.5,18] # list of initial guess parameters
# what type of object does the variable "guess" represent?
print ("The variable 'guess' is a ", type(guess) )

The variable 'guess' is a  <class 'list'>


Now we can retry the fit:

In [6]:
popt,pcov = curve_fit(gaussian,xdata,ydata,p0=guess)
print ("popt :\n", popt)
print ("pcov :\n", pcov)

popt :
 [72.50930905  3.01525268  3.85742572 13.40680375]
pcov :
 [[ 2.02507205e-03 -3.74999230e-10  4.61324597e-12  1.63272286e-09]
 [-3.74999230e-10  2.22561783e-03 -6.30519907e-04 -4.05620889e-03]
 [ 4.61324597e-12 -6.30519907e-04  1.98236679e-03 -1.40174504e-03]
 [ 1.63272286e-09 -4.05620889e-03 -1.40174504e-03  3.10175059e-02]]


This has worked (or it should have done)! We can use the information from the matrix of covariance to calculate the error on each parameter, just as we did in the previous session for the polynomial coefficients. Remember, the error on the parameters are given by the *square roots* of the diagonal elements of the matrix of covariance.


**A python aside / hint: ** When dealing with an array like `popt` that contains numbers each representing different variables, it's sometimes useful to be able to "unpack" the array into different variables - we've already seen examples of this in the code cell above and in the second code cell when unpacking the data from the file. To unpack `popt`, we could use a line of code like:

           x0_fit, sigma_fit, y0_fit, h_fit = popt

If we wanted to then calculate the fitted line at a given x-value (in this case at x = 65), we could then use something like:
        
           fitted_point = gaussian(65, x0_fit, sigma_fit, y0_fit, h_fit)
           
or, if we don't want/need to assign individual variable names to the elements of `popt` (or whichever array we are dealing with), we could use:

           fitted_point = gaussian(65, popt[0], popt[1], popt[2], popt[3])
           
This is a bit unwieldy though, so sometimes it's useful to be able to unpack the array automatically when calling a function by using \* syntax, like this:

           fitted_point = gaussian(65, *popt)
This is much easier to deal with! You can find a fuller discussion of this in [Hill: Learning Scientific Programming with Python](http://sfx.ucl.ac.uk/sfx_local?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2016-07-18T13%3A15%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Journal-UCL_LMS_DS&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=book&rft.atitle=&rft.jtitle=&rft.btitle=Learning%20scientific%20programming%20with%20Python&rft.aulast=Hill&rft.auinit=&rft.auinit1=&rft.auinitm=&rft.ausuffix=&rft.au=Hill,%20Christian,%201974-,%20author&rft.aucorp=&rft.volume=&rft.issue=&rft.part=&rft.quarter=&rft.ssn=&rft.spage=&rft.epage=&rft.pages=&rft.artnum=&rft.issn=&rft.eissn=&rft.isbn=9781107075412&rft.sici=&rft.coden=&rft_id=info:doi/&rft.object_id=&rft.856_url=&rft_dat=%3CUCL_LMS_DS%3E002240476%3C/UCL_LMS_DS%3E&rft.eisbn=&rft_id=info:oai/&req.language=eng) section 2.4.3 (page 49).

The code cell below demonstrates that the two methods do give identical results:

In [7]:
# specifying the elements by hand:
print("At x = 65 our fitted Gaussian has a value of: ", gaussian(65, popt[0], popt[1], popt[2], popt[3]))

# use *syntax to unpack the elements of popt automatically:
print("Calculating the same value using * syntax:    ", gaussian(65,*popt)) 
print("Both give the same result!")

At x = 65 our fitted Gaussian has a value of:  4.460698388343549
Calculating the same value using * syntax:     4.460698388343549
Both give the same result!


<div class="alert alert-success"> 
In the cell below, you should:
<ul>
<li> calculate the errors on the parameters </li>
<li>output each parameter with its error and an appropriate text string </li>
<li>plot the original data and the fitted line on a single, appropriately labelled graph </li>
</ul>
</div>

In [8]:
### STUDENT COMPLETED CELL ###

for i in range(np.size(popt)): #for loop
    print (popt[i], " with error ", np.diag(pcov)[i]**0.5)   
    
    
plt.figure() # create a new figure window


x = np.linspace(40,110,200) # array of 200 x values from 40 to 110 
y = gaussian(x,*popt) #gaussian y values for the x arrays.

#Plots graph, labels and legend
plt.plot(xdata,ydata,'b.', label="points of pixel value against pixel position")
plt.plot(x,y,'r-', label="Fitted polynomial")
plt.title('Data from Zeeman effect experiment') 
plt.xlabel('Pixel position') 
plt.ylabel('Pixel value (counts/second)')
plt.legend(loc=0) 


72.50930904832356  with error  0.04500080058737785
3.0152526754151086  with error  0.04717645417786105
3.8574257202078033  with error  0.04452377780513221
13.406803754684061  with error  0.17611787490240707


<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x1f6fa1ee908>

If you've done this correctly, you should obtain a good fit to the data.

In theory we'd expect a Lorentzian to produce a better fit for this data. Now you're going to try this out and see if this is what we find for this particular data set!

The appropriate form for this is 
$$
f(x) = y_0 + \frac{h}{1 + ((x - x_0)/b)^2}
$$
where $y_0$ is the background level, $x_0$ is the peak position and $b$ is the half-width at half-maximum (HWHM) of the peak, and $h$ the height of the peak relative to the background level.

<div class="alert alert-success"> 
First, write a properly formatted python function, similar in form to the "gaussian" function above, that will return a Lorentzian function for these parameters.
</div>

In [9]:
### STUDENT COMPLETED CELL ###
def Lorentzian(x,x0, y0, h, b): # defins lorentz function
    '''Returns a single value or 1D array of lorentzian function for 
    - input x-value or array of x-values: x
    - mean value of distribution: x0
    - b is the half-width at half-maximum
    - background level y0
    - peak height, h (measured from background level y0)'''
    lorent = y0 + h/(1 + (((x-x0)/b)**2)) # the lorentzian itself
    return lorent

<div class="alert alert-success"> 
Now provide an initial guess for these parameters, and use curve_fit to calculate the best Lorentzian fit for this data. Output the calculated parameters and the matrix of covariance, just like we did for the Gaussian fit.
</div>

In [10]:
### STUDENT COMPLETED CELL ###
guessl = [75,10,10,18] #the guess for the lorentzian function

lopt,lcov = curve_fit(Lorentzian,xdata,ydata,p0=guessl) # curve fit for lorentzian 
print ("lopt :\n", lopt)
print ("lcov :\n", lcov)

lopt :
 [72.48067209  3.3614566  14.88519044  3.08300161]
lcov :
 [[ 3.41522509e-03 -3.26711382e-06  1.19655382e-06  1.97142818e-06]
 [-3.26711382e-06  4.90692293e-03 -1.07970208e-03 -3.39310988e-03]
 [ 1.19655382e-06 -1.07970208e-03  7.99004136e-02 -1.57727283e-02]
 [ 1.97142818e-06 -3.39310988e-03 -1.57727283e-02  9.19528928e-03]]


<div class="alert alert-success"> 
Now use these results to
<ul>
<li>calculate the error on each parameter</li>
<li>output each parameter with its error (and an appropriate text string)</li>
<li>plot the data, the fitted Gaussian and the fitted Lorentzian, all on the same labelled graph.</li>
</ul>
</div>

In [11]:
### STUDENT COMPLETED CELL ###
plt.figure() # create a new figure window

x = np.linspace(40,110,200) # array of 200 x values from 40 to 110 
y = gaussian(x,*popt)

for i in range(np.size(lopt)): # for loop for lorentzian
    print ("lopt is",lopt[i], " with error ", np.diag(lcov)[i]**0.5)   

xl = np.linspace(40,110,200) # array for lorentzian
yl = Lorentzian(x,*lopt)

#Plots graph, labels and legend
plt.plot(xdata,ydata,'b.', label="points of pixel value against pixel position")
plt.plot(xl,yl,'r-', label="Fitted polynomial lorentzian")
plt.plot(x,y,'g-', label="Fitted polynomial gaussian")
plt.title('Data from Zeeman effect experiment') 
plt.xlabel('Pixel position') 
plt.ylabel('Pixel value (counts/second)')
plt.legend(loc=0) 

<IPython.core.display.Javascript object>

lopt is 72.4806720948727  with error  0.05843992716494373
lopt is 3.361456596966064  with error  0.07004943203417825
lopt is 14.885190436426374  with error  0.2826666121064862
lopt is 3.083001614078908  with error  0.09589207099910162


<matplotlib.legend.Legend at 0x1f6fa3ef160>

### Evaluating the goodness of fit

So now we have two potential fits to our data. Looking at them, it's hard to tell which one provides the better fit. We can do this quantitatively by calculating $\chi^2$ for each fit, in the same way as we did in the previous session. 

We'll also need to know the y-error on the data points for this - which for this experiment were estimated to be $\pm 1$ counts/second.

#### 1. Gaussian.

First we'll look at the Gaussian fit. 

<div class="alert alert-success"> 
In the cell below, 
<ul>
<li>calculate the residuals divided by the y-error</li>
<li>calculate the number of degrees of freedom</li>
<li>hence calculate $\chi^2$ for the Gaussian fit.</li>
</ul>
(Refer back to the previous session if you need a reminder of any of the definitions)
</div>


In [12]:
### STUDENT COMPLETED CELL ###
Residuals = (gaussian(xdata,*popt)-ydata)# equation for the distance between the gaussian line and the blue points 
freedom = len(xdata)-len(popt)

yerror = 1

# prints the residuals, freedom and the chi^2
print("the residuals are",Residuals/yerror)
print ("the freedom is",freedom)
print("the chi^2 is",(np.sum((Residuals/yerror)**2)/freedom))

the residuals are [ 2.17425720e-01  4.07425720e-01  3.57425720e-01  2.67425720e-01
  3.07425720e-01  4.47425720e-01  2.67425720e-01 -2.32574280e-01
  1.77425720e-01 -1.42574280e-01 -9.25742797e-02 -1.92574279e-01
 -2.82574269e-01 -5.25741917e-02 -1.42573641e-01 -9.25701344e-02
 -2.32550164e-01 -5.24485997e-02  3.80124833e-02 -1.20185692e-04
  6.62071869e-03 -2.01710854e-01 -1.89769869e-01  1.74171524e-02
  2.30698388e-01  7.15956825e-02  3.33154749e-01  5.59415262e-01
  1.18065904e-01 -3.39775070e-01 -4.04414244e-01 -6.56668665e-02
 -1.82127135e-01  2.21974201e-01  4.38896638e-01  4.87055192e-01
  4.99823915e-02 -5.88237395e-01 -9.10928842e-01 -7.99964725e-01
 -5.68193327e-01 -5.47946765e-01 -3.41040045e-01 -1.83160172e-01
 -2.30056556e-01 -5.19710692e-02 -9.24448119e-02 -9.25493862e-02
 -9.25699919e-02 -1.42573618e-01 -2.32574188e-01 -3.22574268e-01
 -2.32574279e-01 -1.92574280e-01 -2.82574280e-01 -1.42574280e-01
  8.74257202e-02  3.74257202e-02  3.74257202e-02  1.27425720e-01
  1.274

#### 2. Lorentzian

<div class="alert alert-success"> Now do the same for the Lorentzian fit, in the cell below.</div>

In [13]:
### STUDENT COMPLETED CELL ###
Residuals = (Lorentzian(xdata,*lopt)-ydata)# equation for the distance between the lorentzian line and the blue points 
freedom = len(xdata)-len(lopt)

yerror = 1

# prints the residuals, freedom and the chi^2
print("the residuals are",Residuals/yerror)
print ("the freedom is",freedom)
print("the chi^2 is",(np.sum((Residuals/yerror)**2)/freedom))


the residuals are [-0.13713719  0.06219792  0.02248531 -0.05614151 -0.00352488  0.15052242
 -0.01377599 -0.49615128 -0.06627805 -0.36375954 -0.28810813 -0.35871941
 -0.41483726 -0.14550621 -0.18950579 -0.08525828 -0.16069768  0.09692015
  0.28129711  0.35751308  0.50267024  0.45690241  0.64499033  1.00899765
  1.29263381  1.0184705   0.88942052  0.46427983 -0.64449251 -1.28312945
 -0.63324223  0.75340358  0.5659338  -0.16197842 -0.66329637 -0.40538393
 -0.14131651 -0.09864504 -0.00807317  0.23365997  0.40508721  0.27454783
  0.3088939   0.30641029  0.12253656  0.18728115  0.05363588 -0.02341569
 -0.08753175 -0.19142572 -0.32714167 -0.45624136 -0.39993353 -0.38916484
 -0.50468476 -0.38709261 -0.17687206 -0.24441679 -0.26004976 -0.18403786
 -0.19660315 -0.1579316   0.15182016  0.05251946  0.04405331  0.12632505
 -0.20074843  0.15276111 -0.03320854  0.05128848]
the freedom is 66
the chi^2 is 0.1927755642575313


<div class="alert alert-success">
<b> Are these the results you'd expect? Discuss briefly in a text cell.</b>
</div>

### STUDENT COMPLETED TEXT CELL ###

### Analysing the residuals

Another way of verifying the validity of our fits is to check the distribution of the residuals, and see if they follow a normal (Gaussian) distribution. Again, follow the same procedure as we did in the previous session and check the distribution of the calculated residuals for both fits. 




<div class="alert alert-success">
Do this in the code cells below. 
<ul>
<li>You can copy, paste and edit code from Session 3 if you want, rather than writing this from scratch. </li>
<li>Then, use a text cell to discuss (briefly) what you conclude from these results.</li>
<li> You will also find it useful to look at the $x_0$ and $\sigma$ of the pdf of the residuals. Try changing your value of the yerror in the data (that you used to calculate the $\chi^2$) to the $\sigma$ you obtain here. What does this tell you?</li>
<li>Don't forget to change the yerror back to the value recorded by the student before submitting!</li>
</ul>
</div>

In [14]:
### STUDENT COMPLETED CELL ###
plt.figure()

import scipy.stats as stats

Residuals = (gaussian(xdata,*popt)-ydata)

x = np.linspace(-1,1,1000) #array of 1000 x values

x0,sigma = stats.norm.fit(Residuals) #gives the mean and standerd deviasion for Residuals

print("the mean is",x0,"the standerd deviasion",sigma)

gaussian_check = stats.norm.pdf(x,x0,sigma) # finds the curve for the gaussian


plt.hist(Residuals, density=True,alpha=0.25,edgecolor='k') # plots the histogram with black edges and transparent bars


plt.plot(x,gaussian_check,'r-.', label="gaussian stats.norm.pdf") #the generated gaussian line

#plots labels of graph
plt.legend()
plt.xlabel('Residual')
plt.ylabel('frequency')
title_label=('Line fitted with Gaussian $x_0$ = {0:8.2e}, $\sigma$ = {1:8.2e}'.format(x0,sigma))
# n.b. number format 8.2e : *e*xponential format, *8* chars total, with *2* decimal places
plt.title(title_label) ;

<IPython.core.display.Javascript object>

the mean is -2.1760387777395147e-10 the standerd deviasion 0.31759831590779153


In [15]:
### STUDENT COMPLETED CELL ###
plt.figure()

import scipy.stats as stats

Residuals = (Lorentzian(xdata,*lopt)-ydata)

x = np.linspace(-1.5,1.5,1000)

x1,sigma1 = stats.norm.fit(Residuals) #gives the mean and standerd deviasion for Residuals

print("the mean is",x1,"the standerd deviasion",sigma1)

gaussian_check = stats.norm.pdf(x,x1,sigma1) # finds the curve for the lorentzian


plt.hist(Residuals, density=True,alpha=0.25,edgecolor='k') # plots the histogram with black edges and transparent bars


plt.plot(x,gaussian_check,'r-.', label="lorenzian stats.norm.pdf") #the generated gaussian line

#plots labels of graph
plt.legend()
plt.xlabel('Residual')
plt.ylabel('frequency')
title_label=('Line fitted with lorenzian $x_0$ = {0:8.2e}, $\sigma$ = {1:8.2e}'.format(x1,sigma1))
# n.b. number format 8.2e : *e*xponential format, *8* chars total, with *2* decimal places
plt.title(title_label) ;


<IPython.core.display.Javascript object>

the mean is 1.0179234471203407e-09 the standerd deviasion 0.4263329892566719


### STUDENT COMPLETED TEXT CELL ###
### Conclusion ###

the gaussian distribution is more accurate than the lorentzian distibution as the Chi squared value is closer to one qhic contradicts the expectation. Also the line for the gaussian histogram fits the bars much better. However when the yerro is made smaller the lorentzian function becomes more accurate as the student only counted the yerror as whole numbers.   
