# Session 7: How to fit data

<a rel="license" href="https://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons Licence" style="border-width:0" src="https://licensebuttons.net/l/by-sa/4.0/88x31.png" title='This work is licensed under a Creative Commons Attribution 4.0 International License.' align="right"/></a>

Author: Dr Antonia Mey   
Email: antonia.mey@ed.ac.uk
  

Some material was contributed by Dr Matteo Degiacomi and Dr Valentina Erastova   

## Learning outcomes
1. Get more practice with plotting data and computing molecular properties.
2. Test how correlated two datasets are using `scipy`
3. Understand how to find the minimum of a function computationally.
4. Use the library `scipy` to find a line of best fit.
5. Use the library `scipy` to be able to fit an exponential function.
6. Know of other fitting functions, such as polynomial or Gaussian fits. 

**Jupyter cheat sheet**:
- to run the currently highlighted cell, hold <kbd>&#x21E7; Shift</kbd> and press <kbd>&#x23ce; Enter</kbd>;
- to get help for a specific function, place the cursor within the function's brackets, hold <kbd>&#x21E7; Shift</kbd>, and press <kbd>&#x21E5; Tab</kbd>;

# Table of Contents
1. [Recap: molecular geometries and plotting](#plotting)    

2. [Computing correlations](#Correlation)   
   2.1. [Pearson's correlation coefficient](#pearson)    
   2.2. [Spearman's rank correlation coefficient](#spearman)    

3. [Fitting data](#minimization)    
   3.1. [Finding the minimum of a function](#optimization)   
   3.2. [Line of best fit and residuals](#residuals)    
   3.3. [Fitting non-linear functions](#advanced_fitting_exponential)    
   3.4. [Fitting a Gaussian distribution](#advanced_fitting_gaussian)    
4. [Feedback](#feedback)

<div class="alert alert-danger"><b>
⚠️ Execute the cell below! It will allow you to run the notebook properly! ⚠️
</b></div>

In [3]:
import sys
import os.path
sys.path.append(os.path.abspath('../'))
from helper_functions.mentimeter import Mentimeter
import numpy as np
import matplotlib.pylab as plt
import pandas as pd
import math

# 1. Recap: molecular geometries and plotting
<a id='plotting'></a>

## 1.1 Recap: molecular geometries 
<a id='geometries'></a>

### Bonds
To compute the length of a bond **a**, we need to know the length of the vector connecting two atoms A and B using this formula:

$\vert\vert \mathbf{a}\vert \vert$=$\sqrt{(x_B-x_A)^2+(y_B-y_A)^2+(z_B-z_A)^2}$

In Python, `np.linalg.norm(B-A)` is a fast way of computing the distance between two vectors if the input is the form of a numpy array.

In [None]:
# Example positions of a water molecle

position_atom_H1 = np.array([0.758602,0.000000,0.504284])
position_atom_O = np.array([0.000,0.000,0.000])
position_atom_H2 = np.array([0.260455, 0.000000, -0.872893])

In [None]:
def compute_bond_length(atom1, atom2):
    """ This funciton computes the bond length between two atoms
    
    Parameters:
    -----------
    atom1:numpy array 
        contains 3 entries as x, y and z coordinates
    atom2:numpy array
        contains 3 entries as x, y and z coordinates
    
    Returns:
    --------
    bond_length :float
        value of the bond length
    """
    
    bond_length = np.linalg.norm(atom1-atom2)
    return bond_length

In [None]:
bond_length = compute_bond_length(position_atom_H1, position_atom_O)
print(f'The H-O bondlength is: {bond_length:.2f} Å.')

### Angles

Here is an example of an angle in a water molecule, where vector *H1*, *O*, and *H2*  give the positions of the atoms in space.

![indexing](images/bond_angles.png)

The bond length between H1 and O is given by the vector connecting these two atoms **a** in the image and can be computed using the above formula. 


To determine the angle between two vectors you can use the scalar product: 
$$\mathbf{a}\cdot \mathbf{b} = \vert\vert\mathbf{a}\vert\vert \,\vert\vert\mathbf{b}\vert\vert\cos \phi,$$
where $\mathbf{a}$ and $\mathbf{b}$ are vectors, and  $\phi$ is the valence angle we are after. We need to solve the dot product according to the valence angle $\phi$ by rearranging the above equation:
$$\phi = \arccos\big(\frac{\mathbf{a}\cdot\mathbf{b}}{\vert\vert\mathbf{a}\vert\vert \,\vert\vert\mathbf{b} \vert\vert}\big)$$

You can use the `math` library to get the arccos of an angle, e.g.: `math.acos()`

The scalar product or dot product can be computed using `np.dot()` in Python.

In [None]:
def compute_angle_water(O_position,H1_position, H2_position):
    """This function computes the angle between two three atoms
    
    Parameters:
    -----------
    H1_position:numpy array 
        contains 3 entries as x, y and z coordinates
    O_position:numpy array
        contains 3 entries as x, y and z coordinates
    H2_position:numpy array
        contains 3 entries as x, y and z coordinates
    
    Returns:
    --------
    angle :float
        value of the angle
    """
    vector_of_bond_a = H1_position-O_position
    vector_of_bond_b = H2_position-O_position

    bond_length_a = compute_bond_length(H1_position, O_position)
    bond_length_b = compute_bond_length(O_position, H2_position)
    
    angle = math.acos(np.dot(vector_of_bond_a,vector_of_bond_b)/(bond_length_a*bond_length_b))
    return np.degrees(angle)
    

In [None]:
angle = compute_angle_water(position_atom_O,position_atom_H1,position_atom_H2)
print(f'The angle of a water molecule is: {angle:.2f}°.')

### Recap: Reminder of further resources
If you ever get stuck with Matplotlib, they have some very helpful [cheatsheets](https://matplotlib.org/cheatsheets/), one of which is shown below:

![Matplotlib beginner cheat sheet](https://matplotlib.org/cheatsheets/handout-beginner.png)

## 1.2 Recap: Plotting distributions

Take a look at the following code:

```python

# Generate 10000 random samples from a normal distriubution 
X = np.random.normal(4, 0.3, 10000)
# initiate the plot
fig, ax = plt.subplots()
fig.set_figwidth(4)
fig.set_figheight(4)
# Use numpy to compute a histogram
prob, edges = np.histogram(X, density = True, bins=30)
half_width = (edges[1]-edges[0])/2
bin_centres = edges[:-1]+half_width
# plot the probability density from the histogram
ax.plot(bin_centres, prob, marker='o', color='red')


```

How would you expect the final plot to look like?

In [None]:
Mentimeter(vote = 'https://www.menti.com/aladq88k3pq6').show()

## Tasks 1

<div class="alert alert-success">
    <b>Task 1.1 </b> : generate a 1D array, $x$, and plot $x^2$ using non-default line types and colours, label the plot
</div>

 Practice labelling your plot as well!
 - `xlabel()`
 - `ylabel()` 
 - `title()`

<div class="alert alert-info">
    <b>Hint</b> : To neatly write sub- and superscripts on the plots, like  $x_2$  or $x^2$ in the example above, use the $LaTeX$ notation in the code - <code>$x_2$</code> and <code>$x^2$</code> respectively. For  <a href="https://matplotlib.org/3.1.1/tutorials/text/mathtext.html">more examples see here</a>.

</div>



In [None]:
# Task 1: Test out the solution in this cell:



<details><summary {style='color:green;font-weight:bold'}> Click here to see solution to Task. </summary>
    
```python

#generating an array
x = np.linspace(-10, 10, 21) 
y = x**2

#plotting with x in a named colour, connected by a dotted line of a declared width
plt.plot(x, y, 'x:', color='tomato', linewidth='1.5') 

#adding labens
plt.xlabel('x')
plt.ylabel('y')
plt.title('my plot $y=x^2$')

plt.show()

```

</details>

<div class="alert alert-success">
    <b>Task 1.2 </b> : The file <code>data/water.xyz</code> contains a cluster of ice, i.e. many water molecules in a solid state. It has the <code>xyz</code> -file format and below is some help given how to read data from the file. 
</div>

1. Take a look at how the file is read and make sure you understand it. This is one example way of reading this file. There are many other options too.   
2. Compute the angle of each water molecule using the function defined above and append each angle to a list of angles.      
3. Plot a distribution of from the list of angles and report its mean and standard deviation.     


In [None]:
# Have a look at the data file first
!head data/water.xyz

In [None]:
# reading the data in
# This generates a numpy array with the coordinates
data = np.genfromtxt('data/water.xyz', skip_header=1, usecols=[1,2,3])
# We don't want to use the first row and the first column this is what skup_header and use_cols does

# now we loop over this in threes to group the molecules together:
water_molecules = []
for i in range(0,len(data),3):
    # This selects each water molecule
    water_molecule = data[i:i+3]
    # Uncomment this line to see what is happening in detail
    # print(water_molecule)
    water_molecules.append(water_molecule)
print(f'We have {len(water_molecules)} water molecules in our file.')

In [None]:
# Solution to task here:



<details><summary {style='color:green;font-weight:bold'}> Click here to see solution to Task. </summary>
    
```python

#subtask 2
# computing angle
angles = []
for water in water_molecules:
    angle = compute_angle_water(water[0], water[1], water[2])
    angles.append(angle)
    
#subtask 3
# plotting the distribution
plt.hist(angles, bins=30)
plt.xlabel('angle in degree')
plt.ylabel('Count')

print(f'The mean is {np.mean(angles):.2f}')
print(f'The standard deviation is {np.std(angles):.2f}')

```

</details>

<div class="alert alert-success">
    <b>Task 1.3 </b> : Working with data. Use the file <code>data/anscombes_quartet.dat</code>. This file is a tab delimiter file with 8 columns. The first and second columns make up one data set, the second and third the next one, and so forth.
</div>

1. Read the data into a pandas dataframe,    
2. Create four subplots of the data,       
3. Answer the mentimeter question.      


In [None]:
# Task 3: Test out the solution in this cell:

#An example of naming the columns of the file
colnames=['X1', 'Y1', 'X2', 'Y2', 'X3', 'Y3', 'X4', 'Y4']

data = pd.read_csv(# Your code here
                   skiprows=2, names=colnames)

# Setup your 4 subplots
fig, axs = plt.subplots(2, 2)

# Set the figure size
fig.set_figwidth(8)
fig.set_figheight(8)

# add data to plot
axs[0, 0].scatter(data['X1'], data['Y1'])
# ...
# ...

# make sure it is labelled
for ax in axs.flat:
    ax.set(xlabel='x-data', ylabel='y-data')
    
for ax in axs.flat:
    ax.label_outer()

# Set the ranges of all axes
plt.setp(ax, xlim=(4,20), ylim=(3,13))

<details><summary {style='color:green;font-weight:bold'}> Click here to see solution to Task. </summary>
    
```python

# Loading the dataset
colnames=['X1', 'Y1', 'X2', 'Y2', 'X3', 'Y3', 'X4', 'Y4']
data = pd.read_csv('data/anscombes_quartet.dat', delimiter='\t', skiprows=2, names=colnames)

# Setup your 4 subplots
fig, axs = plt.subplots(2, 2)
fig.set_figwidth(8)
fig.set_figheight(8)

# add data to plot
axs[0, 0].scatter(data['X1'], data['Y1'])
axs[0, 1].scatter(data['X2'], data['Y2'])
axs[1, 0].scatter(data['X3'], data['Y3'])
axs[1, 1].scatter(data['X4'], data['Y4'])

# make sure it is lablled
for ax in axs.flat:
    ax.set(xlabel='x-data', ylabel='y-data')
    
for ax in axs.flat:
    ax.label_outer()

# Setting the values for all axes.
plt.setp(ax, xlim=(4,20), ylim=(3,13))

```

</details>

Data is said to be **perfectly correlated**, if all points fall onto a straight line that is $x=y$. Take a look at your data: which of the four plots do you think is the most correlated?

In [None]:
Mentimeter(vote = 'https://www.menti.com/alr8x73cpd77').show()

<div class="alert alert-info">
    <b>Task 4 (advanced) </b> : Working with data. Use the file <code>data/ramachandran.dat</code>. It contains dihedral angles of the backbone of a protein in two columns. Column 1 is the $\phi$ angle and column 2 the $\psi$ angle. To find out more about Ramachandran diagrams take a look <a href="https://en.wikipedia.org/wiki/Ramachandran_plot">here</a>, and for more on dihedral angles see <a href="https://en.wikipedia.org/wiki/Dihedral_angle">here</a>. 
<p>1. Read the data into a pandas dataframe,  </p>  
<p>2. Create a single plot that is a 2D density map of $\phi$ against $\psi$,  </p>
<p>3. Make sure your plot is labelled correctly and displays a colour bar!   </p>   
</div>

In [None]:
# Task 4: Test out the solution in this cell:



# 2. Computing correlations
<a id='Correlation'></a>

## Reminder: mean $\mu$ and standard deviation $\sigma$

The **mean** $\mu$ is given by:

\begin{equation}
\mu = \frac{1}{N} \sum_i^N x_i ,
\end{equation}

where $N$ is a number of samples, as as they increase the mean becomes closer to the 'true' value. 


```python
mu = np.sum(x) / len(x)
```

or as a `np.mean(x)`.

_Note:_ **Median** is a middle value separating the greater and lesser halves of a data set, since the normal distribution is symmetric, mean and median are equivalent. 



The **standard deviation** (STD), $\sigma$ quantifies how much the numbers in our set deviate from the mean, $\mu$

\begin{equation}
\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N(x_i-\mu)^2}.
\end{equation}

it can be written as:

```python
sigma = np.sqrt( np.sum( ( x - np.mean(x))**2 ) / len(x) )
```

or as `np.std(x)`.



On a **normal distribution** the values that are less than 1 $\sigma$ away from the mean, $\mu$, will account for the 68.27% of the set - this is our **confidence interval**

<img src="images/NormalDist.png" width="500">



## 2.1. Pearson's correlation coefficient
<a id='pearson'></a>

One way of quantifying the correlation between two datasets is to compute their **Pearson correlation coefficient $R$**. 
- If $R$ it is 1, or close to 1 the data is highly correlated, 
- around 0 the data is not correlated  
- when it is close to -1 the data is anticorrelated.

Mathematically the correlation coefficient is defined as:

$R = \frac{\langle(X-\mu_X)(Y-\mu_Y)\rangle}{\sigma_X\sigma_Y}$,

where $\sigma$ is the standard deviation of the data set $X$ or $Y$ and the symbol $\langle \cdot \rangle$ denotes computing the mean of the quantities inside the angular bracket.

The equation contains exactly what you learned last week!

Can you think of examples of correlated data?

In [None]:
# Mentimeter wordcloud
Mentimeter(vote='https://www.menti.com/alrkms7ko5ap').show()

In [None]:
Mentimeter(result='https://www.mentimeter.com/app/presentation/alvhzfdayjzxnd5detr24zse7jq15wur').show()

## Tasks 2

<div class="alert alert-success">
    <b>Task 2.1 </b> : Write a function that computes the Pearson correlation coefficient between two datasets, making use of the numpy functions <code>np.mean()</code> and <code>np.std()</code> to compute the mean and standard deviation.
</div>

In [None]:
X = 20 * np.random.randn(1000) + 100
Y = X + (10 * np.random.randn(1000) + 50)

In [None]:
# Task 1: Test out the solution in this cell:
def compute_pearson_r(X, Y):
    r''' function that computes the Pearson correlation coefficient
    Parameters
    ----------
    Computes the correlation between X and Y
    
    X : 1-d numpy array
        dataset 1
    Y : 1-d numpy array
        dataset 2
        
    Returns:
    --------
    R : float
        value of pearson R
    '''
    
    R = None
    # Your code here
    
    
    return R

<details><summary {style='color:green;font-weight:bold'}> Click here to see solution to Task. </summary>
    
```python


def compute_pearson_r(X,Y):
    r''' function that computes the Pearson correlation coefficient
    Parameters
    ----------
    Computes the correlation between X and Y
    
    X : 1-d numpy array
        dataset 1
    Y : 1-d numpy array
        dataset 2
        
    Returns:
    --------
    R : float
        value of pearson R
    '''
    
    R = None
    mean_x = np.mean(X)
    mean_y = np.mean(Y)
    std_x = np.std(X)
    std_y = np.std(Y)
    covariance = np.mean((X-mean_x)*(Y-mean_y))
    R = covariance/(std_x*std_y)
    
    return R

```

</details>

<div class="alert alert-success">
    <b>Task 2.2 </b> : Does your function work correctly? Check if you get the same answers as from the built-in function <code>pearsonr</code> in the <code>scipy.stats</code> package. 
</div>

In [None]:
from scipy.stats import pearsonr
# you use pearson are from the scipy.stats package in the following way:

# pearsonr(dataset1, dataset2)[0]
# Check what happens when you remove the [0] at the end and print the output. 

In [None]:
# Task 2: Test out the solution in this cell:


<details><summary {style='color:green;font-weight:bold'}> Click here to see solution to Task. </summary>
    
```python

pearson1 = pearsonr(X, Y)[0]
pearson2 = compute_pearson_r(X, Y)
print(pearson1, pearson2)

```

</details>

<div class="alert alert-success">
    <b>Task 2.3 </b> : Compute the correlation coefficient of all 4 datasets of the Anscombe's quartet. What do you observe?
</div>

In [None]:
# Task 3: Test out the solution in this cell:


<details><summary {style='color:green;font-weight:bold'}> Click here to see solution to Task. </summary>

```Python
pearson1 = pearsonr(data['X1'], data['Y1'])[0]
pearson2 = pearsonr(data['X2'], data['Y2'])[0]
pearson3 = pearsonr(data['X3'], data['Y3'])[0]
pearson4 = pearsonr(data['X4'], data['Y4'])[0]
print(f'{pearson1}\n{pearson2}\n{pearson3}\n{pearson4}')
```

</details>

## 2.2. Spearman's Rank Correlation coefficient
<a id='spearman'></a>
There are other ways of measuring correlation. Take a look at the documentation of the Spearman rank correlation coefficient in the scipy package [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html) and a bit more background on it [here](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient). 

<div class="alert alert-info">
    <b>Task 2.4 (advanced) </b> : Compute the Spearman's rank correlation coefficient for the Anscombe's quartet. 

In [None]:
# Task 4: Test out the solution in this cell:


<details><summary {style='color:green;font-weight:bold'}> Click here to see solution to Task. </summary>

```Python
from scipy.stats import spearmanr
    
spearman1 = spearmanr(data['X1'], data['Y1'])[0]
spearman2 = spearmanr(data['X2'], data['Y2'])[0]
spearman3 = spearmanr(data['X3'], data['Y3'])[0]
spearman4 = spearmanr(data['X4'], data['Y4'])[0]
print(f'{spearman1}\n{spearman2}\n{spearman3}\n{spearman4}')
```



# Break
<img src="images/break.png" alt="drawing" width="200"/>

# 3. Finding the line of best fit between two sets of data points
<a id='minimization'></a>

From the Anscombe's quartet we have learned that the correlation coefficient alone will not tell us everything about the data. 
- how many data points are we correlating?
- Are there any outliers?
- What is the best fitting line that goes through the data?

## Reminder: functions and graphs

Plots usually show the relationship between two related values, one (or more) independent variable(s), and a dependent variable one. In an experiment:
- Independent variables (often denoted $x$) are measurable and unaffected by the value of other variables
- Dependent variables (often denoted $y$) have values that are affected by the value of independent variables

Mathematically one can say: $y = \mathrm{function}(x)$, or $y = f(x)$.

#### Typical functions in chemistry
1. Linear function: $y=mx+b$
2. Polynomial functions: $y=ax^n+bx^m+c$
2. Power Function: $y=ax^m$   
Power functions can be linear, when taking the logarithm. Remember:
$\log y = m \log x + \log a$
3. Exponential functions: $y=ae^{mx}$   
Exponential functions are linear for a plot of the natural logarithm of the dependent variable against the independent variable: $\ln y = mx+\ln a$

#### Example: Solubility of sodium chloride
The maximum amont of sodium chloride you can dissolve in water will change as a function of temperature. The file <code>sodium_data.dat</code> contains some measurements by a student raising water temperature from 0$^\circ$C to 100$^\circ$C.
For example, if you have 26.1 g of salt it will dissolve in 100 g of water at 20$^\circ$C, but if you add another 2 g the remaining salt will just stay solid. However, at 70$^\circ$C the additional 2 g will also dissolve. Solubility is the dependent variable, as it depends on the independent variable temperature. 

The independent variable is often plotted along the x-axis and the dependent variable along the y-axis. 

In [None]:
 data = pd.read_csv('data/sodium_data.dat', delimiter='\t')

In [None]:
## Plot the data here, and answer the question below


<details><summary {style='color:green;font-weight:bold'}> Click here to see the solution. </summary>

```Python
temp = data['temperature/C']
solubility = data['solubility/g NaCl/100g water']
plt.plot(temp,solubility)
plt.xlabel('temperature/C')
plt.ylabel('solubility/g')
```



In [None]:
Mentimeter(vote="https://www.menti.com/al8svn8ad9as").show()

In [None]:
Mentimeter(result="https://www.mentimeter.com/app/presentation/al44gz6x2z8hwd19stuxngoukg88sss3").show()

## Mean Square Error (MSE)

How do we find a line of best fit through our data?
1. ~~Eyeball it by hand~~
2. Programmatically

To find the best fit, the computer needs to be able to quantify how well a line fits the data.

As an example, let's take some scattered data (green points below). We want to fit a linear model (blue line) through them. By eye, we can see that a line at $y = 0$ does not fit the data well. 
<img src="images/scatter_data.png" alt="drawing" width="300"/>

Each green point is located at a certain distance from the blue line. We call this distance a **residual**. A good fit leads to small residuals.
<img src="images/initial_guess.png" alt="drawing" width="300"/>

We can quantify how well a line fits the data by calculating the **Mean square error (MSE)**. The MSE is a *second order polynomial* is defined as:

$\mathrm{MSE} = \frac{1}{n}\sum(Y_i - \hat Y_i)^2$

$n$ is the number of data points,   
$Y_i$ is the observed value (i.e. the measured data point),   
$\hat Y_i$ is the predicted value (i.e. the value that lies on the line of best fit).   

To find the line of best fit, we need to find which combination of its parameters (for a linear function, slope and intercept) leads to the smallest MSE.

<img src="images/best_fit_loss.png" alt="drawing" width="300"/>

 How do we minimise a function again?

## 3.1. Finding the minimum of a function
<a id='optimization'></a>

To remind ourselves of how the minimum of a function can be found, we will take as example a diatomic molecule, e.g. $O_2$.
We can model the bond between the two atoms as a harmonic oscillator (i.e. a spring): $y=a(x+b)^2$.

### Tasks 3

<div class="alert alert-success">
    <b>Task 3.1 </b> : Find the minimum of $f(r) = 0.5(r -2)^2$ manually.  
</div>

<details><summary {style='color:green;font-weight:bold'}> Click here to see solution to Task. </summary>

We start with the first order derivative of the function
$f'(r) = 2/2(r-2)^1(1) = (r-2)$

In order to find the minimum we set the first order derivative to zero
$f'(r) = 0 = r-2$

And we solve for r:
The minimum can be found at $r=2$

</details>

<div class="alert alert-success">
    <b>Task 3.2 </b> : Find the minimum of $f(r) = 0.5(r -2)^2$ by using <code>scipy.optimize.minimize<code>
</div>

You can find more information on the minimize function in the documentation [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html). Let's arbitrarily choose an initial guess of $r = 4$, and use the minimize function. What do the outputs mean? Does the output change based on the solver method used? Can you also plot the function?

*Hint*: Start by writing a function `def f(r)` defining our harmonic oscillator.

In [None]:
# Task 2: Test out the solution in this cell:
def f(r):
    #Fix me

# create and array using np.arange with 100 values from -10 to 10. 
r = 

# plot the function r v. f(r)

# use optimise.minimize to find the minimum

# what happens if you use a different starting point? Try a different optimizer? 



<details><summary {style='color:green;font-weight:bold'}> Click here to see solution to Task. </summary>
    
```python
from scipy import optimize
def f(r):
    return 0.5*(r-2)**2

# defining the r values
r = np.arange(-10,10,0.1)
plt.plot(r, f(r))
print(optimize.minimize(f, x0=4))

# Trying with a different method and starting point
print(optimize.minimize(f, x0=7, method="L-BFGS-B"))

```
</details>



<div class="alert alert-info">
    <b>Task 3.3 (advanced) </b> : Find the minimum of $f(x) = x^4+x^3-6x^2$ by using <code>scipy.optimize.minimize</code>. Try different starting points and plot the function!
</div>

In [None]:
# Your code here


<details><summary {style='color:green;font-weight:bold'}> Click here to see solution to Task. </summary>

```Python
    
from scipy import optimize
def f(x):
    return np.power(x,4)+np.power(x,3)-6*x**2

# defining the r values
r = np.arange(-4.1,3.7,0.1)
plt.plot(r, f(r))
print(optimize.minimize(f, x0=4))

# Trying with a different method and starting point
print(optimize.minimize(f, x0=7, method="L-BFGS-B"))    
```

## 3.2. Line of best fit and residuals

<a id='residuals'></a>

Take a look at the [linregress](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html) function in scipy. It will allow you to find the line of best fit.

In order to find the line of best fit, we need to find the minimum of the $\mathrm{MSE}$ function, but it is now give by $\mathrm{MSE}=\frac{1}{n}\sum(Y_i-\hat Y_i)^2$. Now the issue that this function does not depend on a single data point Y anymore but many $Y$s! You can think of it graphically. You are trying to minimize the area of squares around your residuals. 

You can use linear least squares if your model parameters combine linearly. 

<img src="images/least_square_bad.jpg" alt="drawing" width="300"/>

Then the best line will have the smallest area of all squares of your residuals:

<img src="images/least_square.jpg" alt="drawing" width="300"/>

## Tasks 4

<div class="alert alert-success">
    <b>Task 4.1 </b> : Discuss how you would try and find the line of best fit algorithmically.   
</div>

<details>
<summary> <mark> Mathematical background:</mark> </summary>
    
There are many different algorithms for this problem. If you have linear data and your problem is overdetermined the analytical solution of using linear least squares, will be the best.     
https://en.wikipedia.org/wiki/Least_squares   
https://www.youtube.com/watch?v=YwZYSTQs-Hk

</details>


<div class="alert alert-success">
    <b>Task 4.2 </b> : Compute the line of best fit using the <code>linregress</code> function in <code>scipy.stats</code> for the solvation data of NaCl? 
</div>

In [None]:
# Your code here


<details><summary {style='color:green;font-weight:bold'}> Click here to see solution to Task. </summary>
    
```python
from scipy import stats
data = pd.read_csv('data/sodium_data.dat', delimiter='\t')
temp = data.iloc[:, 1].to_list()
solubility = data.iloc[:, 2].to_list()
res = stats.linregress(temp,solubility)

```

</details>

<div class="alert alert-success">
    <b>Task 4.3 </b> : Plot your line of best fit, the data and a histogram of residuals. 
</div>

In [None]:
# Your code here


<details>
 <summary {style='color:green;font-weight:bold'}> SOLUTION plotting fit line: </summary>
    
```python

# plotting the line of best fit
plt.plot(temp, solubility, 'o', label='solubility data')
plt.plot(np.array(temp), res.intercept + res.slope*np.array(temp), 'r', label='fitted line')
plt.legend()
plt.xlabel('Temperature/$^{\circ}$C')
plt.ylabel('Solubility/mole/l')
    
```

</details>

<details>
<summary {style='color:green;font-weight:bold'}> SOLUTION plotting residuals: </summary>
    
```python
# plotting the residuals
residual = solubility -(res.intercept + res.slope*np.array(temp))
plt.plot(temp,residual, 'o', color='darkblue')
plt.title("Residual Plot")
plt.xlabel("Independent Variable")
plt.ylabel("Residual")
```

</details>

<details>
<summary {style='color:green;font-weight:bold'}> SOLUTION plotting histogram of residuals: </summary>
    
```python
# plotting the residuals
histogram = plt.hist(residual, bins=20)
plt.xlabel('Residual')
plt.ylabel('Frequency of Residual')
```

</details>

#### Sanity checking your linear regression

When looking at the distribution of residuals you expect them to be Normally distributed. This basically means that the regression model (your line of best fit) should be randomly better or worse for certain data points making the right prediction. You can check this by plotting a histogram of your residuals. Your data behaves as expected if the distribution follows a normal distribution. For more analysis you can do on your regression fit see [here](https://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm).

## 3.3. Fitting a non-linear function
<a id='advanced_fitting_exponential'></a>

The time a drug will survive in the body can often be described by a single exponential process similar to $C(t) = C(0)\exp(-kt)$, where k is the reaction constant and $C(t)$ the concentration of e.g. a drug in the blood after time $t$. Let's look at an example measurement of concentrations over time and see if we can determine the reaction rate $k$.

In [None]:
## Loading the data
exp_data = pd.read_csv('data/drug_concentration.txt', delimiter='\t')

In [None]:
time = exp_data.iloc[:, 1].to_list()
concentration = exp_data.iloc[:, 2].to_list()
plt.scatter(time, concentration)
plt.xlabel('time')
plt.ylabel('Concentration')

#### Defining a fitting function
We need to define the type of function we want to fit. The data looks like an exponential decay, so we can define an exponential function to be fitted. Using the definition from before, $f(x) = a\exp(kx)+b$, we can determine the rate constant from the fit. 


In [None]:
def exp_func(x, a, k, b):
    return a*np.exp(x*k) + b

#### Initial guesses
Just like for the linear regression we previously saw, we will need to give the function that will allow us to fit this exponential curve guesses for the initial parameters. Now we don't just have $x_0$, but inital guesses for the three parameters $a$, $k$ and $b$. This can be defined as an array!

The actual curve fitting is done with the `scipy.optimize.curve_fit` function. Take a look at the documentation [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html).

In [None]:
import scipy.optimize
# using the scipy library to fit the x- and y-axis data 
# p0 is where you give the function guesses for the fitting parameters
# this function returns:
#   popt_exponential: this contains the fitting parameters
#   pcov_exponential: estimated covariance of the fitting paramters
popt_exponential, pcov_exponential = scipy.optimize.curve_fit(exp_func, time, concentration, p0=[1,-0.5, 1])

# we then can find the error of the fitting parameters
# from the pcov_linear array
perr_exponential = np.sqrt(np.diag(pcov_exponential))

In [None]:
# this cell prints the fitting parameters with their errors
print(f"pre-exponential factor = {popt_exponential[0]:2.2f} ± {perr_exponential[0]:2.2f}")
print(f"rate constant = {popt_exponential[1]:2.2f} ± {perr_exponential[1]:2.2f} ")

<div class="alert alert-info">
    <b>Task (advanced) </b> : Plot your line of best fit, the data and a histogram of residuals for the exponential fit! 
</div>

In [None]:
# Task (advanced): Test out the solution in this cell:


<details><summary {style='color:green;font-weight:bold'}> Click here to see solution to Task. </summary>

```Python
plt.scatter(time, concentration)
plt.plot(time,exp_func(np.array(time),popt_exponential[0],popt_exponential[1],popt_exponential[2]))
plt.xlabel('time')
plt.ylabel('Concentration')
```

## 3.4. Fitting a Gaussian distribution
<a id='advanced_fitting_gaussian'></a>

Let's revisit the dataset from the previous session and fit a Gaussian or normal distribution to the densities of white wine. 

In [None]:
# Loading the white wine dataset
df_whites = pd.read_csv("data/winequality-white.csv", delimiter=';')
pH_whites = df_whites['pH']
plt.hist(pH_whites, bins=50, alpha=0.5)

In [None]:
import pylab as plb
from scipy.optimize import curve_fit
from scipy import asarray as ar,exp
heights, edges = np.histogram(pH_whites, bins=50, density=True)

# Just an easy way of defining x and y
x = edges[:49]
y = heights[:49]

plt.plot(x,y)

In [None]:
n = len(x)                          #the number of data
mean = sum(x*y)/n                   #note this correction
sigma = sum(y*(x-mean)**2)/n       #note this correction

# defining the Gaussian distribution
def Gauss(X,C,X_mean,sigma):
    return C*exp(-(X-X_mean)**2/(2*sigma**2))

# The actual curve fitting
popt,pcov = curve_fit(Gauss,x,y,p0=[max(y),mean, sigma],maxfev=5000)

# plotting the result
plt.plot(x,y,'b+:',label='data')
plt.plot(x,Gauss(x,*popt),'ro:',label='fit')

<div class="alert alert-success">
    <b>Task 4.4 </b> : Fit a Gaussian to the citric acid of white wine and plot the fitted Gaussian over your histogram
</div>
Hint: use the above code as a template

In [None]:
# Your code here



<details><summary {style='color:green;font-weight:bold'}> Click here to see solution to Task. </summary>

```Python
citric_acid = df_whites['citric acid'] # This is the main difference, I am now selecting to citric acid column
heights, edges = np.histogram(citric_acid, bins=50, density=True)
x = edges[:49]
y = heights[:49]

# Copying code from before
n = len(x)                          #the number of data
mean = sum(x*y)/n                   #note this correction
sigma = sum(y*(x-mean)**2)/n       #note this correction

def Gauss(X,C,X_mean,sigma):
    return C*exp(-(X-X_mean)**2/(2*sigma**2))

popt,pcov = curve_fit(Gauss,x,y,p0=[max(y),mean, sigma],maxfev=5000)
plt.plot(x,y,'b+:',label='data')
plt.plot(x,Gauss(x,*popt),'ro:',label='fit')
```

<div class="alert alert-info">
    <b>Something to try </b> : You can also use <code>scipy.optimize.curve_fit</code> to fit a linear function. The way you do this is by defining the fitting function as a linear function. 
</div>

Hint use:

```Python
def linear_function(x, slope, intercept):
    return slope*x + intercept
```

## 5. Feedback

### Positive feedback for today's session:

In [4]:
Mentimeter(vote = 'https://www.menti.com/alueo3qdf23e').show()

### Things to be improved in today's session:

In [5]:
Mentimeter(vote = 'https://www.menti.com/al93adwyg2bw').show()

# END

Next session will look at some chemistry applications of what you have learned!