# PHAS0007 Session 5: Notebook 1
# Fitting data with a least-squares fit: Theory

louise.dash@ucl.ac.uk Last updated 16.10.2018 


### Learning objectives
By the end of this session, you should understand the principles of linear least-squares fitting, and be able to produce and plot your own straight-line fit to some data.

## Contents
This Session has been split into two separate notebooks. This one covers the theory of least squares fitting, the second covers implementing this in Python (and details of the task for this session).



#  The theory behind least-squares fitting

Imagine that we've successfully recorded some experimental data in our lab book, and plotted it on a beautiful graph. 

Now we want to find the best possible straight line through it.  How do we go about this? Before we can implement this in Python, we need to understand the underlying maths - this is also covered in Dr Bartlett's Data Analysis lectures, so this should be a review of stuff you've already met.

A straight line has the equation

 $$y = mx + c$$

where $m$ is the slope / gradient of the line, and $c$ the intercept with the $y$-axis.

Imagine we have n pairs of data points:

$$(x_1, y_1), (x_2, y_2), (x_3, y_3), \ldots, (x_n, y_n) $$

We can plot these, and then draw a straight line through them (for the moment this isn't necessarily the best straight line, just any old straight line, this is to illustrate the principle):

<img src="./Leastsquaresfit-animline.gif" width=400>

(if you can't see the image, make sure you've also downloaded the image file(s) from Moodle and they are in the same directory as this notebook)

For each point, we can then measure the vertical distance $\Delta y$ between the data point and the line - note that some of these will be positive, and some negative. We can write this mathematically as

$$\Delta y = y_i - mx - c$$

Now, if we were to add up all these deviations to use as a measure of goodness of fit of the line, we'd have a problem, because many of the positive and negative values would cancel out. Instead, we need to find a positive number that will account for all the deviations.

We therefore take the square of each deviation and add them all up:

$$S = \sum(y_i - mx_i -c)^2$$

<img src="./LeastSquares-Squares.gif" width=400>

The best straight-line fit to our data will then be the one with the smallest value of $S$: i.e.  the one whose sum of all the squares is least.

So how do we find this best fit line? Mathematically, we can minimise $S$ with respect to our fit parameters $c$ and $m$:

$$\frac{\partial S}{\partial m} = -2 \sum x_i(y_i - mx_i - c) = 0$$

$$ \frac{\partial S}{\partial c} = -2 \sum (y_i - mx_i - c) = 0 $$

As we need to minimize both of these, we end up with two simultaneous equations:

$$ m \sum x_i ^2 + c \sum x_i = \sum x_i y_i $$

$$ m \sum x_i + n c = \sum y_i $$

(remember, $n$ is the number of data pairs in the set)

Solving these gives us

$$ m = \frac{\sum(x_i - \bar{x}) y_i}{\sum (x_i - \bar{x})^2} = \frac{\sum x_i (y_i - \bar{y})}{\sum x_i (x_i - \bar{x})} $$

$$ c = \bar{y} - m \bar{x} $$

where $ \bar{x} = \frac{1}{n} \sum x_i$ and $\bar{y} = \frac{1}{n} \sum y_i$, i.e. the mean values of the data.

## The leastsquaresfitometer

These equations can be a bit abstract - it's easier to understand if we can visualise it. To help with this, I wrote a code that lets you interactively change the parameters of a straight line to see how the squares of the residuals change. This is an updated version of the visualisation you saw in the screencast.

Move the two sliders for the slope and intercept around and you will see the straight line moving, and the size of the squares changing. The initial position of the line is close to, but not exactly, the best possible fit. Play with this, and adjust the sliders until you find the best fit - that which minimizes the sum of the areas of the squares.

If you don't see the plot and sliders below, run "Run all" from the "Cell" menu.

In [1]:
# script in this cell is from http://blog.nextgenetics.net/?e=102
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
This cell (and the whole notebook) doesn't display the code by default, as it's the interactive plot I'd like you to play around with. 
If you do want to look at the code itself, you can toggle it on/off by clicking <a href="javascript:code_toggle()">here</a>.
(We will be looking more at how to create interactive plots after Reading week.)''')

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from matplotlib.collections import PatchCollection

from ipywidgets import interactive # Anaconda version 2.4+
from IPython.display import display

# widgets don't work well with notebook backend, so use:
%matplotlib inline

### set "data"
# For this visualization, best if slope ~ 1
# feel free to vary the values!
x = np.array([10,20,30,40,50])
y = np.array([14.3,20.68,33.1,42,47.7])


### Fit the data

mean_x = np.mean(x)
mean_y = np.mean(y)
slope = np.sum((x - mean_x)*y) / np.sum((x - mean_x)*x)
intercept = mean_y - slope*mean_x
sum_of_squares = np.sum((y - slope*x - intercept)**2)


# initialize, only once
# basic information from the data
x_min = np.min(x)
x_max = np.max(x)
y_min = np.min(y)
y_max = np.max(y)


### Now the main plotting function ###
def plotsquares(slope,intercept):
    '''Function that plots data and squares for a given slope and intercept'''
    fig = plt.figure(figsize=(8,8))
    plt.rc("font", size=14) # increase default font size
    ax = fig.add_subplot(111, aspect='equal') # need this to use add_artist

    # generate a list of x-points for the fitted line
    # straight line only needs 2 points, but we'll do more.
    x_points = np.linspace(0,x_max*2.,2)
    y_points = slope*x_points + intercept

    plt.plot(x, y, 'ro')
    plt.line, = plt.plot(x_points, y_points, 'g-') #nb use of comma
    #plt.xlim(0.8*x_min,1.1*x_max)
    #plt.ylim(0.8*y_min,1.1*y_max)
    plt.xlim(0,1.5*x_max)
    plt.ylim(0,1.5*y_max)
    plt.xlabel('x')
    plt.ylabel('y')
    plt.title('Interactive least-squares fit')
    
    sum_of_squares = np.sum((y - slope*x - intercept)**2)
    plt.text(5,65, "Sum of squares is {0:0.3f}".format(sum_of_squares))
    plt.text(5,60, "Slope is {0:0.3f}".format(slope))
    plt.text(5,55, "Intercept is {0:0.3f}".format(intercept))

    ###  set up the squares for the initial plot
    sqcol = '#99b267'
    patches = []
    for xx,yy in zip(x,y):
        # calculate vertical distance from datapoint to line
        dist = yy - (slope*xx + intercept)
        if slope > 0:
            if dist > 0:
                # datapoint is TOP RIGHT of the square
                posx = xx-dist
                posy = yy-dist
                wid = dist
                hei = dist
            else: # dist is negative
                # datapoint is BOTTOM LEFT of the square
                posx = xx
                posy = yy
                wid = -dist
                hei = -dist
        else: #slope is negative
            if dist > 0:
                # datapoint is TOP LEFT of the square
                posx = xx
                posy = yy - dist
                wid = dist
                hei = dist
            else: # dist is negative
                # datapoint is BOTTOM RIGHT of the square
                posx = xx+dist
                posy = yy
                wid = -dist
                hei = -dist

        square = Rectangle(xy=(posx,posy), width=wid, height=hei)
        patches.append(square)

    sq_collection = PatchCollection(patches, alpha=0.4)
    #sq_collection.set_array(np.array(colors))
    ax.add_collection(sq_collection)
    plt.show()

## Calculate and display the interactive plot ##
interplot = interactive(plotsquares, slope=(0.4, 1.4, 0.001), intercept=(2.0,8.0,0.001))
display(interplot)


interactive(children=(FloatSlider(value=0.899, description='slope', max=1.4, min=0.4, step=0.001), FloatSlider…

Once you're fairly sure you've grasped the basic principles here (you'll need to implement the equations for $m$ and $c$ in the task, but **you won't need to plot the residuals or the squares themselves**), proceed to the second notebook, which will guide you through implementing a simple linear least-squares fit in Python.