# 2  Fitting a straight line

## 2.1  Using <code>polyfit</code> to fit a straight line

Suppose you have a set of data that, when plotted, looks more or less like a straight line. This would indicate that there is a linear relationship between the two quantities. So, if we use $(x_i, y_i)$ to indicate that we have a pair of $(x,y)$ values, it should be possible to find the equation of a straight line 

$$y=ax+b$$

where the values for $a$ and $b$ are such that the resulting line is the best fit to the data. (See Maths for Science, Chapter 7 if you need to revise the concepts and ideas related to straight
lines.). There is a function in the `numpy` package, called `polyfit`, which can be used to fit a straigh line. Before we see how it is used, we need to remind ourselves of what a polynomial is.

A polynomial is a function of the type:

$$c_0+c_1x+c_2x^2+c_3x^3+c_4 x^4+....$$

where the $c_i$ are integer or real numbers. The number of terms is finite and any of the $c_i$ can be zero. (If you want to remind yourslef about polynomials, look at MST124: *Essential mathematics*, Unit 3.) The equation of a straight line is a polynomial in which $c_0=b$, $c_1=a$ and all other $c_i$ are zero. The highest power of  $x$ in a polynomial  is
called the **degree**  or **order** of the polynomial. The polynomial that gives a straight line is a polynomial of degree or order 1.

The example below shows how  to determine the gradient and intercept of a best fit line using `polyfit`.


In [None]:
# Example of how to do a linear fit to a set of data

import numpy as np
import matplotlib.pyplot as plt

# Create two arrays
x_values = np.array([0.053,0.042,0.029,0.025,0.017,0.010,0.008,0.002],float)
y_values = np.array([7.05,5.93,4.08,4.01,2.83,2.05,1.393,0.452],float)

grad, intc =  np.polyfit(x_values, y_values, deg=1) # Call polyfit to fit a straight line (polynomial of degree 1)
                                                    # to the data provided. 

plt.plot(x_values,y_values,'sr')    # Plot the data provided 
plt.plot(x_values,x_values*grad+intc)   # Plot the best the best-fit line obtained from polyfit

You can see that `polyfit` takes 3 arguments, in this order: the independent variable (the $x$ values), the dependent variable ($y$ values) and, `deg=` which indicates the order of the polynomial to fit. Since we are fitting a straight line, the order is 1 and `deg=1`.

`polyfit` then returns two values, that in the program above I have labelled `grad` and `intc`:  `grad` is the gradient of the line and <code>intc</code> is the intercept. Finally, in the last line in the program I have used these returned values to generate a line.

There are functions in other packages that allow you to perform a linear fit but in SM123 you are expected to use `polyfit` in your assessment. 

> Would the program work if  `x_values` and `y_values` were lists instead of arrays? Hint: you may want to try this!

:::{hint} Answer
:class: dropdown
It will partially work. Calling the function `polyfit` will work because python can sometimes interpret lists of numbers as arrays. We saw this when plotting at the end of Python 4 Notebook 1. However, line 16 will fail because for `x_values*grad+ intc` to work, `x_values` needs to be an array.  
:::

>  How would you modify the program above so that it prints the  intercept and gradient?

:::{hint} Answer
:class: dropdown
Adding the following lines after line 11:

<code>print ("intercept is=",grad)</code>

<code>print ("gradient is=",intc)</code>

:::

### Exercise 2.1

Write a program that uses  `polyfit` to determine  the best fit straight line for the data that corresponds to the force applied to an elastic band and the extension it causes. Use the data in file Extension_force.csv. Print the values of  the gradient and intercept.

Once you've answered the exercise, click on the <u>**+ 1 cell hidden** </u> button below to to see a possible solution.

In [None]:
# Example program that reads in  data from a file and performs 
# a linear regression to obtain the gradient and intercept 
# of the best-fit line and plots the data

import csv   
import numpy as np
import matplotlib.pyplot as plt

# Create empty lists to store values 
extension = []
force = []

with open('Extension_force.csv', mode='r') as input_file: # open CSV file from which data will be read
    data_of_extension = csv.DictReader(input_file)       # read and store data
    
# Iterating over each row to create two lists with the read data 
    for i_row in data_of_extension:              
        extension.append(i_row['Extension'])     
        force.append(i_row['Force'])            

# Convert the data read from the file into arrays of real numbers 
xarray=np.array(extension,float)
Farray=np.array(force,float)

# Use the polyfit function in numpy to fit a line to the data
m, c =  np.polyfit(xarray, Farray, deg=1)
print ("intercept is=",c,"N")
print ("gradient is=",m,"N/m") 

# Plot the points and the best-fit line. Add labels to the axes
plt.plot(xarray,Farray,'sr')
plt.plot(xarray,xarray*m + c)
plt.xlabel('Extension / m')
plt.ylabel('Force / N')

### &nbsp;

:::{hint} Hint
:class: dropdown

You will need to turn the data you have read into a list or an array to use it in `polyfit`.
:::

In [None]:
# Write your python code here.




## 2.2   Calculating the Hubble constant

The Hubble relationship (Topic 9) tells us that the further away a galaxy is from us, the faster it is receding. In Topic 9, Section 1.3, we expressed this relationship in the following way:

$v = H_0 \times d$

where $v$ is the apparent speed, $d$ is the distance of the galaxy from us and $H_0$ is the Hubble constant. $H_0$ quantifies the expansion rate of the Universe and can also be used to calculate its age.

In Topic 9, Activity 1.2, you measured the speeds of recession and distances to eight clusters of galaxies. You plotted a graph of speed against distance and then calculated the gradient of the resulting line. That gradient corresponds to $H_0$. Here, you will do the same, but you will use `polyfit` to obtain the gradient of the straight line the data describes.

The data you need is the distance of some astronomical objects and their speed and is contained in the file <code>Hubble_data.csv</code>. In this case the data are for some Type Ia supernovae in distant galaxies. Note that these are given, respectively, in megaparsecs (the definition of this unit was given in Topic 9, Section 1.1) and km s$^{−1}$. The observational data comes from a paper by Riess et al. (1996) and has already been converted to distances and speeds so that it’s easier to plot and fit.

### Python activity 2.1 Calculating the Hubble constant

*Allow approximately 1 hours*

Write a Python program that determines the value of the Hubble constant from the data above using `polyfit`. The program should also plot the input data and the fit line. You may base this on the program you've just completed in Exercise 2.1.

(A day or two before the completion of this Python study week, suggested programs that accomplish what is required by the various Activities this week will  be made available. We do this towards the end of the study week in order to encourage you to find your own solutions before seeing them.)

In [None]:
# Write your python code here.




**In this notebook you have used the `numpy` function `polyfit` it to calculate the Hubble constant. You should now  move on to the Python 4, Notebook 3,  Assessing the quality of a fit.**