# Unit 05: Introduction to plotting

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons Licence" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" title='This work is licensed under a Creative Commons Attribution 4.0 International License.' align="right"/></a>

Author: Dr Valentina Erastova, Dr Matteo Degiacomi, Dr Antonia SJS Mey, Hannah Pollak

Email: <valentina.erastova@ed.ac.uk>, <antonia.mey@ed.ac.uk>


## Learning objectives  <a id="learning"></a>

During this session you will learn about:

By the end of this unit, you should be able to 
* use the `numpy` library 
* perform mathematical operations on `numpy` arrays in 1D and in 2D
* access parts of arrays
* load arrays to or from files
* plot data using `matplotlib`
* understand sources of errors and how to mitigate them
* analyse numerical data statistically using in-built functions
* quantify uncertainties

Some of the material was adapted from [Dr Matteo Degiacomi](https://github.com/Degiacomi-Lab/python4science/blob/master/2_Python_numerical_data.ipynb).


# Table of contents

* [1. Arrays and NumPy](#1-arrays-and-numpy)
    * [1.1 1D Arrays](#11-1d-arrays)
* [Tasks 1](#tasks-1)
* [2. Mathematical operations on 1D arrays](#2-mathematical-operations-on-1d-arrays)
* [3. Accessing slices of 1D arrats](#3-accessing-slices-of-1d-arrays)
* [4. Multidimensional arrays](#4-multidimensional-arrays)
    * [4.1 Generating 2D arrays](#41-generating-2d-arrays)
    * [4.2 Slicing 2D arrays](#42-slicing-2d-arrays)
* [Tasks 2](#tasks-2)
* [5. Mathematical operations on multidimensional arrays](#)
* [6. Plotting data](#6-plotting-data)
    * [6.1 Quick aside on string formatting](#61-quick-aside-on-string-formatting) 
* [7. Errors: a discussion](#7-errors)
    * [7.1 Sources of Errors and Uncertainties](#71-sources-of-errors-and-uncertainties)
    * [7.2 Accuracy vs Precision](#72-accuracy-vs-precision)
* [8. Introduction to Statistics](#8-introduction-to-statistics)
    * [8.1 Statistical distributions](#81-statistical-distributions)
    * [8.2 Distribution of measurements](#82-distribution-of-measurements)
    * [8.3 Quantifying Uncertainty](#83-quantifying-uncertainty)
* [Recap](#recap)
* [Feedback](#feedback)


**<span style="color:black">Jupyter Cheat Sheet</span>**
- To run the currently highlighted cell and move focus to the next cell, hold <kbd>&#x21E7; Shift</kbd> and press <kbd>&#x23ce; Enter</kbd>;
- To run the currently highlighted cell and keep focus in the same cell, hold <kbd>&#x21E7; Ctrl</kbd> and press <kbd>&#x23ce; Enter</kbd>;
- To get help for a specific function, place the cursor within the function's brackets, hold <kbd>&#x21E7; Shift</kbd>, and press <kbd>&#x21E5; Tab</kbd>;

### Links to documentation

You can find useful information about using `numpy` and `matplotlib` at
* [NumPy](https://numpy.org)
* [matplotlib](https://matplotlib.org)


### Further reading for this topic

# FIXME

# Import libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Make the helper functions accessible
import sys
import os.path
sys.path.append(os.path.abspath('../'))
from helper_functions.mentimeter import Mentimeter

# 1. Arrays and NumPy <a id="1-arrays-and-numpy"></a>

An **array** is a smart way of storing multidimensional numerical data.

**NumPy**, which stands for *Numerical Python*, is a module consisting of multidimensional array objects and a collection of routines for processing those arrays. 

We can use NumPy to perform mathematical and logical operations on arrays.

NumPy is a base for many other modules, including Pandas, and so they can be used together.

## Import the NumPy library

For NumPy, the standard-practice alias is `np.`:

In [None]:
import numpy as np

## 1.1 1D Arrays <a id="11-1d-arrays"></a>

NumPy arrays can only contain **one datatype**, i.e. all integers, all floats, etc. This is in contrast to lists, which can contain a mix of datatypes.


### Creating 1D arrays 

To create an array of integers (single numbers like 1, 2, 3, 4, 5) we can do it by converting a list to an array as:

```python
import numpy as np

my_list = [1, 2, 3, 4, 5]

my_array = np.array(my_list)
```


### Example 1

In [None]:
# Create a 1D numpy array:

# FIXME

<details><summary {style="color:green;font-weight:bold"}> Click here to see the solution to Example 1.</summary>

```python
a = [1, 2, 3, 4, 5] # Your list can be of any length
my_array = np.array(a)
```

### Example 2

Let's look at some of the **properties** of our array. 

How do you get the **dimensions**, **shape**, **size** and **datatype** of an array?

In [None]:
# Create a 1D array

# Check the properties of this 1D array

# dimensions?

# shape?

# size? 

# datatype?


<details><summary {style="color:green;font-weight:bold"}>Click here to see the solution to Example 2.</summary>

```python
# Create a 1D array
a = [1, 2, 3, 4, 5]
my_array = np.array(a)

# Check the properties of this 1D array
print(f"Dimensions {my_array.ndim}")
print(f"Shape {my_array.shape}")
print(f"Size {my_array.size}")
print(f"Datatype {my_array.dtype}")
```

### Example 3

We can also use **functions** to generate arrays.

Similarly to the in-built function `range`, we can generate one-dimensional arrays of equally-spaced numbers with:
* `np.linspace(start, end, quantity)` or
* `np.arange(start, end, step_size)`

We can also generate multidimensional arrays filled with zeros or ones with NumPy functions:
* `np.zeros(shape)`
* `np.ones(shape)`

where `shape` has to be an `int` for 1D arrays and `tuple`, such as `(5, 6)`, for creating a 2D array.

**Let's use `np.zeros(shape)` to create a 1D array full of zeros:**

In [None]:
# FIXME


<details><summary {style="color:green;font-weight:bold"}> Click here to see the solution to Example 4. </summary>

```python
z = np.zeros(10)
print(f"My array of zeros {z} is of type {z.dtype}")

```

# Tasks 1 <a id="tasks-1"></a>

We will continue to generate 1D arrays, access parts of an array and perform some mathematical operations on them. 


<div class="alert alert-success">
    <b>Task 1.1 </b> : Generate a 1D array of length 5, filled with ones.
</div>



In [None]:
# FIXME



<details><summary {style="color:green;font-weight:bold"}> Click here to see the solution to Task 1.1 </summary>

```python
ones = np.ones(5)
print(f"Array of five ones: {ones}")
```

<div class="alert alert-success"><b> Task 1.2: Create an array with `np.arange`</b>

Using `np.arange`, create a 1D array as a sequence from 0 to 20 in steps of 2.

</div>

In [None]:
# FIXME

<details><summary {style="color:green;font-weight:bold"}> Click here to see the solution to Task 1.2</summary>

```python
sequence = np.arange(0, 21, 2) 
print(sequence)
```

<div class="alert alert-success"><b>Question</b>: What number did you have to stop at to include 20 as a last number? Why?
</div>

<details><summary {style="color:green;font-weight:bold"}>Click here to see the answer to the above question.</summary>

Python starts counting from 0 and in `np.arange(start, stop, step)`, the `stop` value is not inclusive.

<div class="aler alert-warning"><b> Advanced task 1.3</b> <a id="task-13"></a>

Find the last number in an array `np.arange(0, 20, 2)`.

Is the answer as you expected?
</div>

In [None]:
# FIXME

<details><summary {style="color:green;font-weight:bold"}>Click here to see the solution to the Advanced task 1.3.</summary>

```python
a = np.arange(0, 20, 2)
last = a[-1]
print(last)
```


<div class="alert alert-success"><b> Task 1.4: Generate another array</b>

Generate the same array as we did with `np.arange(0, 20, 2)` but this time using `np.linspace(start, stop, n_steps)`.

How do these two functions differ?

</div>

In [None]:
# FIXME


<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Task 1.4</summary>

```python
b = np.linspace(0, 20, 11)
print(b)
```

Note that in this case, the end point is included in the generated array. This is also explained in the [documentation](https://numpy.org).

# 2. Mathematical operations on 1D arrays <a id="2-mathematical-operations-on-1d-arrays"></a>

All mathematical operations between NumPy arrays act element by element. This is not the same for lists, which is why using NumPy is so useful. 

Operations with scalar numbers act on every element of the array. 

For example:

If we define: 
```python
a = np.array([1, 2, 3])
b = np.array([0, 1, 2])
```
then
* `a * b` returns the array `[0, 2, 6]`
* `a - b` returns the array `[1, 1, 1]`
* `a + 1` returns the array `[2, 3, 4]`

We can see that an array behaves much like *vectors* in maths. They can be used to conduct mathematical operations in a compact way. If we were using *lists*, we would have to loop through each element of the list to perform similar operations.

We will see some examples of this below.

<div class="alert alert-success"><b> Task 1.5: Add a scalar to an array </b>

Create an array called `my_array` containing the numbers 3, 6, 7, 2 and 8. Add the number 3 to every number of the array.

</div>

In [None]:
# FIXME


<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Task 1.5 </summary>

```python

my_array = np.array([3, 6, 7, 2, 8])

new_array = my_array + 2

print(f"my_array + 3 = {new_array}")
```

We can also do mathematical operations between two arrays. 

**Note** the arrays have to be same dimensions.

<div class="alert alert-success">
    <b>Task 1.6: Mathematical operations between two arrays.</b>

   Create 2 arrays of your liking and perform mathematical operations.
   
   For example - multiply them, substract one from another and add them up.
   
   Print the answers.
</div>




In [None]:
# FIXME
a = 
b = 

print(f"multiplication a * b = {___}")
print(f"substraction a - b = {___}")
print(f"addition a + b = {___}")


<details><summary {style="color:green; font-weight:bold"}> Click here to see solution to Task 1.6 </summary>

```python
a = np.array([1, 2, 4])
b = np.array([0, 1, 2])

print(f"multiplication a * b = {a * b}")
print(f"substraction a - b = {a - b}")
print(f"addition a + b = {a + b}")
```

 <div class="alert alert-success">
    <b>Task 1.7: Square each value in an `my_array`</b> 
</div>

<div class="alert alert-info"><b> Hint</b>

You can use `**` as an operator to raise to a power, i.e. $x^2$ would be written as `x**2` in python.

</div>

In [None]:
# FIXME

<details><summary {style="color:green; font-weight:bold"}> Click here to see the soluton to Task 1.7. </summary>

```python

my_array = np.array([3, 6, 7, 2, 8])
my_array_squared = my_array ** 2

print(my_array_squared)

```

### Example 4

What is the difference between using `numpy` and using `math`?

How do you calculate
* the square-root of a single number?
* the square-root of a list?
* the square-root of an array?

See what happens when you run the below code.

<div class="alert alert-info">
<b>The community-agreed alias for the math library is just m.</b>
</div>

In [None]:
import math as m
import numpy as np

# Square-root of a single number:
# with math
print (m.sqrt(4)) 
# with numpy
print (np.sqrt(4))
# mathematically, by calculating 4^{1/2} 
print (4**0.5) 

# Square-root of a list of numbers
l = [4, 9, 16] 
# numpy: square root of every element 
print (np.sqrt(l)) 
 # Can you use math here?
print (m.sqrt(l)) 

# Square-root of an array
a = np.array(l)
# square root of every element of a numpy array
print(np.sqrt(a)) 
# would this work?
print(m.sqrt(a)) 

# 3. Accessing *slices* of 1D arrays <a id="3-accessing-slices-of-1d-arrays"></a>


<img src="images/slicing1.png" width="500">

We will learn about *slicing* in the below task.

<div class="alert alert-succes"><b> Task 1.8: Slicing arrays </b>

1. Generate a 1D array of 20 elements and fill it with random numbers.
2. Pick every 3rd value within the first 10 values.
3. Print how many values you get
4. What is the last number in your array? (See [Advanced task 1.3](#task-13))
</div>


<div class="alert alert-info"><b> Hint</b>

 Try  `np.random.default_rng(seed)`

This is a random number generator, where the `seed` is used to "initialise" the number generator. You can read more about this in the [Random Generator Documentation from NumPy](https://numpy.org/doc/stable/reference/random/generator.html).

 </div>

In [None]:
# 1. Generate a 1D array of 20 elements and fill it with random numbers.
# FIXME

# 2. Pick every 3rd value within the first 10 values.
# FIXME

# 3. Print how many values you get
# FIXME

# 4. What is the last number in your array?
# FIXME


<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Task 1.8 </summary>

```python

# 1. Generate a 1D array of 20 elements and fill it with random numbers.

random_generator = np.random.default_rng(12345)
random_numbers = random_generator.random(20)
print(random_numbers)

# 2. Pick every 3rd value within the first 10 values.
picked = random_numbers[0:10:3]

# 3. Print how many values you get
print(len(random_numbers))
print(len(picked))

# 4. What is the last number in your array?    
last = random_numbers[-1]
print(last)
```

# 4 Multidimensional arrays <a id="4-multidimensional-arrays"></a>

We have met 1D arrays above, now let's have a look at the multidimensional cases:

<img src="images/scalar-tensor.png" width="500">


## 4.1 Generating 2D arrays  <a id="31-generating-2d-arrays"></a>

Just like with 1D arrays, we can also create a 2D array in the following manner:

```python
import numpy as np

a = [[1, 2], [3, 4], [5, 6]]
my_2d_array = np.array(a)

```

Sometimes it's nice to write out the array in separate lines to see the columns and the rows more clearly. However, it doesn't change the way python sees the array.

```python
a = [[1, 2], 
     [3, 4], 
     [5, 6]]
my_2d_array = np.array(a)
```


### Example 5

Create a two-dimensional array.

In [None]:
# FIXME


<details><summary {style="color:green; font-weight:bold"}> Click here to see solution to Example 5.</summary>

```python
b = [[1, 2], [3, 4], [5, 6]]
my_2d_array = np.array(b)

print(my_2d_array)

```

# FIXME

<details>
    <summary> <b> Question:</b> What is the difference between <code>tuple</code>, <code>array</code> and a <code>list</code>? </summary>
    

**ANSWER**:

**List**: A list is of an ordered collection data type that is mutable which means it can be easily modified and we can change its data values and a list can be indexed, sliced, and changed and each element can be accessed using its index value in the list. The following are the main characteristics of a List:

- The list is an ordered collection of data types.
- The list is mutable.
- List are dynamic and can contain objects of different data types.
- List elements can be accessed by index number.

```python
list = ["mango", "strawberry", "orange",
		"apple", "banana"]
print(list)

# we can specify the range of the
# index by specifying where to start
# and where to end
print(list[2:4])

# we can also change the item in the
# list by using its index number
list[1] = "grapes"
print(list[1])

```
    
    
**Array**:  An array is a collection of items stored at contiguous memory locations. The idea is to store multiple items of the same type together. This makes it easier to calculate the position of each element by simply adding an offset to a base value, i.e., the memory location of the first element of the array (generally denoted by the name of the array). The following are the main characteristics of an Array:
    
- An array is an ordered collection of the similar data types.
- An array is mutable.
- An array can be accessed by using its index number.

```python
# importing "array" for array creations
import array as arr

# creating an array with integer type
a = arr.array('i', [1, 2, 3])

# printing original array
print ("The new created array is : ", end =" ")
for i in range (0, 3):
	print (a[i], end =" ")
print()

# creating an array with float type
b = arr.array('d', [2.5, 3.2, 3.3])

```

**Tuple**:  A tuple is an ordered and an immutable data type which means we cannot change its values and tuples are written in round brackets. We can access tuple by referring to the index number inside the square brackets.  The following are the main characteristics of a Tuple:

- Tuples are immutable and can store any type of data type.
- it is defined using ().
- it cannot be changed or replaced as it is an immutable data type.

```python
tuple = ("orange","apple","banana")
print(tuple)

# we can access the items in
# the tuple by its index number
print(tuple[2])

#we can specify the range of the
# index by specifying where to start
# and where to end
print(tuple[0:2])
```
Taken from www.geeksforgeeks.org

</details>

### Array properties of 2D arrays

Consider the array 

```python
a = [[0, 1, 2, 3],
     [10, 11, 12, 13],
     [20, 21, 22, 23]]
```

* The number of dimensions or axes of the array is given by `a.ndim` and in this case returns `2`
* The shape of the array, i.e. the size of each dimension is given by `a.shape`, which returns a tuple `(3, 4)`
* The size of the array, i.e. the total number of elements in the array is given by `a.size`, which returns `12`
* The datatype of each element is given by `a.dtype`, which returns `int64`

### Example 6 

Print the number of dimensions, shape and size of `my_2d_array` from above.

In [None]:
# FIXME

<details><summary {style="color:green; font-weight:bold"}>Click here to see the solution Example 6. </summary>

```python
print(f"dimension: {my_2d_array.ndim}")
print(f"shape: {my_2d_array.shape}")
print(f"size: {my_2d_array.size}")
```

**Note** how in the example above, the shape of the matrix is defined as ```(rows, columns)``` - the number of *rows* and then *columns*. 

The output of `shape` is written in round brackets, i.e. it is a *tuple* and is non-changeable.


### Example 7

Let's try to create an array filled with predefined values and check it's properties.


We can use `np.ones` to fill it with ones, or `np.zeros` to fill up an array with zeros. If we want to use a specific value to fill an array with, we can use the function `np.full`.

Generate an array of shape `(4, 5)` filled with the number `1.234`.

In [None]:
# FIXME

<details><summary {style="color:green; font-weigh:bold"}> Click here to see solution to Example 7. </summary>

```python 

# Generate an array of 4 x 5 filled up with a 1.234
f = np.full((4, 5), 1.234)

# Check its properties
print(f"Dimensions {f.ndim}")
print(f"shape {f.shape}")
print(f"Size {f.size}")

```

## 4.2 Slicing 2D arrays <a id="42-slicing-2d-arrays"></a>

We can access data in a multidimensional array by slicing it, in a similar way to 1D arrays:

<img src="images/slicing2.png" width="600">

### Example 8

Create an array of shape `(5, 7)` filled with random integers.

We can again use `np.random.default_rng(seed)` to generate a random number generator and `generator.integers(low, high, size)` to generate an array filled with random numbers.

In [None]:
# FIXME


<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Example 8. </summary>

```python

number_generator = np.random.default_rng(12345)
random_big_array = number_generator.integers(low=1, high=50, size=(5, 7))

print(random_big_array)
```

### Example 9

Use slicing on `random_big_array` to select:

* the first column
* the last column
* the 4th row
* an area 
* samples in a given space 

In [None]:
# FIXME

<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Example 9. </summary>

```python
print(f"first column {random_big_array[:, 0]}")
print(f"last column {random_big_array[:, -1]}")
print(f"4th row {random_big_array[3, :]}")
print(f"selected area {random_big_array[0:2, 3:7]}")
print(f"samples {random_big_array[1:5:2, 3:10:3]}")

```

# Tasks 2

Here we will load some data from a file, do some maths with the data, use slicing to sub-sample the data and visualise the data with a simple plot.


## Loading an array to/from a file <a id="loading-an-array-to-from-a-file"></a>

As you have seen before using `pandas`, we can also load arrays from a plain text file. 


There are many options available for loading the file, such as:

To load a file `array.txt`: 

```python

loaded_array = np.loadtxt("array.txt")

```

We can skip some lines, for example in the case where the file has a header over the first 5 lines of the file, using the option `skiprows`. 

Similarly, if the file contains comments, we can use the option `comments` to specify the character used for comments, so that these lines also get ignored by python. 
```python
clean_array = np.loadtxt("array.txt", comments="#", skiprows=5)
```

To save the array called `my_array` into the file, use `np.savetxt`:

```python
np.savetxt("my_array.txt", data)
```

<div class="alert alert-success"><b>Task 2.1: Load data to and from a file with arrays</b>

1. Load in the file `data/slice_me.txt` and skip the first row. (The `data/` part specifies the folder in which the file is.)
2. Print the shape of this data
3. Save this to another file called `data/slice_me_copy.txt`

</div>

In [None]:
# 1. Load in the file data/slice_me.txt and skip the first row.
# FIXME

# 2. Print the shape of this data
# FIXME

# 3. Save this to another file called data/slice_me_copy.txt
# FIXME


<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Task 2.1</summary>

```python

# 1. Load in the file data/slice_me.txt and skip the first row.
data = np.loadtxt("data/slice_me.txt", skiprows=1)

# 2. Print the shape of this data
print(data.shape)

# 3. Save this to another file called data/slice_me_copy.txt
np.savetxt("data/slice_me_copy.txt", data)

```

# 5. Mathematical operations on multidimensional arrays <a id="5-mathematical-operations-on-multidimensional-arrays"></a>

All mathematical operations between arrays act element by element, similarly to 1D arrays. 

With 2D arrays, we can also choose an axis of operation. 

For example, consider the array

```python
my_list = [[0, 1, 2, 3],
           [10, 11, 12, 13],
           [20, 21, 22, 23]]
my_array = np.array(my_list)
```

### Getting sums from arrays

We can use the function `np.sum()` to get the sum of all elements in our array with:
```python
total_sum = np.sum(my_array) 
print(total_sum)
```

This prints `138`.

What if we want to get the sum of elements, row by row? We can do:

```python
row_sum = np.sum(my_array, axis=0)
print(row_sum)
```

This prints `[30, 33, 36, 39]`.

Similarly, to get the sum of elements, column by column:

```python
column_sum = np.sum(my_array, axis=1)
print(column_sum)
```

This prints `[6, 46, 86]`.

<div class="alert alert-success"><b> Task 2.2: Sum of array elements</b>

1. Calculate the sum of all the elements in the file `data/slice_me_copy.txt` that you created in the previous task.
2. Calculate the "vertical sum", i.e. the sum along the rows.
3. Calculate the "horizontal sum", i.e. the sum along the columns.

In [None]:
# 1. Calculate the sum of all the elements in the file `data/slice_me_copy.txt` that you created in the previous task.
# FIXME

# 2. Calculate the "vertical sum", i.e. the sum along the rows.
# FIXME

# 3. Calculate the "horizontal sum", i.e. the sum along the columns.
# FIXME

<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Task 2.2</summary>

```python
array = np.loadtxt("data/slice_me_copy.txt")

# 1. Calculate the sum of all the elements in the file `data/slice_me_copy.txt` that you created in the previous task.
total_sum = np.sum(array)
print(f"total sum {total_sum}")

# 2. Calculate the "vertical sum", i.e. the sum along the rows.
vertical_sum = np.sum(array, axis=0)
print(f"vertical sum {vertical_sum}")

# 3. Calculate the "horizontal sum", i.e. the sum along the columns.
horizontal_sum = np.sum(array, axis=1)
print(f"horizontal sum {horizontal_sum}")
```

<div class="alert alert-success"><b> Task 2.3: Slicing data arrays</b> <a id="task-23"></a>

The folder `data` contains a file called `ms.txt`, which contains mass spectrometry data given in two columns: m/z and intensity.

1. Read in the file `ms.txt`
2. Create a sub-sample of the intensities data by extracting every 10th line into a variable called `subdata`.
3. Save the `subdata` into a new file.

</div>

**Note: it might be a good idea to print the shapes of `data` and `subdata` to check if your slicing is correct after step 2.**

In [None]:
# 1. Read in the file ms.txt
# FIXME

# 2. Create a sub-sample of the data by extracting every 10th line into a variable called `subdata`.
# FIXME

# 3. Save the intensities column from `subdata` into a new file.
# FIXME


<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Task 2.3 </summary>

```python
# 1. Read in the file ms.txt
data = np.loadtxt("data/ms.txt")

# 2. Create a sub-sample of the data by extracting every 10th line into a variable called `subdata`.
subdata = data[::10, 1]

# Check the shapes of the datasets
print(data.shape)
print(subdata.shape)

# 3. Save the intensities column from `subdata` into a new file.
np.savetxt("data/sub_intensities.txt", subdata)

```

<div class="alert alert-warning"><b> Advanced task 2.4</b>

Can you do the above without numpy, only using in-built python functionality?

In [None]:
# FIXME


<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to the Advanced task 2.4 </summary>

```python

# Read file in line by line
with open("data/ms.txt", "r") as input_file:
    lines = input_file.readlines()

# Counter for counting every 10th line
counter = 0

# Create an empty list to store intensity values
intensities = []

# Loop over the lines in the file
for line in lines:

    # If counter is divisible by 10
    if counter % 10 == 0:
        # split the line (string) into two columns:
        columns = line.split()

        # the second column is intensity
        intensity = columns[1]

        # append intensity value to intensities list
        intensities.append(intensity)

    # increment the counter
    counter += 1

# Open file for writing:
with open("data/sub_densities.txt", "w") as output_file:
    # Loop over all the values in the list intensities
    for intensity in intensities:
        # Write each intensity to the file on separate lines
        output_file.write(f"{intensity} \n")

```


<div class="alert alert-warning"><b> Advanced task 2.5</b>

Using the mass spectrometry data, find the m/z values in the region between 6400 and 6600.

Also find the maximum peak value in this region and the corresponding m/z value.

**Hint: You will need to use Boolean indexing.** This was covered in [Unit 03 Part II](../Unit_03/Unit_03_loops_II.ipynb)
</div>

In [None]:
# FIXME


<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Advanced task 2.5</summary>

```python
# Load in data
data = np.loadtxt("data/ms.txt")

# Create criterion
greater_than = data[:,0] > 6400
less_than = data[:, 0] < 6600
criterion = greater_than & less_than

# slice the array
sliced_array = data[criterion, :]

# Get the maximum peak value
maximum_value = np.max(sliced_array[:, 1])
index_of_max = np.argmax(sliced_array[:, 1])
mz_at_max = sliced_array[index_of_max, 0]

print(f"peak {maximum_value} is at m/z {mz_at_max}")
```

# 6. Plotting data <a id="6-plotting-data"></a>

We can use [matplotlib](http://matplotlib.org) to plot data in python.  

We first look at the `pyplot` functional interface, which allows us to manipulate a given current figure. 

`pyplot` is great to quickly visualize data we are working with, but it is **not suitable** for plots of multiple data quantities, subplots, or more complex customizations. 

In this case, an *object-oriented plotting* is needed. We will discuss the object-oriented plotting below. If you are eager to know more, please see discussion on [PyPlot vs. Object Oriented Interfaces](https://matplotlib.org/matplotblog/posts/pyplot-vs-object-oriented-interface/) on the matplotlib blog.


### Example 10 

As always, we begin with **importing the `matplotlib.pyplot` module** with the alias `plt`. 

This is the community-agreed alias for `matplotlib.pyplot`.

In [None]:
# FIXME 


# Keep this, it's needed for jupyter notebooks
%matplotlib inline  


<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Example 10.</summary>

```python
import matplotlib.pyplot as plt

%matplotlib inline
```

### Example 11

To create a plot, we use the `matplotlib` function `plt.plot()`. 

Load in the file `data/sub_intensities.txt` that you created in [Task 2.3](#task-23)

It's good practice to use `plt.show()` to show the plot, even though the plot will pop up in Jupyter without this as well.

In [None]:
# FIXME


<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Example 11.</summary>

```python

# Read the file
data = np.loadtxt("data/sub_intensities.txt")

# Plot 
plt.plot(data)
plt.show()
```


**Note** this displayed plot is generated from the sub-sampled data, which only has intensities.<br>
Therefore, this data does not have the m/z column, so x-axis is just the row number.


### Labeling the plot and the data <a class="anchor" id="labelplt"></a>

It is always good practice to **label the plots**. <a class="anchor" id="labelplt"></a>

Use the following commands to add the labels to your plot:
 - `xlabel()`
 - `ylabel()` 
 - `title()`

<div class="alert alert-success">
    <b>Task 2.6</b> : Plot the <code>ms.txt</code> data as m/z vs Intensity, label the plot.
        
</div>


In [None]:
# FIXME


<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Task 2.6.</summary>

```python
# Load in the data
data = np.loadtxt("data/ms.txt")

# Assign the columns to 'mz' and 'intensity'
mz = data[:,0]
intensity = data[:,1]

# plot mz against intensity
plt.plot(mz, intensity)

# label the plot
plt.title("Mass spectrometry")
plt.xlabel("m/z")
plt.ylabel("Intensity")

# save the plot
plt.savefig("images/myfigure.png")

# show the plot
plt.show()
```

## 6.1 Quick aside on string formatting <a id="61-quick-aside-on-string-formatting"></a>

We can use f-strings to format strings in a nice way. This is useful for e.g. labelling scientific plots.

For example, let's say we want to creare a plot label for pressure as "Pressure ($\mathrm{N / m}^2$)" in python:

```python
    plt.plot(x, y)
    plt.xlabel(f"pressure (N / m$^2$)")
```

We can do this using LaTex notation given inside the `$ $` signs. 

[Click here](https://oeis.org/wiki/List_of_LaTeX_mathematical_symbols) for a list of some of the mathematical symbols you can write in this format. 

Some of the most useful ones for chemists are **superscripts** `$^{-2}$` and **subscripts** `$_{\mathrm{exp}}$`. The expression `\mathrm{}` stands for "maths roman" which ensures the superscript is written in non-italic. 

You can use this "math mode" in markdown cells in a similar way to write equations. 

Another useful method of f-strings is formatting the number of significant figures of values. For example, let's say we want to print the mass of something with 2 significant figures:

```python
    mass = 0.198 # in g
    print(f"The final mass is {mass:.2f} g.")
```

which prints: `The final mass is 0.20 g.`


# 7. Errors: a discussion. <a id="7-errors-a-discussion"></a>

<div class="alert alert-info"><b>In groups, discuss errors in scientific experiments and data handling</b>
    
Here are some questions to help you get started:

- What kind or errors we often found in scientific experiments?

- Are there any less obvious sources that may go unnoticed?

- What are the sources of uncertainty? 

- How can we mitigate the errors?

- What about the code we write? Can we make it more reproducible, minimising human error? 

- What are the differences between *random error*, *systematic error* and *mistakes*?

- How does repeating measurements reduce (or not?) the effect on the above errors? 

- What is the difference between *accuracy* and *precision*?

- Can you give examples of situations where accuracy is important and where it is not?

- Why are repeat measurements  important for characterising accuracy. What about precision?
    
</div>




## 7.1 - Sources of Errors and Uncertainties <a id="71-sources-of-errors-and-uncertainties"></a>

 - **Random Error:** 
     - Noise in the experimental data 
     - Some scatter of the values 
     - Repeated careful experiments can reduce this error
     - Statistical tools are for dealing with these errors
     - Not present in calculations, calculations return same output value (within the precision)

      



 - **Systematic Error:**
     - Systematically shifted values by a given value/ percentage
     - Must be handled at the source, for example by recalibration of equipment
     - May be accounted for during data processing if identified, example: shift all weights by 15 g
     - Calculations are rarely exact, and so are subject to this error for any approximations that are used
     

     
     
     
 - **Mistakes:**
     - Mainly human, may be in the equipment or in the code
     - Must identify, redo/debug

## 7.2 - Accuracy vs Precision <a id="72-accuracy-vs-precision"></a>

<img src="images/DartPic2.png" width="500">


 - High **precision** = a low spread of results (low random error)
 - High **accuracy** = that the average result is close to “true” answer (low systematic error)

High precision and high accuracy are always desirable, but not always essential.

<br>
We can only access accuracy and precision from multiple data points!


# 8. Introduction to Statistics    <a id="8-introduction-to-statistocs"></a>

First, let's return to Plotting! 

[Previously](#6-plotting-data), we have done only very basic plots  with `pyplot`

In this section, we will need a little bit more complex plotting, so we will need to switch to Object Oriented Plotting with [matplotlib](http://matplotlib.org).

We have also put together a summary, a cheatsheet and an example in this [Reference document](Plotting.ipynb).



### Object Oriented Plotting <a class="anchor" id="matplotOO"></a>

This gives us control over many parameters, as illustrated here:

<img src="images/anatomy-of-a-figure.webp" width="600">

To achieve this, we start with declaring an *object* which is a container for all elements (shown in <span style="color:blue"> *blue* </span>) that are rendered onto the object, i.e. our **figure**.

### 1. Declare a figure *object*:

```python
fig, ax = plt.subplots()
```

Here we have only 1 axes, but we can have many: 
    
```python
# an empty figure with no Axes
fig = plt.figure()  
# a figure with a single Axes
fig, ax = plt.subplots()  
# a figure with a 2x2 grid of Axes
fig, axs = plt.subplots(2, 2)  
```

### 2. Add the data onto the axes of the plot with:  

```python
ax.plot(time, distance)
```

We can also include labels, markers, colors:
    
```python
# Plot some data on the axes
ax.plot(x, x, label="linear")  
# Plot more data on the axes...
ax.plot(x, x**2, label="quadratic", "x")  
# ... and some more:
ax.plot(x, x**3, label="cubic", color="orange")
```


### 3. Add other elements, such as labels:

```python
# Add a y-label to the axes.
ax.set_ylabel("Distance (m)")
# Add an x-label to the axes. 
ax.set_xlabel("Time (s)")
# Add a title to the axes.  
ax.set_title("My plot")  
# Add a legend.
ax.legend()  
```


### 4. Adjust figure size and resolution:  

```python
fig.set_size_inches(6,4)
fig.set_dpi(200)
```


### 5. To finish the figure, render it together:

```python
plt.show()
```

It is best to try with an example below: 


## 8.1 Statistical distributions <a id="81-statistical-distributions"></a>



### Example 12 
    
<div class="alert alert-warning">
The set of 50 samples were weighed in the lab, returning the following results:
</div>



|Sample No.| Weight, g | |Sample No.| Weight, g |

| ----| -----| | ----| -----|

| 1 | 12.7867 || 26 | 13.060 |

| 2 | 11.2558 || 27 | 12.67 |

| 3 | 11.8226 || 28 | 9.284  |

| 4 | 14.2157 || 29 | 11.32  |

| 5 | 11.9955 || 30 | 12.57 |

| 6 | 12.753 || 31 | 11.909 |

| 7 | 10.604 || 32 | 12.055 |

| 8 | 12.7267 || 33 | 11.98 |

| 9 | 11.3204 || 34 | 11.48  |

| 10 | 11.3616 || 35 | 10.99  |

| 11 | 12.1384 || 36 | 11.79 |

| 12 | 12.301 || 37 | 11.357 |

| 13 | 11.032 || 38 | 10.196 |

| 14 | 10.8086 || 39 | 12.16 |

| 15 | 13.58 || 40 | 11.01  |

| 16 | 12.59  || 41 | 12.33  |

| 17 | 11.93  || 42 | 12.14 |

| 18 | 12.41 || 43 | 11.711 |

| 19 | 12.426 || 44 | 12.373 |

| 20 | 10.435 || 45 | 13.26 |

| 21 | 10.39 || 46 | 11.26  |

| 22 | 12.89  || 47 | 12.79 |

| 23 | 11.49  || 48 | 12.11 |

| 24 | 12.45 || 49 | 11.831 |

| 25 | 12.022 || 50 | 10.810 |


The data is stored in a file `data/Weights.txt` and may have a header! 

Lets load the data, plot it and get some statistics!
</div>



In [None]:
# FIXME

<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Example 12. </summary>

```python
import numpy as np
import matplotlib.pyplot as plt

# Load data
data = np.loadtxt("data/Weights.txt", comments="#")
    
# Initialise the figure object
fig, ax = plt.subplots()

# Add data and labels
ax.plot(data[:,0], data[:,1], "X", color="red")
ax.set_xlabel("Sample No.")
ax.set_ylabel("Weight (g)")

# Show plot
plt.show()
    
```


## 8.2 - Distribution of measurements 


### Example 13

If we measure a value many times, we should get a distribution, which can be visualised as a **histogram**

***A histogram** is a distribution and is characteristic for different statistical (random) processes.*

Here, we look at the histogram for a **population** of 50 measurements.

We can get a histogram using the [```numpy.histogram(a, bins=10)```](https://numpy.org/doc/stable/reference/generated/numpy.histogram.html#numpy.histogram) function.

How many bins do you think are needed? Try it!

In [None]:
# FIXME

<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Example 13.</summary>

```python
w = data[:,1]
counts, bins = np.histogram(w, bins=15)
print(counts, bins)

```

### Example 14

We can now plot it, using ```ax.stairs(counts, bins)```:

In [None]:
# FIXME

<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Example 14 </summary>

```python
fig, ax = plt.subplots()

ax.stairs(counts, bins)
ax.set_ylabel('Weight, g')
ax.set_xlabel('Count')

plt.plot()
```

### Example 15

Alternatively, we can use a the function ```plt.hist(a, bins=10)```:

In [None]:
# FIXME


<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Example 15</summary>

```python
fig, ax = plt.subplots()

ax.hist(w,bins=15)
ax.set_ylabel('Weight, g')
ax.set_xlabel('Count')
plt.plot()
```

## 8.3 Normalising the data <a id="83-normalising-the-data"></a> 

In the example above we created a **histogram** with 10 bins (default).

If we change the bin number, the distribution changes.

If we add more samples, it changes again. It's difficult to compare two datasets sets of various size. 

Therefore, we should express the data as a **probability distribution** instead of just a sample count.

We can do it by **normalising** the the data:

\begin{equation}
x_{\mathrm{norm}} = \dfrac{x-x_{\mathrm{max}}}{x_{\mathrm{max}}-x_{\mathrm{min}}},
\end{equation}

where $x$ is the value of the sample being normalised , while $x_{\mathrm{max}}$ and $x_{\mathrm{min}}$ are the maximum and minimum values, respectively.

We can do this in Python by writing a function:

```python
def normalise(data):
    max_value = max(data)
    min_value = min(data)
    for i in range(len(data)):
        data[i] = (data[i] - min_value)/(max_value - min_value)
    return data  
# To have the data in percentages, mutliply by 100:
n_ints = normalise(data[:, 1]) * 100 
```

Or, you can also use ```np.histogram(w, bins=15, density=True)``` to obtain a probability density, i.e. a normalised histogram.


* What does this histogram tell us about the data?
* How do **random** and **systematic errors** show up in histograms like his one? 

This is another way to show the **accuracy** vs **precision** we saw on the 'dart board':

<img src="images/Accuracy_Precision.png" width="500">


# 8.3 Quantifying Uncertainty <a id="83-quantifying-uncertainty"></a> 

Let's analyse this data a bit more to quantify the **uncertainties**.

We first represent data as a **normal distribution** of the population. The normal distribution, or Gaussian distribution, is a distribution centered around the **mean value** and having a spread of the **standard deviation**. 

### The mean, $\mu$ <a class="anchor" id="mean"></a> 

\begin{equation}
\mu = \frac{1}{N} \sum_i^N x_i ,
\end{equation}

where $N$ is the number of samples. As the $N$ increases, the mean becomes closer to the 'true' value. This is know as the [law of large numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers). 


```python
mu = np.sum(a) / len(a)
```

or, we can just use the NumPy function `np.mean(a)`.


_Note:_ **Median** is a middle value separating the greater and lesser halves of a data set, since the normal distribution is symmetric, mean and median are equivalent. 



### The standard deviation (STD), $\sigma$<a class="anchor" id="STD"></a> 
The STD quantifies how much the numbers in our set deviate from the mean, $\mu$

\begin{equation}
\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N(x_i-\mu)^2}.
\end{equation}

We can write the above as a function in python:

```python
import math as m
sigma = m.sqrt(np.sum((a - np.mean(a))**2) / len(a))
```

or, we can just use the NumPy function `np.std(a)`.

In a **normal distribution** the values that are less than 1 $\sigma$ away from the mean, $\mu$, will account for the 68.27% of the set - this is our **confidence interval**


\begin{equation}
f(x) = \frac{1}{\sigma \sqrt{2 \pi} }  exp\left(\frac{-(x-\mu)^2}{2\sigma^2} \right)
\end{equation}



<img src="images/NormalDist.png" width="500">



<div class="aler alert-warning"><b>Exercise 2.7: 
Analyse the data of weights of 50 samples given above.</b>

Find the lightest and the heaviest samples, calculate the mean and standard deviation.

Plot the normal distribution for this data:

Create a plot, that will present:
- a normalised histogram, shaded with a transparency 
- a line for mean and median (are they same?)
- normalised probability distribution
- make sure the plot is labeled

**Hint**

You can use [`scipy.stats` python package](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html#scipy.stats.norm) to plot the **normal probability distribution** of our data.

```python
stats.norm.pdf(a, loc, scale)
```
where the `loc` specifies the mean and `scale` specifies standard deviation.
</div>
 

In [None]:
# FIXME

<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Exercise 2.7</summary>

```python
from scipy import stats

# Smallest value and its index
print(f"Lightest sample weight {np.min(w)} g and the sample no. is {np.argmin(w) + 1}")

# Biggest value and its index
print(f"Heaviest sample weight {np.max(w)} g and the sample no. is {np.argmax(w) + 1}")
# Note we add +1 to the output of argmin/argmax, as they begin counting at 0 

# Mean and standard deviation
print(f"The mean value is {np.mean(w)}")
print(f"The standard deviation is {np.std(w)}")

# Calculate the probability distribution function (pdf) at each x
pdf = stats.norm.pdf(w, loc=np.mean(w), scale=np.std(w))

# Initialise the figure object 
fig, ax = plt.subplots(1, 1)

# Add a normalised histogram
ax.hist(w, density="True", bins = 10, color="lime", alpha=0.2, label="histogram")

# Add mean and a median as a line
ax.axvline(np.mean(w), color="darkorange", label="mean")
ax.axvline(np.median(w), color="magenta", label="median")

# Add a PDF
ax.plot(w, pdf, ".", label="normal distribution")

# Add labels
ax.set_xlabel("Weight (g)")
ax.set_ylabel("Probability, p(w)")

# Add the legend
ax.legend()
plt.show()
```


<div class="alert alert-success"><b>Task 2.8: Analyse the kinetic data for a reaction at 250 and 300 K given below.</b>
    
Plot a relative likelihood that a particular value of rate constant, $K$ would be measured, showing the relative probability of each $K$.
    
Produce a histogram for the data.    
</div>

In [None]:
# Here are some rates, K, at a T:
K_250 = np.array([2.567111, 2.562323, 2.61557, 2.4366565, 2.495657, 2.516454, 3.671456])
K_300 = np.array([2.5700804, 2.5660756, 2.6201404, 2.437922,  2.4999964, 2.5190192, 3.6754052])

# FIXME

<details><summary {style="color:green; font-weight:bold"}> Click here to see the solution to Task 2.8.</summary>

```python

# Print the data
print(f"250 K mean = {np.mean(K_250):.3f}, std = {np.std(K_250):.3f}")
print(f"300 K mean = {np.mean(K_300):.3f}, std = {np.std(K_300):.3f}")

# Generate 100 linearly spaced x values 
# Start a bit before and finish a bit after the min and max of K_250
start = np.min(K_250) - 0.5
finish = np.max(K_250) + 0.5
x = np.linspace(start, finish, 100)

# Calculate the probability distribution at each x
y = stats.norm.pdf(x, loc=np.mean(K_250), scale=np.std(K_250))

# Plot
plt.plot(x, y, ".")
plt.xlabel("Rate (K)")
plt.ylabel("Population, p(K)")
plt.show()


normal_distribution = stats.norm(loc=np.mean(K_250), scale=np.std(K_250))
values = normal_distribution.rvs(5000)

# Plot a histogram
plt.hist(values, density=True, bins=50, alpha=0.5)

# Use min and max of random numbers to create a range
x = np.linspace(values.min(), values.max(), 100)

# Plot the probability in that range
plt.plot(x, normal_distribution.pdf(x))
plt.xlabel("Rate (K)")
plt.ylabel("Population, p(K)")
plt.show() 

# Recap <a id="recap"></a>

You should now be able to use a **collection of methods** within NumPy to process and analyse your data:

 - `numpy.min(a)` find min value in the array
 - `numpy.argmin(a)` find position (AKA index) of the min value in the array
 - `numpy.max(a)` find max value in the array
 - `numpy.argmax(a)` find position (AKA index) of the max value in the array
 - `numpy.unique(a)` selects a subset of unique elements
 - `numpy.sort(a)` sorts the array max to min
 - `numpy.sum(a)` sum the elements of an array
 - `numpy.mean(a)` and `numpy.std(a)` compute mean and standard deviation of array values
 - `numpy.median(a)` 


# Feedback <a id="feedback"></a>