# Analysing data with NumPy

- toc:false
- branch: master
- badges: true
- comments: false
- categories: [data, python]
- hide: true

Questions:
- How can I import and analyse tabular data files in Python?

Objectives:
- Read tabular data from a file into a program using `numpy`.
- Select individual values and subsections from data.
- Perform operations on arrays of data.

Keypoints:
- Use the `numpy` library to work with arrays in Python.
- The expression `array.shape` gives the shape of an array.
- Use `array[x, y]` to select a single element from a 2D array.
- Array indices start at 0, not 1.
- All the indexing and slicing that we've used on lists and strings also works on arrays.
- Use `low:high` to specify a `slice` that includes indices from `low` to `high-1`.
- Arithmetic operations are done element-by-element.
- Use `numpy.mean(array)`, `numpy.max(array)`, and `numpy.min(array)` to calculate simple statistics.
- Use `numpy.mean(array, axis=0)` or `numpy.mean(array, axis=1)` to calculate the mean along a particular column or row

### Use the numpy library to work with arrays in Python.

In general you should use the [NumPy](http://docs.scipy.org/doc/numpy/)
library if you want to do fancy things with numbers, especially if you have matrices or arrays. 

In [3]:
import numpy

First lets ask the library to read our cleaned data
file for us:

In [5]:
numpy.loadtxt(fname='../data/UVVis-01-cleaned.csv', delimiter=',')

array([[ 4.47125000e-04,  6.55591000e-04,  8.64056000e-04, ...,
         1.00000000e+01,  1.29667747e+00,  1.66669679e+00],
       [-3.66223800e-03, -3.49741500e-03, -3.34321500e-03, ...,
        -1.22419536e-01, -7.07442700e-03, -1.82473719e-01],
       [ 2.23267300e-03,  2.29731000e-03,  2.47505900e-03, ...,
         3.31975669e-01,  3.77199233e-01,  3.53418890e-02],
       ...,
       [ 1.20771340e-02,  1.22769590e-02,  1.24000520e-02, ...,
         3.11538220e-02,  1.53292596e-01, -2.67419547e-01],
       [ 3.98183100e-03,  4.22229500e-03,  4.32843200e-03, ...,
        -1.33138746e-01, -6.67433520e-02,  1.55003861e-01],
       [ 4.21040200e-03,  4.36906300e-03,  4.38802100e-03, ...,
         8.95578190e-02,  8.41182170e-02,  1.43565789e-01]])

The expression `numpy.loadtxt(...)` is a function call
that asks Python to run the function `loadtxt` which
belongs to the `numpy` library. 

`numpy.loadtxt` has two parameters: the name of the file
we want to read and the delimeter that separates values on
a line. These both need to be character strings 
, so we put them in quotes.


Let's re-run
`numpy.loadtxt` and save the returned data:

In [6]:
data = numpy.loadtxt(fname='../data/UVVis-01-cleaned.csv', delimiter=',')


Remember, this statement doesn't produce any output because we've assigned the output to the variable `data`.
If we want to check that the data have been loaded,
we can print the variable's value:

In [7]:
print(data)

[[ 4.47125000e-04  6.55591000e-04  8.64056000e-04 ...  1.00000000e+01
   1.29667747e+00  1.66669679e+00]
 [-3.66223800e-03 -3.49741500e-03 -3.34321500e-03 ... -1.22419536e-01
  -7.07442700e-03 -1.82473719e-01]
 [ 2.23267300e-03  2.29731000e-03  2.47505900e-03 ...  3.31975669e-01
   3.77199233e-01  3.53418890e-02]
 ...
 [ 1.20771340e-02  1.22769590e-02  1.24000520e-02 ...  3.11538220e-02
   1.53292596e-01 -2.67419547e-01]
 [ 3.98183100e-03  4.22229500e-03  4.32843200e-03 ... -1.33138746e-01
  -6.67433520e-02  1.55003861e-01]
 [ 4.21040200e-03  4.36906300e-03  4.38802100e-03 ...  8.95578190e-02
   8.41182170e-02  1.43565789e-01]]



First,
let's ask what type of thing `data` refers to:


In [8]:
print(type(data))

<class 'numpy.ndarray'>


The output tells us that `data` currently refers to
an N-dimensional array, the functionality for which is provided by the NumPy library.
The rows are the individual samples, and the columns
are the absorption at each wavelength.

> Note: A Numpy array contains one or more elements
of the same type. The `type` function will only tell you that
a variable is a NumPy array but won't tell you the type of
thing inside the array.
We can find out the type
of the data contained in the NumPy array using `print(data.dtype)`.

### The expression `array.shape` gives the shape of an array.

With the following command, we can see the array's shape:

In [10]:
print(data.shape)

(10, 1301)


The output tells us that the `data` array variable contains 10 rows and 1301 columns. When we
created the variable `data` to store our absorption data, we didn't just create the array; we also
created information about the array, called members or
attributes. This extra information describes `data` in the same way an adjective describes a noun.
`data.shape` is an attribute of `data` which describes the dimensions of `data`. We use the same
dotted notation for the attributes of variables that we use for the functions in libraries because
they have the same part-and-whole relationship.

### Use `array[x, y]` to select a single element from a 2D array.

If we want to get a single number from the array, we must provide an
index in square brackets after the variable name, just as we
do in math when referring to an element of a matrix.  Our absorption data has two dimensions, so
we will need to use two indices to refer to one specific value:

In [11]:
print('first value in data:', data[0, 0])

first value in data: 0.000447125


In [12]:
print('middle value in data:', data[5, 600])

middle value in data: 0.05236074


### Array indices start at 0, not 1.

The expression `data[5, 600]` accesses the element at row 5, column 600. While this expression may
not surprise you,
 `data[0, 0]` might.
Programming languages like Fortran, MATLAB and R start counting at 1
because that's what human beings have done for thousands of years.
Languages in the C family (including C++, Java, Perl, and Python) count from 0
because it represents an offset from the first value in the array (the second
value is offset by one index from the first value). This is closer to the way
that computers represent arrays (if you are interested in the historical
reasons behind counting indices from zero, you can read
[Mike Hoye's blog post](http://exple.tive.org/blarg/2013/10/22/citation-needed/)).
As a result,
if we have an M×N array in Python,
its indices go from 0 to M-1 on the first axis
and 0 to N-1 on the second.
It takes a bit of getting used to,
but one way to remember the rule is that
the index is how many steps we have to take from the start to get the item we want.



![](../images/python-zero-index.png)

> Note: What may also surprise you is that when Python displays an array, it shows the element with index `[0, 0]` in the upper left corner
 rather than the lower left.
 This is consistent with the way mathematicians draw matrices
 but different from the Cartesian coordinates.
 The indices are (row, column) instead of (column, row) for the same reason,
 which can be confusing when plotting data.

### All the indexing and slicing that we've used on lists and strings also works on arrays.

An index like `[5, 600]` selects a single element of an array,
but we can select whole sections as well.
For example,
we can select the first ten days (columns) of values
for the first four patients (rows) like this:

In [14]:
print(data[0:4, 0:10])

[[ 0.00044712  0.00065559  0.00086406  0.00107252  0.00128099  0.00148945
   0.00169792  0.00190638  0.00211485  0.00232331]
 [-0.00366224 -0.00349741 -0.00334322 -0.0036817  -0.00405294 -0.00324795
  -0.00336376 -0.00375587 -0.00342078 -0.00319713]
 [ 0.00223267  0.00229731  0.00247506  0.00222341  0.00225055  0.00260366
   0.00255431  0.00229944  0.00254705  0.00278302]
 [ 0.0060851   0.00631194  0.00641883  0.00616071  0.00553703  0.00655359
   0.00648117  0.00620936  0.00630154  0.00666987]]


### Use `low:high` to specify a `slice` that includes indices from `low` to `high-1`.

The slice `0:4` means, "Start at index 0 and go up to, but not
including, index 4."Again, the up-to-but-not-including takes a bit of getting used to, but the
rule is that the difference between the upper and lower bounds is the number of values in the slice.

We don't have to start slices at 0:

In [15]:
print(data[5:10, 0:10])

[[0.01210117 0.01231773 0.01241982 0.01212973 0.01202739 0.01261753
  0.01251138 0.01232068 0.01247477 0.01260437]
 [0.01206768 0.01227019 0.01239462 0.01212571 0.01220084 0.01219452
  0.0122772  0.01225797 0.01248005 0.01262111]
 [0.01207713 0.01227696 0.01240005 0.01214027 0.01222512 0.01238692
  0.01246438 0.01226141 0.01244613 0.01270402]
 [0.00398183 0.0042223  0.00432843 0.00407767 0.00415076 0.00426605
  0.00438313 0.00419297 0.00439255 0.0046121 ]
 [0.0042104  0.00436906 0.00438802 0.00414603 0.0041499  0.00446744
  0.00435411 0.00416835 0.00443187 0.00467494]]


We also don't have to include the upper and lower bound on the slice.  If we don't include the lower
bound, Python uses 0 by default; if we don't include the upper, the slice runs to the end of the
axis, and if we don't include either (i.e., if we just use ':' on its own), the slice includes
everything:

In [16]:
small = data[:3, 36:]
print('small is:')
print(small)

small is:
[[ 7.95187800e-03  8.16034400e-03  8.36880900e-03 ...  1.00000000e+01
   1.29667747e+00  1.66669679e+00]
 [-2.69293800e-03 -2.44559000e-03 -2.69407000e-03 ... -1.22419536e-01
  -7.07442700e-03 -1.82473719e-01]
 [ 3.43972300e-03  3.70568000e-03  3.41978900e-03 ...  3.31975669e-01
   3.77199233e-01  3.53418890e-02]]


In [None]:
The above example selects rows 0 through 2 and columns 36 through to the end of the array.

### Arithmetic operations are done element-by-element

Arrays also know how to perform common mathematical operations on their values.  The simplest
operations with data are arithmetic: addition, subtraction, multiplication, and division.  When you
do such operations on arrays, the operation is done element-by-element.  Thus:

In [19]:
doubledata = data * 2.0

will create a new array `doubledata`
each element of which is twice the value of the corresponding element in `data`:

In [20]:
print('original:')
print(data[:3, 36:])
print('doubledata:')
print(doubledata[:3, 36:])

original:
[[ 7.95187800e-03  8.16034400e-03  8.36880900e-03 ...  1.00000000e+01
   1.29667747e+00  1.66669679e+00]
 [-2.69293800e-03 -2.44559000e-03 -2.69407000e-03 ... -1.22419536e-01
  -7.07442700e-03 -1.82473719e-01]
 [ 3.43972300e-03  3.70568000e-03  3.41978900e-03 ...  3.31975669e-01
   3.77199233e-01  3.53418890e-02]]
doubledata:
[[ 1.59037560e-02  1.63206880e-02  1.67376180e-02 ...  2.00000000e+01
   2.59335494e+00  3.33339357e+00]
 [-5.38587600e-03 -4.89118000e-03 -5.38814000e-03 ... -2.44839072e-01
  -1.41488540e-02 -3.64947438e-01]
 [ 6.87944600e-03  7.41136000e-03  6.83957800e-03 ...  6.63951338e-01
   7.54398466e-01  7.06837780e-02]]


If, instead of taking an array and doing arithmetic with a single value (as above), you did the
arithmetic operation with another array of the same shape, the operation will be done on
corresponding elements of the two arrays.  Thus:

In [21]:
tripledata = doubledata + data

will give you an array where `tripledata[0,0]` will equal `doubledata[0,0]` plus `data[0,0]`,
and so on for all other elements of the arrays.

In [22]:
print('tripledata:')
print(tripledata[:3, 36:])

tripledata:
[[ 2.38556340e-02  2.44810320e-02  2.51064270e-02 ...  3.00000000e+01
   3.89003241e+00  5.00009036e+00]
 [-8.07881400e-03 -7.33677000e-03 -8.08221000e-03 ... -3.67258608e-01
  -2.12232810e-02 -5.47421157e-01]
 [ 1.03191690e-02  1.11170400e-02  1.02593670e-02 ...  9.95927007e-01
   1.13159770e+00  1.06025667e-01]]


### Use `numpy.mean(array)`, `numpy.max(array)`, and `numpy.min(array)` to calculate simple statistics.

Often, we want to do more than add, subtract, multiply, and divide array elements.  NumPy knows how
to do more complex operations, too.  If we want to find the average absorption for all samples across all wavelengths, for example, we can ask NumPy to compute `data`'s mean value:


In [23]:
print(numpy.mean(data))

0.0814548568076864


`mean` is a function that takes
an array as an argument.

> Note: Generally, a function uses inputs to produce outputs.
 However, some functions produce outputs without
 needing any input. For example, checking the current time
 with `print(time.ctime())` doesn't require any input.
 For functions that don't take in any arguments,
 we still need parentheses `()`
 to tell Python to go and do something for us.


NumPy has lots of useful functions that take an array as input.
Let's use three of those functions to get some descriptive values about the dataset.
We'll also use multiple assignment,
a convenient Python feature that will enable us to do this all in one line.

In [26]:
maxval, minval, stdval = numpy.max(data), numpy.min(data), numpy.std(data)

print('maximum absorption:', maxval)
print('minimum absorption:', minval)
print('standard deviation:', stdval)

maximum absorption: 10.0
minimum absorption: -1.036568046
standard deviation: 0.24849228257073133


Here we've assigned the return value from `numpy.max(data)` to the variable `maxval`, the value
from `numpy.min(data)` to `minval`, and so on.

### Use `numpy.mean(array, axis=0)` or `numpy.mean(array, axis=1)` to calculate the mean along a particular column or row

When analyzing data, though,
we often want to look at variations in statistical values,
such as the maximum absorption per sample
or the average absorption per wavelength.
One way to do this is to create a new temporary array of the data we want,
then ask it to do the calculation:

In [29]:
sample_0 = data[0, :] # 0 on the first axis (rows), everything on the second (columns)
print('maximum absorption for sample 0:', sample_0.max())

maximum absorption for sample 0: 10.0


> Tip: Everything in a line of code following the '#' symbol is a
comment that is ignored by Python.
Comments allow programmers to leave explanatory notes for other
programmers or their future selves.

We don't actually need to store the row in a variable of its own.
Instead, we can combine the selection and the function call:

In [30]:
print('maximum absorption for sample 2:', numpy.max(data[2, :]))

maximum absorption for sample 2: 0.377199233


What if we need the maximum absorption for each patient over all wavelengths (as in the
next diagram on the left) or the average for each wavelength (as in the
diagram on the right)? As the diagram below shows, we want to perform the
operation across an axis:
    
![](../images/python-operations-across-axes.png)

To support this functionality,
most array functions allow us to specify the axis we want to work on.
If we ask for the average across axis 0 (rows in our 2D example),
we get:

In [32]:
print(numpy.mean(data, axis=0))

[0.00556981 0.00575402 0.00588247 ... 1.07087656 0.34124763 0.27600167]


As a quick check,
we can ask this array what its shape is:

In [34]:
print(numpy.mean(data, axis=0).shape)

(1301,)



The expression `(1301,)` tells us we have an N×1 vector,
so this is the average absorption per day for all samples.
If we average across axis 1 (columns in our 2D example), we get:


In [36]:
print(numpy.mean(data, axis=1))

[0.13104075 0.02947129 0.02323768 0.08578812 0.07745822 0.10012283
 0.10339795 0.09871813 0.08216419 0.08314941]




which is the average absorption per sample across all wavelengths.

---

Do [the quick-test](https://nu-cem.github.io/CompPhys/2021/08/02/Analysing-Data-Qs.html).

Back to [data analysis and visualisation](https://nu-cem.github.io/CompPhys/2021/08/02/Data_analysis.html).

---