# Introduction to `numpy`

aka NUMerical PYthon. This package contains a powerful library of numerical tools. To use these tools you must have the following line at the beginning of your file (or at the beginning of your Jupyter notebook, just like this below):

In [1]:
import numpy as np

Once you run that shell (such as by clicking into it and then hitting Shift-Enter), you'll be able to refer to variables and functions from numpy by typing `np.<name of variable>`.

The [`numpy` documentation](https://numpy.org/doc/stable/) includes guides somewhat akin to this notebook, as well as comprehensive descriptions of all of `numpy`'s capabilities and options.

## `numpy` arrays

The "array" is the workhorse of the numpy library—under the hood they're called an `ndarray`.  Arrays are like Python's built-in Lists in that both can hold a set of values.  Unlike a list, which can contain elements of different 'types' (strings, ints, lists, etc), all the elements of an ndarray must be of the same "type."

(Perhaps confusingly, other programming languages use the term "array" to refer to a data type that is much more like Python's List.  But saying "array" in a Python context probably refers to `numpy`'s `ndarray`.)

The easiest way to create an array is from a List, like this:

In [10]:
array1 = np.array([1, 2, 3, -1, -2, -3])
array1 # Including this line makes Jupyter print the array:

array([ 1,  2,  3, -1, -2, -3])

That created an array of integers.  We can check that by reading the "data type" off of `array1` using a dot, like this:

In [6]:
array1.dtype

dtype('int64')

(The `int` part means the data type is "integer."  The `64` means "64-bit".)

If your array includes decimal values, it will have a `float` data type instead.  You can pretty much interpret `float` to mean "decimal" in computer programming.

Arrays are basically tensors.  `array1` is one dimensional, so it's a vector.  You will often work with 2D arrays—matrices.  But it's possible to create arrays with dimension 3 or higher.  `numpy` keeps track of the "shape" of an array for you:

In [11]:
array1.shape

(6,)

This means `array1` has just one dimension, with 6 entries.  Here's a 2D example: 

In [17]:
array2 = np.array([
    [1, 2],
    [3, -1],
    [-2, -3],
])
array2.shape

(3, 2)

This array represents the matrix

$$
\begin{pmatrix}
1 & 2 \\
3 & -1 \\
-2 & -3
\end{pmatrix}
$$

If you import from an Excel spreadsheet, you're going to get a matrix like this!  You can get the transpose like this:

In [16]:
array2.T

array([[ 1,  3, -2],
       [ 2, -1, -3]])

### Creating a blank array

Sometimes you want an array that's filled with zeros to start.  To do this, you use the `np.zeros` function.  You have to pass it the "shape" of the array you want to create.  For example, if I wanted a 3-row, 4-column matrix filled with zeros, I'd write:

In [18]:
np.zeros((3, 4))

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

Notice the extra parenthesis in the `(3, 4)`.  We're creating an ordered pair of shape values, known as a `Tuple` in Python.  (If you just want a one-dimensional array, you can omit the extra parenthesis.)

## Indexing and Slicing

Let's include some (real PER) data so we have something to start working with.

In [20]:
responses = np.array([
    [6, 4, 4, 4, 3, 2, 3, 3, 2, 1, 2, 2],
    [6, 2, 1, 2, 3, 3, 2, 4, 2, 4, 4, 4],
    [5, 5, 6, 7, 5, 3, 5, 5, 5, 8, 6, 5],
    [3, 7, 6, 10, 0, 5, 5, 7, 7, 7, 7, 8],
    [7, 7, 5, 4, 5, 7, 7, 9, 6, 9, 8, 8],
    [0, 5, 5, 6, 7, 6, 6, 4, 5, 5, 7, 7],
    [5, 5, 5, 7, 4, 4, 0, 4, 10, 5, 1, 2],
    [3, 4, 5, 2, 0, 3, 6, 0, 2, 4, 4, 5],
    [2, 0, 2, 0, 2, 3, 0, 0, 1, 1, 1, 3],
    [0, 1, 2, 2, 2, 3, 0, 4, 4, 2, 0, 0],
    [5, 5, 5, 4, 5, 0, 0, 5, 5, 5, 5, 0],
    [5, 5, 5, 5, np.nan, 5, 5, 5, 5, 5, 5, 5],
    [4, 6, 6, 4, 4, 4, 6, 3, 3, 3, np.nan, 8],
    [1, 6, 3, 2, 1, 2, 3, 4, 3, 3, 2, np.nan],
    [7, 5, 4, 2, 2, 2, 1, 2, 1, 2, 2, np.nan],
    [5, 5, 1, 0, 5, 5, 2, 5, 0, 0, 0, np.nan],
    [4, 7, 7, 5, 6, 5, np.nan, 8, 5, 5, 5, 7],
    [np.nan, np.nan, 4, 2, 2, 0, 4, 5, 4, 5, 4, 6],
    [np.nan, np.nan, 2, 8, 5, 7, 8, 0, 3, 5, 7, np.nan],
    [3, 5, 4, 2, 3, 4, np.nan, 5, 0, 5, np.nan, np.nan],
    [4, 5, 0, np.nan, 4, 2, np.nan, 2, 3, 3, np.nan, 5],
    [4, 4, np.nan, 0, 0, 2, np.nan, 3, np.nan, 0, 0, 0],
    [7, 7, 4, np.nan, 7, 5, 6, np.nan, 6, np.nan, np.nan, np.nan],
    [0, 0, np.nan, np.nan, 0, 0, np.nan, np.nan, np.nan, np.nan, 0, 0],
    [0, 5, 5, 5, np.nan, np.nan, 5, np.nan, np.nan, np.nan, 5, np.nan],
    [np.nan, np.nan, np.nan, np.nan, np.nan, 8, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
])

These data reflect students' responses to a weekly survey.  Each row represents a student, and each column represents a week.  `np.nan` stands for "not a number."  It's the value we use to indicate missing data.

Like Lists, you can read values out of `numpy` arrays using "index notation."  To get the first row (the first student's responses) from the list, I can do:

In [21]:
responses[0]

array([6., 4., 4., 4., 3., 2., 3., 3., 2., 1., 2., 2.])

Remember, counting starts at `0`!  But `numpy` arrays are more advanced than Lists.  If I wanted to access the response that the 3rd student gave on the 7th week, I can do:

In [23]:
responses[2, 6]

5.0

You can also change the values in an array by assignment:

In [25]:
responses[2, 6] = 9.0
print(responses[2, 6])

9.0


But let's change it back—no falsifying data!

In [26]:
responses[2, 6] = 5.0

Sometimes you don't just want one value, but you want a range of values.  For this, you perform an operation called "slicing."  To get the data for the 3rd to 6th students only, you can do:

In [27]:
responses[2:6]

array([[ 5.,  5.,  6.,  7.,  5.,  3.,  5.,  5.,  5.,  8.,  6.,  5.],
       [ 3.,  7.,  6., 10.,  0.,  5.,  5.,  7.,  7.,  7.,  7.,  8.],
       [ 7.,  7.,  5.,  4.,  5.,  7.,  7.,  9.,  6.,  9.,  8.,  8.],
       [ 0.,  5.,  5.,  6.,  7.,  6.,  6.,  4.,  5.,  5.,  7.,  7.]])

You can also filter to values that satisfy a certain comparison.  For example, to know how many responses are equal to `5.0`, you'd do:

In [34]:
len(responses[responses == 5])

66

There's a lot to indexing and slicing in `numpy`.  Learn more in the [documentation](https://numpy.org/doc/stable/reference/arrays.indexing.html).

### Operations on arrays

You can perform arithmetic operations on `numpy` arrays.  They operate element-wise!  Let's just consider some examples.

In [35]:
np.array([1, 2, 3]) + np.array([1, 2, 3])

array([2, 4, 6])

In [40]:
np.array([1, 2, 3]) - np.array([1, 2, 3])

array([0, 0, 0])

In [39]:
np.array([1, 2, 3]) * np.array([1, 2, 3])

array([1, 4, 9])

In [38]:
np.array([1, 2, 3]) / np.array([1, 2, 3])

array([1., 1., 1.])

You can also multiply and divide by scalars:

In [41]:
5 * np.array([1, 2, 3])

array([ 5, 10, 15])

Interestingly, you can also add and subtract a scalar.  This also works element-wise:

In [42]:
5 - np.array([1, 2, 3])

array([4, 3, 2])

Under the hood, `numpy` inspects this expression and decides that what you really meant to do is:

In [43]:
np.array([5, 5, 5]) - np.array([1, 2, 3])

array([4, 3, 2])

In general, `numpy` is pretty good at guessing your intention like this.  But sometimes it can't—for example, it makes no sense to add an array of length 4 to an array of length 3!

In [44]:
np.array([1, 2, 3]) + np.array([1, 2, 3, 4])

ValueError: operands could not be broadcast together with shapes (3,) (4,) 

## `numpy`'s mathematical and statistical functions

`numpy` includes implementations of many common functions that can work on arrays.  For example, trigonometric functions

In [45]:
np.sin(
    np.array([np.pi/6, np.pi/4, np.pi/2])
)

array([0.5       , 0.70710678, 1.        ])

The `np.sin()` function and others like it operate element-wise.

There's also statistical functions, including:

- `np.mean()`
- `np.median()`
- `np.std()` (standard deviation)

These functions _don't_ operate element-wise (which wouldn't really make sense).  For example:

In [46]:
np.median(np.array([1, 2, 3]))

2.0

You'll have to be more thoughtful, however, when using these functions on a multi-dimensional array.  See the exercises for this!  Look [here](https://numpy.org/doc/stable/reference/routines.statistics.html) for the list of built-in statistical functions.

## Exercises

Let's redo some exercises from the introduction, but now using `numpy` functions!  Here's the `weekly_avg` array we used in those exercsises, this time computed with a `numpy` function.  (The `np.nanmean()` function computes the mean, but ignoes `nan` values, just like Excel ignores blank cells.)

Make sure you've executed the cell that includes the definition of the `responses` variable above.

In [56]:
weekly_avg = np.nanmean(responses, axis=0)
weekly_avg

array([3.73913043, 4.56521739, 3.95652174, 3.77272727, 3.26086957,
       3.6       , 3.7       , 3.95454545, 3.72727273, 3.95454545,
       3.57142857, 4.16666667])

### Exercise 1

Use your search engine of choice to find the `numpy` function to compute the sum of all the values in the `weekly_avg` array.  Test it out here:

In [57]:
# Your code here.

The answer is just under `46`.

### Exercise 2

Use a `numpy` function to compute the overall average of the `weekly_avg` array.  The answer is around `3.8`.

In [58]:
# Your code here.

### Exercise 3

Use `numpy` to calculate the standard deviation of the `weekly_avg` array.  The answer is about `0.31`

In [59]:
# Your code here.

### Exercise 4

Use `numpy` to produce a array that holds the sum of each row in the `responses` array.  You can do this with a for loop, but you can also do this with one numpy function.  What happens with the `nan` values?  You're totally welcome to look up the answer—just make sure you totally understand the code you find before you use it!

The answer is:

```
array([36,  37,  65,  72,  82,  63,  52,  38,  15,
       20,  44,  55,  51,  30,  30,  28,  64,  36,
       45,  31,  28,  13,  42,   0,  25,   8])
```

In [93]:
# Your code here.
np.nansum(responses, axis=1)

array([36., 37., 65., 72., 82., 63., 52., 38., 15., 20., 44., 55., 51.,
       30., 30., 28., 64., 36., 45., 31., 28., 13., 42.,  0., 25.,  8.])

### Exercise 5

Compute the per-student averages (that is, the averages of the rows in the `responses` array).  Again, you can do this with one numpy function, and you have to attend to the `nan` values.

The answer is:

```
array([3, 3.08333333, 5.41666667, 6, 6.83333333,
 5.25, 4.33333333, 3.16666667, 1.25, 1.66666667,
 3.66666667,  5, 4.63636364, 2.72727273, 2.72727273,
 2.54545455, 5.81818182, 3.6, 5, 3.44444444,
 3.11111111, 1.44444444, 6, 0, 4.16666667,
 8])
```

In [92]:
# Replace the None with your answer
averages = None

For fun, execute the next cell to create a plot of the averages!

In [74]:
from matplotlib import pyplot as plt

plt.plot(averages)

### Exercise 6

The `np.linspace()` function behaves as follows:

In [78]:
np.linspace(0, 1, 2)

array([0., 1.])

In [79]:
np.linspace(0, 1, 3)

array([0. , 0.5, 1. ])

In [82]:
np.linspace(0, 1, 11)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

In [84]:
np.linspace(1, 4, 4)

array([1., 2., 3., 4.])

Use `np.linspace()` and `np.sin()` to plot $\sin(x)$ on $[0, 2\pi]$.  (`np.pi` holds the value of $\pi$.)

In [90]:
# Do what you need to do to set this variable correctly
thing_to_plot = None

plt.plot(thing_to_plot)