What's the name(s) of the person(s) sitting next to you? And what is their favorite (or most used) emoji?

# Scientific Computing I: NumPy

<div class="alert alert-success">
<b>Scientific Computing</b> is the application of computer programming to scientific applications: data analysis, simulation & modeling, plotting, etc. 
</div>

A key reason this course uses Python is because Python is popular in scientific computing. There are packages (functions and classes that you can use) for many, many, many use cases.

## Scientific Python: Scipy Stack

Scipy = Scientific Python

- `numpy` - for lists of numbers
- `pandas` - for heterogenious data (i.e. not just numbers)
- `scipy` - for statistical analysis
- `sklearn` - for machine learning

Goal of today and the next lecture is to look at these _briefly_, so you have a basic idea of _when_ to reach for them.

To actually use them you will have to read docs and tutorials, although you will get some practice in the later coding labs and assignments.

<div class="alert alert-success">
<b><code>Scipy</code></b> is an <i>ecosystem</i>, including a collection of open-source packages for scientific computing in Python.
</div>

A 'family' of packages that all work well together to do scientific computing.

Not made by the same people who manage Python's standard library.

**Packages must be installed before you can use them.** Install once. Import as many times as you want.

The packages we're using today have 1) already been installed for you on datahub and 2) are part of the Anaconda distribution.

However, for your final project, you may want to install additional packages. To do so on the command line: `pip install --user package` (where you replace `package` with the package to be installed.)

## NumPy is for lists of numbers (or lists of lists of numbers)

- recordings of values from experimental participants
- heights or quantitative information from survey data

NumPy stands for "numerical python".

NumPy exists for at least two reasons:

1. Python is slow for large calculations. Many NumPy operations are fast.
2. Python does not let you write e.g. `[1, 2, 3] + 1` and other kinds of common operations on lists of numbers.

In [None]:
# Without NumPy, this will error. (Which is reasonable, since lists do not always contain numbers.)

[1, 2, 3] + 1

In [None]:
# Without NumPy, this will concatenate the lists. (Also reasonable.)

[1, 2, 3] + [4, 5, 6]

In [None]:
# Now let's look at how NumPy operates differently.
# 
# Everyone imports NumPy as np so we can
# write np.something instead of numpy.something

import numpy as np

`np.array()` turns a list into a NumPy "array"

In [None]:
np.array([1, 2, 3])

An "array" is basically a list, _but_ it is stored more efficiently than an ordinary list and many operations on it are faster.

And, importantly, we can perform operations on lots of numbers more easily.

In [None]:
# If we add 1 to a NumPy "array", it adds 1 to each element.

np.array([1, 2, 3]) + 1

In [None]:
# If we add NumPy arrays together, they are added element-wise, instead of concatenation.

np.array([1, 2, 3]) + np.array([4, 5, 6])

In [None]:
# If we need a normal list back, use the .tolist() method

my_np_arr = np.array([1, 2, 3]) + np.array([4, 5, 6])
my_np_arr.tolist()

In [None]:
# Define some random data

data = np.random.default_rng().integers(low=0, high=10, size=8)
data

Now that we have some data, let's see what else we can do.

In [None]:
# Hmmm, what will happen here?

first_and_third = [True, False, True, False, False, False, False, False]

data[first_and_third]

⬆︎ we can index a NumPy array by _a list of booleans!_

In [None]:
# Hmmm, what will happen here?

data > 3

⬆︎ we can _compare all the items against a number!_ The result is a list of booleans.

In [None]:
data

In [None]:
# Hmm, what will happen here?

data[data > 3]

In [None]:
data

⬆︎ we can _compare all the items against a number,_ resulting in a list of booleans, and then _index with those booleans_, thereby filtering down to _only those items that match!_

In [None]:
# Get a list of indices that are true in a list of booleans.

np.where([True, False, True, False, False])

In [None]:
np.where(data > 3)

⬆︎ we can _see which indices match._

(The weird looking output is a single element tuple, with the array as the first element. This has something to do with multidimensional arrays which we will discuss in a bit. Index into the tuple to get just the array: `np.where(data > 3)[0]`)

In [None]:
np.where(data > 3)[0]

In [None]:
data

In [None]:
# Oh, here's a fun one. What might this do?

indices = [0, 3, 3, 3, 2]

data[indices]

⬆︎ we can _index by a list of indices!_

In [None]:
data

In [None]:
# Other helpful methods.

print(data.min())  # Minimum
print(data.max())  # Maximum
print(data.sum())  # Sum
print(data.mean()) # Mean (average)

#### Class Question #1

What is the output of the following?

```python
data = np.array([2, 4, 5, 9, 0, 1])
data.max()
```

- **(a)** Error
- **(b)** `False`
- **(c)** `9`
- **(d)** `(array([3]),)`
- **(e)** `array([False, False, False,  True, False, False])`

#### Class Question #2

What is the output of the following?

```python
data = np.array([2, 4, 5, 9, 0, 1])
data == data.max()
```

- **(a)** Error
- **(b)** `False`
- **(c)** `9`
- **(d)** `(array([3]),)`
- **(e)** `array([False, False, False,  True, False, False])`

#### Class Question #3

What is the output of the following?

```python
data = np.array([2, 4, 5, 9, 0, 1])
data[[False, True, False, True, True, False]]
```

- **(a)** Error
- **(b)** `array([4, 9, 0])`
- **(c)** `(array([1, 3, 4]),)`
- **(d)** `array([False, 4, False, 9, 0, False])`
- **(e)** `array([False, True, False, True, True, False])`

#### Class Question #4

What is the output of the following?

```python
data = np.array([2, 4, 5, 9, 0, 1])
data[data >= 4]
```

- **(a)** Error
- **(b)** `array([4, 5, 9])`
- **(c)** `(array([1, 2, 3]),)`
- **(d)** `array([0, 4, 5, 9, 0, 0])`
- **(e)** `array([False, True, True, True, False, False])`

#### Class Question #5

What is the output of the following?

```python
data = np.array([2, 4, 5, 9, 0, 1])
data[data == data.max()]
```

- **(a)** `9`
- **(b)** `array([3])`
- **(c)** `array([9])`
- **(d)** `(array([3]),)`
- **(e)** `array([False, False, False, True, False, False])`

#### Class Question #6

Which of the following could we put in the blank if we wanted the first three items? If there is more than one right answer, note them all.

```python
data = np.array([2, 4, 5, 9, 0, 1])

data[____________]
```

- **(a)** `0, 1, 2`
- **(b)** `[0, 1, 2]`
- **(c)** `range(0,3)`
- **(d)** `np.array(range(0,6)) < 3`
- **(e)** `[True, True, True, False, False, False]`
- **(f)** `np.array([True, True, True, False, False, False])`
- **(g)** `np.arange(6) < 3`
- **(h)** `:3`

In [None]:
# use this cell to test


### NumPy arrays can be _multi-dimensional_

Nice for linear algebra (i.e. matrix multiplication), if you need to do that sort of thing.

In [None]:
# Create some 2-dimensional arrays (i.e. matrices)
arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([[1, 2], [3, 4], [5, 6]])

In [None]:
arr1

In [None]:
arr2

In [None]:
# index into multidimensional arrays with _two_ numbers, row then column
arr1[0,2]

In [None]:
arr2[1,0]

In [None]:
# You can slice too
arr2[1, 0:2]

In [None]:
arr2[:, 1]

In [None]:
# Check the shape of the array
arr1.shape

In [None]:
# It's (rows, cols)
arr2.shape

#### Class Question #7

What's the output?

```python
arr1 = np.array([[1, 2, 3],
                 [4, 5, 6]])
arr1[1,0]
```

#### Class Question #8

How should we index to pull out the number `6`?

```python
arr2 = np.array([[1, 2],
                 [3, 4],
                 [5, 6]])
arr2[______]
```

#### Class Question #9

What's the output?

```python
arr1 = np.array([[1, 2, 3],
                 [4, 5, 6]])
arr1.shape
```

### Some Matrix Operations

In [None]:
arr1 = np.array([[1, 2, 3],
                 [4, 5, 6]])

arr2 = np.array([[1, 2],
                 [3, 4],
                 [5, 6]])

In [None]:
arr1 * arr2 # will try to multiply element-wise, but the arrays aren't the same shape

In [None]:
# Matrix multiplication
np.matmul(arr1, arr2)

In [None]:
np.matmul(arr2, arr1)

Sum and mean.

In [None]:
arr1 = np.array([[1, 2, 3],
                 [4, 5, 6]])
arr1.sum()

In [None]:
arr1.sum(axis=0)

In [None]:
arr1.sum(axis=1)

It's a little tricky to think about: the axis number you give is the _dimension that will disappear._

`0` is the rows, so the rows will collapse and you be left with columns.<br>
`1` is the columns, so the columns will collapse and you be left with rows.<br>

In [None]:
# min, max, and mean work like this too
arr1.mean(axis=0)

#### Class Question #10

What's the output?

```python
arr2 = np.array([[1, 2],
                 [3, 4],
                 [5, 6]])
arr2.min(axis=1)
```

- **(a)** `1`
- **(b)** `array([1, 2])`
- **(c)** `array([9, 12])`
- **(d)** `array([1, 3, 5])`
- **(e)** `array([3, 6, 11])`

### Other Array Notes

In [None]:
# Need the same length in each list
np.array([[1, 2, 3, 4], [2, 3, 4]])

In [None]:
# Arrays are for "homogeneous" data (usually numbers).
#
# Here, the numbers are converted to strings to make the collection homogeneous (one data type).
np.array([1, 2, 'cogs18', 4])

Okay, that's enough about NumPy!

## A brief aside: `zip()`

`zip()` takes two or more iterables (things you can loop over) and loop over them together.

In [None]:
for a, b, c in zip([1,2,True,1000], ['a','b','c','d'], range(0,10000)):
    print("Values:", a, b, c)

#### Class Question #11

What will it print?

```python
data = np.array([[1, 2, 3, 4],
                 [5, 6, 7, 8]])
 
output = []
for d1, d2 in zip(data[0, :], data[1, :]):
    output.append(d1 + d2)

print(output)
```

- A) [1, 2, 3, 4]
- B) [1, 2, 3, 4, 5, 6, 7, 8]
- C) [6, 8, 10, 12]
- D) [10, 26]
- E) [36]

In [None]:
data = np.array([[1, 2, 3, 4],
                 [5, 6, 7, 8]])

data.sum(axis=0)

(But if you find yourself looping over arrays...there is probably a better way with NumPy.)

Response cell below for impromptu question.