# Numerical Data in Python

*2 hours*

**Contents:**

- [Introducing NumPy](#Introducing-NumPy)
- [Working with NumPy Arrays](#Working-with-NumPy-Arrays)
- [Calculations on NumPy Arrays](#Calculations-on-NumPy-Arrays)
- [Sorting and Filtering Arrays](#Sorting-and-Filtering-Arrays)
- [Reshaping Arrays](#Reshaping-Arrays)
- [Plotting with `matplotlib`](#Plotting-with-matplotlib)

---

## Introducting NumPy

So far, we've been working with built-in Python data structures and functions. We've loaded data from some text files and worked with sequences of numbers. But most scientific datasets have more *structure* than we're able to represent with a CSV file. Many are also large enough that a CSV file isn't the best format to use.

Today, we'll introduce a powerful and widely used Python extension that is essential for working with numerical data. NumPy has been around for over 20 years and thousands of users have contributed code to the library. It's no exaggeration to say that NumPy is probably the most important Python library ever developed, as countless other libraries are built on top of NumPy.

In [None]:
import numpy as np

It's a strong convention to abbreviate the name of `numpy` to `np`.

**The key component of `numpy` is the `numpy.ndarray` data type,** or $n$-dimensional array.

In [None]:
arr = np.array([1, 2, 3])
arr

If we compare the `ndarray` to the most similar built-in data structure, the `list`, we'll see some important differences.

In [None]:
np.array([1, True, 3])

**First, unlike lists, `numpy` arrays can only hold one type of data.** There are multiple data types to choose from and every array has an explicitly defined type.

In [None]:
arr.dtype

In [None]:
np.array([3.14, 2.1]).dtype

**Second, a `numpy` array can have multiple dimensions, or axes,** whereas a `list` can only contain nested lists, which is a confusing way to try and represent more than one dimension.

In [None]:
xy = [[42.0, 45.5], [-118.1, -118.2]]
xy

In [None]:
arr = np.array(xy)
arr

We can index a `numpy` array just like other sequences. We'll learn more about this in a moment.

In [None]:
arr[0]

In this example, it might make more sense to have the latitudes and longitudes in separate columns (instead of in rows). We can get a *transposed* copy of the array by typing:

In [None]:
arr.T

In [None]:
arr.ndim

In [None]:
arr.shape

You may be tempted to ask how many elements are in a `numpy` array...

In [None]:
# The WRONG way
len(arr)

In [None]:
# The correct way
arr.size

**Finally, operations on `numpy` arrays can be executed much more quickly and efficiently than on `lists`.** If you are working with numeric data of any size, you're much better off putting it in a `numpy` array than a `list` or `tuple`. This is especially true for very large datasets.

---

## Working with NumPy Arrays

For the rest of this lesson, we'll be working with [data on near-surface air temperatures from the NOAA Climate Prediction Center (CPC).](http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/.CPC/.GHCN_CAMS/.gridded/.deg0p5/index.html) 

Though the CPC has the word "prediction" in its name, it also performs what are called *re-analyses,* where historical climate data are reproduced by combining weather station data with computer models. Here, we use a re-analysis of the past 75 years of air temperatures across the globe.

**I subset these data to the city of Algiers, Algeria.**

In [None]:
import pandas as pd

temps = pd.read_csv(
    'http://files.ntsg.umt.edu/data/GIS_Programming/data/NOAA_NCEP_CPC_gridded_deg0p5_1948-2022_Algiers.txt',
    header = None).to_numpy()

Again, a chief advantage of `numpy` arrays is their ability to represent multi-dimensional data. **What do we mean by multi-dimensional?**

A collection of data values can be *structured* in multiple ways...

- Time series data consist of values that is associated with a series of points in time.
- Spatial data consist of values that are associated, usually, with a series of numbers that represent spatial coordinates.

Imagine we're collecting soil samples along a 100-meter transect in a forest. For every soil sample measurement, we'll mark how far along the transect it was collected, from 0 to 100, indicating the distance from one end of the transect. If we collected data every 1 meter, we'd have 100 soil sample measurements. The numbers 0 to 100 then also represent a way of indexing those measurements. This is an example of one-dimensional data, structured by a single axis; in this case, distance.

Now imagine we want to collect soil samples in a 100 m-by-100 m grid (100 square meters). In this case, we need two numbers to represent the location of each soil sample, so the data we collect will be 2-dimensional, with one axis representing the distance along one side of the grid and the other axis representing the distance along the other.

**The NOAA CPC provides mean, monthly near-surface air temperatures each year. The shape of our multi-dimensional array is...**

In [None]:
temps.shape

**What do these two axes correspond to?**

How can we get the first year of data?

In [None]:
temps[0]

What about the first month of data?

In [None]:
temps[0,0]

When we index (2-dimensional) `numpy` arrays, remember that we count rows first and columns second. So, for example, if we want to get the month of January in every year...

In [None]:
temps[:,0]

If we want just the first 5 years, we could write:

In [None]:
temps[0:5]

**The colon symbol, `:`, is the *slicing* operator.**

Above, `temps[0:5]` can be read as "Start at index 0 and go up to, *but not including,* index 5." Because Python starts counting at zero, we get 5 years of data; recall that index 5 would refer to the *sixth* year of data, in this example.

When we use the slicing operator without any numbers, it means "take everything" on that axis. Hence, the following two examples are the same:

In [None]:
temps[0,:]

In [None]:
temps[0]

Note that we can also use *negative indexing* in `numpy` arrays, just as with Python lists or tuples...

In [None]:
# The last year of temperature data
temps[-1]

In [None]:
# The last three years of temperature data
temps[-3:]

Here are some other examples of indexing an array that has three rows and two columns:

In [None]:
data = np.array([[1, 2], [3, 4], [5, 6]])
data

![](./assets/numpy-matrix-indexing.png)

*Image is from a presentation by Mauricio Sevilla.*

---

### Challenge: Working with Multi-dimensional Arrays

1. What's the average July temperature in Algiers over the years?
2. What was the minimum monthly temperature in 1979? Recall that the years of this data extend from 1948 through 2022.

---

## Calculations on NumPy Arrays

Because `numpy.ndarray` is an object, NumPy arrays know how to perform mathematical operations on themselves. This allows us to treat them like pure numbers!

In [None]:
# Convert temperatures from deg C to deg K
temps_k = temps + 273.15
temps_k[0]

**How can we calculate the average of the first two years of temperatures in this record?**

In [None]:
(temps[0] + temps[1]) / 2

Of course, we're ultimately going to want to do much more than add and multiply arrays together. NumPy arrays also have more sophisticated methods built-in.

In [None]:
temps.mean()

If we don't provide any arguments to the `mean()` method when it is called on a `numpy.ndarray`, then we get the *overall mean,* i.e., the mean of the entire dataset.

But what if we wanted to calculate the mean January temperature? We could either filter the data **or we could tell the `temps` array that we want to calculate the mean across a given axis.**

In [None]:
temps.mean(axis = 0)

Above, `axis = 0` indicates that we want to *collapse* the years axis (first axis, or axis at position 0) when calculating the mean. If we're in doubt, we can always ask for:

In [None]:
temps.mean(axis = 0).shape

**Can you guess what this array represents?**

In [None]:
temps.mean(axis = 1)

It can be very difficult to remember what `axis` to use in calculating a summary... Here's a helpful visual representation.

![](assets/numpy-axis.jpg)

*Image courtesy of Alex Riley*

**We might also ask: How warm was the coldest July in Utqiagvik?**

In [None]:
temps[:,6].max()

---

### Challenge: Statistical Summary of an Array

What's the minimum, maximum, and mean monthly temperature for August in Algiers?

---

## Break!

*A 10-minute break for learners.*

---

## Sorting and Filtering Arrays

How can we filter the contents of an array?

For example, suppose that we wanted to use these temperature data to predict when we should wear a jacket (*veste*) in Algiers: "jacket weather" is defined as any day when the mean monthly temperature is below 15 deg C.

In [None]:
jacket = temps < 15
jacket[0]

Just like addition and multiplication, we can perform *logical* operations on arrays. In the above example, we can ask *where* in the `numpy` array the temperature is less than 15 deg C. The result is a new type of array, a boolean array.

In [None]:
jacket.dtype

And this *boolean array* has the same shape as `temps`:

In [None]:
jacket.shape

**How can we use this insight to filter arrays?**

In [None]:
temps[temps < 15]

**This is the second way of indexing the contents of an array.**

- We can index an array based on numeric indices, for example: `temps[0]` or `temps[0:5]`
- Or, we can index an array by providing a *boolean array* of the same shape, for example: `temps[temps > 20]`

For example, if we wanted to create a new dataset that contains 1 when it is "jacket weather" but 0 otherwise, we could write...

In [None]:
# Create a new array, full of zeros, with the same shape as temps
jacket = np.zeros(temps.shape)
jacket.shape

We can use the filtering syntax to assign values to an array *only where the conditional expression evaluates to true.*

In [None]:
jacket[temps < 15] = 1

print(temps[0])
print(jacket[0])

An easier way of doing this is to use the `numpy.where()` function:

In [None]:
# np.where() takes three arguments, the first is a conditional expression;
#    2nd argument is the value to return if the expression is true
#    3rd argument is the value to return if the expression is false
jacket2 = np.where(temps < 15, 1, 0)
jacket2[0]

Are we getting the same result?

In [None]:
np.equal(jacket, jacket2).all()

This is great so far, but we our `jacket` array doesn't tell us *which months* are "jacket weather;" what if wanted to know *when* we should wear a jacket?

In [None]:
when_jacket = np.argwhere(temps < 15)

# Just the first 5 rows
when_jacket[0:5]

`np.argwhere()` returns an array with N columns where N is the number of dimensions of the input boolean array. Our `temps` array (and our boolean array, `temps < 15`), have 2 dimensions:

- Year dimension (first axis)
- Month dimension (second axis)

So, if we want to see all the months where `temps < 15`, we should index the second column of `np.where(temps < 15)`:

In [None]:
when_jacket[:,1]

The output of `numpy.argwhere()` is an array that gives the indices of where an array expression evaluates to `True`. Because our `temps` array is 2D, the result of `argwhere()` is a 2D array.

- Every row of `when_jacket` represents an entry where the boolean array `temps < 15` is `True`
- The first column of `when_jacket` indicates the row *in the boolean array* where an entry is `True`
- The second column of `when_jacket` indicates the column *in the boolean array* where an entry is `True`

For example, we can see that "jacket weather" occurs in the following months...

In [None]:
# Add 1 to get a familiar month number (because Python starts counting at zero)
np.unique(when_jacket[:,1]) + 1

That is: January, February, March, April, November, and December.

---

## Reshaping Arrays

As we've seen if we want to get the transpose of an array, we can write:

In [None]:
temps.T.shape

In [None]:
# Now, the first row is the first year of data, not the first month
temps.T[0]

Are there other ways we can change the shape of arrays?

In [None]:
year1 = temps[0]
year1

In [None]:
year1.shape

In [None]:
# For example, what if we wanted an axis for seasons?
year1.reshape((4, 3))

Because `year1` has 12 months (one year), we can easily reshape that 1D array into a 2D array with $4 \times 3 = 12$ elements. Now:

- The first axis (4 elements) represents the season: Winter, Spring, Summer, Fall
- The second axis (3 elements) represents the month within each season

We can do this for the entire `temps` array, too!

In [None]:
# For example, what if we wanted an axis for seasons?
temps2 = temps.reshape((75, 4, 3))
temps2[0]

In the above example, we kept the first axis at 75 elements--that means 75 years. But we created a new, third axis, so the last two axes are now season (axis with 4 elements) and month-within-season (axis with 3 elements).

`reshape()` only works if the resulting shape can contain the same number of elements.

In [None]:
temps.reshape((10, 12))

In [None]:
75 * 4 * 3

In [None]:
temps.size

When we use `reshape()` we want to be careful to make sure that the result is what we expected. There's obviously more than one way to take the rows and columns of the original array and breaking them apart into a different shape. [You can read more about the *order* in which array elements are reshaped here,](https://numpy.org/devdocs/user/absolute_beginners.html#can-you-reshape-an-array) but this topic is beyond what you need to know for this course.

**Often, we want to combine two or more arrays into a single array. There are several ways to do that...**

In [None]:
# Stack the first two years on top of each other (vertically, i.e., "v" in "vstack")
np.vstack([temps[0], temps[1]])

In [None]:
# Stack the first two years horizontally; result is a single row
np.hstack([temps[0], temps[1]])

The more general `stack()` function will let you specify the axis along which the arrays should be joined. For instance, if we wanted to stack the years together as columns instead of as rows...

In [None]:
np.stack([temps[0], temps[1]], axis = 1)

---

## Plotting with matplotlib

There's a lot more we can do with NumPy arrays, but first we should learn how to visualize multi-dimensional array data, at least to make things more interesting.

**`matplotlib` is a basic plotting library for Python. Most of the time we use it, we'll actually import the `pyplot` submodule.**

In [None]:
from matplotlib import pyplot

pyplot.imshow(temps, aspect = 1/4)

Above, we've created a rugplot using `pyplot.imshow()` or "image-show," a quick way to plot 2D arrays as if they were images. Warmer colors represent higher temperatures. Can you identify a seasonal pattern across the years?

Next, let's look at average temperature over the years.

In [None]:
avg_temp = temps.mean(axis = 1)
pyplot.plot(avg_temp)

**Let's try to understand this last example.**

```py
help(pyplot.plot)
```

**How can we improve how this plot looks?**

In [None]:
years = np.arange(1948, 2023)
years

In [None]:
pyplot.plot(years, avg_temp)
pyplot.show()

In [None]:
pyplot.plot(years, avg_temp)
pyplot.xlabel('Year')
pyplot.ylabel('Mean Monthly Temperature (deg C)')
pyplot.show()

In [None]:
pyplot.plot(years, avg_temp)
pyplot.xlabel('Year')
pyplot.ylabel('Mean Monthly Temperature (deg C)')
pyplot.xlim(1948, 2000)
pyplot.show()

**What other kinds of plots can we make?**

Earlier, when we typed `pyplot.plot(avg_temp)`, we got a line plot. `pyplot` got a 1D array of values and just assumed the data were equally space and that we wanted to connect them with a line.

But `pyplot.plot()` is a more general tool; it can plot any `y` value against any `x` value.

In [None]:
# Plotting July temperatures against January temperatures
pyplot.plot(temps[:,0], temps[:,6])
pyplot.show()

This would look a lot cleaner if we told `pyplot` how to represent the points; and, specifically, to represent them as dots instead of a connected line.

In [None]:
pyplot.plot(temps[:,0], temps[:,6], 'k.')
pyplot.show()

`'k.'` is a formatting string; it's a short-hand notation understood by `pyplot` that codes for two things:

- Show the points as dots, `.` as the dot symbol
- Color the points as black, `k` for "blacK" ("b" represents blue)

[You can see more examples of formats in the `pyplot` documentation.](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot)

Alternatively, we could write:

In [None]:
pyplot.scatter(temps[:,0], temps[:,6])
pyplot.show()

---

## More Resources

- [Visual introduction to NumPy](https://jmsevillam.github.io/slides/Python/Numpy.slides.html#/)