# Python Packages  <a name='home' />

**Table of Contents:**
- <a href=#bookmark0>0. Overview on Modules</a> 
- <a href=#bookmark1>1. Numpy</a> 
- <a href=#bookmark2>2. Matplotlib</a> 
- <a href=#bookmark3>3. Cartopy</a> 

## Numpy, MatPlotLib, Cartopy, and Netcdfs/xarray <a name='bookmark0' />

In Python, a package is a bundle of pre-built functions that adds to the functionality available in base Python. Base Python can do many things such as perform math and other operations. However, Python packages can significantly extend this functionality.

You can think of a Python package as a toolbox filled with tools. The tools in the toolbox can be used to do things that you would have to otherwise hand code in base Python. These tasks are things that many people might want to do in Python, thus warranting the creation of a package. It doesn’t make sense for everyone to hard code everything.

There are many different packages available for Python. Some of these are optimized for scientific tasks such as:
* Statistics
* Machine learning
* Using geospatial data
* Plotting & visualizing data
* Accessing data programmatically

Some important packages include: 
* **os:** handle files and directories.
* **glob:** create lists of files and directories for batch processing.
* **matplotlib:** plot data.
* **numpy:** work with data in array formats (often related to imagery and raster format data).
* **scipy:** work with many algorithms for doing scientific computing, e.g. statistical analyses.
* **pandas:** work with tabular data in a DataFrame format.
* **rasterio:** work with raster (image and arrays) data.
* **geopandas:** work with vector format (shapefiles, geojson - points, lines and polygons) using a geodataframe format.
* **cartopy:** plot and manipulate spatial data (raster and vector).

### Python Packages Can Contain Modules
Every Python package should have a unique name. This allows you to import the package using the name with the `import` command.

For example, the command below imports the matplotlib package.

```python
import matplotlib
```

Packages often have modules (i.e. units of code) that each provide different functions and can build on each other. Stealing directly from this [StackOverflow answer](https://stackoverflow.com/questions/7948494/whats-the-difference-between-a-module-and-package-in-python), *Any Python file is a module, its name being the file's base name without the .py extension.
A package is a collection of Python modules: while a module is a single Python file, a package is a directory of Python modules containing an additional __init__.py file, to distinguish a package from a directory that just happens to contain a bunch of Python scripts. Packages can be nested to any depth, provided that the corresponding directories contain their own __init__.py file.
The distinction between module and package seems to hold just at the file system level. When you import a module or a package, the corresponding object created by Python is always of type module. Note, however, when you import a package, only variables/functions/classes in the __init__.py file of that package are directly visible, not sub-packages or module*s. 

The matplotlib package has a commonly used module called **pyplot**.  **Pyplot** makes it easier to quickly set up plots. You can import a specific module like **pyplot** by first calling the package name and then the module name - using . to separate the names like this:

```python
import matplotlib.pyplot
```

But, the better way  import both packages and moduless, is to use an *alias*:
```python
import numpy as np
import matplotlib.pyplot as plt
````

Aliases allows you to call functions from the imported package and/or module using the short name, rather than having to type out the full name of the packages and/or module each time that you want to call a function from i For numpy, we're important all of the scripts associated with that package. Matplotlib is huge, thus, we often only want to import the scripts associated with pyplot.t.

Hot tip: You can get a list of the functions available in a package or module using `dir(np)`. A list of callable functions will appear.


In [None]:
import numpy as np
dir(np)

<a href=#home>Return to Top</a> 

## 1. Numpy <a name='bookmark1' />

Numpy arrays are a commonly used data structure in Python. You can think of arrays as a grid, or matrix.

Like lists, numpy arrays are also composed of ordered values (called elements) and use indexing to organize and manipulate the elements in the numpy arrays. But, unlike lists, all elements in the array must be the same data type (i.e., all integers, floats, strings, etc). 

Numpy arrays can also have N dimensions (while lists, and tuples, etc only have one, and pandas dataframes have two). So think, raster stack. 

And, unlike lists, which you can just use in native python, you need packages to deal with arrays. Numpy is the most commonly-used package for arrays. And it is also the package that more sophisticated packages, like [xarray](https://xarray.dev/) and [rioxarray](https://corteva.github.io/rioxarray/stable/), are built off of.

Arrays are defined like this `array()`. 

#### Key differences between lists and Numpy Arrays
* Unlike a list, elements in a numpy array must be the same data type
* Because of this, numpy arrays support arithmetic and other mathematical operations that run on each element of the array (e.g. element-by-element mutiplication). You cannot directly apply a numeric calculation to a list, and typically can only apply numeric calculations by applying the calculation through looping over each element.
* Unlike a list, you can't remove or add elements to an array, though you can modify elements already in an array. If you add rows or columns, you will be creating a new array each time. 
* Numpy arrays can store data along N DIMENSIONS. This make numpy arrays super efficient if you're using a large dataset.

### 1.1. Dimensionality of Numpy Arrays
Numpy arrays can be
* 1. 1-D composed of values along one dimension - resembling a list
* 2. 2-D composed of rows of individual arrays with one or more columns - like a pandas data frame but just the values without headers.
* 3. N-D composed of nested arrays with one or more dimensions. Prof Spera uses a lot of 3D datasets because there's a time dimension. A global daily precipitation dataset will have rows/columns with data points of precip values and then a third, z / [time dimension](https://xarray.pydata.org/en/stable/_images/dataset-diagram.png). Dr. Yang mostly works with 2D arrays, but there's functionality for N-d arrays, such as connecting latitude, longitude, time, and a genotype, for example.

Brackets `[]` are used to assign the dimensions of a numpy array.

In [None]:
# Make a 1-d array of average precipitation for Jan, Feb, and March in Boulder, CO in 2023
avg_monthly_pcp = np.array([0.7, 0.75, 1.85])
print(avg_monthly_pcp)
print (type(avg_monthly_pcp))

In [None]:
# Create a 2-d array of monthly precip Jan through Mar in 2002 and 2013
pcp02_13 = np.array([
    [1.07, 0.44, 1.50],
    [0.27, 1.13, 1.72]
])
print(pcp02_13)


### 1.2 Reading in a dataset
Prof Spera is of the mind that one of the best ways to learn numpy is to manipulate real data. So let's have at it. First let's inspect the data, which is here in this directory: `/scratch/myang_shared/lab/PythonBootcamp/Sp25/resources/Lesson4_2025/`. These data were downloaded from NOAA [here](https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USW00013740/detail).

You can either look at the contents in the Terminal using some of our handy Linux commands. Or, in Jupyter Lab, navigate to the `Lesson4_2025/` folder and double click on RVA_janfeb2025.csv to quickly look at the data. 

Regardless of how you inspect the data, you should see a description of the data, some column/variable headings, and then some data itself. Let's load this into numpy. 

(Running the next code block will kick out an error - before reading on, can you make a guess as to why?)

In [None]:
import numpy as np
file = "/scratch/myang_shared/lab/PythonBootcamp/Sp25/resources/Lesson4_2025/RVA_janfeb2025.csv"
tmpdata = np.genfromtxt(file)


Note the metadata/header at the top of the file:

    ```
    "Daily temperatures (mean, min, max) for Richmond, VA. Jan 1, 2025 - Feb 18, 2025",,,
    Datasource: NCDC Daily Data,,,
    "Station: USW00013740, Richmond Airport",,,
    ,,,
    YEARMONTHDAY,TAVG,TMAX,TMIN
    ```

This part of the text file provides basic information about the file contents, while helpful to us, but is upsetting numpy. Remember, we only want the numpy elements to be the data themselves. Because all the data have to be the same data type, we only want the rows after this text block to be incorporated in the numpy array. Note also that the array needs to be a rectangle or square. 

So, let's try and read it in again. This time, 1) use `skip_header` to skip the header lines, and 2) specify it's a comma delimited file using `delimiter`. Note that we skip the top 5 lines and not just the top 3 because we also need to ignore the column names (5th line).

In [None]:
tmpdata = np.genfromtxt(file, skip_header=5, delimiter=',')
print(tmpdata)

type(tmpdata)

So, we just read our data into a NumPy ndarray, which is a type of NumPy n-dimensional structure used for storing data like a matrix. In our case we have a two dimensional data struture similar to a spreadsheet. Everything is together in a single large data structure at the moment, but we’ll see later in the lesson how to divide up our data and make interacting with it easier.

### 1.3 Exploring a dataset

Let’s now have a look at the data types in our ndarray. We can find this in the `dtype` attribute that is part of the ndarray data type, something that is known automatically for this kind of data.

In [None]:
print(tmpdata.dtype)

Our data our floating point values with 64-bit precision. There are some exceptions, but 99% of all NumPy arrays will have the same data type.

Now lets check how many dimensions there are using `ndim` and how many rows and columns we have using `shape`.

In [None]:
print('dimensions:', tmpdata.ndim)
print('shape:', tmpdata.shape)

We have 51 rows of data and 4 columns.

### 1.4 Indexing our array

Like lists, arrays can be indexed and sliced using those index values. Let's look at our data again and focus on the data in third row and first column.
Key to remember is python indexing starts at 0.

In [None]:
##Uncomment below if you want to have a look at the entire array. 
##print(tmpdata)

##The command in L. 6 just shows the number that is found in the third row, first column.
tmpdata[2,0]

`tmpdata[2,0]` gives us the value of the data in the third row (which is row 2 in python because python indexing starts at 0), and first column. 

Similar to list or string slicing, we can also obtain ranges of rows and columns using `:`. To get the first 10 rows in the first column. So like, a [0,10)

In [None]:
tmpdata[:10, 0] #here, we get all the rows up to, but not including, index 10

The column headers for this dataset were YEARMONTHDAY, TAVG, TMAX, and TMIN. Sometimes it's easier to break the data apart and put them back together. So, I'm going to extract all the values from a given column and make them each individual variables (date, average temp (tavg), maximum temp (tmax), and minimum temp (tmin)).

In [None]:
date = tmpdata[:,0]
# print dates
print("dates:", date)

tavg = tmpdata[:,1]
tmax = tmpdata[:,2]
tmin = tmpdata[:,3]

# print min temps
print("min temps:", tmin)

Remember those magic commands. We can see all the data types we have defined at this point using the `%whos` magic command. When we do this, we might realize that it is very foolish - even though this is a small file - to be storing these data that have no precision past a decimal as floating point, so let's at least convert the data column into the `int` datatype.

In [None]:
%whos

date = date.astype('int')

print(date)


### 1.5 Basic data calculations

Numpy arrays have a set of attributes and methods to make calculations using the data in the array. Useful methods include, `.mean()`, `.median()`, `.min()`, `.max()`, and `.std()`. 

Let's find the average temperature it's been in Richmond in 2025. (The historical average is around 37 deg F. This is the first time in about 40 years we've had a winter with close to average (1850-2010) temperatures - last winter 7 degrees warmer.


In [None]:
print(tavg.mean()) 

QUICK NOTE: We can do the same calculation on the big dataset by summarizing across an axis. axis = 0 = sum vertical axis, downwards, over all the rows in a column, axis = 1 = summarizing across the horizontal axis, across-words, over all the columns in a row

In [None]:
#np.set_printoptions(suppress=True) # only toggle this on if you hate scientific notation as much as Prof Spera (switch to False to return to scientific notation)
print(np.mean(tmpdata, axis=0)) # this will print the average date, averge temp, min temp and max temp

People love numpy because it is super easy to do calculations. Remember back in Lesson 1 where we had to figure out how to take one value in Farenheit and convert it to Celsius. Well, now that we have numpy:

In [None]:
print("average RVA temp in F:", tavg)

tavg_C = (tavg - 32) / (9/5)

print("average RVA temp in C:", tavg_C)

### 1.6 Filtering data and using masks

We can also filter data based on certain criterion. Let's say we wanted to know which days had average temps above 60 F/15 C in 2025, and then do the same but only for the month of January (the dataset currently includes both January and February). 

In [None]:
# First lets see if there were any days in our whole dataset above 60. 
warmTemps = tavg_C[tavg_C > 15]
print(warmTemps)

# There was one day. Did it occure in Jan?
warmTempsJan = tavg_C[(tavg_C > 15) & (date < 20250201)]
print(warmTempsJan) 
print("There were", len(warmTempsJan), "days in January w average temps above 15 C")

Last year, there were 3 days > 60F in Jan.

It's more likely that instead of filtering data, you'll want to mask it. So, what if we want to identify the dates with temperatures above 15C and keep only those dates in all of our other data columns. To do that, we'll use a mask array. 

A mask array is basically a boolean (True/False) array that can be used to take a subset of data from other data. So, instead of extracting warm temperatures directly, we'll first identify where those elements in `tavg_C` are greater than 15C (`True`).  

In [None]:
warmTempsMask = tavg_C > 15
print(warmTempsMask)

We can see there was one day in Feb above 60. Now we can use this mask on other columns of data.

In [None]:
warmTempDates = date[warmTempsMask]
print(warmTempDates) 

### 1.7 Removing missing and bad data

In some cases, a data file might contain missing values or values that cannot be read. These may be replaced by `nan` values when you look at your data. `nan` stands for “not a number”, and often we want to get rid of these things.

Let’s consider a case where we have an array `badData` that is full of zeros, has the same size as `date` and the other arrays from our data file, and the first 5 rows have `nan` values.

In [None]:
#Create some bad data in an array the length of our date matrix just as an example
badData = np.zeros(len(date))
badData[:5] = np.nan
print(badData)

If we wanted to include values for the date column that only correspond to locations in `badData` where we do not have a nan value, we can use the `isfinite()` function in NumPy. `isfinite()` checks to see if a value is defined (i.e., is not `nan` or infinite (`inf`)). 

So, we'll make a mask to find where the data is **finite** and only pull the dates associated with the good data.

In [None]:
badDataMask = np.isfinite(badData)
print(badDataMask)
goodDates = date[badDataMask]
print(goodDates)

### 1.8 Re-creating and saving an array

We've mostly worked with single columns that we spliced after bringing in our temperature data. We can create the 2d data structure by stacking these columns back together. Let's pull together our `date` and `tavg_C` columns in a new array called `tempdata_C`. We'll use the `vstack()` function.

In [None]:
tempdata_C = np.vstack((date,tavg_C))
print(tempdata_C)

Something looks weird here. Let's invsetigate. 

In [None]:
print(tempdata_C.shape)

It looks like there are two rows and 49 columns. The columns and rows need to be flipped. We can do this using `transpose()`.

In [None]:
tempdata_C = np.transpose(tempdata_C)
print(tempdata_C.shape)
print(tempdata_C)

Now that our data looks good, we can save it using `savetxt()`.

In [None]:
np.savetxt('rva_temps_C.csv', tempdata_C, delimiter=',') #check to see if it saved!

Note that instead of using the `vstack()` function, followed by the `transpose()` function, we could have stacked horizontally using the `hstack()` function to directly get the data set up as we wanted. 

## Knowledge Check 1
Find the average minimum temperature and average maximum temperature during the first week of February. Then, determine how many days went below freezing and print out those dates. 

<a href=#home>Return to Top</a> 

# 2. Matplotlib <a name='bookmark2' />

Python has a **ton** of useful [libraries for plotting](https://rougier.github.io/python-visualization-landscape/landscape-colors.png). Matplotlib is probably the most widely used and it is [*polarizing*](https://www.reddit.com/r/Python/comments/p59091/anyone_else_despises_matplotlib/). The *cooler* [Seaborn](https://seaborn.pydata.org/) is built a top of Matplotlib. 

Matplotlib used to have a 'basemap' toolkit for making maps, but that's being depreciated for [Cartopy](https://scitools.org.uk/cartopy/docs/latest/) which we'll briefly touch on below.

[Boken](https://docs.bokeh.org/en/latest/), [Plotly](https://plotly.com/python/), and [Dash](https://plotly.com/dash/) are fancy plotters.

And, if you love ggplot2 (ALEX), python has it's own [ggplot](https://yhat.github.io/ggpy/). Here, we'll walk you through the standard matplotlib package, but great documentation exists for all of these plotting packages. 

## 2.1 Plotting basics
There are a bunch of different types of plots:
* [Bar chart](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html)
* [Histogram](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html)
* [Scatter plot](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html)
* [Line chart](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html)
* [Pie chart](https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_features.html)
* [Violin chart](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.violinplot.html) # a better box plot
* [Dendrogram](https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html)
 
Most - if not all - of these plots have common elements. The figure hyperlinked [here](https://geo-python-site.readthedocs.io/en/2022.0/_images/basic-elements-of-plot.png) has a lot of the basic plot elements. 

Some common terms are:
* *axis* - axes of graph (x, y, z if you're extra)
* *title* - plot title
* *label* - name of axis (xlabel or ylabel)
* *legend* - plot legend
* *tick label* - text/values represented on the axis
* *symbol* - symbol for data points (on a scatter plot usually) that can presented w. different shapes and colors
* *size* - size of a point on a scatter plot OR text sizes on a plot
* *line style* - how the line should be drawn: solid, dashed, those stars
* *line width* - width of line
* *alpha* - transparency level of a filled element on a plot: 0 = fully transparent, 1 = fully opaque 
* *ticks* - tick marks
* *annotation* - text added to a plot
* *padding* - distance between an axis/tick label and axis


## 2.2 Loading and manipulating our data

We need to import matplotlib, and we're going to use our dataset from earlier for some visualizations.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

filepath = "/scratch/myang_shared/lab/PythonBootcamp/Sp25/resources/Lesson4_2025/RVA_janfeb2025.csv"
data = np.genfromtxt(filepath, skip_header=5, delimiter=',')
print(data)

#I'm going to subset the data to just february
data=data[31:]

In [None]:
# Pulling out each column like above
date = data[:,0]
tavg = data[:,1]
tmax = data[:,2]
tmin = data[:,3]


## 2.3 Plotting

Plots are automatically displayed in Jupyter notebooks, so you don't need `plt.show()`. But, if you're just working in Anaconda or from the terminal, you'll need to add that text.

In [None]:
x = date
y = tavg
plt.plot(x, y)

#plt.show

In [None]:
#let's make our graph better
plt.plot(x, y, 'ro--')
plt.title('RVA temperatures Feb 2025')
plt.xlabel('Day of Month')
plt.ylabel('Temperature (°F)')

Now we see our temperature data as a red dashed line with circles showing the data points. This comes from the additional `ro--` used with `plt.plot()`. In this case, `r` tells the `plt.plot()` function to use red [color](https://matplotlib.org/stable/gallery/color/named_colors.html), `o` tells it to show circles at the [points](https://matplotlib.org/1.4.1/api/markers_api.html), and `--` says to use a dashed line. You can use `help(plt.plot)` to find out more about formatting plots, OR, click on [this link](https://matplotlib.org/1.4.1/api/axes_api.html#matplotlib.axes.Axes.plot) to see all of the great documentation.

### Embiggening

While the plot sizes we’re working with are OK, you can display them to be any size you want in Jupyter notebooks. To set the default plot size to be larger, simply run the Python code in a new cell below.

```python
plt.rcParams['figure.figsize'] = [12, 6]
```
That would set the default plot size to be 12 inches wide by 6 inches tall. You can run that and then rerun any of the plotting cells and see what happens.

### Text labels
If you want to add text to plots, you can do that using `plt.text()`.

```python
plt.text(20240126, 72, 'Too hot')
```

This will display the aforementioned text at the location `x = 20140126` (i.e., Jan 26, 2024), at `y = 72`, on the plot. We’ll see how to do this in a real example in just a second. With our approach to plotting thus far (we'll see another approach in Section 2.5), the commands related to an individual plot should all be in the same Python cell.

### Changing the axis ranges

We can change the plot axes using `plt.axis()`
```python
plt.axis([20240101, 20240131, 20, 80])
```

The format for `plt.axis()` is `[xmin, xmax, ymin, ymax]` in square brackets (because it's a list!). Let's put this together.

In [None]:
plt.plot(x, y, 'ro--')
plt.title('RVA temperatures Feb 2025')
plt.xlabel('Date')
plt.ylabel('Temperature (°F)')
plt.text(20250204, 62, 'Too hot')
plt.axis([20250201, 20250218, 30, 65])

## 2.4 Customizing Plots

We don't need to use numpy data or pandas data or even text data to create plots. We'll work with two lists to learn some of the myriad ways we can customize our plots.

We're going to look at average monthly precipitation in Richmond between 1991 and 2020. 

In [None]:
# creating lists using data found here: 
# https://www.weather.gov/media/akq/climateRECORDS/RIC_Climate_Records.pdf

rva_pcp = [3.21, 2.61, 4.0, 3.18, 4.0, 4.64, 4.37, 4.9, 4.61, 3.39, 3.06, 3.51]
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

plt.plot(months, rva_pcp)

#plt.show()

Note that the output displays the object type as well as the unique identifier (or the memory location) for the figure.

You can hide this information from the output by adding plt.show() - or commenting it out above - as the last line you call in your plot code.

#### Let's get customizing. 
By default, `plt.plot()` creates a line plot. This is not an ideal situation for plotting precipitation. 

What about a scatter plot?


In [None]:
plt.scatter(months,rva_pcp)
plt.show()

Okay but this is useless for precipitation too. Let's make a bar chart instead.

In [None]:
plt.bar(months, rva_pcp)
plt.show()

Great, let's add a title and axes like we learned to above but with a fancy line break designated by `\n`.

In [None]:
plt.bar(months, rva_pcp)

plt.title('Average Monthly Precipitation in RVA\n1991-2020')
plt.xlabel('Month')
plt.ylabel('Precipitation\n(inches)')
plt.show()

Now let's adjust the transparency using `alpha` and color. 

In [None]:
plt.bar(months, 
        rva_pcp, 
        color = 'dodgerblue', 
        edgecolor = 'darkslateblue', 
        alpha = 0.25)

plt.title('Average Monthly Precipitation in RVA\n1991-2020')
plt.xlabel('Month')
plt.ylabel('Precipitation\n(inches)')
plt.show()

#### Saving your plots

Saving plots created using Matplotlib can be done several ways. The recommendation for use outside of Jupyter notebooks is to use the `plt.savefig()` function. When using `plt.savefig()`, you simply give a list of commands to generate a plot and list `plt.savefig()` with some parameters as the last command. The file name is required, and the image format will be determined based on the listed file extension.

Matplotlib plots can be saved in a number of useful file formats, including PNG, PDF, and EPS. PNG is a nice format for raster images, and EPS is probably easiest to use for vector graphics. Let’s check out an example and save our lovely bar plot.

In [None]:
plt.bar(months, 
        rva_pcp, 
        color = 'dodgerblue', 
        edgecolor = 'darkslateblue', 
        alpha = 0.25)

plt.title('Average Monthly Precipitation in RVA\n1991-2020')
plt.xlabel('Month')
plt.ylabel('Precipitation\n(inches)')
plt.savefig('rva_month_pcp.png')
plt.show()

Check your folder to see if a PNG file was created - click on it to take a look!

### Knowledge Check 2
Using *RVA_PRECIP_janfeb2025.csv* data found that same Lesson4_2025 folder, plot either total daily precipitation, snowfall, or snowdepth in **January 2025** as a bar graph. Try out some nice formatting to make your graph look nice!

<a href=#home>Return to Top</a> 

## 2.5 Subplots

If we wanted to level up, we could use subplots. Earlier we said that everything needed to be in the same Jupyter notebook cell block if we wanted it to all plot together. But, that wasn't giving matplotlib enough credit. 

Matplotlib uses an object oriented approach to plotting. This means that plots can be built step-by-step by adding new elements to the plot.

There are two primary objects associated with a matplotlib plot:
* `figure` object: the overall figure space that can contain one or more plots.
* `axis` objects: the individual plots that are rendered within the figure.

You can think of the figure object as your plot canvas. You can think about the axis object as an individual plot. See [this image](https://www.earthdatascience.org/images/earth-analytics/plot-data/fig-1-plot.png).

A figure can hold one or more axis objects. This structure allows you to create figures with one or more plots on them.

To create a plot using matplotlib’s object oriented approach, you first create the figure (which you can call `fig`) and at least one axis (which you can call `ax`) using the `subplots()` function from the pyplot module:

`fig, ax = plt.subplots()`

Notice that the `fig` and `ax` are created at the same time by setting them equal to the output of the `pyplot.subplots()` function. As no other arguments have been provided, the result is a figure with one plot that is empty but ready for data visualization.

In [None]:
# Create figure and one plot (axis object) 
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots()

This format comes in handy, particularly, when working with [multiplot figures](https://www.earthdatascience.org/images/earth-analytics/plot-data/fig-4-plots.png).

Note: In the example above, `fig` and `ax` are variable names for the figure and axis objects. You can call these items whatever you want. For example, you might see f, ax or fig, axis1 used.

When adding more than one axis object, it is good practice to give them distinct names (such as `ax1` and `ax2`), so you can easily work with each `axis` individually.

You will need to provide new arguments to `plt.subplots` for the layout of the figure: number of rows and columns:

`plt.subplots(1, 2)`

In this example, 1, 2 indicates that you want the plot layout to be 1 row across 2 columns.

In [None]:
# Figure with two plots
fig, (ax1, ax2) = plt.subplots(1, 2)

Conversely, (2, 1) indicates that you want the plot layout to be 2 rows across one column.

In [None]:
# Figure with two plots
fig, (ax1, ax2) = plt.subplots(2, 1)

A key benefit of the matplotlib object oriented approach is that each axis is its own object and can be customized independently of the other plots in the figure.

Let's make a subplot with some seasonal minimum temperature data from Acadia National Park from 1950-2021 that Prof Spera has worked with.
The data files are in the `/scratch/myang_shared/lab/PythonBootcamp/Sp25/resources/Lesson4_2025/` directory.
* djf_tmin_acadia.csv
* mam_tmin_acadia.csv
* jja_tmin_acadia.csv
* son_tmin_acadia.csv

Check out the data before you load it. Note that each data file indicates the average minimum temperature over 1950-2021 for a set if three months. 'djf' stands for 'December, January, February', and thus indicates the average minimum temperatures in winter. 

In [None]:
# file paths
djf_file = "/scratch/myang_shared/lab/PythonBootcamp/Sp25/resources/Lesson4_2025/djf_tmin_acadia.csv" #winter
mam_file = "/scratch/myang_shared/lab/PythonBootcamp/Sp25/resources/Lesson4_2025/mam_tmin_acadia.csv" #spring
jja_file = "/scratch/myang_shared/lab/PythonBootcamp/Sp25/resources/Lesson4_2025/jja_tmin_acadia.csv" #summer
son_file = "/scratch/myang_shared/lab/PythonBootcamp/Sp25/resources/Lesson4_2025/son_tmin_acadia.csv" #autumn

# get time series data
djfTS = np.genfromtxt(djf_file, skip_header=1, delimiter=',')
mamTS = np.genfromtxt(mam_file, skip_header=2, delimiter=',') 
jjaTS = np.genfromtxt(jja_file, skip_header=2, delimiter=',')
sonTS = np.genfromtxt(son_file, skip_header=2, delimiter=',')

# turn the dates into integers
years = (djfTS[:,0]).astype(int)

# pull minimum temperature info from each dataset
djf_tmin = djfTS[:,1]
mam_tmin = mamTS[:,1]
jja_tmin = jjaTS[:,1]
son_tmin = sonTS[:,1]

Lets plot the seasons separately, quickly, just to get a feel for what they look like.

In [None]:
plt.plot(djf_tmin)

In [None]:
plt.plot(mam_tmin)

In [None]:
plt.plot(jja_tmin)

In [None]:
plt.plot(son_tmin)

Okayyy, let's make these subplots.  First, we can create a 2x2 panel for our visualization using Matplotlib’s `subplots()` function where we specify how many rows and columns we want to have in our figure. We can also specify the size of our figure with `figsize` parameter as we have seen earlier with pandas. Figsize takes the width and height values (in inches!) as inputs

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))
axes

You can see that as a result we have now a list containing two nested lists where the first one contains the axis for columns 1 and 2 on row 1, and the second list contains the axis for columns 1 and 2 for row 2.

We can split these axes into their own variables so it is easier to work with them.

```
ax11 = axes[0][0]
ax12 = axes[0][1]
ax21 = axes[1][0]
ax22 = axes[1][1]
```

After running that, we have four variables for our plot axes that we can use for the different panels in our figure. We will use them to plot the seasonal data. Let’s begin by plotting the seasons, and use different colors for the lines and specify the y-axis range to be the same for all subplots. 

In [None]:
ax11 = axes[0][0]
ax12 = axes[0][1]
ax21 = axes[1][0]
ax22 = axes[1][1]

In [None]:
# Set the plot line width
line_width = 1

# Plot data
ax11.plot(years, djf_tmin, c="slategrey", lw=line_width)#, set_ylim=([min_temp, max_temp]))
ax12.plot(years, mam_tmin,  c="orchid", lw=line_width)#, ylim=[min_temp, max_temp])
ax21.plot(years, jja_tmin, c="olivedrab", lw=line_width)#, ylim=[min_temp, max_temp])
ax22.plot(years, son_tmin, c="orangered", lw=line_width)#, ylim=[min_temp, max_temp])

# Set figure title
fig.suptitle("Average annual season temperature at \nAcadia National Park, Maine")

# Rotate the x-axis labels so they don't overlap
plt.setp(ax11.xaxis.get_majorticklabels(), rotation=20)
plt.setp(ax12.xaxis.get_majorticklabels(), rotation=20)
plt.setp(ax21.xaxis.get_majorticklabels(), rotation=20)
plt.setp(ax22.xaxis.get_majorticklabels(), rotation=20)

# Axis labels
ax11.set_ylabel("Temperature [°C]")
ax21.set_ylabel("Temperature [°C]")

# Season label text
# The first to numbers are where on the graph you want to put the text
ax11.text(2010, -8, "Winter")
ax12.text(2010, 0.5, "Spring")
ax21.text(2010, 12.5, "Summer")
ax22.text(2010, 6.5, "Autumn")
# Display the figure
fig

FUN!

If you run stats on these data, you'd see a statisically significant increase in minimum temperatures in fall, winter, and summer in Acadia. Nothing is happening in the spring!

Fall nights are about 1.2°C (2.1°F) warmer on average now than 1950.
Winter nights are about 1.4°C (2.5°F) warmer on average now than in 1950.
And summer nights are about 1.6°C (2.9°F) warmer on average now than in 1950.

### Knowledge Check 3
In the temperature data from the numpy section, you had three columns of temperature data - min, max, and average. Make three plots in one figure that depicts each of these. Remember to make your graph look nice using good formatting.

<a href=#home>Return to Top</a> 

# 3. Cartopy <a name='bookmark3' />

Cartopy is one of many packages that focuses on geospatial data. This means it can be used for plotting data on a map, but it can also be used for plotting geographical images such as sections of a Google map. 

Like Matplotlib, Cartopy has several different modules. We will be working with two important modules in this tutorial: `crs` and `feature`. 

`crs` is the basic map-making module that creates the coordinates that we want. 
`feature` allows us to add coastlines, borders, oceans, and other important distinctions to our maps.

More info about Cartopy can be found [here](https://scitools.org.uk/cartopy/docs/latest).

**Important: Before we move on, we have to change conda environments because cartopy is a tricky package. At the top of the page, click 'Kernel'>'Change kernel'>'qgis.' If people are really interested in why we have to do this, we can chat about conda environments.**

In [None]:
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
from cartopy import crs, feature 

In [None]:
# we'll be using ozone data
data = pd.read_csv('/scratch/myang_shared/lab/PythonBootcamp/Sp25/resources/Lesson4_2025/july20_ozone.csv')

Examine the data. What are the column names? What's the shape and size? Etc.

Now let's pull some stuff together from a bunch of previous lessons to manipulate the data and make sure we're doing what we want to do.

In [None]:
# Let's deal with the dates
dates = pd.to_datetime(data.loc[:, 'Date Local'])

# add the dattimes back to the dataframe
data['datetimes'] = dates

# and now just double check that we did it right
data

Let's look at some trends over time using the super handy `groupby`.

In [None]:
print(type(data['Date Local'][0]))
print(type(data['datetimes'][0]))

In [None]:
daily_avg_o3 = data.filter(items=['Ozone (ppm)', 'datetimes']).groupby(data['datetimes'].dt.day).mean()
daily_avg_o3

In [None]:
# quick plot
y=daily_avg_o3['Ozone (ppm)']
x=daily_avg_o3.index

fig=plt.figure(figsize=(8,6))
plt.plot(x,y)
plt.xlabel('Day of Month')
plt.ylabel('Ozone (ppm)')
plt.title('July Daily Average Ozone')

This plot is telling us that the average ozone concentration across the entire United States fluctuates by less than 10 ppb during the month of July. But what is happening at specific locations? Can we zoom in to an area to better understand how air quality is changing there?

Let's look specifically at Massachusetts. (Virginia was boring.)

In [None]:
ma = data.loc[data['State Name']=='Massachusetts']
ma

ma_avg_o3 = ma.filter(items=['Ozone (ppm)', 'datetimes']).groupby(ma['datetimes'].dt.day).mean()

y_ma = ma_avg_o3['Ozone (ppm)']
x_ma = ma_avg_o3.index

fig=plt.figure(figsize=(8,6))
plt.plot(x_ma,y_ma)
plt.xlabel('Day of Month')
plt.ylabel('Ozone (ppm)')
plt.title('July Daily Average Ozone in Massachusetts')

Okay, it seems like there is something else going on in Massachusetts. We can assume ozone, then, varies in time and space. Let's make a map to see which areas in MA have high concentrations of ozone and which have low concentrations.

## 3.1. Making a map

We generally need to follow the same steps to make a map:
* a. Identify the coordinates of your region of interest (ROI)
* b. Make a grid on your coordinates of interest
* c. Select a map project
* d. Create a pyplot figure using your grid and map project
* e. Add the relevant features to your figure
* f. Plot the data

### 3.1.a. Identify the coordinates of your ROI

Sometimes, your data will have a pre-made grid of the latitude and longitude coordinates for you. This makes it really easy. In other cases, you might need to decide on the coordinates yourself. Our data does have latitude and longitude coordinates, but it does not tell us the size of the grid we want. We can use our latitude and longitude data to make an appropriately sized grid.

In [None]:
latmin = ma['Latitude'].min()
latmax = ma['Latitude'].max()
lonmin = ma['Longitude'].min()
lonmax = ma['Longitude'].max()
print('Lat Range: %.4f - %.4f' % (latmin,latmax))
print('Lon Range: %.4f - %.4f' % (lonmin,lonmax))

Now that we know our max/min - we can make a box around our data. I like adding a little bit of a spatial buffer.

In [None]:
# buffered lats and lons
latmin = 41
latmax = 43
lonmin= -74
lonmax = -69.5

### 3.1.b. Make a grid

This step will vary depending on the type of data you are plotting. We will be making a simple dot map, so we basically need to set a max and min range for our map's box size. If we were using gridded data that covered the whole area, like satellite data or a model output, we would need to make a meshgrid that had a lat and lon coordinate pair for each grid cell. Prof Spera can write up something on this if people are interested.

`extent` below indicates the boundaries of our map's box size.

In [None]:
extent = [lonmin, lonmax, latmin, latmax]

### 3.1.c. Select a map projection.
We also need to decide on a map projection. A map projection tells you how the map will be displayed. If you've taken Geog/Envs 260, you'll hopefully have heard all about projections. There's more info [here](https://scitools.org.uk/cartopy/docs/v0.15/crs/projections.html), but the basic idea is that a map represents a sphere (our globe), and so visualizing it in a 2d space can look really different depending on the projection used to put the map into 2d space. Here, we'll use the plate carree projection. 

In [None]:
# Projection
proj = crs.PlateCarree()

### 3.1.d. Create a Figure

Let's try and put everything together in a pyplot figure to set up our map background. 

In [None]:
# create a pyplot figure
fig = plt.figure(figsize=(8,10))
# create a new axes instance with the map information
ax = fig.add_subplot(1,1,1,projection=proj)
# ax.set_extent([lonmin,lonmax,latmin,latmax],crs.PlateCarree())
ax.set_extent(extent)
# # add gridlines
gl = ax.gridlines(crs.PlateCarree(),draw_labels=True,linewidth=1,color='gray',alpha=0.5,linestyle='--')


Note that cartopy is built on top of matplotlib tools! Thus, knowing matplotlib is important for a lot of visualization techniques. 

### 3.1.e. Add relevant figures. 
Let's make sure this is in the right place by adding details like rivers, state borders, etc.

Note: If you are getting DownloadWarnings but the figure does appear, then don't worry. If you rerun the cell, you will see the DownloadWarning has disappeared. If you add new features, you will get a new warning, but it'll go away the next time you use that feature. When Cartopy was initially installed, it didn't download all the different map features with it. So it has to download them the first time you want to use those features in a plot - this might make the job run a bit longer than normal - it should still finish in ~1 minute.

In [None]:
# create a pyplot figure
# IF YOU GET DOWNLOAD WARNINGS, JUST WAIT THEM OUT. THINGS WILL PROCESS. 
fig = plt.figure(figsize=(8,10))
# create a new axes instance with the map information
ax = fig.add_subplot(1,1,1,projection=proj)
ax.set_extent(extent)
# add gridlines
gl = ax.gridlines(crs.PlateCarree(),draw_labels=True,linewidth=1,color='gray',alpha=0.5,linestyle='--')
# add features
ax.add_feature(feature.STATES)
ax.add_feature(feature.RIVERS)
#plt.show()

Note that the `.add_feature()` method is from cartopy and will pull out built-in data on state boundaries and river layouts. 

### 3.1.f. Plot the data

The daily ozone data has lat and lon coordinates. But we have data for every day in the month of July, which is too much to plot all at once. Instead, let's pick just one day and plot that data first.

In [None]:
# get data from July 1
july_1=ma[ma['datetimes']=='2020-07-01']

july_1.head()

In [None]:
# get lat, lon and y-data
lats = july_1['Latitude']
lons = july_1['Longitude']
o3 = july_1['Ozone (ppm)']

In [None]:
# create a pyplot figure
fig = plt.figure(figsize=(10,10))

# create a new axes instance with the map information
ax = fig.add_subplot(1,1,1,projection=proj)
ax.set_extent(extent)

# add gridlines
gl = ax.gridlines(crs.PlateCarree(),draw_labels=True,linewidth=1,color='gray',alpha=0.5,linestyle='--')

# add features
ax.add_feature(feature.STATES)

# ax.add_feature(feature.RIVERS)

# plot the data -- note, we use plt.scatter to add data on top of the map!
plt.scatter(lons,lats,s=50,c=o3)
plt.colorbar(label='Ozone (ppm)',shrink=0.5, pad = 0.1)
plt.show()

Wow, so it seems like southeastern Massachusetts where Prof Spera is from has higher concentrations of ozone than the rest of the state. But, if you remember from the line graph above, ozone was highest across the state on July 19. Let's see if we have the same spatial pattern then as well.

In [None]:
# select monitor data from July 31st
july19 = ma.loc[ma['datetimes'] == '2020-07-19']

# get the lat, lon, and y-data
lats = july19['Latitude']
lons = july19['Longitude']
o3 = july19['Ozone (ppm)']

# create a pyplot figure
fig = plt.figure(figsize=(8,10))
# create a new axes instance with the map information
ax = fig.add_subplot(1,1,1,projection=proj)
ax.set_extent(extent)
# add gridlines
gl = ax.gridlines(crs.PlateCarree(),draw_labels=True,linewidth=1,color='gray',alpha=0.5,linestyle='--')
# add features
ax.add_feature(feature.STATES)
# ax.add_feature(feature.RIVERS)
# plot the data -- note, we use plt.scatter to add data on top of the map!
plt.scatter(lons,lats,s=50,c=o3)
plt.colorbar(label='Ozone (ppm)',shrink=0.5, pad = 0.1)
plt.show()

Totally different pattern! Amtospheric chemistry changes in space & time. 

Note here that the heatmap key is covering a bit of the y-axis label on the right. Mess around with the `plt.colorbar` options - can you make it so the heatmap key doesn't cover parts of the graph?

### Knowledge Check 4
Map a state that isn't Massachusetts using the steps above.

<a href=#home>Return to Top</a> 

## 3.2. A projection deep dive (skip this section if you're not interested)
To evaluate the importance of projections, we really need to look on a larger scale than the state level. Since our ozone data is for the entire US, let's look at the total area covered by the data.
[Projections](https://www.leventhalmap.org/digital-exhibitions/bending-lines/interactives/projection-face/) are essentially how we get a 3-d surface (the surface of the earth, which is a sphere), to a 2-d surface (a map). If you haven't taken Geog 260, see this [figure](https://media.licdn.com/dms/image/C4D12AQGszV4q-00Lew/article-inline_image-shrink_1500_2232/0/1634289520658?e=1712793600&v=beta&t=II_5VLlRGOnrkkU_dZwp6VOYR3q3gDdPwEUSNCPNBkQ).

In [None]:
# what are the min and max coordinates of monitors in our dataset?
latmin = data['Latitude'].min()
latmax = data['Latitude'].max()
lonmin = data['Longitude'].min()
lonmax = data['Longitude'].max()
print('Lat Range: %.4f - %.4f' % (latmin,latmax))
print('Lon Range: %.4f - %.4f' % (lonmin,lonmax))

Let's go through our map-making steps but for the whole US

In [None]:
# 1. Identify the coordinates
# buffered lat/lon values
latmin = 15
latmax = 65
lonmin = -60
lonmax = -150

In [None]:
# 2. Make the grid
# set the extents of our box
extent = [lonmin,lonmax,latmin,latmax]

In [None]:
# 3. Select the map projection
# first, we'll stick with PlateCarree
proj = crs.PlateCarree()

In [None]:
# 4. Create a figure
# create a pyplot figure
fig = plt.figure(figsize=(8,10))
# create a new axes instance with the map information
ax = fig.add_subplot(1,1,1,projection=proj)
ax.set_extent(extent)
# add gridlines
gl = ax.gridlines(crs.PlateCarree(),draw_labels=True,linewidth=1,color='gray',alpha=0.5,linestyle='--')
# add features
ax.add_feature(feature.STATES)
plt.show()

Okay, it appears we have states but not countries. Let's fix that.

In [None]:
# create a pyplot figure
fig = plt.figure(figsize=(8,10))
# create a new axes instance with the map information
ax = fig.add_subplot(1,1,1,projection=proj)
ax.set_extent(extent)
# add gridlines
gl = ax.gridlines(crs.PlateCarree(),draw_labels=True,linewidth=1,color='gray',alpha=0.5,linestyle='--')
# add features
ax.add_feature(feature.STATES)
ax.add_feature(feature.COASTLINE)
plt.show()

Does the US seem to be the right shape? Or do things look a bit flattened? That's what happens with PlateCarree, it flattens out the map. Let's try some other projections.

In [None]:
# Albers Equal Area projection
proj = crs.AlbersEqualArea()

In [None]:
# create a pyplot figure
fig = plt.figure(figsize=(8,10))
# create a new axes instance with the map information
ax = fig.add_subplot(1,1,1,projection=proj)
ax.set_extent(extent)
# add gridlines
gl = ax.gridlines(crs.PlateCarree(),draw_labels=True,linewidth=1,color='gray',alpha=0.5,linestyle='--')
# add features
ax.add_feature(feature.STATES)
ax.add_feature(feature.COASTLINE)
plt.show()

The Albers Equal Area projection is a conical projection, meaning it pretends the globe is actually a cone and uses a conical projection. Not great either. 

In [None]:
# Lambert Conformal. This projection is useful if you ever use CMAQ or WRF data
proj = crs.LambertConformal()

# create a pyplot figure
fig = plt.figure(figsize=(8,10))
# create a new axes instance with the map information
ax = fig.add_subplot(1,1,1,projection=proj)
ax.set_extent(extent)
# add gridlines
gl = ax.gridlines(crs.PlateCarree(),draw_labels=True,linewidth=1,color='gray',alpha=0.5,linestyle='--')
# add features
ax.add_feature(feature.STATES)
ax.add_feature(feature.COASTLINE)
plt.show()

We're getting better. At least the US is nearly centered.

In [None]:
# Robinson projection
proj = crs.Robinson()

# create a pyplot figure
fig = plt.figure(figsize=(8,10))
# create a new axes instance with the map information
ax = fig.add_subplot(1,1,1,projection=proj)
ax.set_extent(extent)
# add gridlines
gl = ax.gridlines(crs.PlateCarree(),draw_labels=True,linewidth=1,color='gray',alpha=0.5,linestyle='--')
# add features
ax.add_feature(feature.STATES)
ax.add_feature(feature.COASTLINE)
plt.show()

This projection is good if you're mapping most of the world. You can find all of cartopy's projections [here](https://scitools.org.uk/cartopy/docs/v0.15/crs/projections.html). 

We've examined a few projections. Let's see how this can affect the data. Let's stick with the Robinson projection just for kicks and pick just one random day to focus on, and plot all monitor locations just to prove another point.

In [None]:
# let's map July 4th, 2020
sel_date = data[data['datetimes'] == '2020-07-04']

# get the lat, lon, and y-data
lats = sel_date['Latitude']
lons = sel_date['Longitude']
o3 = sel_date['Ozone (ppm)']

# create a pyplot figure
fig = plt.figure(figsize=(8,10))
# create a new axes instance with the map information
ax = fig.add_subplot(1,1,1,projection=proj)
ax.set_extent(extent)
# add gridlines
gl = ax.gridlines(crs.PlateCarree(),draw_labels=True,linewidth=1,color='gray',alpha=0.5,linestyle='--')
# add features
ax.add_feature(feature.STATES)
ax.add_feature(feature.COASTLINE)
# add data
plt.scatter(lons,lats,s=10,c=o3)
plt.colorbar(label='Ozone (ppm)',shrink=0.5)
plt.show()

Wait! What happened? Where is the data? The lats, lons, and ozone variables definitely have values, so that isn't the problem. Why didn't any dots show up on the map?

The answer is in the projection. We changed the map projection from Plate Carree to Robinson. But, we left the data in Plate Carree (flat) projection. In order to map our flat data onto a not-flat map, we need to transform it. This data + map being in the same projection gets a lot of people stuck.

In [None]:
# create a pyplot figure
fig = plt.figure(figsize=(8,10))
# create a new axes instance with the map information
ax = fig.add_subplot(1,1,1,projection=proj)
ax.set_extent(extent)
# add gridlines
gl = ax.gridlines(crs.PlateCarree(),draw_labels=True,linewidth=1,color='gray',alpha=0.5,linestyle='--')
# add features
ax.add_feature(feature.STATES)
ax.add_feature(feature.COASTLINE)
# add data
plt.scatter(lons,lats,s=10,c=o3,transform=crs.PlateCarree()) # added transform here
plt.colorbar(label='Ozone (ppm)',shrink=0.5)
plt.show()

### Knowledge Check 5

Pick a different projection and plot the average July ozone across just the contiguous US (no Alaska, Hawaii, or Puerto Rico). Notice any interesting spatial patterns across the US?

<a href=#home>Return to Top</a> 