# Seeing Change Over Time: An Introduction to Time Series Analysis

<div class="alert alert-info">

Welcome back! This activity is part of an introduction to computational notebooks, designed specifically for K-12 educators. Here, we'll dive into a specific, powerful type of analysis: exploring data that changes over time. This is called **time series analysis**.

</div>

Data that unfolds over time is everywhere: daily temperature, stock market prices, student attendance, historical events, and ecological changes. 
    

## Key Ideas in Time Series Analysis

To explore some of the key ideas related to the code, math, and science of time series analysis, in this notebook we will be exploring **air quality data**.

**Our Learning Goals:**
* **Handling Dates and Times:** Time is a special kind of data that can be tricky to work with. Computational notebooks have specialized tools to handle things like leap years, and to calculate the duration between two dates or times. In this activity you will learn how to combine columns to create a `datetime` object, and how to use `datetime` objects to organize your analysis.
* **Visualizing Trends:** A line plot is a powerful tool for seeing patterns in time. We will learn the `lineplot` function and use it to create dynamic, detailed, and accurate line plots directly from our data.
* **Statistics/Data:** Time series analysis involves examining trends, cycles, or other interesting patterns that emerge over time. Learn to identify these patterns, and to exercise caution in interpreting patterns in data by using what you know about measurement and baseline data to separate "signal" from "noise." 
* **Content:** We will explore different ways that air quality data is measured, and  uncover real-world patterns in air pollution data. We will also consider what real-world events or behvaiors might be impacting air quality.

## Introducing Some Key Ideas

We will start with a quick introduction to some of the technical and conceptual foundations for time series data: showing you how to look at a dataset of hourly air quality measurements. 

For this part, we will focus on hourly data reporting the Air Quality Index or **AQI**, a measure used by many health and government organizations to communicate with the public. AQI is a measure that takes into account the presence and levels of several different air pollutants at once, to give an overall assessment of whether the air is, overall, safe to breathe. 

The AQI data we will be exploring is based on an air quality monitoring station located in West Oakland, California.

### Part I: Loading the Data

We're going to load a dataset that reports the hourly AQI over the course of one week from a monitoring station in West Oakland. The data are from the Bay Area Air Quality Municipal District, an air pollution control agency that monitors and reports on air quality in the greater San Francisco Bay region.

In [None]:
import pandas as pd # pandas is for making data tables

aqi_data = pd.read_csv('baaqmd_data.csv') # read the data
aqi_data.head() # show the first 5 rows of data

<div class="alert alert-success">

**Considering the data you see above,**


* What do you notice and wonder about this dataset?


* What are some questions you could explore?




We will use this dataset to practice code and visualizations for time series analysis. Remember, good documentation explains both the "why" (in text like this) and the "how" (in `#code comments`).

### Part II: Working with `datetime`

One thing you may have noticed above is that the month, day, year, and hour of each AQI record is stored in a different column. If we tried to graph the data, we would not be able to put "time" correctly on the x-axis.

We need to convert all this separate information into a single `datetime`. A `datetime` is a special format that computers use to keep time information organized.

Let's show a quick example with one date, and then we will use the same process to convert all the records in our AQI dataset. Try replacing the code below with a date that is special to you!

In [None]:
# often, datasets express date and time separately. 
# we will practice with a single, split up date
year = 2025 # a number format
month = 9
day = 9
time = "17:53" # a string/text format

# let's split the time up into hours and minutes
hours, minutes = time.split(':')

# first, we put these together into a common date format 
consolidated = str(month) + "-" + str(day) + "-" + str(year) + " " + str(hours) + ":" + str (minutes)
consolidated # show the text string we made

Now, we want to take the combined date and convert it to the `datetime` object.

In [None]:
# if the text string is a common format, the to_datetime
# function will automatically read and convert it to a
# datetime

pd.to_datetime(consolidated)

We are going to do this same process for all the records in the air quality dataset. In the example above, we did this for only one set of date information. Below, the code will look a bit different because we are doing the process for *all* the records in `aqi_data`.

In [None]:
# each record in the data table has a Month, Day, Year, and Time.
# Below, we split the Time up into Hours and Minutes, like we did in the example.

# Make new columns in aqi_data called "Hours" and "Minutes."
# Take what's in the "Time" column and split it up. The numbers before
# the colon become hours, the numbers after the colon become minutes.
aqi_data[['Hours', 'Minutes']] = aqi_data['Time'].str.split(':', expand=True).astype(int)

# for each record, store the values that represent year,
# month, day, etc. to be converted to a datetime
datetime_info = {
    'year': aqi_data['Year'],
    'month': aqi_data['Month'],
    'day': aqi_data['Day'],
    'hour': aqi_data['Hours'],
    'minute': aqi_data['Minutes']
}

# Add a new column called 'datetime' to aqi_datetime.
# Convert the information for each record that is saved 
# in datetime_info into a datetime and put it in the column.
aqi_data['datetime'] = pd.to_datetime(datetime_info)

# show the dataset with the new datetime column
aqi_data

<div class="alert alert-success">

**Explore the new table above:**


* How might the new `datetime` column help with analysis?


* What, if any, questions do you still have about what we've done here?


### Part III: Visualizing the Data

Now that we have a proper `datetime` column, we can create a line plot. This will let us see the "shape" of the data over time. 

When you are trying to interpret a graph, it's also important to pay attention to what increases or decreases in a quantity actually mean. If you are not aware of the scale of measurements, something that looks like a large increase on a graph might not mean anything at all. 

Below is the U. S. Environmental Protection Agency's guide to interpreting AQI measurements. You may find it useful in making sense of the graph we will create.

<div>

<img src="epa_aqi_2.jpg" width="550"> 

</div>

Now, let's plot the AQI for the first few months of data to see what it looks like.

In [None]:
import seaborn as sns # make nice looking graphs
import matplotlib.pyplot as plt # this helps us change axis labels
import matplotlib.dates as mdates # this helps us format datetimes

sns.set_theme() # this sets a nice theme with a grid on the plots

# we want to format the datetimes so they look nice on the plot
# the line below says to show the day "%a", then make a new line "\n"
# and put the MM-DD on the new line "%b-%d"
date_format = mdates.DateFormatter('%a\n%b-%d')

# the line below sends the new date format to the plot
plt.gca().xaxis.set_major_formatter(date_format)

# draw the plot
sns.lineplot(
    data=aqi_data, 
    x='datetime', 
    y='AQI'
)

<div class="alert alert-success">

**Considering the lineplot above,**


                                
* What do you notice and wonder?



* What could have happened to create the pattern you see above?



* There are many rises and falls in the data. What patterns in this graph would you consider to be "signals" and what might be "noise"? How are you making those decisions?



* You may have noticed that the large spike in the graph occurs in the evening of July 4. How did/does this impact your interpretation of the graph?

</div>

**Summary.** now you know about datetime, plotting line graphs, and exploring trends over time based on what you know about a measure. Let's apply these to a more interactive data investigation.

# Getting Clear on Air Quality: An Investigation of Pollutants During San Francisco's Orange Skies Day

On September 9, 2020 residents in the greater San Francisco Bay Area woke up to a sky that was a deep, saturated orange color. What was later called "Orange Skies day" was caused by air pollution generated by a combination of intense wildfires nearby, and weather conditions that suspended particulate matter high in the atmosphere for some time, creating the orange color effect.

<div style="display: flex; justify-content: space-around;">

<img src="bridge.jpg" width="250"> 

<img src="financial.jpg" width="250"> 

<img src="nature.jpg" width="250"> 

</div>

In this activity, we will look at hourly air quality measurements taken in Richmond, CA, a northern East Bay suburb of San Francisco, from the Sept 3 until Sept 17, 2020. (XXX might replace this with something closer to SF but the readings will be similar.)

## Part I: Loading the data

First, let's take a look at the dataset to see what we will be working with. Before you play the next cell, take some time to think about what attributes (columns) might be included in the dataset.

In [None]:
fire_data = pd.read_csv('sept2020.csv')

fire_data # this will show the first 5 records and the last 5 records in the dataset.

<div class="alert alert-success">

* What do you notice and wonder about the dataset above?


* What questions could you explore using this dataset?


* What information would you need to interpret these readings? 

</div>

Let's take a look some summary descriptions of the data to get a better sense of what we are working with. We're going to use the describe() function, which you've seen before. You'll notice that there are different measurements all stored in the same column. We can use groupby to treat them differently.

In [None]:
# show summary statistics, but do it by the different pollutant groups.
fire_data.groupby('pollutant').describe()

As you can probably tell, different pollutants are measured on different scales. We can't be sure about what is meaningful without understanding more about how these are measured, and what the measures mean. 

Here are some important things to know about the pollutants that are in this dataset, and some current EPA National Ambient Air Quality Standards for each.

* **PM2.5** describes very small particles, like black carbon or ash, that are in the air. PM2.5 can irritate the lungs, causing respiratory problems. These particles are so small that they can even pass through the lung barrier and get into the bloodstream, causing other health problems as well. PM2.5 concentrations are usually measured in micrograms per cubic meter (explain relatably). The U. S. EPA specifies PM2.5 values of 0-9 as "Good", 12.1-35.4 as "Moderate", 35.5-55.4 as "Unhealthy for Sensitive Groups," and values over 55.5 as Unhealthy for all groups.

* **Ozone (O3)** is a chemical that exists as a natural part of Earth's upper atmosphere, and that is also created at the ground level when various other pollutants, especially those involved with motor vehicles, interact with heat and light. Ozone causes a variety of respiratory problems, especially for children and people with asthma. The EPA designates Ozone levels over 0.071 ppm as "marginal", over 0.081 ppm as "moderate", and over 0.093 as "serious" in terms of the potential for harm to public health.

* **Carbon monoxide (CO)** is a tiny molecule that interferes with the body's ability to carry oxygen in blood. At high levels can cause acute poisoning, loss of conciousness and death. At lower levels, exposure can cause headaches, nausea, muscle pain, and cognitive impairment. The EPA has designated CO levels over 9 ppm (over 8 hours) or 35 ppm (over 1 hour) as harmful to public health. The biggest sources of CO are motor vehicles and residential and commercial heating and cooking appliances.

### Part II: Visualizing the Data Over Time

Now, we're going to look at some lineplots to visualize concentrations of each of these pollutants during the days immediately preceding, during, and after the "Orange Skies" event in September 2020. The code below will create a lineplot of PM2.5 levels. 

<div class="alert alert-success">

**Before we create the lineplot, consider:**


* What do you expect the plot of PM2.5 levels to look like?


* If you were to plot the other pollutants, what would you expect those to look like?



</div>

In [None]:
#Plot the three pollutants together. you'll see that the scales are wonky and PM2.5 looks like you would expect, but it's hard to see what's happening with O3 and CO.
# since the dataset has a column that describes datetime, we can
# easily convert that to Python datetimes using to_datetime.
fire_data['datetime'] = pd.to_datetime(fire_data['datetime_local'])

## FORMATTING STUFF
# set the date format, like we did in the first example above
date_format = mdates.DateFormatter('%a\n%b-%d')  # set the format
plt.gca().xaxis.set_major_formatter(date_format) # give format to the plot

# we have lots of days, so let's only add a tick to the axis
# for every 48 hours
locator = mdates.HourLocator(interval=48)  # set the format
plt.gca().xaxis.set_major_locator(locator) # give the format to the plot

# now, create a plot showing levels of each pollutant over time
sns.lineplot(
    data=fire_data, 
    x='datetime', 
    y='value',
    hue = 'pollutant'
)

<div class="alert alert-success">

* Is this what you expected? why or why not?


* What do you notice/wonder here?


* What of the data you see above would you consider to be a "signal"? What would you consider to be "noise"?


* What additional information do you need to make progress in understanding how air quality changed during the "Orange Skies" incident?


</div>

### Part III: Visualizing Trends for Each Pollutant

Since these pollutants are measured and interpreted using very different scales, it is helpful to look at each separately. We can do this by creating a line plot that is filtered to show only one pollutant at a time. The code below creates a filter, so that only PM2.5 data is used to create the line plot. 

In [None]:
## FORMATTING STUFF
# set the date format, like we did in the first example above
date_format = mdates.DateFormatter('%a\n%b-%d')  # set the format
plt.gca().xaxis.set_major_formatter(date_format) # give format to the plot

# we have lots of days, so let's only add a tick to the axis
# for every 48 hours
locator = mdates.HourLocator(interval=48)  # set the format
plt.gca().xaxis.set_major_locator(locator) # give the format to the plot

## FILTER TO THE POLLUTANT
filter = fire_data['pollutant']=="PM2.5" # this filters to just PM2.5 data

sns.lineplot(data=fire_data[filter], x='datetime', y='value')

What do you think will be trends in the graphs for the other polluntants? Edit and use the code below as a tool to explore the concentrations of PM2.5, Ozone, and Carbon Monoxide separately to get a better sense of each pattern at an interpretable scale.

<div class="alert alert-success">

* What new things did you discover when looking at the plots separately?



* What information helped you make sense of the plots for each polluntant? What additional information might be helpful?


                                                      
* What questions do you still have?



</div>

### Part IV: Looking Across Pollutants Simultaneously

You may have noticed that not all of the pollutants increased or decreased in concentration at the same time during the wildfires. 

When doing time series analyses, it is sometimes helpful to align multiple plots to see what was happening with each value at the same point in time. Below, we generate a plot with all three pollutants aligned.

In [None]:
g = sns.FacetGrid(fire_data, 
                  row="pollutant", 
                  aspect=4, # this makes the graphs 4x wider than they are tall
                  sharey=False)

g.map(sns.lineplot, "datetime", "value")

<div class="alert alert-success">

* Which patterns among the three graphs you see above are "worth talking about"? Why?



* If you had to tell the story of air quality during the weeks featured above, what story would you tell?



* Would you say there is anything "interesting" happening with Ozone? Why or why not?



* There is a "dip" in PM2.5 levels between 9-11 and 9-20, but not in CO levels. Why might this be? Is it something worth paying attention to? Why or why not?


* What additional questions do these graphs bring up for you?

</div>

# Where Else Do We See Time Series?

The skills you just practiced—creating datetime objects and plotting line graphs—are foundational for analyzing any data that has a time component.

Think about how you could apply this same "bundle" of skills to other content areas:

* **Ecology:** Tracking the population of a predator and its prey over several decades to see their cyclical relationship.
* **Social Studies / History:** Plotting immigration rates, economic indicators (like GDP), or census data over time to analyze historical trends.
* **Finance:** Analyzing the stock price of a company over a year to identify volatility and trends.
* **School Improvement:** Plotting daily student attendance rates over a school year to identify days or weeks with low attendance that might need intervention.

Any time you have a question that includes the words "over time," the tools and concepts in this notebook are a great place to start your investigation.

## In the Classroom

This notebook models a core scientific activity: observing, processing, and visualizing time series data to find patterns.

**How to Use This in a Lesson:**
* **Science:** In an Earth Science or Environmental Science unit, students could use this notebook to explore real-world pollution data. You could pose questions like: "Are pollution levels higher during certain times of the day? Or certain seasons?" They would modify the code to filter for different time periods to answer those questions.
* **Math:** In a statistics or data analysis unit, the focus could be on the visualization itself. Why is a line graph appropriate here? What does a sharp spike represent? What rises and falls in the concentrations are worth paying attention to, and which ones might not actually reflect a "signal" among the "noise"?

What other air quality events or patterns might be worth looking at?

Credit

North Complex smoke in San Francisco - Bay Bridge and Financial District.jpg  North Complex smoke in San Francisco - Financial District.jpg by Christopher Michel / Wikipedia / CC-BY-2.0

Observing the Orange Sky in San Francisco, California, USA, September 9, 2020.jpg by Semiautonomous / Wikipedia / CC-BY-SA-4.0