#***Assignment 2: CO<sub>2</sub> Mauna Loa Data***


This assignment is due on Thursday 2024-1-18, before class. Change the name of your notebook to tpp_assignment_2_sunetID.ipynb Share your completed notebook with the TAs akroo@stanford.edu & flora221@stanford.edu using the share banner at the top. For help submitting, see the canvas walkthrough. If you are still having technical difficulties, email us before the deadline.

## Introduction

This week’s assignment will get us thinking about using ground-based sensor data to monitor CO2. We will analyze the oldest CO2 monitoring station data with the goal of understanding: how/why is the concentration of CO2 changing in the atmosphere?

The goal of this assignment: to answer the question: How/why does the concentration of CO2 change at Mauna Loa? Be as quantitative as you can be in your answers. To do this we need to figure out what the key parameters are that are impacting the concentration of CO2. And to do this, we have available sensor data. It’s up to you to figure out what to do with them, and how to analyze the data to answer the question.






## Datasets:
For this assignment, you will be using the Mauna Loa Carbon Dioxide dataset.  This dataset was collected from 1958 to the present day. It is the longest running carbon dioxide measurement station in the world. The longevity of this dataset allows us to get a good picture of the changes in our carbon cycle with the advance of climate change.

This dataset contains columns for the month and year that the data were taken. We will not be using these columns except to construct the date at the beginning of our program. This construction is included in the cell that loads the data.

This dataset also contains multiple columns that contain CO<sub>2</sub> information.

The monthly averaged data (monthly_average_co2) is the average or mean of the CO<sub>2</sub> concentration values at the Mauna Loa observation site over each month. For example there is a data point for January 1959 and a data point for February 1959. These data points do not just represent one reading, which would be more susceptable to daily and even hourly variations, but are instead the averages of all of the measurements made during that month.

The other CO<sub>2</sub> data available to us from this dataset is the "de-seasonalized" data (de-seasonalized_co2). There is an oscillation present in the monthly averaged data happens once per year as a result of seasonal variations. We can use various averaging schemes to remove this seasonal variation in our signal in order to see the overall long term trend more clearly. This cleaning/filtering process has been performed on the data and included in the dataset as a column named 'de-seasonalized_co2'.


If you want to learn more about this dataset see: https://gml.noaa.gov/ccgg/about/co2_measurements.html




## Units:
The unit for reporting CO2 is parts per million (ppm).


## Toolbox:
All the Python functions and packages you will use in this assignment are in the toolbox for the course. We add new tools to the toolbox with each assignment as new ways of analyzing and visualizing data are introduced.

https://colab.research.google.com/drive/1gQxlpnogdJzykfNqjIGO4VLMnHHhrvcC?usp=sharing

This week you will be working with data that can be organized as a table.

numpy (numbers for python) is a package of python tools that handle the mathematical operations.

pandas (referred to as pd in lines of code) is a package of python tools that can be used to work efficiently with datasets, taken into pandas and referred to as dataframes.

matplotlib is a package that is useful for generating plots.

## THE LEARNING GOALS FOR THE WEEK
(where the course learning goals are in plain text, and the focus this week is in italics)

● become familiar with the wide range of sensors available to study various components of the Earth system. These include sensors on satellites, aircraft, *ground-based platforms, and deployed above or beneath the surface on land or water. This week we will work with sensors and learn how these measurements can be made on a range of platforms from hand-held to satellites.*

● *become familiar with the basic physical principles (resolution, sampling, processing workflows, etc.) common to all sensors.* This week you will think about the impact of temporal and spatial sampling on your data. *You will use a very simple workflow to go from data to analysis.*

● *work with various sources of data, learning how to access, analyze, synthesize, and describe the data to quantify trends; think critically and creatively about how to project these trends into the future. This week you will consider how best to interpret your data and use some simple analysis tools.*

● *become motivated to think about new sensors and new ways of using sensor data to study the planet.  This week you will consider the advantages and disadvantages of local sensor measurements.*


-----------------------------------------------------------------

#### 1) **Install Packages**: numpy, pandas, matplotlib (See Toolbox)



#### 2) **Download the Data**:

We begin by downloading the dataset. This dataset is much larger than our class data, so it may take a minute to load. Do not worry about the specifics of the code in this section.

This dataset is poorly set up. The dataset comes with month and year columns that can be difficult to use when plotting as if you put "year" on the x axis all of the months would not be shown. In the code block below, we transform the month and year columns to give us a "datetime" object that will be easier to use when plotting. Don't worry about the manipulations that we are using here, just keep in mind what this block of code does.

In [None]:
# DOWNLOADING DATASET
# connect to data in server
!git clone https://premonition.stanford.edu/taking-the-pulse-of-the-planet/homework-data

# cleans up output by removing superfluous warning notices
import warnings
warnings.filterwarnings('ignore')

# pull the data from the server into a useful pandas format
df_co2 = pd.read_csv('./homework-data/co2_global.csv')

#-----------------------------------#
# GENERATE DATE FROM MONTHS + YEARS
import datetime

#reorganizing datetime objects
years = df_co2['year'].values.astype(int)
months = df_co2['month'].values.astype(int)
times = [datetime.date(years[ii], months[ii], 1) for ii in range(df_co2.shape[0])]
time_series = pd.to_datetime(times)

# set index to be datetime object
df_co2['time'] = time_series
df_co2.set_index('time',drop=False,inplace=True);

Cloning into 'homework-data'...
remote: Enumerating objects: 12, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 12 (delta 1), reused 0 (delta 0), pack-reused 6[K
Receiving objects: 100% (12/12), 9.49 KiB | 9.49 MiB/s, done.
Resolving deltas: 100% (1/1), done.


Unnamed: 0_level_0,year,month,monthly_average_co2,de-seasonalized_co2,time
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1958-03-01,1958.0,3.0,315.70,314.43,1958-03-01
1958-04-01,1958.0,4.0,317.45,315.16,1958-04-01
1958-05-01,1958.0,5.0,317.51,314.71,1958-05-01
1958-06-01,1958.0,6.0,317.24,315.14,1958-06-01
1958-07-01,1958.0,7.0,315.86,315.18,1958-07-01
...,...,...,...,...,...
2022-07-01,2022.0,7.0,418.90,418.59,2022-07-01
2022-08-01,2022.0,8.0,417.19,419.16,2022-08-01
2022-09-01,2022.0,9.0,415.95,419.50,2022-09-01
2022-10-01,2022.0,10.0,415.78,419.14,2022-10-01


#### **3)** Plotting Multiple Variables

##### **a)** Plot both of the CO2 parameters on the same plot so that you can easily compare them. Describe in your own words, what the difference is between the two data categories.

#### **4)** Averaging

##### **a)** Calculate the mean of the CO2 concentration data

#### **5)** Data Analysis: Creating Subsets of Data

In this section, we will get rid of the seasonal variations by looking at yearly means of the monthly data. To do this we will use the function “groupby.” Groupby is a function that operates on a dataframe that has many columns and pulls out groups of the data that have similarities. In this first interaction with the groupby function, we will group our data by year so that we can operate on those groups. The example below shows us grouping by year and getting the group of datapoints for the year 2001. Notice the dates on the output are all from 2001.

In [None]:
df_co2_grouped_year = df_co2.groupby('year')
df_co2_grouped_year.get_group(2001)

Unnamed: 0,year,month,monthly_average_co2,de-seasonalized_co2,time
514,2001.0,1.0,370.76,370.6,2001-01-01
515,2001.0,2.0,371.69,370.95,2001-02-01
516,2001.0,3.0,372.63,371.06,2001-03-01
517,2001.0,4.0,373.55,370.99,2001-04-01
518,2001.0,5.0,374.03,371.11,2001-05-01
519,2001.0,6.0,373.4,371.17,2001-06-01
520,2001.0,7.0,371.68,371.08,2001-07-01
521,2001.0,8.0,369.78,371.39,2001-08-01
522,2001.0,9.0,368.34,371.61,2001-09-01
523,2001.0,10.0,368.61,371.85,2001-10-01


 One reason groupby is so useful is that we can perform actions on this group of data that we pull out of our set.

##### **a)** Choose three years throughout the dataset to compare the average co2 concentration using the function mean() (see toolbox). What do you notice about the relationship between these concentrations?

Instead of getting the group for a particular year, we can perform similar averaging operations on all of the groups. By doing the following.

In [None]:
df_co2_annual = df_co2.groupby('year')['monthly_average_co2'].mean()

##### **b)** What does df_co2_annual look like? What does each row represent? Visualize this dataframe with a plot.

#### **6)** Selecting Groups of Groupby Subsets

There are many different subsets of data that one could be interested in in a timeseries such as the CO<sub>2</sub> dataset we have been working with. Let's look at data from 2000 to 2011 to see what the recent CO<sub>2</sub> values look like. The following code selects a subset of the whole dataset. It does this by making a list of the years that you want to select. It then goes through each year in this list grabs all of the data in this year using groupby and adds it to our subset dataframe. In this code, you will only need to modify the year range that you want.


In [None]:
years = range(2000, 2011)  #make a list of years from 2000 up until not including 2011
df_subset = pd.DataFrame()  #make an empty data frame to put our new values in

#For loop to go through all of the years of interest, pull the data from those years and store it in our empty dataset
for i in years:
  df_subset = pd.concat([df_subset,df_co2.groupby('year').get_group(i)])

##### **a)** Plot the co2 datasets for the years between 2013 to 2023

#### **7)** Making Predictions

The dataframe that we generated in the last section on creating subsets is just like the dataframes we have been working with in both assignment 1 and the rest of this assignment. As such, we can perform a linear fit on it just as we have done with other dataframes


##### **a)** Chosing various subsets of the data, discuss how the rate of change of CO2 concentration has changed over time?


##### **b)** Predict the CO2 concentration in 2050. Explain how you did this and why you made the decisions that you did in your method. (There is not one correct way to do this.)