<div class="alert alert-block alert-danger">

# 07B: Methane (COMPLETE)

*This notebook is intended for students who have completed:*
 
**Page 7.9**

</div>

<div class="alert alert-block alert-warning">

#### Summary of Notebook:

In this lesson, student will look at some data on temperature and methane in the evironment. They will attempt to debunk a claim made by a climate skeptic whose analysis shows that as methane goes up, temperature actually goes down. The students will make models of methane predicting temperature anomaly and come to realize that the reason the model shows a negative trend is because we need to account for season (summer vs winter) in our model.

#### Includes:

- Fitting and interpreting simple and multiple regression models
- Exploring the dangers of failing to include an important variable in a model
- Using visualizations to explore which variables should be included in a model

</div>

<div class="alert alert-block alert-success">

## Approximate time to complete Notebook: 45-55 Mins

</div>

In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

<img src="https://i.postimg.cc/2CdHrqmS/methane-in-the-environment.jpg" alt="A chimney burning methane" width = 50%>

## Climate and Methane

Our changing climate (e.g., more extreme hurricanes, droughts, warming temperatures) have been linked to increases in pollutants and greenhouse gases (such as methane). But there are skeptics who do not believe human activities are linked to changes in climate.

A climate change skeptic chose one summer month and one winter month from each of the years 1999-2018. For each of those months, the skeptic found data on the average global temperature and atmospheric methane level. 

The skeptic claims that his data shows a **negative relationship** between methane and temperature. In other words, higher concentrations of the greenhouse gas tend to be associated with lower average global temperature. 

We will take a look at this data (in the data frame `temp_data`). Is the climate change skeptic correct? Could this be evidence that greenhouse gases and human pollution aren't driving climate change? Let's explore.

### Motivating Question: Is Climate Change Real?

### The Dataset `temp_data`
##### Description
A climate change skeptic (non-randomly) chose one summer month and one winter month from each of the years 1999-2018. For each of those months, the skeptic found data on the average global temperature and atmospheric methane level. 

##### Variables
- `year`: Year
- `month`: Month (1 = Jan, 2 = Feb, etc.)
- `decimal`: Month as a decimal of the year
- `season`: Winter or Summer
- `methane`: Atmospheric methane concentraion, parts per trillion (ppt)
- `temp_anamoly`: Average global temperature, relative to mean temperature from 1980 - 2015, in degreese Celsius

**Data Sources:**

- Temperature readings from GISS/NASA: https://data.giss.nasa.gov/gistemp/graphs_v4/. Shows global temperature anamoly relative to mean temp from 1980 - 2015. 
- Methane levels from NOAA: https://gml.noaa.gov/ccgg/trends_ch4. Converted ppb to ppt (divided by 1,000). Measurements conducted monthly from 1999 - 2018.

<div class="alert alert-block alert-success">

### 1.0 - Approximate Time:  10-15 mins

</div>

### 1.0 - A climate skeptic's analysis

<div class="alert alert-block alert-warning">

<b>Note to Instructors:</b> 

Even though we typically make a visualization here, we hope students "forget" to do that in this section... they will make a visualization in 2.0 though.

</div>

**1.1 -** The code below reads in the data. Take a look at the data frame. Do you have any questions about the variables?

In [None]:
# Read in data
temp_data <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQxO1NNRbXecdTvW8bfSlEytylLLv3jZ_ElehIakBQ157vKHQywyDs_cmyHZG9S0pjQN_SMKrAwSHEy/pub?gid=2129057940&single=true&output=csv")


**1.2 -** What is the climate skeptic's hypothesis about temperature and methane? Write it as a regular sentence as well as a word equation.

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

Their hypothesis is that there is a relationship between methane and temperature, and that as methane goes up, temperature goes down.

temp_anomaly = methane + other stuff


</div>

**1.3 -** Create a model to examine the climate skeptic's hypothesis.

In [None]:
# simple linear regression (temp ~ methane)
methane_model <- lm(temp_anomaly ~ methane, data = temp_data)
methane_model

**1.4 -** Is the skeptic's claim true that these two factors (temperature anamoly and methane levels) are negatively associated? How can you tell? (Support your answer by interpreting the parameter estimates or other statistics.)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

In our model, we see that `methane` has a negative coefficient value. This means that methane concetrations and temperature are negatively associated. As methane rises, the temperature tends to be lower. Specifically, for every 1 part per trillion increase in methane, our model predicts an 84.49 degree Celsius decrease in temperature anomaly.

</div>

<div class="alert alert-block alert-success">

### 2.0 - Approximate Time:  15-20 mins

</div>

### 2.0 - Exploring the data behind the model

Typically, we start by exploring the variation with a visualization. Let's check out this climate skeptic's data this way.

**2.1 -** Visualize the relationship between temperature anamoly and methane. What do you notice?

In [None]:
# scatter, with temp ~ time
gf_point(temp_anomaly ~ methane, data = temp_data)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

It seems that there is some sort of bifurcation in the data, where both sets of data show slight increases in temperature for more methane, but there's a big starting gap in temperature between these sets of data.

</div>

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**2.2 -** Look back at the dataset. Is there a variable that may explain the pattern you see above? Test your idea by creating a new visual or set of visuals. 

</div>

In [None]:
# Sample Responses

# jitterplot with temperature shown by season of month
gf_jitter(temp_anomaly ~ season, data = temp_data, width = 0.1)

# scatter, with temp ~ time, colored by season
gf_point(temp_anomaly ~ methane, data = temp_data, color = ~season)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

It seems clear that seasonality plays a huge role in temperature. As expected, temperatures are much lower in winter than in summer. So, it seems as though we should control for season in our analyses.


<b> Instructor Note: </b> 

It may take students a while to explore the dataset visually and find the patterns associated with season. Encourages students to collaborate on this task, with group members trying out different variables and different visualizations. Challenge the groups to come up with the most effective visualizations to share with the rest of the class.


</div>

**2.3 -** Below, we've visualized the relationship between temperature and methane in our dataset, showing the line of best fit (used in the climate skeptic's model). Does the model fit the data well? Why does the model have a negative slope?


In [None]:
methane_model <- lm(temp_anomaly ~ methane, data = temp_data)

# Scatterplot between methane and temp anamoly
gf_point(temp_anomaly ~ methane, data = temp_data, color = ~season) %>%
  gf_labs(title = "Global Temps vs. Global Methane Concentration (1999 - 2010)",
    x = "Methane Concentration (parts per trillion)",
    y = "Temperature Anamoly (C), w.r.t 1980-2015 Avg Temp") %>%
  gf_model(methane_model)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

The model has a negative slope because the winter months have lower temperature and higher methane. The summer months have lower methane but higher temperature. 

A model that uses methane to predict temperature generally predicts higher temps when there are lower concentrations of methane (these are the data from the summer months) and lower temps when there are higher levels of methane (the winter months). So, the overall relationship is negative. 

However, this is not a very well-fitting model. There are very large residuals, as the line does not come close to predicting many of the temperatures. Interestingly, *within* each season, it appears that temperature tends to increase as methane rises. We may have to find a model that takes season into account.


<b> Instructor Note: </b> 

It may be difficult to see why the line has a negative slope. The effect here is commonly known as Simpson's paradox. You might consider getting them to explore Simpson's paradox on google or wikipedia (e.g., https://en.wikipedia.org/wiki/Simpson%27s_paradox).

</div>

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>


**2.4 -** Create a subset of the data that is *only* the winter months. Then, visualize the relationship between methane and temperature. What do you notice? Do the same with the summer month data. What do you notice?

</div>

In [None]:
# Collect just the winter month data
winter_data <- filter(temp_data, season == "winter")

# Visualize just winter month data
gf_point(temp_anomaly ~ methane, data = winter_data, color) %>%
  gf_labs(title = "Only Winter Months",
    x = "Methane Concentration (parts per trillion)",
    y = "Temperature Anamoly (C), w.r.t 1980-2015 Avg Temp")

In [None]:
# Collect just the winter month data
summer_data <- filter(temp_data, season == "summer")

# Visualize just winter month data
gf_point(temp_anomaly ~ methane, data = summer_data) %>%
  gf_labs(title = "Only Summer Months",
    x = "Methane Concentration (parts per trillion)",
    y = "Temperature Anamoly (C), w.r.t 1980-2015 Avg Temp")

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

Clearly, *within* each season, it appears that temperature tends to increase as methane rises. We may have to find a model that takes season into account.

</div>

<div class="alert alert-block alert-success">

### 3.0 - Approximate Time:  20-25 mins

</div>

### 3.0 - Getting multivariate

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**3.1 -** Create and fit a model predicting temperature from methane, taking season into account. How does the coefficient value for `methane` differ from the one seen in the initial model in the notebook? Why do you think this change occurred?

</div>

In [None]:
# Multiple regression model, controlling for seasons (summer vs. winter)
methane_season_model <- lm(temp_anomaly ~ methane + season, data = temp_data)
methane_season_model

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

Whereas the coefficient was negative in our prior model, the coefficient for `methane` is now positive. This means that, controlling for season, temperature tends to increase as methane increases. The key difference is that we are now controlling for season - sort of like we're only comparing data values within the same season (summer against summer, winter against winter). When looking *within* season, there is a positive relationship between the greenhouse gas and temperature.


<b> Instructor Note: </b> 

In this case, it's extremely important to include season in the model. Because of this, students may walk away from this lesson with the idea that *every possible variable* should be included in multiple regressions. This is not always the case and, in fact, adding too many variables can also cause problems for model interpretations and predictions. We'll explore some of those cases in later notebooks. 
 
</div>

**3.2 -** Create a scatterplot showing temperature as predicted by methane, and color the data according to season. Then, visualize your multiple regression fit on the graph.

In [None]:
# scatter, with temp ~ time, colored by season, showing multivariate regression
gf_point(temp_anomaly ~ methane, data = temp_data, color = ~season) %>%
  gf_labs(title = "Only Winter Months",
    x = "Methane Concentration (parts per trillion)",
    y = "Temperature Anomaly (C), w.r.t 1980-2015 Avg Temp") %>%
  gf_model(methane_season_model)

**3.3 -** Your model had a coefficient value for `seasonwinter`. What feature of your graph does this value describe?

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

Season winter is the baseline difference in temperature anamoly between summer and winter months, controlling for methane levels. On our graph, the coefficient value for `seasonwinter` appear as the vertical distance between the two lines. We subtract 3.77 degrees Celsius from the summer predictions (note: the coefficient value for `seasonwinter` was negative) in order to get the winter predictions.

</div>

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>


##### 3.4 - What feature of your graph does the `methane` coefficient describe? How does its value contradict the claim made by the climate change skeptic?

</div>

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

The methane coefficient shows the slopes of the lines in the graph - it's the relationship between methane and temperature, controlling for season (i.e. within season). Because its value is positive, we have shown that temperature actually tends to rise as methane is produced in greater quantities (controlling for season). This is consistent with many scientists' claim that greenhouse gases (such as methane) contribute to warming the earth, and it contrasts with the skeptic's claim that temperatures tend to lower as methane becomes more prevalent.

</div>

**3.5 -** Which model is more convincing: The skeptic's single-predictor model or the multivariate model? Provide evidence using the visualizations created above and supernova tables.

In [None]:
supernova(methane_season_model)

In [None]:
supernova(methane_model)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

Looking at the multivariate regression plot, the fit is *clearly* better than the earlier fit. The temperatures are more closely predicted by our model (the data are closer, along the y-axis, to the lines of best fit). If we were making predictions, accounting for season would make our predictions more accurate. Looking at the supernova tables, we see the PRE jumps from 0.182 to 0.9955 when accounting for season. This provides further evidence that our multivariate model is a better fit from the data, as a much higher percentage of the variation in temperature is explained by our multivariate model.


<b> Instructor Note: </b> 

To reiterate - In this case, it's extremely important to include season in the model. Because of this, students may walk away from this lesson with the idea that *every possible variable* should be included in multiple regressions. This is not always the case and, in fact, adding too many variables can also cause problems for model interpretations and predictions. We'll explore some of those cases in later notebooks.  

</div>