# Academic Integrity Statement

As a matter of Departmental policy, **we are required to give you a 0** unless you **type your name** after the following statement: 

> *I certify on my honor that I have neither given nor received any help, or used any non-permitted resources, while completing this evaluation.*

\[TYPE YOUR NAME HERE\]

### Partial Credit

Let us give you partial credit! If you're stuck on a problem and just can't get your code to run: 

First, **breathe**. Then, do any or all of the following: 
    
1. Write down everything relevant that you know about the problem, as comments where your code would go. 
2. If you have non-functioning code that demonstrates some correct ideas, indicate that and keep it (commented out). 
3. Write down pseudocode (written instructions) outlining your solution approach. 

In brief, even if you can't quite get your code to work, you can still **show us what you know.**

## Problem 1: Estimating The Pace of Global Warming (65 points)

Global warming is one of the most important issues of our time, primarily because it threatens the natural habitat of nature's most majestic creature. 

<figure class="image" style="width:50%">
  <img src="http://c.files.bbci.co.uk/CF3C/production/_111925035_penguino.jpg" alt="">
  <figcaption><i>So precious and beautiful.</i></figcaption>
</figure>

If that wasn't bad enough, global warming reduces snowfall, thereby endangering our ability to produce videos of panda bears falling off of snowmen. 

<figure class="image" style="width:50%">
  <img src="https://media1.tenor.com/images/c20400089a4f275b4c1b32ebe309d93f/tenor.gif" alt="">
  <figcaption><i></i></figcaption>
</figure>

For these and other, less important reasons, it is urgent to estimate how quickly global warming is taking place. In this problem, we will use linear regression to perform a *very* simple estimate of the rate of global warming. In particular, we will consider the following, somewhat more restricted question: 

> What is the average rate of temperature increase per year in the contiguous United States? 

By "contiguous United States," we refer to the US excluding Alaska, Hawaii, and various territories and outlying islands. 

<figure class="image" style="width:50%">
  <img src="https://www.worldatlas.com/r/w768/upload/84/2f/7d/contiguousunitedstates.gif" alt="">
  <figcaption><i>The contiguous US.</i></figcaption>
</figure>

The data we will use to approach this question comes from the Global Historical Climatology Network, compiled by the National Centers for Environmental Information of the US National Oceanic and Atmospheric Administration. This data set contains temperature measurements at a variety of weather stations throughout the world, often going back many years. I have prepared a portion of this data set which includes only measurements of January temperatures at stations in the contiguous US. Additionally, I have filtered out stations that were missing a large number of measurements, as this can complicate analysis. 

### A Note

Before we begin, I'd like to emphasize that climate modeling is an extremely sophisticated and challenging profession. Our estimates in this problem should be viewed as very approximate. 

## Part A (10 points)

Run the code below to acquire the data. Each row of the data gives information about the average temperature in January at a given weather station. 

- The `NAME` column contains the name of the weather station. 
- The `YEAR` column contains the year of the measurement. (measurements are averages over the month of January). 
- The `STNELEV` column describes the elevation (meters above sea-level) of the station. 
- The `LATITUDE` and `LONGITUDE` columns describe the geospatial coordinates of the station. 
- The `TEMP` column gives the average temperature in January, in °C. 

In [None]:
# run this block
import pandas as pd
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16A/master/datasets/US-temp-timeseries.csv"
temps = pd.read_csv(url)

In [None]:
# use this space to inspect the data


Write a few short lines of code, in the boxes indicated, to answer the following questions: 

1. How many weather stations are present in the data? 
2. Which weather station at elevation less than 400m recorded the hottest January in 2021? 
3. How does the mean elevation of stations *west* of -105 degrees longitude (i.e. longitude < -105) differ from the mean elevation of stations *east* of -105 degrees longitude (i.e. longitude >= - 105)? 

#### Specs

- It is not required that you write any comments or explanation of your results, provided that your code is sufficiently clear. If you're in doubt, a few comments never hurt. 
- For full credit, no `for`-loops. Make full use of Pandas tools. 
- For questions 2 and 3, there are multiple appropriate answers which will receive full credit. For example, on Q2, you may either simply print the name of the station or show the entire data frame row corresponding to that observation. On Q3, you might show a summary table or compute the difference in the averages; either is fine. 

In [None]:
# 1. How many weather stations are represented in the data? 


In [None]:
# 2. Which weather station at elevation less than 400m 
# recorded the hottest January in 2021?


In [None]:
# 3. How does the average elevation of stations *west* 
# of -105 degrees longitude (i.e. longitude < -105) 
# differ from the average elevation of stations *east* 
# of -105 degrees longitude (i.e. longitude >= - 105)? 


## Part B (10 points)

Create and fit a linear regression model. First, use only the `YEAR` column as the predictor variable. Use `TEMP` as the target variable. 

**It is not necessary to perform a train-test split in this or any other parts of the problem.**

In [None]:
# create and fit your model
# not necessary to display any output here


The `coef_` attribute of the fitted model gives the model's estimate of the average change in temperature per year. For example, if the coefficient of `YEAR` has value `0.05`, this indicates that, according to the model, average temperatures at these weather stations are rising at approximately a rate of 0.05 °C per year. 

What is your estimate of the rate of temperature increase from this simple model? 

In [None]:
# show the estimate here


Evaluate the average CV score of this model, which uses only `YEAR` as the predictor data. Use 5-fold cross validation. Based on the CV score, do you trust this model? 

**Note**: We haven't talked much about the score function for linear regression models. When you need to know is that the highest possible score is `1.0`. A score of `0.0` corresponds to an uninformative "constant" model that doesn't include any change over time. 

In [None]:
# compute and show the CV score here


Do you trust this model to give a good estimate of the mean yearly change in temperature? 

```
```
---

[***Your thoughts here***]

```
```
------

## Part C (10 points)

Now, create and fit a model that also uses the elevation, latitude, and longitude variables as predictors, in addition to the year. Check the coefficients again. What is the model's estimate for the mean average change in temperature per year? How does it compare to your answer from the previous part? 

***Note***: *when using multiple variables in linear regression, the `coef_` attribute is in the same order as the variables supplied. So, if `YEAR` is the first column of the predictor data, then the first entry of `coef_` will correspond to `YEAR`.*

In [None]:
# create your model here
# not necessary to display anything


In [None]:
# show the coefficients here


There are also coefficients corresponding to station elevation, latitude, and longitude. Write a sentence for each of these indicating what they mean. You only need to comment on the sign of the coefficient, not the precise value. For example, here's one for the year:  

> *Because the `YEAR` coefficient is positive, this means that, according to the model, temperatures get warmer as the `YEAR` increases*.

You are welcome but not required to include any scientific knowledge you may have in your sentences. For example, you may know something about the relationship between temperature and elevation, or temperature and latitude. However, this aspect is not a graded criterion for this part of the problem. 

```
```
---

[***Your sentences here***]

```
```
---

Now check the cross-validation score. Has the model improved? Do you trust this model more than your model from Part B?

In [None]:
# compute and show the CV score here


## Part D (15 points)

Local weather is often subject to "warm spells" or "cold spells" that can make it difficult to form reliable estimates. For example, if we happen to begin our measurement period during a "warm spell" and end it during a "cold spell," then we could be led to the conclusion that weather is on average becoming cooler. Handling temporal correlations such as these is one of the main reasons that time-series analysis is challenging. In this part, we'll perform the data preparation for one approach to this problem. This simplistic approach is to include *lagged* versions of the temperature in our model.  

To begin, run the block below to ensure that the data is sorted correctly. 

In [None]:
# run this block
temps = temps.sort_values(["NAME", "STNELEV", "YEAR"])

Next, write a function called `lag()` which accepts two arguments: 

- `x`, a numpy array. 
- `num_places`, the number of places by which to lag `x`. 

The result of `lag(x, num_places)` should be a new array `y` which is a copy of `x` that has been "shifted over" by `num_places`. You may assume that the input `x` is a 1d numpy array of `float`s. The first `num_places` elements of `y` should be `nan`. 

A bit more technically, `y[i] = x[i - num_places]` if `i >= num_places`, and `y[i] = np.nan` if `i < num_places`.

Here's an example:

```python
x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
lag(x, 2)
```

```
# output
array([nan, nan,  1.,  2.,  3.])
```

#### Specs and Hints

- For full credit, please do not use `for` loops. 
- Please include an **informative docstring** for your function describing its inputs and outputs. 
- Comments are only necessary if your solution exceeds five lines. 
- You may assume that `x` is a 1d array of floats and that `num_places` is an integer which does not exceed the length of `x`. Type checking is not necessary. 

**Hint**: There are a number of good solutions, but one approach involves `np.roll()` ([documentation](https://numpy.org/doc/stable/reference/generated/numpy.roll.html)). 


In [None]:
# define lag() here
# don't forget a docstring!


In [None]:
# test your lag function here
# desired output: array([nan, nan,  1.,  2.,  3.])


Next, run the following code to add 10 new columns to the data. 

In [None]:
# run this block
for i in range(1, 11):
    temps["lag" + str(i)] = temps.groupby(["NAME"])["TEMP"].transform(lambda x: lag(x, i))
temps = temps.dropna()

The columns you added are *lagged* versions of the `TEMP`, lagged by a certain number of years. For example, as you can check, the value of `lag4` at Station `ALMA_BACON_CO_AP` in the year 1955 is the value of `TEMP` at `ALMA_BACON_CO_AP` in the year 1951. 

In [None]:
# use this cell to take another look at the data


## Part E (15 points)

Incorporating lagged versions of the `TEMP` variable is a simple way to control for "hot spells" and "cold spells" during the measurement period. 

Now, let's see how the incorporation of these lagged columns impacts the model. Create a figure with two subplots. 

- The horizontal axis of both figures should be `t_lag`, the number of lagged columns included in the model. `t_lag` should range from `0` to `10`, inclusive.  
- The vertical axis of the first subplot should be the 5-fold CV score of the model which includes the columns `YEAR`, `LATITUDE`, `LONGITUDE`, `STNELEV`, and all the `lag` columns up to and included `t_lag`. For example, if `t_lag = 3`, then `lag1`, `lag2`, and `lag3` should be included in the model. If `t_lag = 0`, then no lagged columns are included. 
- The vertical axis of the second subplot should be the estimate of the yearly rate of increase in temperature, obtained by inspecting the `coef_` attribute of the fitted linear regression model. 

In other words, you should train a distinct model for each value of `t_lag`. 

Give your figure appropriate horizontal and vertical axis labels. Your output should look something like this: 

<figure class="image" style="width:80%">
  <img src="https://raw.githubusercontent.com/PhilChodrow/PIC16A/master/_images/delta-C-output.png" alt="">
  <figcaption><i></i></figcaption>
</figure>

Feel free to get creative with the color, axis labels, etc. 

*Yes, the example plot shown above is in fact the plot I got when writing this problem, so you can use it to check your previous steps.* 

#### Alternative for Partial Credit

If you are struggling to create this plot, then you may, for partial credit (up to 10/15 points), demonstrate any other way to compare the estimates of the temperature change and CV score as more lagged columns are incorporated. For instance, you could print out this information, or collect it in a `DataFrame` and display that. 

#### Specs and Hints

- For full credit, you should not write out the individual names of the lagged columns. For example, a solution including a line such as `X = temps[["lag1", "lag2", "lag3", "lag4"]] ` will not receive full credit. 
- You are not required to add gridlines, but if you would like to, the following lines will do the trick for a single axis `ax`: 
```python
ax.set_axisbelow(True)
ax.grid(True)
```
- `for`-loops over the axes or possible values of `tlag` are fine. Other `for`-loops should be minimized in favor of `pandas` and `numpy` operations. 

In [None]:
# create your visualization (or alternative solution) here


## Part F (5 points)

***Note***: You can still do this part even if you didn't fully complete Part E above, using the supplied example plot. 

We have now generated  11 estimates of the yearly increase in temperature, one for each possible value of `t_lag` (between 0 and 10 inclusive).  

A reasonable approach to choosing an estimate from the previous part is to choose a value of `t_lag` such that higher values of `t_lag` don't generate substantially higher CV scores. There is a considerable amount of subjectivity in this choice. Suggest a value of `t_lag` based on your plot above, and write a few sentences to justify your choice.

What is your corresponding estimate of the average change in temperature per year, in °C? How does it compare to your previous estimates in earlier parts of this problem?

```
```
---

[***Your thoughts here***]

```
```
---

[According to the EPA](https://www.epa.gov/climate-indicators/climate-change-indicators-us-and-global-temperature#:~:text=Since%201901%2C%20the%20average%20surface,decade%20(see%20Figure%201).), average temperatures in the contiguous US have rise on average about 0.008 °C per year since 1901. Since the 1970s, this rate has increased, and is estimated between 0.016 °C and 0.026 °C per year. Our data set contains measurements from 1941 through 2020. 

Do you find that your estimate roughly matches the EPA estimates? What about the estimates from Parts B and C? 

**Note**: It is ok not to be sure! Just do your best and explain your reasoning. 

---

[***Your thoughts here***]

```
```
---