# Exploring climate change in Finland using Python

You are almost certain to have heard of climate change and the threats it poses to animals, the environment, and society worldwide. And you may have even seen images of changes in global temperatures over time and the warming trend observed in the past 150 years or so. But what about in Finland? Can we observe the same kind of warming here? And if so, how much has it warmed and which seasons are warming the most? Let's see what the data have to say...

In this exercise you will use temperature data from a Finnish Meteorlogical Institute observation station in Sodankylä to determine whether climate is warming in northern Finland, and if so, how quickly. We will explore the changing climate by season, ultimately producing some plots of how temperatures vary with time for each season.

## About the data

The temperature data used in this exercise come from the US National Oceanographic and Atmospheric Association climate database, and consists of temperature observations from 1908 to 2020 at Sodankylä. We will explore the data at the start of our exercise so you can see what the dataset includes, how often temperatures are recorded, etc.

## Tools used in this exercise

We will be using the Python programming language for this exercise inside a web application called a Jupyter notebook. Jupyter notebooks are broken up into sections called cells, and each cell can either contain code or text instructing you what to do in the exercise. We will be running the notebooks using a cloud computing service called Binder, which is free and easy to use for this purpose. However, Binder sessions will terminate after about 10 minutes of inactivity, so you may encounter some problems with Binder if you leave it idle for too long. Binder issues and other common problems you may encounter are briefly described below.

We can start by testing things out using the code cell below. Type in your name between the quotation marks below, and then press **Shift-Enter** to run the cell and see its output. For example, if first line read

```python
name = 'Dave'
```

then when you run the cell the output should be

```
Hello, Dave. We hope you enjoy today's Python exercise!
```

Please test this below!

In [None]:
name = ''
print('Hello, '+name+'. We hope you enjoy today\'s Python exercise!')

## What if I have a problem?

Below are a few common problems you might encounter.

### Binder dies

If your Binder session terminates you may see a message indicating that Binder has died and could not automatically be restarted. Click on the red button to restart, and you may have to re-run all of the earlier cells down to the point where you were when Binder died. You won't lose you work this way, at least :).

### Some or all plot data is not visible

If you are not able to see some of the data in your plot you can try the following:

- Click on the legend items to make them visible
- Reset the plot data range by clicking on the button with the two arrows in a circle

### An error is displayed when you run a cell

The best thing to do here is post a message in chat and ask Leevi for assistance. Error messages can be quite variable.

## Getting started

The first step in this exercise is to load some software libraries that will be used to read in the data file we will be using and to plot our results.

You can load the libraries by clicking on the cell below and pressing **Shift-Enter** or pressing the play button in the toolbar at the top of this notebook.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import pandas_bokeh

### Reading the data

Now we will read in the [text file containing the data used in this exercise](data/2315676.txt). The data file contains dates of temperature measurements from a Finnish Meteorlogical Institute observation station in Sodankylä. Again, you can press **Shift-Enter** to run the cell.

Note that we do not expect you to understand the code in the cell below. We will have a look at the data that you loaded in just a moment.

In [None]:
# Read data
fp = r'data/2315676.txt'
data = pd.read_csv(fp, delim_whitespace=True, skiprows=[1], usecols=['DATE', 'TAVG', 'TMAX', 'TMIN'], na_values='-9999', index_col='DATE', parse_dates=True)

### Exploring the data

Let's start by checking out the data we have loaded. We can do this by looking first at the top 10 rows of the data we have loaded from the data file, and then the bottom 10 rows. Press **Shift-Enter** to run both cells below.

In [None]:
data.head(10)

In [None]:
data.tail(10)

#### Stop and think!

At this point, we have just loaded some climate data from a file, but we have not told you what kinds of data we are working with, the range of dates for the data, or anything else about the data file. But we can use the output above to answer some questions. You do not need to write anything down for this, but just think about the answers.

1. What is the earliest date in the climate data file?
2. What is the date of the most recent observation?
3. What do you think the headings `TAVG`, `TMIN`, and `TMAX` mean?
4. How frequent are the climate observations reported in the data file?
5. When you look at the temperatures, do you see anything unexpected? What kinds of things do you see, and what do you think they mean?

***Think about the answers to the questions above before continuing with the exercise.***

### Exploring the data, part 2

Now let's look at our data in another way, using a data visualization. Here we will make a few plots of a range of dates to explore further.

You can start by running the cell below by pressing **Shift-Enter**. This cell configures the plotting that is used in the exercise to work in this notebook.

In [None]:
pandas_bokeh.output_notebook()
pd.set_option('plotting.backend', 'pandas_bokeh')

Now that the plotting has been configured we can take a look at some temperature data. Let's start by plotting the temperatures in the spring (March-May) of 1908. We can create the plot by pressing **Shift-Enter** to run the cell below.

In [None]:
ax = data.plot(title='Sodankylä temperatures',
                         xlabel='Date', ylabel='Temperature [°F]', xlim=['03-01-1908', '05-31-1908'], figsize=(800,500))

So, what do we see?

Well first, we see there are two lines, one labeled "TMAX" and one labeled "TMIN". These are the maximum and minimum temperatures for the dates we plotted. We also have a label for "TAVG", but it seems there is no blue line on the plot.

#### Stop and think!

Looking at the plot above, please answer the following questions:

1. What is the coldest minimum temperature in the spring of 1908? **Note**: The plot above is interactive, and you can move your mouse cursor on the plot to see the plotted values!
2. What is the warmest maximum temperature in the spring of 1908?
3. What do you think about the range of temperatures that are plotted? Are these reasonable temperatures for Sodankylä in spring? Why or why not?
4. Why do you think there are gaps in the plotted lines?

***Think about the answers to the questions above before continuing with the exercise.***

Let's now look at another set of temperatures from the spring of 2020, and then we'll discuss a few things we can observe. Run the cell below to create the plot.

In [None]:
ax = data.plot(title='Sodankylä temperatures',
                         xlabel='Date', ylabel='Temperature [°F]', xlim=['03-01-2020', '05-31-2020'], figsize=(800,500))

Looks fairly similar to the data from 1908, but now we can see the blue line as well. The blue line for "TAVG" is the average temperature each date in the plotted data.

#### Our data explained

So, we have seen a few different things at this point and we should explain them.

1. We have seen that our data file has daily temperature data from 1908 to 2020. This gives us more than 100 years of climate data to explore!
2. We have also seen that we have minimum ("TMIN"), average ("TAVG), and maximum ("TMAX") temperature observations for each date.
3. In addition, there are some missing temperature observations. These are the gaps in the lines that are plotted, where there was no temperature value recorded for that date. **Note**: The missing values appear as `NaN` in the data file, which stands for not a number.
4. You may have noticed quite high temperatures in general. This is because our temperatures are in Fahrenheit, since the data file is from an American database of climate data. We will convert the temperatures to Celcius later.

### A final introductory plot

Finally, you may have noticed that we chose to plot only a small range of dates in our data file, rather than all 100+ years. Why not plot all of the data at once? Run the cell below to find out!

In [None]:
ax = data.plot(title='Sodankylä temperatures',
                         xlabel='Date', ylabel='Temperature [°F]', figsize=(800,500))

OK, that's a mess. Perhaps now it is clear why we only looked at a smaller range of dates.

With daily observations over more than 100 years, we have quite many values to plot. If you type `len(data)` into the cell below and run it you will see how many dates we have in total.

## Processing the data

As noted above the average daily temperatures in the column `TAVG` are missing in some cases. The average temperatures are the preferred values to use, however, if we want to see how the temperatures have changed over time for different seasons of each year in the dataset. Thus, we need to find some way to "fill in" some of the missing temperatures.

Here, we will fill in these missing values using an estimate for the average daily temperature. We will estimate the daily temperatures by calculating the average of the daily min and max temperatures, `TMEAN` = (`TMAX` + `TMIN`) / 2.

We can calculte the new `TMEAN` values by running the cell below.

In [None]:
data['TMEAN'] = (data['TMAX'] + data['TMIN']) / 2.0

Our next step is to fill in any places where the `TAVG` value is missing with our estimate. We can fill these values into a new column titled `TEMP`. Run the cell below to do that.

In [None]:
data['TEMP'] = data['TAVG']
data['TEMP'].loc[data['TEMP'].isnull()] = data['TMEAN']

Lastly, we can remove any dates where the daily average temperature is still missing. This could be the case if the `TMIN` or `TMAX` value was not recorded for one of the dates where we needed to use the estimated daily temperature.

In [None]:
data.dropna(subset=['TEMP'], inplace=True)

Let's now have a look at what our data look like by inspecting the first and last 10 rows of the data. You can copy and paste the code from the cells above where we did the same. *Hint: We used the `head` and `tail` functions for this when we first loaded the data file in the cells above*.

We can also again have a look at how many dates in total we have in the data file (after dropping missing `TEMP` values. You can type `len(data)` in the cell below and run it to see how many rows remain.

#### Stop and think!

Now that we have made some changes to the data, we should think a bit about what has changed. Looking at the outputs above, consider the followign questions:

1. What data are in the columns "TAVG", "TMEAN", and "TEMP"? How do they differ?
2. What does the column "TEMP" contain if a value is present in column "TAVG", but missing in column "TMEAN"?
3. What does the column "TEMP" contain if a value is present in column "TMEAN", but missing in column "TAVG"?
4. Do you understand what has happened to fill in the missing values for the column "TEMP"? Why or why not?

***Think about the answers to the questions above before continuing with the exercise.***

### Exploring our modified data

Now, we can also look at a visualization of our data to see how things have changed. Run the cell below to plot the data for spring of 1908 again.

In [None]:
ax = data.plot(title='Sodankylä temperatures',
                         xlabel='Date', ylabel='Temperature [°F]', xlim=['03-01-1908', '05-31-1908'], figsize=(800,500))

If you click on some of the labels in the legend of the plot above you can hide them. Click on the label for "TEMP" in the plot above. Do you see any other plot data when you hide the "TEMP" values? What about if you hide "TMEAN"? Think about this for a moment...

What do we see now? Well, we can see that there are no "TAVG" values for spring of 1908, so the "TEMP" values are filled in from the estimate of the temperature from the average of the "TMIN" and "TMAX" columns.

Let's have a look at spring of 2020 now.

In [None]:
ax = data.plot(title='Sodankylä temperatures',
                         xlabel='Date', ylabel='Temperature [°F]', xlim=['03-01-2020', '05-31-2020'], figsize=(800,500))

What happens when you hide "TEMP" now?

#### Stop and think!

Looking at the two plots above:

1. What value has been used to fill in the temperatures in the spring of 2020? Does this make sense?
2. How close are the estimates for the daily average temperature from the "TMIN" and "TMAX" columns compared to the observed average daily temperature in column "TAVG"?
3. Is the estimated average temperature fairly close to the observed daily temperature?

***Think about the answers to the questions above before continuing with the exercise.***

## Converting temperatures to Celsius

Up to this point we've been working with temperatures in degrees Fahrenheit, but we should convert these to Celsius since those are the preferred units in most of the world (including Finland). We can do this using a "function".

We can create a simple function to convert temperatures from Fahrenheit to Celsius by running the cell below. The equation we will use is

$$
\Large
T_{\mathrm{C}} = 5/9 * (T_{\mathrm{F}} - 32)
$$

In [None]:
def f2c(temp):
    """Converts temperatures from Fahrenheit to Celsius"""
    return 5/9 * (temp - 32.0)

The cell above only defines the "function" so that it can be used. In order to use it to convert our temperatures we need to run the cell below.

By running the cell below we will convert the temperatures in column "TEMP" to a new column "TEMP_C", which will be in Celsius.

In [None]:
data['TEMP_C'] = data['TEMP'].apply(f2c)

At this point we can check out our data by typing `data` into the cell below and running it.

#### Stop and think!

Looking at the output above, think about the following questions:

1. Do the temperatures in the column "TEMP_C" make sense to you?
2. Do these seem like more reasonable temperatures for Sodankylä for the dates you can see?

***Think about the answers to the questions above before continuing with the exercise.***

## Calculating monthly average temperatures

Our overall goal in this exercise is to look at how the temperatures in different seasons have changed in the years covered in our data. Although we could calculate seasonal average temperatures right away, we will start by converting our daily temperature observations into averages for each month in the dataset. Later, we will use the monthly values to find seasonal averages.

To get started we first need take our temperature observations and group them together by each month of each year. You can run the cell below to do this.

In [None]:
grouped = data.groupby(pd.Grouper(freq='M'))

Now that we have our monthly groups, we can run the cell below which will create a new data table called `monthly_data` containing average temperatures for each month. Note that this might take a moment to complete.

In [None]:
# Create an empty DataFrame for the monthly values
monthly_data = pd.DataFrame()

# The columns that we want to include
mean_cols = ['TEMP_C']

# Iterate over the groups
for key, group in grouped:
    
    # Calculate mean
    mean_values = group[mean_cols].mean()

    # Add the ´key´ (i.e. the date+time information) into the aggregated values
    mean_values['DATE'] = key

    # Append the aggregated values to the DataFrame
    monthly_data = monthly_data.append(mean_values, ignore_index=True)

# Set the DATE as the index
monthly_data = monthly_data.set_index('DATE')

Let's now have a look at what our new data table looks like. You can see it by typing `monthly_data` into the cell below and running it.

#### Stop and think!

Based on the output from the cell above:

1. What do you think happened when the new `monthly_data` table was created?
2. What do the values in the `TEMP_C` column represent here? Do the values make sense to you?
3. Do you see anything unexpected in the output?

***Think about the answers to the questions above before continuing with the exercise.***

Hopefully you have some idea of what the temperature values in the `monthly_data` table represent, but we will return to this in case you're confused.

For now, let's have a quick look at all of our monthly temperatures in a plot and see whether it might help us understand how temperatures have varied since 1908. You can run the cell below to produce a plot of the monthly temperature values.

In [None]:
ax = monthly_data.plot(title='Sodankylä temperatures',
                       xlabel='Date', ylabel='Temperature [°C]', figsize=(800,500))

Hmm...not that helpful still. We can perhaps see a bit more of how the temperatures have varied for the monthly averages, but it is still hard to see any clear trends in the data, such as a gradual warming of temperatures. To see this we'll need to do the next two analysis steps.

**Note**: In case you're wondering about the "unexpected" thing above, it is that the monthly temperatures are shown as values for the last day in each month. For this kind of date format it is expected that every date have a year, month, and day.

## Calculating monthly average temperatures a reference period

Our next few steps might be a bit confusing, but we'll try to explain and ask some questions along the way to help you understand. In this section and the one that follows we are going to calculate a temperature anomaly for each month in our dataset. A *temperature anomaly* is a way for us to see how much warmer or cooler a given month's temperature is compared to a "typical" month from some reference period. In our case, we are going to calculate the average temperature for each month in the period 1951-1980 and use that as our "typical" value. For example, we will find that the average temperature for January in the period 1951-1980 is about -14°C. With this value, we can then look at every January in our monthly temperature data and calculate how much warmer or cooler it was than the January in the reference period (1951-1980). Essentially, this helps us see whether temperatures are warming or cooling over time.

To get started we will group our data once again, this time grouping by month for each month in the period 1951-1980. You can run the cell below to execute this grouping.

In [None]:
monthly_data['MONTH'] = monthly_data.index.month
grouped = monthly_data['1951-01-01' :'1980-12-31'].groupby('MONTH')

Now that we have grouped all of the months in our reference period, we can proceed with calculating the temperature averaged for each month between 1951-1980. Again, this will give us 12 values in total, one for each month, where the values are the average month for a "typical" month in that range of years.

In [None]:
# Create an empty DataFrame for the reference temperature values
ref_temps = pd.DataFrame()

# Iterate over the groups
for key, group in grouped:
    
    # Calculate mean
    ref_value = group['TEMP_C'].mean()

    # Create Series with values
    mean_series = pd.Series([key, ref_value])

    # Append the aggregated values into the DataFrame
    ref_temps = ref_temps.append(mean_series, ignore_index=True)

# Assign column names
ref_temps.columns = ['MONTH', 'REF_TEMP']

Let's now have a quick look at our new data. You can do this by typing the name of the new data table `ref_temps` in the cell below and running it.

#### Stop and think!

Looking at the output above, think about the following questions:

1. What do the numbers in the `MONTH` column mean?
2. What do the `REF_TEMP` numbers mean?
3. What would you say if you were asked to find the average temperature for August in the period 1951-1980?

***Think about the answers to the questions above before continuing with the exercise.***

Hopefully you have understood at this point that our new data table contains average temperatures for each month in the period 1951-1980. The months here are given as numbers (1-12), where January would be month 1 and December would be month 12. In that way, the average temperature in August in the reference period would be about 5.9°C.

## Calculating temperature anomalies for all months

Now that we have some "reference" temperatures we can proceed with calculating our monthly temperature anomalies. As a reminder, a monthly temperature anomaly is the difference in temperature between a given month of a given year and that month in some reference period. In our case, the reference period is 1951-1980. Thus, we can now calculate these anomalies and see that January of 1908 was about 0.3°C cooler than January in the reference period (1951-1980), while January in 1909 was about 6.1°C warmer than January in the reference period. We will do this and explore our temperature anomalies below.

If you run the cell below it will do two things:

1. It will add our "reference" temperatures to our monthly data table
2. It will calculate a new column titled "DIFF" that contains the temperature anomaly for each month

Run the cell below to calculate the temperature anomalies.

In [None]:
# Join the data from reference_temps with outer join to monthly_data
monthly_data = monthly_data.reset_index().merge(ref_temps, on='MONTH', how='outer')

# Set index
monthly_data = monthly_data.set_index('DATE')

# Calculate the temperature anomalies
monthly_data['DIFF'] = monthly_data['TEMP_C'] - monthly_data['REF_TEMP']

Let's see what our monthly temperature data table looks like now. You can do this by typing `monthly_data` in the cell below and running it.

#### Stop and think!

Looking at the output above, think about the following questions:

1. What does a positive value for the "DIFF" column mean? How does that temperature compare to the reference period temperature?
2. What about a negative "DIFF" value, what does that mean?
3. Looking at the temperatures for January 1908 to 1912, are the temperatures back then always colder than the reference period, or are some Januarys warmer than the reference period?

***Think about the answers to the questions above before continuing with the exercise.***

### Calculating seasonal temperature anomalies

At this point we have the temperature anomalies for each month of our dataset with respect to the reference period. Next we need to calculate the seasonal temperature anomalies by combining months. We'll use the following months for each season:

- Winter: December (of previous year), January, February
- Spring: March, April, May
- Summer: June, July, August
- Autumn: September, October, November

The cell below contains quite many lines of code, but basically this just combines our monthly temperatures into seasonal averages for each year. Run the cell and we'll continue from there.

In [None]:
# Create DataFrames for each season
winter_temps = pd.DataFrame()
spring_temps = pd.DataFrame()
summer_temps = pd.DataFrame()
autumn_temps = pd.DataFrame()

# Loop over all years and calculate seasonal temperature anomalies and average temperatures
for year in range(1909,2021):
    winter_diffs = monthly_data['DIFF'].loc[(monthly_data.index >= str(year-1)+'-12') & (monthly_data.index < str(year)+'-03')]
    winter_temp_c = monthly_data['TEMP_C'].loc[(monthly_data.index >= str(year-1)+'-12') & (monthly_data.index < str(year)+'-03')]
    spring_diffs = monthly_data['DIFF'].loc[(monthly_data.index >= str(year)+'-03') & (monthly_data.index < str(year)+'-06')]
    spring_temp_c = monthly_data['TEMP_C'].loc[(monthly_data.index >= str(year)+'-03') & (monthly_data.index < str(year)+'-06')]
    summer_diffs = monthly_data['DIFF'].loc[(monthly_data.index >= str(year)+'-06') & (monthly_data.index < str(year)+'-09')]
    summer_temp_c = monthly_data['TEMP_C'].loc[(monthly_data.index >= str(year)+'-06') & (monthly_data.index < str(year)+'-09')]
    autumn_diffs = monthly_data['DIFF'].loc[(monthly_data.index >= str(year)+'-09') & (monthly_data.index < str(year)+'-12')]
    autumn_temp_c = monthly_data['TEMP_C'].loc[(monthly_data.index >= str(year)+'-09') & (monthly_data.index < str(year)+'-12')]

    # For all seasons that contain 3 months of data, add them to their seasonal temperature DataFrame
    winter_series = pd.Series([year, winter_temp_c.mean(), winter_diffs.mean()])
    if len(winter_diffs) == 3:
        winter_temps = winter_temps.append(winter_series, ignore_index=True)
    spring_series = pd.Series([year, spring_temp_c.mean(), spring_diffs.mean()])
    if len(spring_diffs) == 3:
        spring_temps = spring_temps.append(spring_series, ignore_index=True)
    summer_series = pd.Series([year, summer_temp_c.mean(), summer_diffs.mean()])
    if len(summer_diffs) == 3:
        summer_temps = summer_temps.append(summer_series, ignore_index=True)
    autumn_series = pd.Series([year, autumn_temp_c.mean(), autumn_diffs.mean()])
    if len(autumn_diffs) == 3:
        autumn_temps = autumn_temps.append(autumn_series, ignore_index=True)

# Set the column names for each seasonal DataFrame
winter_temps.columns = ['Year', 'Temperature', 'Anomaly']
spring_temps.columns = ['Year', 'Temperature', 'Anomaly']
summer_temps.columns = ['Year', 'Temperature', 'Anomaly']
autumn_temps.columns = ['Year', 'Temperature', 'Anomaly']

# Set the date as the index for each DataFrame 
winter_temps['Year'] = pd.to_datetime(winter_temps['Year'], format='%Y')
winter_temps = winter_temps.set_index('Year')
spring_temps['Year'] = pd.to_datetime(spring_temps['Year'], format='%Y')
spring_temps = spring_temps.set_index('Year')
summer_temps['Year'] = pd.to_datetime(summer_temps['Year'], format='%Y')
summer_temps = summer_temps.set_index('Year')
autumn_temps['Year'] = pd.to_datetime(autumn_temps['Year'], format='%Y')
autumn_temps = autumn_temps.set_index('Year')

## Plotting the seasonal temperature data

Now that we have all of the seasonal temperature anomalies we can make some plots to explore the results. You can do this by running the cell below, which will create a four-panel plot with each season.

In [None]:
# Create winter temperature anomaly plot
winter_plot = winter_temps['Anomaly'].plot(
    xlabel='Date',
    ylabel='Temperature anomaly (°C)',
    title='Winter',
    color='#1f77b4',
    plot_data_points=True,
    marker='o',
    ylim=[-7.5, 7.5],
    legend = 'bottom_right',
    show_figure=False)

# Create spring temperature anomaly plot
spring_plot = spring_temps['Anomaly'].plot(
    xlabel='Date',
    ylabel='Temperature anomaly (°C)',
    title='Spring',
    color='#ff7f0e',
    plot_data_points=True,
    marker='o',
    ylim=[-7.5, 7.5],
    legend = 'bottom_right',
    show_figure=False)

# Create summer temperature anomaly plot
summer_plot = summer_temps['Anomaly'].plot(
    xlabel='Date',
    ylabel='Temperature anomaly (°C)',
    title='Summer',
    color='#2ca02c',
    plot_data_points=True,
    marker='o',
    ylim=[-7.5, 7.5],
    legend = 'bottom_right',
    show_figure=False)

# Create autumn temperature anomaly plot
autumn_plot = autumn_temps['Anomaly'].plot(
    xlabel='Date',
    ylabel='Temperature anomaly (°C)',
    title='Autumn',
    color='#d62728',
    plot_data_points=True,
    marker='o',
    ylim=[-7.5, 7.5],
    legend = 'bottom_right',
    show_figure=False)

# Plot all seasonal anomaly plots in a grid
grid_plot = pandas_bokeh.plot_grid([[winter_plot, spring_plot],
                                    [summer_plot, autumn_plot]], 
                                    plot_width=450)

OK, now we have something interesting! Here we can see how the temperatures for each season have varied since 1908, compared to our reference period of 1951-1980. You can perhaps even get a sense of the trend in seasonal temperatures since 1908.

#### Stop and think!

Looking at the output above, think about the following questions:

1. How do the temperature anomalies change for each season over the range of dates we have?
2. Are the anomalies always increasing or decreasing, or do they vary from year to year?
3. In which season do the anomalies seem to vary the most? And the least?
4. If you had to estimate, do the anomalies suggest temperatures are increasing in every season, or only in some seasons?
5. In which season does it appear the average temperatures are increasing the fastest?

***Think about the answers to the questions above before continuing with the exercise.***

## Fitting a regression line to our temperature anomalies

In this section we will calculate a linear fit to our temperature data in order to see how fast temperatures are increasing on average in each season. Our linear fit, or regression line, is the best fit line that models our temperature data. The slope of the line will tell us the average rate of temperature change in each season in degrees Celsius per year.

To get started, we need to create another "function". Our function in this case will allow us to calculate the slope and y-intercept for the best-fit line using our temperature data. You can run the cell below to define our function.

In [None]:
def least_squares(x, y):
    """Returns the coefficients for a least-squares regression line to x-y data"""
    n = len(x)
    delta = n * (x**2).sum() - x.sum()**2
    a = ((x**2).sum() * y.sum() - x.sum() * (x * y).sum()) / delta
    b = (n * (x * y).sum() - x.sum() * y.sum()) / delta
    return a, b

Now that we have a function that can be used to fit a line to our temperature date, we can now use the function to find the slope and y-intercept for lines that fit the temperature anomaly data in each season. You can run the cell below to calculate the slopes and intercepts for each season.

In [None]:
# Calculate linear fit for winter
a_winter, b_winter = least_squares(winter_temps.index.year.values, winter_temps['Anomaly'].values)
winter_temps['Trendline'] = winter_temps.index.year * b_winter + a_winter

# Calculate linear fit for spring
a_spring, b_spring = least_squares(spring_temps.index.year.values, spring_temps['Anomaly'].values)
spring_temps['Trendline'] = spring_temps.index.year * b_spring + a_spring

# Calculate linear fit for summer
a_summer, b_summer = least_squares(summer_temps.index.year.values, summer_temps['Anomaly'].values)
summer_temps['Trendline'] = summer_temps.index.year * b_summer + a_summer

# Calculate linear fit for autumn
a_autumn, b_autumn = least_squares(autumn_temps.index.year.values, autumn_temps['Anomaly'].values)
autumn_temps['Trendline'] = autumn_temps.index.year * b_autumn + a_autumn

OK, now we have some lines that fit our plot data. Before we continue to producing the plots, we can first check some of the values using the two cells below. For example, you can check the y-intercept for winter by typing `a_winter` in the cell below, and the slope for winter by typing `b_winter` into the cell below that. Run both cells to check the slope and intercept values.

#### Stop and think!

Looking at the output above, think about the following questions:

1. What does the y-intercept value tell you in general?
2. What does the y-intercept for winter mean then, in this case? Does the value make sense to you?
3. Based out the output, how fast are temperatures increasing on average in Sodankylä? Does this value make sense to you?

***Think about the answers to the questions above before continuing with the exercise.***

Now we can move on to making our plots with our best-fit lines. You can run the cell below to create the plots, again with four panels.

In [None]:
# Create winter temperature anomaly plot with regression line
winter_plot = winter_temps[['Anomaly', 'Trendline']].plot(
    xlabel='Date',
    ylabel='Temperature anomaly (°C)',
    title='Winter'.format(b_winter),
    color='#1f77b4',
    alpha=0.6,
    plot_data_points=True,
    marker='o',
    ylim=[-7.5, 7.5],
    legend = 'bottom_right',
    show_figure=False)

# Create spring temperature anomaly plot with regression line
spring_plot = spring_temps[['Anomaly', 'Trendline']].plot(
    xlabel='Date',
    ylabel='Temperature anomaly (°C)',
    title='Spring'.format(b_spring),
    color='#ff7f0e',
    alpha=0.6,
    plot_data_points=True,
    marker='o',
    ylim=[-7.5, 7.5],
    legend = 'bottom_right',
    show_figure=False)

# Create summer temperature anomaly plot with regression line
summer_plot = summer_temps[['Anomaly', 'Trendline']].plot(
    xlabel='Date',
    ylabel='Temperature anomaly (°C)',
    title='Summer'.format(b_summer),
    color='#2ca02c',
    alpha=0.6,
    plot_data_points=True,
    marker='o',
    ylim=[-7.5, 7.5],
    legend = 'bottom_right',
    show_figure=False)

# Create autumn temperature anomaly plot with regression line
autumn_plot = autumn_temps[['Anomaly', 'Trendline']].plot(
    xlabel='Date',
    ylabel='Temperature anomaly (°C)',
    title='Autumn'.format(b_autumn),
    color='#d62728',
    alpha=0.6,
    plot_data_points=True,
    marker='o',
    ylim=[-7.5, 7.5],
    legend = 'bottom_right',
    show_figure=False)

# Plot seasonal temperature anomaly plots in a grid
grid_plot = pandas_bokeh.plot_grid([[winter_plot, spring_plot],
                                    [summer_plot, autumn_plot]], 
                                    plot_width=450)

#### Stop and think!

Looking at the output above, think about the following questions:

1. Looking at the four plot panels above, in which season does it seem temperatures are increasing the fastest? How about the slowest?
2. You can estimate the rate of change in temperature using the plots above by zooming in on each panel and finding places where the plotted lines cross the gridlines in the background. Using the zoom tool, zoom in and estimate the rate of change of temperature for each season. Write these estimates down.

***Think about the answers to the questions above before continuing with the exercise.***

So we have seen that we can plot our best-fit lines for our temperature data along with the temperature anomalies to see how fast temperatures are increasing on average in each season. You have also seen above that you can zoom in on the plots to estimate the rate of change of temperature in each season. However, we can do better! We already have calculated the slopes for the lines that are plotted, so we can take advantage of the slopes that were calculated earlier and add those values to our plot in order to see how quickly temperatures change in each season.

You can run the cell below to produce a plot like that above, but with the rate of change of temperature in the title of each panel.

In [None]:
# Create winter temperature anomaly plot with regression line
winter_plot = winter_temps[['Anomaly', 'Trendline']].plot(
    xlabel='Date',
    ylabel='Temperature anomaly (°C)',
    title='Winter (slope = {0:.3f}°C/year)'.format(b_winter),
    color='#1f77b4',
    alpha=0.6,
    plot_data_points=True,
    marker='o',
    ylim=[-7.5, 7.5],
    legend = 'bottom_right',
    show_figure=False)

# Create spring temperature anomaly plot with regression line
spring_plot = spring_temps[['Anomaly', 'Trendline']].plot(
    xlabel='Date',
    ylabel='Temperature anomaly (°C)',
    title='Spring (slope = {0:.3f}°C/year)'.format(b_spring),
    color='#ff7f0e',
    alpha=0.6,
    plot_data_points=True,
    marker='o',
    ylim=[-7.5, 7.5],
    legend = 'bottom_right',
    show_figure=False)

# Create summer temperature anomaly plot with regression line
summer_plot = summer_temps[['Anomaly', 'Trendline']].plot(
    xlabel='Date',
    ylabel='Temperature anomaly (°C)',
    title='Summer (slope = {0:.3f}°C/year)'.format(b_summer),
    color='#2ca02c',
    alpha=0.6,
    plot_data_points=True,
    marker='o',
    ylim=[-7.5, 7.5],
    legend = 'bottom_right',
    show_figure=False)

# Create autumn temperature anomaly plot with regression line
autumn_plot = autumn_temps[['Anomaly', 'Trendline']].plot(
    xlabel='Date',
    ylabel='Temperature anomaly (°C)',
    title='Autumn (slope = {0:.3f}°C/year)'.format(b_autumn),
    color='#d62728',
    alpha=0.6,
    plot_data_points=True,
    marker='o',
    ylim=[-7.5, 7.5],
    legend = 'bottom_right',
    show_figure=False)

# Plot seasonal temperature anomaly plots in a grid
grid_plot = pandas_bokeh.plot_grid([[winter_plot, spring_plot],
                                    [summer_plot, autumn_plot]], 
                                    plot_width=450)

#### Stop and think!

Looking at the output above, think about the following questions:

1. Looking back at the estimates you made for each seasons rate of temperature change, how to they compare to the values in the plot above? Were some seasons easier to estimate than others?
2. Does a line seem like a good model for the temperature changes in each season, or would a different shape provide a better fit?
3. Does a line provide a better fit in some seasons than others?

***Think about the answers to the questions above before continuing with the exercise.***

## Concluding remarks and broader questions to consider

So that is it for our temperature data exercise, and we hope you feel you have learned a few things about temperature data analysis and how to process temperature data to detect things like climate change. There have been many shorter questions above, but we would now ask you to consider a few broader questions related to climate change, climate data, and data visualization. There are no right or wrong answers here, just some ideas to consider.

1. What were some of the challenges trying to observe warming of the local climate in Sodankylä all of the daily temperature observations we have used?
2. What are some of the benefits and challenges of using temperature anomalies instead of measured temperatures?
3. How much does the average temperature in winter vary from year to year in Sodankylä? Do you think this might make it hard for people to sense the climate is warming?
4. What might be some of the challenges in informing the public about evidence of climate change? How would you try to make an argument and demonstrate the climate is changing?