# Lab 7: Inference and Global Climate Change 

By the end of this lab, you should know how to:

1. Test whether observed data appears to be a random sample from a distribution.
2. Analyze a natural experiment.
3. Implement and interpret a sign test.
4. Create a function to run a general hypothesis test.
5. Analyze visualizations and draw conclusions from them.

In [None]:
name = ...

In [None]:
## import statements
try:
    # These lines load the tests. 
    from gofer.ok import check
    
except:
    # Install a pip package in the current Jupyter kernel
    import sys
    !{sys.executable} -m pip install git+https://github.com/grading/gradememaybe.git

    # These lines load the tests. 
    from gofer.ok import check
import numpy as np
from datascience import *
import pandas as pd
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## Overview 

Climate change is usually referring to the general trend of warming temperatures globally.  Along with these increasing temperatures, unusual shifts in trends in weather activity such as hurricanes, storms, winds, etc are also usually classified under climate change.  While the climate can shift due to natural occurrences, scientists have found that human interventions have potentially caused the trend of warming.  One explanation for the warming could be increased solar solar activity, however scientists have found that solar activity has not generally increased during the period when temperature has increased.  

<img src='solar_temp.jpg' width="500" height="340">

Links: [NASA](https://climate.nasa.gov/causes/) [Canada](https://www.canada.ca/en/environment-climate-change/services/climate-change/causes.html)

### Data 

While there's several different metrics we could analyze to make some inferences about overall trends in global climate change, for simplicity's sake we will be focusing on land temperature across different countries.  The original table came from this [database](https://github.com/OpenFloodAI/Climate-Change-Datasets), however it's been reformatted to make the downstream analyses easier.  There are 15 columns: Year, Country, Average Temperature of that country in that year 'avg', and then a column for each month of that year with temperatures.  

In [None]:
temps = Table.read_table('temp_per_country.csv')
temps

**Question 1:** Let's explore this data a bit.  The cell below creates a pivot table with years as the rows and each country as a new column.  Select two countries from our dataset and draw a line plot of the changes in temperature over time.  You only want to graph the years that have data for both your countries of interest (Hint: You may want to utilize where and are.above() to select those years with data).  There is not a autocheck for this question as you all may have different answers depending on the countres you pick.  

In [None]:
np.unique(temps['country'])

In [None]:
pivotTable = temps.select('year', 'country', 'avg').pivot('country', 'year', 'avg', sum)
yourCountries = pivotTable.select('year', ... , ...) 
yourCountries.show()

In [None]:
yourCountries = ... 
yourCountries.plot('year')

**Question 1 continued:** In this markdown cell, explain an observation you see from the figure you generated.


...

**Question 2:** Let's visualize the change in temperature for the United States.  

In [None]:
us = ... 

plt.figure(figsize = (10, 5))
plt.plot(us['year'].astype(int), us['avg'])
plt.xticks(np.arange(1750, 2025, 25))
plt.show()

In [None]:
check('tests/q2.py')

**Question 3:** Null and alternative hypothesis.  We may be curious whether globally temperatures are more likely to increase or decrease on average.  Based on our preliminary figures and what we know about creating good hypotheses, set the null and alternative hypothesis below:  

- Null hypothesis:   ... 


- Alternative hypothesis: ...


To test the null hypothesis we're interested in identifying whether the temperature increased or decreased in each time period.  
Temperatures vary widely across countries and years, presumably due to the vast array of differences among the climates and human intervention. Rather than attempting to analyze the temperatures themselves, here we will restrict our analysis to whether or not temperatures increased or decreased over certain time spans. We will not concern ourselves with how much temperatures increased or decreased; only the direction of the changes - whether they increased or decreased.

The np.diff function takes an array of values and computes the differences between adjacent items of a list or array as such:

    [item 1 - item 0 , item 2 - item 1 , item 3 - item 2, ...]

Instead, we may wish to compute the difference between items that are two positions apart. For example, given a 5-element array, we may want:

    [item 2 - item 0 , item 3 - item 1 , item 4 - item 2]

The diff_n function below computes this result. Don't worry if the implementation uses unfamiliar features of Python, as long as you understand its behavior.

In [None]:
def diff_n(values, n):
    return np.array(values)[n:] - np.array(values)[:-n]

diff_n(make_array(1, 10, 100, 1000, 10000), 2)

**Question 4:** Implement the function changes that takes an array of temperatures for a country, ordered by increasing year. For all two-year periods (e.g., from 1960 to 1962), it computes and returns the number of increases minus the number of decreases.

For example, the array r = make_array(10, 7, 12, 9, 13, 9, 11) contains 3 increases (10 to 12, 7 to 9, and 12 to 13), 1 decrease (13 to 11), and 1 change that is neither an increase or decrease (9 to 9). Therefore, changes(r) would return 2, the difference between 3 increases and 1 decrease.

Hint: Consider using the diff_n function.

In [None]:
def changes(rates, years = 2):
    "Return the number of increases minus the number of decreases after two years."
    ...

In [None]:
check('tests/q4.py')

**Question 5:** Assign changes_by_country to a table with one row per country that has two columns: the Country name and the Temperature changes statistic computed across all years in our data set for that country. It may be useful to split this process into two steps.   The final table's first 2 rows should look like this:

|country    |avg changes| 
|-----------|-----------| 
|Afghanistan|18         | 
|Africa     |8          |

In [None]:

changes_by_country = ...
changes_by_country

In [None]:
check('tests/q5.py')

**Question 6:** Assign test_stat to the total increases minus the total decreases for all two-year periods and all countries in our data set. For example, if the temperature in Albania went up 23 times and fell 17 times, the total change for Albania would be 6. We want the total value for all the countries together.

In [None]:
test_stat = ...
print('Total increases minus total decreases, across all countries and years:', test_stat)

In [None]:
check('tests/q6.py')

"More increases than decreases," one person exclaims, "Temperatures tend to go up across two-year periods. What dire times we live in."

"Not so fast," another person replies, "Even if temperatures just moved up and down uniformly at random, there would be some difference between the increases and decreases. There were a lot of countries and a lot of years, so there were many chances for changes to happen. If country temperature increase and decrease at random with equal probability, perhaps this difference was simply due to chance!"

Based on the null hypothesis above that country temperatures increase and decrease by chance, we can simulate our test statistic.  Our test statistic should depend only on whether temperature increased or decreased, not on the size of any change. Thus we choose:

    Test Statistic: The number of increases minus the number of decreases

The cell below samples increases and decreases at random from a uniform distribution 100 times. The final column of the resulting table gives the number of increases and decreases that resulted from sampling in this way. Using sample_from_distribution is faster than using sample followed by group to compute the same result.

In [None]:
uniform = Table().with_columns(
    "Change", make_array('Increase', 'Decrease'),
    "Chance", make_array(0.5,        0.5))
uniform.sample_from_distribution('Chance', 100)

**Question 7:** Complete the simulation below, which samples num_changes increases/decreases at random many times and forms an empirical distribution of your test statistic under the null hypothesis. Your job is to

- fill in the function simulate_under_null, which simulates a single sample under the null hypothesis, and
- fill in its argument when it's called below.

As a hint, num_changes should be approximately the number of countries times the number of time comparisons (you can find the number of year comparisons by using diff_n().  

In [None]:
def simulate_under_null(num_chances_to_change):
    """Simulates some number changing several times, with an equal
    chance to increase or decrease.  Returns the value of your
    test statistic for these simulated changes.
    
    num_chances_to_change is the number of times the number changes.
    """
    uniform = Table().with_columns(
        "Change", make_array('Increase', 'Decrease'),
        "Chance", make_array(0.5,        0.5))
    sample = ... 
    
    ... 
    
    return ...


In [None]:
def empirical_distribution(tbl):
    num_changes = ...
    samples = make_array()
    for i in np.arange(10000):
        samples = np.append(samples, simulate_under_null(...)) 
    Table().with_column('Test statistic under null', samples).hist(bins=np.arange(0, max(samples) + max(samples) * 0.1, 2))
    return samples

In [None]:
samples = empirical_distribution(...) 

In [None]:
check('tests/q7.py')

**Question 8:**  Complete the analysis as follows:

1. Compute a P-value. (Hint: you can use np.count_nonzero())
2. Using a 5% P-value cutoff, draw a conclusion about the null and alternative hypotheses.
3. Describe your findings using simple, non-technical language. What does your analysis tell you about temperatures changes over time? What can you claim about causation from your statistical analysis?

**P-value:** ... 


**Conclusion about the hypotheses:** ...



**Findings:** ...

In [None]:
pvalue = ...
pvalue

Some countries have implemented policies and laws to counteract climate change whereas others have not - we have a table that contains a boolean to indicate whether a country has any policies or laws to protect the earth and then has a number of policies and laws implemented in that country. We can test to see if those countries that have implemented policies to counteract climate change show any difference in changes in temperatures from those countries who have not implemented policies.  A natural experiment happens when something other than experimental design applies a treatment to one group and not to another (control) group, and we have some hope that the treatment and control groups don't have any other systematic differences. This is likely not the case globally, but if we did believe that the countries didn't have other systematic differences, how would we set up the experiment.  

Data Source: [Climate Change Laws of the World](https://climate-laws.org/legislation_and_policies)

**Question 9:** Describe this investigation in terms of an experiment. What population are we studying? What is the control group? What is the treatment group? What outcome are we measuring? Be precise!

*Write your answers below.*

- Population:  ...

- Control:  ...

- Treatment:  ...

- Outcome:  ...



In [None]:
laws = Table.read_table('laws.csv')
laws

**Question 10:** Let's set up to compute an empirical distribution for countries that have laws and policies that attempt to counteract climate change and an empirical distribution for countries that have not implemented laws and policies.  We want to focus on the time range between 1990 and 2020 as the majority of laws were implemented in this time period. We're going to split this up into four steps.  
1. Combine the temperature table and the laws table.  
2. Set year_range to the correct time period.  
3. Create two tables: one of countries that have climate change laws and one for countries that do not.  

In [None]:
temp_law = ...
year_range = ...
haveLaws = ...
noLaws = ...

In [None]:
check('tests/q10.py')

**Question 11:** Calculate the test statistic for both subsets of countries: those that have implemented climate change laws and those that have not implemented these laws.

In [None]:
laws_test_stat = ...
laws_test_stat

In [None]:
nolaws_test_stat = ...
nolaws_test_stat

In [None]:
check('tests/q11.py')

**Question 12:** Now using these tables from question 10 and the calculated test statistic from question 11, create an empirical distribution and calculate a p-value.  

In [None]:
haveLawsSamples = ...
lawsPvalue = ...
print("P-value for countries that have implemented policies to counteract climate change from 1990 to 2020 :" + str(lawsPvalue))

In [None]:
noLawsSamples = ...
nolawsPvalue = ...
print("P-value for countries that have NOT implemented policies to counteract climate change from 1990 to 2020 :" + str(nolawsPvalue))

**Question 13:** Explain what our results show in the markdown cell below: 

...

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import glob
from gofer.ok import check
correct = 0
checks = [2, 4, 5, 6, 7, 10, 11]
total = len(checks)
for x in checks:
    print('Testing question {}: '.format(str(x)))
    g = check('tests/q{}.py'.format(str(x)))
    if g.grade == 1.0:
        print("Passed")
        correct += 1
    else:
        print('Failed')
        display(g)

print('Grade:  {}'.format(str(correct/total)))