# Introduction to Statistics Part III



Now that we have learned how to manipulate basic statistics, we will look at how to perform *significance tests*.

## Significance Testing

In [0]:
# Load pandas, numpy, and scipy.stats

import pandas as pd
import numpy as np
import scipy.stats as stats

Run the cell below to load a table of temperature in Detroit since 1937:

In [0]:
data_table = pd.read_csv( '../SampleData/detroit_weather_2.csv' ) # Data from Mathematica WeatherData, 2019

Print out the data to see what the format looks like:

In [0]:
# View the head of data_table to see what its format looks like

data_table.head()

Unnamed: 0,YEAR,MONTH,DAY,Temperature
0,1937,1,1,0.5
1,1937,1,2,0.17
2,1937,1,3,-1.06
3,1937,1,4,-3.89
4,1937,1,5,-0.17


In [0]:
# View the tail of data_table to see what its format looks like

data_table.tail()

`data_table` contains one row for each day since 1937, where the column 'Temperature' contains the average temperature for that day (in Celsius).

We will use this data to **test** if global warming has occured in Detroit in the years since 1937.

In [0]:
# Select two temperature ranges from data_table, one from a long time ago and one more recent:

temps_1940 = data_table.query("YEAR >= 1940 and YEAR < 1950")["Temperature"]
temps_2005 = data_table.query("YEAR >= 2005 and YEAR < 2015")["Temperature"] 

Using what we learned this morning, calculate the mean for each of the

In [0]:
# Calculate the mean of your two temperature ranges:

print("Average temperature 1935-1945:", np.mean(temps_1940))
print("Average temperature 2005-2015:", np.mean(temps_2005))


Average temperature 1935-1945: 9.48353134410074
Average temperature 2005-2015: 10.230466136550573
Difference between the means: 0.7469347924498333


In [0]:
# Calculate the difference between the two means:

print("Difference between the means:", np.mean(temps_2005) - np.mean(temps_1940))

Here, we see that there was an increase of 0.75 degrees Celsius between these two time periods. *Statistical tests* are used to determine if this difference is likely due to chance or due to an actual change.

We will use one of these tests, a `t-test` to calculate the probability of this temperature change:

In [0]:
# Use the scipy stats module to calculate a t-test from the data above

stats.ttest_ind(temps_1940, temps_2005).pvalue

0.0025808160556977724

This *p-value* informs us that there is only a 0.25% chance that there was a difference of this size by random fluctuation, which is very low! This shows that our data supports the idea of global warming, even here in Detroit.

Let's redo this analysis using only temperature values from December:

In [0]:
# Reselect the data, now only including data points in December

temps_1940_dec = data_table.query("YEAR >= 1940 and YEAR < 1950 and MONTH == 12")["Temperature"]
temps_2005_dec = data_table.query("YEAR >= 2005 and YEAR < 2015 and MONTH == 12")["Temperature"] 

In [0]:
# Calculate the mean of your two temperature ranges:

print("Average temperature 1935-1945:", np.mean(temps_1940_dec))
print("Average temperature 2005-2015:", np.mean(temps_2005_dec))


Average temperature 1935-1945: -1.7202903225806458
Average temperature 2005-2015: -0.41761437908496746
Difference between the means: 1.3026759434956783


In [0]:
# Calculate the difference between the two means:

print("Difference between the means:", np.mean(temps_2005_dec) - np.mean(temps_1940_dec))

In [0]:
# Re-run the statistical test on these subset datasets

stats.ttest_ind(temps_1940_dec, temps_2005_dec).pvalue

0.0008126329856967648

We can see that the difference in temperature is even greater when you focus on just December. A *p value* of 0.08% indicates that the change is even more signifcant than the difference in temperature for the entire year.

In this lesson you learned how to:

     - Perform a `t-test` on a two-class dataset
     - Interpret the results from statistical tests
     
Now, lets continue to practice with your partner!