# Introduction to Statistics Part III



Now that we have learned how to manipulate basic statistics, we will look at how to perform *significance tests* and find the *correlation* between two variables.

## Significance Testing

In [None]:
# Load pandas, numpy, and scipy.stats




Run the cell below to load a table of temperature in Detroit since 1937:

In [None]:
data_table = pd.read_csv( '../SampleData/detroit_weather_2.csv' ) # Data from Mathematica WeatherData, 2019

Print out the data to see what the format looks like:

In [None]:
# View the head of data_table to see what its format looks like



In [None]:
# View the tail of data_table to see what its format looks like




`data_table` contains one row for each day since 1937, where the column 'Temperature' contains the average temperature for that day (in Celsius).

We will use this data to **test** whether or not the average temperature in Detroit has changed significantly in the years since 1937. To do this, let's first select two equally sized date ranges to generate our averages: 1940-1950 and 2005-2015.

In [None]:
# Select two temperature ranges from data_table, one from a long time ago and one more recent:



Using what we learned this morning, calculate the mean for each of the date ranges.

In [None]:
# Calculate the mean of your two temperature ranges:




In [None]:
# Calculate the difference between the two means:



Here, we see that there was an increase of 0.75 degrees Celsius between the average temperature of these two time periods. *Statistical tests* are used to determine if this difference is likely due to chance or due to an actual change.

We will use one of these tests, a **t-test** to calculate the **probability** of this temperature change. 

For this, we'll use the `ttest_ind` **function** as part of the `stats` **module** of the `scipy` **package**. You'll notice that the arguments we passed to `ttest_ind` are the full daily temp data vectors for each date range, rather than just the averages. This is because the outcome of the test depends on the **distribution** of the data. 

In [None]:
# Use the scipy stats module to calculate a t-test from the data above



The output of the t-test is called a *p-value*. This *p-value* tells us the **probability** that we would see the same data distribution if there was no difference between the two groups we are testing. Here, the result informs us that there is only a 0.25% chance that there was a difference of this size by random fluctuation, which is very low! Since we saw the average of the later dates was higher than the earlier dates, this shows that our data supports the idea of global warming, even here in Detroit.

Let's redo this analysis using only temperature values from December:

In [None]:
# Reselect the data, now only including data points in December

 

In [None]:
# Calculate the mean of your two temperature ranges:




In [None]:
# Calculate the difference between the two means:



In [None]:
# Re-run the statistical test on these subset datasets



We can see that the difference in temperature is even greater when you focus on just December. A *p-value* of 0.08% indicates that the change is even more signifcant than the difference in temperature for the entire year.

## Correlations

A *correlation* is a measure of the statistical relationship between two variables. Correlation values range from -1 to 1, where the absolute value of the correlation indicates the strength of the relationship and the sign of the correlation represents the direction of the relationship. 

We will use the `corrcoef` function from `numpy` to calculate correlation values.

In [None]:
# positive and negative correlation examples
data_1 = np.array([1,2,3,4,5])
data_2 = data_1 * 4
data_3 = data_1 * -2



This function returns a *correlation matrix*, which always has 1's along the diagonal and is *symmetric* (i.e. same values above the diagonal as below). This is so you can compute correlations of more than one variable. Let's illustrate with another example.

In [None]:
a = np.array([1,2,3,4,6,7,8,9])
b = np.array([2,4,6,8,10,12,13,15])
c = np.array([-1,-2,-2,-3,-4,-6,-7,-8])



The resulting correlation matrix follows the following form:
    
|_| a | b | c |
|----|--------------|-------------|------------------------|
|a| 1  | 0.995 | -0.980 |
|b| 0.995 |1 | -0.971 |
|c| -0.980 | -0.971 | 1 |

Now, it should be clear why a correlation matrix always has 1's along the diagonal - every variable has perfect positive correlation with itself. Furthermore, it is symmetric because the correlation of `a` & `b` is the same as the correlation of `b` & `a`. 

Now that we understand our output, let's check the correlations between the variables in the `iris` dataset.

In [None]:
# load and preview iris
iris = pd.read_csv('../SampleData/iris.csv')


In [None]:
# find correlations between sepal_length, sepal_width, petal_length, petal_width


You'll notice this time we included the `rowvar` parameter - this is because, by default, the `corrcoef` function expects that each row represents a variable, with observations in the columns. In our case it is the opposite - each column represents a variable, while the rows contain observations. So here we change the value of `rowvar` from the default `True` to `False`. 

In this lesson you learned how to:

* Perform a `t-test` on a two-class dataset
* Interpret the results from statistical tests
* Compute correlations for multiple variables
     
Now, lets continue to practice with your partner!