# Finding Correlations Using Pandas and SciPy
*Curtis Miller*

**Correlation** is a measure of how strongly two variables are related to one another. The most common measure of correlation is the **Pearson correlation coefficient**, which, for two sets of paired data $x_i$ and $y_i$ is defined as

$$r = \frac{1}{n - 1}\sum_{i = 1}^n \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)$$

$r$ is a number between 1 and -1, with $r > 0$ indicating a positive relationship ($x$ and $y$ increase together) and $r < 0$ a negative relationship ($x$ increases as $y$ decreases). When $|r| = 1$, there is a perfect *linear* relationship, while if $r = 0$ there is no *linear* relationship ($r$ may fail to capture non-linear relationships). In practice, $r$ is never exactly 0, so $r$ with small magnitude are synonymous with "no correlation". $|r| = 1$ does occur, usually when two variables effectively describe the same phenomena (for example, height in meters vs. height in centimeters, or grocery bill and sales tax).

## Loading the Boston House Price Dataset

The Boston housing prices dataset is included with **sklearn** as a "toy" dataset (one used to experiment with statistical and machine learning methods). It includes the results of a survey that prices houses from various areas of Boston, and includes variables such as the crime rate of an area, the age of the home owners, and other variables. While many applications focus on predicting the price of housing based on these variables, I'm only interested in the correlation between these variables (perhaps this will suggest a model later).

Below I load in the dataset and create a Pandas `DataFrame` from it.

In [None]:
from sklearn.datasets import load_boston
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
boston = load_boston()

In [None]:
print(boston.DESCR)

In [None]:
boston.data

In [None]:
boston.feature_names

In [None]:
boston.target

In [None]:
temp = DataFrame(boston.data, columns=pd.Index(boston.feature_names))
boston = temp.join(DataFrame(boston.target, columns=["PRICE"]))
boston

## Correlation Between Two Variables

We could use NumPy's `corrcoef()` function if we wanted the correlation between two variable, say, the local area crime rate (CRIM) and the price of a home (PRICE).

In [None]:
from numpy import corrcoef

In [None]:
boston.CRIM.as_matrix()    # As a NumPy array

In [None]:
corrcoef(boston.CRIM.as_matrix(), boston.PRICE.as_matrix())

The numbers in the off-diagonal entries correspond to the correlation between the two variables. In this case, there is a negative relationship, which makes sense (more crime is associated with lower prices), but the correlation is only moderate.

## Computing a Correlation Matrix

When we have several variables we may want to see what correlations there are among them. We can compute a **correlation matrix** that includes the correlations between the different variables in the dataset.

When loaded into a Pandas `DataFrame`, we can use the `corr()` method to get the correlation matrix.

In [None]:
boston.corr()

While this has a lot of data it's not easy to read. Let's visualize the correlations with a heatmap.

In [None]:
import seaborn as sns    # Allows for easy plotting of heatmaps

In [None]:
sns.heatmap(boston.corr(), annot=True)

The heatmap reveal some interesting patterns. We can see

* A strong positive relationship between home prices and the average number of rooms for homes in that area (RM)
* A strong negative relationship between home prices and the percentage of lower status of the population (LSTAT)
* A strong positive relationship between accessibility to radial highways (RAD) and property taxes (TAX)
* A negative relationship between nitric oxides concentration (NOX) and distance to major employment areas in Boston
* No relationshipp between the Charles River variable (CHAS) and any other variable

## Statistical Test for Correlation

Suppose we want extra assurance that two variables are correlated. We could perform a statistical test that tests

$$H_0: \rho = 0$$
$$H_A: \rho \neq 0$$

(Where $\rho$ is the population, or "true", correlation.) This test is provided for in SciPy.

In [None]:
from scipy.stats import pearsonr

In [None]:
# Test to see if crime rate and house prices are correlated
pearsonr(boston.CRIM, boston.PRICE)

The first number in the returned tuple is the computed sample correlation coefficient $r$, and the second number is the p-value of the test. In this case, the evidence that there is *any* non-zero correlation is strong. That said, just because we can conclude that the correlation is not zero does not mean that the correlation is meaningful.