### Pearson Correlation Coefficient

In [13]:
import numpy as np
from scipy.stats import pearsonr
np.random.seed(0)

size = 300

x = np.random.normal(0,1, size)

print('Lower Noise: ', pearsonr(x, x + np.random.normal(0, 1, size)))
print('Higher Noise: ', pearsonr(x, x + np.random.normal(0, 10, size)))

Lower Noise:  (0.7182483686213841, 7.32401731299835e-49)
Higher Noise:  (0.057964292079338155, 0.3170099388532475)


>As you can see from the example, we compare a variable with a noisy version of itself. With smaller amount of noise, the correlation is relatively strong, with a very low p-value, while for the noisy comparison, the correlation is very small and furthermore, the p-value is high meaning that it is very likely to observe such correlation on a dataset of this size purely by chance.

In [14]:
import numpy as np
from scipy.stats import pearsonr
np.random.seed(42)

size = 1000

x = np.random.normal(0,1, size)

print('Lower Noise: ', pearsonr(x, x + np.random.normal(0, 1, size)))
print('Higher Noise: ', pearsonr(x, x + np.random.normal(0, 10, size)))

Lower Noise:  (0.6857150036658147, 7.860081779159927e-140)
Higher Noise:  (0.12083574866694574, 0.00012792152128382383)


>Similarly, in this example the correlation remains relative but in the noisy comparison, the p-value is small meaning that it is unlikely to observe such a correlation by chance. 

>Scikit-learn provides f_regression method for computing the p-values (and underlying F-values) in batch for a set of features, so it is a convenient way to run a corre1ation test on a dataset in one go and for example include it in a sklearn’s pipeline.

>One obvious drawback of Pearson correlation as a feature ranking mechanism is that it is only sensitive to a linear relationship. If the relation is non-linear, Pearson correlation can be close to zero even if there is a 1-1 correspondence between the two variables.
For example, correlation between x and x2 is zero, when x is centered on 0.

In [17]:
x = np.random.uniform(-1, 1, 100000)
print (pearsonr(x, x**2)[0])

-0.004932875787130827


### Mutual information and maximal information coefficient (MIC)

MIC measures mutual dependence between variables, typically in bits. Threfeore the data ** is not ** normalized. Therefore it can be difficult to use in certain circumstances. 

>- MI Values can be incomparable between two datasets. 
>- It can be difficult to compute for continuous variables. They need to be discrete, and if continueous, binned. However, it can be sensitive to bin selection. 

Therefore another technique was developed to address the shortcomings. 


### Maximal information Coefficient (MINE)

The Maximal information coefficent searches for optimal binning and turns mutual information score into a metric that lies in range (0,1). It is available via the minepy library. 

In [22]:
## INSTALL MINEPY AT THE NEXT AIRPORT

from minepy import MINE

m = MINE()
X = np.random.uniform(-1,1,10000)

print(m.mic())

ModuleNotFoundError: No module named 'minepy'

Above is the correlation we were looking for. 

Further reading on MIC's statistical power: 
> http://ie.technion.ac.il/~gorfinm/files/science6.pdf
> http://www-stat.stanford.edu/~tibs/reshef/comment.pdf
> http://en.wikipedia.org/wiki/Statistical_power


### Distance Correlation

>Another robust, competing method of correlation estimation is distance correlation, designed explicitly to address the shortcomings of Pearson correlation. While for Pearson correlation, the correlation value 0 does not imply independence (as we saw from the x vs x2 example), distance correlation of 0 does imply that there is no dependence between the two variables.

[python gist](https://gist.github.com/josef-pkt/2938402)

> There are at least two reasons why to prefer Pearson correlation over more sophisticated methods such as MIC or distance correlation when the relationship is close to linear. For one, computing the Pearson correlation is much faster, which may be important in case of big datasets. Secondly, the range of the correlation coefficient is [-1;1] (instead of [0;1] for MIC and distance correlation). This can relay useful extra information on whether the relationship is negative or positive, i.e. do higher feature values imply higher values of the response variables or vice versa. But of course the question of negative versus positive correlation is only well-posed if the relationship between the two variables is monotonic.

### Model Based Ranking

>Finally one can use an arbitrary machine learning method to build a predictive model for the response variable using each individual feature, and measure the performance of each model. **In fact, this is already put to use with Pearson’s correlation coefficient, since it is equivalent to standardized regression coefficient that is used for prediction in linear regression.** 

>If the relationship between a feature and the response variable is non-linear, there are a number of alternatives, for example tree based methods (decision trees, random forest), linear model with basis expansion etc. Tree based methods are probably among the easiest to apply, since they can model non-linear relations well and don’t require much tuning. The main thing to avoid is overfitting, so the depth of tree(s) should be relatively small, and cross-validation should be applied.

In [29]:
from sklearn.model_selection import cross_val_score, ShuffleSplit
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor

# Boston Housing Data

boston = load_boston()
X = boston['data']
y = boston['target']
names = boston['feature_names']

model = RandomForestRegressor(n_estimators = 20, max_depth = 4)
scores = []

for i in range(X.shape[1]):
    score = cross_val_score(model, X[:, i:i+1], y, scoring = 'r2', 
                            cv = ShuffleSplit(len(X), 3, .3))
    scores.append((round(np.mean(score), 3), names[i]))
    
print (sorted (scores, reverse = True))

[(-2.378, 'DIS'), (-2.6, 'LSTAT'), (-2.727, 'PTRATIO'), (-2.977, 'RM'), (-3.299, 'TAX'), (-4.422, 'CRIM'), (-4.827, 'ZN'), (-4.843, 'INDUS'), (-6.667, 'B'), (-6.948, 'RAD'), (-8.123, 'AGE'), (-8.771, 'NOX'), (-12.315, 'CHAS')]
