# EDA 2 - Covariance, Correlation and Non Parametric Testing


- For **numeric features** the **Pearson** and the **Spearmen** measurements are the most frequently used processess for measuring association among features.

- For **categorical variables**, the estimate measurement is an association and the **chi-square** statistic is the most frequently used process for measuring association among features.

## Part 1: Importing database
- The iris database will be used in this experiment
- In the following code we examine only the Pearson and the Spearmen measurements.

In [5]:
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
iris_nparray = iris.data

iris_dataframe = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_dataframe["group"] = pd.Series([iris.target_names[k] for k in iris.target], dtype = "category")

## Part 2: Covariance

**Why covariance**? Because it can say if two features have a coincident bahavior with respect to their mean. Three scenarios are examined:

- If the signle values of the variables are usually above or below their respective averages, the two variables have a positive association.

- If one variable is usually above and the other variable usually below their respective averages, the two variables have a negative association. (Nonetheless, it is okay to proceed with predictions in such cases, just to see their behavior with each other).

- If the two variables don't agree or disagree, their covariance tends to be zero, a sign that the variables don't share much and have independednt behaviors.

When the **TARGET FEATURE** is numeric variable and we have either positive or negative covariance that means that we have **information redudancy**, which means that

In [6]:
iris_dataframe.cov()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
sepal length (cm),0.685694,-0.042434,1.274315,0.516271
sepal width (cm),-0.042434,0.189979,-0.329656,-0.121639
petal length (cm),1.274315,-0.329656,3.116278,1.295609
petal width (cm),0.516271,-0.121639,1.295609,0.581006


From the above example we can draw two conclusions:

- There is no relationship between sepal length and sepal width (- 0.04)
- There is a relationship between petal width and petal length (1.29)

## Part 3: Correlation / Pearson
There are two codes that are presented below:

- Correlation, which is the same as covariance, after we standardize the features and
- Correlation squared, which tells us EXACTLY, the percentage of information that fit a good relationship between 2 features. Thus, a 0.96 correlation implies that 96% of the information is shared.

**NOTE**: The correlation appliances should be used only when there is linear relationship

In [7]:
iris_dataframe.corr()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
sepal length (cm),1.0,-0.11757,0.871754,0.817941
sepal width (cm),-0.11757,1.0,-0.42844,-0.366126
petal length (cm),0.871754,-0.42844,1.0,0.962865
petal width (cm),0.817941,-0.366126,0.962865,1.0


- We can now see that the petal width and the petal length are positively correlated (0.96)

In [8]:
iris_dataframe.corr()**2

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
sepal length (cm),1.0,0.013823,0.759955,0.669028
sepal width (cm),0.013823,1.0,0.183561,0.134048
petal length (cm),0.759955,0.183561,1.0,0.92711
petal width (cm),0.669028,0.134048,0.92711,1.0


## Part 4: Non parametric test / Spearman
When:

- The features are ordinal or
- We suspect some nonlinearity due to non normal distributions in our data

Then:

- A good approach is a **test the doubtful correlation with a nonparametric correlation** such as Spearman's correlation.

The Spearman's correlation transforms the numeric values into rankings and then correlates the rankings, thus minimizes the influce of any nonlinear relatonship between the two variables under scrutiny.

We will apply the non parametric test to **sepal length** and **sepal width** to double confirm the non-correlation and non relationship of the two features.

In [9]:
from scipy.stats import spearmanr
from scipy.stats import pearsonr

spearmanr_coef, spearmanr_p = spearmanr(iris_dataframe["sepal length (cm)"],iris_dataframe["sepal width (cm)"])
pearsonr_coef, spearmanr_p = pearsonr(iris_dataframe["sepal length (cm)"],iris_dataframe["sepal width (cm)"])

print ("Pearson correlation  %0.3f\nSpearman correlation %0.3f" % (pearsonr_coef,spearmanr_coef))

Pearson correlation  -0.118
Spearman correlation -0.167


Pearson correlation  -0.118
Spearman correlation -0.167

**Conclusion**: The above code confirms the weak association between the two variables using the nonparametric test.

- When we were checking the variables with skewness and kurtosis for their correlation, the results were dissapointing. Using non-parametric tests of course told us more. By transforming data it may provide solutions to our problems.
- Using for loop we will now experiment on value of Pearson's r value to check the correlation between two features.
- Transformation power, can change the correlation of the two features and eventually, the algorithm's final performance.


In [31]:
import numpy as np

transformations = {'x':lambda x:x, '1/x':lambda x:1/x,'x**2':lambda x: x**2, "x**3": lambda x: x**3,"log(x)": lambda x:np.log(x),'sqrt(x)':lambda x: np.sqrt(x), 'exp(x)':lambda x: np.exp(x),'log(1/x)':lambda x: np.log(1/x)}
for transformation in transformations:
    pearsonr_coef,pearsonr_p = pearsonr(iris_dataframe['sepal length (cm)'], transformations[transformation](iris_dataframe["sepal width (cm)"]))
    print ("Transformation: %s       \t    Pearson\'s r:  %0.3f" % (transformation,pearsonr_coef)) 

Transformation: x       	    Pearson's r:  -0.118
Transformation: 1/x       	    Pearson's r:  0.080
Transformation: x**2       	    Pearson's r:  -0.131
Transformation: x**3       	    Pearson's r:  -0.140
Transformation: log(x)       	    Pearson's r:  -0.100
Transformation: sqrt(x)       	    Pearson's r:  -0.109
Transformation: exp(x)       	    Pearson's r:  -0.142
Transformation: log(1/x)       	    Pearson's r:  0.100
