<a href="https://colab.research.google.com/github/Rachita-G/Python_Practice/blob/main/Model_Concepts/Correlation_and_Coefficient.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CORRELATION AND COVARIANCE

Correlation and covariance are the two commonly used statistical concepts majorly used to measure the linear relationship between the two variables in the data. 

When used to compare the samples from different populations, covariance is used to determine how two variables vary together whereas correlation is used to determine how the change in one variable is affecting the cahnge in other variable.

Even though there are certain similarities between these two mathematical terms , these two are different from each other.

Covariance(x,y)= sum{(xi-mean(x))(yi=mean(y))}/(n-1) ranging from -inf to inf

Correlation(x,y)= Covariance(x,y)/{sd(x) x sd(y)} ranging from -1 to 1

In [None]:
import numpy as np
import pandas as pd

In [None]:
from sklearn.datasets import load_iris
iris=load_iris() # adds a dictionary of all information for the datasets

In [29]:
iris.keys() # the types of information present in the dictionary loaded

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [None]:
# print(iris.DESCR)

In [None]:
iris_feat=pd.DataFrame(iris.data,columns=iris.feature_names)
iris_feat.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [None]:
iris_feat.columns

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')

In [None]:
iris_feat[['sepal length (cm)', 'sepal width (cm)']].cov()

Unnamed: 0,sepal length (cm),sepal width (cm)
sepal length (cm),0.685694,-0.042434
sepal width (cm),-0.042434,0.189979


Note: Covariance for the variable with itself is the variance for the same.

In [None]:
iris_feat[['sepal length (cm)', 'sepal width (cm)']].corr()

Unnamed: 0,sepal length (cm),sepal width (cm)
sepal length (cm),1.0,-0.11757
sepal width (cm),-0.11757,1.0


Note: From the formula itself, correlation is calculated from standardising covariance results, so let us try to evaulate the same in python and see the difference

In [None]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaled_feat=scaler.fit_transform(iris_feat)
scaled_feat=pd.DataFrame(scaled_feat,columns=iris_feat.columns)
scaled_feat.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,-0.900681,1.019004,-1.340227,-1.315444
1,-1.143017,-0.131979,-1.340227,-1.315444
2,-1.385353,0.328414,-1.397064,-1.315444
3,-1.506521,0.098217,-1.283389,-1.315444
4,-1.021849,1.249201,-1.340227,-1.315444


In [None]:
scaled_feat[['sepal length (cm)', 'sepal width (cm)']].cov()

Unnamed: 0,sepal length (cm),sepal width (cm)
sepal length (cm),1.006711,-0.118359
sepal width (cm),-0.118359,1.006711


In [None]:
# for comparison with above!
scaled_feat[['sepal length (cm)', 'sepal width (cm)']].corr()

Unnamed: 0,sepal length (cm),sepal width (cm)
sepal length (cm),1.0,-0.11757
sepal width (cm),-0.11757,1.0


Here, the correlation results on original data is similar to the covariance on the standardised data (with the deviation in decimal values). For any applications like PCA, we can use either of them which yields the same results.

Alternatively , we can use
np.cov(a,b) or np.corrcoef(x,y)

In [None]:
np.corrcoef(iris_feat['sepal length (cm)'], iris_feat['sepal width (cm)'])

array([[ 1.        , -0.11756978],
       [-0.11756978,  1.        ]])

In [None]:
np.cov(scaled_feat['sepal length (cm)'],scaled_feat['sepal width (cm)'])

array([[ 1.00671141, -0.11835884],
       [-0.11835884,  1.00671141]])

#### CONCLUSION

Both the measures- correlation and covariance are closely related to each other and differ a lot when it comes to making a choice between these two. Most of the analysis prefer using correlation as it is more interpretable and not affected by the scale and units of the data.