<a href="https://colab.research.google.com/github/Taaniya/Statistical_tests/blob/master/Statistical_formulas_from_scratch_in_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Pearson correlation

* This correlation method is used to measure the strength of linear relationship between 2 variables

* Assumes the 2 variables roughly come from normal distribution

* +1 -> highly correlated 
* -1 -> negatively correlated 
* 0 -> no **linear correlation**, the variables may be non-linearly correlated though 
* The magnitude indicating the strength of relationship. If the variables are highly correlated (-1 / +1), we can use 1 variable to predict the other

* The alternative hypothesis for the Pearson correlation test is the linear correlation between two variables X and Y.

#### Spearman correlation

* Non-parametric test of statistical dependence for 2 variables.
Non-parametric, since it doesn't assume any distribution for the 2 variables

* This is useful in cases when the 2 variables donot come from normal distribution and is also robust to the effect of outliers.

* Hypothesis for the Spearman’s correlation test states that the correlation between two variables X and Y corresponds to a monotonic function.

In [None]:
def getPearsonCorrelation(df, feature1, feature2):
    """
    eg. # getPearsonCorrelation(sample, 'votes', 'approx_cost(for two people)')
    """
    mu1, mu2 = df[feature1].mean(),df[feature2].mean()
    print(mu1, mu2)
    std1, std2 = df[feature1].std(), df[feature2].std()
    print(std1, std2)
    cov = np.mean((df[feature1] - mu1) * (df[feature2] - mu2))
    r = cov/(std1 * std2)
    return r

### Coefficient of Variation

This statistical measure of relative distribution of data points around the mean. More the value, higher is the dispersion.

$ CV = \frac{\sigma}{\mu} $

In [None]:
def getCoefficientOfVariation(df, featureCol):
    """
    Returns Coefficient of Variance (CV) for featureCol
    CV = std / mu

    eg. getCoefficientOfVariation(zomato_df, 'votes')
    """
    cv = df[featureCol].std() / df[featureCol].mean()
    return cv

### Kurtosis

Kurtosis is the 4th central moment of a dataset divided by square of variance , for measuring heaviness of tails of distribution compared to a normal distribution of same variance.

heavy tails -> high kurtosis (called Leptokurtic). eg. student-t distribution with kurtosis = $ \infty$ at df <= 4
light tails -> low kurtosis (called platykurtic)  eg. uniform distribution with kurtosis = 

The above measure of kurtosis if Fisher's kurtosis which is excess kurtosis

excess kurtosis = kurtosis - 3



In [None]:
def computeKurtosis(df, featureCol):
    """
    Returns Fisher's Kurtosis for featureCol by substracting 3 from k.
    eg. computeKurtosis(zomato_df, 'votes')
    """
    mu = df[featureCol].mean()
    var = df[featureCol].var()
    centralized_mean = np.mean(np.power(df[featureCol].values - mu, 4))
    k = centralized_mean / np.square(var)
    k = k - 3                                        # Fisher's definition
    return k

#### References

* [Probability and Statistics for Programmers - PDF](http://greenteapress.com/thinkstats/thinkstats.pdf)

* https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0697-7