# Descriptive statistics (and correlations)

The main descriptive statistics are: mean, min, 25th, median, 75th, max, standard deviation

The main packages that provide these statistics are numpy, pandas, statistics and scipy.stats


Scipy stats: See https://docs.scipy.org/doc/scipy/reference/stats.html

## Descriptive statistics

In [None]:
# main packages
import numpy as np
import pandas as pd
import statistics
import scipy.stats

# for nan
import math

In [None]:
# Sample dataset
df = pd.read_excel(r'..\datasets\Compustat-Funda.xlsx',nrows= 1000)

In [None]:
df.head()

### pandas

First of all, the describe() method gives various descriptive statistics:

In [None]:
# get descriptive statistics for variable sales (sale)
df['sale'].describe()

### Using numpy

average: `mean`
median: `median`
standard deviation: `std`, requires an extra argument for degrees of freedom (`ddof`)
percentiles: `percentile`, with argument to specify the percentile (25, 75, etc)
    

In [None]:
print ('mean:', np.mean( df['sale']  ) )
print ('median:', np.median( df['sale']  ) )
print ('standard deviation:', np.std( df['sale'] , ddof=1 ) ) #ddof divided by n-1 instead of n
print ('25th percentile:', np.percentile(df['sale'], 25))    
print ('75th percentile:', np.percentile(df['sale'], 75))    

## Correlations

There are 2 main types of correlations: 
- Pearson
- Spearman (based on ranking)

Since the correlation between x and y is the same as y and x, it makes sense to have both correlations in a matrix. One half of the matrix Pearson, the other half Spearman.

### Pearson correlation

How it is calculated: https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/

In [None]:
from IPython.display import Image
Image("images/pearson-correlation.jpg")

From Wikipedia, https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

In [None]:
# calculate 'by hand' 
# this code assumes both series have the same length (and no missing values)
s = df['sale']
a = df['at']
n = len(s)
s_mean, a_mean = np.mean( s ), np.mean (a )
var_s = sum((item - s_mean)**2 for item in s) / ( n - 1)
var_a = sum((item - a_mean)**2 for item in a) / ( n - 1)
std_s, std_a = var_s ** 0.5, var_a ** 0.5
cov_sa = (sum((s[k] - s_mean) * (a[k] - a_mean) for k in range(n)) / (n - 1))
r = cov_sa / (std_s * std_a)
r

### Pandas corr

Pandas' `corr` can be used to get the correlation table. Parameter `method` is used to specify the type ('pearson' vs 'spearman').

It does however not output the p-values.

See: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html
    


In [None]:
corrM = df.corr(method='pearson')
corrM

### Using Scipy.stats

In [None]:
# using scipy
r, p = scipy.stats.pearsonr(s, a)
print('correlation:', r, 'p-value:', p)

### Spearman rank correlation

For an explanation on how it is calculated, see: https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/spearman-rank-correlation-definition-calculate/

In [None]:
# Spearman
r, p = scipy.stats.spearmanr(s, a)
print('correlation:', r, 'p-value:', p)