Skip to content

kmedian/korr

Repository files navigation

PyPI version Total alerts Language grade: Python

korr

collection of utility functions for correlation analysis

Usage

Check the examples folder for notebooks.

Compute correlation matrix and its p-values

  • pearson -- Pearson/Sample correlation (interval- and ratio-scale data)
  • kendall -- Kendall's tau rank correlation (ordinal data)
  • spearman -- Spearman rho rank correlation (ordinal data)
  • mcc -- Matthews correlation coefficient between binary variables

EDA, Dig deeper into results

  • flatten -- A table (pandas) with one row for each correlation pairs with the variable indicies, corr., p-value. For example, try to find "good" cutoffs with corr_vs_pval and then look up the variable indicies with flatten afterwards.
  • slice_yx -- slice a correlation and p-value matrix of a (y,X) dataset into a (y,x_i) vector and (x_j, x_k) matrices
  • corr_vs_pval -- Histogram to find p-value cutoffs (alpha) for a) highly correlated pairs, b) unrelated pairs, c) the mixed results.
  • bracket_pval -- Histogram with more fine-grained p-value brackets.
  • corrgram -- Correlogram, heatmap of correlations with p-values in brackets

Utility functions

  • confusion -- Confusion matrix. Required for Matthews correlation (mcc) and is a bitter faster than sklearn's

Parameter Stability

  • bootcorr -- Estimate multiple correlation matrices based on bootstrapped samples. From there you can assess how stable correlation estimates are (how sensitive against in-sample variation). For example, stable estimates are good candidates for modeling, and unstable correlation pairs are good candidates for P-hacking and non-reproducibility.

Variable Selection, Search Functions

  • mincorr -- From all estimated correlation pairs, pick a given n=3,5,.. of variables with low and insignificant correlations among each other. (See binsel package for an application.)
  • find_best -- Find the N "best", i.e. high and most significant, correlations
  • find_worst -- Find the N "worst", i.e. insignificant/random and low, correlations
  • find_unrelated -- Return variable indicies of unrelated pairs (in terms of insignificant p-value)

Appendix

Installation

The korr git repo is available as PyPi package

pip install korr

Install a virtual environment

python3.7 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir

(If your git repo is stored in a folder with whitespaces, then don't use the subfolder .venv. Use an absolute path without whitespaces.)

Commands

  • Check syntax: flake8 --ignore=F401
  • Run Unit Tests: pytest
  • Remove .pyc files: find . -type f -name "*.pyc" | xargs rm
  • Remove __pycache__ folders: find . -type d -name "__pycache__" | xargs rm -rf

Publish

pandoc README.md --from markdown --to rst -s -o README.rst
python setup.py sdist 
twine upload -r pypi dist/*

Support

Please open an issue for support.

Contributing

Please contribute using Github Flow. Create a branch, add commits, and open a pull request.