# Entropy, conditional entropy, and mutual information

Overview:
* In this tutorial, we will learn how to describe the information that is shared between two variables (mutual information). In other words, how much uncertainty reduction is there to be had about variable 1 by measuring variable 2? 
* These concepts were initially developed in communication theory to describe the efficacy of transmitting signals over a noisy medium (like a noisy telephone line). For example, suppose that we want to know how good a communication channel is, or its effeciency in reliably relaying a message from point A (a 'sender') to point B (a 'reciever').
* Basically this is just like asking, "we know how good the signal is at A, and we recieved the message at B - how much information about A is still in the received signal B?". 
* So that is the general gist of it, but right away you can see the potential applicability of this metric in many fields of neuroscience, psychology, engeneering, etc. In neuroscience, we're bascially dealing with a series of communication channels that are corrupted by noise (i.e. synapses). It is therefore reasonable to ask: how much information from neuron A effectively propogates to neuron B? (or conversely, how much  information is lost?).
* However, this logic works for any combination of variables: two continuous variables, two discrete variables, one continuous and one discrete, etc. As a result, we can ask questions about any two variables really: how much information about median home  price is reflected in stock market fluctuations? etc.

* A few notes before we get started. First, we're going to be talking a lot about 'uncertainty' and 'uncertainty reduction'. While this is basically complementary to talking about certainty and an increase in certainty, we'll deal with the former terminology as it is embedded in some of the concepts that we'll discuss. 
* It can take a while to get used to this if you're not used to it. 
* Second, we'll be dealing with variability in data, and how we can either attribute that variability in the data to 'noise' or to 'signals'. I.e. is the variability in one variable random wrt another variable? or does the variability in one variable systematically change with the variability in another? 
* Finally, a lot of people think at this point, "why not just correlate the variables using a normal r-value?". There are a few answers to this, but the simplest is this: correlation assumes a linear relationship (or, in more complex forms, a *known* relationship) between variables. Mutual information does not, and can generally captute any form of co-dependence between two variables. 
* In addition, MI has has a very intuitive interpretation in terms of the amount of information that is shared between two variables, and we'll get to that in a few minutes. 

[jackknife correction](https://www.pnas.org/content/115/40/9956)

<div class="alert alert-warning">
From the above linked paper from Zeng, Xia, and Tong (their Abstract): Quantifying the dependence between two random variables is a fundamental issue in data analysis, and thus many measures have been proposed. Recent studies have focused on the renowned mutual information (MI) [Reshef DN, et al. (2011) Science 334:1518–1524]. However, “Unfortunately, reliably estimating mutual information from finite continuous data remains a significant and unresolved problem” [Kinney JB, Atwal GS (2014) Proc Natl Acad Sci USA 111:3354–3359]. In this paper, we examine the kernel estimation of MI and show that the bandwidths involved should be equalized. We consider a jackknife version of the kernel estimate with equalized bandwidth and allow the bandwidth to vary over an interval. We estimate the MI by the largest value among these kernel estimates and establish the associated theoretical underpinnings.
</div>

* This is a very important concept to deal with - MI, esp for continuous variables, is highly unstable and requires correction procedures to counter the bias that is inherent in estimating MI for small data sets. 
* So while at the start of the tutorial we'll use discrete arrays of numbers that have few unique entries to demonstrate the basic concepts (like binary weighted coin flips, for example), things will get a bit crazier when we move on to continuous arrays of values. 
* Note also that the proposed jacknife correction from this PNAS paper is just one approach...I'm implementing it because in the few cases I've tried, it seems to be pretty numerically stable. Hwoever, there are other slightly more straightforward approaches that build on the ideas that we discussed in the "randomization" tutorial a few weeks back (and indeed,the jacknife approach is logically related as well).  

## Imports

In [25]:
import numpy as np

from scipy.special import gamma,psi
from scipy import ndimage
from scipy.linalg import det
from numpy import pi
import matplotlib.pyplot as plt


In [None]:
x = np.random.rand(1000)
y = np.random.rand(1000)
