# Understanding 3-way mutual information

The purpose of this notebook is to explore the relationships between relative counts of values of three binary variables and the 3-way mutual information that I will calculate for them.

Firstly, I load all required packages. 

In [2]:
list_of_packages <- c("tidyverse", "entropy", "ggplot2")
new_packages <- list_of_packages[!(list_of_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
for (i in 1:length(list_of_packages))
{
  library(list_of_packages[i],character.only = T)
}

Installing package into ‘/home/jupyter/.R/library’
(as ‘lib’ is unspecified)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Next, we set up the scenarios that will be simulated.
</br>
A sample of three binary variables defines a cube of mutual counts, i.e. $\{n_{0,0,0}$, $n_{0,0,1}$,..., $n_{1,1,1}\}$. For simplicitly, I will only look at _high_ and _low_ counts, relative to some over all count, $N$. The proportion of counts that indicate _high_ and _low_ will depend on how many of the eight cells in the counts cube are _high_ and _low_.
</br>
</br>
There are five classes of scenarios:
1. All count cells are equal.
2. Only one count cell is different to the rest.
3. Two count cells are different to the rest.
4. Three count cells are different to the rest.
5. Four count cells are different to the rest.
</br>
</br>

Each class of scenario has multiple unique instances

|Class #|Class|Count of instances|Explanation|
|---|---|---|---|
|1|All count cells are equal.|2|All _high_ or all _low_|
|2|Only one count cell is different to the rest.|16|One instance for each count cell|
|3|Two count cell are different to the rest.|56|One instance for each count cell pair|
|4|Three count cell are different to the rest.|112|One instance for each count cell triplet|
|5|Four count cell are different to the rest.|140|One instance for each count cell quadruplet|

Mutual information is symmetric so, for example, the scenario where five count cells are different will yield the same mutual information as when three count cells are different. Also, mutual information is equivalent if the binary values 1 and 0 are swapped. To minimise the number of simulations run, only half of the symmetrically-equivalent scenarios will be run. Also, to minimise the size of the simulated datasets, the "different" count cell in the four scenarios listed 2-5 will be _high_. To simulate scenario 1, all cell counts will be even portions of $N$.
</br>
</br>
Below is the full table of 163 scenarios that will be simulated.

In [43]:
# Define the count of observations to simulate.
N <- 100000

# Define each of the eight binary variables.
`000` <- `001` <- `010` <- `011` <- `100` <- `101` <- `110` <- `111` <- c(1,0) 

# Define the scenario grid.
scenarioGrid <-
    expand.grid( `000`, `001`, `010`, `011`, `100`, `101`, `110`, `111`) %>%
    `colnames<-`(c("000", "001", "010", "011", "100", "101", "110", "111")) %>%
    dplyr::mutate(sumCols = rowSums(across())) %>%
    dplyr::filter(sumCols < 5) %>%
    dplyr::arrange(sumCols) %>%
    dplyr::select(-sumCols) %>%
    dplyr::mutate(across(.fns = ~ replace(., (. == 1), "high"))) %>%
    dplyr::mutate(across(.fns = ~ replace(., (. == 0), "low")))
scenarioGrid

000,001,010,011,100,101,110,111
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
low,low,low,low,low,low,low,low
low,low,low,low,low,low,low,high
low,low,low,low,low,low,high,low
low,low,low,low,low,high,low,low
low,low,low,low,high,low,low,low
low,low,low,high,low,low,low,low
low,low,high,low,low,low,low,low
low,high,low,low,low,low,low,low
high,low,low,low,low,low,low,low
low,low,low,low,low,low,high,high


Next, I simluate a dataset with the proportions of 1s and 0s stipulated by a given row in `scenarioGrid`.

In [93]:
# Define tallyTable, which shows unique values for variable combinations.
comboTable <-
    expand.grid(x = c(0,1), y = c(0,1), z = c(0,1)) %>%
    dplyr::arrange(x, y, z) %>%
    `rownames<-`(colnames(scenarioGrid))

# Set overall number of observations, N.
N <- 10000

# Set value for "low" cells.
val_low_cells <- N*0.01



#### LOOOOPPP #####
i<-100
# Select scenario from scenarioGrid.
i_scenarioGrid <- scenarioGrid[i,]

# How many "low" and "high" count cells?
n_low_cells <- sum(i_scenarioGrid == "low")
n_high_cells <- 8 - n_low_cells

# Value of "high" cells, given quantity of "low" cells.
val_high_cells <- N - (n_low_cells * val_low_cells)

# Determine which variable combination should have "high" or "low" counts.
#rownames(comboTable)

# Make the simulated dataset.
i_data <- numeric(ncol(comboTable))
rowcount <- 0
for (i_row in 1:nrow(comboTable))
    {
    # Determine whether to duplicate the row a "high" or "low" number of times.
    high_or_low <- ifelse(i_scenarioGrid[i_row] == "high", val_high_cells, val_low_cells)
    print(paste0("Appending ", rownames(comboTable)[i_row], " ", high_or_low, " times."))
    rowcount <- rowcount + high_or_low
    
    # Duplicate the required count of rows and append them.
    i_data <- 
        rbind(
            i_data,
            comboTable[rep(i_row, times = high_or_low),]
            )
    }
# Remove the initial row that was used as a placeholder.
i_data <- i_data[-1,]
# Print feedback.
print(paste0("There should be ", rowcount, " rows."))
print(paste0("The actual number of rows is ", nrow(i_data), "."))

# Calculate the 3-way mutual information.
entropy.empirical(i_data)


[1] "Appending 000 100 times."
[1] "Appending 001 100 times."
[1] "Appending 010 9600 times."
[1] "Appending 011 100 times."
[1] "Appending 100 9600 times."
[1] "Appending 101 100 times."
[1] "Appending 110 9600 times."
[1] "Appending 111 9600 times."
[1] "There should be 38800 rows."
[1] "The actual number of rows is 38800."


ERROR: Error in sum(ifelse(freqs > 0, freqs * log(freqs), 0)): invalid 'type' (list) of argument


In [None]:
Check
http://www.bioconductor.org/packages/devel/bioc/vignettes/Informeasure/inst/doc/Informeasure.html#mi.measure-mutual-information

https://elife-asu.github.io/rinform/

https://rdrr.io/cran/NlinTS/man/mi_disc.html

And read Srinivasa "A Review on Multivariate Mutual Information"