# Learning the topology of a Bayesian Network from a database of cases using the K2 algorithm

In [None]:
library(tidyverse)

# Installed just by typing install.packages('bnlearn')
library(bnlearn)

# This was a bit more complicated to install:
# install.packages("BiocManager")
# BiocManager::install("Rgraphviz")
library(Rgraphviz)

_Bayesian Networks_ are graphical models in which the nodes are the Random Variables, and the edges represent the conditional probabilities the Variables can have one another. They can be applied in many fields ranging from Image Processing to Medical Diagnosis, however a critical task in its application is to find the correct Network Structure i.e. finding the correct edges (the casualities of the variables) given the nodes.

Finding the best Network structure can be matematically expressed through this formula:

$$goal:\quad \max_x\left[P(B_{x}|D)\right]\tag{1}$$

Meaning we aim to find the Bayesian Network $B_x$ that maximizes the probability of the Network being correct, given a set of Data samples D.

Through the definition of conditional probability, we can rearrange the formula (1):

$$P(B_{x}|D)=\frac{P(B_{x},D)}{P(D)}$$

being $P(D)$ equal for every model, the task can be converted in finding the model that maximized the quantity $P(B_x,D)$

 
 
In [1] is reported the proof of the following formula:
\
$$P(B_x,D)=P(B_x)\prod_{i=1}^n\prod_{j=1}^{q_i}\frac{(r_i-1)!}{(N_{ij}+r_i - 1)!}\prod_{k=1}^{r_i}N_{ijk}! \tag{2}$$

* $P(B_x)$ is the prior probability of the model, if we assume every model to be as likely as the others then it is just a constant
* $r_i$ is the number of possible value assignments of the i-th variable
* $\pi_i$ is the set of parents for the node i
* $q_i$ the number of unique instantiations of $\pi_i$ relative to D
* $w_{ij}$ denotes the jth unique instantiation of $\pi_i$ relative to D
* $N_{ijk}$ is the number of cases in D in which the variable $x_i$ has the value $v_{ik}$ and $\pi_i$ is instantiated as $w_{ij}$
* $N_{ij}=\sum_{k=1}^{r_i}N_{ijk}$

### Explaination of the equation (2)
Consider the following Bayesian Network:

<img src="https://i.ibb.co/n3hgVxm/RCOv-Zs-QPAZmc-Sym-L.png" width="300"/>

An example of this Network can be:
* **X1**: it has rained
* **X2**: the automatic dripper was recently on
* **X3**: I slip and fall
* **X4**: I arrive late to work

The event _slip and fall_ depends on wether the floor is wet. And _falling_ may have some causality on _arriving late to work_.\
Let D be the Data about these events,  consider this set:

| X1 | X2 | X3 | X4 |
|----|----|----|----|
| 0  | 0  | 0  | 0  |
| 0  | 0  | 0  | 1  |
| 0  | 1  | 0  | 0  |
| 1  | 1  | 1  | 1  |
| 0  | 1  | 1  | 1  |
| 1  | 0  | 1  | 1  |


The probability of the Bayesian Network to be true given the data is proportional to $P(B_S,D)$ given by the formula (1).

For simplicity, consider the events of each node being only 0 or 1:
$$X1,X2,X3,X4\, \in\,\{0,1\} \qquad\left(=\,\{TRUE,FALSE\}\right)$$

The first term of the $\Pi$ goes through each nodes, let us explicit them one by one:
* $n=1$
    * $r_1=2;\hspace{2cm}$  X1 can be either 0 or 1
    * $\pi_1 = \emptyset;\hspace{1.8cm}$ since X1 has no parents
    * $q_1 = 1;\hspace{1.95cm}$ Since $\pi_1=\emptyset$, the possible rearrangments are just one
    * $N_{1jk};\hspace{2.4cm}$ If the node has no parents, this ends up being the counts of unique terms
    * $N_{1j};\hspace{2.5cm}$ always ends up being the sum of the terms inside the last $\Pi$
    
    $$\frac{(2-1)!}{(N_{11}+2 - 1)!}\prod_{k=1}^{2}N_{11k}! = \frac{\prod_{k=1}^{2}N_{11k}!}{(N_{11} +1)!}=$$
    (For k=1 we count the zeros, for k=2 we count the ones)
    $$=\frac{4!2!}{(4+2 +1)!}=0.0095$$

* $n=2$
    Exactly the same, only the terms $N_{ij}$ and $N_{ijk}$ change
    $$\frac{(2-1)!}{(N_{21}+2 - 1)!}\prod_{k=1}^{2}N_{21k}! = \frac{\prod_{k=1}^{2}N_{21k}!}{(N_{21} +1)!}=\frac{3!3!}{(3+3+1)!}=0.0071$$

* $n=3$\
    X3 has parents X1 and X2, this means:
    * $r_3=2;\hspace{2cm}$  X3 can be either 0 or 1
    * $\pi_3 = \{X1,X2\};$
    * $q_3 = 4;\hspace{1.95cm}$ Since both X1 and X2 can assume two values, the number of all possible permutation of the unique elements of X1 and X2 are 4
    * $w_{3j}=\{ \{0,0\},\{0,1\},\{1,0\},\{1,1\}\}$
    * $N_{3jk};\hspace{2.3cm}$ Is the counts of times in D that X3 assumes the k-th value, and $\{X1,X2\}$ assume the j-th value from $w_{3j}$
    
    $$\prod_{j=1}^{4}\frac{(2-1)!}{(N_{3j}+ 2- 1)!}\prod_{k=1}^{2}N_{3jk}! = \prod_{j=1}^{4}\frac{\prod_{k=1}^{2}N_{3jk}!}{(N_{3j}+ 1)!}$$
     * j = 1\
     We consider when X1 and X2 assume the value $w_{31}=\{0,0\}$
     When \{X1,X2\}=\{0,0\}, X3 is 0 twice (k=1) and is never 1 (k=2)
     
     * j = 2\
     We consider when X1 and X2 assume the value $w_{31}=\{0,1\}$
     When \{X1,X2\}=\{0,1\}, X3 is 0 once (k=1) and once (k=2)
     
     * and so on...
     
     $$\prod_{j=1}^{4}\frac{\prod_{k=1}^{2}N_{3jk}!}{(N_{3j}+ 1)!}=\frac{0!2!}{3!}\frac{1!1!}{3!}\frac{0!1!}{2!}\frac{0!1!}{2!}=0.028$$

* $n=4$\
    For X4 the computation is easier than for X3 because there's only one parent node, meaning that q is only 2:
    $$\prod_{j=1}^{2}\frac{(2-1)!}{(N_{4j}+ 2- 1)!}\prod_{k=1}^{2}N_{4jk}!=\prod_{j=1}^{2}\frac{ \prod_{k=1}^{2}N_{4jk}! }{(N_{4j}+ 2- 1)!}=$$
    
    To compute $N_{4jk}$ we must count the times X4 is 0/1 when X3 assumes the values 0 and 1:
    
    $$=\frac{2!1!}{4!}\frac{3!0!}{4!}=0.021$$

\
Now that we have a way to compute the probability of a model to be true, we could theoretically compute it for every model and take the one with the highest value. However as demonstrated by Robinson's formula, the number of possible Networks grows exponentially as the number of nodes increases:

In [None]:
# Robinson recursive formula to compute the number of possible belief-network stuctures
# that contains n nodes
# From: A Bayesian Method for the Induction of Probabilistic Networks from Data, p. 319
n.networks.structure <- function(n)
    {
        if(n <= 1){return(1)}
        else
            {
                i <- 1
                res <- 0
                while(i <= n)
                    {
                        res <- res + ((i%%2)*2 - 1)*(choose(n,i)*(2^(i*(n-i)))*n.networks.structure(n-i))
                        i <- i + 1
                    }

                return(res)
            }
    }

for(i in 1:8)
    {
        cat('For n=',i,'the number of possible structures is: ',n.networks.structure(i),'\n')
    }


\
Even for small Networks, it would be preferable to apply a more heuristic approach in finding the best Network. The approach chosen here is K2 algorithm [2]


## 1. Implement the algorithm K2 in R and check its performances with the test data set given in [2]

In [None]:
############ AUXILIARY FUNCTIONS ###############

prob.noparents <- function(D,namecol,prod)
    {
        col <- dplyr::pull(D, namecol)
        nunique <- length(unique(col))
        prod <- prod*factorial(nunique-1)
        den <- nunique - 1
    
        for(i in 1:nunique)
            {
                prod <- prod*factorial(length(col[col == unique(col)[i]]))
                den <- den + length(col[col == unique(col)[i]])
            }
        prod <- prod/factorial(den)
    
        return(prod)
    }

is.eq <- function(row1,row2){return(row1 == row2)}

prob.parents <- function(BN,D,namecol,prod)
    {
        col <- dplyr::pull(D, namecol)
        parents <- parents(BN, namecol)
        n.parents <- length(parents)
        col.parents <- D[parents]
        r <- length(unique(col))

        q <- 1
        combined <- list()
        for(j in 1:length(parents))
            {
                q <- q*length(unique(col.parents[,j]))
                combined[[j]] <- unique(col.parents[,j])    
            }
        combinations <- do.call(expand.grid, combined)
        # for j in 1:qi
        for(j in 1:q)
            {
                w  <- combinations[j,]
                # Compute Nijk!
                nij <- 0
                for(k in 1:r)
                    {
                        wij <- c(w,unique(col)[k])
                        nijk <- sum(apply(apply(cbind(col.parents, col),1,is.eq,row2=wij),2,all))
                        nij <- nij + nijk
                        prod <- prod*factorial(nijk)
                    }
                prod <- prod*factorial(r - 1)/factorial(nij + r - 1)
            }
    
        return(prod)

    }

#############################################

In [None]:
prob.model <- function(BN,D)
    {
        nvar <- length(nodes(BN))
        prod <- 1
        for(i in 1:nvar)
            {
                if(length(parents(BN, nodes(BN)[i])) == 0)
                    {prod <- prob.noparents(D,nodes(BN)[i],prod)}
                else
                    {prod <- prob.parents(BN,D,nodes(BN)[i],prod)}
                
            }
    
        return(prod)
    }

\
```prob.model(BN,D)``` computes the value $P(BN,D)$ through the formula (2), to check if it is correctly implemented, the result was compared with the values shown in [1]:

In [None]:
ex <- read.table("./dataset/cooper.txt", header = TRUE, stringsAsFactors = TRUE)

In [None]:
ex.bn1 <- model2network("[X1][X2|X1][X3|X2]")
graphviz.plot(ex.bn1,layout = 'neato')
cat('In the paper the result is:              P(B1,D)=P(B1) 2.23*10^(-9)')
cat('\nthrough the function implemented we get: P(B1,D)=P(B1)',prob.model(ex.bn1,ex),'\n\n')

In [None]:
ex.bn2 <- model2network("[X1][X2|X1][X3|X1]")
graphviz.plot(ex.bn2)
cat('In the paper the result is:              P(B2,D)=P(B2) 2.23*10^(-10)')
cat('\nthrough the function implemented we get: P(B2,D)=P(B2)',prob.model(ex.bn2,ex),'\n')

The function ```prob.model(BN,D)``` seems to work well since it reproduces the same values of the paper!

### Implementing K2 algorithm

In [None]:
f <- function(BN,D,i)
            {
                prod <- 1
            
                if(length(parents(BN, nodes(BN)[i])) == 0)
                    {prod <- prob.noparents(D,nodes(BN)[i],prod)}
                else
                    {prod <- prob.parents(BN,D,nodes(BN)[i],prod)}
               
                return(prod)
            }

In [None]:
K2 <- function(N,D,u)
    {
        BN <- empty.graph(nodes = N)
    
        for(i in 2:length(N))
            {
                p.old <- f(BN,D,i)
                ok.to.proceed <- TRUE
                j <- i - 1
                                
                while(ok.to.proceed & (length(parents(BN,nodes(BN)[i])) < u) )
                    {   
                        j <- i - 1
                        proposal <- c()
                    
                        while(j > 0)
                            {
                                    BN.proposal <- set.arc(BN, from=nodes(BN)[j], to=nodes(BN)[i])
                                    proposal <- c(proposal, f(BN.proposal,D,i))

                                    j <- j - 1 
                            }
                        
                        p.new <- max(proposal)
                        best.BN   <- set.arc(BN, from=nodes(BN)[i - match(max(proposal),proposal)], to=nodes(BN)[i])
                        if(p.new > p.old)
                            {
                                    p.old <- p.new
                                    BN <- best.BN
                            }
                        else{ok.to.proceed <- FALSE}
                    }
                
            }
        
        return(BN)
    }

In [None]:
graphviz.plot(K2(c('X1','X2','X3'),ex,3))

### Testing on SURVEY database

In [None]:
# Setting the nodes
tus_dag <- empty.graph(nodes = c("A", "S", "E", "O", "R", "T"))

# Setting the arc (the edges)
tus_dag <- set.arc(tus_dag, from = "A", to = "E")
tus_dag <- set.arc(tus_dag, from = "S", to = "E")
tus_dag <- set.arc(tus_dag, from = "E", to = "O")
tus_dag <- set.arc(tus_dag, from = "E", to = "R")
tus_dag <- set.arc(tus_dag, from = "O", to = "T")
tus_dag <- set.arc(tus_dag, from = "R", to = "T")

survey <- read.table("./dataset/survey.txt", header = TRUE, stringsAsFactors = TRUE)
graphviz.plot(tus_dag)

In [None]:
# This function should compute the best survey model and plot it but it does not work right now
# graphviz.plot(K2(c("A", "S", "E", "O", "R", "T"),survey,3))

## 2. Implement and test the K2 algorithm with the test data sets ([2]).  Investigate if it is possible to code it inside the bnstruct R package 

## References
[1] G. F. Cooper and E. Herskovits, A Bayesian Method for the Induction of Probabilistic Networks from Data, Machine Learning 9, (1992) 309\
[2] C. Ruiz,Illustration of the K2 Algorithm for learning Bayes Net Structures, http://web.cs.wpi.edu/~cs539/s11/Projects/k2_algorithm.pdf