# ASSIGNMENT 1                                                McGill:COMP766-2 - Winter 2021 


In [2]:
import numpy as np
import networkx as nx
from utils import tensor_mult

### Problem 1. (5 points)

- __a__: 2 points) Derive the __Markov__ inequality below for a positive discrete random variable
(_Hint:_ rearrange to prove $a P(x \geq a) \leq \mathbb{E}[X]$ and substitute the definition of expectation.)
$$P(x > a) \leq \frac{\mathbb{E}[X]}{a}$$
- __b__: 3 points) Using this inequality prove the following, known as __Chebyshev__ inequality:
$$P(|X - \mathbb{E}[X]| > a) < \frac{Var(X)}{a^2}$$

### Solution
__a__) At first we calculate $ \mathbb{E}[X] $ :

$$\mathbb{E}[X] = \sum_x xp(x) = \sum_{x < a} xp(x) + \sum_{x \geq a} xp(x)$$

$$\Rightarrow \mathbb{E}[X] \geq \sum_{x \geq a} xp(x)dx  \geq \sum_{x \geq a} ap(x)dx = a\sum_{x \geq a} p(x)dx$$

$$\Rightarrow \mathbb{E}[X] \geq a\sum_{x \geq a} p(x)dx = a(P(x\geq a)$$

$$\Rightarrow P(x \geq a) \leq \frac{\mathbb{E}[X]}{a}$$

__b__) We can assume that $x = (X-\mathbb{E}[X])^2$ and $ a=a^2$ in from previous inequality:

We know that: $$P((X-\mathbb{E}[X])^2>a^2) = P(|X-\mathbb{E}[X]|>a)$$

Therefore:
$$P((X-\mathbb{E}[X])^2>a^2) \leq \frac{\mathbb{E}[(X-\mathbb{E}[X])^2]}{a^2}$$

$$ \Rightarrow P(|X-\mathbb{E}[X]|>a) \leq \frac{\mathbb{E}[(X-\mathbb{E}[X])^2]}{a^2}$$

$$ \Rightarrow P(|X-\mathbb{E}[X]|>a) \leq \frac{Var(X)}{a^2}$$


### Problem 2. (5 points)

In showing that factorization in a Markov Network leads to local CIs we used the
following fact. Prove it using the definition of conditional independence:

$$
P \models (X \perp Y \mid Z) \quad \Leftrightarrow \quad P(X=x, Y=y, Z=z) = f(x, z)g(y,z) 
$$


### Solution

- $\Rightarrow$) 2 points 

We know that: $$P \models (X \perp Y \mid Z) \Leftrightarrow P(X,Y|Z) = P(X|Z)P(Y|Z)$$

From chain rule we have: $$P(X,Y,Z) = P(X,Y|Z)P(Z)$$

Therefore: $$\Rightarrow P(X,Y,Z) = P(X|Z)P(Y|Z)P(Z) = P(X|Z)P(Y,Z) $$

So now we have two factors, $P(X=x|Z=z)$ or we can consider it as $f(x,z)$ and $P(Y=y,Z=z)$ or we can consider it as $g(y,z)$.

Therefore: $$P(X=x,Y=y,Z=z) = f(x,z)g(y,z)$$


- $\Leftarrow$) 3 points

We can calculate $P(X|Z)$ and $P(Y|Z)$ separately: 

$$P(X,Z) = \sum_Y P(X,Y,Z) = \sum_{y \in Y} f(x,z)g(y,z) = f(x,z)\sum_{y \in Y} g(y,z) $$

$$P(Y,Z) = \sum_Y P(X,Y,Z) = \sum_{x \in X} g(y,z)f(x,z) = g(y,z)\sum_{x \in X} f(x,z) $$

$$P(Z) = \sum_X \sum_Y P(X,Y,Z) = \sum_{x \in X} \sum_{y \in Y} f(x,z)g(y,z) = \sum_{x \in X} f(x,z) \sum_{y \in Y} g(y,z)$$

$$\Rightarrow P(X|Z) = \frac{P(X,Z)}{P(Z)} =\frac{f(x,z)\sum_{y \in Y} g(y,z)}{\sum_{x \in X} f(x,z) \sum_{y \in Y} g(y,z)} = \frac{f(x,z)}{\sum_{x \in X} f(x,z)}$$

$$\Rightarrow P(X|Z) = \frac{P(X,Z)}{P(Z)} =\frac{g(y,z)\sum_{x \in X} g(y,z)}{\sum_{x \in X} f(x,z) \sum_{y \in Y} g(y,z)} = P(Y|Z) = \frac{g(y,z)}{\sum_{y \in Y} g(y,z)}$$

$P(X,Y|Z)$ can be written as: $$P(X,Y|Z) = \frac{P(X,Y,Z)}{P(Z)} = \frac{f(x,z)g(y,z)}{\sum_{x \in X} f(x,z) \sum_{y \in Y} g(y,z)} = \frac{f(x,z)}{\sum_{x \in X} f(x,z)} \frac{g(y,z)}{\sum_{y \in Y} g(y,z)} = P(X|Z)P(Y|Z)$$

Therefore: $$P \models (X \perp Y \mid Z)$$

### Problem 3. (10 points)

Here, we want to represent the joint probability $P(D,I,G,S,L)$ and answer arbitrary queries such as $P(D,I \mid G=0, L=1)$ for the running example below that we used extensively in the class.
<img src="3_4.png" width="400">
For this, we need to define CPTs. We use the ```networkx``` package to represent a DAG and add CPTs as node attributes. the CPT for a node with $K$ parents is a $K+1$ dimensional array, where the first dimension is for the child node and the order of parents follows their order when the method ```DAG.predecessors(node)``` is called. This is the order in which the corresponding edges are added. 

Your task is to write the body of the function ```Pr()``` that calculates the array of the posterior marginal, given a DAG -- e.g., $P(D, L \mid G= 2, I = 1)$. 
For your implementation you can use the ```tensor_mult``` helper function provided in ```utility.py```. 

You can try implementing this function in three steps:

- calculate the joint PMF
- condition on the evidence (e.g., by multiplying the joint array with appropriate tensor of 0s and 1s.
- marginalize and normalize the final results (normalization might be necessary depending on your implementation of conditioning on the evidence)

There are more efficient ways of calculating the posterior marginals that we study in the __inference__ lectures.

### Solution

In [189]:
# creating the BayesNet
BN = nx.DiGraph()
BN.add_node('D', cpt=np.array([.6,.4]))
BN.add_node('I', cpt=np.array([.7,.3]))

#a 3-dimensional array of shape 3x2x2 representing P(G|I,D).  
#note that the order of parents matters, here we have made sure the order is the same as
#the order returned by BN.predecessors('G')
BN.add_node('G', cpt=np.array([[[.3,.05],[.9,.5]],[[.4,.25],[.08,.3]],[[.3,.7],[.02,.2]]]))
BN.add_node('L', cpt=np.array([[.1,.4,.99],[.9,.6,.01]]))
BN.add_node('S', cpt=np.array([[.95,.2],[.05,.8]]))

# adding edges (note that the order of I,D -> G is important)
BN.add_edge('I','G')
BN.add_edge('D','G')
BN.add_edge('G', 'L')
BN.add_edge('I', 'S')

print("Is this a DAG? {}".format(nx.is_directed_acyclic_graph(BN)))
# we can use topological sort to get a topological ordering of nodes. What is the complexity of this sorting op.?
#The complexity would be O(V+E), since DFS traveral is used to sort the nodes.

list(nx.topological_sort(BN))




def Pr(target, 
       evidence, # a dictionary of observations to conditin on -- eg, {'L':0, 'I':1}
       DAG #DAG containing the CPTs (BN above)
      ):
    
    sorted_BN = list(nx.topological_sort(BN))
    cpt = nx.get_node_attributes(BN, "cpt")
    
    # creating the tensor of joint probabilities 
    Prob = cpt[sorted_BN[0]]
    visited = []
    visited.append(sorted_BN[0])
    for node in sorted_BN[1:]:
        axis1 = []
        axis2 = []
        if ( set(visited) & set(BN.predecessors(node)) == set(BN.predecessors(node)) and len(list(BN.predecessors(node)))>0):
            i = 1
            for p in list(BN.predecessors(node)):
                axis1.append(visited.index(p))
                axis2.append(i)
                i +=1
        Prob = tensor_mult(Prob, cpt[node], axis1, axis2)
        visited.append(node)      
    P = Prob
        
        
        

    #Calculating the margin of non evidences
    evidence_keys = list((evidence.keys()))
    evidence_values = list((evidence.values()))
    margin_denum = P.copy()
    margin_axises = []
    margin_values = []
    ax = 0
    for i,node in enumerate(sorted_BN):
        if node not in evidence_keys:
            margin_axises.append(ax)
            margin_denum = margin_denum.sum(axis=ax)
            
        else:
            ax += 1
            margin_values.append(evidence[node]) #topological sorting evidence values
    
    
    
    #Calculating the margin of joint prbability of target and evidences
    joint_axises = []
    join_nodes = []
    margin_joint = P.copy()
    ax = 0
    for i,node in enumerate(sorted_BN):
        if node not in evidence_keys and node not in target:
            joint_axises.append(ax)
            margin_joint = margin_joint.sum(axis=ax)
            
        else:
            ax += 1
            join_nodes.append(node)

    #calculating the margin given the evidence values
    idx = 0  
    for i,node in enumerate(join_nodes):
        if node in evidence_keys:
            margin_joint = margin_joint.take(indices=evidence[node],axis=idx)
        else:
            idx +=1
    ax = 0  
    topologiacal_order_axis = []
    input_order = []
    target_sorted = []
    for node in sorted_BN:
        if node in target:
            target_sorted.append(node)
            input_order.append(target.index(node))
            topologiacal_order_axis.append(ax)
            ax +=1
    
    #sorting the result based on order of target
    for i in topologiacal_order_axis:
        if (topologiacal_order_axis == input_order):
            break;
        if (i != target.index(target_sorted[i])):
            margin_joint = np.swapaxes(margin_joint, i, target.index(target_sorted[i]))
            topologiacal_order_axis[i], topologiacal_order_axis[target.index(target_sorted[i])] = topologiacal_order_axis[target.index(target_sorted[i])], topologiacal_order_axis[i]
 
            
    
    marginal = margin_joint/margin_joint.sum()
    return marginal



evidence = {'L':0,'G':2} 
target = ['I']

res = Pr(target,evidence,BN)
print("Result is is",res)



Is this a DAG? True
Result is is [0.92105263 0.07894737]


### Problem 4. (10 points)

Your task in this assignment is to implement __D-separation__ algorithm for a DAG representation, similar to what we used in the previous problem. Note that (assuming non-deterministic factors) D-separation does not need access to the CPTs. The following function returns ```True``` if the given CI holds and ```False``` otherwise.

In [194]:
def is_collider(X,Y,Z,DAG):
    return (X in BN.predecessors(Z)) and (Y in BN.predecessors(Z))

def is_cond_independent(X, #non-empty list of nodes -- e.g., ['I', 'D'] 
                        Y, #non-empty list of nodes. It has no intersection with X -- e.g., ['S']
                        Z, #list of nodes -- e.g., []
                        DAG #networkx DAG -- e.g., BN defined above
                       ):
    is_CI = False
    
    #BFS traversal
    mark = list(Z)
    for z in Z:
        if len(list(BN.successors(z)))==0 and len(list(BN.predecessors(z)))==1 and list(BN.predecessors(z))[0] not in mark:
            mark = mark + list(BN.predecessors(z)) 
 
        
    visited = []
    
    for x in X: 
        for y in Y: 
            paths = list(nx.all_simple_paths(BN.to_undirected(),x,y))
            for path in paths:
                for i,v in enumerate(path):
                    if (i>0 and i<len(path)-1):
                        if(v == y):
                            is_CI = False

                        if (is_collider(path[i-1],path[i+1],v,BN) and v not in mark) or (not is_collider(path[i-1],path[i+1],v,BN) and v in mark):
                            is_CI = True
        
        return is_CI;

ci = is_cond_independent('D','L',['G'],BN)
print(ci)

True


Let us verify that the CIs that we get from D-separation, match the definition of conditional independence using the conditional probabilities that we can numerically calculate using your solution to the problem 3. 
In the following we look at all queries in the form of $P \overset{?}{\models} {D} \perp {S} \mid \mathbf{Z} = \mathbf{0}$, where $\mathbf{Z} \subseteq \{I,G,L\}$ may contain any of the remaining variables. 

In [193]:
domains = {'I':[0,1], 'L':[0,1], 'G':[0,1,2]}
for Z in [{},{'I':0}, {'G':0}, {'L':0}, {'I':0,'G':0}, {'I':0,'L':0}, {'L':0,'G':0}, {'I':0,'G':0,'L':0}]:
    #conditional independence from D-separation
    CI_alg = is_cond_independent(['D'], ['S'], Z.keys(), BN)
    #conditional independence according to conditional probabilities
    CI_num = np.max(np.abs(Pr(['D','S'], Z, BN) - np.outer(Pr(['D'], Z, BN),Pr(['S'], Z, BN)))) < 1e-10
    #they should match
    assert(CI_num == CI_alg)
print("passed the minimal test")


passed the minimal test
