# Inference and Reasoning with Bayesian Networks

###### COMP4670/8600 - Introduction to Statistical Machine Learning - Assignment 2

Name: Rui Zhang

Student ID: u5963436

## Instructions

|             |Notes|
|:------------|:--|
|Maximum marks| 19|
|Weight|19% of final grade|
|Format| Complete this ipython notebook. Do not forget to fill in your name and student ID above|
|Submission mode| Use [wattle](https://wattle.anu.edu.au/)|
|Formulas| All formulas which you derive need to be explained unless you use very common mathematical facts. Picture yourself as explaining your arguments to somebody who is just learning about your assignment. With other words, do not assume that the person marking your assignment knows all the background and therefore you can just write down the formulas without any explanation. It is your task to convince the reader that you know what you are doing when you derive an argument. Typeset all formulas in $\LaTeX$.|
| Code quality | Python code should be well structured, use meaningful identifiers for variables and subroutines, and provide sufficient comments. Please refer to the examples given in the tutorials. |
| Code efficiency | An efficient implementation of an algorithm uses fast subroutines provided by the language or additional libraries. For the purpose of implementing Machine Learning algorithms in this course, that means using the appropriate data structures provided by Python and in numpy/scipy (e.g. Linear Algebra and random generators). |
| Cooperation | All assignments must be done individually. Cheating and plagiarism will be dealt with in accordance with University procedures (please see the ANU policies on [Academic Honesty and Plagiarism](http://academichonesty.anu.edu.au)). Hence, for example, code for programming assignments must not be developed in groups, nor should code be shared. You are encouraged to broadly discuss ideas, approaches and techniques with a few other students, but not at a level of detail where specific solutions or implementation issues are described by anyone. If you choose to consult with other students, you will include the names of your discussion partners for each solution. If you have any questions on this, please ask the lecturer before you act. |

$\newcommand{\dotprod}[2]{\left\langle #1, #2 \right\rangle}$
$\newcommand{\onevec}{\mathbb{1}}$

Setting up the environment (there is some hidden latex which needs to be run in this cell).

In [2]:
import itertools, copy
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display, Image

%matplotlib inline

## Part 1: Graphical Models

### Problem setting

We are interested to predict the outcome of the election in an imaginary country, called Under Some Assumptions (USA). There are four candidates for whom the citizens can **Vote** for: Bernie, Donald, Hillary, and Ted. The citizens live in four **Region**s: north, south, east and west. We have general demographic information about the people, namely: **Gender** (male, female) and **Hand**edness (right, left). Based on surveys done by an external company, we believe that the **Region** and **Gender** affects whether the people use their **Jacket**s full time, part time or never. Surprisingly, the company told us that the **Age** of their shoes (new, worn, old) depends on how often they wear their **Jacket**s. Furthermore, the **Gender** and their preferred **Hand** affects the **Colour** of their hat (white, black). Finally, surveys say that the citizens will **Vote** based on their **Region**, **Age** of their shoes and **Colour** of their hats.

The directed graphical model is depicted below:

In [3]:
Image(url="https://machlearn.gitlab.io/isml2017/assignments/election_model.png")

### Conditional probability tables

After paying the survey firm some more money, they provided the following conditional probability tables.

|$p(R)$ | R=n | R=s | R=e | R=w |
|:-----:|:--:|:--:|:--:|:--:|
|marginal| 0.2 | 0.1 | 0.5 | 0.2 |

|$p(G)$ | G=m | G=f |
|:-----:|:--:|:--:|
|marginal| 0.3 | 0.7 |

|$p(H)$ | H=r | H=l |
|:-----:|:--:|:--:|
|marginal| 0.9 | 0.1 |

| $p(J|R,G)$ | R=n,G=m | R=n,G=f | R=s,G=m | R=s,G=f | R=e,G=m | R=e,G=f | R=w,G=m | R=w,G=f |
|:-----:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
|**J**=full $\quad$  |0.9 |0.8 |0.1 | 0.3 |0.4 |0.01| 0.02 | 0.2  |
|**J**=part $\quad$  |0.08|0.17|0.03| 0.35|0.05|0.01| 0.2  | 0.08 |
|**J**=never $\quad$ |0.02|0.03|0.87| 0.35|0.55|0.98| 0.78 | 0.72 |

| $p(A|J)$ | J=full | J=part | J=never |
|:-----:|:--:|:--:|:--:|
|**A**=new  |0.01|0.96|0.3|
|**A**=worn |0.98|0.03|0.5|
|**A**=old  |0.01|0.01|0.2|

| $p(C|G,H)$ | G=m,H=r | G=m,H=l | G=f,H=r | G=f,H=l |
|:-----:|:--:|:--:|:--:|:--:|
|**C**=black $\quad$ |0.9 |0.83 |0.17 | 0.3 |
|**C**=white $\quad$ |0.1 |0.17|0.83 | 0.7 |

The final conditional probability table is given by the matrix below. The order of the rows and columns are also given below.

In [4]:
vote_column_names = ['north,new,black', 'north,new,white', 'north,worn,black', 'north,worn,white', 
                'north,old,black', 'north,old,white', 'south,new,black', 'south,new,white', 
                'south,worn,black', 'south,worn,white', 'south,old,black', 'south,old,white', 
                'east,new,black', 'east,new,white', 'east,worn,black', 'east,worn,white', 
                'east,old,black', 'east,old,white', 'west,new,black', 'west,new,white', 
                'west,worn,black', 'west,worn,white', 'west,old,black', 'west,old,white']

vote_outcomes = ('bernie','donald','hillary','ted')

vote_pmf_array = np.array(
        [
            [0.1,0.1,0.4,0.02,0.2,0.1,0.1,0.04,0.2,0.1,0.1 ,0.1,0.4 ,0.1 ,0.1,0.1 ,0.1,0.04,0.3,0.2,0.1,0.3,0.34,0.35],
            [0.3,0.4,0.2,0.5 ,0.1,0.2,0.1,0.5 ,0.1,0.2,0.5 ,0.3,0.2 ,0.42,0.2,0.67,0.4,0.4 ,0.1,0.1,0.5,0.1,0.1 ,0.1],
            [0.5,0.4,0.3,0.3 ,0.5,0.6,0.6,0.3 ,0.5,0.4,0.36,0.3,0.28,0.3 ,0.4,0.1 ,0.4,0.16,0.4,0.2,0.3,0.3,0.4 ,0.5],
            [0.1,0.1,0.1,0.18,0.2,0.1,0.2,0.16,0.2,0.3,0.04,0.3,0.12,0.18,0.3,0.13,0.1,0.4 ,0.2,0.5,0.1,0.3,0.16,0.05],
        ]
)

The 7 conditional probability tables in are encoded in python below. 

**Base your subsequent computations on these objects**.

In [5]:
class RandomVariable(object):
    def __init__(self, name, parents, outcomes, pmf_array):
        assert isinstance(name, str)
        assert all(isinstance(_, RandomVariable) for _ in parents)
        assert isinstance(outcomes, tuple)
        assert all(isinstance(_, str) for _ in outcomes)
        assert isinstance(parents, tuple)
        assert isinstance(pmf_array, np.ndarray)
        keys = tuple(map(tuple, itertools.product(*[_.outcomes for _ in parents])))
        assert np.allclose(np.sum(pmf_array, 0), 1)
        expected_shape = (len(outcomes), len(keys))
        assert pmf_array.shape == expected_shape, (pmf_array.shape, expected_shape)
        pmfs = {k: {outcome: probability for outcome, probability in zip(outcomes, probabilities)} 
                for k, probabilities in zip(keys, pmf_array.T)}
        self.name, self.parents, self.outcomes, self.pmfs = name, parents, outcomes, pmfs

class BayesianNetwork(object):
    def __init__(self, *random_variables):
        assert all(isinstance(_, RandomVariable) for _ in random_variables)
        self.random_variables = random_variables
        
region_pmf_array = np.array([[0.2, 0.1, 0.5, 0.2]]).T
region = RandomVariable(
    name='region',
    parents=tuple(),
    outcomes=('north', 'south', 'east', 'west'), 
    pmf_array = region_pmf_array,
)

gender_pmf_array = np.array([[0.3, 0.7]]).T
gender = RandomVariable(
    name='gender',
    parents=tuple(),
    outcomes=('male', 'female'), 
    pmf_array = gender_pmf_array
)

hand_pmf_array = np.array([[0.9, 0.1]]).T
hand = RandomVariable(
    name='hand',
    parents=tuple(),
    outcomes=('left', 'right'), 
    pmf_array = hand_pmf_array
)

jacket_pmf_array = np.array(
        [
            [0.9,0.8,0.1,0.3,0.4,0.01,0.02,0.2],
            [0.08,0.17,0.03,0.35,0.05,0.01,0.2,0.08],
            [0.02,0.03,0.87,0.35,0.55,0.98,0.78,0.72],
        ]
    )
jacket = RandomVariable(
    name='jacket',
    parents=(region, gender),
    outcomes=('full', 'part', 'never'), 
    pmf_array = jacket_pmf_array
)

age_pmf_array = np.array(
        [
            [0.01,0.96,0.3],
            [0.98,0.03,0.5],
            [0.01,0.01,0.2],
        ]
    )
age = RandomVariable(
    name='age',
    parents=(jacket, ),
    outcomes=('new', 'worn', 'old'), 
    pmf_array = age_pmf_array
)

colour_pmf_array = np.array(
        [
            [0.9,0.83,0.17,0.3],
            [0.1,0.17,0.83,0.7],
        ]
    )

colour = RandomVariable(
    name='colour',
    parents=(gender, hand),
    outcomes=('black', 'white'),
    pmf_array = colour_pmf_array
)

vote = RandomVariable(
    name='vote',
    parents=(region, age, colour),
    outcomes=vote_outcomes,
    pmf_array = vote_pmf_array
)


election_model = BayesianNetwork(region, gender, hand, jacket, age, colour, vote)

### 1A (1 mark) Joint distribution of a subset

- Compute the joint distribution of **Jacket**, **Region** and **Gender**. Print in a tab separated format the resulting distribution.

In [6]:
# Traverse nodes
print('J \t R \t G \t Probability')
for jo in jacket.outcomes:
    for ro in region.outcomes:
        for go in gender.outcomes:
            p = jacket.pmfs[(ro,go)][jo]*region.pmfs[()][ro]*gender.pmfs[()][go]
            print('{}\t{}\t{}\t{:.6}'.format(jo,ro,go,p))

J 	 R 	 G 	 Probability
full	north	male	0.054
full	north	female	0.112
full	south	male	0.003
full	south	female	0.021
full	east	male	0.06
full	east	female	0.0035
full	west	male	0.0012
full	west	female	0.028
part	north	male	0.0048
part	north	female	0.0238
part	south	male	0.0009
part	south	female	0.0245
part	east	male	0.0075
part	east	female	0.0035
part	west	male	0.012
part	west	female	0.0112
never	north	male	0.0012
never	north	female	0.0042
never	south	male	0.0261
never	south	female	0.0245
never	east	male	0.0825
never	east	female	0.343
never	west	male	0.0468
never	west	female	0.1008


### 1B1 (2 marks) Variable Ordering

1. Implement a function which determines an appropriate ordering of the variables for use in the following scheme:
    - For the first node R, draw a sample from p(R).
    - For each subsequent node, draw a sample from the conditional distribution $p(X \,|\, \text{parents}(X))$ where $\text{parents}(X)$ are the parents of the variable $X$ in the graphical model.
- Use your function to compute such an ordering and print the result in a human friendly format.

In [7]:
# Determines an ordering
def ordering(network):
    done = [] # Save points contained in the ordering
    npoints = len(network.random_variables) # The number of points
    print("The generated ordering is:")
    
    # Traverse the network
    for rv in network.random_variables:
        
        # All points are in the ordering
        if len(done) == npoints:
            break
        
        # Precedents must be in the ordering
        if all(rvp in done for rvp in rv.parents):
            print(rv.name,end=" ")
            done.append(rv)

    return done

# Test
ordering = ordering(election_model)

The generated ordering is:
region gender hand jacket age colour vote 

### 1B2 (2 marks) Sampling

1. Given the ordering you have determined above, implement the sampler described in the previous question. If you were unable to compute the ordering in the previous question, use ``ordering = (hand, region, gender, colour, jacket, age, vote)``.
2. Draw a single sample and print the result in a human friendly format.

In [15]:
# Draw a sample from a distribution
# 
#   pmfs: an array representing a probability mass function
def sampling(pmfs):
    k = np.random.uniform();
    sumOfP = 0;
    
    for key in pmfs:
        sumOfP = sumOfP + pmfs[key]
        if sumOfP>k:
            label = key
            break
    
    return label

# Sampling with previously determined ordering
# 
#   ordering: an array containing random variables
#   nsamples: sampling times
def simpleSampler(ordering, nsamples):
    samples = dict()
    for i in range(nsamples):
        sample = dict()
        for rv in ordering:
            p = () # Save parents
            for rvp in rv.parents:
                p = p + (sample[rvp],)
            sample[rv]=sampling(rv.pmfs[p])
        samples[i] = sample
    return samples

# Test
nsample = 1
res = simpleSampler(ordering,nsample)
print('Ordering: ', [ rv.name for rv in ordering])
print('\nSamples')
print('id\t{}\t{}\t{}\t{}\t{}\t{}\t{}'.format(ordering[0].name,ordering[1].name,ordering[2].name,ordering[3].name,\
                                              ordering[4].name,ordering[5].name,ordering[6].name))
for i in range(nsample):
    print(i+1,end='\t')
    sample = {}
    for rv in res[i]:
        print(res[i][rv], end='\t')
    print()
    

Ordering:  ['region', 'gender', 'hand', 'jacket', 'age', 'colour', 'vote']

Samples
id	region	gender	hand	jacket	age	colour	vote
1	west	female	left	never	old	white	bernie	


### 1B3 (2 marks) Marginals

1. Calculate (and show in LaTeX) the marginal distribution for **Jacket**.
- Implement a function which computes the marginal distribution of each variable. It should make use of the ordering used by your sampler.
- Implement a function which takes a list of samples the model and computes the empirical marginal distribution of each variable.
- Plot the theoretical and approximate marginals which you obtain along with the absolute percent error between the two, in the format below:

1. Calculate the marginal distribution for Jacket:
    \begin{align}
    P(J=full) &= \sum_{r,g} P(J=full,R=r,G=g) \\
    &= \sum_{r,g} P(J=full | R=r, G=g)P(R=r,G=g) \\
    &= \sum_{r,g} P(J=full|R=r,G=g)P(R=r)P(G=g)\\
    &= 0.2\times0.3\times0.9  + 0.2\times0.7\times0.8 + 0.1\times0.3\times0.1 + \\
    &\,\,\,\,\,\,\,\,0.1\times0.7\times0.3 + 0.5\times0.3\times0.4  + 0.5\times0.7\times0.01 +\\
    &\,\,\,\,\,\,\,\,0.2\times0.3\times0.02  + 0.2\times0.7\times0.2\\
    &= 0.2827
    \end{align}
    Similarly, we calculate $P(J=part)$ and $P(J=never)$:
    $$
        P(J=part) = 0.0882\\
        P(J=never) = 0.6291
    $$

In [10]:
# Calculate the marginal distribution of an variable
# 
#   randomVariable: a random variable
# Note: create a new copy of a dictionary each time; otherwise, the old one
#       will be changed
# 
# * Iteratively calculating mariginal distribution *
# import copy
# def marginalDistribution(randomVariable):
#     if len(randomVariable.parents) == 0:
#         return copy.deepcopy(randomVariable.pmfs) 
    
#     tmpPmfs = copy.deepcopy(randomVariable.pmfs)

#     # Joint probabilities
#     for condition in tmpPmfs:
#         for ii in range(len(condition)):
#             parentPmfs = marginalDistribution(randomVariable.parents[ii])
#             for key in tmpPmfs[condition]:
#                 tmpPmfs[condition][key] = tmpPmfs[condition][key]*parentPmfs[()][condition[ii]] # product of probabilities

#     # Adding joint probabilities to get marginal distributions
#     resPmfs = dict()
#     resPmfs[()] = dict()
#     tmpPmfsArrays = [ list(v.values()) for v in tmpPmfs.values()]
#     for ii in range(len(randomVariable.outcomes)):
#         resPmfs[()][randomVariable.outcomes[ii]] = sum([row[ii] for row in tmpPmfsArrays])
#     return resPmfs

import copy
def marginalDistribution(ordering):
    
    resPmfs = dict()
    for randomVariable in ordering:
        if len(randomVariable.parents) == 0:
            resPmfs[randomVariable] = randomVariable.pmfs[()]

        tmpPmfs = dict()
        # Traverse the ordering
        for rv in ordering:
            if len(tmpPmfs) == 0:
                tmpPmfs = copy.deepcopy(rv.pmfs)
            else:
                if len(rv.pmfs) == 1:
                    tmpPmfs2 = dict()
                    tmpPmfs2[()] = dict()

                    # Joint probabilities
                    for key in tmpPmfs[()]:
                        for key2 in rv.pmfs[()]:
                            if isinstance(key,tuple):
                                tmpPmfs2[()][key+(key2,)] = tmpPmfs[()][key]*rv.pmfs[()][key2]
                            else:
                                tmpPmfs2[()][(key,key2)] = tmpPmfs[()][key]*rv.pmfs[()][key2]
                    tmpPmfs = copy.deepcopy(tmpPmfs2)
                else:
                     if len(rv.pmfs) >1:
                        tmpPmfs2 = dict()
                        tmpPmfs2[()] = dict()

                        # Joint probabilities
                        for condition in rv.pmfs:
                            for key in rv.pmfs[condition]:
                                for key2 in tmpPmfs[()]:
                                    if all( t in key2 for t in condition):
                                        tmpPmfs2[()][key2+(key,)] = tmpPmfs2[()].get(key2+(key,),0)+\
                                        tmpPmfs[()][key2]*rv.pmfs[condition][key]
                        tmpPmfs = copy.deepcopy(tmpPmfs2)
            if rv == randomVariable:
                break

        # Marginal probabilities
        resPmfs[randomVariable] = dict()
        for label in randomVariable.outcomes:
            for value in tmpPmfs[()]:
                if label in value:
                    resPmfs[randomVariable][label] = resPmfs[randomVariable].get(label,0) + tmpPmfs[()][value]
    return resPmfs

# Calculate empirical marginal distribution
def empiricalMarginalDistribution(samples):
    res = dict()
    randomVariables = samples[1].keys()
    for randomVariable in randomVariables:
        res[randomVariable] = dict()
        for label in randomVariable.outcomes:
            p = np.mean([samples[i][randomVariable] == label for i in samples])
            res[randomVariable][label] = p
    return res

# # Test
pmfs = marginalDistribution(ordering)

nsample = 20000
samples = simpleSampler(ordering,nsample)
pmfs2 = empiricalMarginalDistribution(samples)

# Plot errors
for rv in pmfs2:
    print(rv.name)
    print('outcome exact approx error (%)')
    for key in pmfs2[rv]:
        print(key,end='\t')
        print('%.2f'%(pmfs[rv][key]),end='\t')
        print('%.2f'%(pmfs2[rv][key]),end = '\t')
        print('%.2f'%(abs(pmfs2[rv][key]-pmfs[rv][key])*100/pmfs[rv][key]),end='\t')
        print()
    print()

region
outcome exact approx error (%)
north	0.20	0.20	0.98	
south	0.10	0.10	0.25	
east	0.50	0.50	0.02	
west	0.20	0.20	0.80	

gender
outcome exact approx error (%)
male	0.30	0.30	0.10	
female	0.70	0.70	0.04	

hand
outcome exact approx error (%)
left	0.90	0.90	0.26	
right	0.10	0.10	2.35	

jacket
outcome exact approx error (%)
full	0.28	0.28	0.65	
part	0.09	0.09	0.96	
never	0.63	0.63	0.43	

age
outcome exact approx error (%)
new	0.28	0.28	0.17	
worn	0.59	0.60	0.86	
old	0.13	0.12	3.57	

colour
outcome exact approx error (%)
black	0.40	0.40	0.01	
white	0.60	0.60	0.01	

vote
outcome exact approx error (%)
bernie	0.15	0.16	4.61	
donald	0.35	0.34	0.69	
hillary	0.30	0.30	0.44	
ted	0.20	0.20	2.96	



### 1B4 (1 mark) Easy conditional probabilities

Compute $p(X \,|\, G=\text{female})$ for all $X$ other than $G$,
1. Approximately, using your sampler
- Exactly, using your marginal calculating function from the previous question. Hint: what happens if you set  $p(G=\text{female})=1$ in your model?
- Plot the results side by side in the same format as the previous question.
- State for which variables other than $G$ the theoretical scheme above can be used to compute such conditionals, and why.

In [11]:
# Solution goes here
# Calculate empirical marginal distribution
def empiricalProbabilities(samples):
    res = dict()
    randomVariables = [ k for k in samples[1].keys() if k != gender]
    for randomVariable in randomVariables:
        res[randomVariable] = dict()
        for label in randomVariable.outcomes:
            p = np.mean([samples[i][randomVariable]==label for i in samples if samples[i][gender]=='female'])
            res[randomVariable][label] = p
    return res

pmfs2 = empiricalProbabilities(samples)

tmpOrdering = copy.deepcopy(ordering)
tmpOrdering[ordering.index(gender)].pmfs[()] = {'male':0,'female':1}
pmfs = marginalDistribution(tmpOrdering)

# Plot errors
for rv in pmfs2:
    print(rv.name)
    print('outcome exact approx error (%)')
    for key in pmfs2[rv]:
        print(key,end='\t')
        print('%.2f'%(pmfs[tmpOrdering[ordering.index(rv)]][key]),end='\t')
        print('%.2f'%(pmfs2[rv][key]),end = '\t')
        print('%.2f'%(abs(pmfs2[rv][key]-pmfs[tmpOrdering[ordering.index(rv)]][key])*100\
                      /pmfs[tmpOrdering[ordering.index(rv)]][key]),end='\t')
        print()
    print()

region
outcome exact approx error (%)
north	0.20	0.20	0.51	
south	0.10	0.10	0.74	
east	0.50	0.50	0.11	
west	0.20	0.20	0.15	

hand
outcome exact approx error (%)
left	0.90	0.90	0.27	
right	0.10	0.10	2.47	

jacket
outcome exact approx error (%)
full	0.24	0.24	0.78	
part	0.09	0.09	0.92	
never	0.68	0.67	0.15	

age
outcome exact approx error (%)
new	0.29	0.29	1.62	
worn	0.57	0.58	1.97	
old	0.14	0.13	4.72	

colour
outcome exact approx error (%)
black	0.18	0.18	0.24	
white	0.82	0.82	0.05	

vote
outcome exact approx error (%)
bernie	0.13	0.14	4.11	
donald	0.39	0.39	0.18	
hillary	0.27	0.27	1.33	
ted	0.21	0.20	3.95	



4.
\begin{align}
p(X\mid G=female) = \dfrac{p(X,G=female)}{p(G=female)}
\end{align}
According to the graphical network, we can rewrite $p(X,G=female)$ to $Ap(G=female)$ where A is with respect to probabilities of variables other than $G$. $p(G=female)$ can be removed from the probability expression because
$$
\dfrac{p(X,G=female)}{p(G=female)}=\dfrac{Ap(G=female)}{p(G=female)}=A
$$
Therefore, we can replace $p(G=female)$ with 1. Based on this analysis, the variables other than $G$ can be processed in the same way are $H$ and $R$.


### 1B5 (3 marks) General conditional probabilities

1. Write down the expression of the joint probability $p(R,G,H,J,A,C,V)$ in terms of the conditional probabilities in the graphical model.
- Derive $p(G = male \,|\, V = Donald)$ in terms of the conditional probabilities of the graphical model.
- Compute and display in a human friendly format the conditional distributions $p(G=g \,|\, V=v)$ for all genders $g$ and votes $v$, by naively marginalising the other variables (other than $G$ and $V$, that is).

### Solution description

1. According to the graphical model, we can write 
   $$
   p(R,G,H,J,A,C,V) = p(R)p(G)p(H)p(J|R,G)p(A|J)p(C|G,H)p(V|R,A,C)
   $$
2. Derive $p(G = male | V = Donald)$:
   \begin{align}
   &p(G = male | V = Donald)\\
   &= \dfrac{P(G = male, V = Donald)}{P(V = Donald)}\\
   &= \dfrac{\sum_{r,h,j,a,c}P(R=r, G =male, H=h,J=j,A=a,C=c, V = Donald)}{p(V=Donald)}\\
   &= \dfrac{\sum_{r,h,j,a,c}p(G=male)p(R=r)p(H=h)p(J=j|R=r,G=male)p(A=a|J=j)p(C=c|G=male,H=h)p(V=Donald|R=r,A=a,C=c)}{p(V=Donald)}\\
   \end{align}
3. 
   
 

In [12]:
# Marginal distribution
pmfs = marginalDistribution(ordering)

# Traverse nodes and naively marginalising other variable
print(' P(G|V) \t V=bernie \t V=donald \t V=hillary \t V=ted')
for go in gender.outcomes:
    print(' G={}'.format(go),end='       ')
    for vo in vote.outcomes:
        p = 0
        for co in colour.outcomes:
            for ao in age.outcomes:
                for jo in jacket.outcomes:
                    for ho in hand.outcomes:
                        for ro in region.outcomes:
                            p = p + vote.pmfs[(ro,ao,co)][vo]*colour.pmfs[(go,ho)][co]*age.pmfs[(jo,)][ao]*\
                            jacket.pmfs[(ro,go)][jo]*hand.pmfs[()][ho]*region.pmfs[()][ro]
        print(p*gender.pmfs[()][go]/pmfs[vote][vo],end='\t')
    print()

 P(G|V) 	 V=bernie 	 V=donald 	 V=hillary 	 V=ted
 G=male       0.398132990181	0.214321628636	0.363555529929	0.278437575417	
 G=female       0.601867009819	0.785678371364	0.636444470071	0.721562424583	


### 1B6 (2 marks) Variable elimination

Denote the graphical model consider thus far $\mathcal{M}$.

1. Derive $p(R,G,J,A,C,V)$ by marginalising over $H$ in $\mathcal{M}$. 
- Describe how the structure (connectivity) of the new graphical model (call it $\mathcal{M}_H$) over all variables other than $H$ changes after eliminating $H$ in this way.
- Describe which conditional(s) in the new graphical model differ from the original model.
- Encode the $\mathcal{M}_H$ in python similarly to $\mathcal{M}$. 

# Solution goes here
1. \begin{align}
    p(R,G,J,A,C,V) &= \sum_{h} p(R,G,H=h,J,A,C,V)\\
    &= \sum_{h} p(R) p(G) p(H=h) p(J|R,G) P(A|J)P(C|G,H=h)p(V|R,A,C)\\
    &= P(C|G) p(R) p(G) p(J|R,G) P(A|J)p(V|R,A,C)\\
    \end{align}
2. Eliminating H will remove the edge H->C and the node H in the graph. As a result, the node C is only connected with the node G.
3. The original $p(C|G,H)$ is changed to $p(C|G)$.

In [13]:
new_colour_pmf_array = np.array(
        [
            [0.837,0.287],
            [0.163,0.713],
        ]
    )

newColour = RandomVariable(
    name='colour',
    parents=(gender,),
    outcomes=('black', 'white'),
    pmf_array = new_colour_pmf_array
)

newElectionModel = BayesianNetwork(region, gender, jacket, age, newColour, vote)

### 1B6 (3 marks) General estimation of conditional probabilities (revisited)

1. As you did earlier, compute and display in a human friendly format the conditional distributions $p(G=g \,|\, V=v)$ for all genders $g$ and votes $v$, by naively marginalising the other variables (other than $G$ and $V$, that is). This time however, do so using $\mathcal{M}_H$ rather than $\mathcal{M}$.
- Quantify the computational advantages of using $\mathcal{M}_H$ in this way rather than $\mathcal{M}$.
- Which variable (or variables) would be the best to eliminate in this way, in terms of the aforementioned computational advantages, and why?
- Pick a (best) variable from the previous question and call it $X$. Assuming we have eliminated $X$, which would be the "best" variable to subsequently eliminate in a similar fashion?

In [14]:
# Traverse nodes and naively marginalising other variable
print(' P(G|V) \t V=bernie \t V=donald \t V=hillary \t V=ted')
for go in gender.outcomes:
    print(' G={}'.format(go),end='       ')
    for vo in vote.outcomes:
        p = 0
        for co in newColour.outcomes:
            for ao in age.outcomes:
                for jo in jacket.outcomes:
                    for ro in region.outcomes:
                        p = p + vote.pmfs[(ro,ao,co)][vo]*newColour.pmfs[(go,)][co]*age.pmfs[(jo,)][ao]*\
                        jacket.pmfs[(ro,go)][jo]*region.pmfs[()][ro]
        print(p*gender.pmfs[()][go]/pmfs[vote][vo],end='\t')
    print()

 P(G|V) 	 V=bernie 	 V=donald 	 V=hillary 	 V=ted
 G=male       0.386966515123	0.224886022999	0.355459620934	0.280764474703	
 G=female       0.650579151422	0.747814637263	0.667545158544	0.703639017822	


2.The computational complexity of $p(G=g \mid V=v)$ on $M_H$ are quantified by the number of summations and multiplications in maginalizing variables. For $M$, the number of outcomes of variable marginalized (R, H, J, A and C) are 4, 2, 3, 3 and 2 so there are $4\times2\times3\times3\times2-1=143$ summations. Besides, to compute each joint probability, there are 6 multiplications. Thus, the total number of operations are $143+144*6 = 1007$.
For $M_H$, we marginalize variables $R,J,A,C$ which leads to $4\times3\times3\times2-1=71$ summations. Each joint probability requires 5 multiplications. So the total number of operations are $71+72\times5=431$. The latter is only $431/1007\times 100\%=42.8\%$ of the former.

3.The best variable to eliminate is $G$ because $G$ has the most outcomes and eliminating it can reduce the operations to $2\times3\times3\times2\times4 = 144$, which is the least operation we can get.


## Part 2: Theory

### 2A (3 marks) Functions of random variables

$u$ and $v$ are independently sampled from the standard uniform distribution on the unit interval, $[0,1]$. 

If $s=(2u-1)^2+(2v-1)^2 \geq 1$ then $x$ is sampled from the standard normal distribution, $\mathcal{N}(0,1)$. Otherwise $x=(2u-1)\sqrt{-2 \log(s)/s}$.

How is $x$ distributed?

### Solution description
Because $u,v \in [0,1]$, we get $2u-1,2v-1 \in [-1,1]$. Sampling a pair of $u,v$ is equivalent to sampling a point in the square $\{(u',v')| u'=2u-1 \in [-1,1], v'=2v-1 \in [-1,1]\}$ and $s$ represents the squared distance between a point and origin.

The probability mass function of $x$ can be computed as:
$$
p(x) = \int_{s\geq 1} p(x\mid s)p(s)ds+\int_{0\leq s<1}p(x\mid s)p(s)\,ds
$$
Firstly, we compute the first part $\int_{s\geq 1} p(x\mid s)p(s)ds$:
\begin{align}
\int_{s\geq 1} p(x\mid s)p(s)ds &= p(x \mid s\geq 1) \int_{s\geq 1} p(s)ds\\
&= p(x \mid s\geq 1) \int_{s\geq 1} p(s)ds\\
& =  p(x \mid s\geq 1)\,\dfrac{Area(\{(u',v')|u'\in [-1,1], v'\in [-1,1], u'^2+v'^2 \geq 1 \})}{Area(\{(u',v')|u'\in [-1,1], v'\in [-1,1]\})}\\
& = p(x \mid s\geq 1) \dfrac{4-\pi}{4}\\
& = \dfrac{4-\pi}{4} \dfrac{1}{\sqrt{2\pi}} \exp^{-\dfrac{x^2}{2}}
\end{align}
"Area$(\{(u',v')|u'\in [-1,1], v'\in [-1,1], u'^2+v'^2 \geq 1 \})$" denotes the area in the square where we sample the points $\{(u',v')| u'\in [-1,1], v'\in [-1,1]\}$ and out of the circle $\{(u',v')|u'^2+v'^2=1\}$. "Area$(\{(u',v')|u'\in [-1,1], v'\in [-1,1] \})$" is explained similarly.

Secondly, we calculate $\int_{0\leq s<1}p(x\mid s)p(s)\,ds$:
\begin{align}
\int_{0\leq s<1}p(x\mid s)p(s)\,ds &= \int_{0\leq s<1} p(x\mid s)\dfrac{d\,P(S\leq s)}{ds}\,ds\\
&= \int_{0\leq s<1} p(x\mid s)\dfrac{\pi}{4}\,ds\\
\end{align}

Only when $ \sqrt{s} \leq \dfrac{x}{\sqrt{-2log(s)/s}} \leq \sqrt{s}$, i.e, $s \leq e^{-x^2/2}$, the probability density function is non-zero. So, we get:

\begin{align}
&\int_{0\leq s<1} p(x\mid s)\dfrac{\pi}{4}\,ds\\
& = \int_{0\leq s \leq e^{-x^2/2}} p(x\mid s)\dfrac{\pi}{4}\,ds\\
& = \int_{0\leq s \leq e^{-x^2/2}} p(u'=x/\sqrt{-2log(s)/s}\mid s)/\sqrt{-2log(s)/s}\dfrac{\pi}{4}\,ds
\end{align}
Here, $p(u'=x/\sqrt{-2log(s)/s}\mid s)$ is the probability density function of the x coordinate of samples uniformly distributed on the circle with the radius of $\sqrt{s}$. So $p(u'=x/\sqrt{-2log(s)/s}\mid s)$ should be equal to the fraction of the length of the arches with the x coordinates of $x/\sqrt{-2log(s)/s}$ over the perimeter of the circle.
\begin{align}
&\int_{0\leq s \leq e^{-x^2/2}} p(u'=x/\sqrt{-2log(s)/s}\mid s)/\sqrt{-2log(s)/s}\dfrac{\pi}{4}\,ds\\
& = \int_{0\leq s \leq e^{-x^2/2}} \dfrac{\sqrt{1+u'^2/(s-u'^2)}}{\pi \sqrt{s}\sqrt{-2log(s)/s}}\dfrac{\pi}{4}\,ds\\
& = \int_{0\leq s \leq e^{-x^2/2}} \dfrac{\sqrt{s/(s-u'^2)}}{4\sqrt{-2log(s)}}\,ds\\
& = \int_{0\leq s \leq e^{-x^2/2}} \dfrac{1}{4 \sqrt{-2log(s)-x^2}} \,ds
\end{align}


By using http://www.wolframalpha.com, we get $\int \dfrac{1}{4 \sqrt{-2log(s)-x^2}}ds = \dfrac{1}{4}\sqrt{\pi/2}(-e^{-x^2/2})erf[\sqrt{-x^2/2-log(s)}]+constant$ so 
\begin{align}
&\int_{0\leq s \leq e^{-x^2/2}} \dfrac{1}{4 \sqrt{-2log(s)-x^2}}ds \\
&= \dfrac{1}{4}\sqrt{\pi/2}(-e^{-x^2/2})erf[\sqrt{-x^2/2-(-x^2/2)}]-\dfrac{1}{4}\sqrt{\pi/2}(-e^{-x^2/2})erf[\sqrt{-x^2/2-log(0)}]\\
&=\dfrac{1}{4}\sqrt{\pi/2}e^{-x^2/2}
\end{align}

Finally, we get:
$$
p(x) = \int_{s\geq 1} p(x\mid s)p(s)ds+\int_{0\leq s<1}p(x\mid s)p(s)\,ds = \dfrac{4-\pi}{4} \dfrac{1}{\sqrt{2\pi}} e^{-x^2/2}+\dfrac{1}{4}\sqrt{\pi/2}e^{-x^2/2} = \dfrac{1}{\sqrt{2\pi}}e^{-x^2/2}
$$