# Homework 4

## Graphical Parameters and Model Structure

In the previous homework, you performed queries on a graphical model of possible murders and murder weapons. Now, you will estimate model parameters and structure using data.   

As a reminder, the joint probability distribution is:

$$p(B,C,W,MO,M)$$     

where the letters indicate the following variables;   
$B = $ butler committed the crime, {not murderer, murderer},   
$C = $ cook committed the crime, {not murderer, murderer},    
$W = $ choice of weapon, {poison, knife}, conditional on butler and cook,  
$MO = $motive for the murder, {no motive, has motive}, conditional on butler and cook,   
$M = $ murderer {butler or cook, third party alone}.    

We have determined that the joint distribution can be factored:

$$p(B,C,BW,CW,M) = p(B)\ p(C)\ p(W\ |\ B, C)\ p(MO\ |\ B,C)\ p(M\ |\ B,C,MO,W)$$  

A graph of the model is shown below. 

<img src="MurderDirected.JPG" alt="Drawing" style="width:500px; height:300px"/>
<center> DAG for murder evidence </center>

Notice that the skeleton of this graph does not have a tree structure. This fact will limit how well estimation algorithms will work, particularly for graph structure. Keep this fact in mind as you proceed. 

As a first step execute the code in the below to simulate the 25 cases from the Bayesian directed model you have previously created. Examine the code to see the CPD tables for this simulation. 

> **Note:** You must change the name of the pickled model file in the `open` statement to match the file name you are using. 

In [None]:
## Simulate the binary tables
import numpy as np
import numpy.random as nr
import numpy as np
import pandas as pd
from pgmpy.sampling import BayesianModelSampling
from pgmpy.models import BayesianModel
from pgmpy.estimators import MaximumLikelihoodEstimator, BayesianEstimator
from pgmpy.inference import BeliefPropagation
import pickle

## Load the model from a file
with open('my_model.pickle', 'rb') as pkl:
    murder_model = pickle.load(pkl)
print('The model loaded correctly: {}'.format(murder_model.check_model()))

## Simulate values from the DAG
def simulate_from_DAG(model, nsamps = 25, set_seed = 234):
    nr.seed(set_seed)
    simulation = BayesianModelSampling(model)
    return(simulation.forward_sample(size = nsamps, return_type='dataframe'))


nsamps = 25
samples_25 = simulate_from_DAG(murder_model, nsamps = nsamps)
samples_25

## Part 1: Parameter Estimation

With the dataset generated you will now estimate the parameters of the graphical model using both maximum likelihood and Bayesian methods. 

As a first step execute the code in the cell below to load the packages you will need.

In [None]:
from pgmpy.models import BayesianModel
from pgmpy.estimators import MaximumLikelihoodEstimator, BayesianEstimator
from pgmpy.inference import BeliefPropagation
from pgmpy.estimators import HillClimbSearch, BicScore, K2Score, StructureScore

Now, create and execute the code in the cell below to estimate and display the parameters of the CPDs using the **maximum likelihood method** from the simulated graphy data.

Examine these results and answer the following questions:
1. How many parameters are there in the CPD tables?
2. Keeping in mind that the probability of each column in a CPT must add to 1, how many free parameters must be fit for this model. 
3. Given the number of parameters, and the sample size of 25 cases, is this MLE problem under-determined and why? 
4. Notice the number of 0.0 and 1.0 parameter values. Is this evidence of an under-fit model, and why? 

ANS 1:      
ANS 2:        
ANS 3:      
ANS 4:     

Next, you will estimate the CPD parameters using a **Bayesian estimator**. For this first estimate use the following moderately weak and uniform prior distributions (pseudo counts):

- C: [3,3]
- B: [3,3]
- W: [[3,3,3,3], [3,3,3,3]]
- MO: [[3,3,3,3], [3,3,3,3]]
- M: [[2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2], [2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2], [2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]]

In the cell below create and execute the code to perform Bayes estimation and display the CPD parameters. 

Focus your attention on the M and W CPDs. In terms of extreme values, how does the table computed with the Bayesian method compare to the table computed with MLE? Is this behavior evidence of the regularization property of the Bayesian estimator. 

ANS 1:       
ANS 2:    

Next verify that the independencies of all the variables in your model are correct using the `local_independencies` method. Create and execute the code in the cell below to display the independencies in the CPD. 

**Question:** Is your graph an I-map of the factorized distribution and why?

ANS:     

Now you are ready to perform inference on your model. Use the belief propagation method to query the M. 

Compare this resulting marginal distribution to the marginal distribution you obtained for the same query in the previous homework using the CDP tables provided. How do these distribution differ?

ANS:   

Next, you will estimate the CPD parameters using the Bayesian estimator with a moderately weak but biased prior distribution. Such a prior distribution can be constructed from some combination of data from previous murder cases and the opinions of several investigators (experts). For this first estimate use the following prior distributions (pseudo counts):

- C: [[8], [2]]
- B: [[2],[8]]
- W: [[2,4,2,3], [4,2,4,3]]
- MO: [[3,2,4,3], [3,4,2,3]]
- M: [[1,1,1,1,4,4,4,4,1,1,1,1,2,2,2,2], [1,1,1,1,1,1,1,1,4,4,4,4,2,2,2,2], [4,4,4,4,4,1,1,1,1,1,1,1,1,1,1,1]]

In the cell below create and execute the code to perform Bayes estimation and display the CPD parameters. 

Compare the parameters tables computed with the biased prior to those estimated with a uniform prior. How are these different, and is this expected given the change in prior?

ANS:    

But, how much does the prior matter in terms of inference? Use the belief propagation method to query the M variable and display the results.  

Compare this marginal distribution to the one obtained with the a uniform prior. Would you say these differences are significant, and why?

ANS:    

Perhaps more data will improve the estimation of the model parameters, particularly for the maximum likelihood method. In the cell below compute a new data set with 250 cases. Use a random seed value of 5678. Be sure to give your new data frame a different name. 

Now, compute the model parameters using the 250 sample dataset and the **maximum likelihood estimator**.

Compare these results to the MLE results you computed with 25 data cases. Are their fewer extreme values? But, does the presence of extreme values still indicate the model is under-fit despite an order of magnitude increase in the number of data cases? 

ANS 1:      
ANS 2:    

Next, you will compare the MLE values to those produced by the **Bayesian estimator** using the same uniform prior as the first Bayes estimate. 

In terms of evidence of under-fitting, such as repeated parameter values, how does these estimates compare to the Bayes estimates using 25 cases? Also, are probability tables for the butler and the cook closer to the original values from DAG model used for the simulation, when compare to the parameters computed with the Bayes estimator from 25 cases? 

ANS 1:    
ANS 2:    

### Bayesian Estimation of Parameters

To gain a feel for how a prior distribution changes the parameter estimates you will perform random sampling of data from the DAG model and estimate a model parameter. As a first step, write a function(s) that random samples a dataset and then estimates the parameter, $\theta$. Your function(s) should do the following:

1. Use the DAG model you imported for the simulation creating each dataset realization. 
2. Arguments should include the $\alpha$ and $\beta$ prior pseudo counts.
3. The parameter, $\theta$ is estimated for the butler, B, table. 
4. The number of samples per realized dataset is 25. 
5. Compute 1,000 estimates of $\theta$, using 1,000 independent sample datasets.
6. Use an initial random seed of 345, and increase your seed value by 100 for each realization.
7. Return an array-like data structure containing your 1,000 parameter estimates. 

Create the code in the cell below. Then execute your code for a **maximum likelihood estimate** of the 1,000 values of theta by setting $\alpha$ and $\beta$ both to 0, and save the results. 

The next step is to visualize the results as a histogram. Create a function to plot a histogram with of your parameter estimates using 50 bins and with x-axis limits of (0.2,0.9). Make sure to label your axes. Then, plot the histogram and examine the results. 

> **Note:** Since the DAG has a limited number of discrete valued notes, expect the histogram to have a number of discrete values.

Examine the dispersion of the parameter estimates and the most likely value (mode). Given that the parameter, $\theta$, must be in the range $0.0 \le \theta \le 1.0$, would you say there is significant dispersion in these estimates, and why? 

ANS:    

Next, simulate a new realization of the 1,000 datasets, estimating the parameters, $\theta$ using a prior with pseudo counts, alpha = 6, beta = 4. Then, plot the histogram to compare with the previous results. 

How has adding this prior changed the characteristics of the distribution of the parameter, $\theta$?

ANS:    

Now, you will explore the learning properties of the Bayes estimator when very little data is available. You will compute realizations of datasets with just 5 samples and plot the histogram of the parameter estimates. Continue using a prior with pseudo counts, alpha = 6, beta = 4. 

Examine the resulting histogram. The result has fewer discrete values that the histogram computed using a sample size of 25. But, are the dispersion and mode of these two distributions nearly the same and why? 

ANS:    

Finally, you will investigate the effect of a strong prior. A strong prior arises in cases where there is considerable experience with the problem. In such cases, the new observations only incrementally change the parameter estimates. 

Simulate a new realization of datasets with 25 samples each, estimating the parameters, $\theta$ using a prior with pseudo counts, alpha = 16, beta = 24. Then, plot the histogram to compare with the previous results. 

Compare these results to the those obtained using the same prior with 25 samples per realization. What are the key differences between these distributions of the parameters, and why? 

ANS:    

## Part 2, Learning Structure

Now you will explore how well the structure of the graph can be estimated. **Keep in mind that the graph used for the simulation is not a tree**. Answer the questions based on the results you find. 

With the dataset simulated, you will now try estimating the model structure. Use the hill climb search algorithm with the BIC scoring function to estimate the model structure, using the dataset with 250 samples. Set a `numpy.random.seed` of 5566, before computing the model. Create and execute the code in the cell below to estimate the model structure and display the identified edges. 

In [None]:
nr.seed(5566) 


Examine these edges. How does this model compare to the model used to simulate the data?

ANS:    

How good is this structure fit? To answer this question you will need to compare the BIC score of the graph used for the simulation with the BIC score of the estimated structure. You must create a baseline DAG structure and compare the BIC score to the score of the estimated model. 

Notice that in practice, you will never know the true graph structure. Else, why estimate it? In such cases, the best you can do is test several models and select the one with the lowest BIC that also honors any constraints known from expert opinion. 

In the cell below create the code to compute and display the BIC score of both the graph used for the simulation and the estimated structure, using the 250 sample dataset.

Are these BIC scores different and what does this mean in terms of how good the estimated model is? 

ANS:    

Next, you will apply the K2 score method to the 250 case dataset. In the cell below, create the code to use the hill climbing search with the K2 score to estimate and display the model structure. Set a `numpy.random.seed` of 6565, before computing the model. 

In [None]:
nr.seed(6565)


Is this graph structure any different from the one obtained with the BIC score and what does this mean in terms of the independency structure?

ANS:    

Now, compare the K2 score for the baseline DAG model with the estimated model using the 250 case dataset. In the cell below create and execute the code to compute and display these scores. 

Are these K2 scores different and what does this indicate about the estimated model? 

ANS:   

In the cell below create and execute the code to display the independencies of the graph structure you have found. 

Notice that these local independencies have some problematic characteristics. What statement can you make about these problems for the murderer variable, M?

ANS:    

Perhaps, a larger dataset will yield better DAGs? In the cell below, compute a dataset with 25,000 samples using the DAG model. Use a random seed of 9898. 

With the larger dataset available, you can now determine if using 2 orders of magnitude more data improves the model structure estimates. In the cell below, use the hill climb search algorithm with the BIC scoring function to estimate the model structure. Set a `numpy.random.seed` of 4567, before computing the model. Make sure you give you model a unique name so you can make comparisons latter.

In [None]:
nr.seed(4567)


Next, use the 25,000 sample data set with the hill climb search algorithm using the K2 scoring function to estimate the model structure. Set a `numpy.random.seed` of 765, before computing the model. Make sure you give you model a unique name so you can make comparisons latter.

In [None]:
nr.seed(765)


The two scoring methods have arrived at different models, even when a larger dataset is used.  What are some key differences in these models and with the original models? **Hint:** Look at the numbers of directed edges.

ANS:   

The K2 score models have been created using a Dirichlet uniform prior, starting with a completely unconnected model. See the [pgmpy documentation for more details](http://pgmpy.org/estimators.html).   

The addition of a prior in the form of an initial DAG model might make a difference. In the cell below a simple initial model is defined. You can specify an initial model using the `start` argument to the `estimate` method. 

Using the 25,000 case dataset and the initial model, use the K2 score to find a model structure. Set a `numpy.random.seed` of 543, before computing the model. Make sure you give you model a unique name so you can make comparisons latter.

Has the use of the prior or initial model changed the result? 

ANS:    

Finally, compare the BIC and K2 scores of the three models you created with the K2 and BIC score methods on the 25,000 case dataset. In the cell below create and execute the code to compute and display these 6 scores.  

Examine these results. Do the results indicate any one model is substantially better than the others? Does this outcome help explain the ambiguity you have seen in estimating model structure, and why? 

ANS 1:   
ANS 2:    

Finally, create and execute the code in the cell below, to print the local independencies of the models estimated using the K2 and BIC score methods on the 25,000 case dataset.

How are the local indepenencies different? Which structure makes more sense when compared to the original DAG used for the simulation?

ANS 1:     
ANS 2:    