In [1]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.insert(0,'../../modules')

In [2]:
import numpy as np
import factors
import factors_sampling

# Maximum Likelihood Estimate (MLE)
One way to make predictions is to use the most likely model parameters which would generate the data rather than integrating out all parameters.
$$P(X_\text{new}|X_\text{old})=P(X_\text{new}|\theta)$$
where $$\theta = \text{argmax}_\theta P(X_\text{old}|\theta)$$
We use $D$ to refer to old data in some cases. <br>
Often it is assumed that the data is independent and identically distributed, which means:
$$P(D|\theta)=\prod_i P(D_i|\theta)$$
Another common practice is to use the log likelihood of the data, as $$\text{argmax} (x) = \text{argmax} (\log(x))$$
This turns the above product into a sum:
$$P(D|\theta)\propto \sum_i \log(P(D_i|\theta))$$
This is much more numerically stable as a product of many numbers less than 1 gets very very small. <br>
**examples of maximum likelihood estimates:**

### Categorical
The binomial distribution describes the distribution with only two possible outcomes defined by a single value parameter $\theta$. Samples are akin to flipping a coin with a certain bentness. The predicted value $k$ is the number of times one outcome occurs in a given number of samples $n$. The number of times the other result happens is just $n-k$. The probability is defined for $\theta$: <br>
$$P(k|n,\theta)=\frac{\theta^k (1-\theta)^{n-k}n!}{k!(n-k)!}$$
We want a distribution over $\theta$ and are using the maximum likelihood to do this so need to take the $\text{argmax}(\theta)$ of the above. Constants and normalizations not depending on $\theta$ can thus be removed: 
$$P(k|n,\theta)\propto \theta^k (1-\theta)^{n-k}$$
Maximizing this is the same as maximizing the likelihood, so:
$$l(\theta)\propto k\ln(\theta)+(n-k)\ln(1-\theta)$$
As the function is convex we can set the gradient to 0 to get the maximum:
$$
\begin{aligned}
  \nabla l(\theta)&=\frac{k}{\theta}-\frac{n-k}{1-\theta} \\
  0&=\frac{k}{\theta}-\frac{n-k}{1-\theta} \\
  \frac{n-k}{1-\theta}&=\frac{k}{\theta} \\
  \frac{(n-k)\theta}{(1-\theta)\theta}&=\frac{k(1-\theta)}{(1-\theta)\theta} \\
  (n-k)\theta&=k(1-\theta) \\
  n\theta-k\theta&=k-k\theta \\
  n\theta&=k \\
  \theta&=\frac{k}{n}
\end{aligned}$$

So the best estimate for the "bentness" is just the mean. With enough data this approaches the truth. E.g For a coin:

In [3]:
total_heads = 0
total_flips = 0
theta = 0.6
for sample in range(10000):
    if(np.random.rand()<theta):
        total_heads+=1
    total_flips+=1
prob_heads_estimates = total_heads/total_flips
print("theta estimate",prob_heads_estimates)

theta estimate 0.6


The same formula also applies for $j$ discrete variables. The maximum likelihood is:
$$\theta_j=\frac{k_j}{\sum k}$$
Example of a fair dice:

In [4]:
totals = np.zeros(6)
dice_true_probs = np.ones(6)*(1/6)
for sample in range(10000):
    roll = np.random.choice(np.arange(1,7),p=dice_true_probs)
    totals[roll-1]+=1
estimated_probs = totals/np.sum(totals)
print("truth   ",dice_true_probs.round(4))
print("estimate",estimated_probs.round(4))

truth    [0.1667 0.1667 0.1667 0.1667 0.1667 0.1667]
estimate [0.1663 0.1696 0.1672 0.1623 0.1726 0.162 ]


### Gaussian
The gaussian pdf is:
$$ p(x|\mu,\sigma^2)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2} \frac{(x-\mu)^2}{\sigma^2}}$$
The log is:
$$ \begin{aligned}
    p(x|\mu,\sigma^2)&=\log(\frac{1}{\sigma\sqrt{2\pi}})-\frac{1}{2} \frac{(x-\mu)^2}{\sigma^2} \\
    &=-\log(\sigma\sqrt{2\pi})-\frac{1}{2} \frac{(x-\mu)^2}{\sigma^2} \\
    &=-\log(\sigma) -\log(\sqrt{2\pi})-\frac{1}{2} \frac{(x-\mu)^2}{\sigma^2}
\end{aligned}
$$
So for $n$ iid (independent identically distributed) data points this becomes the log likelihood:
$$-n\log(\sigma) -n\log(\sqrt{2\pi})-\sum_{i=1}^n \frac{1}{2} \frac{(x_i-\mu)^2}{\sigma^2}$$
Which is: 
$$-n\log(\sigma) -n\log(\sqrt{2\pi})- \frac{1}{2} \frac{\sum_{i=1}^n(x_i-\mu)^2}{\sigma^2}$$
The constant $-n\log(\sqrt{2\pi})$ can be dropped when doing MLE <br>
This function is also convex, so we can set the gradient to 0 for each variable.
$$ \begin{aligned}
    \nabla\mu&=\frac{\sum_{i=1}^n(x_i-\mu)}{\sigma^2} \\
    0&=\frac{\sum_{i=1}^n(x_i-\mu)}{\sigma^2} \\
    0&=\sum_{i=1}^n(x_i-\mu) \\
    \sum_{i=1}^n \mu&=\sum_{i=1}^n x_i \\
    n\mu&=\sum_{i=1}^n x_i \\
    \mu&=\frac{1}{n}\sum_{i=1}^n x_i \\
\end{aligned}
$$
So $\mu$ is just the mean of the samples
$$ \begin{aligned}
    \nabla\sigma^2&=-\frac{n}{\sigma}+\frac{\sum_{i=1}^n(x_i-\mu)^2}{\sigma^3} \\
    0&=-\frac{n}{\sigma}+\frac{\sum_{i=1}^n(x_i-\mu)^2}{\sigma^3} \\
    \frac{n}{\sigma}&=\frac{\sum_{i=1}^n(x_i-\mu)^2}{\sigma^3} \\
    n&=\frac{\sum_{i=1}^n(x_i-\mu)^2}{\sigma^2} \\
    \sigma^2&=\frac{\sum_{i=1}^n(x_i-\mu)^2}{n}
\end{aligned}
$$
So, $\sigma^2$ is just the variance.

In [5]:
true_mean = 3.4053
true_sigma = 1.4233
gaussian_samples = np.random.normal(true_mean,true_sigma,10000)
estimated_mean = np.mean(gaussian_samples)
estimated_sigma = np.sqrt(np.mean((gaussian_samples-estimated_mean)**2))
print("true mean     ",true_mean,"true sigma     ",true_sigma)
print("estimated mean",estimated_mean.round(4),"estimated sigma",estimated_sigma.round(4))

true mean      3.4053 true sigma      1.4233
estimated mean 3.418 estimated sigma 1.4077


### Bayesian Networks
Say we have variables $X_{1:n}$ and a a set of $m$ data points $D$ for a Bayesian Network. A single point $d$ in $D$ is defined as the values for each variable $X_{1:n}$. The function $\text{par}(d,i)$ gets the parents for variable $i$ instantiated in $d$. Each variable has $r_i$ different possible values. We refer to the parents of a given node $i$ for a given instantiation $j$ as $\pi_{ij}$. We refer to the number of parent configurations as $q_i$. The probability of a single variable being set to $k$ is given by the factor:
$$p(X_i=k|\pi_{ij})=\theta_{ijk}$$
For $d$ the probability of a single variable is the value $\theta_{ijk}$ where $d_i=k$ and $\text{par}(d,i)=\pi_{ij}$. If we have a counting function $\mathcal{1}$ which is 1 where true this can be expressed: 
$$p(d_i)=\prod_{k=1}^{r_i}\prod_{j=1}^{q_i}\theta_{ijk}\mathcal{1}(d_i=k)\mathcal{1}(\text{par}(d,i)=\pi_{ij})$$
As each variable is independent given the parents, the probability for every variable is:
$$p(d)=\prod_{i=1}^n\prod_{k=1}^{r_i}\prod_{j=1}^{q_i}\theta_{ijk}\mathcal{1}(d_i=k)\mathcal{1}(\text{par}(d,i)=\pi_{ij})$$
As each data point is independent the likelihood can be expressed:
$$l(D)=\prod_{u=1}^m\prod_{i=1}^n\prod_{k=1}^{r_i}\prod_{j=1}^{q_i}\theta_{ijk}\mathcal{1}(D_{ui}=k)\mathcal{1}(\text{par}(D_u,i)=\pi_{ij})$$
The product over all data items can be replaced by a counting function $M$:
$$l(D)=\prod_{i=1}^n\prod_{k=1}^{r_i}\prod_{j=1}^{q_i}\theta_{ijk}^{M_{ijk}}$$
Where $M_{ijk}$ is the number of times variable $i$ is set to $k$ with parents $j$ across the data.<br>
Similarly to the categorical distribution the maximum likelihood in the network is:
$$\theta=\frac{M_{ijk}}{\sum_k M_{ijk}}$$
**Example:**

In [6]:
factorA = factors.Factor(["A"],[2])
factorB = factors.Factor(["B"],[2])
factorC_givenAB = factors.Factor(["C","A","B"],[2,2,2])
factorD_givenC = factors.Factor(["D","C"],[2,2])
factorE_givenC = factors.Factor(["E","C"],[2,2])
factorA.set_all([0.7,0.3])
factorB.set_all([0.75,0.25])
factorC_givenAB.set_all([0.05,0.5,0.7,0.45,0.95,0.5,0.3,0.55])
factorD_givenC.set_all([0.2,0.7,0.8,0.3])
factorE_givenC.set_all([0.6,0.15,0.4,0.85])
all_true_factors = [factorA,factorB,factorC_givenAB,factorD_givenC,factorE_givenC]
samples = []
for s in range(1000):
    sample_variable_names,sample = factors_sampling.joint_sample_top_down(all_true_factors)
    samples.append(sample)
samples = np.array(samples)
empty_factors = [f.copy_zeros() for f in all_true_factors]

In [7]:
def MLE_directed_bayes_net(all_factors,sample_variable_names,samples):
    total_prob = 1
    new_factors = [f.copy() for f in all_factors]
    for j in range(len(all_factors)):
        indexes = all_factors[j].indexes
        factor_to_sample_index = [sample_variable_names.index(name) for name in all_factors[j].names]
        selected_samples = samples[:,factor_to_sample_index]
        counts = np.zeros(indexes.shape[0])
        for i in range(indexes.shape[0]):
            match = (selected_samples==indexes[i])
            all_match = np.prod(match,axis=1)
            counts[i] = np.sum(all_match)
            new_factors[j].set(indexes[i],counts[i])
        new_factors[j]=factors.condition(new_factors[j],axis=new_factors[j].names[1:])
    return new_factors
learned_factors = MLE_directed_bayes_net(empty_factors,sample_variable_names,samples)

In [8]:
for a in range(len(all_true_factors)):
    print("TRUE")
    print(all_true_factors[a])
    print("LEARNED")
    print(learned_factors[a])

TRUE
A  Values (10 dp)
0  0.7
1  0.3

LEARNED
A  Values (10 dp)
0  0.741
1  0.259

TRUE
B  Values (10 dp)
0  0.75
1  0.25

LEARNED
B  Values (10 dp)
0  0.733
1  0.267

TRUE
C  A  B  Values (10 dp)
0  0  0  0.05
0  0  1  0.5
0  1  0  0.7
0  1  1  0.45
1  0  0  0.95
1  0  1  0.5
1  1  0  0.3
1  1  1  0.55

LEARNED
C  A  B  Values (10 dp)
0  0  0  0.0560747664
0  0  1  0.4854368932
0  1  0  0.7070707071
0  1  1  0.4426229508
1  0  0  0.9439252336
1  0  1  0.5145631068
1  1  0  0.2929292929
1  1  1  0.5573770492

TRUE
D  C  Values (10 dp)
0  0  0.2
0  1  0.7
1  0  0.8
1  1  0.3

LEARNED
D  C  Values (10 dp)
0  0  0.202020202
0  1  0.6870554765
1  0  0.797979798
1  1  0.3129445235

TRUE
E  C  Values (10 dp)
0  0  0.6
0  1  0.15
1  0  0.4
1  1  0.85

LEARNED
E  C  Values (10 dp)
0  0  0.5824915825
0  1  0.146514936
1  0  0.4175084175
1  1  0.853485064

