# Inference

## Eric He

## Homework 2 - Problem 5

In problem 5, we are asked to provide code to build a Hidden Markov Model and train it on the sequence in `sequence.txt`.

We can find the parameters of the Hidden Markov Model using the Baum-Welch algorithm, an expectation-maximization style algorithm that works as follows:

# Baum-Welch algorithm

## Expectation: Given model parameters, compute the probability of the data

This probability, when instead viewed as a function of the model parameters, is the likelihood function. 

## Maximization: Given the likelihood function 

# Import necessary packages and read in sequence.txt

I've changed `sequence.txt` into a Python file and imported that instead.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('whitegrid')

import sequence
s = sequence.sequence

# Define Hidden Markov Model class

In [2]:
class HMM():
    def __init__(self, N, Y):
        # N is the number of hidden nodes
        self.N = N
        self.i = None
        if Y is not None:
            ### initialize params ###
            # store the data; map the input to easy 0-based index
            self.Y = [y - 1 for y in Y]
            # T is the number of observations
            self.T = len(Y)
            # K is the number of observable states
            self.K = len(set(self.Y))
            # A is the matrix of hidden state transition probabilities
            # random initialization
            self.A = np.ones([self.N, self.N])
            self.A = (self.A.transpose() / self.A.sum(axis=1)).transpose()
            # B is the estimated probabilities for each observable state given a hidden state
            # random initialization
            self.B = np.random.rand(self.N, self.K)
            self.B = (self.B.transpose() / self.B.sum(axis=1)).transpose()
            # P is the initial probability of the hidden states
            # uniform initialization
            self.P = np.ones(self.N) / N

            self.previous_A = self.A.copy()
            self.previous_B = self.B.copy()
        else:
            self.Y = None
    
    def fit(self, Y, max_iter=10):
        ### initialize params ###
        # store the data; map the input to easy 0-based index
        self.Y = [y - 1 for y in Y]
        # T is the number of observations
        self.T = len(Y)
        # K is the number of observable states
        self.K = len(set(self.Y))
        # A is the matrix of hidden state transition probabilities
        # random initialization
        self.A = np.random.rand(self.N, self.N)
        self.A = (self.A.transpose() / self.A.sum(axis=1)).transpose()
        # B is the estimated probabilities for each observable state given a hidden state
        # random initialization
        self.B = np.random.rand(self.N, self.K)
        self.B = (self.B.transpose() / self.B.sum(axis=1)).transpose()
        # P is the initial probability of the hidden states
        # uniform initialization
        self.P = np.ones(self.N) / self.N
        
        self.previous_A = self.A.copy()
        self.previous_B = self.B.copy()
        
        self.likelihoods_historical = []
        
        self.likelihoods = np.ones(self.T)
        
        # expectation maximization loop
        while self.convergence_check(max_iter=max_iter):
            alphas, betas = self.expectation()
            self.maximization(alphas, betas)
    
    def forward(self):
        """Computes alphas, the probability of seeing observations y1...yt and having state i at time t"""
        alphas = np.ones([self.N, self.T])
        for n in range(self.N):
            alphas[:, 0] = self.P * self.B[:,self.Y[0]]
        alphas[:, 0] = alphas[:, 0] / alphas[:, 0].sum()
            
        for t in range(1, self.T):
            alphas[:, t] = self.B[:,self.Y[t]] * (alphas[:,t-1] @ self.A)
            self.likelihoods[t] = alphas[:, t].sum()
            alphas[:, t] = alphas[:, t] / self.likelihoods[t]
            
        return alphas
    
    def backward(self):
        betas = np.ones([self.N, self.T]) / 2
        
        for t in reversed(range(self.T - 1)):
            betas[:, t] = self.A @ (self.B[:,self.Y[t+1]] * betas[:,t+1])
            betas[:, t] = betas[:, t] / betas[:, t].sum()
            
        return betas
    
    # check 
    def convergence_check(self, max_iter=10):
        if self.i is None:
            self.i = 1
            return True
        if self.i > max_iter:
            return False
        else:
            self.i += 1
            difference_A = np.abs(self.previous_A - self.A).sum()
            difference_B = np.abs(self.previous_B - self.B).sum()
            print('Log-likelihood is {}'.format(np.log(self.likelihoods).sum()))
            print('Difference in A: {}'.format(difference_A))
            print('Difference in B: {}'.format(difference_B))
            print('\n')
            self.previous_A = self.A
            self.previous_B = self.B
            
            self.likelihoods_historical.append(self.likelihoods.sum())
            
            if (difference_A == 0) & (difference_B == 0):
                return False
            
            return True
    
    def expectation(self):
        alphas = self.forward()
        betas = self.backward()
        return alphas, betas
    
    def maximization(self, alphas, betas):
        gammas = alphas * betas
        gammas = gammas / gammas.sum(axis=0)
        
        xis = np.zeros([self.N, self.N, self.T - 1])
        for t in range(self.T - 1):
            xis[:,:,t] = self.A * (alphas[:,t] @ (self.B[:,self.Y[t]] * betas[:,t+1]).T)       
            
        self.P = gammas[:,0]
        
        xis_sum = xis.sum(axis=2)
        self.A = (xis_sum.transpose() / xis_sum.sum(axis=1)).transpose()
        
        self.xis = xis
        self.gammas = gammas
        
        for j in range(len(set(self.Y))):
            self.B[:,j] = self.gammas[:,(np.array(self.Y) == j)].sum(axis=1) / self.gammas.sum(axis=1)
        # reset the fair die
        self.B[0,:] = 1.0 / self.K

# Test run with 2

In [3]:
hmm = HMM(2,s)
hmm.fit(s, max_iter=100)

Log-likelihood is -10887.682081741354
Difference in A: 0.0
Difference in B: 1.4783297874533952


Log-likelihood is -8887.282511444013
Difference in A: 5.551115123125783e-17
Difference in B: 0.0


Log-likelihood is -8802.589020972615
Difference in A: 1.6653345369377348e-16
Difference in B: 0.0


Log-likelihood is -8773.995354822864
Difference in A: 2.7755575615628914e-16
Difference in B: 0.0


Log-likelihood is -8764.766344053765
Difference in A: 2.220446049250313e-16
Difference in B: 0.0


Log-likelihood is -8761.209396670087
Difference in A: 1.6653345369377348e-16
Difference in B: 0.0


Log-likelihood is -8759.467366451274
Difference in A: 2.220446049250313e-16
Difference in B: 0.0


Log-likelihood is -8758.444693723606
Difference in A: 2.7755575615628914e-16
Difference in B: 0.0


Log-likelihood is -8757.784686753117
Difference in A: 2.7755575615628914e-16
Difference in B: 0.0


Log-likelihood is -8757.34216398986
Difference in A: 3.3306690738754696e-16
Difference in B: 0.0


Log-lik

In [4]:
with pd.option_context('display.float_format', lambda x: '{:.2f}%'.format(x * 100)):
    display(pd.DataFrame(hmm.A))

Unnamed: 0,0,1
0,66.56%,33.44%
1,64.00%,36.00%


In [5]:
with pd.option_context('display.float_format', lambda x: '{:.2f}%'.format(x * 100)):
    display(pd.DataFrame(hmm.B))

Unnamed: 0,0,1,2,3,4,5
0,16.67%,16.67%,16.67%,16.67%,16.67%,16.67%
1,43.86%,8.27%,6.24%,7.26%,6.65%,27.71%


In [6]:
with pd.option_context('display.float_format', lambda x: '{:.2f}%'.format(x * 100)):
    display(pd.DataFrame(hmm.P))

Unnamed: 0,0
0,100.00%
1,0.00%


# Run it back

On this second iteration, we have wildly different A and B, even though we have achieved the same log-likelihood.

In [7]:
hmm = HMM(2,s)
hmm.fit(s, max_iter=100)

Log-likelihood is -10589.99441664798
Difference in A: 1.1102230246251565e-16
Difference in B: 0.9233532373608645


Log-likelihood is -9482.486393359826
Difference in A: 1.1449174941446927e-16
Difference in B: 0.0


Log-likelihood is -8830.124678858549
Difference in A: 1.700029006457271e-16
Difference in B: 0.0


Log-likelihood is -8757.381462368176
Difference in A: 2.220446049250313e-16
Difference in B: 0.0


Log-likelihood is -8756.968348321316
Difference in A: 5.551115123125783e-17
Difference in B: 0.0


Log-likelihood is -8756.96995702456
Difference in A: 0.0
Difference in B: 0.0




In [8]:
with pd.option_context('display.float_format', lambda x: '{:.2f}%'.format(x * 100)):
    display(pd.DataFrame(hmm.A))

Unnamed: 0,0,1
0,32.64%,67.36%
1,3.07%,96.93%


In [9]:
with pd.option_context('display.float_format', lambda x: '{:.2f}%'.format(x * 100)):
    display(pd.DataFrame(hmm.B))

Unnamed: 0,0,1,2,3,4,5
0,16.67%,16.67%,16.67%,16.67%,16.67%,16.67%
1,26.40%,13.65%,12.96%,13.30%,13.07%,20.62%


In [10]:
with pd.option_context('display.float_format', lambda x: '{:.2f}%'.format(x * 100)):
    display(pd.DataFrame(hmm.P))

Unnamed: 0,0
0,61.44%
1,38.56%


# The 3-dice case

Not really seeing any particular properties with 3 dice, unfortunately. The more likely case should be a function of the number of distinct runs in the particular sequence; for example, if numbers 1, 2, and 3 see large amounts of runs, then its likely there are three dice rather than 2.

In [11]:
hmm = HMM(3,s)
hmm.fit(s, max_iter=100)

Log-likelihood is -9116.624925407283
Difference in A: 1.1102230246251565e-16
Difference in B: 1.1961384544930036


Log-likelihood is -8823.116064833743
Difference in A: 3.0531133177191805e-16
Difference in B: 0.0


Log-likelihood is -8767.067342207018
Difference in A: 2.498001805406602e-16
Difference in B: 0.0


Log-likelihood is -8761.196733037374
Difference in A: 2.220446049250313e-16
Difference in B: 0.0


Log-likelihood is -8760.184215214342
Difference in A: 2.220446049250313e-16
Difference in B: 0.0


Log-likelihood is -8759.812235335698
Difference in A: 5.551115123125783e-17
Difference in B: 0.0


Log-likelihood is -8759.546799660486
Difference in A: 2.7755575615628914e-16
Difference in B: 0.0


Log-likelihood is -8759.308051892609
Difference in A: 1.942890293094024e-16
Difference in B: 0.0


Log-likelihood is -8759.082388745956
Difference in A: 2.220446049250313e-16
Difference in B: 0.0


Log-likelihood is -8758.866815319565
Difference in A: 3.3306690738754696e-16
Difference in 

In [12]:
with pd.option_context('display.float_format', lambda x: '{:.2f}%'.format(x * 100)):
    display(pd.DataFrame(hmm.A))

Unnamed: 0,0,1,2
0,19.94%,35.44%,44.62%
1,40.24%,23.50%,36.26%
2,13.39%,55.56%,31.06%


In [13]:
with pd.option_context('display.float_format', lambda x: '{:.2f}%'.format(x * 100)):
    display(pd.DataFrame(hmm.B))

Unnamed: 0,0,1,2,3,4,5
0,16.67%,16.67%,16.67%,16.67%,16.67%,16.67%
1,26.12%,5.84%,18.48%,10.38%,3.83%,35.36%
2,32.24%,20.17%,5.08%,14.61%,20.57%,7.33%


In [14]:
with pd.option_context('display.float_format', lambda x: '{:.2f}%'.format(x * 100)):
    display(pd.DataFrame(hmm.P))

Unnamed: 0,0
0,27.47%
1,0.00%
2,72.53%
