# Problem 3(b): Viterbi Decoding vs. Smoothing

Viterbiâ€™s algorithm finds the most likely *state trajectory* of the hidden Markov model (HMM)
given a sequence of observations \(Y_1,\dots,Y_t\). Specifically, it solves the decoding problem

$$
(x_1^*, \dots, x_t^*)
=
\arg\max_{(x_1,\dots,x_t)}
P(X_1 = x_1, \dots, X_t = x_t \mid Y_1, \dots, Y_t).
$$

On the other hand, smoothing computes the marginal posterior distribution of the state at
each time step and selects the most likely state independently at each time:

$$
\hat{x}_k = \arg\max_x P(X_k = x \mid Y_1, \dots, Y_t),
$$

resulting in the sequence
$$
(\hat{x}_1, \dots, \hat{x}_t).
$$

In general, these two sequences are **not the same**.

---

## Why the two solutions are different in general

The key reason is that the two methods optimize **different objective functions**.

- **Viterbi decoding** maximizes the *joint posterior probability* of the entire state trajectory.
  It explicitly takes into account:
  - the initial state distribution,
  - the state transition probabilities,
  - and the observation likelihoods,
  over all time steps simultaneously.

- **Smoothing with pointwise MAP** maximizes the *marginal posterior probability* at each time
  step independently. The most likely state at time \(k\) is chosen without considering whether
  this choice is consistent with high-probability transitions to neighboring time steps.

Because maximization of marginals and maximization of the joint distribution do not commute,
the sequence formed by independently maximizing each marginal posterior does not, in general,
maximize the joint posterior probability of the full trajectory.

---

## A counterexample

Consider an HMM in which, at some time step \(k\),

$$
P(X_k = a \mid Y_{1:t}) > P(X_k = b \mid Y_{1:t}),
$$

so smoothing selects state \(a\) at time \(k\).

However, suppose that the transition probabilities satisfy

$$
T_{a \to c} \approx 0,
\qquad
T_{b \to c} \text{ is large},
$$

where \(c\) is the most likely state at time \(k+1\).

In this case, although state \(a\) has a slightly higher marginal posterior probability at time \(k\),
any trajectory passing through \(a\) has a very low joint probability due to the near-impossible
transition to \(c\). As a result, the Viterbi algorithm selects state \(b\) at time \(k\) in order to
maximize the joint probability of the entire trajectory.

Thus, the state selected by smoothing at time \(k\) differs from the state chosen by Viterbi
decoding.

---

## When are the two solutions the same?

The two solutions coincide in special cases where temporal dependencies do not affect the
optimal path. For example:

1. **Independent states across time**  
   If the states are independent (i.e., there is no meaningful state transition structure),
   then the joint posterior factorizes into a product of marginal posteriors. In this case,
   maximizing the joint probability is equivalent to maximizing each marginal independently.

2. **Uniform or uninformative transition matrix**  
   If the transition matrix assigns equal probability to all state transitions, the transitions
   do not favor any particular path. Consequently, the Viterbi solution reduces to selecting
   the most likely state at each time step, which matches the smoothing result.

3. **Deterministic transitions**  
   If the Markov chain allows only a single feasible transition from each state, then both
   methods necessarily produce the same state sequence.

---

## Summary

In general, the Viterbi decoding solution is not equal to the sequence obtained by selecting the
most likely state at each time step from smoothing. Smoothing performs pointwise maximization
of marginal posterior probabilities, whereas Viterbi decoding maximizes the joint posterior
probability of the entire state trajectory. The two solutions coincide only in special cases
where temporal dependencies do not influence the optimal path.
