# POS tagging

Part-of-Speech tagging is a task of assigning appropriate part-of-speech labels (or tags) to each word in a given text.

can be solved by Viterbi algorithm with HMM model, classification (MLP, LSTM, etc)

## Bayesian method

HMM is an extension of Naive Bayes to sequences.

computes the posterior probability distribution $P(T|W)$ over **all possible** tag sequences given the words.

Uses Bayes' theorem to relate the likelihood, prior, and posterior probabilities.

\begin{align}
T &= \arg\max_T P(T|W)\\[1em]
\text{(Bayes theorem)}&= \arg\max_T\frac{P(T)P(W|T)}{P(W)}\\[1em]
\text{(ignore P(W))}&=\arg\max_T P(T)P(W|T)\\[1em]
\text{(simplify)}&= \prod P(t_i|t_1, ..., t_{i-1})P(w_i|t_1, ..., t_i)\\[1em]
&= \prod P(t_i|t_{i-1})P(w_i|t_i)\\[1em]
\end{align}


Simplifications:
1. likelihood $P(W|T) = \prod P(w_i|t_i)$
2. prior $P(T) = \prod P(t_i|t_{i-1})$ (Bigram approximation)


## HMM: Viterbi algorithm

HMM task: find state sequence $\mathcal{Q}$ given observation sequence $O$ and HMM model $\mu$

$$
Q=\arg\max_{Q}P(Q|O, \mu)
$$

POS tagging: find **most likely** tag sequence $T=\{t_1, ..., t_n\}$ given a sequence of words $W=\{w_1, ..., w_n\}$

$$
T=\arg\max_{T}P(T|W)
$$

solution

- brute force: Using HMM model $\mu$, compute probability $P(T|W)$ for all possible tag sequences $T$. 

  cons: computationally infeasible due to large number of combinations.

- Greedy Search: Choose the best tag for each word independently, without considering the overall sequence of tags. 

  cons: suboptimal solutions since dependencies between tags are not considered.

- Beam Search: A more efficient approach that uses partial hypotheses.

  At each state, only the top $k$ best hypotheses are retained. 
  
  cons: miss optimal solution due to early pruning 

- **Viterbi algorithm**: best solution

### Viterbi algorithm

Viterbi algorithm is an efficient algorithm for POS tagging using an HMM

features:

* dynamic programming: solve the problem efficiently.

* memoization: store intermediate results and avoid redundant computations.

* backpointers: trace back the optimal path.

Algorithm

1. Initialization $1 \leq j \leq N$:

    $$
    v_1(j) = \pi_j b_j(o_1)\\[1em]
    b_1(j)=0
    $$

2. Recursion $1 \leq j \leq N, 1 < t \leq T$:
   
   $$
   v_t(j) = \max_{1 \leq i \leq N} v_{t-1}(i) a_{ij} b_j(o_t) \\[1em]
   b_t(j) = \arg\max_{1 \leq i \leq N} v_{t-1}(i) a_{ij}b_j(o_t)
   $$

3. Termination:

   Best score: $P^* = \max_{1 \leq i \leq N} v_T(i)$

   Start of backtrace: $q_T^* = \arg\max_{1 \leq i \leq N} v_T(i)$
