<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/task_8_transitions_solved.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 8: POS transition probabilities

In the lecture, we briefly saw the concept of hidden markov models and transition probabilies, as applied to POS tags. In the simplest case, these probabilities model the sequences of POS tag pairs such that e.g. probability of DET -> NOUN will be the probability of seeing a NOUN, having seen a DET (determiner), i.e. more formally P(NOUN|DET). We also had the intuition that for example DET -> NOUN should be much larger than, say DET -> VERB. And of course, since these are probabilities, sum of P(x|y) over all x should sum up
to 1 for any given y. These probabilities can be easily estimated by counting from the data, i.e. the probability of DET -> NOUN transition, i.e. P(NOUN|DET) is simply the count of how many times you saw NOUN following a DET, divided by how many times you saw DET.

Your task is to pick a Universal Dependencies dataset of your choice, e.g. UD_English-EWT training data, calculate these transition probabilities, pretty-print them if you can, and check that our intuitions hold, i.e. that for example DET -> NOUN is substantially more likely than, say, DET -> VERB.



In [1]:
# Grab the data

!wget https://github.com/UniversalDependencies/UD_English-EWT/archive/refs/heads/master.zip
!unzip master.zip

--2024-04-10 10:06:37--  https://github.com/UniversalDependencies/UD_English-EWT/archive/refs/heads/master.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/UniversalDependencies/UD_English-EWT/zip/refs/heads/master [following]
--2024-04-10 10:06:37--  https://codeload.github.com/UniversalDependencies/UD_English-EWT/zip/refs/heads/master
Resolving codeload.github.com (codeload.github.com)... 140.82.114.10
Connecting to codeload.github.com (codeload.github.com)|140.82.114.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip’

master.zip              [     <=>            ]   7.24M  5.65MB/s    in 1.3s    

2024-04-10 10:06:39 (5.65 MB/s) - ‘master.zip’ saved [7588960]

Archive:  master.zip
f91f4cd038fe8aad3563340782714aefce702ce2
   creating: UD_English

In [2]:
# I guess this conllu library is quite useful after all

!pip install conllu

Collecting conllu
  Downloading conllu-4.5.3-py2.py3-none-any.whl (16 kB)
Installing collected packages: conllu
Successfully installed conllu-4.5.3


In [3]:
import conllu
inp_data="UD_English-EWT-master/en_ewt-ud-train.conllu"
conllu_data=conllu.parse(open(inp_data).read())

In [22]:
# Let's get the counts into a numpy 2D array, but
# of course there are very many ways to achieve
# this same job

import numpy as np
import itertools

# I grabbed this list from the documentation on the UD pages
# and I add "START" tag to mark the start of the sentence
all_tags= "START ADJ ADP PUNCT ADV AUX SYM INTJ CCONJ X NOUN DET PROPN NUM VERB PART PRON SCONJ _".split()

def idx(tag):
    """Utility function which turns a tag into a 0-based integer index"""
    return all_tags.index(tag)

counts=np.zeros((len(all_tags),len(all_tags)),dtype=float) #0-filled array of the correct size

for sentence in conllu_data:
    #get the sequence of POS tags in the sentence, pre-pended with START
    pos_sequence=["START"]+[token['upos'] for token in sentence]
    #now loop over all tag pairs - there are so many ways to do that, this one is quite elegant I think
    for pos_from, pos_to in itertools.pairwise(pos_sequence):
        pos_from_i=idx(pos_from)
        pos_to_i=idx(pos_to)
        counts[pos_from_i,pos_to_i]+=1

probs=counts/np.sum(counts,axis=1)[:,None] #normalize into probabilities by dividing with row sums

# Print a few to see if they make any sense, and they do
print("DET -> NOUN", probs[idx("DET"),idx("NOUN")])
print("DET -> VERB", probs[idx("DET"),idx("VERB")])
print("DET -> DET", probs[idx("DET"),idx("DET")])
print("AUX -> VERB", probs[idx("AUX"),idx("VERB")])
print("VERB -> AUX", probs[idx("VERB"),idx("AUX")])



DET -> NOUN 0.5875207067918278
DET -> VERB 0.018958218295600956
DET -> DET 0.009877906620038039
AUX -> VERB 0.33824561403508774
VERB -> AUX 0.006738783472246853


In [27]:
# Let's make sure we don't repeat the mistake on division
array=np.array([[1,2],[3,4]])
print("sum1",np.sum(array))
print("sum2",np.sum(array,0))
print("sum3",np.sum(array,1))
sums=np.sum(array,1)
print("Sums as row vector",sums)
print("Sums as column vector", sums[:,None])
print("division with sums as row vector", array/sums) ## Sums interpreted as row vector
print("division with sums as column vector", array/sums[:,None]) #Make the sums into a column vector



sum1 10
sum2 [4 6]
sum3 [3 7]
Sums as row vector [3 7]
Sums as column vector [[3]
 [7]]
division with sums as row vector [[0.33333333 0.28571429]
 [1.         0.57142857]]
division with sums as column vector [[0.33333333 0.66666667]
 [0.42857143 0.57142857]]


In [28]:
# Now we can do some pretty-printing
# this is what google thought to be a good way, can't argue with that :)
import pandas
df = pandas.DataFrame(probs, columns=all_tags, index=all_tags)
df

Unnamed: 0,START,ADJ,ADP,PUNCT,ADV,AUX,SYM,INTJ,CCONJ,X,NOUN,DET,PROPN,NUM,VERB,PART,PRON,SCONJ,_
START,0.0,0.040816,0.043686,0.035236,0.075016,0.025351,0.007733,0.032286,0.023198,0.012277,0.062022,0.100446,0.111687,0.039142,0.05963,0.004703,0.251993,0.035555,0.039222
ADJ,0.0,0.054495,0.076782,0.129904,0.013815,0.003358,0.001526,0.000305,0.042894,0.001221,0.517631,0.005037,0.065639,0.008319,0.00893,0.031598,0.01183,0.022897,0.003816
ADP,0.0,0.071204,0.030347,0.024165,0.01596,0.000843,0.003316,0.000281,0.005845,0.001293,0.16219,0.358379,0.13291,0.040238,0.006463,0.000843,0.133191,0.003091,0.009441
PUNCT,0.0,0.053647,0.050289,0.08496,0.061534,0.027565,0.005388,0.011479,0.109402,0.008746,0.12705,0.064032,0.098001,0.032563,0.081446,0.010542,0.122677,0.030923,0.019756
ADV,0.0,0.141085,0.091577,0.171644,0.089295,0.040976,0.002976,0.000992,0.026094,0.00129,0.014188,0.045838,0.009227,0.018256,0.195357,0.01766,0.083639,0.035619,0.014287
AUX,0.0,0.106589,0.032827,0.016608,0.140741,0.079064,0.00078,0.000312,0.001949,0.000234,0.011852,0.080312,0.006394,0.009045,0.338246,0.118441,0.046628,0.008265,0.001715
SYM,0.0,0.041074,0.011058,0.044234,0.004739,0.004739,0.017378,0.00158,0.039494,0.0,0.137441,0.006319,0.041074,0.609795,0.022117,0.00158,0.012638,0.00316,0.00158
INTJ,0.0,0.001477,0.008863,0.385524,0.033973,0.019202,0.002954,0.023634,0.011817,0.0,0.019202,0.014771,0.025111,0.008863,0.354505,0.0,0.073855,0.004431,0.011817
CCONJ,0.0,0.087659,0.027973,0.013014,0.081975,0.049065,0.004039,0.002992,0.00015,0.000598,0.1454,0.097233,0.075991,0.014361,0.173373,0.015707,0.167091,0.017053,0.026328
X,0.0,0.016548,0.044917,0.264775,0.002364,0.004728,0.0,0.0,0.01182,0.513002,0.068558,0.007092,0.018913,0.007092,0.009456,0.002364,0.021277,0.004728,0.002364
