<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/task_8_transitions_solved.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 8: POS transition probabilities

In the lecture, we briefly saw the concept of hidden markov models and transition probabilies, as applied to POS tags. In the simplest case, these probabilities model the sequences of POS tag pairs such that e.g. probability of DET -> NOUN will be the probability of seeing a NOUN, having seen a DET (determiner), i.e. more formally P(NOUN|DET). We also had the intuition that for example DET -> NOUN should be much larger than, say DET -> VERB. And of course, since these are probabilities, sum of P(x|y) over all x should sum up
to 1 for any given y. These probabilities can be easily estimated by counting from the data, i.e. the probability of DET -> NOUN transition, i.e. P(NOUN|DET) is simply the count of how many times you saw NOUN following a DET, divided by how many times you saw DET.

Your task is to pick a Universal Dependencies dataset of your choice, e.g. UD_English-EWT training data, calculate these transition probabilities, pretty-print them if you can, and check that our intuitions hold, i.e. that for example DET -> NOUN is substantially more likely than, say, DET -> VERB.



In [None]:
# Grab the data

!wget https://github.com/UniversalDependencies/UD_English-EWT/archive/refs/heads/master.zip
!unzip master.zip

In [None]:
# I guess this conllu library is quite useful after all

!pip install conllu

Collecting conllu
  Downloading conllu-4.5.3-py2.py3-none-any.whl (16 kB)
Installing collected packages: conllu
Successfully installed conllu-4.5.3


In [None]:
import conllu
inp_data="UD_English-EWT-master/en_ewt-ud-train.conllu"
conllu_data=conllu.parse(open(inp_data).read())

In [None]:
# Let's get the counts into a numpy 2D array, but
# of course there are very many ways to achieve
# this same job

import numpy as np
import itertools

# I grabbed this list from the documentation on the UD pages
# and I add "START" tag to mark the start of the sentence
all_tags= "START ADJ ADP PUNCT ADV AUX SYM INTJ CCONJ X NOUN DET PROPN NUM VERB PART PRON SCONJ _".split()

def idx(tag):
    """Utility function which turns a tag into a 0-based integer index"""
    return all_tags.index(tag)

counts=np.zeros((len(all_tags),len(all_tags)),dtype=float) #0-filled array of the correct size

for sentence in conllu_data:
    #get the sequence of POS tags in the sentence, pre-pended with START
    pos_sequence=["START"]+[token['upos'] for token in sentence]
    #now loop over all tag pairs - there are so many ways to do that, this one is quite elegant I think
    for pos_from, pos_to in itertools.pairwise(pos_sequence):
        pos_from_i=idx(pos_from)
        pos_to_i=idx(pos_to)
        counts[pos_from_i,pos_to_i]+=1

probs=counts/np.sum(counts,axis=1) #normalize into probabilities by dividing with row sums

# Print a few to see if they make any sense, and they do
print("DET -> NOUN", probs[idx("DET"),idx("NOUN")])
print("DET -> VERB", probs[idx("DET"),idx("VERB")])
print("DET -> DET", probs[idx("DET"),idx("DET")])
print("AUX -> VERB", probs[idx("AUX"),idx("VERB")])
print("VERB -> AUX", probs[idx("VERB"),idx("AUX")])



DET -> NOUN 0.279517790945445
DET -> VERB 0.013699237453449193
DET -> DET 0.009877906620038039
AUX -> VERB 0.19232133356978187
VERB -> AUX 0.011851851851851851


In [None]:
# Now we can do some pretty-printing
# this is what google thought to be a good way, can't argue with that :)
import pandas
df = pandas.DataFrame(probs, columns=all_tags, index=all_tags)
df

Unnamed: 0,START,ADJ,ADP,PUNCT,ADV,AUX,SYM,INTJ,CCONJ,X,NOUN,DET,PROPN,NUM,VERB,PART,PRON,SCONJ,_
START,0.0,0.039078,0.030797,0.034515,0.093362,0.024795,0.153239,0.598227,0.04353,0.364066,0.022709,0.077305,0.119867,0.121866,0.033162,0.010268,0.169673,0.116176,0.188289
ADJ,0.0,0.054495,0.056536,0.132906,0.017958,0.003431,0.031596,0.005908,0.084069,0.037825,0.197963,0.004049,0.07358,0.027054,0.005187,0.07205,0.00832,0.078145,0.019135
ADP,0.0,0.096703,0.030347,0.033578,0.028177,0.00117,0.093207,0.007386,0.015557,0.054374,0.084241,0.391251,0.202344,0.177712,0.005098,0.002611,0.127214,0.014327,0.064294
PUNCT,0.0,0.052435,0.036192,0.08496,0.078182,0.027524,0.109005,0.217134,0.209574,0.264775,0.047491,0.05031,0.107375,0.1035,0.04624,0.023495,0.084326,0.103152,0.096824
ADV,0.0,0.108533,0.051871,0.135093,0.089295,0.032203,0.047393,0.014771,0.039342,0.030733,0.004174,0.028345,0.007957,0.045669,0.087294,0.030978,0.04525,0.093514,0.055109
AUX,0.0,0.104335,0.02366,0.016633,0.179085,0.079064,0.015798,0.005908,0.00374,0.007092,0.004437,0.063194,0.007016,0.028791,0.192321,0.264358,0.032099,0.027611,0.008419
SYM,0.0,0.001984,0.000393,0.002186,0.000298,0.000234,0.017378,0.001477,0.00374,0.0,0.002539,0.000245,0.002225,0.095805,0.000621,0.000174,0.000429,0.000521,0.000383
INTJ,0.0,7.6e-05,0.000337,0.020381,0.002282,0.001014,0.00316,0.023634,0.001197,0.0,0.000379,0.000614,0.001454,0.001489,0.01064,0.0,0.002684,0.000781,0.003062
CCONJ,0.0,0.044726,0.010509,0.006794,0.05437,0.025575,0.042654,0.029542,0.00015,0.009456,0.028372,0.03988,0.043463,0.023827,0.051383,0.018274,0.059957,0.029695,0.067356
X,0.0,0.000534,0.001068,0.008746,9.9e-05,0.000156,0.0,0.0,0.000748,0.513002,0.000846,0.000184,0.000684,0.000745,0.000177,0.000174,0.000483,0.000521,0.000383
