# 1) PoS Tagger per lingue morte

Costruire un PoS tagger sta/s/co basato su HMM per il Greco antico e il Latino

- A. Implementare Learning (contare) e Decoding (Viterbi)
- B. Addestrare il sistema su Greco e Latino (separatamente) usando 2 Treebank del progetto UD
- C. Valutare il sistema, usando diverse strategie di smoothing
- D. Valutare rispetto ad una baseline facile e ad una difficile

Devo determinare:

- probabilità di transizione
- probabilità di emissione

Per fare questo, devo partire da un corpus annotato su cui fare **LEARNING**. Poi potrò passare alla fare di **DECODING** per poter avere una sequenza di tags (la più probabile) data una frase in ingresso, risolvendo quindi il problema del PoS TAG.

L'obiettivo è trovare la sequenza di tags T=(t1,t2,…,t𝑁) che meglio si adatti ad una data sequenza di parole osservate 𝑂=(o1,o2,…,o𝑁). Esiste una tecnica nota come **Algoritmo di Viterbi** per trovare la migliore sequenza di tags, che si basa sulla programmazione dinamica. L'idea è di calcolare ricorsivamente una sequenza di tags ottimale (cioè che massimizza la probabilità) da soluzioni ottimali di sottoproblemi, dove si considerano versioni troncate della sequenza di osservazione. In pratica riciclo le sottosequenze già calcolate e calcolo la probabilità su ogni nuovo passo a partire da questo prefisso che memorizzo nella **Matrice di Viterbi**.

## The Learning Problem

La procedura che porterà all'algoritmo di Viterbi comprenderà le seguenti fasi:

- Elenchi di parole e di PoS TAG 
- Probabilità PoS->PoS: P(ti|ti-1) = C(ti-1|ti)/C(ti-1)
- Probabilità PoS->Word: P(wi|ti) = C(ti|wi)/C(ti)
- Viterbi

### Latino

Come corpus per il latino utilizziamo il Late Latin Charter Treebank (LLCT). https://universaldependencies.org/treebanks/la_llct/index.html

Zeman, Daniel; et al., 2021, Universal Dependencies 2.8.1, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-3687.

In [1]:
#pyconll is designed as a flexible wrapper around the CoNLL-U format
import pyconll

In [2]:
#Loading CoNLL-U files
train = pyconll.load_from_file('UD_Latin-LLCT/la_llct-ud-train.conllu')
dev = pyconll.load_from_file('UD_Latin-LLCT/la_llct-ud-dev.conllu')
test = pyconll.load_from_file('UD_Latin-LLCT/la_llct-ud-test.conllu')

In [3]:
elenco_parole = []
elenco_frasi = []
elenco_PoST = []
elenco_frasi_post = []
counter = -1

for sentence in train:
    counter += 1
    elenco_frasi.append([])
    elenco_frasi_post.append([])
    for token in sentence:
        elenco_parole.append(token.lemma)
        elenco_PoST.append(token.upos)
        elenco_frasi[counter].append(token.lemma)
        elenco_frasi_post[counter].append(token.upos)

In [4]:
elenco_frasi

[['+',
  'in',
  'Deus',
  'nomen',
  'regno',
  'domnus',
  'noster',
  'Carolus',
  'rex',
  'franci',
  'et',
  'langobardi',
  ',',
  'annus',
  'regnum',
  'is',
  'qui',
  'capio',
  'Langobardia',
  'primus',
  ',',
  'septimus',
  'decimus',
  'kalendae',
  'augustus',
  ',',
  'per',
  'indictio',
  'duodecimus',
  '.'],
 ['consto',
  'ego',
  'Sanitulus',
  'filius',
  'quondam',
  'Cichus',
  'de',
  'locus',
  'Brancalum',
  'praesens',
  'dies',
  'per',
  'hic',
  'chartula',
  'uendo',
  'et',
  'trado',
  'praeuideo',
  'tu',
  'Rachiprandus',
  'presbyter',
  'ecclesia',
  'sanctus',
  'Maria',
  'sino',
  'in',
  'Sextum',
  'duo',
  'petia',
  'de',
  'uinea',
  'meus',
  'qui',
  'habeo',
  'in',
  'locus',
  'Medianum',
  ';'],
 ['unus',
  'petia',
  'sum',
  'teneo',
  'unus',
  'caput',
  'in',
  'uinea',
  'ecclesia',
  'sanctus',
  'Maria',
  'et',
  'ecclesia',
  'sanctus',
  'Fridianus',
  ',',
  'alius',
  'caput',
  'in',
  'uirgareum',
  'sanctus',
  'Mari

In [10]:
elenco_PoST

['PUNCT',
 'ADP',
 'PROPN',
 'NOUN',
 'VERB',
 'NOUN',
 'DET',
 'PROPN',
 'NOUN',
 'NOUN',
 'CCONJ',
 'NOUN',
 'PUNCT',
 'NOUN',
 'NOUN',
 'PRON',
 'PRON',
 'VERB',
 'PROPN',
 'ADJ',
 'PUNCT',
 'ADJ',
 'ADJ',
 'NOUN',
 'ADJ',
 'PUNCT',
 'ADP',
 'NOUN',
 'ADJ',
 'PUNCT',
 'VERB',
 'PRON',
 'PROPN',
 'NOUN',
 'ADJ',
 'PROPN',
 'ADP',
 'NOUN',
 'PROPN',
 'ADJ',
 'NOUN',
 'ADP',
 'DET',
 'NOUN',
 'VERB',
 'CCONJ',
 'VERB',
 'VERB',
 'PRON',
 'PROPN',
 'NOUN',
 'NOUN',
 'ADJ',
 'PROPN',
 'VERB',
 'ADP',
 'PROPN',
 'NUM',
 'NOUN',
 'ADP',
 'NOUN',
 'DET',
 'PRON',
 'VERB',
 'ADP',
 'NOUN',
 'PROPN',
 'PUNCT',
 'NUM',
 'NOUN',
 'AUX',
 'VERB',
 'NUM',
 'NOUN',
 'ADP',
 'NOUN',
 'NOUN',
 'ADJ',
 'PROPN',
 'CCONJ',
 'NOUN',
 'ADJ',
 'PROPN',
 'PUNCT',
 'DET',
 'NOUN',
 'ADP',
 'NOUN',
 'ADJ',
 'PROPN',
 'CCONJ',
 'DET',
 'NOUN',
 'ADP',
 'NOUN',
 'DET',
 'NOUN',
 'ADJ',
 'PROPN',
 'PUNCT',
 'DET',
 'NOUN',
 'AUX',
 'VERB',
 'DET',
 'NOUN',
 'CCONJ',
 'DET',
 'NOUN',
 'ADP',
 'NOUN',
 'DET',
 'N

In [5]:
#queste liste conterranno le parole di inizio frasi e i rispettivi PoS
IN_parole = []
IN_PoST = []

#Gestisco le parole di inizio frase
for sentence in train:
    for token in sentence:
        IN_parole.append(token.lemma)
        IN_PoST.append(token.upos)
        break

In [6]:
len(IN_parole)

7289

In [11]:
#elenco delle parole uniche
words = set(elenco_parole)

#Elenco dei PoS TAG unici
PoST = set(elenco_PoST)

In [12]:
words

{'ascribo',
 'fodio',
 'Landipertus',
 'testimonium',
 'Cosella',
 'Bonus',
 'fero',
 'denarius',
 'Rotpaldus',
 'Leo',
 'ab',
 'Arenarium',
 'uerus',
 'scriptio',
 'mandatum',
 'fiducia',
 'praeparo',
 'Rotchis',
 'lupus',
 'Teutprandulus',
 'Segium',
 'Eleazar',
 'Fuselprandus',
 'Tachiprandus',
 'nosmetipse',
 'affirmo',
 'molinum',
 'Paldus',
 'Totus',
 'Periprandus',
 'Walprandus',
 'galleta',
 'Ghaudiprandus',
 'pondus',
 'considero',
 'ususfructus',
 'uino',
 'uernius',
 'magnificus',
 'transigo',
 'expendibilis',
 'superpono',
 'Pipinus',
 'Sichualdus',
 'Atripaldulus',
 'do',
 'Lamari',
 'Daghittus',
 'bene',
 'nutrimen',
 'Jordanes',
 'appareo',
 'Deblum',
 'Barucciolus',
 'Rupta',
 'saeculum',
 'Liutcharus',
 'delecto',
 'Ghiselprandus',
 'humilis',
 'lignamen',
 'admembro',
 'priuo',
 'episcopium',
 'impendo',
 'ingredior',
 'Gaudiosulus',
 'uenio',
 'mensis',
 'pagina',
 'legalis',
 'quia',
 'Clarissimus',
 'Vincentius',
 'uassallus',
 'Magnentiolus',
 'illustris',
 'Ilmer

In [13]:
PoST

{'ADJ',
 'ADP',
 'ADV',
 'AUX',
 'CCONJ',
 'DET',
 'NOUN',
 'NUM',
 'PART',
 'PRON',
 'PROPN',
 'PUNCT',
 'SCONJ',
 'VERB',
 'X'}

In [14]:
from collections import Counter
PoST_dict = Counter(elenco_PoST)
PoST_dict

Counter({'PUNCT': 28612,
         'ADP': 17782,
         'PROPN': 15994,
         'NOUN': 41461,
         'VERB': 24010,
         'DET': 15271,
         'CCONJ': 11107,
         'PRON': 14406,
         'ADJ': 10551,
         'NUM': 1844,
         'AUX': 2228,
         'SCONJ': 3355,
         'ADV': 6917,
         'PART': 533,
         'X': 79})

In [15]:
#NUmero di volte in cui un PoS compare ad inizio frase
IN_dict = Counter(IN_PoST)
IN_dict

Counter({'PUNCT': 2757,
         'VERB': 542,
         'NUM': 10,
         'DET': 254,
         'CCONJ': 1395,
         'ADV': 512,
         'NOUN': 1016,
         'PROPN': 72,
         'PRON': 189,
         'ADP': 167,
         'ADJ': 339,
         'SCONJ': 28,
         'PART': 4,
         'X': 1,
         'AUX': 3})

In [16]:
import numpy as np
import pandas as pd

# Starting point (probabilità di avere un certo PoS ad inizio frase)
s = pd.DataFrame(index = [0], columns=sorted(PoST))
for i in PoST:
    s[str(i)] = IN_dict[i]/len(IN_PoST)
print(s.shape)
print(s)

# Probabilità di Transizione (attualmente matrice di zeri da riempire)
a = pd.DataFrame(0., index=PoST, columns=PoST)
print(a.shape)

# Probabilità di Emissione
b = pd.DataFrame(0., index=PoST, columns=words)
print(b.shape)

(1, 15)
        ADJ       ADP       ADV       AUX     CCONJ       DET      NOUN  \
0  0.046508  0.022911  0.070243  0.000412  0.191384  0.034847  0.139388   

        NUM      PART      PRON     PROPN     PUNCT     SCONJ      VERB  \
0  0.001372  0.000549  0.025929  0.009878  0.378241  0.003841  0.074359   

          X  
0  0.000137  
(15, 15)
(15, 3235)


In [22]:
#filling a
transitions = elenco_PoST[:5000] #per ora considero solo i primi 5000 PoS su 194.150 (per ragioni computazionali) 
df = pd.DataFrame(columns = ['state', 'next_state'])
for i, val in enumerate(transitions[:-1]): # We don't care about last state
    df_stg = pd.DataFrame(index=[0])
    df_stg['state'], df_stg['next_state'] = transitions[i], transitions[i+1]
    df = pd.concat([df, df_stg], axis = 0)
cross_tab = pd.crosstab(df['state'], df['next_state'])
a = cross_tab.div(cross_tab.sum(axis=1), axis=0)

#Aggiungo lo starting point in cima alla transition matrix
a = pd.concat([pd.DataFrame(s), a])
a

next_state,ADJ,ADP,ADV,AUX,CCONJ,DET,NOUN,NUM,PART,PRON,PROPN,PUNCT,SCONJ,VERB,X
0,0.046508,0.022911,0.070243,0.000412,0.191384,0.034847,0.139388,0.001372,0.000549,0.025929,0.009878,0.378241,0.003841,0.074359,0.000137
ADJ,0.069597,0.018315,0.0,0.029304,0.07326,0.007326,0.230769,0.0,0.0,0.014652,0.245421,0.278388,0.003663,0.029304,0.0
ADP,0.098765,0.004115,0.00823,0.0,0.0,0.298354,0.347737,0.006173,0.0,0.084362,0.109053,0.0,0.0,0.04321,0.0
ADV,0.051852,0.185185,0.022222,0.014815,0.007407,0.037037,0.096296,0.0,0.0,0.044444,0.103704,0.051852,0.014815,0.37037,0.0
AUX,0.056604,0.113208,0.018868,0.0,0.018868,0.09434,0.018868,0.0,0.0,0.132075,0.037736,0.320755,0.018868,0.169811,0.0
CCONJ,0.047619,0.166667,0.047619,0.006803,0.0,0.136054,0.231293,0.040816,0.006803,0.044218,0.034014,0.010204,0.047619,0.180272,0.0
DET,0.025974,0.075758,0.054113,0.004329,0.04329,0.071429,0.471861,0.002165,0.0,0.02381,0.0671,0.047619,0.017316,0.095238,0.0
NOUN,0.102302,0.111679,0.012788,0.005115,0.070759,0.12532,0.086957,0.007673,0.0,0.056266,0.102302,0.184996,0.005115,0.127877,0.000853
NUM,0.0,0.078947,0.026316,0.0,0.052632,0.0,0.736842,0.0,0.0,0.0,0.0,0.105263,0.0,0.0,0.0
PART,0.0,0.090909,0.0,0.090909,0.181818,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.636364,0.0


In [780]:
#filling b
for (i,j) in zip(elenco_PoST, elenco_parole):
    b[j][i] = b[j][i] + 1
for p in PoST:
    b.loc[p] = b.loc[p]/PoST_dict[p]
b

Unnamed: 0,conuenientia,Teutfridus,Airualdus,manifestatio,etiam,impono,Ermualdus,octo,Turris,apostolus,...,Damianus,Liutpertus,altercatio,clauaca,Hildipertus,altar,compositio,nepos,Grumulum,Quesina
NOUN,0.000289,0.0,0.0,4.8e-05,0.0,0.0,0.0,0.0,0.0,0.000555,...,0.0,0.0,0.000145,4.8e-05,0.0,2.4e-05,0.000169,0.000892,0.0,0.0
ADJ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CCONJ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
VERB,0.0,0.0,0.0,0.0,0.0,0.000208,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
NUM,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014642,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SCONJ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DET,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PRON,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AUX,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
#qui considero il test set
test_set = []
test_pos = []
counter = -1

for sentence in test:
    counter += 1
    test_set.append([])
    test_pos.append([])
    for token in sentence:
        test_set[counter].append(token.lemma)
        test_pos[counter].append(token.upos)

V = np.array(test_set)
P = np.array(test_pos)
print(V)
print(P)

[list(['+', 'in', 'Deus', 'omnipotens', 'nomen', ',', 'regno', 'domnus', 'noster', 'Carolus', 'diuinus', 'faueo', 'clementia', 'imperator', 'augustus', ',', 'annus', 'imperium', 'is', 'septimus', ',', 'pridie', 'idus', 'augustus', 'indictio', 'quintus', '.'])
 list(['manifestus', 'sum', 'ego', 'Teutpertus', 'diaconus', ',', 'filius', 'bonus', 'memoria', 'Teutpertus', ',', 'quia', 'tu', 'Gherardus', ',', 'gratia', 'Deus', 'hic', 'sanctus', 'lucanus', 'ecclesia', 'humilis', 'episcopus', ',', 'per', 'chartula', 'libellarius', 'nomen', 'ad', 'census', 'perexsoluo', 'do', 'ego', 'is', 'sum', 'res', 'meus', 'in', 'locus', 'et', 'finis', 'ubi', 'uocito', 'Jouerianum', ',', 'pertineo', 'ecclesia', 'sanctus', 'Petrus', 'ubi', 'uocito', 'Sumualdus', 'sino', 'foras', 'ciuitas', 'iste', 'lucensis', 'qui', 'sum', 'desub', 'potestas', 'ipse', 'episcopatus', 'uester', 'sanctus', 'Martinus', ',', 'et', 'ipse', 'res', 'Leulus', 'et', 'Dominicus', 'ad', 'manus', 'suus', 'modo', 'habeo', 'uideo', ',', 't

In [784]:
V[0]

['+',
 'in',
 'Deus',
 'omnipotens',
 'nomen',
 ',',
 'regno',
 'domnus',
 'noster',
 'Carolus',
 'diuinus',
 'faueo',
 'clementia',
 'imperator',
 'augustus',
 ',',
 'annus',
 'imperium',
 'is',
 'septimus',
 ',',
 'pridie',
 'idus',
 'augustus',
 'indictio',
 'quintus',
 '.']

In [26]:
P[0]

['PUNCT',
 'ADP',
 'PROPN',
 'ADJ',
 'NOUN',
 'PUNCT',
 'VERB',
 'NOUN',
 'DET',
 'PROPN',
 'ADJ',
 'VERB',
 'NOUN',
 'NOUN',
 'NOUN',
 'PUNCT',
 'NOUN',
 'NOUN',
 'PRON',
 'ADJ',
 'PUNCT',
 'ADP',
 'NOUN',
 'NOUN',
 'NOUN',
 'ADJ',
 'PUNCT']

In [937]:
def viterbi (obs, transition, emission):
    post = []
    T = len(obs) #numero di osservazioni
    M = transition.shape[0]-1 #numero di PoS (escluso lo starting point)
    
    prev = np.zeros((T - 1, M))
    
    #Creao la matrice di Viterbi da riempire
    viterbi_matrix = pd.DataFrame(0., index=PoST, columns=obs)
    
    #initialization step
    for i in range(M):
        viterbi_matrix.iloc[:,0][i] = a.iloc[0][i] * b.iloc[i][obs[0]]
    
    #recursion step
    for t in range(1, T):
        for j in range(M):
            pass
    
    print(viterbi_matrix)
    
    return post

In [938]:
viterbi(V[0], a, b)

              +   in  Deus  omnipotens  nomen    ,  regno  domnus  noster  \
NOUN   0.000000  0.0   0.0         0.0    0.0  0.0    0.0     0.0     0.0   
ADJ    0.000000  0.0   0.0         0.0    0.0  0.0    0.0     0.0     0.0   
CCONJ  0.000000  0.0   0.0         0.0    0.0  0.0    0.0     0.0     0.0   
X      0.000000  0.0   0.0         0.0    0.0  0.0    0.0     0.0     0.0   
VERB   0.000000  0.0   0.0         0.0    0.0  0.0    0.0     0.0     0.0   
NUM    0.000000  0.0   0.0         0.0    0.0  0.0    0.0     0.0     0.0   
SCONJ  0.000000  0.0   0.0         0.0    0.0  0.0    0.0     0.0     0.0   
DET    0.000000  0.0   0.0         0.0    0.0  0.0    0.0     0.0     0.0   
PRON   0.000000  0.0   0.0         0.0    0.0  0.0    0.0     0.0     0.0   
AUX    0.000000  0.0   0.0         0.0    0.0  0.0    0.0     0.0     0.0   
PUNCT  0.001248  0.0   0.0         0.0    0.0  0.0    0.0     0.0     0.0   
PART   0.000000  0.0   0.0         0.0    0.0  0.0    0.0     0.0     0.0   

[]

In [922]:
ini = np.full(15,0.5)

def viterbi(V, a, b, initial_distribution):
    T = len(V)
    M = a.shape[0]

    omega = np.zeros((T, M))
    omega.iloc[0, :] = np.log(initial_distribution * b.iloc[:, V[0]])

    prev = np.zeros((T - 1, M))

    for t in range(1, T):
        for j in range(M):
            # Same as Forward Probability
            probability = omega.iloc[t - 1] + np.log(a.iloc[:, j]) + np.log(b.iloc[j, V[t]])

            # This is our most probable state given previous state at time t (1)
            prev[t - 1, j] = np.argmax(probability)

            # This is the probability of the most probable state (2)
            omega[t, j] = np.max(probability)

In [923]:
viterbi(V[0],a,b,ini)

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types