![Texto alternativo](nlp.jpg "Texto opcional del título")







<h1 style="color: #ffffff; font-size: 40px; font-weight: bold; text-align: center;
background-color: #2e2e2e; border-radius: 6px; padding: 20px; box-shadow: 3px 3px 10px #000000;
text-shadow: 0 0 5px #2ecc71, 0 0 10px #2ecc71, 0 0 20px #2ecc71, 0 0 40px #0e0e0e,
0 0 80px #0e0e0e, 0 0 90px #0e0e0e;">
  Construcción de un modelo markoviano de máxima entropía
</h1>




In [None]:
!pip install conllu
!pip install stanza
!git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora.git

### Entrenamiento del modelo - cálculo de conteos

Parta este modelo consideramos el cálculo de las probabilidades:

$$P(t_i | w_i, t_{i-1}) =\frac{C(w_i, t_i, t_{i-1})}{C(w_i, t_{i-1})} $$

* `uniqueFeatureDict` $C(tag|word,prevtag) = C(w_i, t_i, t_{i-1})$
* `contextDict` $C(word,prevtag) = C(w_i, t_{i-1})$

En este caso cuando consideremos el primer elemento de una frase $w_0$, no existirá un elemento anterior $w_{-1}$ y por lo tanto, tampoco una etiqueta previa $t_{-1}$, podemos modelar este problema asumiendo que existe una etiqueta "None", para estos casos:

$$P(t_0|w_0,t_{-1}) = P(t_0|w_0,\text{"None"})$$

In [None]:
from conllu import parse_incr

uniqueFeatureDict = {}
contextDict = {}

tagtype = 'upos'
data_file = open("UD_Spanish-AnCora/es_ancora-ud-train.conllu", "r", encoding="utf-8")

# Calculando conteos (pre-probabilidades)
for tokenlist in parse_incr(data_file):
  prevtag = "None"
  for token in tokenlist:
    tag = token[tagtype]
    word = token['form'].lower()
    #C(tag|word,prevtag)
    largeKey = tag+'|'+word+','+prevtag
    if largeKey in uniqueFeatureDict.keys():
      uniqueFeatureDict[largeKey]+=1
    else:
      uniqueFeatureDict[largeKey]=1
    key = word+','+prevtag
    if key in contextDict.keys():
      contextDict[key]+=1
    else:
      contextDict[key]=1
    #print(largeKey, key, '\n')
    prevtag=tag

### Entrenamiento del modelo - cálculo de probabilidades

$$P(t_i|w_i, t_{i-1}) = \frac{C(t_i, w_i, t_{i-1})}{C(w_i, t_{i-1})}$$

In [None]:
posteriorProbDict = {}

for key in uniqueFeatureDict.keys():
  if len(key.split('|'))==3:
    posteriorProbDict[key] = uniqueFeatureDict[key]/contextDict['|'+key.split('|')[-1]]
  else:
    posteriorProbDict[key] = uniqueFeatureDict[key]/contextDict[key.split('|')[1]]

In [None]:
# Aquí verificamos que todas las probabilidades
# por cada contexto 'word,prevtag' suman 1.0

for base_context in contextDict.keys():
  countprob = 0
  items = 0
  for key in posteriorProbDict.keys():
    if len(key.split('|'))==3:
      if '|'+key.split('|')[-1]==base_context:
        countprob+=posteriorProbDict[key]
        items+=1
    else:
      if key.split('|')[1]==base_context:
        countprob+=posteriorProbDict[key]
        items+=1
  print(base_context, items, countprob)

### Distribución inicial de estados latentes

In [None]:
# identificamos las categorias gramaticales 'upos' unicas en el corpus
stateSet = {'ADJ',
 'ADP',
 'ADV',
 'AUX',
 'CCONJ',
 'DET',
 'INTJ',
 'NOUN',
 'NUM',
 'PART',
 'PRON',
 'PROPN',
 'PUNCT',
 'SCONJ',
 'SYM',
 'VERB',
 '_'}
# enumeramos las categorias con numeros para asignar a
# las columnas de la matriz de Viterbi
tagStateDict = {}
for i, state in enumerate(stateSet):
  tagStateDict[state] = i
tagStateDict

{'ADJ': 9,
 'ADP': 8,
 'ADV': 1,
 'AUX': 14,
 'CCONJ': 4,
 'DET': 12,
 'INTJ': 15,
 'NOUN': 2,
 'NUM': 7,
 'PART': 3,
 'PRON': 5,
 'PROPN': 0,
 'PUNCT': 10,
 'SCONJ': 6,
 'SYM': 16,
 'VERB': 11,
 '_': 13}

In [None]:
initTagStateProb = {} # \rho_i^{(0)}
from conllu import parse_incr
wordList = []
data_file = open("UD_Spanish-AnCora/es_ancora-ud-train.conllu", "r", encoding="utf-8")
count = 0 # cuenta la longitud del corpus
for tokenlist in parse_incr(data_file):
  count += 1
  tag = tokenlist[0]['upos']
  if tag in initTagStateProb.keys():
    initTagStateProb[tag] += 1
  else:
    initTagStateProb[tag] = 1

for key in initTagStateProb.keys():
  initTagStateProb[key] /= count

initTagStateProb

{'ADJ': 0.010136315973435861,
 'ADP': 0.1574274729115694,
 'ADV': 0.07577770010485844,
 'AUX': 0.022789234533379936,
 'CCONJ': 0.036980076896190144,
 'DET': 0.34799021321216356,
 'INTJ': 0.0020272631946871723,
 'NOUN': 0.025026214610276126,
 'NUM': 0.0068507514854945824,
 'PART': 0.002446696959105208,
 'PRON': 0.04173365955959455,
 'PROPN': 0.10506815798671792,
 'PUNCT': 0.09143656064313177,
 'SCONJ': 0.027123383432366307,
 'SYM': 0.0004893393918210416,
 'VERB': 0.04557846906675987,
 '_': 0.0011184900384480952}

### Construcción del algoritmo de Viterbi

Dada una secuencia de palabras $\{p_1, p_2, \dots, p_n \}$, y un conjunto de categorias gramaticales dadas por la convención `upos`, se considera la matriz de probabilidades de Viterbi así:

$$
\begin{array}{c c}
\begin{array}{c c c c}
\text{ADJ} \\
\text{ADV}\\
\text{PRON} \\
\vdots \\
{}
\end{array}
&
\left[
\begin{array}{c c c c}
\nu_1(\text{ADJ}) & \nu_2(\text{ADJ}) & \dots  & \nu_n(\text{ADJ})\\
\nu_1(\text{ADV}) & \nu_2(\text{ADV}) & \dots  & \nu_n(\text{ADV})\\
\nu_1(\text{PRON}) & \nu_2(\text{PRON}) & \dots  & \nu_n(\text{PRON})\\
\vdots & \vdots & \dots & \vdots \\ \hdashline
p_1 & p_2 & \dots & p_n
\end{array}
\right]
\end{array}
$$

Donde las probabilidades de Viterbi en la primera columna (para una categoria $i$) están dadas por:

$$
\nu_1(i) = \underbrace{\rho_i^{(0)}}_{\text{probabilidad inicial}} \times P(i \vert p_1, \text{"None"})
$$

y para las siguientes columnas:

$$
\nu_{t}(j) = \max_i \{ \overbrace{\nu_{t-1}(i)}^{\text{estado anterior}} \times P(j \vert p_t, i) \}
$$


In [None]:
import numpy as np
import stanza
stanza.download('es')
nlp = stanza.Pipeline('es', processors='tokenize')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.1.0.json: 122kB [00:00, 9.28MB/s]                    
2020-08-18 02:14:23 INFO: Downloading default packages for language: es (Spanish)...
2020-08-18 02:14:25 INFO: File exists: /root/stanza_resources/es/default.zip.
2020-08-18 02:14:31 INFO: Finished downloading models and saved to /root/stanza_resources.


In [None]:
def ViterbiMatrix(secuencia, posteriorProbDict=posteriorProbDict, initTagStateProb=initTagStateProb):
  doc = nlp(secuencia)
  if len(doc.sentences)>1:
    raise ValueError('secuencia must be a string!')
  seq = [word.text for word in doc.sentences[0].words]
  viterbiProb = np.zeros((17, len(seq)))

  # inicialización primera columna
  for tag in tagStateDict.keys():
    tag_row = tagStateDict[tag]
    key = tag+'|'+seq[0].lower()+','+"None"
    try:
      viterbiProb[tag_row, 0] = initTagStateProb[tag]*posteriorProbDict[key]
    except:
      pass

  # computo de las siguientes columnas
  for col in range(1, len(seq)):
    for tag in tagStateDict.keys():
      tag_row = tagStateDict[tag]
      possible_probs = []
      for prevtag in tagStateDict.keys():
        prevtag_row = tagStateDict[prevtag]
        key = tag+'|'+seq[col].lower()+','+prevtag
        try:
          possible_probs.append(
              viterbiProb[prevtag_row, col-1]*posteriorProbDict[key])
        except:
          possible_probs.append(0)
      viterbiProb[tag_row, col] = max(possible_probs)

  return viterbiProb

ViterbiMatrix('el mundo es pequeño')

array([[0.00000000e+00, 8.22024126e-03, 1.13643888e-04, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 3.39769972e-01, 5.94927966e-03, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.33820692e-01],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [3.47990213e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e

In [None]:
def ViterbiTags(secuencia, posteriorProbDict=posteriorProbDict, initTagStateProb=initTagStateProb):
  doc = nlp(secuencia)
  if len(doc.sentences)>1:
    raise ValueError('secuencia must be a string!')
  seq = [word.text for word in doc.sentences[0].words]
  viterbiProb = np.zeros((17, len(seq)))

  # inicialización primera columna
  for tag in tagStateDict.keys():
    tag_row = tagStateDict[tag]
    key = tag+'|'+seq[0].lower()+','+"None"
    try:
      viterbiProb[tag_row, 0] = initTagStateProb[tag]*posteriorProbDict[key]
    except:
      pass


  # computo de las siguientes columnas
  for col in range(1, len(seq)):
    for tag in tagStateDict.keys():
      tag_row = tagStateDict[tag]
      possible_probs = []
      for prevtag in tagStateDict.keys():
        prevtag_row = tagStateDict[prevtag]
        key = tag+'|'+seq[col].lower()+','+prevtag
        try:
          possible_probs.append(
              viterbiProb[prevtag_row, col-1]*posteriorProbDict[key])
        except:
          possible_probs.append(0)
      viterbiProb[tag_row, col] = max(possible_probs)

  # contruccion de secuencia de tags
  res = []
  for i, p in enumerate(seq):
    for tag in tagStateDict.keys():
      if tagStateDict[tag] == np.argmax(viterbiProb[:, i]):
        res.append((p, tag))


  return res

ViterbiTags('el mundo es pequeño')

[('el', 'DET'), ('mundo', 'NOUN'), ('es', 'AUX'), ('pequeño', 'ADJ')]

In [None]:
ViterbiTags('estos instrumentos han de rasgar')

[('estos', 'DET'),
 ('instrumentos', 'NOUN'),
 ('han', 'AUX'),
 ('de', 'ADP'),
 ('rasgar', 'PROPN')]