In [1]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:
import numpy as np
import matplotlib.pyplot as plt
import torch
from torch import nn
import pandas as pd
import re
import nltk.corpus as corpus
import nltk
import spacy
import math
from warnings import simplefilter
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)
#import csv
nltk.download("brown")
nltk.download("treebank")
nltk.download('universal_tagset')
nltk.download('treebank_tagset')
pd.options.mode.chained_assignment = None  # default='warn'

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[nltk_data] Error loading treebank_tagset: Package 'treebank_tagset'
[nltk_data]     not found in index


In [3]:
def removePunctuation(s):
  out = re.sub("[^\w\s\u0300-\u036f]", "", s)
  out = re.sub("\d+", " ", out)
  out = re.sub(":", "", out)
  out = re.sub("[\\\n]", " ", out)
  out = re.sub(" +", " ", out)
  return out

In [4]:
nlp = spacy.load("en_core_web_sm")
corpus_treebank = corpus.treebank.tagged_sents()
corpus_brown = corpus.treebank.tagged_sents()
tagged_sents_full = corpus_brown + corpus_treebank

In [5]:
lim = 10000
tagged_sents = tagged_sents_full[:lim]
tagged_sents

[[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')], [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (',', ','), ('the', 'DT'), ('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN'), ('.', '.')], ...]

# Parts of speech and Named entities

Parts of speech refers roughly to the morphosyntactic class a word belongs to and it's given to each word individually or, alternatively in some languages, to morphemes. Named entities are, roughly, anything that can be referred to with a proper name such as a person, a location or an organization.

POS and named entities are useful clues for sentence structure and meaning and POS tagging is a key aspect of parsing. In a sequence of words each individual word gets assigned a tag following its word type (Noun, verb, adj...) and other aspects such as inflection. The task of named entity recognition is to assign words of phrases tags like "ORGANIZATION", "PERSON" or "LOCATION".

The task of given each word a tag in a sequence is called sequence labeling, some examples are the hidden Markov model (HMM, generative) and the Conditional Random Field (CRM, discriminative). There are other more advanced methods like RNN.

## Parts of speech

Parts of speech are defined based on their grammatical relationship with its neighboring words or the morphological properties about their affixes or inflections. There are a los of word distinctions like open and closed classes, functional and conceptual, etc. The most extended tags are the Penn Treebank POS tags.

### Tagging
 Tagging is a process os disambiguation, as words are ambiguous. There are, generally, highly accurate POS tagging algorithms. Ambiguous words are very common, they represent 14-15% of the vocabulary but they are very common, 55-67%. The idea then is to choose the tag which is most frequent in the training corpus given an ambiguous word.

## Named entities

 NAmed entities are anything that can be refered to with a proper name. The task or named entity recognition is to find spans of text that constitute proper names and tag the type of entity. The four most common tags are: PER (person), LOC (location), ORG (organization) and GPE (Geo-political entity). Many applications will need to use specific entity types.

 NER is an important step in NLP, as, for example, in sentiment analysis we might want to know the consumer's sentiment toward a particular entity.

 Named entities have an ambiguity of segmentation, meaning it's hard to decide what's part of the entity and what not, or where it starts and ends. The standard approach to NER is BIO tagging, this allows us to treat the problem as a word by word labeling task by labeling the boundary and the named entity type. There are other methods like the IO (I for named entities (inside), O for the rest) or the BIOES (E for end and S for span of one word).

## Hidden Markov POS Model Tagging

It assigns a label to each unit in the sequence. It's a probabilistic model, it computes the probability distribution over possible sequences of labels and chooses the best label sequence.

### Markov Chains

A markov chain is a model that tells us something about the probabilities of the sequences of random variables. It makes the assumption that if we want to predict the future in the sequence all that matters is the current state.

    P(q_i = a | q_1, q_2...q_i-1) = P(p_i = a | q-1)

So, it defines the probability of each change in state given a current state, all of which sum up to one in each state. This is similar to bigram models. It has the following components:

* A set of N states: Q = q_1, q_2... q_n
* A transition probability matrix A: Each a_i_j represents the probability of moving from state *i* to state *j*.
* An initial probability matrix over states (pi): pi_i is the probability that the Markov chain will start in state *i*.

### Hidden Markov Model

A MArkov chain is useful to compute a probability for a sequence of observable events, but most of the times we are interested in the hidden events. These events are hidden because we can't observe them directly, we must infer them. The hidden markov model has the following components:

* A set of N states: Q = q_1, q_2... q_n
* A transition probability matrix A: Each a_i_j represents the probability of moving from state *i* to state *j*.
* A sequence T of observations: O = o_1, o_2... o_T each one drawn from a vocabulary V = v_1, v_2... v_v
* A sequence of observation likelihoods: B = b_i(o_i), also called emission probability, each representing the probability of an observation o_i being generated from a state q.
* An initial probability matrix over states (pi): pi_i is the probability that the Markov chain will start in state *i*.

The model assumes, in first place, that the probability of a particular state depends only on the previous state. In second place it assumes that the probability of an output observation o_i depends only on the state that produced the observation q_i and not on any other states or observations.

The basic components of an HMM tagger are the two probabilities matrices A and B. The matrix A contains the tag transition probabilities, meaning the probabilities of a tag ocurring given a previous tag. This is computed though maximum likelihood estimate (MLE):

    P(t_i | t_i-1) = count(t_i-1, t_i)/count(t_i-1)

Remember it estimates the tag *t_i*, so the probability is based on the previous tag, taht's why the denominator is the previous tag.

The matrix B is composed of the emission probabilities and represents the probability, given a tag, that it'll be associated with a given word, also computed through MLE:

    P(w_i | t_i) = count(t_i, w_i)/count(t_i)

The task of determining the hidden variables sequence corresponding to the sequence of observations is called decoding. It's to find the most probable sequence of states (Q = q_i, q_2... q_n) given the two matrices A and B and a sequence of observations (O = o_q, o_2... o_n), which is also the observation sequence of n words w_1, w_2... w_n.

The two assumptions are:

* The probability of a word appearing depends only on its own tag.
* The probability of a tag is dependen only on the previous tag.

In [7]:
# The two matrices A and B
try:
  trans_prob = pd.read_excel("/content/ProbabilityMatrix.xlsx", sheet_name="Transmission Probabilities", index_col=0, engine="openpyxl")
  emission_prob = pd.read_excel("/content/ProbabilityMatrix.xlsx", sheet_name="Emission Probabilities", index_col=0, engine="openpyxl")
except:
  trans_prob = pd.DataFrame(index=["<s>"])
  emission_prob = pd.DataFrame()

  for i in tagged_sents:
    for j in range(len(i)):
      if i[j][0] not in emission_prob.columns:
        emission_prob[i[j][0]] = 0
      if i[j][1] not in emission_prob.index:
        emission_prob.loc[i[j][1]] = 0
      emission_prob[i[j][0]][i[j][1]] += 1

      if i[j][1] not in trans_prob.columns:
        trans_prob[i[j][1]] = 0
        trans_prob.loc[i[j][1]] = 0

      if j != 0:
        trans_prob[i[j][1]][i[j-1][1]] += 1
      else:
        trans_prob[i[j][1]]["<s>"] += 1

  trans_prob = trans_prob.div([sum(trans_prob.loc[i]) for i in trans_prob.index], axis="index")
  emission_prob = emission_prob.div([sum(emission_prob.loc[i]) for i in emission_prob.index], axis="index")
  writer = pd.ExcelWriter("/content/out.xlsx", engine = "openpyxl")
  trans_prob.to_excel(writer, sheet_name="Transmission Probabilities")
  emission_prob.to_excel(writer, sheet_name="Emission Probabilities")
  writer.close()

In [70]:
def getMaxHMM(w, prev_tag):
  out = ""
  currentMax = 0

  if w in emission_prob.columns:
    for c in trans_prob.columns:
      p = (trans_prob[c][prev_tag]**0.5)*(emission_prob[w][c]**0.5)
      if p > currentMax:
        out = c
        currentMax = p
  else:
    out = trans_prob.loc[prev_tag].idxmax()

  if currentMax <= 0:
    out = trans_prob.loc[prev_tag].idxmax()

  return out

def markov(s):
  words = ["<s>"] + [i.text for i in nlp(s)]
  out = [(words[0], words[0])]
  for i in range(1, len(words)):
    out.append((words[i], getMaxHMM(words[i], out[i-1][1])))
  return out

In [74]:
test_sentence = "more than half of the journeys taken from London City airport last year can be reached in six hours or less by train, data reveals."
print(markov(test_sentence))

[('<s>', '<s>'), ('more', 'JJR'), ('than', 'IN'), ('half', 'DT'), ('of', 'IN'), ('the', 'DT'), ('journeys', 'NN'), ('taken', 'VBN'), ('from', 'IN'), ('London', 'NNP'), ('City', 'NNP'), ('airport', 'NNP'), ('last', 'JJ'), ('year', 'NN'), ('can', 'MD'), ('be', 'VB'), ('reached', 'VBN'), ('in', 'IN'), ('six', 'CD'), ('hours', 'NNS'), ('or', 'CC'), ('less', 'JJR'), ('by', 'IN'), ('train', 'NN'), (',', ','), ('data', 'NNS'), ('reveals', 'IN'), ('.', '.')]


## Conditional Random Fields (CRFs)

This model is better at tagging words that are not in the vocabulary. This sistem is based in morphological or capitalization rules. This model also takes into account previous or following words. In general this model combines arbitrary features in a principled way, just like log-linear models. The problem is that logistic regression assigns a class to a single observation, but we need a sequential model.

The most commonly used versio of the CRF in NLP is the Linear chain CRF, whose conditioning closely matches the HMM. Assuming we have a sequence of input words X = x_1, x_2... x_n and want to compute the sequence of output tags Y = y_1, y_2... y_n in HMM we rely on bayes rule and the likelihood. In CRF we compute the posterior p(Y | X) directly, training the CRF to discriminate among the tags. At each timestep the CRF computes the log-linear functions over a set of relevant features, and these local features are aggregated and normalized to create a global probability for the whole sequence.

A CRF assigns a probability to an entire output (tags) sequence out of al possible sequences, given the entire input (words). In regular logistic regression the feature function *f* computes the features of a tuple (x, y). In CRF the function *F* maps an entire input sequence *X* and an entire output sequence *Y* to a feature vector