# Introduction

This notebook is a companion to the [first session](https://github.com/SunoikisisDC/SunoikisisDC-2017-2018/wiki/Python-1:-Entity-extraction) of the Sunoikisis-DC (2017-2018) class on Python and Natural Language Processing (NLP). In it, you will learn how to use the basics of Python that you learned in the first half of the session to perform a fundamental task of NLP, [Named-Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition).

## What is NER?

Named-Entity Recognition is that task of Natural Language Processing that tries to define automatic ways to recognize and label **Named Entities** in texts.

In their manual of NLP, [Jurafsky and Martin](https://web.stanford.edu/~jurafsky/slp3/21.pdf) define Named Entities as: "roughly speaking, anything that can be referred to with a proper name: a person, a location, an organization".

Note that:
* proper names are linguistic objects that "function as interface points between language as system and particular individuals in the world" ([P. Hanks](https://doi.org/10.1016/B0-08-044854-2/05280-9))
* Often in concrete applications, the notion of `NE` is extended to include other words in a text that do not refer to individual entities

Consider this example (adapted from Jurafsky and Martin):

```html
Citing high fuel prices, <ORG>United Airlines</ORG> said <TIME>Friday</TIME> it has increased fares by <MONEY>$6</MONEY> per round trip on flights to some cities also served by lower-cost carriers. <ORG>American Airlines</ORG>, a unit of <ORG>AMR Corp.</ORG>, immediately matched the move, spokesman <PER>Tim Wagner</PER> said. <ORG>United</ORG>, a unit of <ORG>UAL Corp.</ORG>, said the increase took effect <TIME>Thursday</TIME> and applies to most routes where it competes against discount carriers, such as <LOC>Chicago</LOC> to <LOC>Dallas</LOC> and <LOC>Denver</LOC> to <LOC>San Francisco</LOC>.
```

Three important things can be seen from this example:
1. NER is not just about *recognizing* names, it is also about *classifying* them (distinguishing e.g. persons from locations, organizations, time expressions and money amounts);
2. NER is already a powerful tool to **extract information** from texts and attempt something like an automatic summarization;
3. although `TIME EXPRESSIONS` and `AMOUNT of MONEY` are not at all prototypical entities with proper names, your goal of information extraction (e.g. understanding how prices in airline fares fluctuate) might require you to extract and classify them too! For the sake of this NER application, `6$` or `Thursday` become Named Entities too, like `San Francisco` or `Tim Wagner`

## Getting ready to code!

External modules and libraries can be imported using `import` statements.

For this notebook, we will need the [Natural Language ToolKit (NLTK)](http://www.nltk.org/), the [Classical Language ToolKit (CLTK)](http://cltk.org/), and some local libraries that are used in this notebook. `NLTK` is the comprehensive library that collects all the most important resources (corpora and tools) for Natural Language Processing in Python. `CLTK` is a sort of "spin off" that provide specialized tools and access to corpora for ancient languages, including Latin and Greek.

**Remember!** Everything that is not part of Python's [Standard Library](https://docs.python.org/3/library/index.html) must be installed before it can be imported. Most likely, you will have to install both `NLTK` and `CLTK` to your machine before you can run the following lines

In [20]:
import cltk
from cltk.corpus.utils.importer import CorpusImporter

# Getting the texts

For this exercise, we will work with a Latin text. Let's say, we want to recognize (and possibly, classify) proper names in Caesar's *Bellum Gallicum*. The first thing we need is a digitized version of the text!

Obviously, we can download a copy and then open it and read it in Python, like any other file. But we can make use of the functinoality to download and [import corpora](http://docs.cltk.org/en/latest/importing_corpora.html) offered by `CLTK`: this is handy when you want to operate on the level of a whole big corpus of many files!

Though the support to the corpus reader is still limited (and the documentation is quite confusing: see [this issue](https://github.com/cltk/cltk/issues/615)), the corpus reader for the texts from the Latin Library works well enough for our purposes.

We start by creating a corpus importer for Latin

In [3]:
corpus_importer = CorpusImporter('latin')

Then we list all the possible corpora and resources (dictionaries, word lists etc) that can be obtained via `CLTK`

In [17]:
corpus_importer.list_corpora

['latin_text_perseus',
 'latin_treebank_perseus',
 'latin_text_latin_library',
 'phi5',
 'phi7',
 'latin_proper_names_cltk',
 'latin_models_cltk',
 'latin_pos_lemmata_cltk',
 'latin_treebank_index_thomisticus',
 'latin_lexica_perseus',
 'latin_training_set_sentence_cltk',
 'latin_word2vec_cltk',
 'latin_text_antique_digiliblt',
 'latin_text_corpus_grammaticorum_latinorum',
 'latin_text_poeti_ditalia']

And we donwload the `latin_text_latin_library` corpus. This will create a local copy of the texts from the [Latin Library](http://www.thelatinlibrary.com/) on our computer.

In [18]:
corpus_importer.import_corpus('latin_text_latin_library')

Downloaded 4% 1.29 MiB | 2.54 MiB/s Downloaded 5% 1.29 MiB | 2.54 MiB/s Downloaded 6% 1.29 MiB | 2.54 MiB/s Downloaded 7% 1.29 MiB | 2.54 MiB/s Downloaded 8% 1.29 MiB | 2.54 MiB/s Downloaded 9% 1.29 MiB | 2.54 MiB/s Downloaded 10% 1.29 MiB | 2.54 MiB/s Downloaded 11% 1.29 MiB | 2.54 MiB/s Downloaded 12% 1.29 MiB | 2.54 MiB/s Downloaded 13% 1.29 MiB | 2.54 MiB/s Downloaded 14% 1.29 MiB | 2.54 MiB/s Downloaded 15% 1.29 MiB | 2.54 MiB/s Downloaded 16% 1.29 MiB | 2.54 MiB/s Downloaded 17% 1.29 MiB | 2.54 MiB/s Downloaded 17% 1.29 MiB | 2.54 MiB/s Downloaded 18% 5.70 MiB | 5.65 MiB/s Downloaded 19% 5.70 MiB | 5.65 MiB/s Downloaded 20% 5.70 MiB | 5.65 MiB/s Downloaded 21% 5.70 MiB | 5.65 MiB/s Downloaded 22% 5.70 MiB | 5.65 MiB/s Downloaded 23% 5.70 MiB | 5.65 MiB/s Downloaded 24% 5.70 MiB | 5.65 MiB/s Downloaded 25% 5.70 MiB | 5.65 MiB/s Downloaded 26% 5.70 MiB | 5.65 MiB/s Downloaded 27% 5.70 MiB | 5.65 MiB/s Downloaded 28% 5.70 MiB | 5.65 MiB/s Downloaded 29% 10.

Done! But where is it stored? All the data downloaded by `CLTK` are kept in a path stored in a property named `cltk_data` (exactly like everything you download with `NLTK` is available at the location stored in `nltk_data`). This is the location of the root folder on my machine:

In [24]:
cltk.corpus.latin.cltk_path

'/Users/fmambrini/cltk_data'

In [26]:
corpus = cltk.corpus.latin.PlaintextCorpusReader("/Users/fmambrini/cltk_data/latin/text/latin_text_latin_library/", ".*.txt")

In [28]:
corpus.fileids()

['12tables.txt',
 '1644.txt',
 'abbofloracensis.txt',
 'abelard/dialogus.txt',
 'abelard/epistola.txt',
 'abelard/historia.txt',
 'addison/barometri.txt',
 'addison/burnett.txt',
 'addison/hannes.txt',
 'addison/machinae.txt',
 'addison/pax.txt',
 'addison/praelium.txt',
 'addison/preface.txt',
 'addison/resurr.txt',
 'addison/sphaer.txt',
 'adso.txt',
 'aelredus.txt',
 'agnes.txt',
 'alanus/alanus1.txt',
 'alanus/alanus2.txt',
 'albertanus/albertanus.arsloquendi.txt',
 'albertanus/albertanus.liberconsol.txt',
 'albertanus/albertanus.sermo.txt',
 'albertanus/albertanus.sermo1.txt',
 'albertanus/albertanus.sermo2.txt',
 'albertanus/albertanus.sermo3.txt',
 'albertanus/albertanus.sermo4.txt',
 'albertanus/albertanus1.txt',
 'albertanus/albertanus2.txt',
 'albertanus/albertanus3.txt',
 'albertanus/albertanus4.txt',
 'albertofaix/hist1.txt',
 'albertofaix/hist10.txt',
 'albertofaix/hist11.txt',
 'albertofaix/hist12.txt',
 'albertofaix/hist2.txt',
 'albertofaix/hist3.txt',
 'albertofaix/his

That's huge! How many files have we load? And how many words?

You have now access to all the typical methods of `NLTK` to answer these questions. If you want a fantastic introduction to the world of corpus exploration using `NLTK` you *have to* check out [this amazing book](http://www.nltk.org/book/)

In [30]:
len(corpus.fileids())

2164

In [31]:
# this will take a while to run...
len(corpus.words())

16764097

# NER on Latin!

Let's review what a NER task involves. A NER task can be broken down into a sequence of operations in which you take all the "units" of your text and then, for each of them, you:

* decide whether it is a name or not
* decide what type of entity it is (e.g. Person, Location, or No-Entity)
* decide whether it is a *part* of name (e.g. "New" in "New York City")

Those "units" that are processed one at the time are called **"tokens"** in NLP. Most of the times (as in the example above), they correspond to our own intuitive idea of a "word". But it is not said; ultimately, a token is the product of a process called "tokenization" which involves . 

The second task requires you to define a `tagset` (i.e. a list of tags to assign) and, consequently, a taxonomy of the entities that you want to consider. Some popular options:

* 3-tag system: PER, LOC, ORG, with "O" being used for no entity
* a binary classification: Entity, O

In [33]:
#a question for you: how would you tokenize the following string using Python?
s = "Gallia est omnis divisa in partes tres."

NER methods generally take some lexical or other linguistic features and try to associate them with a label (generally, using complex statistical models to calculate combined probabilities).

Let us try to build a binary classifier (two answers: Entity, O). What can be two of the easiest features to associate with entities?

## A "baseline"

Now let's write what in NLP jargon is called a *baseline*, that is a starting-point method for extracting named entities that can serve as a term of comparison to evaluate the accuracy of other methods. 

**Baseline method**: 
- cycle through each token of the text
- if the token starts with a capital letter it's a named entity (`Entity`), otherwise not (`O`)

**excursus**: how do we store annotation? Which of the Python built-in data types is the most useful?

Linguistic information comes grouped at different hierarchical levels. The "text" (in our case: Caesar's *Bellum Gallicum*, book 1) is the widest unit. The text can be conceptualized as a series of tokens, each of which holds the text and the tag.

Optionally, we can create an intermediate subdivision for the text, the sentence.

The best datatype to use for the token is a **tuple**: (token, annotation). Our text can thus be a list of tuples (tokens) or a list of lists (sentences) of tuples (tokens).

```python
#text and tokens
text = [(token1, label), (token2, label), (token3, label), ...]

#text, sentences, tokens
text = [ [(sent1_token1, label), (sent1_token2, label)], [(sent2_token1, label), ...], ...]
```

How do we verify if each token starts with a capital letter? Surely, we could use a regular expression for that. But there is no need. Python provides a series of built-in string methods to check the case of strings. Those methods are:

* `isupper`: check whether the string is all uppercase
* `islower`: check whether the string is all lowercase
* `istitle`: check whether the string is titlecase, i.e. the first character is uppercase and the rest is lower

**bonus question**: what do these string methods return?

Since we're dealing with string *methods*, we simply call then with a dot notation on a string of text.

In [40]:
print("Gallia".islower())
print("Gallia".isupper())
print("Gallia".istitle())

False
False
True


`istitle()` is clearly what we are looking for! Now we just have to loop over the tokens and add the correct label according to whether the `istitle` method returns `True` or `False`.

We can write a loop right away. But as we might want to reuse our code for future texts, let us do the correct thing and write a **function**.

What type of *parameters* will our function require? What will it *return*?

In [46]:
def tag_baseline(tokens):
    """
    Loop over a tokenized text and append a NE tag to each token:
    - Entity: if the token is titlecas
    - O: in all other cases
    
    :param tokens: the text to tag (string)
    :type tokens: list of tokens
    :return: a list of tuples, where tuple[0] is the token and tuple[1] is the named entity tag
    
    """
    tagged_tokens = []
    for token in tokens:
        if token.istitle():
                tagged_tokens.append((token, "Entity"))
        else:
            tagged_tokens.append((token, "O")) 
    return tagged_tokens    

Now we can just tag whatever text we want by invoking:

```python
tagged_base = tag_baseline(tokens)
```

**Careful**: as the docstring of our function says, we need to pass a **list** of tokens to `tag_baseline`; it won't work if we pass a raw text, or, to be more specific, it would produce a very funny output, not quite what you're expecting!

That means that you will have to tokenize the text yourself before you can feed it to this tagger.

We can access the words and the sentences of Caesar BG by invoking the right methods of the corpus (`words()` and `sents()`) passing a `fileid` argument to the function.

In [73]:
bg_tokens = corpus.words("caesar/gall1.txt")
bg_tokens[1:78]

['.',
 'IVLI',
 'CAESARIS',
 'COMMENTARIORVM',
 'DE',
 'BELLO',
 'GALLICO',
 'LIBER',
 'PRIMUS',
 'C',
 '.',
 'IVLI',
 'CAESARIS',
 'COMMENTARIORVM',
 'DE',
 'BELLO',
 'GALLICO',
 'LIBER',
 'PRIMVS',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21',
 '22',
 '23',
 '24',
 '25',
 '26',
 '27',
 '28',
 '29',
 '30',
 '31',
 '32',
 '33',
 '34',
 '35',
 '36',
 '37',
 '38',
 '39',
 '40',
 '41',
 '42',
 '43',
 '44',
 '45',
 '46',
 '47',
 '48',
 '49',
 '50',
 '51',
 '52',
 '53',
 '54',
 '[',
 '1',
 ']',
 'Gallia']

In fact, there is a lot of prefatory material before the text. The actual book 1 starts only with the token indexed 77. Let's re-slice it:

In [74]:
bg_tokens = bg_tokens[77:]

In [75]:
bg1_base = tag_baseline(bg_tokens)

In [76]:
# let's inspect some tokens
bg1_base[:50]

[('Gallia', 'Entity'),
 ('est', 'O'),
 ('omnis', 'O'),
 ('divisa', 'O'),
 ('in', 'O'),
 ('partes', 'O'),
 ('tres', 'O'),
 (',', 'O'),
 ('quarum', 'O'),
 ('unam', 'O'),
 ('incolunt', 'O'),
 ('Belgae', 'Entity'),
 (',', 'O'),
 ('aliam', 'O'),
 ('Aquitani', 'Entity'),
 (',', 'O'),
 ('tertiam', 'O'),
 ('qui', 'O'),
 ('ipsorum', 'O'),
 ('lingua', 'O'),
 ('Celtae', 'Entity'),
 (',', 'O'),
 ('nostra', 'O'),
 ('Galli', 'Entity'),
 ('appellantur', 'O'),
 ('.', 'O'),
 ('Hi', 'Entity'),
 ('omnes', 'O'),
 ('lingua', 'O'),
 (',', 'O'),
 ('institutis', 'O'),
 (',', 'O'),
 ('legibus', 'O'),
 ('inter', 'O'),
 ('se', 'O'),
 ('differunt', 'O'),
 ('.', 'O'),
 ('Gallos', 'Entity'),
 ('ab', 'O'),
 ('Aquitanis', 'Entity'),
 ('Garumna', 'Entity'),
 ('flumen', 'O'),
 (',', 'O'),
 ('a', 'O'),
 ('Belgis', 'Entity'),
 ('Matrona', 'Entity'),
 ('et', 'O'),
 ('Sequana', 'Entity'),
 ('dividit', 'O'),
 ('.', 'O')]

Bonus question: how do we visualize **only** the tokens that were labelled as Entities? And how do we extract this list without repetitions?

In [60]:
# answer to the first question: list comprehension!
ents = [t[0] for t in bg1_base if t[1] == 'Entity']
len(ents)

978

The syntax for list comprehension may seem very exoteric, but in fact it is very, very intuitive; let us break it down to its syntactical components:

```python
[ # I want to create a new list!
    t[0] # of elements stored in a variable called t (we keep only the el. indexed 0)
    for t in bg1_base # this is a regular for loop: t is every iteration of a list
    if t[1] == 'Entity' # this is just a condition for an if statement;
                        # keep t only if its tag == Entity
] # we close the list
```

The same list can be created and populated using your elementary Python syntax of for, if and append! It is absolutely equivalent:

```python
ents = [] # we initialize an empty list
for t in bg1_base: # we start the for loop
    if t[1] == 'Entity': # we check that t is tagged as Entity
        ents.append(t[0]) # we append the token
```

But we used 4 lines for something that could be readily (and more safely) be written with just 1 line...

In [67]:
# now we answer the second question: sets!
len(set(ents))

369

In [66]:
set(ents[:50])

{'Apud',
 'Aquitani',
 'Aquitania',
 'Aquitanis',
 'Belgae',
 'Belgarum',
 'Belgis',
 'C',
 'Celtae',
 'Eorum',
 'Galli',
 'Gallia',
 'Galliae',
 'Gallos',
 'Garumna',
 'Germanis',
 'Helvetii',
 'Helvetiis',
 'Helvetios',
 'Hi',
 'Hispaniam',
 'Horum',
 'Id',
 'Is',
 'M',
 'Matrona',
 'Messala',
 'Oceani',
 'Oceano',
 'Orgetorix',
 'P',
 'Pisone',
 'Pyrenaeos',
 'Qua',
 'Rheni',
 'Rhenum',
 'Rhodano',
 'Sequana',
 'Sequanis'}

This method of NER is very trivial. Yet it is interesting, isn't it? It got a lot of true entities. But it also intercepted a lot of false positives. Can you guess why?

## NER using CLTK

I said above that there are two very trivial linguistic features of the tokens that we can use to construct a NE tagger that implements a binary distinction. One is the graphical convention of capitalizing the first character of proper names.

The second one is the one that is implemented by the `CLTK` NE tagger. Can you guess what it is?

Anyway, let's see it in action!

The first thing we have to do is importing the tagger in our code

In [77]:
from cltk.tag import ner

In [94]:
bg1_cltk = ner.tag_ner("latin", input_text=list(bg_tokens))

Now, let us try to display the two annotations face to face. To do it, we'd need a function to loop symultaneosly through two list of equal length and do something with the items of each list. Python provides a function like that, which is called `zip`!

In [102]:
#let's check that the two lists have the same length
len(bg1_base) == len(bg1_cltk)

True

In [103]:
for b,c in zip(bg1_base, bg1_cltk):
    print(b,c)

('Gallia', 'Entity') ('Gallia', 'Entity')
('est', 'O') ('est',)
('omnis', 'O') ('omnis',)
('divisa', 'O') ('divisa',)
('in', 'O') ('in',)
('partes', 'O') ('partes',)
('tres', 'O') ('tres',)
(',', 'O') (',',)
('quarum', 'O') ('quarum',)
('unam', 'O') ('unam',)
('incolunt', 'O') ('incolunt',)
('Belgae', 'Entity') ('Belgae', 'Entity')
(',', 'O') (',',)
('aliam', 'O') ('aliam',)
('Aquitani', 'Entity') ('Aquitani', 'Entity')
(',', 'O') (',',)
('tertiam', 'O') ('tertiam',)
('qui', 'O') ('qui',)
('ipsorum', 'O') ('ipsorum',)
('lingua', 'O') ('lingua',)
('Celtae', 'Entity') ('Celtae', 'Entity')
(',', 'O') (',',)
('nostra', 'O') ('nostra',)
('Galli', 'Entity') ('Galli', 'Entity')
('appellantur', 'O') ('appellantur',)
('.', 'O') ('.',)
('Hi', 'Entity') ('Hi', 'Entity')
('omnes', 'O') ('omnes',)
('lingua', 'O') ('lingua',)
(',', 'O') (',',)
('institutis', 'O') ('institutis',)
(',', 'O') (',',)
('legibus', 'O') ('legibus',)
('inter', 'O') ('inter',)
('se', 'O') ('se',)
('differunt', 'O') ('differu

('.', 'O') ('.',)
('In', 'Entity') ('In', 'Entity')
('eo', 'O') ('eo',)
('itinere', 'O') ('itinere',)
('persuadet', 'O') ('persuadet',)
('Castico', 'Entity') ('Castico', 'Entity')
(',', 'O') (',',)
('Catamantaloedis', 'Entity') ('Catamantaloedis',)
('filio', 'O') ('filio',)
(',', 'O') (',',)
('Sequano', 'Entity') ('Sequano', 'Entity')
(',', 'O') (',',)
('cuius', 'O') ('cuius',)
('pater', 'O') ('pater',)
('regnum', 'O') ('regnum',)
('in', 'O') ('in',)
('Sequanis', 'Entity') ('Sequanis', 'Entity')
('multos', 'O') ('multos',)
('annos', 'O') ('annos',)
('obtinuerat', 'O') ('obtinuerat',)
('et', 'O') ('et',)
('a', 'O') ('a',)
('senatu', 'O') ('senatu',)
('populi', 'O') ('populi',)
('Romani', 'Entity') ('Romani', 'Entity')
('amicus', 'O') ('amicus',)
('appellatus', 'O') ('appellatus',)
('erat', 'O') ('erat',)
(',', 'O') (',',)
('ut', 'O') ('ut',)
('regnum', 'O') ('regnum',)
('in', 'O') ('in',)
('civitate', 'O') ('civitate',)
('sua', 'O') ('sua',)
('occuparet', 'O') ('occuparet',)
(',', 'O') 

('concedendum', 'O') ('concedendum',)
('non', 'O') ('non',)
('putabat', 'O') ('putabat',)
(';', 'O') (';',)
('neque', 'O') ('neque',)
('homines', 'O') ('homines',)
('inimico', 'O') ('inimico',)
('animo', 'O') ('animo',)
(',', 'O') (',',)
('data', 'O') ('data',)
('facultate', 'O') ('facultate',)
('per', 'O') ('per',)
('provinciam', 'O') ('provinciam',)
('itineris', 'O') ('itineris',)
('faciundi', 'O') ('faciundi',)
(',', 'O') (',',)
('temperaturos', 'O') ('temperaturos',)
('ab', 'O') ('ab',)
('iniuria', 'O') ('iniuria',)
('et', 'O') ('et',)
('maleficio', 'O') ('maleficio',)
('existimabat', 'O') ('existimabat',)
('.', 'O') ('.',)
('Tamen', 'Entity') ('Tamen',)
(',', 'O') (',',)
('ut', 'O') ('ut',)
('spatium', 'O') ('spatium',)
('intercedere', 'O') ('intercedere',)
('posset', 'O') ('posset',)
('dum', 'O') ('dum',)
('milites', 'O') ('milites',)
('quos', 'O') ('quos',)
('imperaverat', 'O') ('imperaverat',)
('convenirent', 'O') ('convenirent',)
(',', 'O') (',',)
('legatis', 'O') ('legatis',)

('atque', 'O') ('atque',)
('ita', 'O') ('ita',)
('exercitum', 'O') ('exercitum',)
('traducit', 'O') ('traducit',)
('.', 'O') ('.',)
('Helvetii', 'Entity') ('Helvetii',)
('repentino', 'O') ('repentino',)
('eius', 'O') ('eius',)
('adventu', 'O') ('adventu',)
('commoti', 'O') ('commoti',)
('cum', 'O') ('cum',)
('id', 'O') ('id',)
('quod', 'O') ('quod',)
('ipsi', 'O') ('ipsi',)
('diebus', 'O') ('diebus',)
('XX', 'O') ('XX',)
('aegerrime', 'O') ('aegerrime',)
('confecerant', 'O') ('confecerant',)
(',', 'O') (',',)
('ut', 'O') ('ut',)
('flumen', 'O') ('flumen',)
('transirent', 'O') ('transirent',)
(',', 'O') (',',)
('illum', 'O') ('illum',)
('uno', 'O') ('uno',)
('die', 'O') ('die',)
('fecisse', 'O') ('fecisse',)
('intellegerent', 'O') ('intellegerent',)
(',', 'O') (',',)
('legatos', 'O') ('legatos',)
('ad', 'O') ('ad',)
('eum', 'O') ('eum',)
('mittunt', 'O') ('mittunt',)
(';', 'O') (';',)
('cuius', 'O') ('cuius',)
('legationis', 'O') ('legationis',)
('Divico', 'Entity') ('Divico',)
('prince

('propter', 'O') ('propter',)
('eam', 'O') ('eam',)
('adfinitatem', 'O') ('adfinitatem',)
(',', 'O') (',',)
('odisse', 'O') ('odisse',)
('etiam', 'O') ('etiam',)
('suo', 'O') ('suo',)
('nomine', 'O') ('nomine',)
('Caesarem', 'Entity') ('Caesarem', 'Entity')
('et', 'O') ('et',)
('Romanos', 'Entity') ('Romanos', 'Entity')
(',', 'O') (',',)
('quod', 'O') ('quod',)
('eorum', 'O') ('eorum',)
('adventu', 'O') ('adventu',)
('potentia', 'O') ('potentia',)
('eius', 'O') ('eius',)
('deminuta', 'O') ('deminuta',)
('et', 'O') ('et',)
('Diviciacus', 'Entity') ('Diviciacus',)
('frater', 'O') ('frater',)
('in', 'O') ('in',)
('antiquum', 'O') ('antiquum',)
('locum', 'O') ('locum',)
('gratiae', 'O') ('gratiae',)
('atque', 'O') ('atque',)
('honoris', 'O') ('honoris',)
('sit', 'O') ('sit',)
('restitutus', 'O') ('restitutus',)
('.', 'O') ('.',)
('Si', 'Entity') ('Si',)
('quid', 'O') ('quid',)
('accidat', 'O') ('accidat',)
('Romanis', 'Entity') ('Romanis', 'Entity')
(',', 'O') (',',)
('summam', 'O') ('summ

('aberat', 'O') ('aberat',)
(',', 'O') (',',)
('rei', 'O') ('rei',)
('frumentariae', 'O') ('frumentariae',)
('prospiciendum', 'O') ('prospiciendum',)
('existimavit', 'O') ('existimavit',)
(';', 'O') (';',)
('itaque', 'O') ('itaque',)
('iter', 'O') ('iter',)
('ab', 'O') ('ab',)
('Helvetiis', 'Entity') ('Helvetiis',)
('avertit', 'O') ('avertit',)
('ac', 'O') ('ac',)
('Bibracte', 'Entity') ('Bibracte', 'Entity')
('ire', 'O') ('ire',)
('contendit', 'O') ('contendit',)
('.', 'O') ('.',)
('Ea', 'Entity') ('Ea', 'Entity')
('res', 'O') ('res',)
('per', 'O') ('per',)
('fugitivos', 'O') ('fugitivos',)
('L', 'Entity') ('L',)
('.', 'O') ('.',)
('Aemilii', 'Entity') ('Aemilii', 'Entity')
(',', 'O') (',',)
('decurionis', 'O') ('decurionis',)
('equitum', 'O') ('equitum',)
('Gallorum', 'Entity') ('Gallorum', 'Entity')
(',', 'O') (',',)
('hostibus', 'O') ('hostibus',)
('nuntiatur', 'O') ('nuntiatur',)
('.', 'O') ('.',)
('Helvetii', 'Entity') ('Helvetii',)
(',', 'O') (',',)
('seu', 'O') ('seu',)
('quod'

('arma', 'O') ('arma',)
('ferre', 'O') ('ferre',)
('possent', 'O') ('possent',)
(',', 'O') (',',)
('et', 'O') ('et',)
('item', 'O') ('item',)
('separatim', 'O') ('separatim',)
(',', 'O') (',',)
('quot', 'O') ('quot',)
('pueri', 'O') ('pueri',)
(',', 'O') (',',)
('senes', 'O') ('senes',)
('mulieresque', 'O') ('mulieresque',)
('.', 'O') ('.',)
('[', 'O') ('[',)
('Quarum', 'Entity') ('Quarum',)
('omnium', 'O') ('omnium',)
('rerum', 'O') ('rerum',)
(']', 'O') (']',)
('summa', 'O') ('summa',)
('erat', 'O') ('erat',)
('capitum', 'O') ('capitum',)
('Helvetiorum', 'Entity') ('Helvetiorum',)
('milium', 'O') ('milium',)
('CCLXIII', 'O') ('CCLXIII',)
(',', 'O') (',',)
('Tulingorum', 'Entity') ('Tulingorum',)
('milium', 'O') ('milium',)
('XXXVI', 'O') ('XXXVI',)
(',', 'O') (',',)
('Latobrigorum', 'Entity') ('Latobrigorum',)
('XIIII', 'O') ('XIIII',)
(',', 'O') (',',)
('Rauracorum', 'Entity') ('Rauracorum', 'Entity')
('XXIII', 'O') ('XXIII',)
(',', 'O') (',',)
('Boiorum', 'Entity') ('Boiorum', 'Ent

('a', 'O') ('a',)
('provincia', 'O') ('provincia',)
('nostra', 'O') ('nostra',)
('Rhodanus', 'Entity') ('Rhodanus', 'Entity')
('divideret', 'O') ('divideret',)
('];', 'O') ('];',)
('quibus', 'O') ('quibus',)
('rebus', 'O') ('rebus',)
('quam', 'O') ('quam',)
('maturrime', 'O') ('maturrime',)
('occurrendum', 'O') ('occurrendum',)
('putabat', 'O') ('putabat',)
('.', 'O') ('.',)
('Ipse', 'Entity') ('Ipse',)
('autem', 'O') ('autem',)
('Ariovistus', 'Entity') ('Ariovistus',)
('tantos', 'O') ('tantos',)
('sibi', 'O') ('sibi',)
('spiritus', 'O') ('spiritus',)
(',', 'O') (',',)
('tantam', 'O') ('tantam',)
('arrogantiam', 'O') ('arrogantiam',)
('sumpserat', 'O') ('sumpserat',)
(',', 'O') (',',)
('ut', 'O') ('ut',)
('ferendus', 'O') ('ferendus',)
('non', 'O') ('non',)
('videretur', 'O') ('videretur',)
('.', 'O') ('.',)
('[', 'O') ('[',)
('34', 'O') ('34',)
(']', 'O') (']',)
('Quam', 'Entity') ('Quam',)
('ob', 'O') ('ob',)
('rem', 'O') ('rem',)
('placuit', 'O') ('placuit',)
('ei', 'O') ('ei',)
('u

('perspecta', 'O') ('perspecta',)
('eum', 'O') ('eum',)
('neque', 'O') ('neque',)
('suam', 'O') ('suam',)
('neque', 'O') ('neque',)
('populi', 'O') ('populi',)
('Romani', 'Entity') ('Romani', 'Entity')
('gratiam', 'O') ('gratiam',)
('repudiaturum', 'O') ('repudiaturum',)
('.', 'O') ('.',)
('Quod', 'Entity') ('Quod',)
('si', 'O') ('si',)
('furore', 'O') ('furore',)
('atque', 'O') ('atque',)
('amentia', 'O') ('amentia',)
('impulsum', 'O') ('impulsum',)
('bellum', 'O') ('bellum',)
('intulisset', 'O') ('intulisset',)
(',', 'O') (',',)
('quid', 'O') ('quid',)
('tandem', 'O') ('tandem',)
('vererentur', 'O') ('vererentur',)
('?', 'O') ('?',)
('Aut', 'Entity') ('Aut', 'Entity')
('cur', 'O') ('cur',)
('de', 'O') ('de',)
('sua', 'O') ('sua',)
('virtute', 'O') ('virtute',)
('aut', 'O') ('aut',)
('de', 'O') ('de',)
('ipsius', 'O') ('ipsius',)
('diligentia', 'O') ('diligentia',)
('desperarent', 'O') ('desperarent',)
('?', 'O') ('?',)
('Factum', 'Entity') ('Factum', 'Entity')
('eius', 'O') ('eius',)

('respondit', 'O') ('respondit',)
(',', 'O') (',',)
('de', 'O') ('de',)
('suis', 'O') ('suis',)
('virtutibus', 'O') ('virtutibus',)
('multa', 'O') ('multa',)
('praedicavit', 'O') ('praedicavit',)
(':', 'O') (':',)
('transisse', 'O') ('transisse',)
('Rhenum', 'Entity') ('Rhenum', 'Entity')
('sese', 'O') ('sese',)
('non', 'O') ('non',)
('sua', 'O') ('sua',)
('sponte', 'O') ('sponte',)
(',', 'O') (',',)
('sed', 'O') ('sed',)
('rogatum', 'O') ('rogatum',)
('et', 'O') ('et',)
('arcessitum', 'O') ('arcessitum',)
('a', 'O') ('a',)
('Gallis', 'Entity') ('Gallis', 'Entity')
(';', 'O') (';',)
('non', 'O') ('non',)
('sine', 'O') ('sine',)
('magna', 'O') ('magna',)
('spe', 'O') ('spe',)
('magnisque', 'O') ('magnisque',)
('praemiis', 'O') ('praemiis',)
('domum', 'O') ('domum',)
('propinquosque', 'O') ('propinquosque',)
('reliquisse', 'O') ('reliquisse',)
(';', 'O') (';',)
('sedes', 'O') ('sedes',)
('habere', 'O') ('habere',)
('in', 'O') ('in',)
('Gallia', 'Entity') ('Gallia', 'Entity')
('ab', 'O') 

('.', 'O') ('.',)
('[', 'O') ('[',)
('Hic', 'Entity') ('Hic', 'Entity')
('locus', 'O') ('locus',)
('ab', 'O') ('ab',)
('hoste', 'O') ('hoste',)
('circiter', 'O') ('circiter',)
('passus', 'O') ('passus',)
('DC', 'O') ('DC',)
(',', 'O') (',',)
('uti', 'O') ('uti',)
('dictum', 'O') ('dictum',)
('est', 'O') ('est',)
(',', 'O') (',',)
('aberat', 'O') ('aberat',)
('.]', 'O') ('.]',)
('Eo', 'Entity') ('Eo', 'Entity')
('circiter', 'O') ('circiter',)
('hominum', 'O') ('hominum',)
('XVI', 'O') ('XVI',)
('milia', 'O') ('milia',)
('expedita', 'O') ('expedita',)
('cum', 'O') ('cum',)
('omni', 'O') ('omni',)
('equitatu', 'O') ('equitatu',)
('Ariovistus', 'Entity') ('Ariovistus',)
('misit', 'O') ('misit',)
(',', 'O') (',',)
('quae', 'O') ('quae',)
('copiae', 'O') ('copiae',)
('nostros', 'O') ('nostros',)
('terrerent', 'O') ('terrerent',)
('et', 'O') ('et',)
('munitione', 'O') ('munitione',)
('prohiberent', 'O') ('prohiberent',)
('.', 'O') ('.',)
('Nihilo', 'Entity') ('Nihilo',)
('setius', 'O') ('seti

('trans', 'O') ('trans',)
('Rhenum', 'Entity') ('Rhenum', 'Entity')
('nuntiato', 'O') ('nuntiato',)
(',', 'O') (',',)
('Suebi', 'Entity') ('Suebi', 'Entity')
(',', 'O') (',',)
('qui', 'O') ('qui',)
('ad', 'O') ('ad',)
('ripas', 'O') ('ripas',)
('Rheni', 'Entity') ('Rheni', 'Entity')
('venerant', 'O') ('venerant',)
(',', 'O') (',',)
('domum', 'O') ('domum',)
('reverti', 'O') ('reverti',)
('coeperunt', 'O') ('coeperunt',)
(';', 'O') (';',)
('quos', 'O') ('quos',)
('ubi', 'O') ('ubi',)
('qui', 'O') ('qui',)
('proximi', 'O') ('proximi',)
('Rhenum', 'Entity') ('Rhenum', 'Entity')
('incolunt', 'O') ('incolunt',)
('perterritos', 'O') ('perterritos',)
('senserunt', 'O') ('senserunt',)
(',', 'O') (',',)
('insecuti', 'O') ('insecuti',)
('magnum', 'O') ('magnum',)
('ex', 'O') ('ex',)
('iis', 'O') ('iis',)
('numerum', 'O') ('numerum',)
('occiderunt', 'O') ('occiderunt',)
('.', 'O') ('.',)
('Caesar', 'Entity') ('Caesar', 'Entity')
('una', 'O') ('una',)
('aestate', 'O') ('aestate',)
('duobus', 'O') 

Let's just visualize the case of missmatch. That's easy, but, before we could do that, we have to fix the problem that the non-entities are not tagged in CLTK

In [109]:
len(bg1_cltk)

9595

In [110]:
bg1_cltk_fixed = [t if len(t) > 1 else (t[0], 'O') for t in bg1_cltk]

In [113]:
for b,c in zip(bg1_base[:10], bg1_cltk_fixed[:10]):
    print(b,c)

('est', 'O')
('Gallia', 'Entity') ('Gallia', 'Entity')
('est', 'O') ('est', 'O')
('omnis', 'O') ('omnis', 'O')
('divisa', 'O') ('divisa', 'O')
('in', 'O') ('in', 'O')
('partes', 'O') ('partes', 'O')
('tres', 'O') ('tres', 'O')
(',', 'O') (',', 'O')
('quarum', 'O') ('quarum', 'O')
('unam', 'O') ('unam', 'O')


In [114]:
missmatches = [(b[0], b[1], c[1]) for b,c in zip(bg1_base, bg1_cltk_fixed) if b[1] != c[1]]
len(missmatches)

373

In [115]:
missmatches[:20]

[('Garumna', 'Entity', 'O'),
 ('Horum', 'Entity', 'O'),
 ('Qua', 'Entity', 'O'),
 ('Helvetii', 'Entity', 'O'),
 ('Garumna', 'Entity', 'O'),
 ('Helvetiis', 'Entity', 'O'),
 ('Garumna', 'Entity', 'O'),
 ('Apud', 'Entity', 'O'),
 ('Helvetios', 'Entity', 'O'),
 ('M', 'Entity', 'O'),
 ('P', 'Entity', 'O'),
 ('M', 'Entity', 'O'),
 ('Helvetii', 'Entity', 'O'),
 ('Helvetium', 'Entity', 'O'),
 ('Helvetios', 'Entity', 'O'),
 ('Helvetiis', 'Entity', 'O'),
 ('Ad', 'Entity', 'O'),
 ('Ad', 'Entity', 'O'),
 ('Catamantaloedis', 'Entity', 'O'),
 ('Diviciaci', 'Entity', 'O')]

So which method looks better? Can we improve on them? This questions are really for the next class... ;-)