[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CLARIN-PL/NlpRest2-Tutorials/blob/master/part3.ipynb)

# Part 3 — Building an index of named entities

The goal of this tutorial is to create an index of named entities. For each named entity we want to have a category and a list of documents in which the entity is mentioned. Named entities will be recognized using Liner2 and the mentions will be lemmatized using Polem tool.

### Actions
1. Upload sample dataset to CLARIN-PL WS
2. Run the Liner2 task on the uploaded dataset
3. Download the result of processing
4. Extract named entity mentions from the documents
5. Lemmatize named entity mentions

### Links
1. Liner2 demo — https://ws.clarin-pl.eu/ner.shtml
2. Liner2 as a standalone tool — https://github.com/CLARIN-PL/Liner2
3. Polem as a standalone tool — https://github.com/CLARIN-PL/Polem

## 1. Upload sample dataset to CLARIN-PL WS

In [1]:
import json
import requests

clarinpl_url = "http://ws.clarin-pl.eu/nlprest2/base"
user_mail = "demo2019@nlpday.pl"

In [2]:
import urllib.request

url = clarinpl_url + "/upload/"
url_zip = "https://www.dropbox.com/s/54gmpdd6x3rx4gq/brexit_pl.zip?dl=1"

doc = urllib.request.urlopen(url_zip).read()
    
print("Size of the package: %d" % len(doc))

Size of the package: 800523


In [3]:
headers = {'content-type': 'binary/octet-stream'}

file_handler = requests.post(url, data=doc, headers=headers).text
print("File handler: %s" % file_handler)
print("URL: %s/download%s" % (clarinpl_url, file_handler))

File handler: /users/default/4957e32d-1a7b-4107-84c6-8acd7527c624
URL: http://ws.clarin-pl.eu/nlprest2/base/download/users/default/4957e32d-1a7b-4107-84c6-8acd7527c624


## 2. Run the Liner2 task on the uploaded dataset

In [4]:
import time

url = clarinpl_url + "/startTask"
lpmn = 'filezip(%s)|wcrft2|liner2({"model":"top9"})|dir|makezip' % file_handler
print("LPMN: %s" % lpmn)

payload = {'lpmn': lpmn, 'user': user_mail}
headers = {'content-type': 'application/json'}

start = time.time()
task_id = requests.post(url, data=json.dumps(payload), headers=headers).text
print("Task id: %s" % task_id)

# Check task status
processing = True
file_id = None

while processing:
  data = requests.get(clarinpl_url + "/getStatus/" + task_id).text
  result = json.loads(data)
  end = time.time()
  if result["status"] == "PROCESSING":
    print("[%3d s] Status: %s; Progress: %6.2f%%" % (end-start, result["status"], result["value"]*100))
    time.sleep(1)
  elif result["status"] == "DONE":
    file_id = result["value"][0]["fileID"]
    processing = False  
    print("[%3d s] Status: DONE      ; Progress: 100.00%%" % (end-start))
  else:
    print(data)
    processing = False  
    
print("Result file id: %s" % file_id)

LPMN: filezip(/users/default/4957e32d-1a7b-4107-84c6-8acd7527c624)|wcrft2|liner2({"model":"top9"})|dir|makezip
Task id: 288d4e73-f54a-487e-b177-38183f7a2cbd
[  0 s] Status: PROCESSING; Progress:   0.00%
[  1 s] Status: PROCESSING; Progress:   0.00%
[  2 s] Status: PROCESSING; Progress:   5.09%
[  3 s] Status: PROCESSING; Progress:   7.89%
[  5 s] Status: PROCESSING; Progress:  11.29%
[  6 s] Status: PROCESSING; Progress:  15.28%
[  7 s] Status: PROCESSING; Progress:  20.08%
[  8 s] Status: PROCESSING; Progress:  22.28%
[  9 s] Status: PROCESSING; Progress:  24.28%
[ 10 s] Status: PROCESSING; Progress:  27.67%
[ 12 s] Status: PROCESSING; Progress:  30.27%
[ 13 s] Status: PROCESSING; Progress:  33.47%
[ 14 s] Status: PROCESSING; Progress:  36.46%
[ 15 s] Status: PROCESSING; Progress:  39.66%
[ 16 s] Status: PROCESSING; Progress:  42.06%
[ 18 s] Status: PROCESSING; Progress:  45.65%
[ 19 s] Status: PROCESSING; Progress:  49.45%
[ 20 s] Status: PROCESSING; Progress:  51.65%
[ 21 s] Status:

## 3. Download the result of processing

In [5]:
path = "result.zip"

url = clarinpl_url + "/download" + file_id
print(url)
data = requests.get(url).content
file = open(path, "w+b")
file.write(data)
file.close()

print("Saved to %s" % path)

http://ws.clarin-pl.eu/nlprest2/base/download/requests/makezip/d743d991-f81b-4af4-bedf-4247dff825fc
Saved to result.zip


In [6]:
import zipfile

zf = zipfile.ZipFile(path, 'r')

print("Number of documents: %d" % len(zf.namelist()))

print("")
print("First 10 files in the package:")
print(zf.namelist()[:10])

print("")
print("Content of the first file:")
data = zf.read(zf.namelist()[0]).decode("utf-8-sig")
print(data)

Number of documents: 500

First 10 files in the package:
['brexit_pl%brexit_pl.txt_file_700.txt', 'brexit_pl%brexit_pl.txt_file_274.txt', 'brexit_pl%brexit_pl.txt_file_1337.txt', 'brexit_pl%brexit_pl.txt_file_1918.txt', 'brexit_pl%brexit_pl.txt_file_1302.txt', 'brexit_pl%brexit_pl.txt_file_1934.txt', 'brexit_pl%brexit_pl.txt_file_1683.txt', 'brexit_pl%brexit_pl.txt_file_1441.txt', 'brexit_pl%brexit_pl.txt_file_626.txt', 'brexit_pl%brexit_pl.txt_file_1233.txt']

Content of the first file:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
 <chunk type="p" id="ch1">
  <sentence id="s1">
   <tok>
    <orth>pl</orth>
    <lex disamb="1"><base>Polska</base><ctag>brev:npun</ctag></lex>
    <ann chan="nam_liv">0</ann>
   </tok>
   <ns/>
   <tok>
    <orth>-</orth>
    <lex disamb="1"><base>-</base><ctag>interp</ctag></lex>
    <ann chan="nam_liv">0</ann>
   </tok>
   <ns/>
   <tok>
    <orth>700</orth>
    <lex disamb="1"><base>700</base><ctag>num:pl:nom:

## 4. Extract named entity mentions from the documents

### Auxilary class to represent tokens and annotations

In [7]:
class Token:
    
    def __init__(self, orth, base, ctag):
        self.orth = orth
        self.base = base
        self.ctag = ctag
        
    def get_orth(self):
        return self.orth
    
    def get_base(self):
        return self.base
    
    def get_ctag(self):
        return self.ctag
        

class Annotation:
    
    def __init__(self, category, tokens):
        self.category = category
        self.tokens = tokens
        self.lemma = self.get_orth()
        
    def get_category(self):
        return self.category
    
    def get_tokens(self):
        return self.tokens
    
    def get_orth(self):
        return " ".join([token.get_orth() for token in self.tokens])

    def get_base(self):
        return " ".join([token.get_base() for token in self.tokens])
    
    def get_ctag(self):
        return " ".join([token.get_ctag() for token in self.tokens])
    
    def get_space(self):
        return " ".join(["True" for token in self.tokens])
    
    def get_lemma(self):
        return self.lemma
    
    def set_lemma(self, lemma):
        self.lemma = lemma

    def __str__(self):
        return "[%s] %s" % (self.get_category(), self.get_lemma())

### Convert ccl document into a list of Annotation objects

```
<tok>...</tok>       | Ala | mieszka | w | Zielonej | Górze | , | a | Marta | w | Warszawie
---------------------|-----|----------------------------------------------------------------
<ann chan="nam_loc"> | 0   | 0       | 0 | 1        | 1     | 0 | 0 | 0     | 0 | 1
<ann chan="nam_per"> | 1   | 0       | 0 | 0        | 0     | 0 | 0 | 2     | 0 | 0
```

In [8]:
import xml.etree.ElementTree as ET

def sentence_ner(sentence):
    channels = {}
    for token in sentence.iter("tok"):
        orth = token.find("./orth").text
        base = token.find("./lex/base").text
        ctag = token.find("./lex/ctag").text
        t = Token(orth, base, ctag)
        for channel in token.iter("ann"):            
            index = int(channel.text)
            chan = channel.attrib["chan"]            
            if index > 0:                
                channels.setdefault(chan, {}) \
                        .setdefault(index, []) \
                        .append(t)
                
    annotations = []
    for (ann_type, group) in channels.items():
        for tokens in group.values():            
            an = Annotation(ann_type, tokens)
            annotations.append(an)
    
    return annotations
                

def ccl_ner(ccl):
    tree = ET.fromstring(ccl)
    annotations = []
    for sentence in tree.iter("sentence"):
        annotations += sentence_ner(sentence)
    return annotations


# Test on a single document

ccl = zf.read(zf.namelist()[0]).decode("utf-8-sig")
annotations = ccl_ner(ccl)
for annotation in annotations:
    print(annotation)

[nam_liv] Borisa Johnsona
[nam_org] Unii Europejskiej
[nam_adj] Brytyjski
[nam_liv] Boris Johnson
[nam_loc] Wielkiej Brytanii
[nam_org] Uniwersytetu Warszawskiego
[nam_liv] Krzysztof Winler


### Load annotations from all documents

In [9]:
document_annotations = {}
for file in zf.namelist():
    ccl = zf.read(file)
    document_annotations[file] =  ccl_ner(ccl)    

print("Done")

Done


### Create a map of annotation orths to set of documents

In [10]:
def group_annotations(document_annotations):
    annotation_documents = {}
    for (document, annotations) in document_annotations.items():
        distinct_annotations = set([an.__str__() for an in annotations])
        for an in distinct_annotations:
            annotation_documents.setdefault(an, set()).add(document)
    return annotation_documents

annotation_documents = group_annotations(document_annotations)

### Annotations in alphabetical order

In [11]:
def print_annotation_index_in_alphabetical_order(annotation_documents):
    for an, documents in sorted(annotation_documents.items()):
        print("%4d %s" % (len(documents), an))
        
print_annotation_index_in_alphabetical_order(annotation_documents)

   1 [nam_adj] Amerykańscy
   3 [nam_adj] Amerykański
   1 [nam_adj] Amerykańskie
   1 [nam_adj] Belgijska
   3 [nam_adj] Belgijski
   1 [nam_adj] Belgijskie
   3 [nam_adj] Brytyjscy
  30 [nam_adj] Brytyjska
  28 [nam_adj] Brytyjski
   7 [nam_adj] Brytyjskie
   1 [nam_adj] Brytyjskiej
   2 [nam_adj] Brytyjskim
   1 [nam_adj] Europejscy
   3 [nam_adj] Europejska
   1 [nam_adj] Europejski
   2 [nam_adj] Francuski
   2 [nam_adj] Francuskie
   1 [nam_adj] Hiszpański
   1 [nam_adj] Holenderska
   1 [nam_adj] Irlandzcy
   1 [nam_adj] Irlandzki
   2 [nam_adj] Londyński
   3 [nam_adj] Niemiecka
   4 [nam_adj] Niemiecki
   1 [nam_adj] Niemieckie
   3 [nam_adj] Polscy
   2 [nam_adj] Polska
   1 [nam_adj] Polskie
   2 [nam_adj] Rosyjski
   1 [nam_adj] Szkocki
   1 [nam_adj] Unijna
   5 [nam_adj] Unijni
  11 [nam_adj] Unijny
   1 [nam_adj] afrykańskie
   4 [nam_adj] amerykańska
  13 [nam_adj] amerykański
  11 [nam_adj] amerykańskich
   7 [nam_adj] amerykańskiego
   4 [nam_adj] amerykańskiej
   3 [

   1 [nam_org] Unii Polska
   1 [nam_org] Unii Theresa May
   1 [nam_org] Unii Wielka Brytania
   1 [nam_org] Unii Zjednoczonego Królestwa
   1 [nam_org] Unii i W
   1 [nam_org] Unii i Wlk
   1 [nam_org] Unii o Bałkany Zachodnie
   1 [nam_org] Unii sektorów lotnictwa
   1 [nam_org] Unii z Wielka Brytanią
   1 [nam_org] Unii z inną grupą - " Vote Leave "
   2 [nam_org] Unijnych
   1 [nam_org] University College Dublin
   1 [nam_org] University College London
   1 [nam_org] University College London's Centre
   1 [nam_org] Uniwersytecie Cambridge
   1 [nam_org] Uniwersytecie Cardiff w Walii
   1 [nam_org] Uniwersytecie Oksfordzkim
   1 [nam_org] Uniwersytecie Technicznym w Chemnitz
   1 [nam_org] Uniwersytetem
   1 [nam_org] Uniwersytetem Cambridge
   2 [nam_org] Uniwersytetu Warszawskiego
   1 [nam_org] Uniwersytetu w Białymstoku
   1 [nam_org] Uniwersytetu w Cambridge
  16 [nam_org] Unią
  44 [nam_org] Unią Europejską
   1 [nam_org] Unią Europejską "
   1 [nam_org] Unią Europejską a Wi

   2 [nam_pro] Rzeczpospolita
   5 [nam_pro] Rzeczpospolitej
   1 [nam_pro] Rzeczpospolitą
   1 [nam_pro] Saksonia
   1 [nam_pro] Salisbury
   1 [nam_pro] Salvini use Kremlin money and intel
   2 [nam_pro] Sam Coates
   1 [nam_pro] Shannonowi
   1 [nam_pro] Shutterstock.com
   2 [nam_pro] Skandynawii
   1 [nam_pro] Skład Parlamentu Europejskiego
   1 [nam_pro] Sojuszu Północnoatlantyckiego
   1 [nam_pro] Soros
   1 [nam_pro] Spiegel
   1 [nam_pro] St . Kitt's and Nevis na Karaibach
   1 [nam_pro] Stała Konferencja do spraw Europejskich
   1 [nam_pro] Strefa Biznesu - Dolar
   1 [nam_pro] Strefa Biznesu - Dynamika PKB
   1 [nam_pro] Strefa Biznesu - Młodzi Polacy
   5 [nam_pro] Sunday Times
   5 [nam_pro] Sunday Timesa
   1 [nam_pro] Sussex
   1 [nam_pro] Szczególne obawy
   1 [nam_pro] Szczere zainteresowanie premier polską społecznością
   7 [nam_pro] Szkocja
   2 [nam_pro] Szkocji
   2 [nam_pro] Szkocji z polskimi
   7 [nam_pro] TVN
   1 [nam_pro] TVN24
   4 [nam_pro] TVN24 BiS
   1 

### Annotations in frequency order

In [12]:
def print_annotation_index_in_frequency_order(annotation_documents):
    for an, documents in sorted(annotation_documents.items(), key=lambda x: len(x[1]), reverse=True):
        print("%4d %s" % (len(documents), an))
        
print_annotation_index_in_frequency_order(annotation_documents)

 344 [nam_org] UE
 331 [nam_loc] Wielkiej Brytanii
 220 [nam_loc] Wielka Brytania
 219 [nam_org] Unii Europejskiej
 127 [nam_org] Unii
  98 [nam_loc] Brukseli
  96 [nam_adj] brytyjskiego
  94 [nam_org] Brytyjczycy
  90 [nam_adj] brytyjski
  86 [nam_loc] Londynie
  85 [nam_liv] May
  84 [nam_loc] Polski
  83 [nam_org] Brytyjczyków
  76 [nam_loc] Londynu
  71 [nam_adj] brytyjskich
  70 [nam_loc] Londyn
  69 [nam_loc] Wielką Brytanią
  69 [nam_org] Polaków
  68 [nam_adj] unijnych
  67 [nam_loc] Wielką Brytanię
  64 [nam_loc] Polsce
  63 [nam_oth] euro
  62 [nam_loc] Polska
  58 [nam_liv] Theresy May
  57 [nam_liv] Theresa May
  57 [nam_adj] unijnego
  56 [nam_loc] Europie
  54 [nam_loc] Brukselą
  51 [nam_org] Partii Konserwatywnej
  50 [nam_adj] europejskich
  49 [nam_org] Rady Europejskiej
  46 [nam_adj] brytyjska
  46 [nam_org] PAP
  45 [nam_loc] Irlandią
  45 [nam_loc] USA
  44 [nam_org] Unią Europejską
  43 [nam_loc] Wyspach
  41 [nam_loc] Europy
  38 [nam_loc] Brytanii
  38 [nam_org

   1 [nam_org] Trybunału Konstytucyjnego Julią Przyłębską
   1 [nam_liv] Stanisława Gawłowskiego
   1 [nam_liv] Marcin Makowski
   1 [nam_liv] Johan Lundgren
   1 [nam_fac] European Research Group
   1 [nam_pro] BBC Inside Out South West team challenged five young Brits
   1 [nam_org] Kanadyjczykom Demokratyczna Partia Unionistyczna
   1 [nam_liv] Alex Begg
   1 [nam_org] Komisji Konstytucyjnej
   1 [nam_oth] https://www.polskieradio.pl/5/3/Artykul/2109729,Szef-KE-o-budzecie-ambitny-zbilansowany-wazna-praworzadnosc Szef KE
   1 [nam_liv] Wallis Simpson
   1 [nam_oth] http://www.rp.pl/Rzad-PiS/180319903-Radoslaw-Sikorski-w-Wielkiej-Brytanii-Kaczynski-moze-doprowadzic-do-Polexitu.html Sikorski
   1 [nam_liv] Hamburga
   1 [nam_oth] http://fakty.interia.pl/swiat/news-w-brytania-wiceminister-rezygnuje-i-chce-drugiego-referendum,nId,2593102 W
   1 [nam_org] AOC w Zjednoczonym Królestwie
   1 [nam_pro] Uwalniając Demony : Kulisy Brexitu
   1 [nam_oth] http://www.farmer.pl/agroskop/analizy-i-

   1 [nam_org] Polsko - Brytyjskiej Izby Handlowej
   1 [nam_liv] dziejesie.wp.pl Zobacz
   1 [nam_loc] Polski Miasteczko Great Yamouth
   1 [nam_liv] Bibby MSP Index
   1 [nam_oth] https://www.polityka.pl/tygodnikpolityka/swiat/1735702,1,europosel-charles-tannock-o-tym-jak-wielka-brytania-mierzy-sie-z-wyjsciem-z-ue.read?utm_source=rss&utm_medium=rss&utm_campaign=rss Europoseł
   1 [nam_org] Unii sektorów lotnictwa
   1 [nam_adj] afrykańskie
   1 [nam_oth] http://www.rp.pl/Unia-Europejska/304019971-Brexit-zatrzymac-Wielka-Brytanie-w-UE.html Zatrzymać Brytanię
   1 [nam_org] PLL LOT
   1 [nam_org] SVP
   1 [nam_org] Europejskiego Komitetu Regionów Karla
   1 [nam_org] Laburzystów
   1 [nam_org] Polskiej Agencji Inwestycji i Handlu
   1 [nam_org] Unii z inną grupą - " Vote Leave "
   1 [nam_pro] PE dla Polski
   1 [nam_loc] Amerykańskim
   1 [nam_liv] Daniel Kawczyński
   1 [nam_loc] Serbii
   1 [nam_loc] Most Johnsona
   1 [nam_liv] Andrew Haines
   1 [nam_pro] Europejska piąta kolumna


   1 [nam_org] Wspólną Polityka Rolną
   1 [nam_oth] http://www.rp.pl/Spoleczenstwo/180119647-Brytyjczycy-zapragneli-zostac-Francuzami.html
   1 [nam_eve] Szkocji Nicoli Sturgeon
   1 [nam_liv] Przemysław Barbrich
   1 [nam_liv] Philippe Lamberts
   1 [nam_liv] Czaputowiczowi
   1 [nam_org] Banku Zachodniego WBK
   1 [nam_liv] Marek Tarczyński
   1 [nam_oth] https://www.euractiv.pl/section/polityka-regionalna/news/komitet-regionow-chce-utworzenia-funduszu-stabilizacyjnego-dla-regionow-ktore-ucierpia-z-powodu-brexitu/
   1 [nam_liv] Craig Oliver
   1 [nam_liv] Paweł Świeboda
   1 [nam_loc] Inverness
   1 [nam_adj] Amerykańskie
   1 [nam_pro] Bohaterowie brexitu ?
   1 [nam_pro] UKiP Nigela Farage ’ a oraz Vote Leave
   1 [nam_org] University College London
   1 [nam_oth] http://www.rp.pl/Biznes/306249977-Brexit-zagrozeniem-dla-sieci-energetycznych-w-Europie.html
   1 [nam_pro] Kurz w wywiadzie dla austriackiego dziennika " Der Standard
   1 [nam_liv] Brexitu Michelem Barnierem
   1 [nam

In [13]:
def print_documents_with_named_entity(annotation_documents, ne):
    for an, documents in annotation_documents.items():
        if ne in an:
            print(an)
            for document in documents:
                print(" - %s" % document)
            print("Number of documents: %d" % len(documents))
            
print_documents_with_named_entity(annotation_documents, "[nam_loc] Irlandia Północna")
print("")

print_documents_with_named_entity(annotation_documents, "[nam_loc] Irlandii Północnej")

[nam_loc] Irlandia Północna
 - brexit_pl%brexit_pl.txt_file_607.txt
 - brexit_pl%brexit_pl.txt_file_1399.txt
 - brexit_pl%brexit_pl.txt_file_1441.txt
 - brexit_pl%brexit_pl.txt_file_1337.txt
 - brexit_pl%brexit_pl.txt_file_1546.txt
 - brexit_pl%brexit_pl.txt_file_1212.txt
 - brexit_pl%brexit_pl.txt_file_297.txt
 - brexit_pl%brexit_pl.txt_file_547.txt
 - brexit_pl%brexit_pl.txt_file_582.txt
 - brexit_pl%brexit_pl.txt_file_597.txt
 - brexit_pl%brexit_pl.txt_file_1393.txt
 - brexit_pl%brexit_pl.txt_file_469.txt
 - brexit_pl%brexit_pl.txt_file_131.txt
 - brexit_pl%brexit_pl.txt_file_1643.txt
 - brexit_pl%brexit_pl.txt_file_1543.txt
 - brexit_pl%brexit_pl.txt_file_164.txt
Number of documents: 16

[nam_loc] Irlandii Północnej
 - brexit_pl%brexit_pl.txt_file_566.txt
 - brexit_pl%brexit_pl.txt_file_1539.txt
 - brexit_pl%brexit_pl.txt_file_193.txt
 - brexit_pl%brexit_pl.txt_file_1509.txt
 - brexit_pl%brexit_pl.txt_file_1103.txt
 - brexit_pl%brexit_pl.txt_file_1072.txt
 - brexit_pl%brexit_pl.txt

## 5. Lemmatize named entity mentions

The goal is to group different forms of the same NE, i.e. *Partii Konserwatywnej*, *Partia Konserwatywna*, *Partię Konserwatywną* => *Partia Konserwatywna*

It can be done by applying lemmatization to annotations. This can be done using [Polem](https://github.com/CLARIN-PL/Polem) (rule-based lemmatizer for Polish).

In [14]:
phrases = []
for (document, annotations) in document_annotations.items():
    for an in annotations:
        phrases.append([an.get_orth(), an.get_base(), an.get_ctag(), an.get_space(), an.get_category()])

print("Number of phrases to lemmatize: %d" % len(phrases))
        
payload = {'lexeme_polem': phrases, 'tool': 'polem', 'options': [], 'lexeme':'', 'task':'all'}
headers = {'content-type': 'application/json'}
url = 'http://ws.clarin-pl.eu/lexrest/lex'

start = time.time()
response = requests.post(url, data=json.dumps(payload), headers=headers).text
print("Lemmatized in %d second(s)" % (time.time()-start))

Number of phrases to lemmatize: 14694
Lemmatized in 18 second(s)


In [15]:
result = json.loads(response)

groups = {}
orths = set()
lemmas = set()
orth_to_lemma = {}

for (phrase, lemma) in zip(result["input"]["lexeme_polem"], result["results"]):
    groups.setdefault(lemma, set([])).add(phrase[0])
    orths.add(phrase[0])
    lemmas.add(lemma)
    orth_to_lemma[phrase[0]] = lemma
    
print("Number of distinct forms before lemmatization: %d" % len(orths) )
print("Number of distinct forms after lemmatization : %d" % len(lemmas) )
print("-"*60)
    
for (lemma, forms) in sorted(groups.items(), key=lambda x: x[0]):
    if len(forms)>1:
        print("%-25s: %s" % (lemma, ", ".join(forms)))

Number of distinct forms before lemmatization: 2854
Number of distinct forms after lemmatization : 2419
------------------------------------------------------------
Adam Dąbrowski           : Adam Dąbrowski, Adamowi Dąbrowskiemu
Afganistan               : Afganistanie, Afganistanu
Afryka                   : Afryce, Afryki
Airbus                   : Airbus, Airbusa
Ameryka Południowa       : Ameryce Południowej, Amerykę Południową
Ameryka Północna         : Amerykę Północną, Ameryce Północnej
Amerykanie               : Amerykanie, Amerykanach
Amsterdam                : Amsterdamu, Amsterdamie, Amsterdam
Android                  : Android, Androidem
Angela Merkel            : Angeli Merkel, Angelą Merkel, Angelę Merkel
Anglia                   : Anglia, Anglii
Anglicy                  : Anglicy, Anglikami, Anglikom, Anglików
Anna                     : Anna, Anny
Australia                : Australia, Australii, Australią
Austria                  : Austria, Austrią, Austrii
Bank           

In [16]:
import copy

document_annotations_lemma = copy.deepcopy(document_annotations)

for document, annotations in document_annotations_lemma.items():
    for an in annotations:
        lemma = orth_to_lemma[an.get_orth()]
        an.set_lemma(lemma)
        
annotation_documents_lemma = group_annotations(document_annotations_lemma)

print_annotation_index_in_frequency_order(annotation_documents_lemma)

 384 [nam_loc] Wielka Brytania
 344 [nam_org] UE
 261 [nam_org] Unia Europejska
 255 [nam_adj] brytyjski
 189 [nam_loc] Londyn
 155 [nam_loc] Bruksela
 147 [nam_org] Brytyjczycy
 139 [nam_org] Unia
 131 [nam_loc] Polska
 105 [nam_adj] unijny
  95 [nam_loc] Europa
  89 [nam_adj] unijni
  85 [nam_liv] May
  82 [nam_adj] europejski
  80 [nam_org] Polacy
  78 [nam_adj] brytyjscy
  74 [nam_loc] Niemcy
  73 [nam_org] Komisja Europejska
  71 [nam_loc] Irlandia
  70 [nam_adj] europejscy
  69 [nam_loc] Wielką Brytanią
  63 [nam_oth] euro
  61 [nam_loc] Irlandia Północna
  60 [nam_loc] Wyspa
  59 [nam_org] Partia Konserwatywna
  58 [nam_liv] Theresy May
  57 [nam_liv] Theresa May
  54 [nam_org] Parlament Europejski
  52 [nam_loc] Francja
  49 [nam_org] Rady Europejska
  46 [nam_org] PAP
  45 [nam_loc] USA
  41 [nam_org] Partia Pracy
  39 [nam_loc] Brytania
  38 [nam_adj] polski
  38 [nam_org] KE
  38 [nam_liv] Donald Tusk
  37 [nam_org] Wspólnota
  36 [nam_adj] irlandzki
  32 [nam_liv] Tusk
  30

   1 [nam_loc] Shahmir Sanni
   1 [nam_org] ministerstwo spraw wewnętrznych
   1 [nam_loc] Petersburg
   1 [nam_org] Komisja Kontroli Europejskiej
   1 [nam_liv] Elton John
   1 [nam_liv] John Lynch
   1 [nam_pro] Nasze
   1 [nam_loc] Radom
   1 [nam_liv] Clifford Chance
   1 [nam_org] Centrum Studiów
   1 [nam_pro] https://www.polskieradio.pl/130/5153/Artykul/2089455 9 . 04 . 2018 Frans Timmermans w Polsce
   1 [nam_loc] Irlandia Pn
   1 [nam_org] Wikingowie
   1 [nam_pro] dziennik The Guardian
   1 [nam_org] Saksońskiego Urzędu Pracy
   1 [nam_liv] Nick Otterwell
   1 [nam_liv] Henry Bolton
   1 [nam_oth] http://www.gazetaprawna.pl/artykuly/1108889,wynik-wloskich-wyborow-ostrzezeniem-dla-ue.html Brytyjskie
   1 [nam_liv] Sorosa
   1 [nam_loc] Italia
   1 [nam_org] Polska Agencja Inwestycji i Handlu
   1 [nam_org] PRS
   1 [nam_liv] Farage
   1 [nam_liv] Viktor Orban
   1 [nam_liv] Christian Dustmann
   1 [nam_loc] Strefa Biznesu
   1 [nam_eve] druga wojny światowej
   1 [nam_liv] Pow

In [17]:
print_documents_with_named_entity(annotation_documents_lemma, "[nam_loc] Irlandia Północna")

[nam_loc] Irlandia Północna
 - brexit_pl%brexit_pl.txt_file_566.txt
 - brexit_pl%brexit_pl.txt_file_1084.txt
 - brexit_pl%brexit_pl.txt_file_1545.txt
 - brexit_pl%brexit_pl.txt_file_1244.txt
 - brexit_pl%brexit_pl.txt_file_1753.txt
 - brexit_pl%brexit_pl.txt_file_1539.txt
 - brexit_pl%brexit_pl.txt_file_193.txt
 - brexit_pl%brexit_pl.txt_file_1103.txt
 - brexit_pl%brexit_pl.txt_file_1509.txt
 - brexit_pl%brexit_pl.txt_file_1072.txt
 - brexit_pl%brexit_pl.txt_file_1622.txt
 - brexit_pl%brexit_pl.txt_file_1643.txt
 - brexit_pl%brexit_pl.txt_file_513.txt
 - brexit_pl%brexit_pl.txt_file_1050.txt
 - brexit_pl%brexit_pl.txt_file_587.txt
 - brexit_pl%brexit_pl.txt_file_1337.txt
 - brexit_pl%brexit_pl.txt_file_13.txt
 - brexit_pl%brexit_pl.txt_file_1799.txt
 - brexit_pl%brexit_pl.txt_file_1179.txt
 - brexit_pl%brexit_pl.txt_file_1546.txt
 - brexit_pl%brexit_pl.txt_file_1618.txt
 - brexit_pl%brexit_pl.txt_file_582.txt
 - brexit_pl%brexit_pl.txt_file_1594.txt
 - brexit_pl%brexit_pl.txt_file_1510

[Back to agenda](agenda.ipynb)