## Tagger Comparison

To compare the work of different POS-taggers I chose the following Python packages:

* TreeTagger
* pymystem3
* pymorphy2

Being a big fan of the legendary Russian punk-rock group "Король и Шут", I decided to take the first verse and chorus from the song "Свой среди чужих" to analyse it with the taggers.

In [13]:
from nltk.tokenize import word_tokenize
lyrics_raw = "Как злобно сверкают глаза маньяка, Он проклинает всех живых, Но невдомёк ему, однако, Что есть свой среди чужих! Леса и просёлки, деревья да ёлки, А в посёлке воют волки. Опять с ним кто-то хочет поиграть. Никого вокруг, только сердца стук, И во власти рук заточенный сук - его верный друг. Что-то здесь не так, будто рядом враг. Как свинец кулак, подруга луна, подай верный знак!"
lyrics = word_tokenize(lyrics_raw)

First comes **TreeTagger**. Looks nice and simple, but takes some time and provides only one interpretation for a token and no additional grammatical info except POS-tags. A few tokens weren't identified at all (*'просёлки', 'посёлки', 'ёлки'*), probably the problem relates to the use of the letter 'Ёё'.

In [14]:
import pprint
import treetaggerwrapper
tagger = treetaggerwrapper.TreeTagger(TAGLANG='ru')
tt_raw = tagger.tag_text(lyrics)
tt = treetaggerwrapper.make_tags(tt_raw)
pprint.pprint(tt)

[Tag(word='Как', pos='C', lemma='как'),
 Tag(word='злобно', pos='R', lemma='злобно'),
 Tag(word='сверкают', pos='Vmip3p-a-e', lemma='сверкать'),
 Tag(word='глаза', pos='Ncmpan', lemma='глаз'),
 Tag(word='маньяка', pos='Ncmsgy', lemma='маньяк'),
 Tag(word=',', pos=',', lemma=','),
 Tag(word='Он', pos='P-3msnn', lemma='он'),
 Tag(word='проклинает', pos='Vmip3s-a-e', lemma='проклинает'),
 Tag(word='всех', pos='P---pga', lemma='весь'),
 Tag(word='живых', pos='Afpmpgf', lemma='живой'),
 Tag(word=',', pos=',', lemma=','),
 Tag(word='Но', pos='C', lemma='но'),
 Tag(word='невдомёк', pos='Ncmsnn', lemma='невдомёк'),
 Tag(word='ему', pos='P-3msdn', lemma='он'),
 Tag(word=',', pos=',', lemma=','),
 Tag(word='однако', pos='C', lemma='однако'),
 Tag(word=',', pos=',', lemma=','),
 Tag(word='Что', pos='P--nsnn', lemma='что'),
 Tag(word='есть', pos='Vmip3s-a-e', lemma='быть'),
 Tag(word='свой', pos='P--msaa', lemma='свой'),
 Tag(word='среди', pos='Sp-g', lemma='среди'),
 Tag(word='чужих', pos='Afpmpg

Then **pymystem3** appears on the stage. It didn't strike me as a convenient tool for POS-tagging due to the user-unfriendly output structure, although several grammatical interpretations are given in some cases (I wish pymystem3 provided their probabilities...). The 'Ёё' problem is solved here as well as cases of disambiguation, that's a positive sign. But still... the whole output looks as bulky as an opera singer.

In [11]:
from pymystem3 import Mystem
m = Mystem()
import json
ms = json.dumps(m.analyze(lyrics_raw), ensure_ascii=False)
for i in ms.split('{"analysis": '):
    print(i)

[
[{"gr": "ADVPRO=", "lex": "как"}], "text": "Как"}, {"text": " "}, 
[{"gr": "ADV=", "lex": "злобно"}], "text": "злобно"}, {"text": " "}, 
[{"gr": "V,несов,нп=непрош,мн,изъяв,3-л", "lex": "сверкать"}], "text": "сверкают"}, {"text": " "}, 
[{"gr": "S,муж,неод=(вин,мн|род,ед|им,мн)", "lex": "глаз"}], "text": "глаза"}, {"text": " "}, 
[{"gr": "S,муж,од=(вин,ед|род,ед)", "lex": "маньяк"}], "text": "маньяка"}, {"text": ", "}, 
[{"gr": "SPRO,ед,3-л,муж=им", "lex": "он"}], "text": "Он"}, {"text": " "}, 
[{"gr": "V=непрош,ед,изъяв,3-л,несов,пе", "lex": "проклинать"}], "text": "проклинает"}, {"text": " "}, 
[{"gr": "SPRO,мн=(пр|вин|род)", "lex": "все"}], "text": "всех"}, {"text": " "}, 
[{"gr": "A=(пр,мн,полн|вин,мн,полн,од|род,мн,полн)", "lex": "живой"}], "text": "живых"}, {"text": ", "}, 
[{"gr": "CONJ=", "lex": "но"}], "text": "Но"}, {"text": " "}, 
[{"gr": "ADV,прдк=", "lex": "невдомек"}], "text": "невдомёк"}, {"text": " "}, 
[{"gr": "SPRO,ед,3-л,муж=дат", "lex": "он"}], "text": "ему"}, {"t

Finally, let's look at **pymorphy2**. The first thing to be noticed here is multiple interpretations for almost every input token. They are ranked by probability scores, the treat I find quite useful for a wide range of researches. The output structure (imho) is relatively more readable in comparison with the previous package, taking into account the possibility to filter the output by probability values.

In [12]:
import pymorphy2
morph = pymorphy2.MorphAnalyzer()
for i in lyrics:
    print(morph.parse(i),'\n')

[Parse(word='как', tag=OpencorporaTag('CONJ'), normal_form='как', score=0.875, methods_stack=((<DictionaryAnalyzer>, 'как', 1673, 0),)), Parse(word='как', tag=OpencorporaTag('ADVB,Ques'), normal_form='как', score=0.09375, methods_stack=((<DictionaryAnalyzer>, 'как', 1674, 0),)), Parse(word='как', tag=OpencorporaTag('PRCL'), normal_form='как', score=0.03125, methods_stack=((<DictionaryAnalyzer>, 'как', 22, 0),))] 

[Parse(word='злобно', tag=OpencorporaTag('ADVB'), normal_form='злобно', score=0.666666, methods_stack=((<DictionaryAnalyzer>, 'злобно', 3, 0),)), Parse(word='злобно', tag=OpencorporaTag('ADJS,Qual neut,sing'), normal_form='злобный', score=0.333333, methods_stack=((<DictionaryAnalyzer>, 'злобно', 12, 29),))] 

[Parse(word='сверкают', tag=OpencorporaTag('VERB,impf,intr plur,3per,pres,indc'), normal_form='сверкать', score=1.0, methods_stack=((<DictionaryAnalyzer>, 'сверкают', 15, 6),))] 

[Parse(word='глаза', tag=OpencorporaTag('NOUN,inan,masc sing,gent'), normal_form='глаз', sc

Speaking about the speed of the packages, pymystem3 is pedaling in the Peloton behind pymorphy2 while TreeTagger is pulling up the rear.

In terms of accuracy, all the three taggers show quite sustainable results, but TreeTagger deals with the task slightly worse. In my future work, I'd rather rely on pymorphy2, bearing in mind "fine tuning" of its parameters, probability scores and the suitability for working with Russian texts.