## Part-of-Speech tagging

This part was produced using [Jupyter](http://jupyter.org).  
If you are used to it, you can [download the corresponding notebook code from here](TP1-PoSTagging.ipynb). If not, no problem at all, this is not mandatory at all: simply proceed as usual in your favorite Python environment.

### Preliminary steps

Let us try NLTK with a simple example. The application here considered consist in associating "_syntactic tags_" (called "_Part-of-Speech tags_") to the words in a text; i.e. to determine the grammatical role of each word in the context of the sentence.

The applications of PoS tagging are numerous. For instance:

* help for "lemmatization" (obtain the words’ canonical forms);
* disambiguate words for higher level treatments (e.g. information extraction);
* provide syntactic clues ("roles") for unknown words ("guessers"), ...

PoS taggers usually reach a 95-99% performance level, depending on the language considered, the application and the granularity of the tag-set.

More about Part-of-Speech tagging will be presented during the semester (Week 5). The purpose of the present exercise is simply to illustrate this NLP task and to check NLTK installation.

In order to tag a sentence in NLTK from the python interpreter, you first need to download required models, if not done yet:

In [1]:
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\leose\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\leose\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

Then, simply proceed this way to actually tag some sentence:

In [17]:
def tag(sentence):
    return nltk.pos_tag( nltk.word_tokenize( sentence ) )

tag("Your sentence comes here.")

[('Your', 'PRP$'),
 ('sentence', 'NN'),
 ('comes', 'VBZ'),
 ('here', 'RB'),
 ('.', '.')]

NLTK provides some help about the tageset used. If not done yet, first download the required package:

In [None]:
nltk.download('tagsets')

Then you can require explaination about some tag, e.g.:

In [None]:
nltk.help.upenn_tagset('RB')

or even some set of tags using regular expressions, e.g.:

In [None]:
nltk.help.upenn_tagset('NN.*')

### Your turn

Try tagging the following sentence:

In [None]:
tag("This is only a sample sentence.")

What are the Part-of-Speech tags? Do you understand them? Does it make senses?

Then try replacing the word "_sample_" with some unknown form, e.g.:

In [None]:
tag("This is only a xxx sentence.")

What happens? What do you get for "_xxx_"?

**Explaination:** Since "_xxx_" is not likely a word that the PoS-tagger has seen before, it will try to **guess** the part of speech tag (we will explore this topic in more detail during the semester, in the corresponding dedicated practical session). The most probable thing between a determinant and a nouns is an adjective, so this simple tagger will guess "_xxx_" to be an adjective.

Then try with some highly ambiguous sentence, e.g.:

In [None]:
tag("Time flies like an arrow.")

Is the obtained result consistent with one or the other possible interpretations of this sentence?  
(reminder: make use of `nltk.help.upenn_tagset()` if needed)

Does it correspond to your "most intuitive interpretation"?

**Comment:** the tagger output is due to the limited scope of the algorithms used, this scope is limited for the sake of efficiency: $O(n)$ here, versus $O(n^3)$ for broader scope algorithms (CFG parser). 

Finaly, try with a less ambiguous sentences, still containing some ambiguous words. For instance (prior to submiting it to the tager, ask yourself what are the ambiguous words; what ambiguities? Then proceed):

In [None]:
tag("This girl can beat.")

Can you explain why "_can_" and "_beat_" are correctly tagged as modal verb and verb, respectively, instead of nouns (given that both words are also sometimes nouns and nouns are more frequent than verbs)?

### Final comments

**Note 1:** A parser can be used instead of a corpus-based POS-tagger to disambiguate words in context. But this demands more work on resource development (writing grammars), which is costly; and, the processing is also more costly: context-free parsing takes cubic time (w.r.t. sentence length) as opposed to PoS-tagging which takes linear time.

This is a good illustration of the "descriptive power" as opposed to "processing speed" balance (i.e. compromize) presented in today's lecture.

**Note 2:** Systematically disambiguating at each level of language (lexical, syntactic, semantic) is most of the time not necessary at all. It is often better to leave the disambiguation to a later, more informed, processing level, provided that complexity could be managed/handled. It is a more robust way to proceed. So, don't assume that at each step where you get some result, you have to decide on a unique one. Imagine solutions where you could ship/handle several solutions together.