# Examination of the NLT package
**N**atural **L**anguage **T**ol**k**it is a famous Pythonic library devoted for various kinds of researches conducted on the language as opposed to _spaCy_ that deals with production-ready enterprise solutions. In tis notebook I will investigate its strengths and make a conclusion what it is great for.

In [18]:
import nltk # Natural Language Toolkit
import nltk.corpus as nlc # Natural Language Corpus
import os #For file illustrations
import nltk.tokenize as nlt # Natural Language Tokeniser
nltk.download('all') # Download all the corpora
with open("relations.txt") as f:
    source = f.read()
print(source)

The United States officially recognized the independence of Ukraine on December 25, 1991.
The United States upgraded its consulate in the capital, Kyiv, to embassy status on January 21,
1992. In 2002, relations between the United States and Ukraine deteriorated after one of the
recordings made during the Cassette Scandal revealed an alleged transfer of a sophisticated
Ukrainian defense system to Saddam Hussein's Iraq. Following the 2014 annexation of Crimea by the
Russian Federation, the USA became one of the largest defense partners of Ukraine.

The current ambassador of the United States to Ukraine is Bridget A. Brink. The current Ukrainian
Ambassador to the United States is Oksana Markarova.

As of 2009, the United States supports Ukraine's bid to join NATO.

According to documents uncovered during the United States diplomatic cables leak, American
diplomats defends Ukrainian sovereignty in meetings with other diplomats.

Ukrainians have generally viewed the U.S. positively, with 80

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /home/codespace/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /home/codespace/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /home/codespace/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /home/codespace/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /home/codespace/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloadi

# Word processing
## Tokenisation
NLTK does a great job at tokenisating the text stream into tokens, such as words and punctuation marks. In a way, it is similar to `str.split()` but extracting the special characters like commas, dots, quotation marks as well. Because of this, NLTK tokenisation is much more lightweight than the spaCy implementation, but it is also in a way inaccurate by treating possessive _'s_ as a different lexeme from the base word as shown in the example below.

In [19]:
lexemes = nlt.word_tokenize(source)
print(lexemes)



## Stemming and lemmasation
In natural language processing, extracting the base form of the word is one of the frequent tasks as it often deals with grammar conversions and word formulations. NLTK offers two ways to do it: **stemming** and **lemmasation**.  

Stemming is a way to extract the base algorithmically by analysing the word's structure and removing any affixes and inflections. This is why it's a lightweight solution, yet it is unable to deal with irregular words that are in the realm of exceptions. For those cases, _lemmasation_ offers a dictionary where all words are grouped around their lemma, and the whole task comes down to traversing a map. This solution is obviously more accurate if this is what the task demands.

In [20]:
stemmer = nltk.stem.PorterStemmer()
print(stemmer.stem("caged"))   #Correct: cage
print(stemmer.stem("beloved")) #False: belov
lemmatizer = nltk.stem.WordNetLemmatizer()
print(lemmatizer.lemmatize("caged"))   #Correct: cage
print(lemmatizer.lemmatize("beloved")) #Correct: beloved

cage
belov
caged
beloved


# Named Entity Recognition
NER is a one of the most widespread technology in NLP used everywhere, and NLTK offers a way to print named entities from the text. It labels every word a _tag_, among which there are:
* **CD** coordinate conjunction;
* **CD** cardinal digit;
* **DT** determiner;
* **FW** foreign word;
* **IN** preposition/subordination conjunction;
* **JJ** large adjective;
* **JJR** comparative adjective;
* **JJS** superlative adjective;
* **LS** list market;
* **NN** singular noun;
* **NNS** plural noun;
* **NNP** proper singular noun;
* **NNPS** proper plural noun;
* **POS** possesive ending;
* **PRP** personal pronouns, etc...

In [21]:
text = "The United States officially recognized the independence of Ukraine on December 25, 1991."
tags = nltk.pos_tag(nltk.word_tokenize(text))
print(tags)
#Documentation about each tag and examples can be printed at need:
print(nltk.help.upenn_tagset('RB'))
print(nltk.help.upenn_tagset('VBN'))

[('The', 'DT'), ('United', 'NNP'), ('States', 'NNPS'), ('officially', 'RB'), ('recognized', 'VBN'), ('the', 'DT'), ('independence', 'NN'), ('of', 'IN'), ('Ukraine', 'NNP'), ('on', 'IN'), ('December', 'NNP'), ('25', 'CD'), (',', ','), ('1991', 'CD'), ('.', '.')]
RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...
None
VBN: verb, past participle
    multihulled dilapidated aerosolized chaired languished panelized used
    experimented flourished imitated reunifed factored condensed sheared
    unsettled primed dubbed desired ...
None


# Chunking 
**Chunking** is the process of grouping individual tokens into larger related groups that can be applied together, such as in question answering, similarly to what spaCy does with spans. 

In [26]:
#As always, we need to first tokenise the words to work with them:
tokens = nltk.word_tokenize(source)
#We need to define the grammar with a regular expression:
grammar = r"NP: {<DT>?<JJ>*<NN>}"
parser = nltk.RegexpParser(grammar)
tree = parser.parse(nltk.pos_tag(tokens))
print(tree)

(S
  The/DT
  United/NNP
  States/NNPS
  officially/RB
  recognized/VBN
  (NP the/DT independence/NN)
  of/IN
  Ukraine/NNP
  on/IN
  December/NNP
  25/CD
  ,/,
  1991/CD
  ./.
  The/DT
  United/NNP
  States/NNPS
  upgraded/VBD
  its/PRP$
  (NP consulate/NN)
  in/IN
  (NP the/DT capital/NN)
  ,/,
  Kyiv/NNP
  ,/,
  to/TO
  embassy/VB
  (NP status/NN)
  on/IN
  January/NNP
  21/CD
  ,/,
  1992/CD
  ./.
  In/IN
  2002/CD
  ,/,
  relations/NNS
  between/IN
  the/DT
  United/NNP
  States/NNPS
  and/CC
  Ukraine/NNP
  deteriorated/VBD
  after/IN
  one/CD
  of/IN
  the/DT
  recordings/NNS
  made/VBN
  during/IN
  the/DT
  Cassette/NNP
  Scandal/NNP
  revealed/VBD
  an/DT
  alleged/VBN
  (NP transfer/NN)
  of/IN
  (NP a/DT sophisticated/JJ Ukrainian/JJ defense/NN)
  (NP system/NN)
  to/TO
  Saddam/NNP
  Hussein/NNP
  's/POS
  Iraq/NNP
  ./.
  Following/VBG
  the/DT
  2014/CD
  (NP annexation/NN)
  of/IN
  Crimea/NNP
  by/IN
  the/DT
  Russian/JJ
  Federation/NNP
  ,/,
  the/DT
  USA/NNP
  bec

# Conclusions 
NLTK is naturally built for researching purposes and does not provide as much robustness as its analogue, spaCy does. It primarily processes the text in a C-like text fashion completely omitting any details about parts of speech, spans or properties but instead finds those out in the process of calling specialised functions for that. On top of that, NLTK is completely built on the algorithmic analysis and does not use any kind of machine learning, even though it comes with rich corpora.   

With that being said, I consider Natural Language Toolkit to be suitable for minor linguistic researches and spontaneous scripts, and would prefer to use Spacy for any other kind of language processing.
# Sources and references
1. POS Tagging with NLTK and Chunking in NLP: _https://www.guru99.com/pos-tagging-chunking-nltk.html_
