# Intro to NLTK

[NLTK](https://www.nltk.org/) stands for Natural Language Toolkit. It's a particarly interesting platform to work with NLP (Natural Language Processing) in Python. 

### Table of Contents

- <a href='#tokenization'>Tokenization</a>
- <a href='#text_normalization'>Text Normalization</a>
- <a href='#similarity'>Similarity</a>
- <a href='#pos_tagging'>POS Tagging</a>
- <a href='#wordnet'>WordNET</a>
- <a href='#disambiguation'>Disambiguation</a>


### Dependencies

In [1]:
import nltk

In order to further process the data, nltk provides a *corpora* that contains informations about different languages. Here's how you can download it:

In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
nltk.download('treebank')
nltk.download('omw')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\T-Gamer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\T-Gamer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\T-Gamer\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\T-Gamer\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\T-Gamer\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\T-Gamer\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to

True

( Based on *"Mastering Natural Language Processing with Python" (Chopra, Joshi, Mathur)* )

Let's use reference text to process it and show how to use the library.

In [3]:
text = "Welcome students. NLP is an interesting subject. However, knowledge about many things is needed to achieve full undestanding."

## Tokenization <a id='tokenization'></a>

Tokenization is a very important task in NLP.

Given a *document unit* (that could be, for instance, a phrase), *tokenizing it* means that you will change your representation into pieces (tokens).

Most commonly, tokenization is done on word-level, that is, it breaks up the text by the words (**word-level** tokenization). You can see it here:

In [4]:
nltk.word_tokenize(text)

['Welcome',
 'students',
 '.',
 'NLP',
 'is',
 'an',
 'interesting',
 'subject',
 '.',
 'However',
 ',',
 'knowledge',
 'about',
 'many',
 'things',
 'is',
 'needed',
 'to',
 'achieve',
 'full',
 'undestanding',
 '.']

But it's also possible to apply **sentence-level** tokenization:

In [5]:
from nltk.tokenize import sent_tokenize
sent_tokenize(text)

['Welcome students.',
 'NLP is an interesting subject.',
 'However, knowledge about many things is needed to achieve full undestanding.']

Althought most things in NLTK are pre-defined to work with English, **NLTK also provides data for other langugages**.

In [6]:
port_tokenizer=nltk.data.load('tokenizers/punkt/portuguese.pickle')
port_text='Este texto em português é "tokenizado". Cortesia do NLTK.'
port_tokenizer.tokenize(port_text)

['Este texto em português é "tokenizado".', 'Cortesia do NLTK.']

There are some **project decisions** that you can make on **how to handle tokenization**. For instance, if you have the word *"don't"*, when tokenizing it on word-level, you could consider "don't" as one token, or as two tokens ("do", "not"), or even as three tokens ("don", " ' ", 't'). Below you can find how NLTK is able to handle it.

word_tokenize:

In [7]:
nltk.word_tokenize("I don't know.")

['I', 'do', "n't", 'know', '.']

WordPunctTokenizer:

In [8]:
from nltk.tokenize import WordPunctTokenizer
tokenizer=WordPunctTokenizer()
tokenizer.tokenize(" I don't know.")

['I', 'don', "'", 't', 'know', '.']

Traditional Split:

In [9]:
"I don't know".split()

['I', "don't", 'know']

## Text Normalization <a id='text_normalization'></a>

Cleaning your text will often be necessary in order to make the right information get through. 

Applying both sentence and word-level tokenization:

In [10]:
[nltk.word_tokenize(a) for a in sent_tokenize(text)]

[['Welcome', 'students', '.'],
 ['NLP', 'is', 'an', 'interesting', 'subject', '.'],
 ['However',
  ',',
  'knowledge',
  'about',
  'many',
  'things',
  'is',
  'needed',
  'to',
  'achieve',
  'full',
  'undestanding',
  '.']]

It's a **common practice to make everything lowercase**, this way if "welcome" is typed "Welcome" or "WeLcOmE" it will still become the same token.

In [11]:
text.lower()

'welcome students. nlp is an interesting subject. however, knowledge about many things is needed to achieve full undestanding.'

**Many words don't have much semantic meaning**, such as ["a", "the", "is"]. They are named **'stopwords'**. In many applications it may be useful to *remove* stopwords.

In [12]:
from nltk.corpus import stopwords

stops=set(stopwords.words('english'))
words=nltk.word_tokenize(text)
[word for word in words if word not in stops]

['Welcome',
 'students',
 '.',
 'NLP',
 'interesting',
 'subject',
 '.',
 'However',
 ',',
 'knowledge',
 'many',
 'things',
 'needed',
 'achieve',
 'full',
 'undestanding',
 '.']

Each language has it own stopwords list, that can be used throught nltk like this:

In [13]:
from nltk.corpus import stopwords
stops=set(stopwords.words('portuguese'))
words=nltk.word_tokenize(port_text)
[word for word in words if word not in stops]

['Este',
 'texto',
 'português',
 'é',
 '``',
 'tokenizado',
 "''",
 '.',
 'Cortesia',
 'NLTK',
 '.']

You can check out what's the stopword list and you are able to adapt it to your application needs.

In [14]:
w=stopwords.words('portuguese')
[[w[i+k*8] for i in range(k,k+8)] for k in range(7)]

[['de', 'a', 'o', 'que', 'e', 'do', 'da', 'em'],
 ['para', 'com', 'não', 'uma', 'os', 'no', 'se', 'na'],
 ['mais', 'as', 'dos', 'como', 'mas', 'ao', 'ele', 'das'],
 ['seu', 'sua', 'ou', 'quando', 'muito', 'nos', 'já', 'eu'],
 ['só', 'pelo', 'pela', 'até', 'isso', 'ela', 'entre', 'depois'],
 ['mesmo', 'aos', 'seus', 'quem', 'nas', 'me', 'esse', 'eles'],
 ['essa', 'num', 'nem', 'suas', 'meu', 'às', 'minha', 'numa']]

**Stemming** is another text normalization technique. It's the process of reducing inflected or derived words to their root form. This way ["work", "working", "worked"] will all be all represented identically (as "work").

In [32]:
from nltk.stem import PorterStemmer
stemmerporter = PorterStemmer()
stemmerporter.stem('working')

'work'

In [33]:
stemmerporter.stem('happiness')

'happi'

In [34]:
stemmerporter.stem('happy')

'happi'

In [23]:
stemmerporter.stem('sensation')

'sensat'

NLTK also provides other stemming techniques, like LancasterStemmer.

In [24]:
stemmerlan=nltk.stem.LancasterStemmer()
stemmerlan.stem('sensation'), stemmerlan.stem('happiness'), stemmerlan.stem('working')

('sens', 'happy', 'work')

and also SnowBallStemmer.

In [26]:
from nltk.stem import SnowballStemmer as snow
sstem=snow('portuguese')
[sstem.stem(x) for x in "comendo comida alegria trabalhismo sensação sentimento".split()]

['com', 'com', 'alegr', 'trabalh', 'sensaçã', 'sentiment']

an important thing to ask is: what are the major difference between these three stemmers?

- **Porter:** Very gentle stemmer. It is also the most computationally intensive of the algorithms (Granted not by a very significant margin). It is also the oldest stemming algorithm.

- **Lancaster:** Very aggressive stemming algorithm, sometimes to a fault. The representations in lancaster are not as intuitive as in the other two, and short words might become totally obfuscated. It this the fastest algorithm between the three, and reduces your working set of words hugely (which comes at the price of less distinction of words).

- **SnowBall** *(also known as Porter2)*: It is an improvement over porter, with slightly faster computation time. This will almost always be the go-to if you are not sure which one to use.

**Lemmatization**

It is similar to the Stemming process, but Lemmatization tries to process the words with the use of vocabulary and morphological analysis. For instance, stemming "meeting" retrieves "meet" since it is the root of the word. However, "meeting" could also be treated as a noun instead of a verb (like "in our last meeting"), and the lemmatization will preserve the token as "meeting", as you can observe below:

In [38]:
stemmerporter.stem('meeting')

'meet'

In [37]:
import nltk
from nltk.stem import WordNetLemmatizer

lemmatizer_output=WordNetLemmatizer()
lemmatizer_output.lemmatize('meeting')

'meeting'

In [28]:
[lemmatizer_output.lemmatize(x, pos="n") for x in "working works unhappiness sensation sentiment journey journal information".split()]

['working',
 'work',
 'unhappiness',
 'sensation',
 'sentiment',
 'journey',
 'journal',
 'information']

Handling **spellchecking** is another text normalization task that, depending on your application, may be necessary. You can come across words missing spaces, flipped characters and much more.

In [None]:
# !apt-get install aspell-pt # package for GNU spellchecker (in portuguese)
# !apt-get install enchant # spellchecking software
# !pip install pyenchant # spellchecking library for python

In [None]:
import enchant

s=enchant.Dict("en_US")
tok=[]
def tokenize(st1):
  if not st1:return
  for j in range(len(st1),-1,-1):
    if s.check(st1[0:j]):
      tok.append(st1[0:j])
      st1=st1[j:]
      tokenize(st1)
      break
  return tok

tokenize("whatistheproblemwiththisbook")

In [32]:
s.suggest('prone')

['probe',
 'pron',
 'pone',
 'prose',
 'prole',
 'crone',
 'drone',
 'prune',
 'prong',
 'phone',
 'prove',
 'krone',
 'pr one',
 'pr-one',
 'pron e']

Also available in other languages (if you download it):

In [33]:
enchant.list_dicts()

[('en_US', <Enchant: Myspell Provider>),
 ('pt_BR', <Enchant: Aspell Provider>),
 ('pt_PT', <Enchant: Aspell Provider>)]

In [34]:
s=enchant.Dict("pt_BR")
tok=[]
def tokenize(st1):
  if not st1:return
  for j in range(len(st1),-1,-1):
    if s.check(st1[0:j]):
      tok.append(st1[0:j])
      st1=st1[j:]
      tokenize(st1)
      break
  return tok

tokenize("eunãotenhoamenoridéiadecomoseseparaisso")

['eu',
 'não',
 'tenho',
 'ameno',
 'ri',
 'd',
 'é',
 'ia',
 'de',
 'como',
 'se',
 'separais',
 's',
 'o']

## Similarity <a id='similarity'></a>

Calculating similarity between *tokens* is an important task. The type of calculation you use to compute the similarity might retrieve different interpretations.

Since words can be stored as vectors, you could use **classification metrics** to evaluate how two vectors are similar:

In [24]:
from nltk.metrics import *

a = [1, 2, 3, 4, 5]
b = [1, 1, 2, 4, 5]

accuracy(a,b), precision(set(a),set(b)), recall(set(a),set(b))

(0.6, 1.0, 0.8)

A more direct way is to calculate the **edit distance** between two words. It represents the number of changes needed to modify one word and get to the other one (*insertions, removals and substitutions*).

In [16]:
edit_distance("amor","ambar")

2

In [18]:
edit_distance("sunday","sydney")

4

**Jaccard distance** is used to compute the similarity and variety in two sets. It is the complement of the sets intersection size. (This way if you have two identical sets the jaccard distance will be zero). It is also important to note that this approach does not consider the element order (because it is operating with set-similarity)

In [25]:
jaccard_distance(set(a),set(b))

0.2

In [27]:
jaccard_distance(set([1,2,3,4]), set([4,3,2,1]))

0.0

A very simple distance is the **binary distance**. It simply checks whether the inputs are equal or not. (0 if identical, 1 if different)

In [19]:
binary_distance([1,2,3],[1,4,5])

1.0

In [30]:
binary_distance([1,2,3],[1,2,3])

0.0

**Masi distance** is also a set-based technique that will take partial agreement when multiple labels are assigned. ([Paper](http://www.cs.columbia.edu/nlp/papers/2006/passonneau_06.pdf))

It is computed following the formula:

<center>1 - (len_intersection / len_union) * m</center>

In [31]:
masi_distance(set(a),set(b))

0.46399999999999997

## POS Tagging <a id='pos_tagging'></a>

POS stands for **Part-of-Speech**.  POS Tagging is used to assign tags to parts-of-speeches (*a-ha!*) considering word relationship with adjacent and related words in the text. NLTK has a built-in module for pos-tagging:

In [36]:
text1=nltk.word_tokenize("It is a pleasant day today")
nltk.pos_tag(text1)

[('It', 'PRP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('pleasant', 'JJ'),
 ('day', 'NN'),
 ('today', 'NN')]

It assigned those tags, and you can check what each tag means through the code below:

In [37]:
nltk.help.upenn_tagset('VBZ')

VBZ: verb, present tense, 3rd person singular
    bases reconstructs marks mixes displeases seals carps weaves snatches
    slumps stretches authorizes smolders pictures emerges stockpiles
    seduces fizzes uses bolsters slaps speaks pleads ...


In [38]:
nltk.pos_tag(nltk.word_tokenize("Now, I cannot bear the pain of bear"))

[('Now', 'RB'),
 (',', ','),
 ('I', 'PRP'),
 ('can', 'MD'),
 ('not', 'RB'),
 ('bear', 'VB'),
 ('the', 'DT'),
 ('pain', 'NN'),
 ('of', 'IN'),
 ('bear', 'NN')]

Other Taggers are available as well:

In [39]:
from nltk.tag import BigramTagger
from nltk.corpus import treebank
training_1= treebank.tagged_sents()[:7000]
bigramtagger=BigramTagger(training_1)
treebank.sents()[0]

['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

In [40]:
bigramtagger.tag(treebank.sents()[0])

[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.')]

In [41]:
testing_1 = treebank.tagged_sents()[2000:]
bigramtagger.evaluate(testing_1) #evaluate based on the ground truth

0.9171131227292321

You can then use the obtained tags with the text:

In [42]:
sent=[("A","DT"),("wise", "JJ"), ("small", "JJ"),("girl", "NN"),
("of", "IN"), ("village", "NN"), ("became", "VBD"), ("leader", "NN")]
grammar = "NP: {<DT>?<JJ>*<NN><IN>?<NN>*}"
find = nltk.RegexpParser(grammar)
res = find.parse(sent)
print(res)

(S
  (NP A/DT wise/JJ small/JJ girl/NN of/IN village/NN)
  became/VBD
  (NP leader/NN))


And then you can also strucutre it as a graph:

In [43]:
node=0

def graph_tree(tree):
  global node
  node=0
  def graf(tree):
    global node
    outs=""
    if type(tree)==nltk.tree.Tree:
      s=str(node)
      node += 1
      outs += s+'[label="'+tree.label()+'"];\n'
      for c in tree:
        outs += s+"->"+str(node)+";\n"
        outs += graf(c)
    else:
      s=str(node)
      node += 1
      outs += s+'[label="'+str(tree[0])+'\\n'+str(tree[1])+'"];\n'
    return outs
  
  return "digraph G {\n"+graf(tree)+"}"


g=graph_tree(res)
print(g)

digraph G {
0[label="S"];
0->1;
1[label="NP"];
1->2;
2[label="A\nDT"];
1->3;
3[label="wise\nJJ"];
1->4;
4[label="small\nJJ"];
1->5;
5[label="girl\nNN"];
1->6;
6[label="of\nIN"];
1->7;
7[label="village\nNN"];
0->8;
8[label="became\nVBD"];
0->9;
9[label="NP"];
9->10;
10[label="leader\nNN"];
}


## WordNet <a id='wordnet'></a>

WordNet is a lexical database for english. It was already used by multiple analysis approaches and then you can use their work to process other texts.

For instance, you can use it to find **synonms.** In particular it uses "Synsets", that is a set of synonyms that should be interchangeable.

In [48]:
from nltk.corpus import wordnet as wn

wn.synsets('cat')

[Synset('cat.n.01'),
 Synset('guy.n.01'),
 Synset('cat.n.03'),
 Synset('kat.n.01'),
 Synset('cat-o'-nine-tails.n.01'),
 Synset('caterpillar.n.02'),
 Synset('big_cat.n.01'),
 Synset('computerized_tomography.n.01'),
 Synset('cat.v.01'),
 Synset('vomit.v.01')]

In [49]:
wn.synsets('flake',pos=wn.VERB)

[Synset('flake.v.01'), Synset('flake.v.02'), Synset('peel_off.v.04')]

In [50]:
wn.synset('flake.v.02').definition()

'cover with flakes or as if with flakes'

In [51]:
wn.synset('flake.v.01').examples()

['The substances started to flake']

In [52]:
wn.synset('cat.n.01').lemmas('por')

[Lemma('cat.n.01.bichano'),
 Lemma('cat.n.01.gata'),
 Lemma('cat.n.01.gato'),
 Lemma('cat.n.01.gato-doméstico'),
 Lemma('cat.n.01.Gato_doméstico'),
 Lemma('cat.n.01.Gato-doméstico')]

In [53]:
wn.synsets('carro',lang='por')

[Synset('beach_wagon.n.01'),
 Synset('car.n.01'),
 Synset('car.n.02'),
 Synset('cart.n.01')]

And even **hypernyms**! (Which are superordinate, like "Musical Instrument" is superordinate of "Guitar")

In [54]:
wn.synset('tomato.n.01').hypernyms()

[Synset('solanaceous_vegetable.n.01')]

In [55]:
wn.synset('solanaceous_vegetable.n.01').hyponyms()

[Synset('eggplant.n.01'),
 Synset('pepper.n.04'),
 Synset('potato.n.01'),
 Synset('tomatillo.n.03'),
 Synset('tomato.n.01')]

In [56]:
wn.synset('solanaceous_vegetable.n.01').hypernyms()

[Synset('vegetable.n.01')]

**Hyponyms** as well! (Which are subordinate - the oposite relationship of hypernyms)

In [57]:
wn.synset('vegetable.n.01').hyponyms()

[Synset('artichoke.n.02'),
 Synset('artichoke_heart.n.01'),
 Synset('asparagus.n.02'),
 Synset('bamboo_shoot.n.01'),
 Synset('cardoon.n.02'),
 Synset('celery.n.02'),
 Synset('cruciferous_vegetable.n.01'),
 Synset('cucumber.n.02'),
 Synset('fennel.n.02'),
 Synset('greens.n.01'),
 Synset('gumbo.n.03'),
 Synset('julienne.n.01'),
 Synset('leek.n.02'),
 Synset('legume.n.03'),
 Synset('mushroom.n.05'),
 Synset('onion.n.03'),
 Synset('pieplant.n.01'),
 Synset('plantain.n.03'),
 Synset('potherb.n.01'),
 Synset('pumpkin.n.02'),
 Synset('raw_vegetable.n.01'),
 Synset('root_vegetable.n.01'),
 Synset('solanaceous_vegetable.n.01'),
 Synset('squash.n.02'),
 Synset('truffle.n.02')]

In [58]:
wn.synset('cruciferous_vegetable.n.01').hyponyms()

[Synset('broccoli.n.02'),
 Synset('broccoli_rabe.n.02'),
 Synset('brussels_sprouts.n.01'),
 Synset('cabbage.n.01'),
 Synset('cauliflower.n.02'),
 Synset('kohlrabi.n.02'),
 Synset('mustard.n.03'),
 Synset('radish.n.01'),
 Synset('turnip.n.02')]

In [59]:
wn.synset('vegetable.n.01').hypernyms()

[Synset('produce.n.01')]

In [60]:
wn.synset('produce.n.01').hyponyms()

[Synset('eater.n.02'), Synset('edible_fruit.n.01'), Synset('vegetable.n.01')]

In [61]:
wn.synset('cabbage.n.01').lowest_common_hypernyms(wn.synset('tomato.n.01'))

[Synset('vegetable.n.01')]

## Disambiguation <a id='disambiguation'></a>

It is common to end up on ambiguities. Path similarity can be used in wordnet to **calculate the distance between *synsets*.**

In [62]:
wn.synset('cabbage.n.01').path_similarity(wn.synset('tomato.n.01'))

0.2

In [63]:
wn.synset('cabbage.n.01').path_similarity(wn.synset('car.n.01'))

0.058823529411764705

And NLTK has *Lesk* which can be used to get the **definition of a word in the context of a sentence.**

In [64]:
from nltk.wsd import lesk
sent = ['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.']
lesk(sent,'bank'), lesk(sent,'bank').definition()

(Synset('savings_bank.n.02'),
 'a container (usually with a slot in the top) for keeping money at home')

In [65]:
sent = 'The bank is steep'.split()
lesk(sent,'bank','n'), lesk(sent,'bank').definition()

(Synset('bank.n.07'), 'put into a bank account')

In [66]:
lesk('I bear the pain of the bear.'.split(),'bear').definition()

"take on as one's own the expenses or debts of another person"

In [67]:
lesk('I got used to the pain of the bear'.split(),'bear','n').definition()

'an investor with a pessimistic market outlook; an investor who expects prices to fall and so sells now in order to buy later at a lower price'

In [77]:
lesk('Drop the pine cone'.split(),'cone','n').definition()

'a visual receptor cell in the retina that is sensitive to bright light and to color'

## For now.. that's it!