# NLTK tutorial
(From https://www.nltk.org/)
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

We'll talk about the following sections in this tutorial:

1. Tokenizer
2. Stemmer
3. WordNet
4. Tips to the assignments

In [1]:
# !pip install nltk
# !pip install numpy



You should consider upgrading via the 'c:\users\asus\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.


# 1. NLTK Tokenizer

In [1]:
import nltk
nltk.download('punkt') # to make nltk.tokenizer works
nltk.download('wordnet') 

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
text1 = "Text mining is to identify useful information."
text2 = "Current NLP models isn't able to solve NLU perfectly."

print("string.split tokenizer", text1.split(" "))
print("string.split tokenizer", text2.split(" "))

string.split tokenizer ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information.']
string.split tokenizer ['Current', 'NLP', 'models', "isn't", 'able', 'to', 'solve', 'NLU', 'perfectly.']


Cannot deal with punctuations, i.e., full stops and apostrophes.

In [3]:
import regex # regular expression
print("regular expression tokenizer", regex.split("^[A-Za-z\s\.]", text1))
print("regular expression tokenizer", regex.split("^[A-Za-z\s\.]", text2))

regular expression tokenizer ['', 'ext mining is to identify useful information.']
regular expression tokenizer ['', "urrent NLP models isn't able to solve NLU perfectly."]


- Here, the `string.split` function can not deal with punctuations
- Simple regular expression can deal with most punctuations but may fail in the cases of "isn't, wasn't, can't"

In [4]:
def tokenize(text):
    """
    :param text: a doc with multiple sentences, type: str
    return a word list, type: list
    e.g.
    Input: 'Text mining is to identify useful information.'
    Output: ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
    """
    return nltk.word_tokenize(text)

In [5]:
print(tokenize(text1))
print(tokenize(text2))

['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
['Current', 'NLP', 'models', 'is', "n't", 'able', 'to', 'solve', 'NLU', 'perfectly', '.']


In [6]:
# Other examples:
# 1. Possessive cases: Apostrophe (isn't, I've, ...)
tokens = tokenize("Bob's text mining skills are perfect.")
print(tokens)
# 2. Parentheses
tokens = tokenize("Bob's text mining skills (or, NLP) are perfect.")
print(tokens)
# 3. ellipsis
tokens = tokenize("Bob's text mining skills are perfect...")
print(tokens)

['Bob', "'s", 'text', 'mining', 'skills', 'are', 'perfect', '.']
['Bob', "'s", 'text', 'mining', 'skills', '(', 'or', ',', 'NLP', ')', 'are', 'perfect', '.']
['Bob', "'s", 'text', 'mining', 'skills', 'are', 'perfect', '...']


# 2. Stemming and lemmatization

(https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

Stemming: chops off the ends of words to acquire the root, and often includes the removal of derivational affixes. 

e.g., gone -> go, wanted -> want, trees -> tree.

Lemmatization: doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . 

Differences:
The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma (focus on the concrete semantic meaning). 

E.g.: useful -> use(stemming), useful(lemmatization)

PorterStemmer:

Rule-based methods. E.g., SSES->SS, IES->I, NOUNS->NOUN. # misses->miss, flies->fli.

Doc: https://www.nltk.org/api/nltk.stem.html

In [7]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

def stem(tokens):
    """
    :param tokens: a list of tokens, type: list
    return a list of stemmed words, type: list
    e.g.
    Input: ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
    Output: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']
    """
    ### equivalent code
    # results = list()
    # for token in tokens:
    #     results.append(ps.stem(token))
    # return results

    return [ps.stem(token) for token in tokens]

In [8]:
tokens = stem(tokenize("Text mining is to identify useful information."))
print(tokens)

['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']


In [9]:
from nltk.stem import WordNetLemmatizer
lm = WordNetLemmatizer()
def lemmatize(tokens):
    return [lm.lemmatize(token) for token in tokens]

In [10]:
tokens = lemmatize(tokenize("Text mining is to identify useful information."))
print(tokens)

['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']


# 3. WordNet

https://www.nltk.org/howto/wordnet.html

- a semantically-oriented dictionary of English,
- similar to a traditional thesaurus but with a richer structure

In [11]:
from nltk.corpus import wordnet as wn

### 3.1 synsets

A set of one or more **synonyms** that are interchangeable in some context without changing the truth value of the proposition in which they are embedded.

In [12]:
# Look up a word using synsets(); 
wn.synsets('chase')

[Synset('pursuit.n.01'),
 Synset('chase.n.02'),
 Synset('chase.n.03'),
 Synset('chase.v.01'),
 Synset('chase.v.02'),
 Synset('chase.v.03'),
 Synset('furrow.v.03')]

In [13]:
wn.synsets('dog',pos = wn.NOUN)

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01')]

In [14]:
wn.synsets('bank')

[Synset('bank.n.01'),
 Synset('depository_financial_institution.n.01'),
 Synset('bank.n.03'),
 Synset('bank.n.04'),
 Synset('bank.n.05'),
 Synset('bank.n.06'),
 Synset('bank.n.07'),
 Synset('savings_bank.n.02'),
 Synset('bank.n.09'),
 Synset('bank.n.10'),
 Synset('bank.v.01'),
 Synset('bank.v.02'),
 Synset('bank.v.03'),
 Synset('bank.v.04'),
 Synset('bank.v.05'),
 Synset('deposit.v.02'),
 Synset('bank.v.07'),
 Synset('trust.v.01')]

In [15]:
print("synset","\t","definition")
for synset in wn.synsets('bank'):
    print(synset, '\t', synset.definition())

synset 	 definition
Synset('bank.n.01') 	 sloping land (especially the slope beside a body of water)
Synset('depository_financial_institution.n.01') 	 a financial institution that accepts deposits and channels the money into lending activities
Synset('bank.n.03') 	 a long ridge or pile
Synset('bank.n.04') 	 an arrangement of similar objects in a row or in tiers
Synset('bank.n.05') 	 a supply or stock held in reserve for future use (especially in emergencies)
Synset('bank.n.06') 	 the funds held by a gambling house or the dealer in some gambling games
Synset('bank.n.07') 	 a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
Synset('savings_bank.n.02') 	 a container (usually with a slot in the top) for keeping money at home
Synset('bank.n.09') 	 a building in which the business of banking transacted
Synset('bank.n.10') 	 a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turni

In [16]:
# this function has an optional pos argument which lets you constrain the part of speech of the word:
# pos: part-of-speech
wn.synsets('bank', pos=wn.NOUN)

[Synset('bank.n.01'),
 Synset('depository_financial_institution.n.01'),
 Synset('bank.n.03'),
 Synset('bank.n.04'),
 Synset('bank.n.05'),
 Synset('bank.n.06'),
 Synset('bank.n.07'),
 Synset('savings_bank.n.02'),
 Synset('bank.n.09'),
 Synset('bank.n.10')]

In [48]:
wn.synsets('dish')

[Synset('dish.n.01'),
 Synset('dish.n.02'),
 Synset('dish.n.03'),
 Synset('smasher.n.02'),
 Synset('dish.n.05'),
 Synset('cup_of_tea.n.01'),
 Synset('serve.v.06'),
 Synset('dish.v.02')]

In [50]:
for synset in wn.synsets('dish'):
    print(f"Definition of {synset} : {synset.definition()}")

Definition of Synset('dish.n.01') : a piece of dishware normally used as a container for holding or serving food
Definition of Synset('dish.n.02') : a particular item of prepared food
Definition of Synset('dish.n.03') : the quantity that a dish will hold
Definition of Synset('smasher.n.02') : a very attractive or seductive looking woman
Definition of Synset('dish.n.05') : directional antenna consisting of a parabolic reflector for microwave or radio frequency radiation
Definition of Synset('cup_of_tea.n.01') : an activity that you like or at which you are superior
Definition of Synset('serve.v.06') : provide (usually but not necessarily food)
Definition of Synset('dish.v.02') : make concave; shape like a dish


In [78]:
for synset in wn.synset('dish.n.01').hyponyms():
    for lemma in synset.lemmas():
        print(lemma)

Lemma('bowl.n.03.bowl')
Lemma('butter_dish.n.01.butter_dish')
Lemma('casserole.n.02.casserole')
Lemma('coquille.n.02.coquille')
Lemma('gravy_boat.n.01.gravy_boat')
Lemma('gravy_boat.n.01.gravy_holder')
Lemma('gravy_boat.n.01.sauceboat')
Lemma('gravy_boat.n.01.boat')
Lemma('petri_dish.n.01.Petri_dish')
Lemma('ramekin.n.02.ramekin')
Lemma('ramekin.n.02.ramequin')
Lemma('serving_dish.n.01.serving_dish')
Lemma('sugar_bowl.n.01.sugar_bowl')
Lemma('watch_glass.n.01.watch_glass')


In [71]:
wn.synset('ramekin.n.02').lemma_names()

['ramekin', 'ramequin']

In [17]:
wn.synset('dog.n.01')

Synset('dog.n.01')

In [18]:
print(wn.synset('dog.n.01').definition())

a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds


In [19]:
wn.synset('dog.n.01').examples()

['the dog barked all night']

In [20]:
wn.synset('dog.n.01').lemma_names()

['dog', 'domestic_dog', 'Canis_familiaris']

In [21]:
dir(wn.synset('dog.n.01'))
# isA: hyponyms, hypernyms
# part_of: member_holonyms, substance_holonyms, part_holonyms
# being part of: member_meronyms, substance_meronyms, part_meronyms
# domains: topic_domains, region_domains, usage_domains
# attribute: attributes
# entailments: entailments
# causes: causes
# also_sees: also_sees
# verb_groups: verb_groups
# similar_to: similar_tos

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_all_hypernyms',
 '_definition',
 '_examples',
 '_frame_ids',
 '_hypernyms',
 '_instance_hypernyms',
 '_iter_hypernym_lists',
 '_lemma_names',
 '_lemma_pointers',
 '_lemmas',
 '_lexname',
 '_max_depth',
 '_min_depth',
 '_name',
 '_needs_root',
 '_offset',
 '_pointers',
 '_pos',
 '_related',
 '_shortest_hypernym_paths',
 '_wordnet_corpus_reader',
 'acyclic_tree',
 'also_sees',
 'attributes',
 'causes',
 'closure',
 'common_hypernyms',
 'definition',
 'entailments',
 'examples',
 'frame_ids',
 'hypernym_distances',
 'hypernym_paths',
 'hypernyms',
 'hyponyms',
 'in_region_domains',
 'in_topic_domains',
 'in_u

Check more relations in http://www.nltk.org/api/nltk.corpus.reader.html?highlight=wordnet

In [22]:
# hypernyms: abstraction
# hyponyms: instantiation

dog = wn.synset('dog.n.01')
print("hypernyms:", dog.hypernyms())
print("hyponyms:", dog.hyponyms())

hypernyms: [Synset('canine.n.02'), Synset('domestic_animal.n.01')]
hyponyms: [Synset('basenji.n.01'), Synset('corgi.n.01'), Synset('cur.n.01'), Synset('dalmatian.n.02'), Synset('great_pyrenees.n.01'), Synset('griffon.n.02'), Synset('hunting_dog.n.01'), Synset('lapdog.n.01'), Synset('leonberg.n.01'), Synset('mexican_hairless.n.01'), Synset('newfoundland.n.01'), Synset('pooch.n.01'), Synset('poodle.n.01'), Synset('pug.n.01'), Synset('puppy.n.01'), Synset('spitz.n.01'), Synset('toy_dog.n.01'), Synset('working_dog.n.01')]


In [23]:
print(dog.hypernyms()[0].hypernyms()) # the hypernym of canine
# animals that feeds on flesh
print(dog.hypernyms()[0].hypernyms()[0].hypernyms()) # the hypernym of carnivore
# placental mammals
print(dog.hypernyms()[0].hypernyms()[0].hypernyms()[0].hypernyms()) # the hypernym of placental
# mammals
# ...
print("root hypernyms for dog:", dog.root_hypernyms())

[Synset('carnivore.n.01')]
[Synset('placental.n.01')]
[Synset('mammal.n.01')]
root hypernyms for dog: [Synset('entity.n.01')]


This locates the lowest single hypernym that is shared by two given words

In [24]:
# find common hypernyms
print("root hypernyms for cat:", wn.synset('cat.n.01').hypernyms())
print("root hypernyms for cat:", wn.synset('cat.n.01').root_hypernyms())
print("the lowest common hypernyms of dog and cat")
print(wn.synset('dog.n.01').lowest_common_hypernyms(wn.synset('cat.n.01')))

root hypernyms for cat: [Synset('feline.n.01')]
root hypernyms for cat: [Synset('entity.n.01')]
the lowest common hypernyms of dog and cat
[Synset('carnivore.n.01')]


In [72]:
motorcar = wn.synset('car.n.01')
paths = motorcar.hypernym_paths()

In [73]:
paths

[[Synset('entity.n.01'),
  Synset('physical_entity.n.01'),
  Synset('object.n.01'),
  Synset('whole.n.02'),
  Synset('artifact.n.01'),
  Synset('instrumentality.n.03'),
  Synset('container.n.01'),
  Synset('wheeled_vehicle.n.01'),
  Synset('self-propelled_vehicle.n.01'),
  Synset('motor_vehicle.n.01'),
  Synset('car.n.01')],
 [Synset('entity.n.01'),
  Synset('physical_entity.n.01'),
  Synset('object.n.01'),
  Synset('whole.n.02'),
  Synset('artifact.n.01'),
  Synset('instrumentality.n.03'),
  Synset('conveyance.n.03'),
  Synset('vehicle.n.01'),
  Synset('wheeled_vehicle.n.01'),
  Synset('self-propelled_vehicle.n.01'),
  Synset('motor_vehicle.n.01'),
  Synset('car.n.01')]]

### 3.2 Similarity

In [25]:
dog = wn.synset('dog.n.01')
corgi = wn.synset('corgi.n.01')
bensenji = wn.synset('basenji.n.01')
cat = wn.synset('cat.n.01')

In [26]:
dog.path_similarity(cat) # dog <- canine <- carnivore -> feline -> cat

0.2

In [27]:
corgi.path_similarity(dog) # corgi <- dog

0.5

In [28]:
canine = wn.synset('canine.n.02')
canine.path_similarity(dog)

0.5

In [29]:
corgi.path_similarity(bensenji) # bensenji <- dog -> corgi

0.3333333333333333

In [30]:
hit = wn.synset('hit.v.01')
slap = wn.synset('slap.v.01')
jump = wn.synset('jump.v.01')
run = wn.synset('run.v.01')

In [31]:
hit.path_similarity(slap) # 1/7

0.14285714285714285

In [32]:
hit.path_similarity(jump) # 1/6

0.16666666666666666

also check:
- wup_similarity : 2 * (depth of common hypernym)/ (depth of syn1 + depth of syn2)
- lch_similarity -log(p/2d) where p is the depth of the synset closest to both and d is the deepest synset from the 2 synsets
- res_similarity
...

Find more on https://www.nltk.org/howto/wordnet.html

### 3.3 Traverse the synsets to build a graph

In [33]:
wn_graph_hypernyms = {}
# or you could use networkx package

for synset in list(wn.all_synsets('n'))[:10]:
    for hyp_syn in synset.hypernyms():
        wn_graph_hypernyms[synset.name()] = {**wn_graph_hypernyms.get(synset.name(), {}), **{hyp_syn.name():True}}

In [34]:
wn_graph_hypernyms['physical_entity.n.01']['entity.n.01']

True

# 4. Tips to the assignments

Some corpus in the NLTK.

Reference: https://www.nltk.org/book/ch02.html. You could search for `gutenberg` and `brown` for detailed documentations.

### 4.1 gutenberg corpus

In [35]:
from nltk.corpus import gutenberg as gb
nltk.download("gutenberg")

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [36]:
file_id = 'austen-sense.txt'
word_list = gb.words(file_id)

In [37]:
print(word_list[:100])

['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', 'Austen', '1811', ']', 'CHAPTER', '1', 'The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.', 'Their', 'estate', 'was', 'large', ',', 'and', 'their', 'residence', 'was', 'at', 'Norland', 'Park', ',', 'in', 'the', 'centre', 'of', 'their', 'property', ',', 'where', ',', 'for', 'many', 'generations', ',', 'they', 'had', 'lived', 'in', 'so', 'respectable', 'a', 'manner', 'as', 'to', 'engage', 'the', 'general', 'good', 'opinion', 'of', 'their', 'surrounding', 'acquaintance', '.', 'The', 'late', 'owner', 'of', 'this', 'estate', 'was', 'a', 'single', 'man', ',', 'who', 'lived', 'to', 'a', 'very', 'advanced', 'age', ',', 'and', 'who', 'for', 'many', 'years', 'of', 'his', 'life', ',', 'had', 'a', 'constant', 'companion']


In [38]:
sents = gb.sents(file_id)

In [39]:
sents[0]

['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', 'Austen', '1811', ']']

### 4.2 brown corpus

In [40]:
from nltk.corpus import brown
nltk.download("brown")
print(brown.categories())

romance_word_list = brown.words(categories='romance')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...


['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']


[nltk_data]   Package brown is already up-to-date!


In [41]:
romance_word_list

['They', 'neither', 'liked', 'nor', 'disliked', 'the', ...]

In [42]:
from nltk.corpus import reuters

In [43]:
cats = ['naphtha','gas','fuel','trade']
fileids = [i for i in reuters.fileids() if set(reuters.categories(i)).isdisjoint(set(cats)) == False]

In [44]:
reuters.categories(fileids[1])

['corn', 'grain', 'rice', 'rubber', 'sugar', 'tin', 'trade']

In [45]:
len(reuters.fileids(

    
))

10788

In [46]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [47]:
text4.concordance('nation')

Displaying 25 of 316 matches:
 to the character of an independent nation seems to have been distinguished by
f Heaven can never be expected on a nation that disregards the eternal rules o
first , the representatives of this nation , then consisting of little more th
, situation , and relations of this nation and country than any which had ever
, prosperity , and happiness of the nation I have acquired an habitual attachm
an be no spectacle presented by any nation more pleasing , more noble , majest
party for its own ends , not of the nation for the national good . If that sol
tures and the people throughout the nation . On this subject it might become m
if a personal esteem for the French nation , formed in a residence of seven ye
f our fellow - citizens by whatever nation , and if success can not be obtaine
y , continue His blessing upon this nation and its Government and give it all 
powers so justly inspire . A rising nation , spread over a wide and fruitful l
ing now decided by the