# NLTK tutorial
(From https://www.nltk.org/)
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

We'll talk about the following sections in this tutorial:

1. Tokenizer：数据预处理，将str数据变成list of tokens
2. Stemmer：数据预处理，将list of token变成cleaned list of tokens
3. WordNet
4. Tips to the assignments

In [1]:
!pip install nltk
!pip install numpy



# 1. NLTK Tokenizer

In [14]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [2]:
import nltk
nltk.download('punkt') # to make nltk.tokenizer works
nltk.download('wordnet') 

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/yanzheyuan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/yanzheyuan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [16]:
text1 = "Text mining is to identify useful information."
text2 = "Current NLP models isn't able to solve NLU perfectly."

print("string.split tokenizer", text1.split(" "))
print("string.split tokenizer", text2.split(" "))

string.split tokenizer ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information.']
string.split tokenizer ['Current', 'NLP', 'models', "isn't", 'able', 'to', 'solve', 'NLU', 'perfectly.']


Cannot deal with punctuations, i.e., full stops and apostrophes.

In [17]:
import regex # regular expression 正则表达式
print("regular expression tokenizer", regex.split("[\s\.]", text1))
print("regular expression tokenizer", regex.split("[\s\.]", text2))

regular expression tokenizer ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '']
regular expression tokenizer ['Current', 'NLP', 'models', "isn't", 'able', 'to', 'solve', 'NLU', 'perfectly', '']


- Here, the `string.split` function can not deal with punctuations
- Simple regular expression can deal with most punctuations but may fail in the cases of "isn't, wasn't, can't"

In [18]:
def tokenize(text):
    """
    :param text: a doc with multiple sentences, type: str
    return a word list, type: list
    e.g.
    Input: 'Text mining is to identify useful information.'
    Output: ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
    """
    return nltk.word_tokenize(text)

In [19]:
print(tokenize(text1))
print(tokenize(text2))

['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
['Current', 'NLP', 'models', 'is', "n't", 'able', 'to', 'solve', 'NLU', 'perfectly', '.']


In [20]:
# Other examples:
# 1. Possessive cases: Apostrophe (isn't, I've, ...) 所有格
tokens = tokenize("Bob's text mining skills are perfect.")
print(tokens)
# 2. Parentheses 插入语
tokens = tokenize("Bob's text mining skills (or, NLP) are perfect.")
print(tokens)
# 3. ellipsis 省略号
tokens = tokenize("Bob's text mining skills are perfect...")
print(tokens)

['Bob', "'s", 'text', 'mining', 'skills', 'are', 'perfect', '.']
['Bob', "'s", 'text', 'mining', 'skills', '(', 'or', ',', 'NLP', ')', 'are', 'perfect', '.']
['Bob', "'s", 'text', 'mining', 'skills', 'are', 'perfect', '...']


# 2. Stemming and lemmatization

(https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

Stemming: chops off the ends of words to acquire the root, and often includes the removal of derivational affixes. 

e.g., gone -> go, wanted -> want, trees -> tree.

Lemmatization: doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . 

Differences:
The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma (focus on the concrete semantic meaning). 

E.g.: useful -> use(stemming), useful(lemmatization)

PorterStemmer:

Rule-based methods. E.g., SSES->SS, IES->I, NOUNS->NOUN. # misses->miss, flies->fli.

Doc: https://www.nltk.org/api/nltk.stem.html

- stemming: 词干提取，指去除单词前后缀返回词根的过程，比如playing-play
- lemmatization: 词形还原，指根据词典知识还原词原型，比如drove-drive
- PorterStemmer: 基于一些规则的stem，指定如何去除前后缀。来源于nltk.stem
- WordNetLemmatizer: 来源于nltk.stem

In [21]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

def stem(tokens):
    """
    :param tokens: a list of tokens, type: list
    return a list of stemmed words, type: list
    e.g.
    Input: ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
    Output: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']
    """
    ### equivalent code
    # results = list()
    # for token in tokens:
    #     results.append(ps.stem(token))
    # return results

    return [ps.stem(token) for token in tokens]

In [22]:
tokens = stem(tokenize("Text mining is to identify useful information."))
print(tokens)

['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']


In [23]:
from nltk.stem import WordNetLemmatizer
lm = WordNetLemmatizer()
def lemmatize(tokens):
    return [lm.lemmatize(token) for token in tokens]

In [24]:
tokens = lemmatize(tokenize("Text mining is to identify useful information."))
print(tokens)

['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']


### self practice
- lemmatize 和 stem 是可以有叠加效果的，但貌似在工业界预处理过程是
  - tokenize
  - lemmatize/stem
  - stop_words

In [16]:
# test on the combination of lemmatization and stemming
import nltk 
tokens = nltk.word_tokenize('Text minings is to identify useful informations. lighted up as if you had a chooses, wolves')
print(tokens)
from nltk import PorterStemmer
from nltk import WordNetLemmatizer
ps_test = PorterStemmer()
lm_test = WordNetLemmatizer()
ps_test_1 = PorterStemmer()
lm_test_1 = WordNetLemmatizer()
def stem(ps_test,tokens):
    return [ps_test.stem(token) for token in tokens]
def lemmatize(lm_test,tokens):
    return [lm_test.lemmatize(token) for token in tokens]
print(stem(ps_test,tokens))
print(lemmatize(lm_test,tokens))
print(stem(ps_test,lemmatize(lm_test,tokens)))
print(lemmatize(lm_test_1,stem(ps_test_1,tokens)))

['Text', 'minings', 'is', 'to', 'identify', 'useful', 'informations', '.', 'lighted', 'up', 'as', 'if', 'you', 'had', 'a', 'chooses', ',', 'wolves']
['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.', 'light', 'up', 'as', 'if', 'you', 'had', 'a', 'choos', ',', 'wolv']
['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.', 'lighted', 'up', 'a', 'if', 'you', 'had', 'a', 'chooses', ',', 'wolf']
['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.', 'light', 'up', 'a', 'if', 'you', 'had', 'a', 'choos', ',', 'wolf']
['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.', 'light', 'up', 'a', 'if', 'you', 'had', 'a', 'choos', ',', 'wolv']


# 3. WordNet

https://www.nltk.org/howto/wordnet.html

- a semantically-oriented dictionary of English,
- similar to a traditional thesaurus but with a richer structure

- 一个基于词语义的字典
- 同义词，词的区别

- 应用
  - 输出同义词
  - wn.synsets实例的方法
    - defination, examples, lemma_names等
    - 输出一个词的上位/下位词
      - 一个词的root词
      - 输出两个词相同的上位/下位词
    - 两个实例的相似度
      - path_similarity: shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy.
      - lch_similarity: Leacock-Chodorow Similarity, above + the maximum depth of the taxonomy.
      - wup_similarity: Wu-Palmer Similarity, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer.
      - res_similarity: Resnik Similarity, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

In [15]:
from nltk.corpus import wordnet as wn

### 3.1 synsets

A set of one or more **synonyms** that are interchangeable in some context without changing the truth value of the proposition in which they are embedded.

- 同义词，在某些上下文中可以替换且不改变原来语义


In [17]:
# Look up a word using synsets(); 
wn.synsets('dog')


[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]

In [14]:
wn.synsets('bank')

[Synset('bank.n.01'),
 Synset('depository_financial_institution.n.01'),
 Synset('bank.n.03'),
 Synset('bank.n.04'),
 Synset('bank.n.05'),
 Synset('bank.n.06'),
 Synset('bank.n.07'),
 Synset('savings_bank.n.02'),
 Synset('bank.n.09'),
 Synset('bank.n.10'),
 Synset('bank.v.01'),
 Synset('bank.v.02'),
 Synset('bank.v.03'),
 Synset('bank.v.04'),
 Synset('bank.v.05'),
 Synset('deposit.v.02'),
 Synset('bank.v.07'),
 Synset('trust.v.01')]

In [15]:
print("synset","\t","definition")
for synset in wn.synsets('bank'):
    print(synset, '\t', synset.definition())

synset 	 definition
Synset('bank.n.01') 	 sloping land (especially the slope beside a body of water)
Synset('depository_financial_institution.n.01') 	 a financial institution that accepts deposits and channels the money into lending activities
Synset('bank.n.03') 	 a long ridge or pile
Synset('bank.n.04') 	 an arrangement of similar objects in a row or in tiers
Synset('bank.n.05') 	 a supply or stock held in reserve for future use (especially in emergencies)
Synset('bank.n.06') 	 the funds held by a gambling house or the dealer in some gambling games
Synset('bank.n.07') 	 a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
Synset('savings_bank.n.02') 	 a container (usually with a slot in the top) for keeping money at home
Synset('bank.n.09') 	 a building in which the business of banking transacted
Synset('bank.n.10') 	 a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turni

In [16]:
# this function has an optional pos argument which lets you constrain the part of speech of the word:
# pos: part-of-speech
wn.synsets('bank', pos=wn.NOUN)

[Synset('bank.n.01'),
 Synset('depository_financial_institution.n.01'),
 Synset('bank.n.03'),
 Synset('bank.n.04'),
 Synset('bank.n.05'),
 Synset('bank.n.06'),
 Synset('bank.n.07'),
 Synset('savings_bank.n.02'),
 Synset('bank.n.09'),
 Synset('bank.n.10')]

In [17]:
wn.synset('dog.n.01')

Synset('dog.n.01')

In [18]:
print(wn.synset('dog.n.01').definition())

a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds


In [19]:
wn.synset('dog.n.01').examples()

['the dog barked all night']

In [20]:
wn.synset('dog.n.01').lemma_names()

['dog', 'domestic_dog', 'Canis_familiaris']

- dir() 函数不带参数时，返回当前范围内的变量、方法和定义的类型列表；带参数时，返回参数的属性、方法列表。

In [21]:
dir(wn.synset('dog.n.01'))
# isA: hyponyms, hypernyms：下位词，上位词
# part_of: member_holonyms, substance_holonyms, part_holonyms
# being part of: member_meronyms, substance_meronyms, part_meronyms
# domains: topic_domains, region_domains, usage_domains
# attribute: attributes
# entailments: entailments
# causes: causes
# also_sees: also_sees
# verb_groups: verb_groups
# similar_to: similar_tos

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_all_hypernyms',
 '_definition',
 '_examples',
 '_frame_ids',
 '_hypernyms',
 '_instance_hypernyms',
 '_iter_hypernym_lists',
 '_lemma_names',
 '_lemma_pointers',
 '_lemmas',
 '_lexname',
 '_max_depth',
 '_min_depth',
 '_name',
 '_needs_root',
 '_offset',
 '_pointers',
 '_pos',
 '_related',
 '_shortest_hypernym_paths',
 '_wordnet_corpus_reader',
 'also_sees',
 'attributes',
 'causes',
 'closure',
 'common_hypernyms',
 'definition',
 'entailments',
 'examples',
 'frame_ids',
 'hypernym_distances',
 'hypernym_paths',
 'hypernyms',
 'hyponyms',
 'in_region_domains',
 'in_topic_domains',
 'in_usage_domains',
 '

Check more relations in http://www.nltk.org/api/nltk.corpus.reader.html?highlight=wordnet

In [22]:
# hypernyms: abstraction 上层抽象
# hyponyms: instantiation 下层具象

dog = wn.synset('dog.n.01')
print("hypernyms:", dog.hypernyms())
print("hyponyms:", dog.hyponyms())

hypernyms: [Synset('canine.n.02'), Synset('domestic_animal.n.01')]
hyponyms: [Synset('basenji.n.01'), Synset('corgi.n.01'), Synset('cur.n.01'), Synset('dalmatian.n.02'), Synset('great_pyrenees.n.01'), Synset('griffon.n.02'), Synset('hunting_dog.n.01'), Synset('lapdog.n.01'), Synset('leonberg.n.01'), Synset('mexican_hairless.n.01'), Synset('newfoundland.n.01'), Synset('pooch.n.01'), Synset('poodle.n.01'), Synset('pug.n.01'), Synset('puppy.n.01'), Synset('spitz.n.01'), Synset('toy_dog.n.01'), Synset('working_dog.n.01')]


In [23]:
print(dog.hypernyms()[0].hypernyms()) # the hypernym of canine
# animals that feeds on flesh
print(dog.hypernyms()[0].hypernyms()[0].hypernyms()) # the hypernym of carnivore
# placental mammals
print(dog.hypernyms()[0].hypernyms()[0].hypernyms()[0].hypernyms()) # the hypernym of placental
# mammals
# ...
print("root hypernyms for dog:", dog.root_hypernyms())

[Synset('carnivore.n.01')]
[Synset('placental.n.01')]
[Synset('mammal.n.01')]
root hypernyms for dog: [Synset('entity.n.01')]


In [24]:
# find common hypernyms
print("root hypernyms for cat:", wn.synset('cat.n.01').hypernyms())
print("root hypernyms for cat:", wn.synset('cat.n.01').root_hypernyms())
print("the lowest common hypernyms of dog and cat")
print(wn.synset('dog.n.01').lowest_common_hypernyms(wn.synset('cat.n.01')))

root hypernyms for cat: [Synset('feline.n.01')]
root hypernyms for cat: [Synset('entity.n.01')]
the lowest common hypernyms of dog and cat
[Synset('carnivore.n.01')]


### 3.2 Similarity

In [25]:
dog = wn.synset('dog.n.01')
corgi = wn.synset('corgi.n.01')
bensenji = wn.synset('basenji.n.01')
cat = wn.synset('cat.n.01')

In [26]:
dog.path_similarity(cat) # dog <- canine <- carnivore -> feline -> cat

0.2

In [27]:
dog.path_similarity(corgi) # corgi <- dog

0.5

In [28]:
corgi.path_similarity(bensenji) # bensenji <- dog -> corgi

0.3333333333333333

In [29]:
hit = wn.synset('hit.v.01')
slap = wn.synset('slap.v.01')
jump = wn.synset('jump.v.01')
run = wn.synset('run.v.01')

In [30]:
hit.path_similarity(slap) # 1/7

0.14285714285714285

In [31]:
hit.path_similarity(jump) # 1/6

0.16666666666666666

also check:
- wup_similarity
- lch_similarity
- res_similarity
...

Find more on https://www.nltk.org/howto/wordnet.html

### 3.3 Traverse the synsets to build a graph

In [25]:
wn_graph_hypernyms = {}
# or you could use networkx package

for synset in list(wn.all_synsets('n'))[:10]:
    for hyp_syn in synset.hypernyms():
        wn_graph_hypernyms[synset.name()] = {**wn_graph_hypernyms.get(synset.name(), {}), **{hyp_syn.name():True}}

In [26]:
wn_graph_hypernyms['physical_entity.n.01']['entity.n.01']

True

# 4. Tips to the assignments

Some corpus in the NLTK.

Reference: https://www.nltk.org/book/ch02.html. You could search for `gutenberg` and `brown` for detailed documentations.

### 4.1 gutenberg corpus

In [27]:
from nltk.corpus import gutenberg as gb
nltk.download("gutenberg")

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/yanzheyuan/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [28]:
file_id = 'austen-sense.txt'
word_list = gb.words(file_id)

In [29]:
print(word_list[:100])

['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', 'Austen', '1811', ']', 'CHAPTER', '1', 'The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.', 'Their', 'estate', 'was', 'large', ',', 'and', 'their', 'residence', 'was', 'at', 'Norland', 'Park', ',', 'in', 'the', 'centre', 'of', 'their', 'property', ',', 'where', ',', 'for', 'many', 'generations', ',', 'they', 'had', 'lived', 'in', 'so', 'respectable', 'a', 'manner', 'as', 'to', 'engage', 'the', 'general', 'good', 'opinion', 'of', 'their', 'surrounding', 'acquaintance', '.', 'The', 'late', 'owner', 'of', 'this', 'estate', 'was', 'a', 'single', 'man', ',', 'who', 'lived', 'to', 'a', 'very', 'advanced', 'age', ',', 'and', 'who', 'for', 'many', 'years', 'of', 'his', 'life', ',', 'had', 'a', 'constant', 'companion']


In [30]:
sents = gb.sents(file_id)

In [31]:
sents[0]

['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', 'Austen', '1811', ']']

### 4.2 brown corpus

In [32]:
from nltk.corpus import brown
nltk.download("brown")
print(brown.categories())

romance_word_list = brown.words(categories='romance')

[nltk_data] Downloading package brown to
[nltk_data]     /Users/yanzheyuan/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']


In [33]:
romance_word_list

['They', 'neither', 'liked', 'nor', 'disliked', 'the', ...]