# NLTK Basics - Stop words, CMU WordList, WordNet, Tokenisation with POS tagging

** *Stop words* are words which are filtered out before processing of natural language data. Stop words are generally the most common words in a language. **

** Extraction of the stop words from the text can be done with the help of the stopwords functionality that can be imported from nltk.corpus. **

In [2]:
from nltk.corpus import stopwords
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

** Stop words for languages other than English (such as German) can be extracted by specifying the target language wothin brackets. **

In [3]:
stopwords.words('german')

['aber',
 'alle',
 'allem',
 'allen',
 'aller',
 'alles',
 'als',
 'also',
 'am',
 'an',
 'ander',
 'andere',
 'anderem',
 'anderen',
 'anderer',
 'anderes',
 'anderm',
 'andern',
 'anderr',
 'anders',
 'auch',
 'auf',
 'aus',
 'bei',
 'bin',
 'bis',
 'bist',
 'da',
 'damit',
 'dann',
 'der',
 'den',
 'des',
 'dem',
 'die',
 'das',
 'dass',
 'daß',
 'derselbe',
 'derselben',
 'denselben',
 'desselben',
 'demselben',
 'dieselbe',
 'dieselben',
 'dasselbe',
 'dazu',
 'dein',
 'deine',
 'deinem',
 'deinen',
 'deiner',
 'deines',
 'denn',
 'derer',
 'dessen',
 'dich',
 'dir',
 'du',
 'dies',
 'diese',
 'diesem',
 'diesen',
 'dieser',
 'dieses',
 'doch',
 'dort',
 'durch',
 'ein',
 'eine',
 'einem',
 'einen',
 'einer',
 'eines',
 'einig',
 'einige',
 'einigem',
 'einigen',
 'einiger',
 'einiges',
 'einmal',
 'er',
 'ihn',
 'ihm',
 'es',
 'etwas',
 'euer',
 'eure',
 'eurem',
 'euren',
 'eurer',
 'eures',
 'für',
 'gegen',
 'gewesen',
 'hab',
 'habe',
 'haben',
 'hat',
 'hatte',
 'hatten',
 '

In [4]:
import nltk

### Analysis of CMU Wordlist

** The *Carnegie Mellon University Pronouncing Dictionary* is an open-source machine-readable pronunciation dictionary for North American English that contains over 134,000 words and their pronunciations.**

** We can evaluate the number of entries in the CMU WordList using the *cmudict.entries()* functionality. **

In [5]:
entries=nltk.corpus.cmudict.entries()

In [6]:
len(entries)

133737

** We can extract specific entries within a particular range of indices within the word list as follows. **

In [8]:
for entry in entries[300:400]:
    print(entry)

('abreu', ['AH0', 'B', 'R', 'UW1'])
('abridge', ['AH0', 'B', 'R', 'IH1', 'JH'])
('abridged', ['AH0', 'B', 'R', 'IH1', 'JH', 'D'])
('abridgement', ['AH0', 'B', 'R', 'IH1', 'JH', 'M', 'AH0', 'N', 'T'])
('abridges', ['AH0', 'B', 'R', 'IH1', 'JH', 'AH0', 'Z'])
('abridging', ['AH0', 'B', 'R', 'IH1', 'JH', 'IH0', 'NG'])
('abril', ['AH0', 'B', 'R', 'IH1', 'L'])
('abroad', ['AH0', 'B', 'R', 'AO1', 'D'])
('abrogate', ['AE1', 'B', 'R', 'AH0', 'G', 'EY2', 'T'])
('abrogated', ['AE1', 'B', 'R', 'AH0', 'G', 'EY2', 'T', 'IH0', 'D'])
('abrogating', ['AE1', 'B', 'R', 'AH0', 'G', 'EY2', 'T', 'IH0', 'NG'])
('abrogation', ['AE2', 'B', 'R', 'AH0', 'G', 'EY1', 'SH', 'AH0', 'N'])
('abrol', ['AH0', 'B', 'R', 'OW1', 'L'])
('abron', ['AH0', 'B', 'R', 'AA1', 'N'])
('abrupt', ['AH0', 'B', 'R', 'AH1', 'P', 'T'])
('abruptly', ['AH0', 'B', 'R', 'AH1', 'P', 'T', 'L', 'IY0'])
('abruptness', ['AH0', 'B', 'R', 'AH1', 'P', 'T', 'N', 'AH0', 'S'])
('abrutyn', ['EY1', 'B', 'R', 'UW0', 'T', 'IH0', 'N'])
('abruzzese', ['AA0',

### WordNet

** *WordNet* is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. **

In [9]:
from nltk.corpus import wordnet as wn

** The synonyms are grouped into synsets with short definitions and usage examples.
We get an id of subsets using the *wn.synsets()* functionality. It has an optional pos argument which lets you constrain the part of speech of the word. **

In [19]:
wn.synsets('car')

[Synset('car.n.01'),
 Synset('car.n.02'),
 Synset('car.n.03'),
 Synset('car.n.04'),
 Synset('cable_car.n.01')]

** A synset is identified with a 3-part name of the form: *word.pos.nn*. We can extract all those synsets that have a particular synset identifier using the *lemma_names()* functionality. **

In [21]:
wn.synset('car.n.04').lemma_names()

['car', 'elevator_car']

### NLTK Pipeline

In [22]:
import nltk

In [23]:
texts="""Benedict Timothy Carlton Cumberbatch CBE (born 19 July 1976) is an English actor. A graduate of the Victoria University of Manchester, he continued his training at the London Academy of Music and Dramatic Art, obtaining a Master of Arts in Classical Acting. He first performed at the Open Air Theatre, Regent's Park in Shakespearean productions and made his West End debut in Richard Eyre's revival of Hedda Gabler in 2005. Since then, he has starred in the Royal National Theatre productions After the Dance (2010) and Frankenstein (2011). In 2015, he played William Shakespeare's Hamlet at the Barbican Theatre."""

** Tokenisation procedure follows the sequence of *sentence tokenisation* followed by *word tokenisation* ie, the text is firstly tokenised into sentences using the period delimiter and then they are in turn tokenised into words using the word delimiter. **

** Next the extracted words are provided with *parts-of-speech* tag. **

In [24]:
for text in texts:
    sentences=nltk.sent_tokenize(text) #Tokenising into sentences
    for sentence in sentences:
        words=nltk.word_tokenize(sentence) #Tokenising the words in the given sentence
        tagged_words=nltk.pos_tag(words) #Assigning POS
        print(tagged_words)

[('B', 'NN')]
[('e', 'NN')]
[('n', 'NN')]
[('e', 'NN')]
[('d', 'NN')]
[('i', 'NN')]
[('c', 'NNS')]
[('t', 'NN')]
[('T', 'NN')]
[('i', 'NN')]
[('m', 'NN')]
[('o', 'NN')]
[('t', 'NN')]
[('h', 'NN')]
[('y', 'NN')]
[('C', 'SYM')]
[('a', 'DT')]
[('r', 'NN')]
[('l', 'NN')]
[('t', 'NN')]
[('o', 'NN')]
[('n', 'NN')]
[('C', 'SYM')]
[('u', 'NN')]
[('m', 'NN')]
[('b', 'NN')]
[('e', 'NN')]
[('r', 'NN')]
[('b', 'NN')]
[('a', 'DT')]
[('t', 'NN')]
[('c', 'NNS')]
[('h', 'NN')]
[('C', 'SYM')]
[('B', 'NN')]
[('E', 'NN')]
[('(', '(')]
[('b', 'NN')]
[('o', 'NN')]
[('r', 'NN')]
[('n', 'NN')]
[('1', 'CD')]
[('9', 'CD')]
[('J', 'NN')]
[('u', 'NN')]
[('l', 'NN')]
[('y', 'NN')]
[('1', 'CD')]
[('9', 'CD')]
[('7', 'CD')]
[('6', 'CD')]
[(')', ')')]
[('i', 'NN')]
[('s', 'NN')]
[('a', 'DT')]
[('n', 'NN')]
[('E', 'NN')]
[('n', 'NN')]
[('g', 'NN')]
[('l', 'NN')]
[('i', 'NN')]
[('s', 'NN')]
[('h', 'NN')]
[('a', 'DT')]
[('c', 'NNS')]
[('t', 'NN')]
[('o', 'NN')]
[('r', 'NN')]
[('.', '.')]
[('A', 'DT')]
[('g', 'NN')]
[('

### Implementing Tokenisation

** Twitter aware tokenizer. **

In [25]:
import nltk

** The *TweetTokenizer* functionaity imported from *nltk.tokenize* can be used for effectively executing the tokenisation of text in the form of tweets. ** 

In [26]:
from nltk.tokenize import TweetTokenizer

In [41]:
text='THANK YOU TOLEDO, OHIO :('

In [42]:
twtkn=TweetTokenizer()

** It is capable of properly classifying and effectively tokenising even keyboard emoticons that are part of the text. **

In [43]:
twtkn.tokenize(text)

['THANK', 'YOU', 'TOLEDO', ',', 'OHIO', ':(']

#### Further Analysis using Brown corpus

** We extract the words that are part of the specified category. **

In [33]:
from nltk.corpus import brown

In [36]:
news_text=brown.words(categories='news')

In [37]:
print(news_text)

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]


** For the purpose of normalisation and effective comparison, we reduce all the words to the lower case. **

In [38]:
f=nltk.FreqDist(w.lower() for w in news_text) 

** Next we try to evaluate the number of occurences of each modal word separately within the list of words. **

In [44]:
modals=['can','could','may','might','will','must']

In [45]:
for a in modals:
    print(a+':',f[a],end=' ')

can: 94 could: 87 may: 93 might: 38 will: 389 must: 53 

In [46]:
print(f)

<FreqDist with 13112 samples and 100554 outcomes>
