# NLTK Basics Part 2

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import nltk
%matplotlib inline 
plt.style.use('ggplot')

## 2.1 Using WordNet in Text Engineering

### Senses and Synonyms

WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. <BR>NLTK includes the English WordNet, with 155,287 words and 117,659 synonym sets. <BR>We'll begin by looking at synonyms and how they are accessed in WordNet.
Consider the following two sentences:
1. Benz is credited with the invention of the motorcar.
2. Benz is credited with the invention of the automobile.
<BR>


From sentence 1 to 2, we change only one word (motorcar --> automobile), and the meaning of the sentences stays same. <BR>
Since everything else in the sentence has remained unchanged, we can conclude that the words *motorcar* and *automobile* have the same meaning, i.e. they are <b>synonyms</b>. We can explore these words with the help of WordNet:

In [None]:
from nltk.corpus import wordnet as wn
wn.synsets('motorcar')

Thus, *motorcar* has just one possible meaning and it is identified as *car.n.01*, the first noun sense of *car*. The entity *car.n.01* is called a <b>synset</b>, or "synonym set", a collection of synonymous words (or "<b>lemmas</b>"):
 --> Lemmatization is a crucial step in text mining.

In [None]:
wn.synset('car.n.01').lemma_names()

Each word of a synset can have several meanings, e.g., car can also signify a train carriage, a gondola, or an elevator car. However, we are only interested in the single meaning that is common to all words of the above synset. 

In [None]:
 wn.synset('car.n.01').definition()

Synsets also come with a prose definition and some example sentences:

In [None]:
wn.synset('car.n.01').examples()

Although definitions help humans to understand the intended meaning of a synset, the words of the synset are often more useful for our programs. To eliminate ambiguity, we will identify these words as car.n.01.automobile, car.n.01.motorcar, and so on. This pairing of a synset with a word is called a lemma. <BR>We can get all the lemmas for a given synset:

In [None]:
wn.synset('car.n.01').lemmas()

Look up a particular lemma：

In [None]:
wn.lemma('car.n.01.automobile')

Get the synset corresponding to a lemma:

In [None]:
wn.lemma('car.n.01.automobile').synset()

Get the "name" of a lemma

In [None]:
wn.lemma('car.n.01.automobile').name()

Unlike the word *motorcar*, which is unambiguous and has one synset, the word *car* is ambiguous, having five synsets:

In [None]:
wn.synsets('car')

Now let us look at all the synonyms of *car* in all senses:

In [None]:
for synset in wn.synsets('car'):
    print(synset.lemma_names())

### WordNet Hierarchy

WordNet synsets correspond to abstract concepts, and they don't always have corresponding words in English. These concepts are linked together in a hierarchy. Some concepts are very general, such as *Entity, State, Event* — these are called <b>unique beginners</b> or root synsets. Others, such as *gas guzzler* and *hatchback*, are much more specific. <BR>A small portion of a concept hierarchy is illustrated below:<BR>
*Nodes* correspond to synsets; edges indicate the hypernym/hyponym relation, i.e. the relation between superordinate and subordinate concepts.

<img src = 'http://www.nltk.org/images/wordnet-hierarchy.png' alt = "WordNet Hierarchy" style = "height: 50%; width: 50%" align = 'center'/>

WordNet makes it easy to navigate between concepts. <BR>
For example, given a concept like *motorcar*, we can look at the concepts that are more specific; the (immediate) <b>hyponyms</b>.

In [None]:
motorcar = wn.synset('car.n.01')
types_of_motorcar = motorcar.hyponyms()
#types_of_motorcar[0]
for synset in types_of_motorcar:
    print (synset) 

In [None]:
# Now how many types of cars we know?
sorted(lemma.name() for synset in types_of_motorcar for lemma in synset.lemmas())

We can also navigate up the hierarchy by visiting hypernyms. Some words have multiple paths, because they can be classified in more than one way. There are two paths between *car.n.01* and *entity.n.01* because *wheeled_vehicle.n.01* can be classified as both a vehicle and a container.

In [None]:
motorcar.hypernyms()

In [None]:
paths = motorcar.hypernym_paths()
len(paths)

So now we know there are two paths linking <b> Entity </b> and <b> Car </b>. What are they then?

In [None]:
[synset.name() for synset in paths[0]]

Here is the other path:

In [None]:
[synset.name() for synset in paths[1]]

We can get the most general <b>hypernyms</b> (or root hypernyms) of a synset as follows:

In [None]:
motorcar.root_hypernyms()

## YOUR TURN HERE

Try out NLTK's convenient graphical WordNet browser: nltk.app.wordnet(). Explore the WordNet hierarchy by following the hypernym and hyponym links.

In [None]:
# call nltk.app.wordnet() here


## 2.2 Preprocessing Text

The most important source of texts is undoubtedly the Web. It's convenient to have existing text collections to explore, such as the corpora we saw in the previous chapters. However, you probably have your own text sources in mind, and need to learn how to access them. <BR>

The goal of this chapter is to answer the following questions:<BR>

<ol> How can we write programs to access text from local files and from the web, in order to get hold of an unlimited range of language material?</ol>
<ol>How can we split documents up into individual words and punctuation symbols, so we can carry out the same kinds of analysis we did with text corpora in earlier chapters?</ol>
<ol>How can we write programs to produce formatted output and save it in a file?</ol>
In order to address these questions, we will be covering key concepts in NLP, including tokenization and stemming.<BR> Along the way you will consolidate your Python knowledge and learn about strings, files, and regular expressions. Since so much text on the web is in HTML format, we will also see how to dispense with markup.



In [None]:
from __future__ import division  # Python 2 users only
import nltk, re, pprint
from nltk import word_tokenize

### The NLP Pipeline

Following diagram summarizes what we have covered in this section, including the process of building a vocabulary that we saw in Part 1.

<img src='http://www.nltk.org/images/pipeline1.png' alt='The NLP Pipeline' style='height；75%; width: 75%'/>

There's a lot going on in this pipeline. To understand it properly, it helps to be clear about the type of each variable that it mentions. We find out the type of any Python object x using type(x), e.g. type(1) is “int” since 1 is an integer. <BR>
When we load the contents of a URL or file, and when we strip out HTML markup, we are dealing with strings, Python's *String* data type.

### Regular Expressions for Detecting Word Patterns

Many linguistic processing tasks involve pattern matching. For example, we can find words ending with ed using endswith('ed').  Regular expressions give us a more powerful and flexible method for describing the character patterns we are interested in. <BR>
<B> NOTE: </B>
There are many other published introductions to regular expressions, organized around the syntax of regular expressions and applied to searching text files. Instead of doing this again, we focus on the use of regular expressions at different stages of linguistic processing. As usual, we'll adopt a problem-based approach and present new features only as they are needed to solve practical problems. In our discussion we will mark regular expressions using chevrons like this: «patt».<BR>
To use regular expressions in Python we need to import the re library using: *import re*. We also need a list of words to search; we'll use the Words Corpus again. We will preprocess it to remove any proper names.

In [None]:
import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

#### Using Basic Meta-Characters

Let's find words ending with ed using the regular expression «ed$». We will use the *re.search(p, s)* function to check whether the pattern <b> p </b> can be found somewhere inside the string <b>s</b>. We need to specify the characters of interest, and use the *dollar sign* which has a special behavior in the context of regular expressions in that it matches the end of the word:

In [None]:
[w for w in wordlist if re.search('ed$', w)]

The . <b>wildcard </b> symbol matches any single character. Suppose we have room in a crossword puzzle for an 8-letter word with j as its third letter and t as its sixth letter. In place of each blank cell we use a period:

In [None]:
[w for w in wordlist if re.search('^..j..t..$', w)]

### YOUR TURN HERE

The caret symbol ^ matches the start of a string, just like the $ matches the end. What results do we get with the above example if we leave out both of these, and search for «..j..t..»?

In [None]:
### Your Code Here
[w for w in wordlist if re.search('..j..t..', w)]

#### Ranges and Closures

The T9 system is used for entering text on mobile phones . Two or more words that are entered with the same sequence of keystrokes are known as textonyms. For example, both hole and golf are entered by pressing the sequence 4653. What other words could be produced with the same sequence? Here we use the regular expression «^[ghi][mno][jlk][def]$»:

<img src='http://www.nltk.org/images/T9.png' alt='T9'style = 'height:50%; width: 50%'/>

In [None]:
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]

The first part of the expression, «^[ghi]», matches the start of a word followed by g, h, or i. The next part of the expression, «[mno]», constrains the second character to be m, n, or o. The third and fourth characters are also constrained. Only four words satisfy all these constraints. Note that the order of characters inside the square brackets is not significant, so we could have written «^[hig][nom][ljk][fed]$» and matched the same words.

### YOUR TURN HERE

Look for some "finger-twisters", by searching for words that only use part of the number-pad. For example,  "^[ghijklmno]+$", or more concisely, "^[g-o]+$", will match words that only use keys 4, 5, 6 in the center row, and «^[a-fj-o]+$» will match words that use keys 2, 3, 5, 6 in the top-right corner. What do - and + mean?

In [None]:
#### YOUR CODE HERE


Let's explore the + symbol a bit further. Notice that it can be applied to individual letters, or to bracketed sets of letters:

In [None]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
[w for w in chat_words if re.search('^m+i+n+e+$', w)]

In [None]:
[w for w in chat_words if re.search('^[ha]+$', w)]

It should be clear that + simply means "one or more instances of the preceding item", which could be an individual character like m, a set like [fed] or a range like [d-f]. Now let's replace + with *, which means "zero or more instances of the preceding item". The regular expression «^m*i*n*e*$» will match everything that we found using «^m+i+n+e+$», but also words where some of the letters don't appear at all, e.g. me, min, and mmmmm. Note that the + and * symbols are sometimes referred to as Kleene closures, or simply closures. <BR>
The ^ operator has another function when it appears as the first character inside square brackets. For example «[^aeiouAEIOU]» matches any character other than a vowel. We can search the NPS Chat Corpus for words that are made up entirely of non-vowel characters using «^[^aeiouAEIOU]+$» to find items like these: :):):), grrr, cyb3r and zzzzzzzz. Notice this includes non-alphabetic characters.

Here are some more examples of regular expressions being used to find tokens that match a particular pattern, illustrating the use of some new symbols: \, {}, (), and |:

In [None]:
wsj = sorted(set(nltk.corpus.treebank.words()))
# if we want to search for all numbers in the corpus
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)]

In [None]:
# if we want to search for all srtring ends with $
[w for w in wsj if re.search('^[A-Z]+\$$', w)]

In [None]:
# if we want to search for any 4-digit numbers
[w for w in wsj if re.search('^[0-9]{4}$', w)]

In [None]:
#### put your comment here - what does this search for?
[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)]

In [None]:
#### put your comment here - what does this search for?
[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]

In [None]:
#### put your comment here - what does this search for?
[w for w in wsj if re.search('(ed|ing)$', w)]

### YOUR TURN HERE

Study the above examples and try to work out what the \, {}, (), and | notations mean before you read on.
<B> YOUR NOTES HERE: </b>

double click to mark your notes

### Other RegEx Applications

The above examples all involved searching for words w that match some regular expression regexp using re.search(regexp, w). Apart from checking if a regular expression matches a word, we can use regular expressions to extract material from words, or to modify words in specific ways.


#### Extracting Word Pieces

The re.findall() ("find all") method finds all (non-overlapping) matches of the given regular expression. Let's find all the vowels in a word, then count them:

In [None]:
word = 'supercalifragilisticexpialidocious'

Where did this word come from? Please view the following video:

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('uZNRzc3hWvE')
# Supercalifragilisticexpialidocious (from "Mary Poppins") - Julie Andrews, Dick Van Dyke

In [None]:
# let us find all vowels in this ridiculously long word
print(re.findall(r'[aeiou]', word))
print(len(re.findall(r'[aeiou]', word)))
# 16 vowels - what a ridiculously long word!

Let's exercise a more normal example by looking for all sequences of two or more vowels in some text, and determine their relative frequency:

In [None]:
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj
                   for vs in re.findall(r'[aeiou]{2,}', word))
fd.most_common(12)

#### Doing More with Word Pieces

Once we can use re.findall() to extract material from words, there's interesting things to do with the pieces, like glue them back together or plot them.

It is sometimes noted that English text is highly redundant, and it is still easy to read when word-internal vowels are left out. For example, declaration becomes dclrtn, and inalienable becomes inlnble, retaining any initial or final vowel sequences. The regular expression in our next example matches initial vowel sequences, final vowel sequences, and all consonants; everything else is ignored. This three-way disjunction is processed left-to-right, if one of the three parts matches the word, any later parts of the regular expression are ignored. We use re.findall() to extract all the matching pieces, and ''.join() to join them together.

In [None]:
regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'

# define a function to isolate vowels from English words
def compress(word):
    pieces = re.findall(regexp, word)
    return ''.join(pieces)

english_udhr = nltk.corpus.udhr.words("English-Latin1")
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))

Next, let's combine regular expressions with conditional frequency distributions. Here we will extract all consonant-vowel sequences from the words of Rotokas, such as ka and si. Since each of these is a pair, it can be used to initialize a conditional frequency distribution. We then tabulate the frequency of each pair:

In [None]:
rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()

## Q1.
Define the variable saying to contain the list 
        ['After', 'all', 'is', 'said', 'and', 'done', ',', 'more', 'is', 'said', 'than', 'done', '.']. 

Process this list using a for loop, and store the length of each word, and the word itself in a dictionary. 
###Hint: 
1. begin by assigning the empty list to lengths, using lengths = []. 
2. Then each time through the loop, use append() to add another length value to the list. 
3. Now do the same thing using a list comprehension - and construct a dictionary from it.

In [None]:
#### Your code here


## Q2. 
Define a variable **silly** to contain the string:

    'newly formed bland ideas are inexpressible in an infuriating way'. 
    
(This happens to be the legitimate interpretation that bilingual English-Spanish speakers can assign to Chomsky's famous nonsense phrase, *colorless green ideas sleep furiously* according to [Wikipedia](https://en.wikipedia.org/wiki/Colorless_green_ideas_sleep_furiously)). 

Now write code to perform the following tasks:

1. Split silly into a list of strings, one per word, using Python's split() operation, and save this to a variable called **bland**.
2. Extract the *second* letter of each word in **silly** and join them into a string, to get '*eoldrnnnna*'.
3. Combine the words in **bland** back into a single string, using **join()**. Make sure the words in the resulting string are separated with whitespace.
4. Print the words of **silly** in alphabetical order, one per line.

In [None]:
#### Your code here
