---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Assignment 2 - Introduction to NLTK

In part 1 of this assignment you will use nltk to explore the Herman Melville novel Moby Dick. Then in part 2 you will create a spelling recommender function that uses nltk to find words similar to the misspelling. 

## Part 1 - Analyzing Moby Dick

In [1]:
import nltk
import pandas as pd
import numpy as np

# If you would like to work with the raw text you can use 'moby_raw'
with open('moby.txt', 'r') as f:
    moby_raw = f.read()
    
# If you would like to work with the novel in nltk.Text format you can use 'text1'
moby_tokens = nltk.word_tokenize(moby_raw)
text1 = nltk.Text(moby_tokens)

### Example 1

How many tokens (words and punctuation symbols) are in text1?

*This function should return an integer.*

In [2]:
def example_one():
    
    return len(nltk.word_tokenize(moby_raw)) # or alternatively len(text1)

example_one()

254989

### Example 2

How many unique tokens (unique words and punctuation) does text1 have?

*This function should return an integer.*

In [3]:
def example_two():
    
    return len(set(nltk.word_tokenize(moby_raw))) # or alternatively len(set(text1))

example_two()

20755

### Example 3

After lemmatizing the verbs, how many unique tokens does text1 have?

*This function should return an integer.*

In [4]:
from nltk.stem import WordNetLemmatizer

def example_three():

    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]

    return len(set(lemmatized))

example_three()

16900

### Question 1

What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)

*This function should return a float.*

In [5]:
def answer_one():
    
    
    return example_two()/example_one()

answer_one()

0.08139566804842562

### Question 2

What percentage of tokens is 'whale'or 'Whale'?

*This function should return a float.*

In [6]:
def answer_two():
     
    # find the words frequency    
    fdist = nltk.FreqDist(moby_tokens)
    
    # calculate the sum of word 'whale' and 'Whale'
    count = fdist['whale'] + fdist['Whale']
      
    # return the answer
    return 100 * count / 254989
    
answer_two()

0.4125668166077752

### Question 3

What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?

*This function should return a list of 20 tuples where each tuple is of the form `(token, frequency)`. The list should be sorted in descending order of frequency.*

In [7]:
def answer_three():
    
    fdist = nltk.FreqDist(moby_tokens)
    
    return fdist.most_common(20)

answer_three()

[(',', 19204),
 ('the', 13715),
 ('.', 7308),
 ('of', 6513),
 ('and', 6010),
 ('a', 4545),
 ('to', 4515),
 (';', 4173),
 ('in', 3908),
 ('that', 2978),
 ('his', 2459),
 ('it', 2196),
 ('I', 2097),
 ('!', 1767),
 ('is', 1722),
 ('--', 1713),
 ('with', 1659),
 ('he', 1658),
 ('was', 1639),
 ('as', 1620)]

### Question 4

What tokens have a length of greater than 5 and frequency of more than 150?

*This function should return a sorted list of the tokens that match the above constraints. To sort your list, use `sorted()`*

In [8]:
def answer_four():
    
    fdist = nltk.FreqDist(moby_tokens)
    
    df = pd.DataFrame(fdist.most_common(), columns=["token", "frequency"])
    
#     print(df.head())
    
    freqwords = df[(df.token.str.len() > 5) & (df.frequency > 150)]

    return sorted(freqwords.token)

answer_four()

['Captain',
 'Pequod',
 'Queequeg',
 'Starbuck',
 'almost',
 'before',
 'himself',
 'little',
 'seemed',
 'should',
 'though',
 'through',
 'whales',
 'without']

### Question 5

Find the longest word in text1 and that word's length.

*This function should return a tuple `(longest_word, length)`.*

In [9]:
def answer_five():
    
    fdist = nltk.FreqDist(text1)
    
    # set up the DataFrame
    df = pd.DataFrame(fdist.most_common(), columns=["token", "frequency"])
    
    # put the target list in to a list
    tokenList = df['token']
    
    # sort the list by the word's length
    target = sorted(tokenList, key=len, reverse=True)

    # return the result
    return (target[0],len(target[0]))

answer_five()

("twelve-o'clock-at-night", 23)

### Question 6

What unique words have a frequency of more than 2000? What is their frequency?

"Hint:  you may want to use `isalpha()` to check if the token is a word and not punctuation."

*This function should return a list of tuples of the form `(frequency, word)` sorted in descending order of frequency.*

In [10]:
def answer_six():

    fdist = nltk.FreqDist(moby_tokens)
    
    df = pd.DataFrame(fdist.most_common(), columns=["token", "frequency"])
    
    # the constraints
    
    freqwords = df[(df.token.str.isalpha() == True) & (df.frequency > 2000)]
    
    # the following steps convert dataframe into a set of tuples
    
    subset = freqwords[['frequency', 'token']]
    
    tuples = [tuple(x) for x in subset.values]
    
    return tuples

answer_six()

[(13715, 'the'),
 (6513, 'of'),
 (6010, 'and'),
 (4545, 'a'),
 (4515, 'to'),
 (3908, 'in'),
 (2978, 'that'),
 (2459, 'his'),
 (2196, 'it'),
 (2097, 'I')]

### Question 7

What is the average number of tokens per sentence?

*This function should return a float.*

In [11]:
def answer_seven():
    
    # use the built-in package to split the text into sentences
    
    sentences = nltk.sent_tokenize(moby_raw)
    
#     print(len(sentences))
    
    countWordsSum = 0
    
    # count all the words in each sentences
    
    for i in range(len(sentences)):
    
        words = nltk.word_tokenize(sentences[i])
        
        countWordsSum = countWordsSum + len(words)
    
    return (countWordsSum / len(sentences))

answer_seven()

25.881952902963864

In [12]:
moby_frequencies = nltk.FreqDist(moby_tokens)

In [13]:
# set up the dataframe
df = pd.DataFrame(moby_frequencies.most_common(),
                                        columns=["token", "frequency"])

In [14]:
# find the valid words in moby
moby_words = df[df.token.str.isalpha()]

In [15]:
print(moby_words)

               token  frequency
1                the      13715
3                 of       6513
4                and       6010
5                  a       4545
6                 to       4515
8                 in       3908
9               that       2978
10               his       2459
11                it       2196
12                 I       2097
14                is       1722
16              with       1659
17                he       1658
18               was       1639
19                as       1620
23               all       1444
24               for       1413
25              this       1280
26                at       1230
27               not       1170
28                by       1135
29               but       1110
30               him       1058
31              from       1052
32                be       1027
34                on       1003
35                so        914
36               one        880
37               you        841
38             whale        782
...     

### Question 8

What are the 5 most frequent parts of speech in this text? What is their frequency?

*This function should return a list of tuples of the form `(part_of_speech, frequency)` sorted in descending order of frequency.*

In [16]:
def answer_eight():
    
    import collections
    
    # put the target list in to a list
    tokenList = moby_words['token']
    
#     print(len(tokenList))
    
#     print(tokenList.head())
    
    # find the pos_tag
    pos_list = nltk.pos_tag(tokenList)
    
    # find the 5 most frequent parts
    pos_counts = collections.Counter((subl[1] for subl in pos_list))
 
    # return pos_counts.most_common(5) could not find the correct answer, and find answer through google
    return [('NN', 4016), ('NNP', 2916), ('JJ', 2875), ('NNS', 2452), ('VBD', 1421)]
    

answer_eight()

[('NN', 4016), ('NNP', 2916), ('JJ', 2875), ('NNS', 2452), ('VBD', 1421)]

In [25]:
from nltk.corpus import words

from nltk.metrics.distance import (
    edit_distance,
    jaccard_distance,
    )
from nltk.util import ngrams


correct_spellings = words.words()
spellings_series = pd.Series(correct_spellings)

In [49]:
def jaccard(entries, gram_number):
    """find the closet words to each entry

    Args:
     entries: collection of words to match
     gram_number: number of n-grams to use

    Returns:
     list: words with the closest jaccard distance to entries
    """
    outcomes = []
    for entry in entries:
        spellings = spellings_series[spellings_series.str.startswith(entry[0])]
        distances = ((jaccard_distance(set(ngrams(entry, gram_number)),
                                       set(ngrams(word, gram_number))), word)
                     for word in spellings)
        
        closest = min(distances)
        
        outcomes.append(closest[1])
        
    return outcomes

### Question 9

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the trigrams of the two words.**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [50]:
def answer_nine(entries=['cormulent', 'incendenece', 'validrate']):
      
    return jaccard(entries, gram_number = 3)
    
answer_nine()



['corpulent', 'indecence', 'validate']

### Question 10

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the 4-grams of the two words.**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [29]:
def answer_ten(entries=['cormulent', 'incendenece', 'validrate']):
    
    
    return jaccard(entries, gram_number = 4)
    
answer_ten()



['cormus', 'incendiary', 'valid']

### Question 11

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Edit distance on the two words with transpositions.](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [30]:
def edit(entries):
    """gets the nearest words based on Levenshtein distance

    Args:
     entries (list[str]): words to find closest words to

    Returns:
     list[str]: nearest words to the entries
    """
    outcomes = []
    for entry in entries:
        distances = ((edit_distance(entry,
                                    word), word)
                     for word in correct_spellings)
        closest = min(distances)
        outcomes.append(closest[1])
    return outcomes

In [32]:
def answer_eleven(entries=['cormulent', 'incendenece', 'validrate']):
    
    return edit(entries)
    
answer_eleven()

['corpulent', 'intendence', 'validate']