---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Assignment 2 - Introduction to NLTK

In part 1 of this assignment you will use nltk to explore the Herman Melville novel Moby Dick. Then in part 2 you will create a spelling recommender function that uses nltk to find words similar to the misspelling. 

## Part 1 - Analyzing Moby Dick

In [4]:
import nltk
import pandas as pd
import numpy as np
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize

# If you would like to work with the raw text you can use 'moby_raw'
with open('moby.txt', 'r') as f:
    moby_raw = f.read()
    
# If you would like to work with the novel in nltk.Text format you can use 'text1'
moby_tokens = nltk.word_tokenize(moby_raw)
text1 = nltk.Text(moby_tokens)

In [16]:
moby_tokens[:10]

['[',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 ']',
 'ETYMOLOGY',
 '.']

In [17]:
text1[:10]

['[',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 ']',
 'ETYMOLOGY',
 '.']

In [2]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### Example 1

How many tokens (words and punctuation symbols) are in text1?

*This function should return an integer.*

In [8]:
def example_one():
    
    return len(text1) # or alternatively len(text1)

example_one()

254989

### Example 2

How many unique tokens (unique words and punctuation) does text1 have?

*This function should return an integer.*

In [7]:
def example_two():
    
    return len(set(text1)) # or alternatively len(set(text1))

example_two()

20755

### Example 3

After lemmatizing the verbs, how many unique tokens does text1 have?

*This function should return an integer.*

In [9]:
from nltk.stem import WordNetLemmatizer

def example_three():

    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]

    return len(set(lemmatized))

example_three()

16900

### Question 1

What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)

*This function should return a float.*

In [15]:
def answer_one():
    ratio = len(set(text1))/len(text1)
    return ratio # Your answer here

answer_one()

0.08139566804842562

### Question 2

What percentage of tokens is 'whale'or 'Whale'?

*This function should return a float.*

In [19]:
def answer_two():
    dist = nltk.FreqDist(text1)
    
    percentage = (dist['whale'] + dist['Whale'])/sum(dist.values()) *100
    
    return percentage # Your answer here

answer_two()

0.4125668166077752

In [20]:
dist = nltk.FreqDist(text1)

### Question 3

What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?

*This function should return a list of 20 tuples where each tuple is of the form `(token, frequency)`. The list should be sorted in descending order of frequency.*

In [27]:
def answer_three():
    dist = nltk.FreqDist(text1)
    most_20 = dist.most_common(20)

    return  most_20# Your answer here

answer_three()

[(',', 19204),
 ('the', 13715),
 ('.', 7308),
 ('of', 6513),
 ('and', 6010),
 ('a', 4545),
 ('to', 4515),
 (';', 4173),
 ('in', 3908),
 ('that', 2978),
 ('his', 2459),
 ('it', 2196),
 ('I', 2097),
 ('!', 1767),
 ('is', 1722),
 ('--', 1713),
 ('with', 1659),
 ('he', 1658),
 ('was', 1639),
 ('as', 1620)]

### Question 4

What tokens have a length of greater than 5 and frequency of more than 150?

*This function should return a sorted list of the tokens that match the above constraints. To sort your list, use `sorted()`*

In [29]:
def answer_four():
    dist = nltk.FreqDist(text1)
    vocab1 = dist.keys()   
    tokens = sorted([w for w in vocab1 if len(w) > 5 and dist[w] > 150])
    return  tokens # Your answer here

answer_four()

['Captain',
 'Pequod',
 'Queequeg',
 'Starbuck',
 'almost',
 'before',
 'himself',
 'little',
 'seemed',
 'should',
 'though',
 'through',
 'whales',
 'without']

### Question 5

Find the longest word in text1 and that word's length.

*This function should return a tuple `(longest_word, length)`.*

In [32]:
maxlen = max(len(w) for w in text1)
[w for w in text1 if len(w) == maxlen], maxlen

(["twelve-o'clock-at-night"], 23)

In [34]:
def answer_five():
    
    maxlen = max(len(w) for w in text1)
   
    return  [w for w in text1 if len(w) == maxlen], maxlen # Your answer here

answer_five()

(["twelve-o'clock-at-night"], 23)

### Question 6

What unique words have a frequency of more than 2000? What is their frequency?

"Hint:  you may want to use `isalpha()` to check if the token is a word and not punctuation."

*This function should return a list of tuples of the form `(frequency, word)` sorted in descending order of frequency.*

In [42]:
def answer_six():
     
    dist = nltk.FreqDist(text1)
    vocab1 = dist.keys()   
  
    return sorted([(dist[w], w)  for w in vocab1 if w.isalpha()  and dist[w] > 2000], reverse= True) # Your answer here

answer_six()

[(13715, 'the'),
 (6513, 'of'),
 (6010, 'and'),
 (4545, 'a'),
 (4515, 'to'),
 (3908, 'in'),
 (2978, 'that'),
 (2459, 'his'),
 (2196, 'it'),
 (2097, 'I')]

### Question 7

What is the average number of tokens per sentence?

*This function should return a float.*

In [47]:
def answer_seven():
    sentences = nltk.sent_tokenize(moby_raw)
    
    return  sum([len(nltk.word_tokenize(sentence)) for sentence in sentences])/len(sentences) # Your answer here

answer_seven()

25.881952902963864

### Question 8

What are the 5 most frequent parts of speech in this text? What is their frequency?

*This function should return a list of tuples of the form `(part_of_speech, frequency)` sorted in descending order of frequency.*

In [88]:
def answer_eight():
    pos = nltk.pos_tag(text1)
    pos_t=list(zip(*pos))[1]
    dist = nltk.FreqDist(pos_t)
    return dist.most_common(5) # Your answer here

answer_eight()

[('NN', 32730), ('IN', 28657), ('DT', 25867), (',', 19204), ('JJ', 17620)]

## Part 2 - Spelling Recommender

For this part of the assignment you will create three different spelling recommenders, that each take a list of misspelled words and recommends a correctly spelled word for every word in the list.

For every misspelled word, the recommender should find find the word in `correct_spellings` that has the shortest distance*, and starts with the same letter as the misspelled word, and return that word as a recommendation.

*Each of the three different recommenders will use a different distance measure (outlined below).

Each of the recommenders should provide recommendations for the three default words provided: `['cormulent', 'incendenece', 'validrate']`.

In [96]:
from nltk.corpus import words

correct_spellings = words.words()
spellings_series = pd.Series(correct_spellings)

In [97]:
from nltk.corpus import words
from nltk.metrics.distance import (
    edit_distance,
    jaccard_distance,
    )
from nltk.util import ngrams

In [98]:
entries=['cormulent', 'incendenece', 'validrate']

In [108]:
spellings_series[spellings_series.str.startswith(entries[0][0])]

28167                  c
28168                 ca
28169               caam
28170              caama
28171            caaming
28172            caapeba
28173           caatinga
28174                cab
28175               caba
28176             cabaan
28177             caback
28178             cabaho
28179              cabal
28180             cabala
28181         cabalassou
28182          cabaletta
28183            cabalic
28184           cabalism
28185           cabalist
28186         cabalistic
28187       cabalistical
28188     cabalistically
28189           caballer
28190          caballine
28191              caban
28192             cabana
28193            cabaret
28194              cabas
28195           cabasset
28196           cabassou
               ...      
236035        comparison
236036       competition
236037          complete
236038           complex
236039         condition
236040        connection
236041         conscious
236042           control
236043              cook


In [102]:
entry_1 = set(ngrams(entries[0], 3))
entry_1

{('c', 'o', 'r'),
 ('e', 'n', 't'),
 ('l', 'e', 'n'),
 ('m', 'u', 'l'),
 ('o', 'r', 'm'),
 ('r', 'm', 'u'),
 ('u', 'l', 'e')}

In [105]:
spell_series = set(ngrams(correct_spellings[13], 3))
spell_series

{('A', 'a', 'r'),
 ('a', 'r', 'o'),
 ('i', 't', 'i'),
 ('n', 'i', 't'),
 ('o', 'n', 'i'),
 ('r', 'o', 'n'),
 ('t', 'i', 'c')}

In [106]:
jaccard_distance(entry_1,spell_series)

1.0

### Question 9

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the trigrams of the two words.**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [90]:
entries=['cormulent', 'incendenece', 'validrate']

In [109]:
def answer_nine(entries=['cormulent', 'incendenece', 'validrate']):
    gram_number=3
    results = []
    for entry in entries:
        candidates = [w for w in correct_spellings if w[0] == entry[0]]
        distances = ((jaccard_distance(set(ngrams(entry, gram_number)),
                                       set(ngrams(word, gram_number))), word)
                     for word in candidates)
        closest = min(distances)
        results.append(closest[1])
    return results # Your answer here
    
answer_nine()



['corpulent', 'indecence', 'validate']

### Question 10

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the 4-grams of the two words.**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [111]:
def answer_ten(entries=['cormulent', 'incendenece', 'validrate']):
    gram_number=4
    results = []
    for entry in entries:
        candidates = [w for w in correct_spellings if w[0] == entry[0]]
        distances = ((jaccard_distance(set(ngrams(entry, gram_number)),
                                       set(ngrams(word, gram_number))), word)
                     for word in candidates)
        closest = min(distances)
        results.append(closest[1])
    return results # Your answer here
    
answer_ten()



['cormus', 'incendiary', 'valid']

### Question 11

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Edit distance on the two words with transpositions.](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [112]:
def answer_eleven(entries=['cormulent', 'incendenece', 'validrate']):
    results = []
    for entry in entries:
        candidates = [w for w in correct_spellings if w[0] == entry[0]]
        distances =(((nltk.edit_distance(entry, word)), word)
                     for word in candidates)
        closest = min(distances)
        results.append(closest[1])
    
    return results# Your answer here 
    
answer_eleven()

['corpulent', 'intendence', 'validate']