---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Assignment 2 - Introduction to NLTK

In part 1 of this assignment you will use nltk to explore the Herman Melville novel Moby Dick. Then in part 2 you will create a spelling recommender function that uses nltk to find words similar to the misspelling. 

## Part 1 - Analyzing Moby Dick

In [1]:
import nltk
import pandas as pd
import numpy as np
#from nltk import *

#nltk.download('book')
# If you would like to work with the raw text you can use 'moby_raw'
with open('moby.txt', 'r') as f:
    moby_raw = f.read()
    
# If you would like to work with the novel in nltk.Text format you can use 'text1'
moby_tokens = nltk.word_tokenize(moby_raw)
text1 = nltk.Text(moby_tokens)

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/chat80.zip.
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2000.zip.
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2002.zip.
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /home/jovyan/nlt

### Example 1

How many tokens (words and punctuation symbols) are in text1?

*This function should return an integer.*

In [4]:
def example_one():
    
    return len(nltk.word_tokenize(moby_raw)) # or alternatively len(text1)

example_one()

254989

### Example 2

How many unique tokens (unique words and punctuation) does text1 have?

*This function should return an integer.*

In [21]:
def example_two():
    
    return len(set(nltk.word_tokenize(moby_raw))) # or alternatively len(set(text1))

example_two()

20755

### Example 3

After lemmatizing the verbs, how many unique tokens does text1 have?

*This function should return an integer.*

In [24]:
from nltk.stem import WordNetLemmatizer

def example_three():

    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]

    return len(set(lemmatized))

example_three()

16900

### Question 1

What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)

*This function should return a float.*

In [5]:
def answer_one():
    
    tot_tok = len(nltk.word_tokenize(moby_raw))
    uni_tok = len(set(nltk.word_tokenize(moby_raw)))
                  
    return uni_tok / tot_tok

answer_one()

0.08139566804842562

### Question 2

What percentage of tokens is 'whale'or 'Whale'?

*This function should return a float.*

In [8]:
def answer_two():
    
    moby_tok = nltk.word_tokenize(moby_raw)
    freq = nltk.FreqDist(moby_tok)
    whale_fre = freq["whale"]
    Whale_fre = freq["Whale"]
    percentage = ((whale_fre + Whale_fre) / len(moby_tok))*100
    return percentage

answer_two()

0.4125668166077752

### Question 3

What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?

*This function should return a list of 20 tuples where each tuple is of the form `(token, frequency)`. The list should be sorted in descending order of frequency.*

In [7]:
def answer_three():
    
    # turns raw text into tokens
    moby_tok = nltk.word_tokenize(moby_raw)
    # creates a dictionary of tokens and frequencies
    freq = nltk.FreqDist(moby_tok)
    # creates a list of top 20 most frequenct occurances, numbers only...
    frequency = sorted(freq.values(),reverse=True)[0:20]
    
    # maps the numbers to the keys in the original dictionary, returns list of tokens and numbers
    most_frequent = []
    for i in frequency:
        for k,v in freq.items():
            if i == v:
                most_frequent.append((k,i))
    
    return most_frequent

answer_three()

[(',', 19204),
 ('the', 13715),
 ('.', 7308),
 ('of', 6513),
 ('and', 6010),
 ('a', 4545),
 ('to', 4515),
 (';', 4173),
 ('in', 3908),
 ('that', 2978),
 ('his', 2459),
 ('it', 2196),
 ('I', 2097),
 ('!', 1767),
 ('is', 1722),
 ('--', 1713),
 ('with', 1659),
 ('he', 1658),
 ('was', 1639),
 ('as', 1620)]

### Question 4

What tokens have a length of greater than 5 and frequency of more than 150?

*This function should return an alphabetically sorted list of the tokens that match the above constraints. To sort your list, use `sorted()`*

In [26]:
def answer_four():
    
    moby_token = nltk.word_tokenize(moby_raw)
    # creates corpus of only words > 5 letters long.
    mody_token_5 = [w for w in moby_token if len(w) > 5]

    # creates dictionary from these words
    freq_dic = nltk.FreqDist(mody_token_5)
    # creates a list of decending values from dictionary above
    freq_sorted = sorted(freq_dic.values(),reverse=True)
    # filters out frequencies < 150
    freq_sorted = [v for v in freq_sorted if v > 150]

    # matches frequencies with keys in freq_dic
    tokens = []
    for i in freq_sorted:
        for k,v in freq_dic.items():
            if v == i:
                tokens.append(k)
                
    # sorts them alphabetically
    return  sorted(tokens)

answer_four()

['Captain',
 'Pequod',
 'Queequeg',
 'Starbuck',
 'almost',
 'before',
 'himself',
 'little',
 'seemed',
 'should',
 'though',
 'through',
 'whales',
 'without']

### Question 5

Find the longest word in text1 and that word's length.

*This function should return a tuple `(longest_word, length)`.*

In [38]:
def answer_five():
    moby_token = nltk.word_tokenize(moby_raw)
    longest_len = max([len(w) for w in moby_token])
    longest_word = [w for w in moby_token if len(w)==longest_len]
    
    return (longest_word[0], longest_len)

answer_five()

("twelve-o'clock-at-night", 23)

### Question 6

What unique words have a frequency of more than 2000? What is their frequency?

"Hint:  you may want to use `isalpha()` to check if the token is a word and not punctuation."

*This function should return a list of tuples of the form `(frequency, word)` sorted in descending order of frequency.*

In [50]:
def answer_six():
    
    moby_token = nltk.word_tokenize(moby_raw)
    # list of tokens that only include words, isalha() is a filter for words only.
    moby_words = [w for w in moby_token if w.isalpha() == True ]
    # create dicitonary of words and frequencies
    freq_dic = nltk.FreqDist(moby_words)
    # sort values of dic in decending order
    freq_values = sorted(freq_dic.values(),reverse=True)
    # filter out words with frequencies < 2000
    freq_values = [w for w in freq_values if w > 2000]
    
    # create list of frequencies and words by matching frequency values in freq_values to original dic.
    lst = []
    for i in freq_values:
        for k,v in freq_dic.items():
            if i == v:
                lst.append((i,k))
    
    return lst

answer_six()

[(13715, 'the'),
 (6513, 'of'),
 (6010, 'and'),
 (4545, 'a'),
 (4515, 'to'),
 (3908, 'in'),
 (2978, 'that'),
 (2459, 'his'),
 (2196, 'it'),
 (2097, 'I')]

### Question 7

What is the average number of tokens per sentence?

*This function should return a float.*

In [59]:
def answer_seven():
    
    # creates a list of tokenized sentences
    moby_sent = nltk.sent_tokenize(moby_raw)
    
    # append the length of each sentence in the moby dick
    len_tokens = []
    for sent in moby_sent:
        len_tokens.append(len(nltk.word_tokenize(sent)))
    
    # returns the average sentence length
    return sum(len_tokens)/len(len_tokens)

answer_seven()

25.881952902963864

### Question 8

What are the 5 most frequent parts of speech in this text? What is their frequency?

*This function should return a list of tuples of the form `(part_of_speech, frequency)` sorted in descending order of frequency.*

In [50]:
def answer_eight():
    # create tokens from test
    moby_token = nltk.word_tokenize(moby_raw)
    # create tuples of tokens and POS
    moby_pos = nltk.pos_tag(moby_token)
    
    # create a list of all the POS 
    lst_pos = []
    for i in range(0,len(moby_pos)):
        lst_pos.append(moby_pos[i][1])

    # create a list of all unique POS
    unique_pos = []
    for i in lst_pos:
        if i not in unique_pos:
            unique_pos.append(i)
            
    # loop through and create a list of tuples of all POS and frequency
    count_pos = []
    for i in unique_pos:
        count = 0
        for j in lst_pos:
            if i == j:
                count += 1
        count_pos.append((i,count))
    
    # need to sort the list of tuples in decending order of frequency
    count_pos_sorted = sorted(count_pos, key=lambda tup: tup[1], reverse=True)
    
    return count_pos_sorted[0:5]

answer_eight()

[('NN', 32730), ('IN', 28657), ('DT', 25867), (',', 19204), ('JJ', 17620)]

## Part 2 - Spelling Recommender

For this part of the assignment you will create three different spelling recommenders, that each take a list of misspelled words and recommends a correctly spelled word for every word in the list.

For every misspelled word, the recommender should find find the word in `correct_spellings` that has the shortest distance*, and starts with the same letter as the misspelled word, and return that word as a recommendation.

*Each of the three different recommenders will use a different distance measure (outlined below).

Each of the recommenders should provide recommendations for the three default words provided: `['cormulent', 'incendenece', 'validrate']`.

In [51]:
from nltk.corpus import words

correct_spellings = words.words()

### Question 9

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the trigrams of the two words.**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [96]:
def answer_nine(entries=['cormulent', 'incendenece', 'validrate']):
    
    # creates a dictionary with keys as the miss-spelling and values as words that begin with the same letter.
    possible_words = {}
    for i in entries:
        lst = []
        for j in correct_spellings:
            if i[0] == j[0]:
                lst.append(j)
        possible_words[i] = lst
        
    # creates a dic with keys are miss-spelled words and values as a tuple of the words and jd values.
    # on top of this it splits the keys and values into 3 parts and then compares the key parts to the value parts.
    jacc_dic = {}
    for k,v in possible_words.items():
        k_char = set(nltk.ngrams(k,n=3))
        lst = []
        for i in possible_words[k]:
            v_char = set(nltk.ngrams(i,n=3))
            lst.append((i,nltk.jaccard_distance(k_char,v_char)))
        jacc_dic[k] = lst
    
    # creates a list of recommendations, the smallest jd value for all values in dic for that key is returned.
    recommendations = []
    for k,v in jacc_dic.items():
        v_sorted = sorted(v, key=lambda tup: tup[1])
        recommendations.append(v_sorted[0][0])

    return recommendations
    
answer_nine()



['corpulent', 'indecence', 'validate']

### Question 10

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the 4-grams of the two words.**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [97]:
def answer_ten(entries=['cormulent', 'incendenece', 'validrate']):
    # creates a dictionary with keys as the miss-spelling and values as words that begin with the same letter.
    possible_words = {}
    for i in entries:
        lst = []
        for j in correct_spellings:
            if i[0] == j[0]:
                lst.append(j)
        possible_words[i] = lst
        
    # creates a dic with keys are miss-spelled words and values as a tuple of the words and jd values.
    # on top of this it splits the keys and values into 3 parts and then compares the key parts to the value parts.
    jacc_dic = {}
    for k,v in possible_words.items():
        k_char = set(nltk.ngrams(k,n=4))
        lst = []
        for i in possible_words[k]:
            v_char = set(nltk.ngrams(i,n=4))
            lst.append((i,nltk.jaccard_distance(k_char,v_char)))
        jacc_dic[k] = lst
    
    # creates a list of recommendations, the smallest jd value for all values in dic for that key is returned.
    recommendations = []
    for k,v in jacc_dic.items():
        v_sorted = sorted(v, key=lambda tup: tup[1])
        recommendations.append(v_sorted[0][0])

    return recommendations
    
answer_ten()



['cormus', 'incendiary', 'valid']

### Question 11

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Edit distance on the two words with transpositions.](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [99]:
def answer_eleven(entries=['cormulent', 'incendenece', 'validrate']):
    
    # creates a dictionary with keys as the miss-spelling and values as words that begin with the same letter.
    possible_words = {}
    for i in entries:
        lst = []
        for j in correct_spellings:
            if i[0] == j[0]:
                lst.append(j)
        possible_words[i] = lst
        
    # creates a dic with keys are miss-spelled words and values as a tuple of the words and jd values.
    jacc_dic = {}
    for k,v in possible_words.items():
        lst = []
        for i in possible_words[k]:
            lst.append((i,nltk.edit_distance(k,i)))
        jacc_dic[k] = lst
    
    # creates a list of recommendations, the smallest jd value for all values in dic for that key is returned.
    recommendations = []
    for k,v in jacc_dic.items():
        v_sorted = sorted(v, key=lambda tup: tup[1])
        recommendations.append(v_sorted[0][0])

    return recommendations
    
answer_eleven()

['corpulent', 'intendence', 'validate']