### Text Functions

Most of the interesting work in this assignment will happen when you create your own file `text_functions.py`. This next cell will load the higher-level functions into the kernel. 

**Note**: When you submit this, leave the output of code _I've_ written printed to the screen. This will make it easier for me to check your work. If you print some large stuff to the screen, you can delete those cells or just suppress the printing. 

In [1]:
from nltk.corpus import reuters
from text_functions import *
import nltk
from nltk import FreqDist

Now we'll just test them out. We'll use information from the Reuters corpus. More information can be found [here](https://www.nltk.org/book/ch02.html) in section 1.4.

In [2]:
categories = reuters.categories()

In [3]:
crop_cats = ["barley","corn","cotton","grain","potato","rye","sugar","wheat"]
mining_cats = ["alum","copper","silver","gold","iron-steel","tin","zinc"]

The Reuters corpus has 1.3M articles arranged into these categories. Let's build some big sets of text based on these categories. Articles can be in multiple categories. (Quick: what type of corpus do we call that?) So we'll pull articles that are exclusively in one of our categories. 

In [4]:
crop_articles = set()
mining_articles = set()

for cat in crop_cats : 
    for article in reuters.fileids(cat) : 
        crop_articles.add(article)
        
for cat in mining_cats : 
    for article in reuters.fileids(cat) : 
        mining_articles.add(article)


In [5]:
in_both = crop_articles.intersection(mining_articles)
crop_articles = crop_articles - in_both
mining_articles = mining_articles - in_both

In [6]:
crop_text = []
mining_text = []

for article in crop_articles :
    # Categories are stored in the article in upper case
    article_text = [w for w in reuters.words(article) if w != w.upper()]
    crop_text.extend(article_text)

for article in mining_articles :
    # Categories are stored in the article in upper case
    article_text = [w for w in reuters.words(article) if w != w.upper()]
    mining_text.extend(article_text)
    

Now we're in a position to test our code! 

### Cleaning and Tokenizing

First we'll clean and tokenize, sending one set of text in as a list and the other in as a string, just to make sure both options work. 

In [7]:
holder = crop_text
crop_text = clean_tokenize(holder)
crop_text_2 = clean_tokenize(holder,remove_sw=False,remove_non_alpha=False)
mining_text = clean_tokenize(" ".join(mining_text),remove_sw=True,lowercase=True,remove_non_alpha=True)

In [8]:
assert(len(crop_text)==69727)
assert(len(mining_text)==31275)
assert(len([w for w in crop_text if w != w.lower()])==0)
assert(len([w for w in mining_text if w != w.lower()])==0)
assert(len(crop_text_2) - len(crop_text)==42870)
print("Passed all assertion tests!")

Passed all assertion tests!


### Patterns in a Corpus

In [9]:
def get_patterns(text,num_words=10)  :
    """Computes basic statistics on a text corpus. 
    
       This function takes text as an input and returns a dictionary of statistics,
       after cleaning the text. 
       
       Args: 
           text: a list of tokens. Calls `clean_tokenize` on the text.
           num_words: Number of words to include in the FreqDist object
           in the results. Defaults to 10. 
           
       Returns: 
           A dictionary with the following keys: 
           * tokens
           * unique_tokens
           * avg_token_length
           * lexical_diversity
           * top_words: The value is a result of a call to 
             FreqDist(text).most_common(num_words)
        
    """
    
    if(len(text)==0) :
        raise ValueError("Can't work with empty text object.")
    else :
        text = clean_tokenize(text)

    # Calculate total tokens
    total_tokens = len(text)
    
    # Calculating unique tokens
    unique_tokens = len(list(FreqDist(text).keys()))

    #Calculating average token length
    avg_token_len = sum([len(word) for word in text]) / len(text)
    
    #Calculating lexical diversity
    lex_diversity = unique_tokens/total_tokens
    
    #Calculating top 10 
    top_words = FreqDist(text).most_common(num_words)
        
    
    #Results to a dictionary
    results = {'tokens':total_tokens,
               'unique_tokens':unique_tokens,
               'avg_token_length':avg_token_len,
               'lexical_diversity':lex_diversity,
               'top_words':top_words}
    
    return(results)

In [10]:
get_patterns(crop_text,num_words=10)

{'tokens': 69727,
 'unique_tokens': 6682,
 'avg_token_length': 6.191073759089019,
 'lexical_diversity': 0.09583088330202073,
 'top_words': [('said', 2323),
  ('tonnes', 1462),
  ('mln', 1291),
  ('wheat', 813),
  ('year', 661),
  ('sugar', 592),
  ('pct', 529),
  ('grain', 523),
  ('would', 487),
  ('last', 474)]}

In [11]:
get_patterns(mining_text,10)

{'tokens': 31275,
 'unique_tokens': 4656,
 'avg_token_length': 6.127769784172662,
 'lexical_diversity': 0.14887290167865708,
 'top_words': [('said', 1171),
  ('gold', 444),
  ('mln', 321),
  ('pct', 311),
  ('year', 301),
  ('tonnes', 260),
  ('dlrs', 236),
  ('company', 210),
  ('lt', 208),
  ('copper', 206)]}

### Comparing Corpora

In [12]:
def compare_texts(corpus_1, corpus_2, num_words = 10, ratio_cutoff = 5):

#error check and cleaning
    c1_results = get_patterns(corpus_1)
    c2_results = get_patterns(corpus_2)
    
    #ratio cutoff = 5
    #ratio cutoff is the number of words in each dataset.
    c1_dict = {}
    for key, value in FreqDist(corpus_1).items():
        if value > ratio_cutoff:
            c1_dict[key] = value/len(corpus_1)
        else:
            continue

    c2_dict = {}
    for key, value in FreqDist(corpus_2).items():
        if value > ratio_cutoff:
            c2_dict[key] = value/len(corpus_1)
        else:
            continue

    #print(c1_dict)
    #print(c2_dict)

    c_1VSc_2 = {}
    c_2VSc_1 = {}
    
    #adding comparison to dictionaries
    for key in c1_dict.keys():
        if key in c2_dict.keys():
            c_1VSc_2[key] = c1_dict[key]/c2_dict[key]
            c_2VSc_1[key] = c2_dict[key]/c1_dict[key]
    
    #sorting dictionary
    #num words is the number of words to be displayed
    one_vs_two = sorted(c_1VSc_2.items(), key=lambda k: k[1], reverse = True)[:num_words]
    two_vs_one = sorted(c_2VSc_1.items(), key=lambda k: k[1], reverse = True)[:num_words]
    
    #Adding to dictionary
    comparison_dict = {'one': c1_results,
                      'two': c2_results,
                      'one_vs_two': one_vs_two,
                      'two_vs_one': two_vs_one}
    
    return(comparison_dict)

In [13]:
compare_texts(crop_text,mining_text)

{'one': {'tokens': 69727,
  'unique_tokens': 6682,
  'avg_token_length': 6.191073759089019,
  'lexical_diversity': 0.09583088330202073,
  'top_words': [('said', 2323),
   ('tonnes', 1462),
   ('mln', 1291),
   ('wheat', 813),
   ('year', 661),
   ('sugar', 592),
   ('pct', 529),
   ('grain', 523),
   ('would', 487),
   ('last', 474)]},
 'two': {'tokens': 31275,
  'unique_tokens': 4656,
  'avg_token_length': 6.127769784172662,
  'lexical_diversity': 0.14887290167865708,
  'top_words': [('said', 1171),
   ('gold', 444),
   ('mln', 321),
   ('pct', 311),
   ('year', 301),
   ('tonnes', 260),
   ('dlrs', 236),
   ('company', 210),
   ('lt', 208),
   ('copper', 206)]},
 'one_vs_two': [('nil', 34.300000000000004),
  ('soviet', 21.5),
  ('department', 21.3125),
  ('previous', 14.285714285714286),
  ('vs', 14.0),
  ('offer', 13.300000000000002),
  ('administration', 12.285714285714285),
  ('french', 12.125),
  ('area', 11.300000000000002),
  ('private', 10.833333333333334)],
 'two_vs_one': [('

### Spelling

In [14]:
import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

In [15]:
assert(isinstance(crop_text,(list)))
corrected_words = dict()

for word in crop_text[:1000] :
    cw = correction(word)
    if cw != word :
        corrected_words[word] = cw

In [16]:
len(corrected_words)

55

In [17]:
for w, cw in corrected_words.items() :
    print(f"{w} was corrected to {cw}")

mln was corrected to man
dlrs was corrected to days
earmarked was corrected to remarked
exporters was corrected to exports
tonnes was corrected to tones
iraq was corrected to ran
algeria was corrected to algebra
paddy was corrected to daddy
hectares was corrected to hectare
milled was corrected to killed
portland was corrected to poland
kan was corrected to an
reagan was corrected to began
reps was corrected to rep
minn was corrected to mind
dorgan was corrected to organ
pct was corrected to put
generic was corrected to genetic
alfredo was corrected to alfred
ricart was corrected to cart
constantin was corrected to constantine
reuters was corrected to renters
pact was corrected to part
allocations was corrected to allocation
souffle was corrected to scuffle
fob was corrected to for
graniere was corrected to grangers
companie was corrected to companies
miguel was corrected to michel
braceras was corrected to braces
resende was corrected to presence
sao was corrected to so
paulo was corr