### Text Functions

Most of the interesting work in this assignment will happen when you create your own file `text_functions.py`. This next cell will load the higher-level functions into the kernel. 

**Note**: When you submit this, leave the output of code _I've_ written printed to the screen. This will make it easier for me to check your work. If you print some large stuff to the screen, you can delete those cells or just suppress the printing. 

In [None]:
from nltk.corpus import reuters
from text_functions import clean_tokenize, get_patterns, compare_texts, correction

Now we'll just test them out. We'll use information from the Reuters corpus. More information can be found [here](https://www.nltk.org/book/ch02.html) in section 1.4.

In [None]:
categories = reuters.categories()

In [None]:
crop_cats = ["barley","corn","cotton","grain","potato","rye","sugar","wheat"]
mining_cats = ["alum","copper","silver","gold","iron-steel","tin","zinc"]

The Reuters corpus has 1.3M articles arranged into these categories. Let's build some big sets of text based on these categories. Articles can be in multiple categories. (Quick: what type of corpus do we call that?) So we'll pull articles that are exclusively in one of our categories. 

In [None]:
crop_articles = set()
mining_articles = set()

for cat in crop_cats : 
    for article in reuters.fileids(cat) : 
        crop_articles.add(article)
        
for cat in mining_cats : 
    for article in reuters.fileids(cat) : 
        mining_articles.add(article)


In [None]:
in_both = crop_articles.intersection(mining_articles)
crop_articles = crop_articles - in_both
mining_articles = mining_articles - in_both

In [None]:
crop_text = []
mining_text = []

for article in crop_articles :
    # Categories are stored in the article in upper case
    article_text = [w for w in reuters.words(article) if w != w.upper()]
    crop_text.extend(article_text)

for article in mining_articles :
    # Categories are stored in the article in upper case
    article_text = [w for w in reuters.words(article) if w != w.upper()]
    mining_text.extend(article_text)
    

Now we're in a position to test our code! 

### Cleaning and Tokenizing

First we'll clean and tokenize, sending one set of text in as a list and the other in as a string, just to make sure both options work. 

In [None]:
holder = crop_text
crop_text = clean_tokenize(holder)
crop_text_2 = clean_tokenize(holder,remove_sw=False,remove_non_alpha=False)
mining_text = clean_tokenize(" ".join(mining_text),remove_sw=True,lowercase=True,remove_non_alpha=True)

In [None]:
assert(len(crop_text)==69727)
assert(len(mining_text)==31275)
assert(len([w for w in crop_text if w != w.lower()])==0)
assert(len([w for w in mining_text if w != w.lower()])==0)
assert(len(crop_text_2) - len(crop_text)==42870)
print("Passed all assertion tests!")

### Patterns in a Corpus

In [None]:
get_patterns(crop_text,10)

In [None]:
get_patterns(mining_text,10)

### Comparing Corpora

In [None]:
compare_texts(crop_text,mining_text)

### Spelling

In [None]:
assert(isinstance(crop_text,(list)))
corrected_words = dict()

for word in crop_text[:1000] :
    cw = correction(word)
    if cw != word :
        corrected_words[word] = cw

In [None]:
len(corrected_words)

In [None]:
for w, cw in corrected_words.items() :
    print(f"{w} was corrected to {cw}")