# Prepare Exercises
The end result of this exercise should be a file named ```prepare.py``` that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [3]:
# unicode, regex, json for text digestion
import unicodedata
import re
import json

# nltk: natural language toolkit -> tokenization, stopwords
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

# pandas dataframe manipulation, acquire script, time formatting
import pandas as pd
import acquire
from time import strftime

# shh, down in front
import warnings
warnings.filterwarnings('ignore')

## 1. Define a function named ```basic_clean```. It should take in a string and apply some basic text cleaning to it:
* Lowercase everything
* Normalize unicode characters
* Replace anything that is not a letter, number, whitespace or a single quote.

In [4]:
def basic_clean(string):
    '''
    Description:
    This function takes in a string and returns the string normalized, cleaned, and lowercase.
    '''
    string = unicodedata.normalize('NFKD', string)\
             .encode('ascii', 'ignore')\
             .decode('utf-8', 'ignore')
    string = re.sub(r"[^\w0-9'\s]", '', string).lower()
    return string

In [9]:
basic_clean('I think THAT "S@t#uff$" will work 4-real!')

'85i think that stuff will work 4real'

## 2. Define a function named ```tokenize```. It should take in a string and tokenize all the words in the string.

In [10]:
def tokenize(string):
    '''
    This function takes in a string and returns it tokenized.
    '''
    tokenizer = nltk.tokenize.ToktokTokenizer()
    string = tokenizer.tokenize(string, return_str = True)
    
    return string

In [12]:
tokenize('worked hated hours')

'worked hated hours'

## 3. Define a function named ```stem```. It should accept some text and return the text after applying stemming to all the words.

In [13]:
def stem(string):
    '''
    This function takes in a string and returns the stemmed words.
    '''
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in string.split()]
    string = ' '.join(stems)
    
    return string

In [14]:
stem('worked hated hours')

'work hate hour'

## 4. Define a function named ```lemmatize```. It should accept some text and return the text after applying lemmatization to each word.

In [15]:
def lemmatize(string):
    '''
    This function takes in string and returns a string with words lemmatized.
    '''
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    string = ' '.join(lemmas)
    return string

In [16]:
lemmatize('Im not quite sure that these words are long enough to be lemmtized')

'Im not quite sure that these word are long enough to be lemmtized'

## 5. Define a function named ```remove_stopwords```. It should accept some text and return the text after removing all the stopwords.
* This function should define two optional parameters, ```extra_words``` and ```exclude_words```. 
* These parameters should define any additional stop words to include, and any words that we _**don't**_ want to remove.

In [None]:
def remove_stopwords(string, extra_words = [], exclude_words = []):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    stopword_list = stopwords.words('english')
    stopword_list = set(stopword_list) - set(exclude_words)
    stopword_list = stopword_list.union(set(extra_words))
    words = string.split()
    filtered_words = [word for word in words if word not in stopword_list]
    string_without_stopwords = ' '.join(filtered_words)
    return string_without_stopwords

## 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe ```news_df```.

## 7. Make another dataframe for the Codeup blog posts. Name the dataframe ```codeup_df```.

## 8. For each dataframe, produce the following columns:
* ```title``` to hold the title
* ```original``` to hold the original article/post content
* ```clean``` to hold the normalized and tokenized original with the stopwords removed.
* ```stemmed``` to hold the stemmed version of the cleaned data.
* ```lemmatized``` to hold the lemmatized version of the cleaned data.

## 9. Ask yourself:
* If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
* If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
* If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?