# Prepare Exercises

The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

 - Lowercase everything
 - Normalize unicode characters
 - Replace anything that is not a letter, number, whitespace, or a single quote.
 

2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

6. Define a function named prep_article that takes in the dictionary representing an article and returns a dictionary that looks like this:

{

    'title': 'the original title'.
    
    'original': original,
    
    'stemmed': article_stemmed,
    
    'lemmatized': article_lemmatized,
    
    'clean': article_without_stopwords
}

Note that if the orignal dictionary has a title property, it should remain unchanged (same goes for the category property).

7.  Define a function named prepare_article_data that takes in the list of articles dictionaries, applies the prep_article function to each one, and returns the transformed data.

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import acquire

import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords


In [None]:
articles, df = acquire.get_blog_articles()

## Clean using dataframe of strings

In [3]:
df

Unnamed: 0,title,body
0,Codeup’s Data Science Career Accelerator is He...,\nThe rumors are true! The time has arrived. C...
1,Data Science Myths - Codeup,\nBy Dimitri Antoniou and Maggie Giust\nData S...
2,Data Science VS Data Analytics: What’s The Dif...,"\nBy Dimitri Antoniou\nA week ago, Codeup laun..."
3,10 Tips to Crush It at the SA Tech Job Fair - ...,\n10 Tips to Crush It at the SA Tech Job Fair\...
4,Competitor Bootcamps Are Closing. Is the Model...,\nCompetitor Bootcamps Are Closing. Is the Mod...


In [4]:
df_bodies = df.body

In [5]:
df_bodies

0    \nThe rumors are true! The time has arrived. C...
1    \nBy Dimitri Antoniou and Maggie Giust\nData S...
2    \nBy Dimitri Antoniou\nA week ago, Codeup laun...
3    \n10 Tips to Crush It at the SA Tech Job Fair\...
4    \nCompetitor Bootcamps Are Closing. Is the Mod...
Name: body, dtype: object

In [23]:
# basic_clean takes in a string and apply some basic text cleaning to it
def basic_clean(string):
    '''
    Put doctring comments here. That way I can reference it for later to explain my function
    '''
    
    # lowercase capitalized letters
    string = string.lower()
    
    # normalizing the data by removing non-ASCII characters
    string = unicodedata.normalize('NFKD', string)\
        .encode('ascii', 'ignore')\
        .decode('utf-8', 'ignore')

    # remove whitespace
    string = string.strip()

    # remove anything that is not a through z, a number, a single quote, or whitespace
    string = re.sub(r"[^a-z0-9'\s]", '', string)
    
    # convert newlins and tabs to a single space
    string = re.sub(r'[\r|\n|\r\n]+',' ', string)
    
    return string

In [24]:
df.body.apply(basic_clean)

0    the rumors are true the time has arrived codeu...
1    by dimitri antoniou and maggie giust data scie...
2    by dimitri antoniou a week ago codeup launched...
3    10 tips to crush it at the sa tech job fair sa...
4    competitor bootcamps are closing is the model ...
Name: body, dtype: object

In [25]:
# tokenize takes in a string and tokenize all the words in the string
def tokenize():
    tokenizer = nltk.tokenize.ToktokTokenizer()
    return tokenizer.tokenize(string, return_str=True)

In [26]:
# stem accepts some text and return the text after applying stemming to all the words.
def stem():
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in string.split()]
    string_of_stems = ' '.join(stems)
    return string_of_stems

In [27]:
# lemmatize accepts some text and return the text after applying lemmatization to each word
def lemmatize():
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    string_of_lemmas = ' '.join(lemmas)
    return string_of_lemmas

In [28]:
# remove_stopwords accepts some text and return the text after removing all the stopwords.
def remove_stopwords():
    # Tokenize the string
    string = tokenize(string)

    words = string.split()
    stopword_list = stopwords.words('english')

    # remove the excluded words from the stopword list
    stopword_list = set(stopword_list) - set(exclude_words)

    # add in the user specified extra words
    stopword_list = stopword_list.union(set(extra_words))

    filtered_words = [w for w in words if w not in stopword_list]
    final_string = " ".join(filtered_words)
    return final_string

In [None]:
# prep_article that takes in the dictionary representing an article and returns a dictionary:  Reference top cell for example of what it should look like.
def prep_article():
    

In [None]:
# prepare_article_data that takes in the list of articles dictionaries, applies the prep_article function to each one, and returns the transformed data.
def prepare_article_data():
    

## Clean using string data ( needs a for loop to go through all of them)

In [None]:
article1 = articles[0]['body']

In [None]:
article1

In [None]:
# lowercase all capitalized letters
article1 = article1.lower()

In [None]:
# normalizing the data by removing non-ASCII characters
article1 = unicodedata.normalize('NFKD', article1)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')
print(article1)

In [None]:
# remove whitespace
article1 = article1.strip()

In [None]:
article1

In [None]:
# remove anything that is not a through z, a number, a single quote, or whitespace
article1 = re.sub(r"[^a-z0-9'\s]", '', article1)
print(article1)

In [None]:
# basic_clean takes in a string and apply some basic text cleaning to it
def basic_clean(string):
    

In [None]:
# tokenize takes in a string and tokenize all the words in the string
def tokenize():
    

In [None]:
# stem accepts some text and return the text after applying stemming to all the words.
def stem():
    

In [None]:
# lemmatize accepts some text and return the text after applying lemmatization to each word
def lemmatize():
    

In [None]:
# remove_stopwords accepts some text and return the text after removing all the stopwords.
def remove_stopwords():
    

In [None]:
# prep_article that takes in the dictionary representing an article and returns a dictionary:  Reference top cell for example of what it should look like.
def prep_article():
    

In [None]:
# prepare_article_data that takes in the list of articles dictionaries, applies the prep_article function to each one, and returns the transformed data.
def prepare_article_data():
    