# Preparation
The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [1]:
from acquire_codeup_blog import get_blog_articles
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
nltk.download('wordnet')

import pandas as pd

# We don't need to install nltk, it should come with anaconda, 
# but nltk does need to download some data.
!python -c 
nltk; nltk.download('stopwords')

Argument expected for the -c option
usage: python [option] ... [-c cmd | -m mod | file | -] [arg] ...
Try `python -h' for more information.


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/sandragraham/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sandragraham/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
articles = get_blog_articles()
article_index = 0
article = articles[article_index]['content']
original = article
print(article)

The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in Glassdoor’s #1 Best Job in America.
Data Science is a method of providing actionable intelligence from data. The data revolution has hit San Antonio, resulting in an explosion in Data Scientist positions across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen UTSA invest $70 M for a Cybersecurity Center and School of Data Science. We built a program to specifically meet the growing demands of this industry.
Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Students will work with real

1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:
    - lowercase everything
    - normalize unicode characters
    - replace anything that is not a letter, number, whitespace or a single quote

In [3]:
def basic_clean(article):
    new_article = article.lower()
    new_article = re.sub(r'\s', ' ', new_article)
    normalized = unicodedata.normalize('NFKD', new_article)\
                .encode('ascii', 'ignore')\
                .decode('utf-8')
    without_special_chars = re.sub(r'[^\w\s]', ' ', normalized)
    word_list = without_special_chars.split()
    word_list = ' '.join(word_list)
    return word_list

article = basic_clean(article)
print(article)

the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america data science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industry our program will be 18 weeks long full time hands on and project based our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problem

2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [4]:
def tokenize(article):
    tokenizer = nltk.tokenize.ToktokTokenizer()
    new_article = tokenizer.tokenize(article, return_str=True)
    return new_article

article = tokenize(article)
print(article)

the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america data science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industry our program will be 18 weeks long full time hands on and project based our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problem

3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [5]:
def print_stop_words(article):
    # Create the nltk stemmer object, then use it
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in article.split()]
    print(pd.Series(stems).value_counts())

print_stop_words(article)

and           13
data          13
to             9
a              8
in             8
scienc         7
our            7
will           6
the            6
with           6
learn          6
of             6
program        5
ha             4
machin         4
for            4
on             4
is             4
we             3
codeup         3
san            3
from           3
time           3
avail          3
antonio        3
develop        3
are            3
appli          2
4              2
deploy         2
              ..
com            1
process        1
7              1
individu       1
career         1
18             1
entir          1
focus          1
allen          1
11             1
domain         1
respond        1
start          1
america        1
languag        1
method         1
industri       1
cloud          1
pipelin        1
intellig       1
revolut        1
classif        1
best           1
senior         1
realist        1
true           1
cybersecur     1
group         

In [6]:
def stem(article):
    ps = nltk.stem.PorterStemmer()
    article_stemmed = ''.join([ps.stem(word) for word in article])
    return article_stemmed

article_stemmed = stem(article)
article_stemmed

'the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america data science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industry our program will be 18 weeks long full time hands on and project based our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic proble

4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [7]:
def lemmatize(article):
    wnl = nltk.stem.WordNetLemmatizer()
    lemmatized_words = [wnl.lemmatize(word) for word in article]
    article_lemmatized = ''.join(lemmatized_words)
    return article_lemmatized
    
article_lemmatized = lemmatize(article)
print(article_lemmatized)

the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america data science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industry our program will be 18 weeks long full time hands on and project based our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problem

5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

    This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [8]:
def remove_stopwords(article, extra_words, exclude_words):
    # get basic stopword list
    stopword_list = stopwords.words('english')
    # add extra words
    stopword_list = stopword_list + extra_words
    # remove excluded words
    stopword_list = [x for x in stopword_list if x not in exclude_words]
    
    without_stopwords = [word for word in article.split(' ') if word not in stopword_list]
    article_without_stopwords = ' '.join(without_stopwords)
    return article_without_stopwords

extra_words = ['codeup']
exclude_words = ['']
article_without_stopwords = remove_stopwords(article, extra_words, exclude_words)
print(article_without_stopwords)

rumors true time arrived officially opened applications new data science career accelerator 25 seats available immersive program one kind san antonio help land job glassdoors 1 best job america data science method providing actionable intelligence data data revolution hit san antonio resulting explosion data scientist positions across companies like usaa accenture booz allen hamilton heb weve even seen utsa invest 70 cybersecurity center school data science built program specifically meet growing demands industry program 18 weeks long full time hands project based curriculum development instruction led senior data scientist maggie giust worked heb capital group rackspace along input dozens practitioners hiring partners students work real data sets realistic problems entire data science pipeline collection deployment receive professional development training resume writing interviewing continuing education prepare smooth transition workforce focus applied data science immediate impact r

6. Define a function named prep_article that takes in the dictionary representing an article and returns a dictionary that looks like this:
```
{
    'title': 'the original title',
    'original': original,
    'stemmed': article_stemmed,
    'lemmatized': article_lemmatized,
    'clean': article_without_stopwords
}
```
Note that if the orignal dictionary has a title property, it should remain unchanged (same goes for the category property).

In [27]:
def prep_article(this_dict):
    keys = list(this_dict.keys())
    this_entry = {
         'title': this_dict['title'],
         'original': original,
         'category': [this_dict['category'] if 'category' in keys else 'blog'],
         'stemmed': article_stemmed,
         'lemmatized': article_lemmatized,
         'clean': article_without_stopwords
        }
    return this_entry

this_dict = articles[article_index]
prep_article(this_dict)

{'title': 'codeups-data-science-career-accelerator-is-here',
 'original': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, 

7. Define a function named prepare_article_data that takes in the list of articles dictionaries, applies the prep_article function to each one, and returns the transformed data.

In [28]:
from acquire_codeup_blog import get_blog_articles
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
nltk.download('wordnet')

import pandas as pd

# We don't need to install nltk, it should come with anaconda, 
# but nltk does need to download some data.
!python -c 
nltk; nltk.download('stopwords')

Argument expected for the -c option
usage: python [option] ... [-c cmd | -m mod | file | -] [arg] ...
Try `python -h' for more information.


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/sandragraham/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sandragraham/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [29]:
def basic_clean(article):
    '''
    take in a string (article) and return it after applying some basic text cleaning to it:
        - lowercase everything
        - normalize unicode characters
        - replace anything that is not a letter, number, whitespace or a single quote
    '''
    new_article = article.lower()
    new_article = re.sub(r'\s', ' ', new_article)
    normalized = unicodedata.normalize('NFKD', new_article)\
                .encode('ascii', 'ignore')\
                .decode('utf-8')
    without_special_chars = re.sub(r'[^\w\s]', ' ', normalized)
    word_list = without_special_chars.split()
    word_list = ' '.join(word_list)
    return word_list

In [30]:
def tokenize(article):
    '''tokenize all the words in the string, article'''
    tokenizer = nltk.tokenize.ToktokTokenizer()
    new_article = tokenizer.tokenize(article, return_str=True)
    return new_article

In [31]:
def print_stop_words(article):
    '''accept some text, apply stemming to all of the words,
        and print a list of value counts for all the stemmed words'''
    # Create the nltk stemmer object, then use it
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in article.split()]
    print(pd.Series(stems).value_counts())

In [32]:
def stem(article):
    '''accept a string and return it after applying stemming to all the words'''
    ps = nltk.stem.PorterStemmer()
    article_stemmed = ''.join([ps.stem(word) for word in article])
    return article_stemmed

In [33]:
def lemmatize(article):
    '''accept a string and return it after applying lemmatization to each word.'''
    wnl = nltk.stem.WordNetLemmatizer()
    lemmatized_words = [wnl.lemmatize(word) for word in article]
    article_lemmatized = ''.join(lemmatized_words)
    return article_lemmatized

In [34]:
def remove_stopwords(article, extra_words, exclude_words):
    '''remove all the stopwords, including all the words in extra_words and excluding
    all the words in exclude list'''

    # get basic stopword list
    stopword_list = stopwords.words('english')
    # add extra words
    stopword_list = stopword_list + extra_words
    # remove excluded words
    stopword_list = [x for x in stopword_list if x not in exclude_words]
    
    without_stopwords = [word for word in article.split(' ') if word not in stopword_list]
    article_without_stopwords = ' '.join(without_stopwords)
    return article_without_stopwords

In [35]:
def prep_article(this_dict):
    '''
    takes in a dictionary representing an article and returns a dictionary that 
    looks like this:
            {
             'title': 'the original title',
             'original': original,
             'stemmed': article_stemmed,
             'lemmatized': article_lemmatized,
             'clean': article_without_stopwords
            }
    Note that if the orignal dictionary has a title property, it will remain unchanged 
    (same goes for the category property).
    '''
    # put the content section into article and make a copy
    article = this_dict['content']
    original = article

    '''
    apply some basic text cleaning to the string, article:
        - lowercase everything
        - normalize unicode characters
        - replace anything that is not a letter, number, whitespace or a single quote
    '''
    article = basic_clean(article)

    '''tokenize all the words in the string, article'''
    article = tokenize(article)

    '''applying stemming to all the words in the string, article'''
    article_stemmed = stem(article)
    
    ''''apply lemmatization to each word in the string, article'''
    article_lemmatized = lemmatize(article)

    '''create a list of extra words and another of words to exclude from the stoplist'''
    extra_words = ['codeup']
    exclude_words = ['']
    
    '''remove all the stopwords, including all the words in extra_words and excluding
    all the words in exclude list'''
    article_without_stopwords = remove_stopwords(article, extra_words, exclude_words)

    keys = list(this_dict.keys())
    
    new_dict = {
         'title': this_dict['title'],
         'original': original,
         'category': [this_dict['category'] if 'category' in keys else 'blog'],
         'stemmed': article_stemmed,
         'lemmatized': article_lemmatized,
         'clean': article_without_stopwords
        }
    return new_dict

In [52]:
def prepare_article_data(articles):
    # takes in the list of articles dictionaries, 
    # applies the prep_article function to each one, 
    # and returns the transformed data.
    transformed_articles = []

    for article_index in range(len(articles)):
        transformed_entry = prep_article(articles[article_index])
        transformed_articles.append(transformed_entry.copy())

    return transformed_articles

articles = get_blog_articles()
transformed_data = prepare_article_data(articles)