In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire

import warnings
warnings.filterwarnings("ignore")

### In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [2]:
# using functions from acquire to gather blogs and news data
blogs = acquire.get_blog_articles()
news = acquire.get_news_articles()

In [3]:
# saving one article from news and blogs to test functions with
blog_one = blogs[0]['article']
news_one = news[0]['article']

## Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:
- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [4]:
def basic_clean(s):
    """
    Accepts string and returns with only lowercase, normalized unicode characters. 
    Removes anything that is not a letter, number, whitespace or a single quote.
    """
    # convert string to lower case
    s = s.lower()
    # normalize text
    s = unicodedata.normalize('NFKD', s).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    # remove anything that is not a character specified in docstring
    s = re.sub(r"[^a-z0-9'\s]", '', s)
    return s

In [5]:
# testing function on blog article
blog_one = basic_clean(blog_one)

blog_one

'the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america\ndata science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industry\nour program will be 18 weeks long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problem

In [6]:
# testing function on news article
news_one = basic_clean(news_one)

news_one

"american biotechnology company moderna on monday announced its experimental vaccine was 945 effective in preventing covid19 based on interim data from a latestage clinical trial moderna's interim analysis was based on 95 infections among trial participants who received either a placebo or the vaccine among those only five infections occurred in those who received the vaccine"

## Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [7]:
def tokenize(s):
    """
    Accepts a string and returns it tokenized.
    Breaks down words and any punctuation left over into discrete units.
    """
    # save tokenizer to variable
    tokenizer = nltk.tokenize.ToktokTokenizer()
    # use tokenize on string and return
    return tokenizer.tokenize(s, return_str=True)

In [8]:
# testing function on blog article
blog_one = tokenize(blog_one)

blog_one

'the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america\ndata science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industry\nour program will be 18 weeks long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problem

In [20]:
# testing function on news article
news_one = tokenize(news_one)

news_one

"american biotechnology company moderna on monday announced its experimental vaccine was 945 effective in preventing covid19 based on interim data from a latestage clinical trial moderna ' s interim analysis was based on 95 infections among trial participants who received either a placebo or the vaccine among those only five infections occurred in those who received the vaccine"

## Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [10]:
def stem(s):
    """
    Accepts string. Returns string after stemming it.
    """
    # saving stemmer to variable
    ps = nltk.porter.PorterStemmer()
    # applying stemmer to each word in string
    stems = [ps.stem(word) for word in s.split()]
    # joining words together
    article_stemmed = ' '.join(stems)
    # returning string
    return article_stemmed

In [11]:
# testing function on blog article
blog_one_stem = stem(blog_one)

blog_one_stem

'the rumor are true the time ha arriv codeup ha offici open applic to our new data scienc career acceler with onli 25 seat avail thi immers program is one of a kind in san antonio and will help you land a job in glassdoor 1 best job in america data scienc is a method of provid action intellig from data the data revolut ha hit san antonio result in an explos in data scientist posit across compani like usaa accentur booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecur center and school of data scienc we built a program to specif meet the grow demand of thi industri our program will be 18 week long fulltim handson and projectbas our curriculum develop and instruct is led by senior data scientist maggi giust who ha work at heb capit group and rackspac along with input from dozen of practition and hire partner student will work with real data set realist problem and the entir data scienc pipelin from collect to deploy they will receiv profession develop train in resu

In [22]:
# testing function on news article
news_one_stem = stem(news_one)

news_one_stem

"american biotechnolog compani moderna on monday announc it experiment vaccin wa 945 effect in prevent covid19 base on interim data from a latestag clinic trial moderna ' s interim analysi wa base on 95 infect among trial particip who receiv either a placebo or the vaccin among those onli five infect occur in those who receiv the vaccin"

## Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [30]:
def lemmatize(s):
    """
    Accepts string. Returns string after lemmatizing it.
    """
    # saving lemmatizer to variable
    wnl = nltk.stem.WordNetLemmatizer()
    # applying lemmatizer to each word in string
    lemmas = [wnl.lemmatize(word) for word in s.split()]
    # rejoining words
    article_lemmatized = ' '.join(lemmas)
    # returning string
    return article_lemmatized

In [31]:
# testing function on blog article
blog_one_lem = lemmatize(blog_one)

blog_one_lem

'the rumor are true the time ha arrived codeup ha officially opened application to our new data science career accelerator with only 25 seat available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america data science is a method of providing actionable intelligence from data the data revolution ha hit san antonio resulting in an explosion in data scientist position across company like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demand of this industry our program will be 18 week long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who ha worked at heb capital group and rackspace along with input from dozen of practitioner and hiring partner student will work with real data set realistic problem and the entire dat

In [32]:
# testing function on news article
news_one_lem = lemmatize(news_one)

news_one_lem

"american biotechnology company moderna on monday announced it experimental vaccine wa 945 effective in preventing covid19 based on interim data from a latestage clinical trial moderna ' s interim analysis wa based on 95 infection among trial participant who received either a placebo or the vaccine among those only five infection occurred in those who received the vaccine"

## Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [33]:
def remove_stopwords(s):
    """
    Accepts string and return with stopwords removed.
    """
    # importing list of stopwords
    stopword_list = stopwords.words('english')
    # removing no and not from stopword list
    stopword_list.remove('no')
    stopword_list.remove('not')
    # splitting provided string
    words = s.split()
    # filtering out stop words
    filtered_words = [w for w in words if w not in stopword_list]
    # joining strings that are not stopwords
    article_without_stopwords = ' '.join(filtered_words)
    # returning string
    return article_without_stopwords

In [34]:
# testing function on blog article
blog_one_lem = remove_stopwords(blog_one_lem)

blog_one_lem

'rumor true time ha arrived codeup ha officially opened application new data science career accelerator 25 seat available immersive program one kind san antonio help land job glassdoors 1 best job america data science method providing actionable intelligence data data revolution ha hit san antonio resulting explosion data scientist position across company like usaa accenture booz allen hamilton heb weve even seen utsa invest 70 cybersecurity center school data science built program specifically meet growing demand industry program 18 week long fulltime handson projectbased curriculum development instruction led senior data scientist maggie giust ha worked heb capital group rackspace along input dozen practitioner hiring partner student work real data set realistic problem entire data science pipeline collection deployment receive professional development training resume writing interviewing continuing education prepare smooth transition workforce focus applied data science immediate im

In [35]:
# testing function on news article
news_one_lem = remove_stopwords(news_one_lem)

news_one_lem

"american biotechnology company moderna monday announced experimental vaccine wa 945 effective preventing covid19 based interim data latestage clinical trial moderna ' interim analysis wa based 95 infection among trial participant received either placebo vaccine among five infection occurred received vaccine"

## Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [43]:
newsdf = pd.DataFrame(news)

## Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [44]:
blogdf = pd.DataFrame(blogs)

blogdf

Unnamed: 0,title,article
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...


## For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

In [None]:
newsdf = newsdf.rename(columns = {'article':'original'})
blogdf = blogdf.rename(columns = {'article':'original'})

In [None]:
newsdf['clean'] = newsdf['original'].apply(basic_clean)
newsdf['stemmed'] = newsdf['clean'].apply(stem)
newsdf['lemmatized'] = newsdf['clean'].apply(lemmatize)

newsdf

In [66]:
blogdf['clean'] = blogdf['original'].apply(basic_clean)
blogdf['stemmed'] = blogdf['clean'].apply(stem)
blogdf['lemmatized'] = blogdf['clean'].apply(lemmatize)

blogdf

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...,the rumors are true the time has arrived codeu...,the rumor are true the time ha arriv codeup ha...,the rumor are true the time ha arrived codeup ...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...,by dimitri antoniou and maggie giust\ndata sci...,by dimitri antoni and maggi giust data scienc ...,by dimitri antoniou and maggie giust data scie...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch...",by dimitri antoniou\na week ago codeup launche...,by dimitri antoni a week ago codeup launch our...,by dimitri antoniou a week ago codeup launched...
3,10 Tips to Crush It at the SA Tech Job Fair,SA Tech Job Fair\nThe third bi-annual San Anto...,sa tech job fair\nthe third biannual san anton...,sa tech job fair the third biannual san antoni...,sa tech job fair the third biannual san antoni...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...,competitor bootcamps are closing is the model ...,competitor bootcamp are close is the model in ...,competitor bootcamps are closing is the model ...


### Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?