# Exercises
The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire

In [3]:
# get a string of text to test
news_df = acquire.get_news_articles(cached=False)
news_df.head()

Unnamed: 0,topic,title,author,content
0,business,Moderna's early data shows its COVID-19 vaccin...,Pragya Swastik,American biotechnology company Moderna on Mond...
1,business,15 countries sign world's biggest free-trade p...,Pragya Swastik,Fifteen Asia-Pacific countries signed the Regi...
2,business,How does Moderna's COVID-19 vaccine candidate ...,Pragya Swastik,Moderna's initial results of late-stage trial ...
3,business,"Reduce foreign funding to 26% by Oct 15, 2021:...",Pragya Swastik,The I&B Ministry on Monday asked digital media...
4,business,Reliance Retail buys 96% stake in Urban Ladder...,Rishabh Bhatnagar,Reliance Industries' retail arm Reliance Retai...
...,...,...,...,...
94,entertainment,Women are not offered cliched roles like befor...,Kriti Sharma,"Actress Samantha Akkineni, who will star in th..."
95,entertainment,Feel like you've gone for a long shoot: Irrfan...,Kriti Sharma,"Irrfan Khan's son Babil Khan, on Sunday, poste..."
96,entertainment,We decided to have the baby in Canada: Karanvi...,Kriti Sharma,Television actor Karanvir Bohra and his wife T...
97,entertainment,Aditya made me feel comfortable on the sets of...,Kriti Sharma,Speaking about her experience of working with ...


In [4]:
sstring = news_df.content[0]
sstring

"American biotechnology company Moderna on Monday announced its experimental vaccine was 94.5% effective in preventing COVID-19 based on interim data from a late-stage clinical trial. Moderna's interim analysis was based on 95 infections among trial participants who received either a placebo or the vaccine. Among those, only five infections occurred in those who received the vaccine."

1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:
- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [5]:
def basic_clean(text):
    '''
    Initial basic cleaning/normalization of text string
    '''
    # change to all lowercase
    low_case = text.lower()
    # remove special characters, encode to ascii and recode to utf-8
    recode = unicodedata.normalize('NFKD', low_case).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    # Replace anything that is not a letter, number, whitespace or a single quote
    cleaned = re.sub(r"[^a-z0-9'\s]", '', recode)
    return cleaned

In [6]:
clean_string = basic_clean(sstring)
clean_string

"american biotechnology company moderna on monday announced its experimental vaccine was 945 effective in preventing covid19 based on interim data from a latestage clinical trial moderna's interim analysis was based on 95 infections among trial participants who received either a placebo or the vaccine among those only five infections occurred in those who received the vaccine"

2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [7]:
def tokenize(text):
    '''
    Use NLTK TlktokTokenizer to seperate/tokenize text
    '''
    # create the NLTK tokenizer object
    tokenizer = nltk.tokenize.ToktokTokenizer()
    return tokenizer.tokenize(text, return_str=True)

In [8]:
tokenized = tokenize(clean_string)
tokenized

"american biotechnology company moderna on monday announced its experimental vaccine was 945 effective in preventing covid19 based on interim data from a latestage clinical trial moderna ' s interim analysis was based on 95 infections among trial participants who received either a placebo or the vaccine among those only five infections occurred in those who received the vaccine"

3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [11]:
def stem(text):
    '''
    Apply NLTK stemming to text to remove prefix and suffixes
    '''
    # Create the nltk stemmer object, then use it
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in text.split()]
    article_stemmed = ' '.join(stems)
    return article_stemmed


In [12]:
stemmed = stem(tokenized)
stemmed

"american biotechnolog compani moderna on monday announc it experiment vaccin wa 945 effect in prevent covid19 base on interim data from a latestag clinic trial moderna ' s interim analysi wa base on 95 infect among trial particip who receiv either a placebo or the vaccin among those onli five infect occur in those who receiv the vaccin"

4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [21]:
def lemmatize(text):
    '''
    Apply NLTK lemmatizing to text to remove prefix and suffixes
    '''
    # Create the nltk lemmatize object, then use it
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in text.split()]
    article_lemmatized = ' '.join(lemmas)
    return article_lemmatized

In [22]:
article_lemmatized = lemmatize(tokenized)
article_lemmatized

"american biotechnology company moderna on monday announced it experimental vaccine wa 945 effective in preventing covid19 based on interim data from a latestage clinical trial moderna ' s interim analysis wa based on 95 infection among trial participant who received either a placebo or the vaccine among those only five infection occurred in those who received the vaccine"

5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.     
This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [40]:
def remove_stopwords(text, extra_words=[], exclude_words=[]):
    '''
    Removes stopwords from text, allows for additional words to exclude, or words to not exclude
    '''
    # define initial stopwords list
    stopword_list = stopwords.words('english')
    # add additional stopwords
    for word in extra_words:
        stopword_list.append(word)
    # remove stopwords to exclude from stopword list
    for word in exclude_words:
        stopword_list.remove(word)
    # split the string into words
    words = text.split()
    # filter the words
    filtered_words = [w for w in words if w not in stopword_list]
    # print number of stopwords removed
    print('Removed {} stopwords'.format(len(words) - len(filtered_words)))
    # produce string without stopwords
    article_without_stopwords = ' '.join(filtered_words)
    return article_without_stopwords

In [41]:
remove_words2 = remove_stopwords(stemmed)

Removed 17 stopwords


In [34]:
extra_words = ['USAA', 'Codeup']
exclude_words = ['no', 'not']
remove_words1 = remove_stopwords(article_lemmatized, extra_words, exclude_words)
remove_words1

Removed 18 stopwords


"american biotechnology company moderna monday announced experimental vaccine wa 945 effective preventing covid19 based interim data latestage clinical trial moderna ' interim analysis wa based 95 infection among trial participant received either placebo vaccine among five infection occurred received vaccine"

6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [35]:
news_df

Unnamed: 0,topic,title,author,content
0,business,Moderna's early data shows its COVID-19 vaccin...,Pragya Swastik,American biotechnology company Moderna on Mond...
1,business,15 countries sign world's biggest free-trade p...,Pragya Swastik,Fifteen Asia-Pacific countries signed the Regi...
2,business,How does Moderna's COVID-19 vaccine candidate ...,Pragya Swastik,Moderna's initial results of late-stage trial ...
3,business,"Reduce foreign funding to 26% by Oct 15, 2021:...",Pragya Swastik,The I&B Ministry on Monday asked digital media...
4,business,Reliance Retail buys 96% stake in Urban Ladder...,Rishabh Bhatnagar,Reliance Industries' retail arm Reliance Retai...
...,...,...,...,...
94,entertainment,Women are not offered cliched roles like befor...,Kriti Sharma,"Actress Samantha Akkineni, who will star in th..."
95,entertainment,Feel like you've gone for a long shoot: Irrfan...,Kriti Sharma,"Irrfan Khan's son Babil Khan, on Sunday, poste..."
96,entertainment,We decided to have the baby in Canada: Karanvi...,Kriti Sharma,Television actor Karanvir Bohra and his wife T...
97,entertainment,Aditya made me feel comfortable on the sets of...,Kriti Sharma,Speaking about her experience of working with ...


7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [38]:
urls = acquire.get_all_urls()
codeup_df = acquire.get_blog_articles(urls, cached=False)
codeup_df.head()

Unnamed: 0,title,content
0,Codeup Launches Houston!,"Houston, we have a problem: there aren’t enoug..."
1,How to Succeed in a Coding Bootcamp,We held a virtual event called “How to Succeed...
2,What is Python?,If you’ve been digging around our website or r...
3,How Codeup Alumni are Helping to Make Water,Imagine having a kit mailed to you with all th...
4,What to Expect at Codeup,"Setting Expectations for Life Before, During, ..."


8. For each dataframe, produce the following columns:
- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

In [None]:
codeup_df.content[x]

In [None]:
# only got here before walkthrough

9. Ask yourself:
- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
    - lemmatize
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
    - either
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?
    - stemmed

In [20]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/ryvyny/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True