# Prepare Exercises

<p>The end result of this exercise should be a file named <code>prepare.py</code> that defines
the requested functions.</p>
<p>In this exercise we will be defining some functions to prepare textual data.
These functions should apply equally well to both the codeup blog articles and
the news articles that were previously acquired.</p>


In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

from time import strftime
import acquire_codeup_blog
import acquire_inshots

1. <p>Define a function named <code>basic_clean</code>. It should take in a string and apply
   some basic text cleaning to it:</p>
<ul>
<li>Lowercase everything</li>
<li>Normalize unicode characters</li>
<li>Replace anything that is not a letter, number, whitespace or a single quote.</li>
</ul>


In [2]:
def basic_clean(corpus):
    '''
    Basic text cleaning function  that  takes a corpus of text; lowercases everything; normalizes unicode characters; and replaces anything that is not a letter, number, whitespace or a single quote.
    '''
    lower_corpus = corpus.lower()
    normal_corpus = unicodedata.normalize('NFKD', lower_corpus)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')
    basic_clean_corpus = re.sub(r"[^a-z0-9'\s]", '', normal_corpus)
    return(basic_clean_corpus)    

2. Define a function named <code>tokenize</code>. It should take in a string and tokenize all the words in the string.



In [3]:
def tokenize(string):
    tokenizer = nltk.tokenize.ToktokTokenizer()
    return(tokenizer.tokenize(string, return_str=True))

3. Define a function named <code>stem</code>. It should accept some text and return the text after applying stemming to all the words.
   


In [4]:
def stem(text):
    '''
    Uses NLTK Porter stemmer object to return stems of words
    '''
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in text.split()]
    stemmed_text = ' '.join(stems)
    return stemmed_text


4. Define a function named <code>lemmatize</code>. It should accept some text and return the text after applying lemmatization to each word.



In [5]:
def lemmatize(text):
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in text.split()]
    lemmatized_text = ' '.join(lemmas)
    return(lemmatized_text)

5. Define a function named <code>remove_stopwords</code>. It should accept some text and return the text after removing all the stopwords.</p>
<p>This function should define two optional parameters, <code>extra_words</code> and <code>exclude_words</code>. These parameters should define any additional stop words to
include, and any words that we <em>don't</em> want to remove.



In [6]:
def remove_stopwords(string, extra_words = [], exclude_words = []):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    # Create stopword_list.
    stopword_list = stopwords.words('english')
    
    # Remove 'exclude_words' from stopword_list to keep these in my text.
    stopword_list = set(stopword_list) - set(exclude_words)
    
    # Add in 'extra_words' to stopword_list.
    stopword_list = stopword_list.union(set(extra_words))

    # Split words in string.
    words = string.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe <code>news_df</code>.



In [9]:
news_df = acquire_inshots.get_articles()
news_df.head()



  soup = BeautifulSoup(response.text)


Results saved to CSV file


Unnamed: 0,title,author,body,date,category
0,Bezos gets richer by $19 bn in a day as Amazon...,Hiral Goyal,Amazon Founder Jeff Bezos' net worth climbed b...,05 Feb 2022,business
1,It's not my money: Musk as GoFundMe blocks C$1...,Kiran Khatri,After GoFundMe blocked C$10 million (over ₹58 ...,05 Feb 2022,business
2,Dutch locals to throw eggs at Bezos' superyach...,Kiran Khatri,After reports said the 1927 'De Hef' bridge in...,05 Feb 2022,business
3,Amazon adds $135 bn in one of the biggest 1-da...,Hiral Goyal,Amazon added more than $135 billion in market ...,04 Feb 2022,business
4,Amazon adds $191 billion in value in biggest 1...,Kiran Khatri,Amazon shares jumped nearly 14% on Friday to a...,05 Feb 2022,business


7. Make another dataframe for the Codeup blog posts. Name the dataframe <code>codeup_df
   


In [7]:
codeup_df = acquire_codeup_blog.get_blog()



  soup = BeautifulSoup(response.text)


  soup = BeautifulSoup(response.text)


Results saved to CSV file


In [8]:
codeup_df.head()

Unnamed: 0,title,date,post
0,Codeup Dallas Open House,"Nov 30, 2021",Come join us for the re-opening of our Dallas ...
1,Codeup’s Placement Team Continues Setting Records,"Nov 19, 2021",Our Placement Team is simply defined as a grou...
2,"IT Certifications 101: Why They Matter, and Wh...","Nov 18, 2021","AWS, Google, Azure, Red Hat, CompTIA…these are..."
3,A rise in cyber attacks means opportunities fo...,"Nov 17, 2021","In the last few months, the US has experienced..."
4,Use your GI Bill® benefits to Land a Job in Tech,"Nov 4, 2021","As the end of military service gets closer, ma..."


8. For each dataframe, produce the following columns:</p>
<ul>
<li><code>title</code> to hold the title</li>
<li><code>original</code> to hold the original article/post content</li>
<li><code>clean</code> to hold the normalized and tokenized original with the stopwords removed.</li>
<li><code>stemmed</code> to hold the stemmed version of the cleaned data.</li>
<li><code>lemmatized</code> to hold the lemmatized version of the cleaned data.</li>


In [10]:
codeup_df.head()

Unnamed: 0,title,date,post
0,Codeup Dallas Open House,"Nov 30, 2021",Come join us for the re-opening of our Dallas ...
1,Codeup’s Placement Team Continues Setting Records,"Nov 19, 2021",Our Placement Team is simply defined as a grou...
2,"IT Certifications 101: Why They Matter, and Wh...","Nov 18, 2021","AWS, Google, Azure, Red Hat, CompTIA…these are..."
3,A rise in cyber attacks means opportunities fo...,"Nov 17, 2021","In the last few months, the US has experienced..."
4,Use your GI Bill® benefits to Land a Job in Tech,"Nov 4, 2021","As the end of military service gets closer, ma..."


In [11]:
news_df.head()

Unnamed: 0,title,author,body,date,category
0,Bezos gets richer by $19 bn in a day as Amazon...,Hiral Goyal,Amazon Founder Jeff Bezos' net worth climbed b...,05 Feb 2022,business
1,It's not my money: Musk as GoFundMe blocks C$1...,Kiran Khatri,After GoFundMe blocked C$10 million (over ₹58 ...,05 Feb 2022,business
2,Dutch locals to throw eggs at Bezos' superyach...,Kiran Khatri,After reports said the 1927 'De Hef' bridge in...,05 Feb 2022,business
3,Amazon adds $135 bn in one of the biggest 1-da...,Hiral Goyal,Amazon added more than $135 billion in market ...,04 Feb 2022,business
4,Amazon adds $191 billion in value in biggest 1...,Kiran Khatri,Amazon shares jumped nearly 14% on Friday to a...,05 Feb 2022,business


In [12]:
news_df.rename(columns={'body': 'original'}, inplace=True)
codeup_df.rename(columns={'post': 'original'}, inplace=True)

In [13]:
news_df.head()

Unnamed: 0,title,author,original,date,category
0,Bezos gets richer by $19 bn in a day as Amazon...,Hiral Goyal,Amazon Founder Jeff Bezos' net worth climbed b...,05 Feb 2022,business
1,It's not my money: Musk as GoFundMe blocks C$1...,Kiran Khatri,After GoFundMe blocked C$10 million (over ₹58 ...,05 Feb 2022,business
2,Dutch locals to throw eggs at Bezos' superyach...,Kiran Khatri,After reports said the 1927 'De Hef' bridge in...,05 Feb 2022,business
3,Amazon adds $135 bn in one of the biggest 1-da...,Hiral Goyal,Amazon added more than $135 billion in market ...,04 Feb 2022,business
4,Amazon adds $191 billion in value in biggest 1...,Kiran Khatri,Amazon shares jumped nearly 14% on Friday to a...,05 Feb 2022,business


In [14]:
codeup_df.head()

Unnamed: 0,title,date,original
0,Codeup Dallas Open House,"Nov 30, 2021",Come join us for the re-opening of our Dallas ...
1,Codeup’s Placement Team Continues Setting Records,"Nov 19, 2021",Our Placement Team is simply defined as a grou...
2,"IT Certifications 101: Why They Matter, and Wh...","Nov 18, 2021","AWS, Google, Azure, Red Hat, CompTIA…these are..."
3,A rise in cyber attacks means opportunities fo...,"Nov 17, 2021","In the last few months, the US has experienced..."
4,Use your GI Bill® benefits to Land a Job in Tech,"Nov 4, 2021","As the end of military service gets closer, ma..."


In [15]:
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df, the name for a text column with the option to pass lists for extra_words and exclude_words and returns a df with the text article title, original text, stemmed text,lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['stemmed'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(stem)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['lemmatized'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(lemmatize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    return df[['title', column,'clean', 'stemmed', 'lemmatized']]

In [16]:
prep_article_data(news_df, 'original', extra_words = ['ha'], exclude_words = ['no']).head()


Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Bezos gets richer by $19 bn in a day as Amazon...,Amazon Founder Jeff Bezos' net worth climbed b...,amazon founder jeff bezos ' net worth climbed ...,amazon founder jeff bezo ' net worth climb 188...,amazon founder jeff bezos ' net worth climbed ...
1,It's not my money: Musk as GoFundMe blocks C$1...,After GoFundMe blocked C$10 million (over ₹58 ...,gofundme blocked c10 million 58 crore raised c...,gofundm block c10 million 58 crore rais canadi...,gofundme blocked c10 million 58 crore raised c...
2,Dutch locals to throw eggs at Bezos' superyach...,After reports said the 1927 'De Hef' bridge in...,reports said 1927 ' de hef ' bridge netherland...,report said 1927 ' de hef ' bridg netherland '...,report said 1927 ' de hef ' bridge netherlands...
3,Amazon adds $135 bn in one of the biggest 1-da...,Amazon added more than $135 billion in market ...,amazon added 135 billion market value one bigg...,amazon ad 135 billion market valu one biggest ...,amazon added 135 billion market value one bigg...
4,Amazon adds $191 billion in value in biggest 1...,Amazon shares jumped nearly 14% on Friday to a...,amazon shares jumped nearly 14 friday add 191 ...,amazon share jump nearli 14 friday add 191 bil...,amazon share jumped nearly 14 friday add 191 b...


In [17]:
prep_article_data(codeup_df, 'original', extra_words = ['ha'], exclude_words = ['no']).head()


Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Codeup Dallas Open House,Come join us for the re-opening of our Dallas ...,come join us reopening dallas campus drinks sn...,come join us reopen dalla campu drink snack co...,come join u reopening dallas campus drink snac...
1,Codeup’s Placement Team Continues Setting Records,Our Placement Team is simply defined as a grou...,placement team simply defined group manages re...,placement team simpli defin group manag relati...,placement team simply defined group manages re...
2,"IT Certifications 101: Why They Matter, and Wh...","AWS, Google, Azure, Red Hat, CompTIA…these are...",aws google azure red hat comptiathese big name...,aw googl azur red hat comptiathes big name onl...,aws google azure red hat comptiathese big name...
3,A rise in cyber attacks means opportunities fo...,"In the last few months, the US has experienced...",last months us experienced dozens major cybera...,last month us experienc dozen major cyberattac...,last month u experienced dozen major cyberatta...
4,Use your GI Bill® benefits to Land a Job in Tech,"As the end of military service gets closer, ma...",end military service gets closer many transiti...,end militari servic get closer mani transit se...,end military service get closer many transitio...




9. Ask yourself: </p>
<ul>
<li>If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?</li>
<li>If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?</li>
<li>If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?</li>
</ul>
</li>
</ol>

Lemmatize first two, likely stem the last unless it provides poor results or you have deep pockets / resources