In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import acquire

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:
    * Lowercase everything
    * Normalize unicode characters
    * Replace anything that is not a letter, number, whitespace or a single quote.

In [2]:
def basic_clean(article):
    """ lowercases, normalizes, and destroys special characters of an article (string) """
    # lowercase
    article = article.lower()
    # normalize
    article = unicodedata.normalize('NFKD', article).encode('ascii', 'ignore').decode('utf-8')
    # remove special characters
    article = re.sub(r"[^a-z0-9'\s]", "", article)
    
    return article

2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [3]:
def tokenize(article):
    """ tokenize a basic_clean-ed article (string) """
    tokenizer = nltk.tokenize.ToktokTokenizer() # create tokenizer
    article = tokenizer.tokenize(article, return_str = True) # tokenize
    
    return article

3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [4]:
def stem(article):
    """ stem all words in an article (string) """
    ps = nltk.porter.PorterStemmer() # create stemmer
    stems = [ps.stem(word) for word in article.split()] # list comprehension of stems
    article_stemmed = ' '.join(stems) # re-join list as article
    
    return article_stemmed

4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [5]:
def lemmatize(article):
    """ lemma all words in an article (string) """
    nltk.download('wordnet') # get current lemma list
    wnl = nltk.stem.WordNetLemmatizer() # create lemmatizer
    lemmas = [wnl.lemmatize(word) for word in article.split()] # list comp of lemmas
    article_lemmatized = ' '.join(lemmas) # re-join list as article
    
    return article_lemmatized

5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.
    * This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [6]:
def remove_stopwords(article):
    """ remove stopwords from an article (string) """
    stopword_list = stopwords.words('english') # get default stopword list
    words = article_lemmatized.split() # split for stopword removal
    filtered_words = [word for word in words if word not in stopword_list] # ignore stopwords
    article_without_stopwords = ' '.join(filtered_words) # re-join list to article
    
    return article_without_stopwords

6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [7]:
news_df = acquire.get_news()
news_df

Unnamed: 0,headline,publish_time,category,content
0,Refer friends & get a chance to win Bitcoin wo...,11:39 am,Business,CoinSwitch Kuber has launched 'CSK Referral Le...
1,Making quality medicines super affordable with...,10:00 am,Business,Apollo 24|7 is offering flat 25% discount on f...
2,China's new COVID-19 outbreak wipes $4 billion...,03:51 pm,Business,China's top hot pot chain has lost $4 billion ...
3,Shiba Inu jumps 40% to record high after anony...,02:46 pm,Business,Meme-based cryptocurrency Shiba Inu (SHIB) jum...
4,"Wow, 13 years ago: Musk on old video from when...",03:46 pm,Business,Tesla CEO and the world's richest person Elon ...
...,...,...,...,...
20,Was heartbreaking to see SRK going to jail to ...,06:31 pm,Entertainment,Adhyayan Summan spoke about his tweet wherein ...
21,"Went for drive, forgot Mehr when she was 40 da...",06:33 pm,Entertainment,Actress Neha Dhupia has revealed that she and ...
22,"'Dirty Little Billy', 'Star Trek' actor Richar...",07:58 pm,Entertainment,Veteran Hollywood actor Richard Evans passed a...
23,Sequel to Shahid's debut film 'Ishq Vishk' in ...,08:37 pm,Entertainment,"A sequel to the 2003 film 'Ishq Vishk', which ..."


7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [8]:
codeup_df = acquire.get_blogs()
codeup_df

Unnamed: 0,title,date,category,content
0,Codeup’s Data Science Career Accelerator is Here!,"Sep 30, 2018",Data Science,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,"Oct 31, 2018",Data Science,By Dimitri Antoniou and Maggie Giust Data Scie...
2,Data Science VS Data Analytics: What’s The Dif...,"Oct 17, 2018",Data Science,"By Dimitri Antoniou A week ago, Codeup launche..."
3,10 Tips to Crush It at the SA Tech Job Fair,"Aug 14, 2018",Tips for Prospective Students,The third bi-annual San Antonio Tech Job Fair ...
4,Competitor Bootcamps Are Closing. Is the Model...,"Aug 14, 2018",Codeup News,"In recent news, DevBootcamp and The Iron Yar..."


8. For each dataframe, produce the following columns:
    * title to hold the title
    * original to hold the original article/post content
    * clean to hold the normalized and tokenized original with the stopwords removed.
    * stemmed to hold the stemmed version of the cleaned data.
    * lemmatized to hold the lemmatized version of the cleaned data.

In [9]:
news_df['clean'] = news_df['content'].apply(basic_clean).apply(tokenize)
news_df['stemmed'] = news_df['clean'].apply(stem)
news_df['lemmatized'] = news_df['clean'].apply(lemmatize)
news_df

[nltk_data] Downloading package wordnet to /Users/jake/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jake/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jake/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jake/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jake/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jake/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jake/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jake/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package 

Unnamed: 0,headline,publish_time,category,content,clean,stemmed,lemmatized
0,Refer friends & get a chance to win Bitcoin wo...,11:39 am,Business,CoinSwitch Kuber has launched 'CSK Referral Le...,coinswitch kuber has launched ' csk referral l...,coinswitch kuber ha launch ' csk referr leagu ...,coinswitch kuber ha launched ' csk referral le...
1,Making quality medicines super affordable with...,10:00 am,Business,Apollo 24|7 is offering flat 25% discount on f...,apollo 247 is offering flat 25 discount on fir...,apollo 247 is offer flat 25 discount on first ...,apollo 247 is offering flat 25 discount on fir...
2,China's new COVID-19 outbreak wipes $4 billion...,03:51 pm,Business,China's top hot pot chain has lost $4 billion ...,china ' s top hot pot chain has lost 4 billion...,china ' s top hot pot chain ha lost 4 billion ...,china ' s top hot pot chain ha lost 4 billion ...
3,Shiba Inu jumps 40% to record high after anony...,02:46 pm,Business,Meme-based cryptocurrency Shiba Inu (SHIB) jum...,memebased cryptocurrency shiba inu shib jumped...,memebas cryptocurr shiba inu shib jump over 40...,memebased cryptocurrency shiba inu shib jumped...
4,"Wow, 13 years ago: Musk on old video from when...",03:46 pm,Business,Tesla CEO and the world's richest person Elon ...,tesla ceo and the world ' s richest person elo...,tesla ceo and the world ' s richest person elo...,tesla ceo and the world ' s richest person elo...
...,...,...,...,...,...,...,...
20,Was heartbreaking to see SRK going to jail to ...,06:31 pm,Entertainment,Adhyayan Summan spoke about his tweet wherein ...,adhyayan summan spoke about his tweet wherein ...,adhyayan summan spoke about hi tweet wherein h...,adhyayan summan spoke about his tweet wherein ...
21,"Went for drive, forgot Mehr when she was 40 da...",06:33 pm,Entertainment,Actress Neha Dhupia has revealed that she and ...,actress neha dhupia has revealed that she and ...,actress neha dhupia ha reveal that she and hus...,actress neha dhupia ha revealed that she and h...
22,"'Dirty Little Billy', 'Star Trek' actor Richar...",07:58 pm,Entertainment,Veteran Hollywood actor Richard Evans passed a...,veteran hollywood actor richard evans passed a...,veteran hollywood actor richard evan pass away...,veteran hollywood actor richard evans passed a...
23,Sequel to Shahid's debut film 'Ishq Vishk' in ...,08:37 pm,Entertainment,"A sequel to the 2003 film 'Ishq Vishk', which ...",a sequel to the 2003 film ' ishq vishk ' which...,a sequel to the 2003 film ' ishq vishk ' which...,a sequel to the 2003 film ' ishq vishk ' which...


In [10]:
codeup_df['clean'] = codeup_df['content'].apply(basic_clean).apply(tokenize)
codeup_df['stemmed'] = codeup_df['clean'].apply(stem)
codeup_df['lemmatized'] = codeup_df['clean'].apply(lemmatize)
codeup_df

[nltk_data] Downloading package wordnet to /Users/jake/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jake/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jake/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jake/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jake/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,title,date,category,content,clean,stemmed,lemmatized
0,Codeup’s Data Science Career Accelerator is Here!,"Sep 30, 2018",Data Science,The rumors are true! The time has arrived. Cod...,the rumors are true the time has arrived codeu...,the rumor are true the time ha arriv codeup ha...,the rumor are true the time ha arrived codeup ...
1,Data Science Myths,"Oct 31, 2018",Data Science,By Dimitri Antoniou and Maggie Giust Data Scie...,by dimitri antoniou and maggie giust data scie...,by dimitri antoni and maggi giust data scienc ...,by dimitri antoniou and maggie giust data scie...
2,Data Science VS Data Analytics: What’s The Dif...,"Oct 17, 2018",Data Science,"By Dimitri Antoniou A week ago, Codeup launche...",by dimitri antoniou a week ago codeup launched...,by dimitri antoni a week ago codeup launch our...,by dimitri antoniou a week ago codeup launched...
3,10 Tips to Crush It at the SA Tech Job Fair,"Aug 14, 2018",Tips for Prospective Students,The third bi-annual San Antonio Tech Job Fair ...,the third biannual san antonio tech job fair i...,the third biannual san antonio tech job fair i...,the third biannual san antonio tech job fair i...
4,Competitor Bootcamps Are Closing. Is the Model...,"Aug 14, 2018",Codeup News,"In recent news, DevBootcamp and The Iron Yar...",in recent news devbootcamp and the iron yard a...,in recent news devbootcamp and the iron yard a...,in recent news devbootcamp and the iron yard a...


9. Ask yourself:
    * If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
    * If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
    * If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?