In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/curtisjohansen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# define categories
categories = ["business", "sports", "technology", "entertainment"]

# use get_all_new_article function from acquire.py file 

news_df = acquire.get_all_news_articles(categories)

In [4]:
# look at the head of dataframe
news_df.head()

Unnamed: 0,title,content,category
0,Facebook changes its company name to 'Meta',Facebook on Thursday announced it's changing t...,business
1,'Man who takes 6 months parental leave is a lo...,Several Twitter users criticised US-based Pala...,business
2,"Delhi HC notice to RBI, SBI over banning UPI p...",The Delhi High Court on Thursday issued notice...,business
3,Paytm will not force employees to come to offi...,Paytm will continue to allow employees to work...,business
4,"Legacy companies eat Ola, Ather, Tork & SmartE...",Bajaj Auto's MD Rajiv Bajaj on Thursday took a...,business


In [5]:
# lets use the content of first news item as 'article' to test my functions

article = news_df.content[0]
article

'Facebook on Thursday announced it\'s changing the company\'s name to \'Meta\' to reflect its focus on \'metaverse\'. Using technologies like augmented reality and virtual reality, Facebook plans to create a greater sense of "virtual presence" to mimic the experience of interacting in person. "Our apps and our brands, they\'re not changing," Zuckerberg said.'

#### In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

### 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:
- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote

In [6]:
def basic_clean(string):
    '''
    This function takes in a string and
    returns the string normalized.
    '''
    string = unicodedata.normalize('NFKD', string)\
             .encode('ascii', 'ignore')\
             .decode('utf-8', 'ignore')
    string = re.sub(r'[^\w\s]', '', string).lower()
    return string

In [7]:
# use the function defined above

basic_clean(article)

'facebook on thursday announced its changing the companys name to meta to reflect its focus on metaverse using technologies like augmented reality and virtual reality facebook plans to create a greater sense of virtual presence to mimic the experience of interacting in person our apps and our brands theyre not changing zuckerberg said'

### 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [8]:
def tokenize(string):
    '''
    This function takes in a string and returns a tokenized string.
    
    '''
    # create the tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()
    # Use the tokenizer
    string = tokenizer.tokenize(string, return_str=True)
    
    return string

In [9]:
# Use the function defined above

tokenize(article)

'Facebook on Thursday announced it \' s changing the company \' s name to \' Meta \' to reflect its focus on \' metaverse \' . Using technologies like augmented reality and virtual reality , Facebook plans to create a greater sense of " virtual presence " to mimic the experience of interacting in person. " Our apps and our brands , they \' re not changing , " Zuckerberg said .'

### 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [10]:
def stem(string):
    '''
    This function takes in a string and
    returns a string with words stemmed.
    '''
    # Create porter stemmer.
    ps = nltk.porter.PorterStemmer()
    
    # Use the stemmer to stem each word in the list of words we created by using split.
    stems = [ps.stem(word) for word in string.split()]
    
    # Join our lists of words into a string again and assign to a variable.
    string = ' '.join(stems)
    
    return string

In [11]:
# use the function defined above

stem(article)

'facebook on thursday announc it\' chang the company\' name to \'meta\' to reflect it focu on \'metaverse\'. use technolog like augment realiti and virtual reality, facebook plan to creat a greater sens of "virtual presence" to mimic the experi of interact in person. "our app and our brands, they\'r not changing," zuckerberg said.'

### 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [12]:
def lemmatize(string):
    '''
    This function takes in string for and
    returns a string with words lemmatized.
    '''
    # Create the lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Use the lemmatizer on each word in the list of words we created by using split.
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    # Join our list of words into a string again and assign to a variable.
    string = ' '.join(lemmas)
    
    return string

In [13]:
# use the function defined above

lemmatize(article)

'Facebook on Thursday announced it\'s changing the company\'s name to \'Meta\' to reflect it focus on \'metaverse\'. Using technology like augmented reality and virtual reality, Facebook plan to create a greater sense of "virtual presence" to mimic the experience of interacting in person. "Our apps and our brands, they\'re not changing," Zuckerberg said.'

### 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.
This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [14]:
def remove_stopwords(string, extra_words = [], exclude_words = []):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    # Create stopword_list.
    stopword_list = stopwords.words('english')
    
    # Remove 'exclude_words' from stopword_list to keep these in my text.
    stopword_list = set(stopword_list) - set(exclude_words)
    
    # Add in 'extra_words' to stopword_list.
    stopword_list = stopword_list.union(set(extra_words))
    
    # Split words in string.
    words = string.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

In [15]:
# use the function defined above

remove_stopwords(article)

'Facebook Thursday announced changing company\'s name \'Meta\' reflect focus \'metaverse\'. Using technologies like augmented reality virtual reality, Facebook plans create greater sense "virtual presence" mimic experience interacting person. "Our apps brands, they\'re changing," Zuckerberg said.'

### 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [16]:
# check head of my news_df dataframe:

news_df.head()

Unnamed: 0,title,content,category
0,Facebook changes its company name to 'Meta',Facebook on Thursday announced it's changing t...,business
1,'Man who takes 6 months parental leave is a lo...,Several Twitter users criticised US-based Pala...,business
2,"Delhi HC notice to RBI, SBI over banning UPI p...",The Delhi High Court on Thursday issued notice...,business
3,Paytm will not force employees to come to offi...,Paytm will continue to allow employees to work...,business
4,"Legacy companies eat Ola, Ather, Tork & SmartE...",Bajaj Auto's MD Rajiv Bajaj on Thursday took a...,business


In [17]:
# use all the functions to see if they work on news_df's content column

news_df['content'].apply(basic_clean)\
.apply(tokenize)\
.apply(lemmatize)\
.apply(remove_stopwords)

0     facebook thursday announced changing company n...
1     several twitter user criticised usbased palant...
2     delhi high court thursday issued notice rbi sb...
3     paytm continue allow employee work home force ...
4     bajaj auto md rajiv bajaj thursday took jibe s...
                            ...                        
95    actress nitu chandra ha revealed chose stunt w...
96    bigg bos 13 fame news anchor shefali bagga wa ...
97    actor emraan hashmi ha said witnessed exorcism...
98    mahesh manjrekar wa diagnosed urinary bladder ...
99    actor chris evans ha said pinch every day voic...
Name: content, Length: 100, dtype: object

### 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [18]:
codeup_df = acquire.get_blogs()

In [19]:
codeup_df.head()

Unnamed: 0,title,date,category,content
0,Codeup Launches First Podcast: Hire Tech,"Aug 25, 2021",Codeup News,Any podcast enthusiasts out there? We are plea...
1,Why Should I Become a System Administrator?,"Aug 23, 2021",Tips for Prospective Students,"With so many tech careers in demand, why choos..."
2,Announcing our Candidacy for Accreditation!,"Jun 30, 2021",Codeup News,Did you know that even though we’re an indepen...
3,Codeup Takes Over More of the Historic Vogue B...,"Jun 21, 2021",Codeup News,Codeup is moving into another floor of our His...
4,Inclusion at Codeup During Pride Month (and Al...,"Jun 4, 2021",Codeup News,Happy Pride Month! Pride Month is a dedicated ...


### 8. For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

In [20]:
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['stemmed'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(stem)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['lemmatized'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(lemmatize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    return df[['title', column,'clean', 'stemmed', 'lemmatized']]

In [21]:
# use the function defined above for news_df's content column.

prep_article_data(news_df, 'content', extra_words = ['ha'], exclude_words = ['no']).head()

Unnamed: 0,title,content,clean,stemmed,lemmatized
0,Facebook changes its company name to 'Meta',Facebook on Thursday announced it's changing t...,facebook thursday announced changing companys ...,facebook thursday announc chang compani name m...,facebook thursday announced changing company n...
1,'Man who takes 6 months parental leave is a lo...,Several Twitter users criticised US-based Pala...,several twitter users criticised usbased palan...,sever twitter user criticis usbas palantir tec...,several twitter user criticised usbased palant...
2,"Delhi HC notice to RBI, SBI over banning UPI p...",The Delhi High Court on Thursday issued notice...,delhi high court thursday issued notice rbi sb...,delhi high court thursday issu notic rbi sbi n...,delhi high court thursday issued notice rbi sb...
3,Paytm will not force employees to come to offi...,Paytm will continue to allow employees to work...,paytm continue allow employees work home force...,paytm continu allow employe work home forc com...,paytm continue allow employee work home force ...
4,"Legacy companies eat Ola, Ather, Tork & SmartE...",Bajaj Auto's MD Rajiv Bajaj on Thursday took a...,bajaj autos md rajiv bajaj thursday took jibe ...,bajaj auto md rajiv bajaj thursday took jibe s...,bajaj auto md rajiv bajaj thursday took jibe s...


### 9. Ask yourself:

- **If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?**
- I would lemmatize it so that the words that are returned are real words. The dataset is small, so I don't see a waste of resources doing this method over stemming


- **If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?**
- I would still lemmatize it...25MB isn't to large


- **If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?**
- Stemmed, I'll work with what I get before I have to pay, would be very expensive
