# NLP Prepare Exercises

1. Define a function named ```basic_clean```. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

2. Define a function named ```tokenize```. It should take in a string and tokenize all the words in the string.

3. Define a function named ```stem```. It should accept some text and return the text after applying stemming to all the words.

4. Define a function named ```lemmatize```. It should accept some text and return the text after applying lemmatization to each word.

5. Define a function named ```remove_stopwords```. It should accept some text and return the text after removing all the stopwords.

- This function should define two optional parameters, ```extra_words``` and ```exclude_words```. These parameters should define any additional stop words to include, and any words that we don't want to remove.

6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe ```news_df```.

7. Make another dataframe for the Codeup blog posts. Name the dataframe ```codeup_df```.

8. For each dataframe, produce the following columns:

- ```title``` to hold the title
- ```original``` to hold the original article/post content
- ```clean``` to hold the normalized and tokenized original with the stopwords removed.
- ```stemmed``` to hold the stemmed version of the cleaned data.
- ```lemmatized``` to hold the lemmatized version of the cleaned data.

9. Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

<hr style="border:2px solid gray">

In [1]:
#unicode, regex, json for text digestion
import unicodedata
import re
import json

#natural language toolkit -> tokenization, stopwords
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

#standard ds imports
import pandas as pd
from time import strftime
from requests import get
from bs4 import BeautifulSoup

#custom import
import acquire

#ignore warnings import
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/natasharivers/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/natasharivers/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/natasharivers/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [2]:
#specify the categories desired
categories = ["business", "sports", "technology", "entertainment"]

#use function from acquire.py
news_df = acquire.get_news_articles_data(categories)

In [3]:
#take a look
news_df.head()

Unnamed: 0,title,content,category
0,Women to get free entry to matches in inaugura...,Women and girls will be getting free entry to ...,national
1,Let's win polls first: Farooq to Kharge on Opp...,National Conference patron Farooq Abdullah on ...,national
2,Harassed for independent thinking: INC on CPR'...,Senior Congress leader Jairam Ramesh said Cent...,national
3,2024 polls are about who should be defeated: T...,"Tamil Nadu CM MK Stalin, speaking at his birth...",national
4,UK's legal process independent of govt: UK min...,British Foreign Secretary James Cleverly on We...,national


In [4]:
#use first article [0] to use as test string
test_string = news_df.content[0]
test_string

"Women and girls will be getting free entry to matches in the inaugural season of Women's Premier League (WPL), which will begin on March 4. Meanwhile, for men and boys tickets are being sold at a price of ₹100 and ₹400. The tournament will be held in Mumbai's Brabourne Stadium and Navi Mumbai's Dr DY Patil Sports Academy."

<hr style="border:2px solid black">

### Let's get into the exercises

<b>1. Define a function named ```basic_clean```</b>
It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

<div class="alert alert-block alert-info">
<b>Remember:</b> 
<br>

- unicodedata.normalize('NFKD', string): This function normalizes the string by decomposing any composed characters into their basic components using the Unicode Normalization Form Compatibility Decomposition (NFKD) algorithm. This is useful for dealing with certain types of characters that might cause issues when processing text data.

- .encode('ascii', 'ignore'): This function encodes the normalized string into ASCII format, ignoring any non-ASCII characters in the process. This is useful if you want to remove any non-ASCII characters from the string.

- .decode('utf-8', 'ignore'): This function decodes the ASCII-encoded string back into UTF-8 format, ignoring any encoding errors that might occur. This is useful if you want to convert the ASCII-encoded string back into its original Unicode format.
</div>

In [5]:
def basic_clean(string):
    '''
    This function takes in the original text.
    The text is all lowercased, 
    the text is encoded in ascii and any characters that are not ascii are ignored.
    The text is then decoded in utf-8 and any characters that are not ascii are ignored
    Additionally, special characters are all removed.
    A clean article is then returned
    '''
    #lowercase
    string = string.lower()
    
    #normalize
    string = unicodedata.normalize('NFKD', string)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')
    
    #remove special characters and replaces it with blank
    string = re.sub(r"[^a-z0-9'\s]", '', string)
    
    return string

In [6]:
#make sure function works
basic_clean(test_string)

"women and girls will be getting free entry to matches in the inaugural season of women's premier league wpl which will begin on march 4 meanwhile for men and boys tickets are being sold at a price of 100 and 400 the tournament will be held in mumbai's brabourne stadium and navi mumbai's dr dy patil sports academy"

<div class="alert alert-block alert-success">
<b>Takeaways:</b>
<br>

- All words have been lowercased
- All periods, commas and quotations within the text have been removed and replaced with a blank space
</div>

<hr style="border:1px solid black">

<b>2. Define a function named ```tokenize```</b>
It should take in a string and tokenize all the words in the string.

<div class="alert alert-block alert-info">
<b>Remember:</b> 
<br>

- <b>Tokenization</b>: Breaks a text document into individual words or terms, also called tokens. This can be done by splitting the text at whitespace, punctuation marks, or any other separator.
<br>

<b>Note:</b> There are several ways to tokenize: word tokenization (at white space), sentence tokenization (at period) and character tokenization (at each character)
    
</div>

In [7]:
def tokenize(string):
    '''
    This function takes in a string
    and returns the string as individual tokens put back into the string
    '''
    #create the tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()

    #use the tokenizer
    string = tokenizer.tokenize(string, return_str = True)

    return string

In [29]:
#make sure function works, only root words (no past tense)
tokenize(test_string)

"Women and girls will be getting free entry to matches in the inaugural season of Women ' s Premier League ( WPL ) , which will begin on March 4. Meanwhile , for men and boys tickets are being sold at a price of ₹ 100 and ₹ 400. The tournament will be held in Mumbai ' s Brabourne Stadium and Navi Mumbai ' s Dr DY Patil Sports Academy ."

<div class="alert alert-block alert-success">
<b>Takeaways:</b>
<br>

- This function returns the same string, but as individual tokens.
- backslashes are considered it's own word

<hr style="border:1px solid black">

<b>3. Define a function named ```stem```<b>
It should accept some text and return the text after applying stemming to all the words.

<div class="alert alert-block alert-info">
<b>Remember:</b> 
<br>

<b>stemming</b>: removes suffixes, pluralities, etc. Returns word to its <u>stem</u>.
- Because of this: the word returned may not be an actual word...
    - (ex): house --> returns hous
    - (ex): calls, called, calling --> are all returned to <u>call</u>.
<br>

In [40]:
def stem(string):
    '''
    This function takes in text
    and returns the stem word joined back into the text
    '''
    #create porter stemmer
    ps = nltk.porter.PorterStemmer()
    
    #use the stem, split string using each word
    stems = [ps.stem(word) for word in string.split()]
    
    #join stem word to string
    string = ' '.join(stems)

    return string

In [10]:
#make sure function works, only root words (no past tense)
stem(test_string)

"women and girl will be get free entri to match in the inaugur season of women' premier leagu (wpl), which will begin on march 4. meanwhile, for men and boy ticket are be sold at a price of ₹100 and ₹400. the tournament will be held in mumbai' brabourn stadium and navi mumbai' dr dy patil sport academy."

<div class="alert alert-block alert-success">
<b>Takeaways:</b>
<br>

- Now all words are returned to their <u>stem</u> word.
</div>

<hr style="border:1px solid black">

<b>4. Define a function named ```lemmatize```</b> 
It should accept some text and return the text after applying lemmatization to each word.

<div class="alert alert-block alert-info">
<b>Remember:</b> 
<br>

- <b>lemmatization</b>: very similar to stemming, <u>but</u> the word must be an actual word present in the dictionary. It returns the word to its <u>root</u> as opposed to its <u>stem</u> like in stemming.
</div>

In [12]:
def lemmatize(string):
    '''
    This function takes in a string
    and returns the lemmatized word joined back into the string
    '''
    #create the lemmatizer
    wnl = nltk.stem.WordNetLemmatizer()
    
    #look at the article 
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    #join lemmatized words into article
    string = ' '.join(lemmas)

    return string

In [13]:
#make sure function works
lemmatize(test_string)

"Women and girl will be getting free entry to match in the inaugural season of Women's Premier League (WPL), which will begin on March 4. Meanwhile, for men and boy ticket are being sold at a price of ₹100 and ₹400. The tournament will be held in Mumbai's Brabourne Stadium and Navi Mumbai's Dr DY Patil Sports Academy."

In [41]:
lemmas = [wnl.lemmatize(word) for word in test_string.split()]
lemmas

['Women',
 'and',
 'girl',
 'will',
 'be',
 'getting',
 'free',
 'entry',
 'to',
 'match',
 'in',
 'the',
 'inaugural',
 'season',
 'of',
 "Women's",
 'Premier',
 'League',
 '(WPL),',
 'which',
 'will',
 'begin',
 'on',
 'March',
 '4.',
 'Meanwhile,',
 'for',
 'men',
 'and',
 'boy',
 'ticket',
 'are',
 'being',
 'sold',
 'at',
 'a',
 'price',
 'of',
 '₹100',
 'and',
 '₹400.',
 'The',
 'tournament',
 'will',
 'be',
 'held',
 'in',
 "Mumbai's",
 'Brabourne',
 'Stadium',
 'and',
 'Navi',
 "Mumbai's",
 'Dr',
 'DY',
 'Patil',
 'Sports',
 'Academy.']

<div class="alert alert-block alert-success">
<b>Takeaways:</b>
<br>

- The root word is now returned into the string.
</div>

<hr style="border:1px solid black">

<b>5. Define a function named ```remove_stopwords```</b>
- It should accept some text and return the text after removing all the stopwords.
- This function should define two optional parameters, ```extra_words``` and ```exclude_words```. These parameters should define any additional stop words to include, and any words that we don't want to remove.

<div class="alert alert-block alert-info">
<b>Remember:</b> 
<br>

- <b>stopwords</b>: refers to words that have little significance. They are typically articles, conjunctions and prepositions.
    - articles: a, an, the, etc
    - conjunctions: for, and, nor, but, yet, etc
    - prepositions: in, at, on, by, with, etc
<br>

In [15]:
def remove_stopwords(string, extra_words = [], exclude_words = []):
    '''
    This function takes in text, extra words and exclude words
    and returns a list of text with stopword removed
    '''
    #create stopword list
    stopword_list = stopwords.words('english')
    
    #remove excluded words from list
    stopword_list = set(stopword_list) - set(exclude_words)
    
    #add the extra words to the list
    stopword_list = stopword_list.union(set(extra_words))
    
    #split the string into different words
    words = string.split()
    
    #create a list of words that are not in the list
    filtered_words = [word for word in words if word not in stopword_list]
    
    #join the words that are not stopwords (filtered words) back into the string
    string = ' '.join(filtered_words)
    
    return string

In [16]:
remove_stopwords(test_string)

"Women girls getting free entry matches inaugural season Women's Premier League (WPL), begin March 4. Meanwhile, men boys tickets sold price ₹100 ₹400. The tournament held Mumbai's Brabourne Stadium Navi Mumbai's Dr DY Patil Sports Academy."

<div class="alert alert-block alert-success">
<b>Takeaways:</b>
<br>

- All stop words have been removed from the string.
</div>

<hr style="border:1px solid black">

<b>6. Use your data from the acquire to produce a dataframe of the news articles.</b>
Name the dataframe ```news_df```.

<div class="alert alert-block alert-info">
<b>Note:</b> 
<br>

We created a function in the acquire phase to acquire news articles
<br>

In [17]:
#specify the categories desired
categories = ["business", "sports", "technology", "entertainment"]

#use function from acquire.py
news_df = acquire.get_news_articles_data(categories)

In [18]:
#take a look at news_ddf
news_df.head()

Unnamed: 0,title,content,category
0,Women to get free entry to matches in inaugura...,Women and girls will be getting free entry to ...,national
1,Let's win polls first: Farooq to Kharge on Opp...,National Conference patron Farooq Abdullah on ...,national
2,Harassed for independent thinking: INC on CPR'...,Senior Congress leader Jairam Ramesh said Cent...,national
3,2024 polls are about who should be defeated: T...,"Tamil Nadu CM MK Stalin, speaking at his birth...",national
4,UK's legal process independent of govt: UK min...,British Foreign Secretary James Cleverly on We...,national


In [19]:
#apply all functions just created to the content column to prep
news_df['content'].apply(basic_clean)\
.apply(tokenize)\
.apply(lemmatize)\
.apply(remove_stopwords)

0      woman girl getting free entry match inaugural ...
1      national conference patron farooq abdullah wed...
2      senior congress leader jairam ramesh said cent...
3      tamil nadu cm mk stalin speaking birthday meet...
4      british foreign secretary james cleverly wedne...
                             ...                        
295    turkish president recep tayyip erdogan ha indi...
296    moody ' investor service said expects global g...
297    akasa air founder ceo vinay dube announced air...
298    indian government ha said ' concerned european...
299    sebi barred two individual security market one...
Name: content, Length: 300, dtype: object

In [20]:
#apply all functions just created to the title column to prep
news_df['title'].apply(basic_clean)\
.apply(tokenize)\
.apply(lemmatize)\
.apply(remove_stopwords)

0        woman get free entry match inaugural season wpl
1      let ' win poll first farooq kharge opposition ...
2      harassed independent thinking inc cpr ' licenc...
3                2024 poll defeated tamil nadu cm stalin
4      uk ' legal process independent govt uk ministe...
                             ...                        
295    turkey prez poll likely held may amid quake cr...
296      economic growth g20 economy slow 2 2023 moody '
297           akasa air hire 300 pilot next 12 month ceo
298    india feeling ' little challenged ' eu ' carbo...
299    sebi ban 2 1 yr levy 25lakh penalty insider tr...
Name: title, Length: 300, dtype: object

<div class="alert alert-block alert-success">
<b>Takeaways:</b>
<br>

- We now have a content column and title column that has all the functions we created applied to them.
    
</div>

<hr style="border:1px solid black">

<b>7. Make another dataframe for the Codeup blog posts</b>
Name the dataframe ```codeup_df```.

<div class="alert alert-block alert-info">
<b>Note:</b> 
<br>

We created a function in the acquire phase to acquire codeup blog articles.
<br>

In [21]:
#bring in blogs using acquire.py
codeup_df = acquire.get_blog_articles_data()

In [22]:
#take a look at the material
codeup_df.head()

Unnamed: 0,title,content
0,Black Excellence in Tech: Panelist Spotlight –...,Black excellence in tech: Panelist Spotlight –...
1,Black excellence in tech: Panelist Spotlight –...,Black excellence in tech: Panelist Spotlight –...
2,Black excellence in tech: Panelist Spotlight –...,Black excellence in tech: Panelist Spotlight –...
3,Black excellence in tech: Panelist Spotlight –...,Black excellence in tech: Panelist Spotlight –...
4,Coding Bootcamp or Self-Learning? Which is Bes...,If you’re interested in embarking on a career ...


In [23]:
#apply all functions just created to the content column to prep
codeup_df['content'].apply(basic_clean)\
.apply(tokenize)\
.apply(lemmatize)\
.apply(remove_stopwords)

0    black excellence tech panelist spotlight wilma...
1    black excellence tech panelist spotlight steph...
2    black excellence tech panelist spotlight james...
3    black excellence tech panelist spotlight jeani...
4    youre interested embarking career tech likely ...
5    codeup pleased announce ranked among 58 best c...
Name: content, dtype: object

<div class="alert alert-block alert-success">
<b>Takeaways:</b>
<br>

- We now have a title column and content column that has all the functions we created applied to them.
    
</div>

<hr style="border:1px solid black">

<b>8. For each dataframe, produce the following columns</b>:

- ```title``` to hold the title
- ```original``` to hold the original article/post content
- ```clean``` to hold the normalized and tokenized original with the stopwords removed.
- ```stemmed``` to hold the stemmed version of the cleaned data.
- ```lemmatized``` to hold the lemmatized version of the cleaned data.

In [24]:
################################ PREP ARTICLES ################################

#take dataframe, specify the column, extra and exclude words
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    #original text from content column
    df['original'] = df['content']
    
    #chain together clean, tokenize, remove stopwords
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    #chain clean, tokenize, stem, remove stopwords
    df['stemmed'] = df['clean'].apply(stem)
    
    #clean clean, tokenize, lemmatize, remove stopwords
    df['lemmatized'] = df['clean'].apply(lemmatize)
    
    return df[['title', 'original', 'clean', 'stemmed', 'lemmatized']]

In [25]:
#assign variable to our new prepped dataframe
prep_news = prep_article_data(news_df, 'content', extra_words =[], exclude_words=[])

#take a look
prep_news.head(5)

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Women to get free entry to matches in inaugura...,Women and girls will be getting free entry to ...,women girls getting free entry matches inaugur...,women girl get free entri match inaugur season...,woman girl getting free entry match inaugural ...
1,Let's win polls first: Farooq to Kharge on Opp...,National Conference patron Farooq Abdullah on ...,national conference patron farooq abdullah wed...,nation confer patron farooq abdullah wednesday...,national conference patron farooq abdullah wed...
2,Harassed for independent thinking: INC on CPR'...,Senior Congress leader Jairam Ramesh said Cent...,senior congress leader jairam ramesh said cent...,senior congress leader jairam ramesh said cent...,senior congress leader jairam ramesh said cent...
3,2024 polls are about who should be defeated: T...,"Tamil Nadu CM MK Stalin, speaking at his birth...",tamil nadu cm mk stalin speaking birthday meet...,tamil nadu cm mk stalin speak birthday meet we...,tamil nadu cm mk stalin speaking birthday meet...
4,UK's legal process independent of govt: UK min...,British Foreign Secretary James Cleverly on We...,british foreign secretary james cleverly wedne...,british foreign secretari jame cleverli wednes...,british foreign secretary james cleverly wedne...


In [30]:
#take a look at one article
prep_news.iloc[1]

title         Let's win polls first: Farooq to Kharge on Opp...
original      National Conference patron Farooq Abdullah on ...
clean         national conference patron farooq abdullah wed...
stemmed       nation confer patron farooq abdullah wednesday...
lemmatized    national conference patron farooq abdullah wed...
Name: 1, dtype: object

In [27]:
#assign variable to our new prepped dataframe
prep_codeup = prep_article_data(codeup_df, 'content', extra_words =[], exclude_words=[])

#take a look
prep_codeup.head(5)

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Black Excellence in Tech: Panelist Spotlight –...,Black excellence in tech: Panelist Spotlight –...,black excellence tech panelist spotlight wilma...,black excel tech panelist spotlight wilmari de...,black excellence tech panelist spotlight wilma...
1,Black excellence in tech: Panelist Spotlight –...,Black excellence in tech: Panelist Spotlight –...,black excellence tech panelist spotlight steph...,black excel tech panelist spotlight stephani j...,black excellence tech panelist spotlight steph...
2,Black excellence in tech: Panelist Spotlight –...,Black excellence in tech: Panelist Spotlight –...,black excellence tech panelist spotlight james...,black excel tech panelist spotlight jame coope...,black excellence tech panelist spotlight james...
3,Black excellence in tech: Panelist Spotlight –...,Black excellence in tech: Panelist Spotlight –...,black excellence tech panelist spotlight jeani...,black excel tech panelist spotlight jeanic fre...,black excellence tech panelist spotlight jeani...
4,Coding Bootcamp or Self-Learning? Which is Bes...,If you’re interested in embarking on a career ...,youre interested embarking career tech likely ...,your interest embark career tech like taken lo...,youre interested embarking career tech likely ...


In [31]:
#take a look at one article
prep_codeup.iloc[4]

title         Coding Bootcamp or Self-Learning? Which is Bes...
original      If you’re interested in embarking on a career ...
clean         youre interested embarking career tech likely ...
stemmed       your interest embark career tech like taken lo...
lemmatized    youre interested embarking career tech likely ...
Name: 4, dtype: object

<hr style="border:1px solid black">

<b>9. Ask yourself</b>:

- a. If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- b. If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- c. If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

<b>In general: 
- lemmatize: for smaller dataset, it is slower
- stemming: for larger dataset, it is faster

<b>Answers</b>: 
- a. lemmatize your text. The corpus is fairly small.
- b. stem your text. The dataset is larger and stemming is faster.
- c. If you are getting charged you'd probably want to stem because it is faster-- you'd get charged less.