# NLP: Prepare


In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd
import acquire
from time import strftime

import warnings
warnings.filterwarnings('ignore')

## Exercises
The end result of this exercise should be a file named `prepare.py` that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [2]:
def basic_clean(string):
    '''
    This function takes in a string and
    returns the string normalized.
    '''
    string = unicodedata.normalize('NFKD', string)\
             .encode('ascii', 'ignore')\
             .decode('utf-8', 'ignore')
    string = re.sub(r'[^\w\s]', '', string).lower()
    return string

2. Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.

In [3]:
def tokenize(string):
    '''
    This function takes in a string and
    returns a tokenized string.
    '''
    # Create tokenizer.
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    # Use tokenizer
    string = tokenizer.tokenize(string, return_str = True)
    
    return string

3. Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.

In [4]:
def stem(string):
    '''
    This function takes in a string and
    returns a string with words stemmed.
    '''
    # Create porter stemmer.
    ps = nltk.porter.PorterStemmer()
    
    # Use the stemmer to stem each word in the list of words we created by using split.
    stems = [ps.stem(word) for word in string.split()]
    
    # Join our lists of words into a string again and assign to a variable.
    string = ' '.join(stems)
    
    return string

4. Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

In [5]:
def lemmatize(string):
    '''
    This function takes in string for and
    returns a string with words lemmatized.
    '''
    # Create the lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Use the lemmatizer on each word in the list of words we created by using split.
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    # Join our list of words into a string again and assign to a variable.
    string = ' '.join(lemmas)
    
    return string

5. Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords.     This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [6]:
def remove_stopwords(string, extra_words = [], exclude_words = []):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    # Create stopword_list.
    stopword_list = stopwords.words('english')
    
    # Remove 'exclude_words' from stopword_list to keep these in my text.
    stopword_list = set(stopword_list) - set(exclude_words)
    
    # Add in 'extra_words' to stopword_list.
    stopword_list = stopword_list.union(set(extra_words))

    # Split words in string.
    words = string.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe `news_df`.

In [7]:
# use get_all_new_article function from acquire.py file 
news_df = acquire.get_inshorts_articles()

In [8]:
news_df.head()

Unnamed: 0,title,author,content,date,category
0,"Ambani, Adani become richer than Zuckerberg af...",Pragya Swastik,Reliance Industries Chairman Mukesh Ambani and...,"04 Feb 2022,Friday",business
1,Drop in Meta's market value more than the tota...,Arshiya Chopra,After Facebook parent Meta lost $251 billion i...,"04 Feb 2022,Friday",business
2,Amazon adds $135 bn in one of the biggest 1-da...,Hiral Goyal,Amazon added more than $135 billion in market ...,"04 Feb 2022,Friday",business
3,Meta drops below Berkshire Hathaway in market ...,Hiral Goyal,Meta Platforms is now worth about $50 billion ...,"04 Feb 2022,Friday",business
4,Facebook's user growth in India slowed due to ...,Sakshita Khosla,Facebook's user growth in India was hit due to...,"04 Feb 2022,Friday",business


7. Make another dataframe for the Codeup blog posts. Name the dataframe `codeup_df`.

In [9]:
codeup_df = acquire.get_codeup_blogs()

In [10]:
codeup_df.head()

Unnamed: 0,title,published,content
0,Codeup Dallas Open House,"Nov 30, 2021",Come join us for the re-opening of our Dallas ...
1,Codeup’s Placement Team Continues Setting Records,"Nov 19, 2021",Our Placement Team is simply defined as a grou...
2,"IT Certifications 101: Why They Matter, and Wh...","Nov 18, 2021","AWS, Google, Azure, Red Hat, CompTIA…these are..."
3,A rise in cyber attacks means opportunities fo...,"Nov 17, 2021","In the last few months, the US has experienced..."
4,Use your GI Bill® benefits to Land a Job in Tech,"Nov 4, 2021","As the end of military service gets closer, ma..."


8. For each dataframe, produce the following columns:
- `title` to hold the title
- `original` to hold the original article/post content
- `clean` to hold the normalized and tokenized original with the stopwords removed.
- `stemmed` to hold the stemmed version of the cleaned data.
- `lemmatized` to hold the lemmatized version of the cleaned data.

In [11]:
news_df.rename(columns={'content': 'original'}, inplace=True)
codeup_df.rename(columns={'content': 'original'}, inplace=True)

In [12]:
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['stemmed'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(stem)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['lemmatized'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(lemmatize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    return df[['title', column,'clean', 'stemmed', 'lemmatized']]

In [13]:
# use the function defined above for news_df's content column.

prep_article_data(news_df, 'original', extra_words = ['ha'], exclude_words = ['no']).head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,"Ambani, Adani become richer than Zuckerberg af...",Reliance Industries Chairman Mukesh Ambani and...,reliance industries chairman mukesh ambani ada...,relianc industri chairman mukesh ambani adani ...,reliance industry chairman mukesh ambani adani...
1,Drop in Meta's market value more than the tota...,After Facebook parent Meta lost $251 billion i...,facebook parent meta lost 251 billion market v...,facebook parent meta lost 251 billion market v...,facebook parent meta lost 251 billion market v...
2,Amazon adds $135 bn in one of the biggest 1-da...,Amazon added more than $135 billion in market ...,amazon added 135 billion market value one bigg...,amazon ad 135 billion market valu one biggest ...,amazon added 135 billion market value one bigg...
3,Meta drops below Berkshire Hathaway in market ...,Meta Platforms is now worth about $50 billion ...,meta platforms worth 50 billion less berkshire...,meta platform worth 50 billion less berkshir h...,meta platform worth 50 billion le berkshire ha...
4,Facebook's user growth in India slowed due to ...,Facebook's user growth in India was hit due to...,facebooks user growth india hit due hike prepa...,facebook user growth india wa hit due hike pre...,facebooks user growth india wa hit due hike pr...


In [14]:
# use the function defined above for codeup_df's content column.

prep_article_data(codeup_df, 'original', extra_words = ['ha'], exclude_words = ['no']).head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Codeup Dallas Open House,Come join us for the re-opening of our Dallas ...,come join us reopening dallas campus drinks sn...,come join us reopen dalla campu drink snack co...,come join u reopening dallas campus drink snac...
1,Codeup’s Placement Team Continues Setting Records,Our Placement Team is simply defined as a grou...,placement team simply defined group manages re...,placement team simpli defin group manag relati...,placement team simply defined group manages re...
2,"IT Certifications 101: Why They Matter, and Wh...","AWS, Google, Azure, Red Hat, CompTIA…these are...",aws google azure red hat comptiathese big name...,aw googl azur red hat comptiathes big name onl...,aws google azure red hat comptiathese big name...
3,A rise in cyber attacks means opportunities fo...,"In the last few months, the US has experienced...",last months us experienced dozens major cybera...,last month us experienc dozen major cyberattac...,last month u experienced dozen major cyberatta...
4,Use your GI Bill® benefits to Land a Job in Tech,"As the end of military service gets closer, ma...",end military service gets closer many transiti...,end militari servic get closer mani transit se...,end military service get closer many transitio...


Ask yourself:

If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?