# Exercises

* The end result of this exercise should be a file named prepare.py that defines the requested functions.

* In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:
* Lowercase everything
* Normalize unicode characters
* Replace anything that is not a letter, number, whitespace or a single quote.

2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.
5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.
   This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

8. For each dataframe, produce the following columns:
* title to hold the title
* original to hold the original article/post content
* clean to hold the normalized and tokenized original with the stopwords removed.
* stemmed to hold the stemmed version of the cleaned data.
* lemmatized to hold the lemmatized version of the cleaned data.

9. Ask yourself:
* If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
* If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
* If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

In [14]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

import pandas as pd
import prepare
import acquire


In [2]:
def basic_clean(text):
    """
    This function takes in a string and applies some basic text cleaning to it:
    * Lowercase everything
    * Normalize unicode characters
    * Replace anything that is not a letter, number, whitespace or a single quote.
    """
    # Lowercase the text
    text = text.lower()
    
    # Normalize unicode characters
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    
    # Replace everything that is not a letter, number, whitespace or a single quote with a space
    text = re.sub(r"[^a-z0-9\s']", ' ', text)
    
    # Remove extra whitespaces
    text = re.sub(r"\s+", ' ', text).strip()
    
    return text


In [3]:
def tokenize(text):
    """
    This function takes in a string and tokenizes all the words in the string.
    """
    # Tokenize the text using the nltk library
    tokens = nltk.word_tokenize(text)
    
    return tokens


In [4]:
def stem(text):
    """
    This function takes in a string and returns the text after applying stemming to all the words.
    """
    # Tokenize the text using the nltk library
    tokens = nltk.word_tokenize(text)
    
    # Apply stemming to each token using the PorterStemmer from nltk
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    
    # Join the stemmed tokens back into a single string
    stemmed_text = ' '.join(stemmed_tokens)
    
    return stemmed_text


In [5]:
def lemmatize(text):
    """
    This function takes in a string and returns the text after applying lemmatization to each word.
    """
    # Tokenize the text using the nltk library
    tokens = nltk.word_tokenize(text)
    
    # Apply lemmatization to each token using the WordNetLemmatizer from nltk
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Join the lemmatized tokens back into a single string
    lemmatized_text = ' '.join(lemmatized_tokens)
    
    return lemmatized_text


In [6]:
def remove_stopwords(text, extra_words=None, exclude_words=None):
    """
    This function takes in a string and returns the text after removing all the stopwords.
    It has two optional parameters: extra_words and exclude_words to define additional stop words to include
    and words that we don't want to remove.
    """
    # Define the list of stopwords from the nltk library
    stopword_list = stopwords.words('english')
    
    # Add any extra stop words to the list
    if extra_words:
        stopword_list.extend(extra_words)
    
    # Exclude any words from the stopword list
    if exclude_words:
        stopword_list = [word for word in stopword_list if word not in exclude_words]
    
    # Tokenize the text using the nltk library
    tokens = nltk.word_tokenize(text)
    
    # Remove the stop words from the tokens
    filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    
    # Join the filtered tokens back into a single string
    filtered_text = ' '.join(filtered_tokens)
    
    return filtered_text


In [7]:
news_df = acquire.get_news_articles_data(refresh=False)

In [8]:
codeup_df = acquire.get_blog_articles_data(refresh=False)

In [9]:
# apply basic text cleaning
news_df['title'] = news_df['title'].apply(basic_clean)

# apply basic text cleaning
news_df['content'] = news_df['content'].apply(basic_clean)

# tokenize the text
news_df['original'] = news_df['content'].apply(tokenize)

# apply stemming to the tokens
news_df['stemmed'] = news_df['content'].apply(stem)

# apply lemmatization to the tokens
news_df['lemmatized'] = news_df['content'].apply(lemmatize)

# remove stop words from the tokens
news_df['clean'] = news_df['content'].apply(remove_stopwords)

# print the first few rows of the data frame
print(news_df.head())

                                               title  \
0  whatsapp responds to int'l calls scam announce...   
1  beyonce wears colour changing dress during con...   
2  complaint filed against prabhas kriti sanon's ...   
3   gauahar khan zaid darbar blessed with a baby boy   
4  yuzvendra chahal creates history takes most wi...   

                                             content  category  \
0  whatsapp has ramped up its ai and machine lear...  national   
1  singer beyonce wore a colour changing dress du...  national   
2  a complaint has been filed against prabhas and...  national   
3  actress gauahar khan and her husband zaid darb...  national   
4  rr leg spinner yuzvendra chahal has created hi...  national   

                                            original  \
0  [whatsapp, has, ramped, up, its, ai, and, mach...   
1  [singer, beyonce, wore, a, colour, changing, d...   
2  [a, complaint, has, been, filed, against, prab...   
3  [actress, gauahar, khan, and, her, husb

In [10]:
# apply basic text cleaning
codeup_df['title'] = codeup_df['title'].apply(basic_clean)

# apply basic text cleaning
codeup_df['content'] = codeup_df['content'].apply(basic_clean)

# tokenize the text
codeup_df['original'] = codeup_df['content'].apply(tokenize)

# apply stemming to the tokens
codeup_df['stemmed'] = codeup_df['content'].apply(stem)

# apply lemmatization to the tokens
codeup_df['lemmatized'] = codeup_df['content'].apply(lemmatize)

# remove stop words from the tokens
codeup_df['clean'] = codeup_df['content'].apply(remove_stopwords)

# print the first few rows of the data frame
print(codeup_df.head())

                                               title  \
0    women in tech panelist spotlight magdalena rahn   
1  women in tech panelist spotlight rachel robbin...   
2      women in tech panelist spotlight sarah mellor   
3  women in tech panelist spotlight madeleine capper   
4  black excellence in tech panelist spotlight wi...   

                                             content  \
0  codeup is hosting a women in tech panel in hon...   
1  codeup is hosting a women in tech panel in hon...   
2  codeup is hosting a women in tech panel in hon...   
3  codeup is hosting a women in tech panel in hon...   
4  codeup is hosting a black excellence in tech p...   

                                            original  \
0  [codeup, is, hosting, a, women, in, tech, pane...   
1  [codeup, is, hosting, a, women, in, tech, pane...   
2  [codeup, is, hosting, a, women, in, tech, pane...   
3  [codeup, is, hosting, a, women, in, tech, pane...   
4  [codeup, is, hosting, a, black, excellence,

9. If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
 * If the goal is to have a more precise and accurate analysis of the corpus, it might be better to use lemmatized text. However, if the goal is to quickly process the text and extract general patterns, stemmed text might be more appropriate

9. If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
* I prefer lemmatizing for accuracy most time. 

9. If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?
* I'd sacrifice accuracy for cost in this case and go with stemmed. 