# Prepare Exercises
The end result of this exercise should be a file named ```prepare.py``` that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [1]:
# unicode, regex, json for text digestion
import unicodedata
import re
import json

# nltk: natural language toolkit -> tokenization, stopwords
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

# pandas dataframe manipulation, acquire script, time formatting
import pandas as pd
import acquire
from time import strftime

# shh, down in front
import warnings
warnings.filterwarnings('ignore')

In [2]:
!code prepare.py

## 1. Define a function named ```basic_clean```. It should take in a string and apply some basic text cleaning to it:
* Lowercase everything
* Normalize unicode characters
* Replace anything that is not a letter, number, whitespace or a single quote.

In [3]:
def basic_clean(string):
    '''
    Description:
    This function takes in the string argument and returns the string normalized, cleaned, and lowercase.
    
    Required Imports:
    import re
    
    Arguments:
    string = The string of text to be cleaned
    
    Returns:
    string - After being cleaned
    '''
    string = unicodedata.normalize('NFKD', string)\
             .encode('ascii', 'ignore')\
             .decode('utf-8', 'ignore')
    string = re.sub(r"[^\w0-9'\s]", '', string).lower()
    return string

In [4]:
basic_clean('I think THAT "S@t#uff$" will work 4-real!')

'i think that stuff will work 4real'

## 2. Define a function named ```tokenize```. It should take in a string and tokenize all the words in the string.

In [5]:
def tokenize(string):
    '''
    Description:
    This function takes in the string argument and returns the string tokenized.
    
    Required Imports:
    import nltk
    from nltk.tokenize.toktok import ToktokTokenizer
    
    Arguments:
    string = The string of text to be cleaned
    
    Returns:
    string - After being cleaned
    '''
    tokenizer = nltk.tokenize.ToktokTokenizer()
    string = tokenizer.tokenize(string, return_str = True)
    
    return string

In [6]:
tokenize('worked hated hours')

'worked hated hours'

## 3. Define a function named ```stem```. It should accept some text and return the text after applying stemming to all the words.

In [7]:
def stem(string):
    '''
    Description:
    This function takes in the string argument and returns the stemmed words.
    
    Required Imports:
    import nltk
    
    Arguments:
    string = The string of text to be cleaned
    
    Returns:
    string - After being cleaned    
    '''
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in string.split()]
    string = ' '.join(stems)
    
    return string

In [8]:
stem('worked hated hours')

'work hate hour'

## 4. Define a function named ```lemmatize```. It should accept some text and return the text after applying lemmatization to each word.

In [9]:
def lemmatize(string):
    '''
    Description:
    This function takes in the string argument and returns a string with words lemmatized.
    
    Required Imports:
    import nltk
    
    Arguments:
    string = The string of text to be cleaned
    
    Returns:
    string - After being cleaned    
    '''
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    string = ' '.join(lemmas)
    return string

In [10]:
lemmatize('Im not quite sure that these words are long enough to be lemmtized')

'Im not quite sure that these word are long enough to be lemmtized'

## 5. Define a function named ```remove_stopwords```. It should accept some text and return the text after removing all the stopwords.
* This function should define two optional parameters, ```extra_words``` and ```exclude_words```. 
* These parameters should define any additional stop words to include, and any words that we _**don't**_ want to remove.

In [11]:
def remove_stopwords(string, extra_words = [], exclude_words = []):
    '''
    Description:
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    
    Required Imports:
    import nltk
    from nltk.corpus import stopwords
    
    Arguments:
           string = The string of text to be cleaned
      extra_words = Holds List of words to be added to the stopwords list 
    exclude_words = Holds List of words to be removed from the stopwords list 
    
    Returns:
    string - After being cleaned    
    '''
    stopword_list = stopwords.words('english')
    stopword_list = set(stopword_list) - set(exclude_words)
    stopword_list = stopword_list.union(set(extra_words))
    words = string.split()
    filtered_words = [word for word in words if word not in stopword_list]
    string = ' '.join(filtered_words)
    return string

In [12]:
remove_stopwords('I want to see the stop words out of the text', extra_words = [], exclude_words = [])

'I want see stop words text'

## 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe ```news_df```.

In [13]:
import acquire as a
url = 'https://inshorts.com/en/read'
category_list = ['technology']

articles = a.get_news_articles(url, category_list)

In [14]:
news_df = pd.DataFrame(articles)
news_df

Unnamed: 0,title,content,category
0,Supreme Court rejects Google's plea against re...,The Supreme Court on Thursday refused to enter...,technology
1,101 people complain in one day of ₹300 crore l...,Maharashtra's Jalna district police received 1...,technology
2,Instagram launches 'Quiet Mode' to help people...,Instagram's new 'Quiet Mode' will help users t...,technology
3,Amazon beats Apple to become world's most valu...,Amazon has beaten Apple to become the world's ...,technology
4,Which are the world's 10 most valuable brands?,Amazon has beaten Apple to become the world's ...,technology
5,Which are the world's 10 most valuable IT serv...,"Accenture, with a brand value of $39.9 billion...",technology
6,Wikipedia changes its look for the first time ...,Wikipedia has unveiled a new look for the firs...,technology
7,SpaceX launches advanced GPS satellite for US ...,SpaceX on Wednesday launched an advanced GPS s...,technology
8,"Boeing wins NASA deal for greener, more fuel-e...","NASA has issued an award to Boeing to build, t...",technology
9,Despicable human: Musk as PR firm CEO says Twi...,Twitter owner Elon Musk reacted to Richard Ede...,technology


## 7. Make another dataframe for the Codeup blog posts. Name the dataframe ```codeup_df```.

In [15]:
import acquire as a
url = 'https://codeup.com/blog/'

articles = a.get_blog_articles(url)

In [16]:
codeup_df = pd.DataFrame(articles)
codeup_df

Unnamed: 0,title,content
0,Codeup Among Top 58 Best Coding Bootcamps of 2023,Codeup is pleased to announce we have been ran...
1,Become a Data Scientist in 6 Months!,Are you feeling unfulfilled in your work but w...
2,Hiring Tech Talent Around the Holidays,Are you a hiring manager having trouble fillin...
3,Cloud Administration Program New Funding Options,Finding resources to fund your educational goa...
4,Why Dallas is a Great Location for IT Professi...,"When breaking into a new career, it is importa..."
5,Codeup is ranked #1 Best in DFW 2022,We are excited to announce that Codeup ranked ...


## 8. For each dataframe, produce the following columns:
* ```title``` to hold the title
* ```original``` to hold the original article/post content
* ```clean``` to hold the normalized and tokenized original with the stopwords removed.
* ```stemmed``` to hold the stemmed version of the cleaned data.
* ```lemmatized``` to hold the lemmatized version of the cleaned data.

In [17]:
def prep_article_data(df, column_name, extra_words=[], exclude_words=[]):
    '''
    Description:
    This function takes in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original, cleaned, tokenized, & lemmatized text.
        
    Required Imports:
    import nltk
    import pandas as pd
    
    Arguments:
               df = DataFrame
      column_name = The name of the 'column' that holds the target text to be prepared.
      extra_words = Holds List of words to be added to the stopwords list 
    exclude_words = Holds List of words to be removed from the stopwords list 
        
    Returns:
    df - DataFrame with each of the columns: 'title', 'original', 'clean', 'stemmed', 'lemmatized'    
    '''
    df['clean'] = df[column_name].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords,
                                  extra_words=extra_words,
                                  exclude_words=exclude_words)
    
    df['original'] = df[column_name] 
    
    df['stemmed'] = df['clean'].apply(stem)
    
    df['lemmatized'] = df['clean'].apply(lemmatize)
    
    return df[['title', 'original', 'clean', 'stemmed', 'lemmatized']]

In [18]:
prep_article_data(news_df, 'content').head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Supreme Court rejects Google's plea against re...,The Supreme Court on Thursday refused to enter...,supreme court thursday refused entertain googl...,suprem court thursday refus entertain googl ' ...,supreme court thursday refused entertain googl...
1,101 people complain in one day of ₹300 crore l...,Maharashtra's Jalna district police received 1...,maharashtra ' jalna district police received 1...,maharashtra ' jalna district polic receiv 101 ...,maharashtra ' jalna district police received 1...
2,Instagram launches 'Quiet Mode' to help people...,Instagram's new 'Quiet Mode' will help users t...,instagram ' new ' quiet mode ' help users take...,instagram ' new ' quiet mode ' help user take ...,instagram ' new ' quiet mode ' help user take ...
3,Amazon beats Apple to become world's most valu...,Amazon has beaten Apple to become the world's ...,amazon beaten apple become world ' valuable br...,amazon beaten appl becom world ' valuabl brand...,amazon beaten apple become world ' valuable br...
4,Which are the world's 10 most valuable brands?,Amazon has beaten Apple to become the world's ...,amazon beaten apple become world ' valuable br...,amazon beaten appl becom world ' valuabl brand...,amazon beaten apple become world ' valuable br...


In [19]:
prep_article_data(codeup_df, 'content').head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Codeup Among Top 58 Best Coding Bootcamps of 2023,Codeup is pleased to announce we have been ran...,codeup pleased announce ranked among 58 best c...,codeup pleas announc rank among 58 best code b...,codeup pleased announce ranked among 58 best c...
1,Become a Data Scientist in 6 Months!,Are you feeling unfulfilled in your work but w...,feeling unfulfilled work want avoid returning ...,feel unfulfil work want avoid return tradit ed...,feeling unfulfilled work want avoid returning ...
2,Hiring Tech Talent Around the Holidays,Are you a hiring manager having trouble fillin...,hiring manager trouble filling position around...,hire manag troubl fill posit around holiday co...,hiring manager trouble filling position around...
3,Cloud Administration Program New Funding Options,Finding resources to fund your educational goa...,finding resources fund educational goals possi...,find resourc fund educ goal possibl largest ob...,finding resource fund educational goal possibl...
4,Why Dallas is a Great Location for IT Professi...,"When breaking into a new career, it is importa...",breaking new career important explore job oppo...,break new career import explor job opportun ex...,breaking new career important explore job oppo...


In [20]:
import prepare as p

In [21]:
p.prep_article_data(news_df, 'content').head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Supreme Court rejects Google's plea against re...,The Supreme Court on Thursday refused to enter...,supreme court thursday refused entertain googl...,suprem court thursday refus entertain googl ' ...,supreme court thursday refused entertain googl...
1,101 people complain in one day of ₹300 crore l...,Maharashtra's Jalna district police received 1...,maharashtra ' jalna district police received 1...,maharashtra ' jalna district polic receiv 101 ...,maharashtra ' jalna district police received 1...
2,Instagram launches 'Quiet Mode' to help people...,Instagram's new 'Quiet Mode' will help users t...,instagram ' new ' quiet mode ' help users take...,instagram ' new ' quiet mode ' help user take ...,instagram ' new ' quiet mode ' help user take ...
3,Amazon beats Apple to become world's most valu...,Amazon has beaten Apple to become the world's ...,amazon beaten apple become world ' valuable br...,amazon beaten appl becom world ' valuabl brand...,amazon beaten apple become world ' valuable br...
4,Which are the world's 10 most valuable brands?,Amazon has beaten Apple to become the world's ...,amazon beaten apple become world ' valuable br...,amazon beaten appl becom world ' valuabl brand...,amazon beaten apple become world ' valuable br...


In [22]:
p.prep_article_data(codeup_df, 'content').head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Codeup Among Top 58 Best Coding Bootcamps of 2023,Codeup is pleased to announce we have been ran...,codeup pleased announce ranked among 58 best c...,codeup pleas announc rank among 58 best code b...,codeup pleased announce ranked among 58 best c...
1,Become a Data Scientist in 6 Months!,Are you feeling unfulfilled in your work but w...,feeling unfulfilled work want avoid returning ...,feel unfulfil work want avoid return tradit ed...,feeling unfulfilled work want avoid returning ...
2,Hiring Tech Talent Around the Holidays,Are you a hiring manager having trouble fillin...,hiring manager trouble filling position around...,hire manag troubl fill posit around holiday co...,hiring manager trouble filling position around...
3,Cloud Administration Program New Funding Options,Finding resources to fund your educational goa...,finding resources fund educational goals possi...,find resourc fund educ goal possibl largest ob...,finding resource fund educational goal possibl...
4,Why Dallas is a Great Location for IT Professi...,"When breaking into a new career, it is importa...",breaking new career important explore job oppo...,break new career import explor job opportun ex...,breaking new career important explore job oppo...


## 9. Ask yourself:
* If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
#### Lemmatized
* If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
#### Lemmatized
* If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?
#### Stemmed