# Import

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire

# Exercises

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

8. For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

9. Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

## Exercise 1. 
Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [2]:
urls = acquire.get_blog_urls()
articles = acquire.get_news_articles()

In [3]:
string = articles.content[0]
string

"India's GDP grew at 13.5% in the first quarter of FY23, achieving its fastest annual expansion in a year, government data showed. However, it is lower than the Reserve Bank of India's (RBI) projection of 16.2% GDP growth in the first quarter of FY23. India's GDP growth in the first quarter of FY22 was 20.1%."

In [4]:
# lowercase everything
string.lower()

"india's gdp grew at 13.5% in the first quarter of fy23, achieving its fastest annual expansion in a year, government data showed. however, it is lower than the reserve bank of india's (rbi) projection of 16.2% gdp growth in the first quarter of fy23. india's gdp growth in the first quarter of fy22 was 20.1%."

In [5]:
# Normalizing unicode characters
unicodedata.normalize('NFKD', string)\
    .encode('ascii', 'ignore')\
    .decode('UTF-8')

"India's GDP grew at 13.5% in the first quarter of FY23, achieving its fastest annual expansion in a year, government data showed. However, it is lower than the Reserve Bank of India's (RBI) projection of 16.2% GDP growth in the first quarter of FY23. India's GDP growth in the first quarter of FY22 was 20.1%."

In [6]:
# Replacing anything that isn't specified above
re.sub(r"[^a-z0-9\s]", '', string)

'ndias  grew at 135 in the first quarter of 23 achieving its fastest annual expansion in a year government data showed owever it is lower than the eserve ank of ndias  projection of 162  growth in the first quarter of 23 ndias  growth in the first quarter of 22 was 201'

In [7]:
# creating the function
def basic_clean(string):
    
    # lowercase everything
    string = string.lower()
    
    # remove inconsistenceis
    # encode into ascii byte strings
    # decode back into UTF-8
    # (This process will normalize the unicode characters)
    
    string = unicodedata.normalize('NFKD', string).encode('ascii', 'ignore').decode('UTF-8')
    
    # replace anything that is not a letter, number, whitespace, etc
    # use regex to perform this operation
    string = re.sub(r"[^a-z0-9\s]", '', string)
    
    return string

In [8]:
cleaned_string = basic_clean(string)
cleaned_string

'indias gdp grew at 135 in the first quarter of fy23 achieving its fastest annual expansion in a year government data showed however it is lower than the reserve bank of indias rbi projection of 162 gdp growth in the first quarter of fy23 indias gdp growth in the first quarter of fy22 was 201'

## Exercise 2. 
Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [9]:
#create the token
token = nltk.tokenize.ToktokTokenizer()

In [10]:
#Use the token
string = token.tokenize(string,  return_str=True)
string

"India ' s GDP grew at 13.5 % in the first quarter of FY23 , achieving its fastest annual expansion in a year , government data showed. However , it is lower than the Reserve Bank of India ' s ( RBI ) projection of 16.2 % GDP growth in the first quarter of FY23. India ' s GDP growth in the first quarter of FY22 was 20.1 % ."

In [11]:
def tokenize(string):
    """
    This function will take in a string, tokenize the string and 
    return the tokenized string
    """
    
    #create the token
    token = nltk.tokenize.ToktokTokenizer()
    
    #Use the token
    string = token.tokenize(string,  return_str=True)
    
    return string

In [12]:
token = tokenize(cleaned_string)
token

'indias gdp grew at 135 in the first quarter of fy23 achieving its fastest annual expansion in a year government data showed however it is lower than the reserve bank of indias rbi projection of 162 gdp growth in the first quarter of fy23 indias gdp growth in the first quarter of fy22 was 201'

## Exercise 3. 
Define a function named stem. It should accept some text and return the text after applying stemming to all the words.


In [13]:
# Create the nltk stemmer object, then use it
ps = nltk.porter.PorterStemmer()

In [14]:
#Apply the stem to each work in the string and create a list
# of steemed words
stems = [ps.stem(word) for word in string.split()]


In [15]:
def stem(string):
    """
    This function will accept some text(string) and return a stemmed 
    version of the text
    """
    
    #create the porter stem
    ps = nltk.porter.PorterStemmer()
    
    #Apply the stem to each work in the string and create a list
    # of steemed words
    
    stem = [ps.stem(word) for word in string.split()]
    
    # rejoin the string together
    stemmed_string = ' '.join(stem)
    
    return stemmed_string

In [16]:
#test
stemmed_string = stem(token)
stemmed_string

'india gdp grew at 135 in the first quarter of fy23 achiev it fastest annual expans in a year govern data show howev it is lower than the reserv bank of india rbi project of 162 gdp growth in the first quarter of fy23 india gdp growth in the first quarter of fy22 wa 201'

## Exercise 4. 
Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.


In [17]:
# download wornet lemmatized
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/davidschneemann/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [18]:
# download omw-1.4 for nltk
# nltk.download('omw-1.4')

In [19]:
wnl = nltk.stem.WordNetLemmatizer()

In [20]:
[wnl.lemmatize(word) for word in string.split()]

['India',
 "'",
 's',
 'GDP',
 'grew',
 'at',
 '13.5',
 '%',
 'in',
 'the',
 'first',
 'quarter',
 'of',
 'FY23',
 ',',
 'achieving',
 'it',
 'fastest',
 'annual',
 'expansion',
 'in',
 'a',
 'year',
 ',',
 'government',
 'data',
 'showed.',
 'However',
 ',',
 'it',
 'is',
 'lower',
 'than',
 'the',
 'Reserve',
 'Bank',
 'of',
 'India',
 "'",
 's',
 '(',
 'RBI',
 ')',
 'projection',
 'of',
 '16.2',
 '%',
 'GDP',
 'growth',
 'in',
 'the',
 'first',
 'quarter',
 'of',
 'FY23.',
 'India',
 "'",
 's',
 'GDP',
 'growth',
 'in',
 'the',
 'first',
 'quarter',
 'of',
 'FY22',
 'wa',
 '20.1',
 '%',
 '.']

In [21]:
def lemmatize(string):
    """This function takes in a string and returns a lmeeatized 
    version of the string"""
    
    # create the lemmatizer
    wnl = nltk.stem.WordNetLemmatizer()
    
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    string_lemmatize = ' '.join(lemmas)
    
    return string_lemmatize

In [22]:

lemmatized = lemmatize(token)
lemmatized

'india gdp grew at 135 in the first quarter of fy23 achieving it fastest annual expansion in a year government data showed however it is lower than the reserve bank of india rbi projection of 162 gdp growth in the first quarter of fy23 india gdp growth in the first quarter of fy22 wa 201'

## Exercise 5.
Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.


In [23]:
def remove_stopwords(string, extra_words=[], exclude_words=[]):
    """
    This function will take in a string, filter out stop words from the nltk standard english list 
    as well as any other extra words, and return a version of the text without these stopwords.
    It includes optional paramaters allowing the user to add extra words to remove 
    or to exclude words from the stopword list.
    """
    
    #get english stopwords from nltk
    stop_words = stopwords.words('english')
    
    #Add extra words to be removed to the stop word list
    for word in extra_words:
        stop_words.append(word)
    
    #Remove words to be excluded from the stop word list
    for word in exclude_words:
        stop_words.remove(word)
    
    #Create a list of words to be checked by splitting the string
    words = string.split()
    
    #Filter out all of the stop words
    filtered_words = [word for word in words if word not in stop_words]
    
    #Join the list of filtered words into a string
    filtered_string = ' '.join(filtered_words)
    
    return filtered_string

In [24]:
removed_stopwords = remove_stopwords(token)
removed_stopwords

'indias gdp grew 135 first quarter fy23 achieving fastest annual expansion year government data showed however lower reserve bank indias rbi projection 162 gdp growth first quarter fy23 indias gdp growth first quarter fy22 201'

In [25]:
# Test using extra_words and exclude_words options
extra_words = ['russia', 'matter']
exclude_words = ['the']

filters = remove_stopwords(token, extra_words, exclude_words)
filters

'indias gdp grew 135 the first quarter fy23 achieving fastest annual expansion year government data showed however lower the reserve bank indias rbi projection 162 gdp growth the first quarter fy23 indias gdp growth the first quarter fy22 201'

## Exercise 6.
Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [26]:
news_df = acquire.get_news_articles()
news_df

Unnamed: 0,category,title,author,date,content
0,business,India's GDP grows at 13.5% in first quarter of...,Anmol Sharma,2022-08-31,India's GDP grew at 13.5% in the first quarter...
1,business,Musk seeks to delay Twitter trial to Nov amid ...,Ridham Gambhir,2022-08-31,Tesla CEO Elon Musk is seeking to delay the tr...
2,business,"Snap to lay off 20% of staff, cancel several p...",Ananya Goyal,2022-08-31,Snap said on Wednesday it will lay off 20% of ...
3,business,2 top executives at Snap quit hours after repo...,Ridham Gambhir,2022-08-31,Two senior advertising executives at Snap quit...
4,business,Viral video shows Amazon parcels thrown out of...,Apaar Sharma,2022-08-31,A video from Guwahati railway station has gone...
...,...,...,...,...,...
95,entertainment,S Korea may hold survey on BTS members' mandat...,Indraneel Sen,2022-08-31,South Korea is considering a survey to determi...
96,entertainment,Judge 'The Kashmir...' according to its cinema...,Amartya Sharma,2022-08-31,'The Kashmir Files' actress Pallavi Joshi has ...
97,entertainment,Response to 'Delhi Crime 2' is enormous: Actre...,Amartya Sharma,2022-08-31,"Actress Shefali Shah, speaking about the recen..."
98,entertainment,We have been inviting Bappa for almost 60 year...,Amartya Sharma,2022-08-31,"Actress Shraddha Kapoor, speaking on the occas..."


## Exercise 7.
Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [27]:
codeup_df = acquire.get_blog_articles()
codeup_df

Unnamed: 0,title,date,category,content
0,Is a Career in Tech Recession-Proof?,2022-08-12,Cloud Administration,"Given the current economic climate, many econo..."
1,Codeup X Superhero Car Show & Comic Con,2022-08-10,Codeup News,Codeup had a blast at the San Antonio Superher...
2,What Jobs Can You Get After a Coding Bootcamp?...,2022-08-02,Featured,If you’re considering a career in web developm...
3,Codeup’s New Dallas Campus,2022-07-25,Codeup News,Codeup’s Dallas campus has a new location! For...
4,Codeup TV Commercial,2022-07-20,Codeup News,Codeup has officially made its TV debut! Our c...
5,What Jobs Can You Get After a Coding Bootcamp?...,2022-07-14,Featured,Have you been considering a career in Cloud Ad...


## Exercise 8.
For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

In [28]:
# Rename the content columns to original 
news_df.rename(columns = {'content':'original'}, inplace = True)
codeup_df.rename(columns = {'content':'original'}, inplace = True)

In [29]:
## Starting with the news_df

news_df['clean'] = news_df['original']

#apply the basic_clean, tokenize, and remove_stopwords functions
news_df['clean'] = news_df['clean'].apply(basic_clean).apply(tokenize).apply(remove_stopwords)

#create the stemmed column
news_df['stemmed'] = news_df['clean']

#apply the stem function
news_df['stemmed'] = news_df['stemmed'].apply(stem)

#create the lematize column
news_df['lemmatized'] = news_df['clean']

#apply the lemmatize function
news_df['lemmatized'] = news_df['lemmatized'].apply(lemmatize)

news_df

Unnamed: 0,category,title,author,date,original,clean,stemmed,lemmatized
0,business,India's GDP grows at 13.5% in first quarter of...,Anmol Sharma,2022-08-31,India's GDP grew at 13.5% in the first quarter...,indias gdp grew 135 first quarter fy23 achievi...,india gdp grew 135 first quarter fy23 achiev f...,india gdp grew 135 first quarter fy23 achievin...
1,business,Musk seeks to delay Twitter trial to Nov amid ...,Ridham Gambhir,2022-08-31,Tesla CEO Elon Musk is seeking to delay the tr...,tesla ceo elon musk seeking delay trial twitte...,tesla ceo elon musk seek delay trial twitter n...,tesla ceo elon musk seeking delay trial twitte...
2,business,"Snap to lay off 20% of staff, cancel several p...",Ananya Goyal,2022-08-31,Snap said on Wednesday it will lay off 20% of ...,snap said wednesday lay 20 staff shut original...,snap said wednesday lay 20 staff shut origin s...,snap said wednesday lay 20 staff shut original...
3,business,2 top executives at Snap quit hours after repo...,Ridham Gambhir,2022-08-31,Two senior advertising executives at Snap quit...,two senior advertising executives snap quit ho...,two senior advertis execut snap quit hour repo...,two senior advertising executive snap quit hou...
4,business,Viral video shows Amazon parcels thrown out of...,Apaar Sharma,2022-08-31,A video from Guwahati railway station has gone...,video guwahati railway station gone viral show...,video guwahati railway station gone viral show...,video guwahati railway station gone viral show...
...,...,...,...,...,...,...,...,...
95,entertainment,S Korea may hold survey on BTS members' mandat...,Indraneel Sen,2022-08-31,South Korea is considering a survey to determi...,south korea considering survey determine wheth...,south korea consid survey determin whether mem...,south korea considering survey determine wheth...
96,entertainment,Judge 'The Kashmir...' according to its cinema...,Amartya Sharma,2022-08-31,'The Kashmir Files' actress Pallavi Joshi has ...,kashmir files actress pallavi joshi responded ...,kashmir file actress pallavi joshi respond opp...,kashmir file actress pallavi joshi responded o...
97,entertainment,Response to 'Delhi Crime 2' is enormous: Actre...,Amartya Sharma,2022-08-31,"Actress Shefali Shah, speaking about the recen...",actress shefali shah speaking recently release...,actress shefali shah speak recent releas new s...,actress shefali shah speaking recently release...
98,entertainment,We have been inviting Bappa for almost 60 year...,Amartya Sharma,2022-08-31,"Actress Shraddha Kapoor, speaking on the occas...",actress shraddha kapoor speaking occasion gane...,actress shraddha kapoor speak occas ganesh cha...,actress shraddha kapoor speaking occasion gane...


In [30]:
## Now apply the same to the codeup_df

codeup_df['clean'] = codeup_df['original']

#apply the basic_clean, tokenize, and remove_stopwords functions
codeup_df['clean'] = codeup_df['clean'].apply(basic_clean).apply(tokenize).apply(remove_stopwords)

#create the stemmed column
codeup_df['stemmed'] = codeup_df['clean']

#apply the stem function
codeup_df['stemmed'] = codeup_df['stemmed'].apply(stem)

#create the lematize column
codeup_df['lemmatized'] = codeup_df['clean']

#apply the lemmatize function
codeup_df['lemmatized'] = codeup_df['lemmatized'].apply(lemmatize)

codeup_df

Unnamed: 0,title,date,category,original,clean,stemmed,lemmatized
0,Is a Career in Tech Recession-Proof?,2022-08-12,Cloud Administration,"Given the current economic climate, many econo...",given current economic climate many economists...,given current econom climat mani economist con...,given current economic climate many economist ...
1,Codeup X Superhero Car Show & Comic Con,2022-08-10,Codeup News,Codeup had a blast at the San Antonio Superher...,codeup blast san antonio superhero car show co...,codeup blast san antonio superhero car show co...,codeup blast san antonio superhero car show co...
2,What Jobs Can You Get After a Coding Bootcamp?...,2022-08-02,Featured,If you’re considering a career in web developm...,youre considering career web development dont ...,your consid career web develop dont know expec...,youre considering career web development dont ...
3,Codeup’s New Dallas Campus,2022-07-25,Codeup News,Codeup’s Dallas campus has a new location! For...,codeups dallas campus new location two years c...,codeup dalla campu new locat two year codeup o...,codeups dallas campus new location two year co...
4,Codeup TV Commercial,2022-07-20,Codeup News,Codeup has officially made its TV debut! Our c...,codeup officially made tv debut community stud...,codeup offici made tv debut commun student sta...,codeup officially made tv debut community stud...
5,What Jobs Can You Get After a Coding Bootcamp?...,2022-07-14,Featured,Have you been considering a career in Cloud Ad...,considering career cloud administration idea j...,consid career cloud administr idea job titl po...,considering career cloud administration idea j...


## Exercise 9.
Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text? 
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text? 
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

Asking myself complete :P