In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire
from time import strftime

## Exercises
The end result of this exercise should be a file named `prepare.py` that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

2. Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.

3. Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.

4. Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

5. Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords.

    This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.
    

6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe `news_df`.

7. Make another dataframe for the Codeup blog posts. Name the dataframe `codeup_df`.

8. For each dataframe, produce the following columns:
- `title` to hold the title
- `original` to hold the original article/post content
- `clean` to hold the normalized and tokenized original with the stopwords removed.
- `stemmed` to hold the stemmed version of the cleaned data.
- `lemmatized` to hold the lemmatized version of the cleaned data.

Ask yourself:

If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

In [2]:
inshort = acquire.get_inshorts_articles()
inshort



  soup = BeautifulSoup(response.text)


Unnamed: 0,title,author,content,date,category
0,Drop in Meta's market value more than the tota...,Arshiya Chopra,After Facebook parent Meta lost $251 billion i...,"04 Feb 2022,Friday",business
1,Meta drops below Berkshire Hathaway in market ...,Hiral Goyal,Meta Platforms is now worth about $50 billion ...,"04 Feb 2022,Friday",business
2,Amazon adds $135 bn in one of the biggest 1-da...,Hiral Goyal,Amazon added more than $135 billion in market ...,"04 Feb 2022,Friday",business
3,Facebook's user growth in India slowed due to ...,Sakshita Khosla,Facebook's user growth in India was hit due to...,"04 Feb 2022,Friday",business
4,"Mukesh Ambani buys ₹13cr Rolls-Royce SUV, one ...",Arshiya Chopra,Reliance Industries Chairman Mukesh Ambani has...,"04 Feb 2022,Friday",business
...,...,...,...,...,...
95,"Riteish Deshmukh, Genelia to star in comedy fi...",Udit Gupta,Riteish Deshmukh and his actress-wife Genelia ...,"04 Feb 2022,Friday",entertainment
96,Waheeda Rehman stood barefoot in temple set at...,Udit Gupta,Rakeysh Omprakash Mehra recalled shooting for ...,"04 Feb 2022,Friday",entertainment
97,2022 will be a busy year: Disha Patani on upco...,Ramanpreet Singh Virdi,Actress Disha Patani has said 2022 will be a b...,"04 Feb 2022,Friday",entertainment
98,Rejected H'wood projects where Indians weren't...,Udit Gupta,"Nitu Chandra, who made her Hollywood debut wit...","04 Feb 2022,Friday",entertainment


In [3]:
inshort1 = inshort.content[0]
inshort1

'After Facebook parent Meta lost $251 billion in market value, suffering the biggest wipeout in US market\'s history, Kotak Mahindra Bank CEO Uday Kotak said, "That is more than the total value of India\'s largest company." "It highlights the fragility and fickleness of our times. Welcome to the never normal world," he added.'

### 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [4]:
# #lowercase
# article = original.lower()
# # unicode
# article = unicodedata.normalize('NFKD', article)\
#     .encode('ascii', 'ignore')\
#     .decode('utf-8', 'ignore')
# #replace
# article = re.sub(r"[^a-z0-9'\s]", '', article)

In [5]:
def basic_clean(article):
    '''input article to lowercase, normalize unicode characters, 
    and replace anything that is not a letter, number, whitespace, 
    or single quote'''
    article = article.lower()
    article = unicodedata.normalize('NFKD', article)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')
    article = re.sub(r"[^a-z0-9'\s]", '', article)
    
    return article

In [6]:
basic_clean(inshort1)

"after facebook parent meta lost 251 billion in market value suffering the biggest wipeout in us market's history kotak mahindra bank ceo uday kotak said that is more than the total value of india's largest company it highlights the fragility and fickleness of our times welcome to the never normal world he added"

### 2. Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.


In [7]:
# tokenizer = nltk.tokenize.ToktokTokenizer()

# print(tokenizer.tokenize(original, return_str=True))

In [29]:
def tokenize(article):
    #create tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()
    #return
    article = tokenizer.tokenize(article, return_str = True)
    
    return article

In [9]:
tokenizer(inshort1)

'After Facebook parent Meta lost $ 251 billion in market value , suffering the biggest wipeout in US market \' s history , Kotak Mahindra Bank CEO Uday Kotak said , " That is more than the total value of India \' s largest company. " " It highlights the fragility and fickleness of our times. Welcome to the never normal world , " he added .'

### 3. Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.

In [10]:
# # Create the nltk stemmer object, then use it
# ps = nltk.porter.PorterStemmer()

# ps.stem('call'), ps.stem('called'), ps.stem('calling')
###
# stems = [ps.stem(word) for word in article.split()]
# article_stemmed = ' '.join(stems)
# print(article_stemmed)

In [11]:
def stem(article):
    ps = nltk.porter.PorterStemmer()
    # 
    stems = [ps.stem(word) for word in article.split()]

    article = ' '.join(stems)
    
    return article

In [12]:
stem(inshort1)

'after facebook parent meta lost $251 billion in market value, suffer the biggest wipeout in us market\' history, kotak mahindra bank ceo uday kotak said, "that is more than the total valu of india\' largest company." "it highlight the fragil and fickl of our times. welcom to the never normal world," he added.'

### 4. Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

In [13]:
# wnl = nltk.stem.WordNetLemmatizer()

# for word in 'study studies'.split():
#     print('stem:', ps.stem(word), '-- lemma:', wnl.lemmatize(word))


# lemmas = [wnl.lemmatize(word) for word in article.split()]
# article_lemmatized = ' '.join(lemmas)


In [14]:
def lemmatize(article):
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in article.split()]
    article = ' '.join(lemmas)
    
    return article

In [15]:
lemmatize(inshort1)

'After Facebook parent Meta lost $251 billion in market value, suffering the biggest wipeout in US market\'s history, Kotak Mahindra Bank CEO Uday Kotak said, "That is more than the total value of India\'s largest company." "It highlight the fragility and fickleness of our times. Welcome to the never normal world," he added.'

### 5. Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords.

    This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [16]:
# list for words to include
# list for words we don't want to remove

In [17]:
#should accept some text and return the text after removing all the stopwords - two optional parameters

def remove_stopwords(article, extra_words= [], exclude_words = []):    

    #english list
    stopword_list = stopwords.words('english')
    
    #remove excluded words from list
    stopword_list = set(stopword_list) - set(exclude_words)
    
    #add in determined extra words into list
    stopword_list = stopword_list.union(set(extra_words))

    #have to split words in string
    words = article.split()
    
    # list 
    filtered_words = [w for w in words if w not in stopword_list]
    
    article_without_stopwords = ' '.join(filtered_words)

    print(article_without_stopwords)

In [18]:
remove_stopwords(inshort1)

After Facebook parent Meta lost $251 billion market value, suffering biggest wipeout US market's history, Kotak Mahindra Bank CEO Uday Kotak said, "That total value India's largest company." "It highlights fragility fickleness times. Welcome never normal world," added.


### 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe `news_df`.

In [19]:
import warnings
warnings.filterwarnings('ignore')

In [20]:
news_df = acquire.get_inshorts_articles()


In [21]:
news_df.head()


Unnamed: 0,title,author,content,date,category
0,Drop in Meta's market value more than the tota...,Arshiya Chopra,After Facebook parent Meta lost $251 billion i...,"04 Feb 2022,Friday",business
1,Meta drops below Berkshire Hathaway in market ...,Hiral Goyal,Meta Platforms is now worth about $50 billion ...,"04 Feb 2022,Friday",business
2,Amazon adds $135 bn in one of the biggest 1-da...,Hiral Goyal,Amazon added more than $135 billion in market ...,"04 Feb 2022,Friday",business
3,Facebook's user growth in India slowed due to ...,Sakshita Khosla,Facebook's user growth in India was hit due to...,"04 Feb 2022,Friday",business
4,"Mukesh Ambani buys ₹13cr Rolls-Royce SUV, one ...",Arshiya Chopra,Reliance Industries Chairman Mukesh Ambani has...,"04 Feb 2022,Friday",business


### 7. Make another dataframe for the Codeup blog posts. Name the dataframe `codeup_df`.


In [22]:
codeup_df = acquire.get_blog_articles()


In [23]:
codeup_df.head()

Unnamed: 0,title,published,content
0,Codeup Dallas Open House,"Nov 30, 2021",Come join us for the re-opening of our Dallas ...
1,Codeup’s Placement Team Continues Setting Records,"Nov 19, 2021",Our Placement Team is simply defined as a grou...
2,"IT Certifications 101: Why They Matter, and Wh...","Nov 18, 2021","AWS, Google, Azure, Red Hat, CompTIA…these are..."
3,A rise in cyber attacks means opportunities fo...,"Nov 17, 2021","In the last few months, the US has experienced..."
4,Use your GI Bill® benefits to Land a Job in Tech,"Nov 4, 2021","As the end of military service gets closer, ma..."


### 8. For each dataframe, produce the following columns:
- `title` to hold the title
- `original` to hold the original article/post content
- `clean` to hold the normalized and tokenized original with the stopwords removed.
- `stemmed` to hold the stemmed version of the cleaned data.
- `lemmatized` to hold the lemmatized version of the cleaned data.

Ask yourself:

If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?


In [24]:
### Pulled from class review for this:

In [25]:
news_df.rename(columns={'content': 'original'}, inplace=True)
codeup_df.rename(columns={'content': 'original'}, inplace=True)

In [30]:
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['stemmed'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(stem)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['lemmatized'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(lemmatize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    return df[['title', column,'clean', 'stemmed', 'lemmatized']]

In [31]:
# use the function defined above for news_df's content column.

prep_article_data(news_df, 'original', extra_words = ['ha'], exclude_words = ['no']).head()


facebook parent meta lost 251 billion market value suffering biggest wipeout us market ' history kotak mahindra bank ceo uday kotak said total value india ' largest company highlights fragility fickleness times welcome never normal world added
meta platforms worth 50 billion less berkshire hathaway social media giant ' stock plunged 27 thursday wiping 250 billion market capitalisation meta overtook billionaire investor warren buffettled berkshire market value nearly two years ago buffett richer meta ceo mark zuckerberg lost 31 billion thursday
amazon added 135 billion market value one biggest singleday gains us stock market history shares surged much 12 friday company ' stock surged beat profit expectations fourth quarter thursday amazon stock declined 78 wiping 110 billion market value
facebook ' user growth india hit due hike prepaid data prices fourth quarter 2021 meta cfo david wehner said came social media giant recorded drop daily active users daus first time 18year history wehne

england men ' team assistant coach graham thorpe left position confirmed ecb friday came managing director ashley giles head coach chris silverwood stepped respective role recently following england ' 04 ash test series loss australia ' fortunate worked many good player coach said thorpe
former manchester united striker dimitar berbatov said hope see indian play english club one day adding would great success see indian player european championship even come india football european level need believe 41yearold said
exindia pacer ajit agarkar said rohit sharma fly captain challenge india ' new whiteball skipper remain fit ahead t20 odi world cup scheduled next 24 month added fitness wa one strength excaptains dhoni virat kohli rarely missed game
facebook ' parent meta ' share plunged 27 thursday ' collapse wiped 230 billion company ' market value ' biggest collapse market value u company ' no certainty loss hold given volatility bloomberg said come facebook ' daily active user fell firs

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Drop in Meta's market value more than the tota...,After Facebook parent Meta lost $251 billion i...,,,
1,Meta drops below Berkshire Hathaway in market ...,Meta Platforms is now worth about $50 billion ...,,,
2,Amazon adds $135 bn in one of the biggest 1-da...,Amazon added more than $135 billion in market ...,,,
3,Facebook's user growth in India slowed due to ...,Facebook's user growth in India was hit due to...,,,
4,"Mukesh Ambani buys ₹13cr Rolls-Royce SUV, one ...",Reliance Industries Chairman Mukesh Ambani has...,,,


In [32]:
prep_article_data(codeup_df, 'original', extra_words = ['ha'], exclude_words = ['no']).head()


come join us reopening dallas campus drinks snacks codeup curious campus looks like click register free event come join us reopening dallas campus drinks snacks codeup curious campus looks like interested web development career accelerator keen chat instructor financial aid rep open house answer questions meet codeup instructor help explain whats taught classes answer questions understand join one upcoming cohort dec 6th dont miss opportunity learn start new year transitioning new exciting career tech answer questions may codeup future take first step new career today create tomorrow
placement team simply defined group manages relationships employer partners graduating students help get graduating students hired last quarter placement team helped 48 students get hired lifechanging careers tech last month placement team already placed 40 students top tech companies want send huge thank placement team employer partners done tremendous job helping codeup empower life change students exact

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Codeup Dallas Open House,Come join us for the re-opening of our Dallas ...,,,
1,Codeup’s Placement Team Continues Setting Records,Our Placement Team is simply defined as a grou...,,,
2,"IT Certifications 101: Why They Matter, and Wh...","AWS, Google, Azure, Red Hat, CompTIA…these are...",,,
3,A rise in cyber attacks means opportunities fo...,"In the last few months, the US has experienced...",,,
4,Use your GI Bill® benefits to Land a Job in Tech,"As the end of military service gets closer, ma...",,,
