# Parsing Text (aka Prepping Text Data)
In this lesson we'll take our acquired data and parse it, that is, we'll better understand the text data by *breaking it down into smaller components.

We'll be using the `nltk` package, which requires a little bit of up-front setup:

In [1]:
# We don't need to install nltk, it should come with anaconda, but nltk needs to download data.
# python -c "import nltk; nltk.download('stopwords')"

In [2]:
# unicode, regex, json for text digestion
import unicodedata
import re
import json

# nltk: natural language toolkit -> tokenization, stopwords (more on this soon)
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

# pandas dataframe manipulation, acquire script, time formatting
import pandas as pd
import acquire
from time import strftime

# shh, down in front
import warnings
warnings.filterwarnings('ignore')

Here's our plan for parsing the text data:

1. Convert text to all lower case for normalcy.
1. Remove any accented characters, non-ASCII characters.
1. Remove special characters.
1. Tokenize the words.
1. Stem or lemmatize the words.
1. Remove stopwords.
1. Store the clean text and the original text for use in future notebooks

In [3]:
original = '''
What are the Math and Stats Principles You Need for Data Science?Oct 21, 2020 | Data Science


Coming into our Data Science program, you will need to know some math and stats. However, many of our applicants actually learn in the application process – you don’t need to be an expert before applying! Data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. But what “skills” do we mean, exactly? Just what exactly are the data science math and stats principles you need to know?
What are the main math principles you need to know to get into Codeup’s Data Science program?


Algebra
Do you know PEMDAS and can you solve for x? You will need to be or become comfortable with the following:

Variables (x, y, n, etc.)
Formulas, functions, and variable manipulations (e.g. x^2 = x + 6, solve for x).
Order of evaluation: PEMDAS: parentheses, exponents, then multiplication, division, addition, and subtraction
Commutativity where a + b = b + a
Associativity where a + (b + c) = (a + b) + c
Adding and subtracting matrices
A conceptual understanding of exponential growth/decay- things can increase at an increasing rate

Descriptive Statistics
Know what a min, max, mode, median, and average are. Have a conceptual understanding that stats/probability is about trying to quantify uncertainty.
Data Visualization
Know what a scatterplot is and how to read a barplot.
How to Learn and Expand on These Concepts
There are a number of great resources out there to teach you these and similar concepts. Khan Academy is a great starting place for data science math! If you want to know what exactly we assign our applicants, you’ll just have to apply!
 
What about once you’re in Codeup?


What You Won’t Do
Do we do any mathematical proofs for concepts or perform derivations? No. 
Do we do any calculus and probability calculating by hand? No.
Are we transforming equations, where we cancel out units or terms and do lots of algebraic gymnastics? No
What You Will Do
Will we have Python solve our linear algebra problems for us? Yes
Will we have Python calculate probabilities, the area under a curve, and the slope of a line for us? Yes
Will we have Python do all of the calculus for us? Yes
 
See, the data science math and stats slice of the pie is certainly doable. If you like problem-solving and are ready to challenge yourself, you’ll love data science! If you are interested in learning about data science, just apply! Our Admissions Manager can work with you to get you where you need to be starting from where you are now. Let us help you get there so you can launch a great new career.

Request More Info





Our ProgramsFull Stack Web Development
Data Science
Cyber Cloud
Systems Engineering





Latest Blog Articles
Codeup Dallas Open House
Codeup’s Placement Team Continues Setting Records
IT Certifications 101: Why They Matter, and Why They Don’t
A rise in cyber attacks means opportunities for veterans in San Antonio
Use your GI Bill® benefits to Land a Job in Tech























More From This Category






Codeup Dallas Open House
Nov 30, 2021 | Dallas Newsletter, EventsCome join us for the re-opening of our Dallas Campus with some drinks and snacks at Codeup! Curious about what our campus looks like? Click here to register for free About this event Come join us for the re-opening of our Dallas Campus with some drinks and snacks at...




Codeup’s Placement Team Continues Setting Records
Nov 19, 2021 | Codeup News, EmployersOur Placement Team is simply defined as a group that manages relationships with our employer partners and our graduating students to help get our graduating students hired. Last quarter the Placement Team helped 48 students get hired to life-changing careers in tech....




IT Certifications 101: Why They Matter, and Why They Don’t
Nov 18, 2021 | Cybersecurity, IT Training, Tips for Prospective StudentsAWS, Google, Azure, Red Hat, CompTIA...these are big names in IT! And not only for their products, but also for the certifications they offer. If you’re new to tech, you might be wondering: Do certifications really matter? Welcome to IT Certifications 101! What’s the...

'''

#### Step 1: Convert all text to lower for normalcy

In [4]:
article = original.lower()
print(article)



what are the math and stats principles you need for data science?oct 21, 2020 | data science


coming into our data science program, you will need to know some math and stats. however, many of our applicants actually learn in the application process – you don’t need to be an expert before applying! data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. but what “skills” do we mean, exactly? just what exactly are the data science math and stats principles you need to know?
what are the main math principles you need to know to get into codeup’s data science program?


algebra
do you know pemdas and can you solve for x? you will need to be or become comfortable with the following:

variables (x, y, n, etc.)
formulas, functions, and variable manipulations (e.g. x^2 = x + 6, solve for x).
order of evaluation: pemdas: parentheses, exponents, then multiplication, division, addition, and

#### Step 2: Remove any accented characters, non-ASCII characters.

Usually in any text corpus, you might be dealing with accented characters/letters, especially if you only want to analyze the English language. Hence, we need to make sure that these characters are converted and standardized into ASCII characters. A simple example is converting é to e.

We'll go about this in three steps:

1. `unicodedata.normalize` removes any inconsistencies in unicode character encoding.
1. `.encode` to convert the resulting string to the ASCII character set. We'll ignore any errors in conversion, meaning we'll drop anything that isn't an ASCII character.
1. `.decode` to turn the resulting bytes object back into a string.

ℌ ̧ ==> H ̧ ==> Ḩ

In [5]:
article = unicodedata.normalize('NFKD', article)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')

print(article)


what are the math and stats principles you need for data science?oct 21, 2020 | data science


coming into our data science program, you will need to know some math and stats. however, many of our applicants actually learn in the application process  you dont need to be an expert before applying! data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. but what skills do we mean, exactly? just what exactly are the data science math and stats principles you need to know?
what are the main math principles you need to know to get into codeups data science program?


algebra
do you know pemdas and can you solve for x? you will need to be or become comfortable with the following:

variables (x, y, n, etc.)
formulas, functions, and variable manipulations (e.g. x^2 = x + 6, solve for x).
order of evaluation: pemdas: parentheses, exponents, then multiplication, division, addition, and subt

#### Step 3: Remove special characters

Special characters and symbols are usually non-alphanumeric characters or even occasionally numeric characters (depending on the problem), which add to the extra noise in unstructured text. Usually, simple regular expressions (regexes) can be used to remove them.

In [6]:
# remove anything that is not a through z, a number, a single quote, or whitespace
article = re.sub(r"[^a-z0-9'\s]", '', article)
print(article)



what are the math and stats principles you need for data scienceoct 21 2020  data science


coming into our data science program you will need to know some math and stats however many of our applicants actually learn in the application process  you dont need to be an expert before applying data science is a very accessible field to anyone dedicated to learning new skills and we can work with any applicant to help them learn what they need to know but what skills do we mean exactly just what exactly are the data science math and stats principles you need to know
what are the main math principles you need to know to get into codeups data science program


algebra
do you know pemdas and can you solve for x you will need to be or become comfortable with the following

variables x y n etc
formulas functions and variable manipulations eg x2  x  6 solve for x
order of evaluation pemdas parentheses exponents then multiplication division addition and subtraction
commutativity where a  b  b  a


#### Step 4: Tokenization

After removing non-ASCII characters and special characters, it's common to tokenize the strings, to break words and any punctuation left over into discrete units. Tokenization is the process of breaking something down into discrete units. In the context of NLP, this means breaking text down into discrete words, punctuation, etc.

We will use `nltk` to do tokenization for us:

In [7]:
tokenizer = nltk.tokenize.ToktokTokenizer()

print(tokenizer.tokenize(original, return_str=True))

What are the Math and Stats Principles You Need for Data Science?Oct 21 , 2020 &#124; Data Science


Coming into our Data Science program , you will need to know some math and stats. However , many of our applicants actually learn in the application process – you don ’ t need to be an expert before applying ! Data science is a very accessible field to anyone dedicated to learning new skills , and we can work with any applicant to help them learn what they need to know. But what “skills ” do we mean , exactly ? Just what exactly are the data science math and stats principles you need to know ? 
What are the main math principles you need to know to get into Codeup ’ s Data Science program ? 


Algebra
Do you know PEMDAS and can you solve for x ? You will need to be or become comfortable with the following : 

Variables ( x , y , n , etc. ) 
Formulas , functions , and variable manipulations ( e.g. x^2 = x + 6 , solve for x ) .
Order of evaluation : PEMDAS : parentheses , exponents , then 

In [8]:
print(tokenizer.tokenize("Here is a string that we might've tokenized", return_str=True))

Here is a string that we might ' ve tokenized


#### Step 5: Stemming and Lemmatization

Usually you will want to use lemmatization. We will demonstrate why that is the case by looking at both here.

#### Stemming:
Word stems are the base form of a word.

We create new words by attaching affixes in a process known as inflection. For example, "calls", "called", and "calling" all share the base stem "call".

The Porter stemmer is based on the algorithm developed by its inventor, Dr. Martin Porter. Originally, the algorithm is said to have had a total of five different phases for reduction of inflections to their stems, where each phase has its own set of rules.

Note that usually stemming has a fixed set of rules, hence, the root stems may not be lexicographically correct. This means that the stemmed words may not be semantically correct, and might have a chance of not being present in the dictionary (as we'll see in the output of stemming).

In [9]:
# Create the nltk stemmer object, then use it
ps = nltk.porter.PorterStemmer()

ps.stem('call'), ps.stem('calling'), ps.stem('called')


('call', 'call', 'call')

Now we can apply this stemming transformation to all the words in the article.

In [10]:
stems = [ps.stem(word) for word in article.split()]
article_stemmed = ' '.join(stems)
print(article_stemmed)

what are the math and stat principl you need for data scienceoct 21 2020 data scienc come into our data scienc program you will need to know some math and stat howev mani of our applic actual learn in the applic process you dont need to be an expert befor appli data scienc is a veri access field to anyon dedic to learn new skill and we can work with ani applic to help them learn what they need to know but what skill do we mean exactli just what exactli are the data scienc math and stat principl you need to know what are the main math principl you need to know to get into codeup data scienc program algebra do you know pemda and can you solv for x you will need to be or becom comfort with the follow variabl x y n etc formula function and variabl manipul eg x2 x 6 solv for x order of evalu pemda parenthes expon then multipl divis addit and subtract commut where a b b a associ where a b c a b c ad and subtract matric a conceptu understand of exponenti growthdecay thing can increas at an in

In [11]:
pd.Series(stems).value_counts().head(5)

to     25
and    23
you    21
a      18
the    15
dtype: int64

#### Lemmatization
Lemmatization is very similar to stemming, however, the base form in this case is known as the root word, but not the root stem. The difference is that the root word is always a lexicographically correct word (present in the dictionary), but the root stem may not be so. Thus, root word, also known as the lemma, will always be present in the dictionary.

Note that the lemmatization process is considerably slower than stemming, because an additional step is involved where the root form or lemma is formed by removing the affix from the word if and only if the lemma is present in the dictionary.

Let's take a look at a simple example of the difference between stemming and lemmatization:

In [12]:
# python3 -c "import nltk; nltk.download('all')"
# OR
# python3 -c "import nltk; nltk.download('wordnet')"

In [13]:
wnl = nltk.stem.WordNetLemmatizer()

for word in "would ' ve".split():
    print('lemma: ', wnl.lemmatize(word), '------ stem: ', ps.stem(word))


lemma:  would ------ stem:  would
lemma:  ' ------ stem:  '
lemma:  ve ------ stem:  ve


And now we can apply lemmatization to our entire document:

In [14]:
lemmas = [wnl.lemmatize(word) for word in article.split()]
article_lemmatized = ' '.join(lemmas)

print(article_lemmatized)

what are the math and stats principle you need for data scienceoct 21 2020 data science coming into our data science program you will need to know some math and stats however many of our applicant actually learn in the application process you dont need to be an expert before applying data science is a very accessible field to anyone dedicated to learning new skill and we can work with any applicant to help them learn what they need to know but what skill do we mean exactly just what exactly are the data science math and stats principle you need to know what are the main math principle you need to know to get into codeups data science program algebra do you know pemdas and can you solve for x you will need to be or become comfortable with the following variable x y n etc formula function and variable manipulation eg x2 x 6 solve for x order of evaluation pemdas parenthesis exponent then multiplication division addition and subtraction commutativity where a b b a associativity where a b 

Now that we have a list of the lemmas, we can take a look at the most frequent words.

In [15]:
pd.Series(lemmas).value_counts().head(5)

to     25
and    23
you    21
a      19
the    15
dtype: int64

#### Step 6: Removing stopwords

Words which have little or no significance, especially when constructing meaningful features from text, are known as stop words (or stopwords). These are usually words that end up having the maximum frequency if you do a simple term or word frequency in a corpus. Typically, these can be articles, conjunctions, prepositions and so on. Some examples of stopwords: a, an, the, and like.

While there is no universal stopword list, we will use a standard English language stopwords list from nltk. You can also add your own domain-specific stopwords as needed.

Before removing stopwords, we want to segment text into linguistic units such as words or numbers. This process is called tokenization.

In [16]:
stopword_list = stopwords.words('english')

In [17]:
list(stopword_list)

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [18]:
stopword_list[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [19]:
len(stopword_list)

179

In [20]:
stopword_list.remove('no')

In [21]:
stopword_list.remove('not')

In [22]:
type(article)

str

In [23]:
words = article.split()
filtered_words = [w for w in words if w not in stopword_list]

print('Removed {} stopwords'.format(len(words) - len(filtered_words)))
print('---')

article_without_stopwords = ' '.join(filtered_words)

print(article_without_stopwords)

Removed 307 stopwords
---
math stats principles need data scienceoct 21 2020 data science coming data science program need know math stats however many applicants actually learn application process dont need expert applying data science accessible field anyone dedicated learning new skills work applicant help learn need know skills mean exactly exactly data science math stats principles need know main math principles need know get codeups data science program algebra know pemdas solve x need become comfortable following variables x n etc formulas functions variable manipulations eg x2 x 6 solve x order evaluation pemdas parentheses exponents multiplication division addition subtraction commutativity b b associativity b c b c adding subtracting matrices conceptual understanding exponential growthdecay things increase increasing rate descriptive statistics know min max mode median average conceptual understanding statsprobability trying quantify uncertainty data visualization know scatte

In [24]:
pd.Series(filtered_words).value_counts().head(5)

data       12
science    10
know        8
need        8
math        6
dtype: int64

## Exercises
The end result of this exercise should be a file named `prepare.py` that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [27]:
df = acquire.scrape_busines_shorts()
df

Unnamed: 0,title,date,author,content
0,Rupee closes at all-time low of 77.50 against ...,"(09 May, 08:57 pm)",Pragya Swastik,The Indian rupee weakened further on Monday to...
1,"Bitcoin briefly drops below $30,000 for first ...","(10 May, 10:03 am)",Ridham Gambhir,"Bitcoin, in the early hours of Tuesday, fell b..."
2,Twitter will comply with EU content rules afte...,"(10 May, 01:07 pm)",Ridham Gambhir,Tesla CEO Elon Musk has said that Twitter will...
3,"Office as we know it, is over: Airbnb CEO on l...","(10 May, 02:49 pm)",Sakshita Khosla,After Airbnb allowed its employees to work rem...
4,When are you coming to deliver 1st Tesla? Payt...,"(10 May, 10:38 am)",Ridham Gambhir,Paytm CEO Vijay Shekhar Sharma took to Twitter...
5,Musk's $44 bn Twitter deal at risk of being re...,"(10 May, 02:46 pm)",Pragya Swastik,Elon Musk's $44 billion offer to buy Twitter c...
6,Microsoft to help cover US employees' travel c...,"(10 May, 09:12 am)",Ridham Gambhir,Microsoft has said that it will cover travel c...
7,"After Musk's Taj Mahal tweet, his mother says ...","(10 May, 09:48 am)",Apaar Sharma,After Elon Musk tweeted he visited Taj Mahal i...
8,Layout of 'world's first Bitcoin City' in El S...,"(10 May, 06:54 pm)",Hiral Goyal,El Salvador's President Nayib Bukele has share...
9,Zomato's market cap falls below its last priva...,"(10 May, 07:45 pm)",Hiral Goyal,Zomato's market capitalisation fell below its ...


In [36]:
article = df.content[0]
article

'The Indian rupee weakened further on Monday to close at a new all-time low of 77.50 against the US dollar, 60 paise over its previous close. During the trading session, the rupee touched its lifetime low of 77.52. The currency was weighed down by elevated crude oil prices and a widening trade deficit.'

In [37]:
def basic_clean(string):
    '''
    This function takes in a string and
    returns the string normalized.
    '''
    string = unicodedata.normalize('NFKD', string)\
             .encode('ascii', 'ignore')\
             .decode('utf-8', 'ignore')
    string = re.sub(r'[^\w\s]', '', string).lower()
    return string

In [42]:
basic_clean(article)

'the indian rupee weakened further on monday to close at a new alltime low of 7750 against the us dollar 60 paise over its previous close during the trading session the rupee touched its lifetime low of 7752 the currency was weighed down by elevated crude oil prices and a widening trade deficit'

2. Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.

In [40]:
def tokenize(string):
    '''
    This function takes in a string and
    returns a tokenized string.
    '''
    tokenizer = nltk.tokenize.ToktokTokenizer()
    string = tokenizer.tokenize(string, return_str=True)
    return string

In [41]:
tokenize(article)

'The Indian rupee weakened further on Monday to close at a new all-time low of 77.50 against the US dollar , 60 paise over its previous close. During the trading session , the rupee touched its lifetime low of 77.52. The currency was weighed down by elevated crude oil prices and a widening trade deficit .'

3. Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.

In [43]:
def stem(string):
    '''
    This function takes in a string and
    returns a string with words stemmed.
    '''
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in string.split()]
    string = ' '.join(stems)    
    return string


In [44]:
stem(article)

'the indian rupe weaken further on monday to close at a new all-tim low of 77.50 against the us dollar, 60 pais over it previou close. dure the trade session, the rupe touch it lifetim low of 77.52. the currenc wa weigh down by elev crude oil price and a widen trade deficit.'

4. Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

In [45]:
def lemmatize(string):
    '''
    This function takes in string for and
    returns a string with words lemmatized.
    '''
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    string = ' '.join(lemmas)    
    return string

In [46]:
lemmatize(article)

'The Indian rupee weakened further on Monday to close at a new all-time low of 77.50 against the US dollar, 60 paisa over it previous close. During the trading session, the rupee touched it lifetime low of 77.52. The currency wa weighed down by elevated crude oil price and a widening trade deficit.'

5. Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords.     This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [47]:
def remove_stopwords(string, extra_words = [], exclude_words = []):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    stopword_list = stopwords.words('english')
    stopword_list = set(stopword_list) - set(exclude_words)
    stopword_list = stopword_list.union(set(extra_words))
    words = string.split()
    filtered_words = [word for word in words if word not in stopword_list]
    string_without_stopwords = ' '.join(filtered_words)    
    return string_without_stopwords

In [48]:
remove_stopwords(article)

'The Indian rupee weakened Monday close new all-time low 77.50 US dollar, 60 paise previous close. During trading session, rupee touched lifetime low 77.52. The currency weighed elevated crude oil prices widening trade deficit.'

6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe `news_df`.

In [73]:
news_df = acquire.scrape_busines_shorts()
news_df

Unnamed: 0,title,date,author,content
0,"Bitcoin briefly drops below $30,000 for first ...","(10 May, 10:03 am)",Ridham Gambhir,"Bitcoin, in the early hours of Tuesday, fell b..."
1,Rupee closes at all-time low of 77.50 against ...,"(09 May, 08:57 pm)",Pragya Swastik,The Indian rupee weakened further on Monday to...
2,SBI hikes interest rates on bulk term deposits...,"(10 May, 04:52 pm)",Ridham Gambhir,SBI has increased interest rates on its bulk t...
3,"After Musk's Taj Mahal tweet, his mother says ...","(10 May, 09:48 am)",Apaar Sharma,After Elon Musk tweeted he visited Taj Mahal i...
4,Musk's $44 bn Twitter deal at risk of being re...,"(10 May, 02:46 pm)",Pragya Swastik,Elon Musk's $44 billion offer to buy Twitter c...
5,"Office as we know it, is over: Airbnb CEO on l...","(10 May, 02:49 pm)",Sakshita Khosla,After Airbnb allowed its employees to work rem...
6,Twitter will comply with EU content rules afte...,"(10 May, 01:07 pm)",Ridham Gambhir,Tesla CEO Elon Musk has said that Twitter will...
7,Microsoft to help cover US employees' travel c...,"(10 May, 09:12 am)",Ridham Gambhir,Microsoft has said that it will cover travel c...
8,Layout of 'world's first Bitcoin City' in El S...,"(10 May, 06:54 pm)",Hiral Goyal,El Salvador's President Nayib Bukele has share...
9,When are you coming to deliver 1st Tesla? Payt...,"(10 May, 10:38 am)",Ridham Gambhir,Paytm CEO Vijay Shekhar Sharma took to Twitter...


In [74]:
news_df['content'].apply(basic_clean)\
.apply(tokenize)\
.apply(lemmatize)\
.apply(remove_stopwords)

0     bitcoin early hour tuesday fell 30000 first ti...
1     indian rupee weakened monday close new alltime...
2     sbi ha increased interest rate bulk term depos...
3     elon musk tweeted visited taj mahal 2007 calle...
4     elon musk 44 billion offer buy twitter could g...
5     airbnb allowed employee work remotely forever ...
6     tesla ceo elon musk ha said twitter comply eur...
7     microsoft ha said cover travel cost employee u...
8     el salvador president nayib bukele ha shared l...
9     paytm ceo vijay shekhar sharma took twitter re...
10    zomatos market capitalisation fell last privat...
11    centre ha called meeting cab aggregator ola ub...
12    jsw group chairman sajjan jindal told financia...
13    several electric vertical takeoff landing evto...
14    tesla ha halted production shanghai plant due ...
15    toyota group ha signed mou karnataka governmen...
16    tiger global ha faced loss around 17 billion y...
17    cbi tuesday searched eight location jammu 

7. Make another dataframe for the Codeup blog posts. Name the dataframe `codeup_df`.

In [71]:
codeup_df = acquire.scrape_codeup()
codeup_df

Unnamed: 0,title,date,tags,content
0,From Bootcamp to Bootcamp | A Military Appreci...,"Apr 27, 2022","Alumni Stories, Dallas, Events, Featured, Mil...","In honor of Military Appreciation Month, join ..."
1,Our Acquisition of the Rackspace Cloud Academy...,"Apr 14, 2022","Codeup News, Featured, IT Training","Just about a year ago on April 16th, 2021 we a..."
2,Learn to Code: HTML & CSS on 4/30,"Apr 1, 2022","Virtual, Workshops",HTML & CSS are the design building blocks of a...
3,Learn to Code: Python Workshop on 4/23,"Mar 31, 2022","Events, Virtual, Workshops","According to LinkedIn, the ""#1 Most Promising ..."
4,Coming Soon: Cloud Administration,"Mar 17, 2022",Codeup News,We're launching a new program out of San Anton...
5,5 Books Every Woman In Tech Should Read,"Mar 8, 2022",Featured,On this International Women's Day 2022 we want...
6,Codeup Start Dates for March 2022,"Jan 26, 2022",Codeup News,As we approach the end of January we wanted to...
7,VET TEC Funding Now Available For Dallas Veterans,"Jan 7, 2022","Codeup News, Dallas Newsletter, Featured, Tips...",We are so happy to announce that VET TEC benef...
8,Dallas Campus Re-opens With New Grant Partner,"Dec 30, 2021","Codeup News, Featured",We are happy to announce that our Dallas campu...
9,Codeup’s Placement Team Continues Setting Records,"Nov 19, 2021","Codeup News, Employers",Our Placement Team is simply defined as a grou...


8. For each dataframe, produce the following columns:
- `title` to hold the title
- `original` to hold the original article/post content
- `clean` to hold the normalized and tokenized original with the stopwords removed.
- `stemmed` to hold the stemmed version of the cleaned data.
- `lemmatized` to hold the lemmatized version of the cleaned data.

In [75]:
news_df.rename(columns={'content':'original'}, inplace=True)

In [76]:
news_df['clean'] = news_df['original'].apply(basic_clean)\
                                    .apply(tokenize)

In [78]:
news_df['stemmed'] = news_df['original'].apply(basic_clean)\
                                      .apply(tokenize)\
                                      .apply(stem)

In [79]:
news_df['lemmatized'] = news_df['original'].apply(basic_clean)\
                                         .apply(tokenize)\
                                         .apply(lemmatize)

In [80]:
news_df

Unnamed: 0,title,date,author,original,clean,stemmed,lemmatized
0,"Bitcoin briefly drops below $30,000 for first ...","(10 May, 10:03 am)",Ridham Gambhir,"Bitcoin, in the early hours of Tuesday, fell b...",bitcoin in the early hours of tuesday fell bel...,bitcoin in the earli hour of tuesday fell belo...,bitcoin in the early hour of tuesday fell belo...
1,Rupee closes at all-time low of 77.50 against ...,"(09 May, 08:57 pm)",Pragya Swastik,The Indian rupee weakened further on Monday to...,the indian rupee weakened further on monday to...,the indian rupe weaken further on monday to cl...,the indian rupee weakened further on monday to...
2,SBI hikes interest rates on bulk term deposits...,"(10 May, 04:52 pm)",Ridham Gambhir,SBI has increased interest rates on its bulk t...,sbi has increased interest rates on its bulk t...,sbi ha increas interest rate on it bulk term d...,sbi ha increased interest rate on it bulk term...
3,"After Musk's Taj Mahal tweet, his mother says ...","(10 May, 09:48 am)",Apaar Sharma,After Elon Musk tweeted he visited Taj Mahal i...,after elon musk tweeted he visited taj mahal i...,after elon musk tweet he visit taj mahal in 20...,after elon musk tweeted he visited taj mahal i...
4,Musk's $44 bn Twitter deal at risk of being re...,"(10 May, 02:46 pm)",Pragya Swastik,Elon Musk's $44 billion offer to buy Twitter c...,elon musks 44 billion offer to buy twitter cou...,elon musk 44 billion offer to buy twitter coul...,elon musk 44 billion offer to buy twitter coul...
5,"Office as we know it, is over: Airbnb CEO on l...","(10 May, 02:49 pm)",Sakshita Khosla,After Airbnb allowed its employees to work rem...,after airbnb allowed its employees to work rem...,after airbnb allow it employe to work remot fo...,after airbnb allowed it employee to work remot...
6,Twitter will comply with EU content rules afte...,"(10 May, 01:07 pm)",Ridham Gambhir,Tesla CEO Elon Musk has said that Twitter will...,tesla ceo elon musk has said that twitter will...,tesla ceo elon musk ha said that twitter will ...,tesla ceo elon musk ha said that twitter will ...
7,Microsoft to help cover US employees' travel c...,"(10 May, 09:12 am)",Ridham Gambhir,Microsoft has said that it will cover travel c...,microsoft has said that it will cover travel c...,microsoft ha said that it will cover travel co...,microsoft ha said that it will cover travel co...
8,Layout of 'world's first Bitcoin City' in El S...,"(10 May, 06:54 pm)",Hiral Goyal,El Salvador's President Nayib Bukele has share...,el salvadors president nayib bukele has shared...,el salvador presid nayib bukel ha share the la...,el salvador president nayib bukele ha shared t...
9,When are you coming to deliver 1st Tesla? Payt...,"(10 May, 10:38 am)",Ridham Gambhir,Paytm CEO Vijay Shekhar Sharma took to Twitter...,paytm ceo vijay shekhar sharma took to twitter...,paytm ceo vijay shekhar sharma took to twitter...,paytm ceo vijay shekhar sharma took to twitter...


In [84]:
codeup_df.rename(columns={'content':'original'}, inplace=True)

codeup_df['clean'] = codeup_df['original'].apply(basic_clean)\
                                    .apply(tokenize)

codeup_df['stemmed'] = codeup_df['original'].apply(basic_clean)\
                                      .apply(tokenize)\
                                      .apply(stem)

codeup_df['lemmatized'] = codeup_df['original'].apply(basic_clean)\
                                         .apply(tokenize)\
                                         .apply(lemmatize)

codeup_df

Unnamed: 0,title,date,tags,original,clean,stemmed,lemmatized
0,From Bootcamp to Bootcamp | A Military Appreci...,"Apr 27, 2022","Alumni Stories, Dallas, Events, Featured, Mil...","In honor of Military Appreciation Month, join ...",in honor of military appreciation month join u...,in honor of militari appreci month join us for...,in honor of military appreciation month join u...
1,Our Acquisition of the Rackspace Cloud Academy...,"Apr 14, 2022","Codeup News, Featured, IT Training","Just about a year ago on April 16th, 2021 we a...",just about a year ago on april 16th 2021 we an...,just about a year ago on april 16th 2021 we an...,just about a year ago on april 16th 2021 we an...
2,Learn to Code: HTML & CSS on 4/30,"Apr 1, 2022","Virtual, Workshops",HTML & CSS are the design building blocks of a...,html css are the design building blocks of all...,html css are the design build block of all the...,html cs are the design building block of all t...
3,Learn to Code: Python Workshop on 4/23,"Mar 31, 2022","Events, Virtual, Workshops","According to LinkedIn, the ""#1 Most Promising ...",according to linkedin the 1 most promising job...,accord to linkedin the 1 most promis job is da...,according to linkedin the 1 most promising job...
4,Coming Soon: Cloud Administration,"Mar 17, 2022",Codeup News,We're launching a new program out of San Anton...,were launching a new program out of san antoni...,were launch a new program out of san antonio w...,were launching a new program out of san antoni...
5,5 Books Every Woman In Tech Should Read,"Mar 8, 2022",Featured,On this International Women's Day 2022 we want...,on this international womens day 2022 we wante...,on thi intern women day 2022 we want to tell s...,on this international woman day 2022 we wanted...
6,Codeup Start Dates for March 2022,"Jan 26, 2022",Codeup News,As we approach the end of January we wanted to...,as we approach the end of january we wanted to...,as we approach the end of januari we want to l...,a we approach the end of january we wanted to ...
7,VET TEC Funding Now Available For Dallas Veterans,"Jan 7, 2022","Codeup News, Dallas Newsletter, Featured, Tips...",We are so happy to announce that VET TEC benef...,we are so happy to announce that vet tec benef...,we are so happi to announc that vet tec benefi...,we are so happy to announce that vet tec benef...
8,Dallas Campus Re-opens With New Grant Partner,"Dec 30, 2021","Codeup News, Featured",We are happy to announce that our Dallas campu...,we are happy to announce that our dallas campu...,we are happi to announc that our dalla campu r...,we are happy to announce that our dallas campu...
9,Codeup’s Placement Team Continues Setting Records,"Nov 19, 2021","Codeup News, Employers",Our Placement Team is simply defined as a grou...,our placement team is simply defined as a grou...,our placement team is simpli defin as a group ...,our placement team is simply defined a a group...


9. Ask yourself:
 - If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
 - If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
 - If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

