# Parsing Text (aka Prepping Text Data)
In this lesson we'll take our acquired data and parse it, that is, we'll better understand the text data by *breaking it down into smaller components.

We'll be using the `nltk` package, which requires a little bit of up-front setup:

In [None]:
# We don't need to install nltk, it should come with anaconda, but nltk needs to download data.
# python -c "import nltk; nltk.download('stopwords')"

In [1]:
# unicode, regex, json for text digestion
import unicodedata
import re
import json

# nltk: natural language toolkit -> tokenization, stopwords (more on this soon)
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

# pandas dataframe manipulation, acquire script, time formatting
import pandas as pd
import acquire
from time import strftime

# shh, down in front
import warnings
warnings.filterwarnings('ignore')


Here's our plan for parsing the text data:

1. Convert text to all lower case for normalcy.
1. Remove any accented characters, non-ASCII characters.
1. Remove special characters.
1. Tokenize the words.
1. Stem or lemmatize the words.
1. Remove stopwords.
1. Store the clean text and the original text for use in future notebooks

In [2]:
original = '''
What are the Math and Stats Principles You Need for Data Science?Oct 21, 2020 | Data Science


Coming into our Data Science program, you will need to know some math and stats. However, many of our applicants actually learn in the application process – you don’t need to be an expert before applying! Data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. But what “skills” do we mean, exactly? Just what exactly are the data science math and stats principles you need to know?
What are the main math principles you need to know to get into Codeup’s Data Science program?


Algebra
Do you know PEMDAS and can you solve for x? You will need to be or become comfortable with the following:

Variables (x, y, n, etc.)
Formulas, functions, and variable manipulations (e.g. x^2 = x + 6, solve for x).
Order of evaluation: PEMDAS: parentheses, exponents, then multiplication, division, addition, and subtraction
Commutativity where a + b = b + a
Associativity where a + (b + c) = (a + b) + c
Adding and subtracting matrices
A conceptual understanding of exponential growth/decay- things can increase at an increasing rate

Descriptive Statistics
Know what a min, max, mode, median, and average are. Have a conceptual understanding that stats/probability is about trying to quantify uncertainty.
Data Visualization
Know what a scatterplot is and how to read a barplot.
How to Learn and Expand on These Concepts
There are a number of great resources out there to teach you these and similar concepts. Khan Academy is a great starting place for data science math! If you want to know what exactly we assign our applicants, you’ll just have to apply!
 
What about once you’re in Codeup?


What You Won’t Do
Do we do any mathematical proofs for concepts or perform derivations? No. 
Do we do any calculus and probability calculating by hand? No.
Are we transforming equations, where we cancel out units or terms and do lots of algebraic gymnastics? No
What You Will Do
Will we have Python solve our linear algebra problems for us? Yes
Will we have Python calculate probabilities, the area under a curve, and the slope of a line for us? Yes
Will we have Python do all of the calculus for us? Yes
 
See, the data science math and stats slice of the pie is certainly doable. If you like problem-solving and are ready to challenge yourself, you’ll love data science! If you are interested in learning about data science, just apply! Our Admissions Manager can work with you to get you where you need to be starting from where you are now. Let us help you get there so you can launch a great new career.

Request More Info





Our ProgramsFull Stack Web Development
Data Science
Cyber Cloud
Systems Engineering





Latest Blog Articles
Codeup Dallas Open House
Codeup’s Placement Team Continues Setting Records
IT Certifications 101: Why They Matter, and Why They Don’t
A rise in cyber attacks means opportunities for veterans in San Antonio
Use your GI Bill® benefits to Land a Job in Tech























More From This Category






Codeup Dallas Open House
Nov 30, 2021 | Dallas Newsletter, EventsCome join us for the re-opening of our Dallas Campus with some drinks and snacks at Codeup! Curious about what our campus looks like? Click here to register for free About this event Come join us for the re-opening of our Dallas Campus with some drinks and snacks at...




Codeup’s Placement Team Continues Setting Records
Nov 19, 2021 | Codeup News, EmployersOur Placement Team is simply defined as a group that manages relationships with our employer partners and our graduating students to help get our graduating students hired. Last quarter the Placement Team helped 48 students get hired to life-changing careers in tech....




IT Certifications 101: Why They Matter, and Why They Don’t
Nov 18, 2021 | Cybersecurity, IT Training, Tips for Prospective StudentsAWS, Google, Azure, Red Hat, CompTIA...these are big names in IT! And not only for their products, but also for the certifications they offer. If you’re new to tech, you might be wondering: Do certifications really matter? Welcome to IT Certifications 101! What’s the...

'''

#### Step 1: Convert all text to lower for normalcy

In [3]:
article = original.lower()
print(article)



what are the math and stats principles you need for data science?oct 21, 2020 | data science


coming into our data science program, you will need to know some math and stats. however, many of our applicants actually learn in the application process – you don’t need to be an expert before applying! data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. but what “skills” do we mean, exactly? just what exactly are the data science math and stats principles you need to know?
what are the main math principles you need to know to get into codeup’s data science program?


algebra
do you know pemdas and can you solve for x? you will need to be or become comfortable with the following:

variables (x, y, n, etc.)
formulas, functions, and variable manipulations (e.g. x^2 = x + 6, solve for x).
order of evaluation: pemdas: parentheses, exponents, then multiplication, division, addition, and

#### Step 2: Remove any accented characters, non-ASCII characters.

Usually in any text corpus, you might be dealing with accented characters/letters, especially if you only want to analyze the English language. Hence, we need to make sure that these characters are converted and standardized into ASCII characters. A simple example is converting é to e.

We'll go about this in three steps:

1. `unicodedata.normalize` removes any inconsistencies in unicode character encoding.
1. `.encode` to convert the resulting string to the ASCII character set. We'll ignore any errors in conversion, meaning we'll drop anything that isn't an ASCII character.
1. `.decode` to turn the resulting bytes object back into a string.

ℌ ̧ ==> H ̧ ==> Ḩ

In [4]:
article = unicodedata.normalize('NFKD', article)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')

print(article)


what are the math and stats principles you need for data science?oct 21, 2020 | data science


coming into our data science program, you will need to know some math and stats. however, many of our applicants actually learn in the application process  you dont need to be an expert before applying! data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. but what skills do we mean, exactly? just what exactly are the data science math and stats principles you need to know?
what are the main math principles you need to know to get into codeups data science program?


algebra
do you know pemdas and can you solve for x? you will need to be or become comfortable with the following:

variables (x, y, n, etc.)
formulas, functions, and variable manipulations (e.g. x^2 = x + 6, solve for x).
order of evaluation: pemdas: parentheses, exponents, then multiplication, division, addition, and subt

#### Step 3: Remove special characters

Special characters and symbols are usually non-alphanumeric characters or even occasionally numeric characters (depending on the problem), which add to the extra noise in unstructured text. Usually, simple regular expressions (regexes) can be used to remove them.

In [5]:
# remove anything that is not a through z, a number, a single quote, or whitespace
article = re.sub(r"[^a-z0-9'\s]", '', article)
print(article)



what are the math and stats principles you need for data scienceoct 21 2020  data science


coming into our data science program you will need to know some math and stats however many of our applicants actually learn in the application process  you dont need to be an expert before applying data science is a very accessible field to anyone dedicated to learning new skills and we can work with any applicant to help them learn what they need to know but what skills do we mean exactly just what exactly are the data science math and stats principles you need to know
what are the main math principles you need to know to get into codeups data science program


algebra
do you know pemdas and can you solve for x you will need to be or become comfortable with the following

variables x y n etc
formulas functions and variable manipulations eg x2  x  6 solve for x
order of evaluation pemdas parentheses exponents then multiplication division addition and subtraction
commutativity where a  b  b  a


#### Step 4: Tokenization

After removing non-ASCII characters and special characters, it's common to tokenize the strings, to break words and any punctuation left over into discrete units. Tokenization is the process of breaking something down into discrete units. In the context of NLP, this means breaking text down into discrete words, punctuation, etc.

We will use `nltk` to do tokenization for us:

In [8]:
print(tokenizer.tokenize("Here is a string that we might've tokenized", return_str=True))

Here is a string that we might ' ve tokenized


In [6]:
tokenizer = nltk.tokenize.ToktokTokenizer()

print(tokenizer.tokenize(original, return_str=True))

What are the Math and Stats Principles You Need for Data Science?Oct 21 , 2020 &#124; Data Science


Coming into our Data Science program , you will need to know some math and stats. However , many of our applicants actually learn in the application process – you don ’ t need to be an expert before applying ! Data science is a very accessible field to anyone dedicated to learning new skills , and we can work with any applicant to help them learn what they need to know. But what “skills ” do we mean , exactly ? Just what exactly are the data science math and stats principles you need to know ? 
What are the main math principles you need to know to get into Codeup ’ s Data Science program ? 


Algebra
Do you know PEMDAS and can you solve for x ? You will need to be or become comfortable with the following : 

Variables ( x , y , n , etc. ) 
Formulas , functions , and variable manipulations ( e.g. x^2 = x + 6 , solve for x ) .
Order of evaluation : PEMDAS : parentheses , exponents , then 

#### Step 5: Stemming and Lemmatization

Usually you will want to use lemmatization. We will demonstrate why that is the case by looking at both here.

#### Stemming:
Word stems are the base form of a word.

We create new words by attaching affixes in a process known as inflection. For example, "calls", "called", and "calling" all share the base stem "call".

The Porter stemmer is based on the algorithm developed by its inventor, Dr. Martin Porter. Originally, the algorithm is said to have had a total of five different phases for reduction of inflections to their stems, where each phase has its own set of rules.

Note that usually stemming has a fixed set of rules, hence, the root stems may not be lexicographically correct. This means that the stemmed words may not be semantically correct, and might have a chance of not being present in the dictionary (as we'll see in the output of stemming).

In [9]:
# Create the nltk stemmer object, then use it
ps = nltk.porter.PorterStemmer()

ps.stem('call'), ps.stem('calling'), ps.stem('called')


('call', 'call', 'call')

Now we can apply this stemming transformation to all the words in the article.

In [10]:
stems = [ps.stem(word) for word in article.split()]
article_stemmed = ' '.join(stems)
print(article_stemmed)

what are the math and stat principl you need for data scienceoct 21 2020 data scienc come into our data scienc program you will need to know some math and stat howev mani of our applic actual learn in the applic process you dont need to be an expert befor appli data scienc is a veri access field to anyon dedic to learn new skill and we can work with ani applic to help them learn what they need to know but what skill do we mean exactli just what exactli are the data scienc math and stat principl you need to know what are the main math principl you need to know to get into codeup data scienc program algebra do you know pemda and can you solv for x you will need to be or becom comfort with the follow variabl x y n etc formula function and variabl manipul eg x2 x 6 solv for x order of evalu pemda parenthes expon then multipl divis addit and subtract commut where a b b a associ where a b c a b c ad and subtract matric a conceptu understand of exponenti growthdecay thing can increas at an in

In [12]:
pd.Series(stems).value_counts().head(5)

to     25
and    23
you    21
a      18
the    15
dtype: int64

#### Lemmatization
Lemmatization is very similar to stemming, however, the base form in this case is known as the root word, but not the root stem. The difference is that the root word is always a lexicographically correct word (present in the dictionary), but the root stem may not be so. Thus, root word, also known as the lemma, will always be present in the dictionary.

Note that the lemmatization process is considerably slower than stemming, because an additional step is involved where the root form or lemma is formed by removing the affix from the word if and only if the lemma is present in the dictionary.

Let's take a look at a simple example of the difference between stemming and lemmatization:

In [None]:
# python3 -c "import nltk; nltk.download('all')"
# OR
# python3 -c "import nltk; nltk.download('wordnet')"

In [15]:
wnl = nltk.stem.WordNetLemmatizer()

for word in "would ' ve".split():
    print('lemma: ', wnl.lemmatize(word), '------ stem: ', ps.stem(word))


lemma:  would ------ stem:  would
lemma:  ' ------ stem:  '
lemma:  ve ------ stem:  ve


And now we can apply lemmatization to our entire document:

In [16]:
lemmas = [wnl.lemmatize(word) for word in article.split()]
article_lemmatized = ' '.join(lemmas)

print(article_lemmatized)

what are the math and stats principle you need for data scienceoct 21 2020 data science coming into our data science program you will need to know some math and stats however many of our applicant actually learn in the application process you dont need to be an expert before applying data science is a very accessible field to anyone dedicated to learning new skill and we can work with any applicant to help them learn what they need to know but what skill do we mean exactly just what exactly are the data science math and stats principle you need to know what are the main math principle you need to know to get into codeups data science program algebra do you know pemdas and can you solve for x you will need to be or become comfortable with the following variable x y n etc formula function and variable manipulation eg x2 x 6 solve for x order of evaluation pemdas parenthesis exponent then multiplication division addition and subtraction commutativity where a b b a associativity where a b 

Now that we have a list of the lemmas, we can take a look at the most frequent words.

In [17]:
pd.Series(lemmas).value_counts().head(5)

to     25
and    23
you    21
a      19
the    15
dtype: int64

#### Step 6: Removing stopwords

Words which have little or no significance, especially when constructing meaningful features from text, are known as stop words (or stopwords). These are usually words that end up having the maximum frequency if you do a simple term or word frequency in a corpus. Typically, these can be articles, conjunctions, prepositions and so on. Some examples of stopwords: a, an, the, and like.

While there is no universal stopword list, we will use a standard English language stopwords list from nltk. You can also add your own domain-specific stopwords as needed.

Before removing stopwords, we want to segment text into linguistic units such as words or numbers. This process is called tokenization.

In [18]:
stopword_list = stopwords.words('english')

In [19]:
list(stopword_list)

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [None]:
stopword_list[:10]

In [20]:
len(stopword_list)

179

In [21]:
stopword_list.remove('no')

In [22]:
stopword_list.remove('not')

In [23]:
type(article)

str

In [24]:
words = article.split()
filtered_words = [w for w in words if w not in stopword_list]

print('Removed {} stopwords'.format(len(words) - len(filtered_words)))
print('---')

article_without_stopwords = ' '.join(filtered_words)

print(article_without_stopwords)

Removed 307 stopwords
---
math stats principles need data scienceoct 21 2020 data science coming data science program need know math stats however many applicants actually learn application process dont need expert applying data science accessible field anyone dedicated learning new skills work applicant help learn need know skills mean exactly exactly data science math stats principles need know main math principles need know get codeups data science program algebra know pemdas solve x need become comfortable following variables x n etc formulas functions variable manipulations eg x2 x 6 solve x order evaluation pemdas parentheses exponents multiplication division addition subtraction commutativity b b associativity b c b c adding subtracting matrices conceptual understanding exponential growthdecay things increase increasing rate descriptive statistics know min max mode median average conceptual understanding statsprobability trying quantify uncertainty data visualization know scatte

In [26]:
pd.Series(filtered_words).value_counts().head(5)

data       12
science    10
need        8
know        8
us          6
dtype: int64

## Exercises
The end result of this exercise should be a file named `prepare.py` that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

2. Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.

3. Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.

4. Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

5. Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords.     This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe `news_df`.

7. Make another dataframe for the Codeup blog posts. Name the dataframe `codeup_df`.

8. For each dataframe, produce the following columns:
- `title` to hold the title
- `original` to hold the original article/post content
- `clean` to hold the normalized and tokenized original with the stopwords removed.
- `stemmed` to hold the stemmed version of the cleaned data.
- `lemmatized` to hold the lemmatized version of the cleaned data.

9. Ask yourself:
 - If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
 - If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
 - If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

