# Parsing Text

In this lesson we'll take our acquired data and *parse* it, that is, we'll better understand the text data by breaking it down into smaller components.

We'll be using the `nltk` package, which requires a little bit of up-front setup:

```bash
# We don't need to install nltk, it should come with anaconda, but nltk
# does need to download some data.
python -c "import nltk; nltk.download('stopwords')"
```

Here's our plan for parsing the text data:

1. Convert text to all lower case for normalcy.
1. Remove any accented characters, non-ASCII characters.
1. Remove special characters.
1. Stem or lemmatize the words.
1. Remove stopwords.
1. Store the clean text and the original text for use in future notebooks.

In [16]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire

In [25]:
article = "Coming into our Data Science program, you will need to know some math and \
stats. However, many of our applicants actually learn in the application process – you \
don’t need to be an expert before applying! Data science is a very accessible field to \
anyone dedicated to learning new skills, and we can work with any applicant to help them \
learn what they need to know. But what “skills” do we mean, exactly? Just what exactly \
are the data science math and stats principles you need to know?', 'What are the main \
math principles you need to know to get into Codeup’s Data Science program?'"
article

"Coming into our Data Science program, you will need to know some math and stats. However, many of our applicants actually learn in the application process – you don’t need to be an expert before applying! Data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. But what “skills” do we mean, exactly? Just what exactly are the data science math and stats principles you need to know?', 'What are the main math principles you need to know to get into Codeup’s Data Science program?'"

## Convert text to all lower case for normalcy

In [26]:
article = article.lower()
print(article)

coming into our data science program, you will need to know some math and stats. however, many of our applicants actually learn in the application process – you don’t need to be an expert before applying! data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. but what “skills” do we mean, exactly? just what exactly are the data science math and stats principles you need to know?', 'what are the main math principles you need to know to get into codeup’s data science program?'


## Removing accented characters

Usually in any text corpus, you might be dealing with accented characters/letters, especially if you only want to analyze the English language. Hence, we need to make sure that these characters are converted and standardized into ASCII characters. A simple example is converting é to e.

We'll go about this in three steps:

1. `unicodedata.normalize` removes any inconsistencies in unicode character encoding.
1. `.encode` to convert the resulting string to the ASCII character set. We'll `ignore` any errors in conversion, meaning we'll drop anything that isn't an ASCII character.
1. `.decode` to turn the resulting `bytes` object back into a string.

In [4]:
article = unicodedata.normalize('NFKD', article)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')

print(article[0:500])

what are the math and stats principles you need for data science?
oct 21, 2020 | data science


coming into our data science program, you will need to know some math and stats. however, many of our applicants actually learn in the application process  you dont need to be an expert before applying! data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. but what skills do we mean, exactly?


## Removing Special Characters
Special characters and symbols are usually non-alphanumeric characters or even occasionally numeric characters (depending on the problem), which add to the extra noise in unstructured text. Usually, simple regular expressions (regexes) can be used to remove them.

In [5]:
# remove anything that is not a through z, a number, a single quote, or whitespace
article = re.sub(r"[^a-z0-9'\s]", '', article)
print(article)

what are the math and stats principles you need for data science
oct 21 2020  data science


coming into our data science program you will need to know some math and stats however many of our applicants actually learn in the application process  you dont need to be an expert before applying data science is a very accessible field to anyone dedicated to learning new skills and we can work with any applicant to help them learn what they need to know but what skills do we mean exactly


## Tokenization

After removing non-ASCII characters and special characters, it's common to **tokenize** the strings, to break words and any punctuation left over into discrete units. Tokenization is the process of breaking something down into discrete units. In the context of NLP, this means breaking text down into discrete words, punctuation, etc.

We will use `nltk` to do tokenization for us:

In [6]:
tokenizer = nltk.tokenize.ToktokTokenizer()

print(tokenizer.tokenize(article, return_str=True)[0:500])

what are the math and stats principles you need for data science
oct 21 2020 data science


coming into our data science program you will need to know some math and stats however many of our applicants actually learn in the application process you dont need to be an expert before applying data science is a very accessible field to anyone dedicated to learning new skills and we can work with any applicant to help them learn what they need to know but what skills do we mean exactly


## Stemming and Lemmatization

Usually you will want to use lemmatization.  We will demonstrate why that is the case by looking at both here. 

### Stemming

Word **stems** are the base form of a word.

We create new words by attaching *affixes* in a process known as *inflection*. For example, "calls", "called", and "calling" all share the base stem "call".

The Porter stemmer is based on the algorithm developed by its inventor, Dr. Martin Porter. Originally, the algorithm is said to have had a total of five different phases for reduction of inflections to their stems, where each phase has its own set of rules.

Note that usually stemming has a fixed set of rules, hence, the root stems may not be lexicographically correct. This means that the stemmed words may not be semantically correct, and might have a chance of not being present in the dictionary (as we'll see in the output of stemming).

In [7]:
# Create the nltk stemmer object, then use it
ps = nltk.porter.PorterStemmer()

ps.stem('call'), ps.stem('called'), ps.stem('calling')

('call', 'call', 'call')

Now we can apply this stemming transformation to all the words in the article.

In [8]:
stems = [ps.stem(word) for word in article.split()]
article_stemmed = ' '.join(stems)
print(article_stemmed)

what are the math and stat principl you need for data scienc oct 21 2020 data scienc come into our data scienc program you will need to know some math and stat howev mani of our applic actual learn in the applic process you dont need to be an expert befor appli data scienc is a veri access field to anyon dedic to learn new skill and we can work with ani applic to help them learn what they need to know but what skill do we mean exactli


In [9]:
pd.Series(stems).value_counts().head(10)

to        6
need      4
data      4
scienc    4
what      3
learn     3
applic    3
and       3
you       3
skill     2
dtype: int64

### Lemmatization

Lemmatization is very similar to stemming, however, the base form in this case is known as the root word, but not the root stem. The difference is that the root word is always a lexicographically correct word (present in the dictionary), but the root stem may not be so. Thus, root word, also known as the lemma, will always be present in the dictionary.

Note that the lemmatization process is considerably slower than stemming, because an additional step is involved where the root form or lemma is formed by removing the affix from the word if and only if the lemma is present in the dictionary.

Let's take a look at a simple example of the difference between stemming and lemmatization:

In [10]:
wnl = nltk.stem.WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."

for word in sentence.split():
    print('stem:', ps.stem(word), '-- lemma:', wnl.lemmatize(word))

stem: he -- lemma: He
stem: wa -- lemma: wa
stem: run -- lemma: running
stem: and -- lemma: and
stem: eat -- lemma: eating
stem: at -- lemma: at
stem: same -- lemma: same
stem: time. -- lemma: time.
stem: he -- lemma: He
stem: ha -- lemma: ha
stem: bad -- lemma: bad
stem: habit -- lemma: habit
stem: of -- lemma: of
stem: swim -- lemma: swimming
stem: after -- lemma: after
stem: play -- lemma: playing
stem: long -- lemma: long
stem: hour -- lemma: hour
stem: in -- lemma: in
stem: the -- lemma: the
stem: sun. -- lemma: Sun.


In [11]:
wnl = nltk.stem.WordNetLemmatizer()

for word in 'studying what they needed to study, the students studied studiously'.split():
    print('stem:', ps.stem(word), '-- lemma:', wnl.lemmatize(word))

stem: studi -- lemma: studying
stem: what -- lemma: what
stem: they -- lemma: they
stem: need -- lemma: needed
stem: to -- lemma: to
stem: study, -- lemma: study,
stem: the -- lemma: the
stem: student -- lemma: student
stem: studi -- lemma: studied
stem: studious -- lemma: studiously


And now we can apply lemmatization to our entire document:

In [12]:
lemmas = [wnl.lemmatize(word) for word in article.split()]
article_lemmatized = ' '.join(lemmas)

print(article_lemmatized)

what are the math and stats principle you need for data science oct 21 2020 data science coming into our data science program you will need to know some math and stats however many of our applicant actually learn in the application process you dont need to be an expert before applying data science is a very accessible field to anyone dedicated to learning new skill and we can work with any applicant to help them learn what they need to know but what skill do we mean exactly


Now that we have a list of the lemmas, we can take a look at the most frequent words.

In [13]:
pd.Series(lemmas).value_counts()[:10]

to           6
need         4
data         4
science      4
what         3
and          3
you          3
skill        2
know         2
applicant    2
dtype: int64

### Removing Stopwords

Words which have little or no significance, especially when constructing meaningful features from text, are known as **stop words** (or **stopwords**). These are usually words that end up having the maximum frequency if you do a simple term or word frequency in a corpus. Typically, these can be articles, conjunctions, prepositions and so on. Some examples of stopwords: a, an, the, and like.

While there is no universal stopword list, we will use a standard English language stopwords list from `nltk`. You can also add your own domain-specific stopwords as needed.

Before removing stopwords, we want to segment text into linguistic units such as words or numbers. This process is called tokenization.

In [14]:
stopword_list = stopwords.words('english')

stopword_list.remove('no')
stopword_list.remove('not')

stopword_list[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [15]:
words = article.split()
filtered_words = [w for w in words if w not in stopword_list]

print('Removed {} stopwords'.format(len(words) - len(filtered_words)))
print('---')

article_without_stopwords = ' '.join(filtered_words)

print(article_without_stopwords)

Removed 41 stopwords
---
math stats principles need data science oct 21 2020 data science coming data science program need know math stats however many applicants actually learn application process dont need expert applying data science accessible field anyone dedicated learning new skills work applicant help learn need know skills mean exactly


## Exercises

The end result of this exercise should be a file named `prepare.py` that defines
the requested functions.

In this exercise we will be defining some functions to prepare textual data.
These functions should apply equally well to both the codeup blog articles and
the news articles that were previously acquired.

1. Define a function named `basic_clean`. It should take in a string and apply
   some basic text cleaning to it:

    - Lowercase everything
    - Normalize unicode characters
    - Replace anything that is not a letter, number, whitespace or a single quote.

1. Define a function named `tokenize`. It should take in a string and tokenize
   all the words in the string.

1. Define a function named `stem`. It should accept some text and return the
   text after applying stemming to all the words.

1. Define a function named `lemmatize`. It should accept some text and return
   the text after applying lemmatization to each word.

1. Define a function named `remove_stopwords`. It should accept some text and
   return the text after removing all the stopwords.

    This function should define two optional parameters, `extra_words` and
    `exclude_words`. These parameters should define any additional stop words to
    include, and any words that we *don't* want to remove.

1. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe `news_df`.

1. Make another dataframe for the Codeup blog posts. Name the dataframe `codeup_df`.

1. For each dataframe, produce the following columns:
    - `title` to hold the title
    - `original` to hold the original article/post content
    - `clean` to hold the normalized and tokenized original with the stopwords removed.
    - `stemmed` to hold the stemmed version of the cleaned data.
    - `lemmatized` to hold the lemmatized version of the cleaned data.

1. Ask yourself: 
    - If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
    - If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
    - If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?