Table of Contents
Natural Language Processing (NLP) is all about enabling machines to understand and process human language. One of the very first and most fundamental steps is text processing. Whether you're building a chatbot, a sentiment analysis model, or a search engine, you can't skip this step.
Let's first understand a few foundational terms often used in text processing:
- Corpus - Documents/Paragraphs
- Documents - Sentences
- Words - All the words that are present in the corpus
- Vocabulary - Unique Words
- Corpus: A corpus is a large collection of textual data. It can be considered as a collection of paragraphs or documents.
- Document: A document is typically a single paragraph or a unit of text within a corpus.
- Words: These are all individual lexical units (tokens) that appear in the corpus, regardless of repetition.
- Vocabulary: The vocabulary refers to the set of unique words present in the corpus. It forms the foundation for constructing vector representations of textual data.
Tokenization is a text preprocessing technique that involves splitting raw text into smaller units called tokens. These tokens could be:
- Sentence-level tokens (splitting a paragraph into individual sentences)
- Word-level tokens (splitting a sentence into individual words)
"Better tokens lead to better models."
Each token can then be processed and transformed into numerical vectors, an essential step since the models can't understand raw text directly.
Example 1: Paragraph Tokenization
Input Paragraph:
"On a rainy afternoon, Maria decided to bake cookies. She gathered flour, sugar, and chocolate chips from the pantry."
Step 1: Sentence Tokenization
Using sentence boundary markers such as full stops (.) and exclamation marks (!), the paragraph is tokenized into sentences:
[
"On a rainy afternoon, Maria decided to bake cookies",
"She gathered flour, sugar, and chocolate chips from the pantry"
]
Step 2: Word Tokenization
Each sentence is further tokenized into words:
Sentence 1: ["On", "a", "rainy", "afternoon", "Maria", "decided", "to", "bake", "cookies"]
Sentence 2: ["She", "gathered", "flour", "sugar", "and", "chocolate", "chips", "from", "the", "pantry"]
Example 2: Vocabulary Extraction
Input Text:
"I like to drink Apple Juice. My friend likes Mango Juice."
Tokenized Sentences:
[
"I like to drink Apple Juice",
"My friend likes Mango Juice"
]
Word Tokens:
["I", "like", "to", "drink", "Apple", "Juice", "My", "friend", "likes", "Mango", "Juice"]
Total Words: 11
Unique Words (Vocabulary):
["I", "like", "to", "drink", "Apple", "Juice", "My", "friend", "likes", "Mango"]
Vocabulary Size: 10
Note: "Juice" appears twice, hence the count of total words are 11 but unique words are 10.
Download the pre-trained tokenizer models needed for sentence and word tokenization.
import nltk
nltk.download('punkt')
Function | Description |
---|---|
sent_tokenize(corpus) |
Breaks a paragraph into individual sentences. |
word_tokenize(corpus) |
Breaks a sentence into words and punctuation. |
wordpunct_tokenize(corpus) |
Splits text into words and also separates all punctuation marks. |
TreebankWordTokenizer().tokenize(corpus) |
Breaks text into words using rules from the Penn Treebank (eg: "don't" → "do" + "n't"). |
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).
Types of Stemmer i) Porter Stemmer
- The Porter Stemmer is one of the oldest and most widely used stemming algorithms in NLP.
- It works by applying a series of rule-based steps to remove common word endings such as -ing, -ed, and -ly.
- For example, it converts the word
"running" to "run"
and"caresses" to "caress"
- It's simple and efficient, making it useful for basic text analysis in English.
- However, a major disadvantage is that it can sometimes be too aggressive, removing parts of words inappropriately.
- For instance, "university" might be stemmed to "univers", which is not a real word.
ii) Snowball Stemmer
- The Snowball Stemmer is essentially an improved version of the Porter Stemmer, designed to be more consistent and support multiple languages, including English, French, German, and more.
- It uses a more refined set of rules and provides better accuracy for modern NLP tasks.
- Like Porter, it turns
"running" into "run"
, but tends to avoid some of the over-stemming problems. - The main disadvantage is that while it's better than Porter, it still doesn't always produce real root words, just shorter versions, and is still based on fixed rules rather than word meaning.
iii) Regexp Stemmer
- The Regexp Stemmer is a customizable stemmer where you define your own rules using regular expressions (regex).
- This gives you full control over how stemming is done, which is helpful in specialized tasks or domain-specific text.
- For example, you can write a rule to remove the ending "ing" from any word, so "jumping" becomes "jump".
- The advantage is flexibility, but the disadvantage is that it requires manual setup and good knowledge of regex, and can easily miss edge cases or introduce errors if not carefully designed.
import nltk
Porter Stemmer
from nltk.stem import PorterStemmer
stemming = PorterStemmer()
stemming.stem(word)
Snowball Stemmer
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer('english')
snowball_stemmer.stem(word)
Regexp Stemmer
from nltk.stem import RegexpStemmer
reg_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)
reg_stemmer.stem(word)
Lemmatization technique is like stemming. The output we will get after lemmatization is called 'lemma', which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.
Wordnet Lemmatizer: NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma.
POS | Tag |
---|---|
Noun | n |
Verb | v |
Adjective | a |
Adverb | r |
import nltk
WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
lemmatizer.lemmatize(word,pos='v')
- Stopwords refer to high-frequency lexical items i.e., common words (e.g., "the", "is", "in", "and") that often do not contribute significant semantic value to text-based tasks.
- Removing stopwords is a common preprocessing step to reduce dimensionality, eliminate noise, and improve computational efficiency, especially in tasks like information retrieval, text classification, and topic modeling.
import nltk
nltk.download('stopwords')
The nltk library provides a built-in stopword corpus through nltk.corpus.stopwords, which can be accessed and filtered using functions like stopwords.words('english').
from nltk.corpus import stopwords
stopwords.words('english')
Tag | Description | Example(s) |
---|---|---|
CC | Coordinating conjunction | and, but, or |
CD | Cardinal digit | one, two, 100 |
DT | Determiner | the, a, an |
EX | Existential there | there is, there exists |
FW | Foreign word | voila, bonjour |
IN | Preposition/Subordinating conjunction | in, on, because |
JJ | Adjective | big |
JJR | Adjective, comparative | bigger |
JJS | Adjective, superlative | biggest |
LS | List marker | 1), a), i) |
MD | Modal | could, will, should |
NN | Noun, singular | desk |
NNS | Noun, plural | desks |
NNP | Proper noun, singular | Harrison |
NNPS | Proper noun, plural | Americans |
PDT | Predeterminer | "all the kids" |
POS | Possessive ending | parent's |
PRP | Personal pronoun | I, he, she |
PRP$ | Possessive pronoun | my, his, hers |
RB | Adverb | very, silently |
RBR | Adverb, comparative | better |
RBS | Adverb, superlative | best |
RP | Particle | give up |
TO | "To" | to go 'to' the store |
UH | Interjection | ugh, hmm, errrrrrrrm |
VB | Verb, base form | take |
VBD | Verb, past tense | took |
VBG | Verb, gerund/present participle | taking |
VBN | Verb, past participle | taken |
VBP | Verb, singular present, non-3d | take |
VBZ | Verb, 3rd person singular present | takes |
WDT | Wh-determiner | which |
WP | Wh-pronoun | who, what |
WP$ | Possessive wh-pronoun | whose |
WRB | Wh-adverb | where, when |
import nltk
nltk.download('averaged_perceptron_tagger')
If you want to directly pass a sentence and check the POS for each word, then do:
nltk.pos_tag(sentence.split())
Consider the following example:
Text | Output | |
---|---|---|
D1 | The food is good | 1 |
D2 | The food is bad | 0 |
D3 | Pizza is Amazing | 1 |
Vocabulary size = 7
["The", "food", "is", "good", "bad", "Pizza", "Amazing"]
Word | Vector |
---|---|
The | [1, 0, 0, 0, 0, 0, 0] |
food | [0, 1, 0, 0, 0, 0, 0] |
is | [0, 0, 1, 0, 0, 0, 0] |
good | [0, 0, 0, 1, 0, 0, 0] |
bad | [0, 0, 0, 0, 1, 0, 0] |
Pizza | [0, 0, 0, 0, 0, 1, 0] |
Amazing | [0, 0, 0, 0, 0, 0, 1] |
D1: The food is good
[
[1, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0]
] # 4x7
D2: The food is bad
[
[1, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0]
] # 4x7
D3: Pizza is Amazing
[
[0, 0, 0, 0, 0, 1, 0],
[0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1]
] # 3x7
[
[0, 0, 0, 0, 0, 0, 0], # "Burger" is out of vocabulary
[0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0]
]
Easy to implement in Python, Using libraries like:
python: sklearn.OneHotEncoder pandas.get_dummies
- Sparse matrix → Overfitting
- ML algorithms → Require fixed input size
- No semantic meaning is captured
- Out of Vocabulary (OOV) issues
The Bag of Words (BoW) model is a text representation technique used in NLP, where a text corpus is represented as a "bag" of words, disregarding grammar and word order but keeping track of the word frequencies. The BoW model transforms text data into a structured format, such as a matrix, where rows represent individual text samples (e.g., documents or sentences) and columns represent words in the vocabulary.
Steps in Preprocessing for BoW:
- Text Lowercasing: The text is converted to lowercase to maintain consistency and prevent differentiating between words like "Apple" and "apple."
- Stopwords Removal: Common words (stopwords) like "and," "the," "is," etc., which do not carry significant meaning in most NLP tasks, are removed to reduce noise in the dataset.
- Tokenization: The remaining text is broken down into tokens (usually words).
- Vocabulary Construction: The distinct words (tokens) across the entire corpus are used to create a vocabulary, which forms the basis for the BoW model.
- Frequency Count or Binary Encoding: After the vocabulary is built, two primary approaches can be used to encode the presence and frequency of words
Types of Bag of Words (BoW):
-
Binary Bag of Words (Binary BoW):
- Representation: In Binary BoW, each word is represented as either present or absent in a given text sample.
- Encoding: The feature values are binary (0 or 1). If a word is present in a document, it is represented as 1; otherwise, it is 0.
- Example:
Vocabulary: ["dog", "cat", "fish"]
Document 1: "dog and cat" → [1, 1, 0]
Document 2: "fish and cat" → [0, 1, 1]
-
Frequency-Based Bag of Words (Frequency BoW):
- Representation: Each word is represented by its frequency of occurrence in the document.
- Encoding: The feature values represent the count of times each word appears in a given document.
- Example:
Vocabulary: ["dog", "cat", "fish"]
Document 1: "dog and cat and dog" → [2 1, 0]
Document 2: "fish and cat" → [0, 1, 1]
Example:
Text | Output |
---|---|
He is a good boy | 1 |
She is a good girl | 0 |
Boy and girl are good | 1 |
Lowercase all the words in the text and remove the stopwords.
S1 → good boy; S2 → good girl; S3 → Boy girl good
Vocabulary | Frequency |
---|---|
good | 3 |
boy | 2 |
girl | 2 |
[good boy girl] output
S1 [ 1 1 0 ] 1
S2 [ 1 0 1 ] 1
S3 [ 1 1 1 ] 1
- Simple and Intuitive
- Fixed Sized Input is given to → ML Algorithms
- Sparse matrix/array → Overfitting
- Ordering of the word is getting changed
- Out of Vocabulary (OOV)
- Semantic meaning is still not captured