Natural Language Processing

Table of Contents

NLP
Core Concepts in NLP
Text Preprocessing

Tokenization with Example
Stemming
Lemmatization
Stopwords
Parts of Speech Tag

Text to Vector

This repository represents the implementation of NLP concepts using NLTK library.

NLP

Natural Language Processing (NLP) is all about enabling machines to understand and process human language. One of the very first and most fundamental steps is text processing. Whether you're building a chatbot, a sentiment analysis model, or a search engine, you can't skip this step.

Let's first understand a few foundational terms often used in text processing:

Corpus - Documents/Paragraphs
Documents - Sentences
Words - All the words that are present in the corpus
Vocabulary - Unique Words

Core Concepts in NLP

Corpus: A corpus is a large collection of textual data. It can be considered as a collection of paragraphs or documents.
Document: A document is typically a single paragraph or a unit of text within a corpus.
Words: These are all individual lexical units (tokens) that appear in the corpus, regardless of repetition.
Vocabulary: The vocabulary refers to the set of unique words present in the corpus. It forms the foundation for constructing vector representations of textual data.

Text Preprocessing

1) Tokenization

Tokenization is a text preprocessing technique that involves splitting raw text into smaller units called tokens. These tokens could be:

Sentence-level tokens (splitting a paragraph into individual sentences)
Word-level tokens (splitting a sentence into individual words)

"Better tokens lead to better models."

Each token can then be processed and transformed into numerical vectors, an essential step since the models can't understand raw text directly.

Example

Example 1: Paragraph Tokenization

Input Paragraph:

"On a rainy afternoon, Maria decided to bake cookies. She gathered flour, sugar, and chocolate chips from the pantry."

Step 1: Sentence Tokenization
Using sentence boundary markers such as full stops (.) and exclamation marks (!), the paragraph is tokenized into sentences:

[
      "On a rainy afternoon, Maria decided to bake cookies",
      "She gathered flour, sugar, and chocolate chips from the pantry"
]

Step 2: Word Tokenization
Each sentence is further tokenized into words:

Sentence 1: ["On", "a", "rainy", "afternoon", "Maria", "decided", "to", "bake", "cookies"]
Sentence 2: ["She", "gathered", "flour", "sugar", "and", "chocolate", "chips", "from", "the", "pantry"]

Example 2: Vocabulary Extraction

Input Text:

"I like to drink Apple Juice. My friend likes Mango Juice."

Tokenized Sentences:

[
  "I like to drink Apple Juice",
  "My friend likes Mango Juice"
]

Word Tokens:

["I", "like", "to", "drink", "Apple", "Juice", "My", "friend", "likes", "Mango", "Juice"]

Total Words: 11

Unique Words (Vocabulary):

["I", "like", "to", "drink", "Apple", "Juice", "My", "friend", "likes", "Mango"]

Vocabulary Size: 10

Note: "Juice" appears twice, hence the count of total words are 11 but unique words are 10.

⤵️ Tokenization Functions from NLTK

Download the pre-trained tokenizer models needed for sentence and word tokenization.

import nltk
nltk.download('punkt')

Function	Description
`sent_tokenize(corpus)`	Breaks a paragraph into individual sentences.
`word_tokenize(corpus)`	Breaks a sentence into words and punctuation.
`wordpunct_tokenize(corpus)`	Splits text into words and also separates all punctuation marks.
`TreebankWordTokenizer().tokenize(corpus)`	Breaks text into words using rules from the Penn Treebank (eg: "don't" → "do" + "n't").

2) Stemming

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

Types of Stemmer i) Porter Stemmer

The Porter Stemmer is one of the oldest and most widely used stemming algorithms in NLP.
It works by applying a series of rule-based steps to remove common word endings such as -ing, -ed, and -ly.
For example, it converts the word "running" to "run" and "caresses" to "caress"
It's simple and efficient, making it useful for basic text analysis in English.
However, a major disadvantage is that it can sometimes be too aggressive, removing parts of words inappropriately.
For instance, "university" might be stemmed to "univers", which is not a real word.

ii) Snowball Stemmer

The Snowball Stemmer is essentially an improved version of the Porter Stemmer, designed to be more consistent and support multiple languages, including English, French, German, and more.
It uses a more refined set of rules and provides better accuracy for modern NLP tasks.
Like Porter, it turns "running" into "run", but tends to avoid some of the over-stemming problems.
The main disadvantage is that while it's better than Porter, it still doesn't always produce real root words, just shorter versions, and is still based on fixed rules rather than word meaning.

iii) Regexp Stemmer

The Regexp Stemmer is a customizable stemmer where you define your own rules using regular expressions (regex).
This gives you full control over how stemming is done, which is helpful in specialized tasks or domain-specific text.
For example, you can write a rule to remove the ending "ing" from any word, so "jumping" becomes "jump".
The advantage is flexibility, but the disadvantage is that it requires manual setup and good knowledge of regex, and can easily miss edge cases or introduce errors if not carefully designed.

⤵️ Stemming Functions from NLTK

import nltk

Porter Stemmer

from nltk.stem import PorterStemmer
stemming = PorterStemmer()
stemming.stem(word)

Snowball Stemmer

from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer('english')
snowball_stemmer.stem(word)

Regexp Stemmer

from nltk.stem import RegexpStemmer
reg_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)
reg_stemmer.stem(word)

3) Lemmatization

Lemmatization technique is like stemming. The output we will get after lemmatization is called 'lemma', which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.

Wordnet Lemmatizer: NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma.

POS	Tag
Noun	`n`
Verb	`v`
Adjective	`a`
Adverb	`r`

⤵️ Lemmatization Functions from NLTK

import nltk

WordNetLemmatizer

from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
lemmatizer.lemmatize(word,pos='v')

4) Stopwords

Stopwords refer to high-frequency lexical items i.e., common words (e.g., "the", "is", "in", "and") that often do not contribute significant semantic value to text-based tasks.
Removing stopwords is a common preprocessing step to reduce dimensionality, eliminate noise, and improve computational efficiency, especially in tasks like information retrieval, text classification, and topic modeling.

⤵️ Stopwords Functions from NLTK

import nltk
nltk.download('stopwords')

The nltk library provides a built-in stopword corpus through nltk.corpus.stopwords, which can be accessed and filtered using functions like stopwords.words('english').

from nltk.corpus import stopwords
stopwords.words('english')

5) Parts of Speech Tag

POS Tag Reference

Tag	Description	Example(s)
CC	Coordinating conjunction	and, but, or
CD	Cardinal digit	one, two, 100
DT	Determiner	the, a, an
EX	Existential there	there is, there exists
FW	Foreign word	voila, bonjour
IN	Preposition/Subordinating conjunction	in, on, because
JJ	Adjective	big
JJR	Adjective, comparative	bigger
JJS	Adjective, superlative	biggest
LS	List marker	1), a), i)
MD	Modal	could, will, should
NN	Noun, singular	desk
NNS	Noun, plural	desks
NNP	Proper noun, singular	Harrison
NNPS	Proper noun, plural	Americans
PDT	Predeterminer	"all the kids"
POS	Possessive ending	parent's
PRP	Personal pronoun	I, he, she
PRP$	Possessive pronoun	my, his, hers
RB	Adverb	very, silently
RBR	Adverb, comparative	better
RBS	Adverb, superlative	best
RP	Particle	give up
TO	"To"	to go 'to' the store
UH	Interjection	ugh, hmm, errrrrrrrm
VB	Verb, base form	take
VBD	Verb, past tense	took
VBG	Verb, gerund/present participle	taking
VBN	Verb, past participle	taken
VBP	Verb, singular present, non-3d	take
VBZ	Verb, 3rd person singular present	takes
WDT	Wh-determiner	which
WP	Wh-pronoun	who, what
WP$	Possessive wh-pronoun	whose
WRB	Wh-adverb	where, when

⤵️ POS Tag NLTK

import nltk
nltk.download('averaged_perceptron_tagger')

If you want to directly pass a sentence and check the POS for each word, then do:

nltk.pos_tag(sentence.split())

Text to Vector

1) One-Hot Encoding

Consider the following example:

	Text	Output
D1	The food is good	1
D2	The food is bad	0
D3	Pizza is Amazing	1

Vocabulary (Unique Words)

Vocabulary size = 7

["The", "food", "is", "good", "bad", "Pizza", "Amazing"]

Word	Vector
The	[1, 0, 0, 0, 0, 0, 0]
food	[0, 1, 0, 0, 0, 0, 0]
is	[0, 0, 1, 0, 0, 0, 0]
good	[0, 0, 0, 1, 0, 0, 0]
bad	[0, 0, 0, 0, 1, 0, 0]
Pizza	[0, 0, 0, 0, 0, 1, 0]
Amazing	[0, 0, 0, 0, 0, 0, 1]

Encoded Representation

D1: The food is good

[
  [1, 0, 0, 0, 0, 0, 0],
  [0, 1, 0, 0, 0, 0, 0],
  [0, 0, 1, 0, 0, 0, 0],
  [0, 0, 0, 1, 0, 0, 0]
]  # 4x7

D2: The food is bad

[
  [1, 0, 0, 0, 0, 0, 0],
  [0, 1, 0, 0, 0, 0, 0],
  [0, 0, 1, 0, 0, 0, 0],
  [0, 0, 0, 0, 1, 0, 0]
]  # 4x7

D3: Pizza is Amazing

[
  [0, 0, 0, 0, 0, 1, 0],
  [0, 0, 1, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 1]
]  # 3x7

Test Text: "Burger is bad"

[
  [0, 0, 0, 0, 0, 0, 0], # "Burger" is out of vocabulary
  [0, 0, 1, 0, 0, 0, 0],
  [0, 0, 0, 0, 1, 0, 0]
]

Advantages

Easy to implement in Python, Using libraries like:
python: sklearn.OneHotEncoder pandas.get_dummies

Disadvantages

Sparse matrix → Overfitting
ML algorithms → Require fixed input size
No semantic meaning is captured
Out of Vocabulary (OOV) issues

2) Bag of Words

The Bag of Words (BoW) model is a text representation technique used in NLP, where a text corpus is represented as a "bag" of words, disregarding grammar and word order but keeping track of the word frequencies. The BoW model transforms text data into a structured format, such as a matrix, where rows represent individual text samples (e.g., documents or sentences) and columns represent words in the vocabulary.

Steps in Preprocessing for BoW:

Text Lowercasing: The text is converted to lowercase to maintain consistency and prevent differentiating between words like "Apple" and "apple."
Stopwords Removal: Common words (stopwords) like "and," "the," "is," etc., which do not carry significant meaning in most NLP tasks, are removed to reduce noise in the dataset.
Tokenization: The remaining text is broken down into tokens (usually words).
Vocabulary Construction: The distinct words (tokens) across the entire corpus are used to create a vocabulary, which forms the basis for the BoW model.
Frequency Count or Binary Encoding: After the vocabulary is built, two primary approaches can be used to encode the presence and frequency of words

Types of Bag of Words (BoW):

Binary Bag of Words (Binary BoW):
- Representation: In Binary BoW, each word is represented as either present or absent in a given text sample.
- Encoding: The feature values are binary (0 or 1). If a word is present in a document, it is represented as 1; otherwise, it is 0.
- Example:
Vocabulary: ["dog", "cat", "fish"]
Document 1: "dog and cat" → [1, 1, 0]
Document 2: "fish and cat" → [0, 1, 1]
Frequency-Based Bag of Words (Frequency BoW):
- Representation: Each word is represented by its frequency of occurrence in the document.
- Encoding: The feature values represent the count of times each word appears in a given document.
- Example:
Vocabulary: ["dog", "cat", "fish"]
Document 1: "dog and cat and dog" → [2 1, 0]
Document 2: "fish and cat" → [0, 1, 1]

Example:

Text	Output
He is a good boy	1
She is a good girl	0
Boy and girl are good	1

Lowercase all the words in the text and remove the stopwords. S1 → good boy; S2 → good girl; S3 → Boy girl good

Vocabulary	Frequency
good	3
boy	2
girl	2

      [good  boy  girl]    output
S1    [ 1    1     0  ]      1
S2    [ 1    0     1  ]      1
S3    [ 1    1     1  ]      1

Advantages

Simple and Intuitive
Fixed Sized Input is given to → ML Algorithms

Disadvantages

Sparse matrix/array → Overfitting
Ordering of the word is getting changed
Out of Vocabulary (OOV)
Semantic meaning is still not captured

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Text Preprocessing		Text Preprocessing
Text to Vectors		Text to Vectors
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Natural Language Processing

NLP

Core Concepts in NLP

Text Preprocessing

1) Tokenization

Example

⤵️ Tokenization Functions from NLTK

2) Stemming

⤵️ Stemming Functions from NLTK

3) Lemmatization

⤵️ Lemmatization Functions from NLTK

4) Stopwords

⤵️ Stopwords Functions from NLTK

5) Parts of Speech Tag

POS Tag Reference

⤵️ POS Tag NLTK

Text to Vector

1) One-Hot Encoding

Vocabulary (Unique Words)

Encoded Representation

Test Text: "Burger is bad"

Advantages

Disadvantages

2) Bag of Words

Advantages

Disadvantages

About

Uh oh!

Languages

SejalWagh19/NLP

Folders and files

Latest commit

History

Repository files navigation

Natural Language Processing

NLP

Core Concepts in NLP

Text Preprocessing

1) Tokenization

Example

⤵️ Tokenization Functions from NLTK

2) Stemming

⤵️ Stemming Functions from NLTK

3) Lemmatization

⤵️ Lemmatization Functions from NLTK

4) Stopwords

⤵️ Stopwords Functions from NLTK

5) Parts of Speech Tag

POS Tag Reference

⤵️ POS Tag NLTK

Text to Vector

1) One-Hot Encoding

Vocabulary (Unique Words)

Encoded Representation

Test Text: "Burger is bad"

Advantages

Disadvantages

2) Bag of Words

Advantages

Disadvantages

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages