**Advanced Tokenization, N-grams, Stemming, Lemmatization & Stopwords**

Today, we continued our journey into Natural Language Processing (NLP) by exploring more tokenization techniques, generating n-grams, and learning about stemming, lemmatization, and stopwords.

# Whitespace Tokenization

**Definition:** Splits text based on whitespace (spaces, tabs, newlines) without removing punctuation.

- Useful when punctuation should be preserved as part of the tokens.

- Faster but less precise than other tokenizers.

In [2]:
# you can also create your own words
AI = '''Artificial Intelligence refers to the intelligence of machines. This is in contrast to the natural intelligence of
humans and animals. With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and
problem-solving. Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines.
It is probably the fastest-growing development in the World of technology and innovation. Furthermore, many experts believe
AI could solve major challenges and crisis situations.'''

In [3]:
from nltk.tokenize import WhitespaceTokenizer

wt = WhitespaceTokenizer().tokenize(AI)
print(wt)  # Clean split by spaces

['Artificial', 'Intelligence', 'refers', 'to', 'the', 'intelligence', 'of', 'machines.', 'This', 'is', 'in', 'contrast', 'to', 'the', 'natural', 'intelligence', 'of', 'humans', 'and', 'animals.', 'With', 'Artificial', 'Intelligence,', 'machines', 'perform', 'functions', 'such', 'as', 'learning,', 'planning,', 'reasoning', 'and', 'problem-solving.', 'Most', 'noteworthy,', 'Artificial', 'Intelligence', 'is', 'the', 'simulation', 'of', 'human', 'intelligence', 'by', 'machines.', 'It', 'is', 'probably', 'the', 'fastest-growing', 'development', 'in', 'the', 'World', 'of', 'technology', 'and', 'innovation.', 'Furthermore,', 'many', 'experts', 'believe', 'AI', 'could', 'solve', 'major', 'challenges', 'and', 'crisis', 'situations.']


In [4]:
print(len(wt))  # Count of tokens

70


# WordPunct Tokenization
**Definition:** Splits words and punctuation into separate tokens.

- Numbers, punctuation marks, and words are treated individually.

In [5]:
from nltk.tokenize import wordpunct_tokenize

s = 'Good apple cost $3.88 in Hyderabad. Please buy two of them. Thanks.'
s

'Good apple cost $3.88 in Hyderabad. Please buy two of them. Thanks.'

In [6]:
print(wordpunct_tokenize(s))

['Good', 'apple', 'cost', '$', '3', '.', '88', 'in', 'Hyderabad', '.', 'Please', 'buy', 'two', 'of', 'them', '.', 'Thanks', '.']


In [7]:
print(len(wordpunct_tokenize(s)))

18


In [8]:
w_p = wordpunct_tokenize(AI)
print(w_p)
print(len(w_p))

['Artificial', 'Intelligence', 'refers', 'to', 'the', 'intelligence', 'of', 'machines', '.', 'This', 'is', 'in', 'contrast', 'to', 'the', 'natural', 'intelligence', 'of', 'humans', 'and', 'animals', '.', 'With', 'Artificial', 'Intelligence', ',', 'machines', 'perform', 'functions', 'such', 'as', 'learning', ',', 'planning', ',', 'reasoning', 'and', 'problem', '-', 'solving', '.', 'Most', 'noteworthy', ',', 'Artificial', 'Intelligence', 'is', 'the', 'simulation', 'of', 'human', 'intelligence', 'by', 'machines', '.', 'It', 'is', 'probably', 'the', 'fastest', '-', 'growing', 'development', 'in', 'the', 'World', 'of', 'technology', 'and', 'innovation', '.', 'Furthermore', ',', 'many', 'experts', 'believe', 'AI', 'could', 'solve', 'major', 'challenges', 'and', 'crisis', 'situations', '.']
85


**Summary of Tokenizers**

| Tokenizer             | Description                                   |
| --------------------- | --------------------------------------------- |
| `word_tokenize`       | Splits into words and punctuation (accurate). |
| `sent_tokenize`       | Splits into sentences.                        |
| `blankline_tokenize`  | Splits into paragraphs by blank lines.        |
| `WhitespaceTokenizer` | Splits by spaces/tabs only.                   |
| `wordpunct_tokenize`  | Splits words and punctuation separately.      |


# N-grams
**Definition:** Sequences of n consecutive tokens.

- Bigram: 2-word sequence.
- Trigram: 3-word sequence.
- N-gram: Any n-word sequence (n > 3).

In [13]:
import nltk
from nltk.util import bigrams, trigrams, ngrams

string = "we are learner of AI from 4th May 2025 till now"
string

'we are learner of AI from 4th May 2025 till now'

In [14]:
quotes_tokens = nltk.word_tokenize(string)

print(quotes_tokens)
print(len(quotes_tokens))

['we', 'are', 'learner', 'of', 'AI', 'from', '4th', 'May', '2025', 'till', 'now']
11


## Bigrams

In [15]:
# Bigrams
quotes_tokens_bi = list(nltk.bigrams(quotes_tokens))
print(quotes_tokens_bi)

[('we', 'are'), ('are', 'learner'), ('learner', 'of'), ('of', 'AI'), ('AI', 'from'), ('from', '4th'), ('4th', 'May'), ('May', '2025'), ('2025', 'till'), ('till', 'now')]


## Trigrams

In [17]:
# Trigrams
quotes_tokens_tri = list(nltk.trigrams(quotes_tokens))
print(quotes_tokens_tri)

[('we', 'are', 'learner'), ('are', 'learner', 'of'), ('learner', 'of', 'AI'), ('of', 'AI', 'from'), ('AI', 'from', '4th'), ('from', '4th', 'May'), ('4th', 'May', '2025'), ('May', '2025', 'till'), ('2025', 'till', 'now')]


## N-grams (n=8)

In [16]:
# N-grams (n=8)
quotes_tokens_n = list(nltk.ngrams(quotes_tokens, 8))
print(quotes_tokens_n)

[('we', 'are', 'learner', 'of', 'AI', 'from', '4th', 'May'), ('are', 'learner', 'of', 'AI', 'from', '4th', 'May', '2025'), ('learner', 'of', 'AI', 'from', '4th', 'May', '2025', 'till'), ('of', 'AI', 'from', '4th', 'May', '2025', 'till', 'now')]


# Stemming
**Definition:** Reduces words to their root form, often by chopping off suffixes.
Types:

- **Porter Stemmer** – Basic and widely used, but may not handle all words well.
- **Lancaster Stemmer** – More aggressive, sometimes over-stems words.
- **Snowball Stemmer** – Advanced, supports multiple languages.

In [19]:
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

words_to_stem = ['given','give','giving','gave','thinking','loving','maximum','akshaybhujbal','gaved']
words_to_stem

['given',
 'give',
 'giving',
 'gave',
 'thinking',
 'loving',
 'maximum',
 'akshaybhujbal',
 'gaved']

## Porter Stemmer

In [20]:
# Porter Stemmer
pst = PorterStemmer()
for word in words_to_stem:
    print(word + ' : ' + pst.stem(word))

given : given
give : give
giving : give
gave : gave
thinking : think
loving : love
maximum : maximum
akshaybhujbal : akshaybhujb
gaved : gave


## Lancaster Stemmer

In [21]:
# Lancaster Stemmer
lst = LancasterStemmer()
for word in words_to_stem:
    print(word + ' : ' + lst.stem(word))

given : giv
give : giv
giving : giv
gave : gav
thinking : think
loving : lov
maximum : maxim
akshaybhujbal : akshaybhujb
gaved : gav


## Snowball Stemmer (English)

In [22]:
# Snowball Stemmer (English)
sbst = SnowballStemmer('english')
for word in words_to_stem:
    print(word + ' : ' + sbst.stem(word))

given : given
give : give
giving : give
gave : gave
thinking : think
loving : love
maximum : maximum
akshaybhujbal : akshaybhujb
gaved : gave


In [26]:
# Snowball Stemmer (German)
stemmer = SnowballStemmer('german')
print(stemmer.stem('samstag'))

samstag


# Lemmatization
**Definition:** Reduces words to their dictionary (lemma) form, considering meaning and grammar.

- More accurate than stemming because it uses vocabulary and morphological analysis.




In [27]:
from nltk.stem import WordNetLemmatizer

word_lem = WordNetLemmatizer()
for word in words_to_stem:
    print(word + ' : ' + word_lem.lemmatize(word))

given : given
give : give
giving : giving
gave : gave
thinking : thinking
loving : loving
maximum : maximum
akshaybhujbal : akshaybhujbal
gaved : gaved


# Stopwords
**Definition:** Commonly used words (e.g., "the", "is", "in") that are often removed in NLP tasks.
- Removing stopwords helps focus on meaningful content.

In [29]:
from nltk.corpus import stopwords

stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [30]:
int(len(stopwords.words('english')))

198

In [34]:
from nltk.corpus import stopwords

print("Stopwords length for French is", len(stopwords.words('french')))
print("Stopwords length for German is", len(stopwords.words('german')))
print("Stopwords length for Chinese is", len(stopwords.words('chinese')))


Stopwords length for French is 157
Stopwords length for German is 232
Stopwords length for Chinese is 841


**Summary**
  
* Learned WhitespaceTokenizer and wordpunct_tokenize for flexible token splitting.
* Explored n-grams for sequence-based analysis.
* Compared Porter, Lancaster, and Snowball stemmers.
* Understood lemmatization for accurate root forms.
* Reviewed stopwords in multiple languages.

# **NLP Quick Reference – Tokenization, N-grams, Stemming, Lemmatization, Stopwords**

## Example Text

In [36]:
text = """Natural Language Processing makes it possible for machines 
to understand human language and perform useful tasks."""

## Basic Tokenization

In [40]:
from nltk.tokenize import word_tokenize, sent_tokenize, blankline_tokenize

word_tokenize(text)         # Words + punctuation

['Natural',
 'Language',
 'Processing',
 'makes',
 'it',
 'possible',
 'for',
 'machines',
 'to',
 'understand',
 'human',
 'language',
 'and',
 'perform',
 'useful',
 'tasks',
 '.']

In [41]:
sent_tokenize(text)         # Sentences

['Natural Language Processing makes it possible for machines \nto understand human language and perform useful tasks.']

In [42]:
blankline_tokenize(text)    # Paragraphs

['Natural Language Processing makes it possible for machines \nto understand human language and perform useful tasks.']

## Whitespace Tokenizer
Splits only on spaces/tabs, keeps punctuation attached.

In [43]:
from nltk.tokenize import WhitespaceTokenizer
WhitespaceTokenizer().tokenize(text)

['Natural',
 'Language',
 'Processing',
 'makes',
 'it',
 'possible',
 'for',
 'machines',
 'to',
 'understand',
 'human',
 'language',
 'and',
 'perform',
 'useful',
 'tasks.']

## WordPunct Tokenizer
Splits words and punctuation separately.

In [44]:
from nltk.tokenize import wordpunct_tokenize
wordpunct_tokenize(text)

['Natural',
 'Language',
 'Processing',
 'makes',
 'it',
 'possible',
 'for',
 'machines',
 'to',
 'understand',
 'human',
 'language',
 'and',
 'perform',
 'useful',
 'tasks',
 '.']

## N-grams
Create sequences of n consecutive tokens.

In [46]:
import nltk
tokens = nltk.word_tokenize(text)

list(nltk.bigrams(tokens))        # Bigrams

[('Natural', 'Language'),
 ('Language', 'Processing'),
 ('Processing', 'makes'),
 ('makes', 'it'),
 ('it', 'possible'),
 ('possible', 'for'),
 ('for', 'machines'),
 ('machines', 'to'),
 ('to', 'understand'),
 ('understand', 'human'),
 ('human', 'language'),
 ('language', 'and'),
 ('and', 'perform'),
 ('perform', 'useful'),
 ('useful', 'tasks'),
 ('tasks', '.')]

In [47]:
list(nltk.trigrams(tokens))       # Trigrams

[('Natural', 'Language', 'Processing'),
 ('Language', 'Processing', 'makes'),
 ('Processing', 'makes', 'it'),
 ('makes', 'it', 'possible'),
 ('it', 'possible', 'for'),
 ('possible', 'for', 'machines'),
 ('for', 'machines', 'to'),
 ('machines', 'to', 'understand'),
 ('to', 'understand', 'human'),
 ('understand', 'human', 'language'),
 ('human', 'language', 'and'),
 ('language', 'and', 'perform'),
 ('and', 'perform', 'useful'),
 ('perform', 'useful', 'tasks'),
 ('useful', 'tasks', '.')]

In [48]:
list(nltk.ngrams(tokens, 4))      # 4-word n-grams

[('Natural', 'Language', 'Processing', 'makes'),
 ('Language', 'Processing', 'makes', 'it'),
 ('Processing', 'makes', 'it', 'possible'),
 ('makes', 'it', 'possible', 'for'),
 ('it', 'possible', 'for', 'machines'),
 ('possible', 'for', 'machines', 'to'),
 ('for', 'machines', 'to', 'understand'),
 ('machines', 'to', 'understand', 'human'),
 ('to', 'understand', 'human', 'language'),
 ('understand', 'human', 'language', 'and'),
 ('human', 'language', 'and', 'perform'),
 ('language', 'and', 'perform', 'useful'),
 ('and', 'perform', 'useful', 'tasks'),
 ('perform', 'useful', 'tasks', '.')]

## Stemming
Reduce words to root form (may not be real words).

In [49]:
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
words = ['running', 'runs', 'easily', 'fairly']

PorterStemmer().stem('running')

'run'

In [50]:
LancasterStemmer().stem('running')

'run'

In [51]:
SnowballStemmer('english').stem('running')

'run'

## Lemmatization
Reduce to dictionary form, considering meaning.

In [52]:
from nltk.stem import WordNetLemmatizer
WordNetLemmatizer().lemmatize('running', pos='v')  # 'run'

'run'

## Stopwords
Common words often removed in text processing.

In [53]:
from nltk.corpus import stopwords
stopwords.words('english')[:10]   # First 10 stopwords
len(stopwords.words('english'))   # Count of English stopwords

198

**Closing Notes :**

- Tokenization is the **first step** in almost every NLP task.  
- The choice between **stemming** and **lemmatization** depends on whether you need speed (stemming) or accuracy (lemmatization).  
- **Stopwords removal** helps focus on meaningful words, but always check if they are important for your specific task.  
- **N-grams** capture word sequences and context, useful for tasks like text prediction or sentiment analysis.  
- Always experiment with different tokenizers and preprocessing techniques to see what works best for your dataset.  
