## Advanced Tokenization, Stemming, Lemmatization & Stopword

Natural Language Processing (NLP) by exploring more tokenization techniques, generating and learning about stemming, lemmatization, and stopwords.

# Whitespace Tokenization 
**Definition:** Splits text based on whitespace (spaces, tabs, newlines) without removing punctuation.
- Useful when punctuation should be preserved as part of the tokens.
- Faster but less precise than other tokenizers.

In [3]:
import os
import nltk
#nltk.download()

In [4]:
AI = '''Artificial Intelligence refers to the intelligence of machines. This is in contrast to the natural intelligence of
humans and animals. With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and
problem-solving. Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines.
It is probably the fastest-growing development in the World of technology and innovation. Furthermore, many experts believe
AI could solve major challenges and crisis situations.'''

In [5]:
AI

'Artificial Intelligence refers to the intelligence of machines. This is in contrast to the natural intelligence of\nhumans and animals. With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and\nproblem-solving. Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines.\nIt is probably the fastest-growing development in the World of technology and innovation. Furthermore, many experts believe\nAI could solve major challenges and crisis situations.'

In [44]:
from nltk.tokenize import word_tokenize

In [45]:
AI_tokens = word_tokenize(AI)
AI_tokens

['Artificial',
 'Intelligence',
 'refers',
 'to',
 'the',
 'intelligence',
 'of',
 'machines',
 '.',
 'This',
 'is',
 'in',
 'contrast',
 'to',
 'the',
 'natural',
 'intelligence',
 'of',
 'humans',
 'and',
 'animals',
 '.',
 'With',
 'Artificial',
 'Intelligence',
 ',',
 'machines',
 'perform',
 'functions',
 'such',
 'as',
 'learning',
 ',',
 'planning',
 ',',
 'reasoning',
 'and',
 'problem-solving',
 '.',
 'Most',
 'noteworthy',
 ',',
 'Artificial',
 'Intelligence',
 'is',
 'the',
 'simulation',
 'of',
 'human',
 'intelligence',
 'by',
 'machines',
 '.',
 'It',
 'is',
 'probably',
 'the',
 'fastest-growing',
 'development',
 'in',
 'the',
 'World',
 'of',
 'technology',
 'and',
 'innovation',
 '.',
 'Furthermore',
 ',',
 'many',
 'experts',
 'believe',
 'AI',
 'could',
 'solve',
 'major',
 'challenges',
 'and',
 'crisis',
 'situations',
 '.']

In [46]:
from nltk.tokenize import WhitespaceTokenizer
wt = WhitespaceTokenizer().tokenize(AI)
wt

['Artificial',
 'Intelligence',
 'refers',
 'to',
 'the',
 'intelligence',
 'of',
 'machines.',
 'This',
 'is',
 'in',
 'contrast',
 'to',
 'the',
 'natural',
 'intelligence',
 'of',
 'humans',
 'and',
 'animals.',
 'With',
 'Artificial',
 'Intelligence,',
 'machines',
 'perform',
 'functions',
 'such',
 'as',
 'learning,',
 'planning,',
 'reasoning',
 'and',
 'problem-solving.',
 'Most',
 'noteworthy,',
 'Artificial',
 'Intelligence',
 'is',
 'the',
 'simulation',
 'of',
 'human',
 'intelligence',
 'by',
 'machines.',
 'It',
 'is',
 'probably',
 'the',
 'fastest-growing',
 'development',
 'in',
 'the',
 'World',
 'of',
 'technology',
 'and',
 'innovation.',
 'Furthermore,',
 'many',
 'experts',
 'believe',
 'AI',
 'could',
 'solve',
 'major',
 'challenges',
 'and',
 'crisis',
 'situations.']

In [47]:
print(len(wt))

70


In [48]:
s = 'Good apple cost $3.88 in hyd.Please buy two of them. Thanks.'
s

'Good apple cost $3.88 in hyd.Please buy two of them. Thanks.'

# WordPunct Tokenization 
**Definition:** Splits words and punctuation into separate tokens.

- Numbers, punctuation marks, and words are treated individually

In [49]:
from nltk.tokenize import wordpunct_tokenize

s = 'Good apple cost $3.88 in Hyderabad. Please buy two of them. Thanks.'
s

'Good apple cost $3.88 in Hyderabad. Please buy two of them. Thanks.'

In [50]:
print(wordpunct_tokenize(s))

['Good', 'apple', 'cost', '$', '3', '.', '88', 'in', 'Hyderabad', '.', 'Please', 'buy', 'two', 'of', 'them', '.', 'Thanks', '.']


In [51]:
print(len(wordpunct_tokenize(s)))

18


In [52]:

w_p = wordpunct_tokenize(AI)
print(w_p)
print(len(w_p))

['Artificial', 'Intelligence', 'refers', 'to', 'the', 'intelligence', 'of', 'machines', '.', 'This', 'is', 'in', 'contrast', 'to', 'the', 'natural', 'intelligence', 'of', 'humans', 'and', 'animals', '.', 'With', 'Artificial', 'Intelligence', ',', 'machines', 'perform', 'functions', 'such', 'as', 'learning', ',', 'planning', ',', 'reasoning', 'and', 'problem', '-', 'solving', '.', 'Most', 'noteworthy', ',', 'Artificial', 'Intelligence', 'is', 'the', 'simulation', 'of', 'human', 'intelligence', 'by', 'machines', '.', 'It', 'is', 'probably', 'the', 'fastest', '-', 'growing', 'development', 'in', 'the', 'World', 'of', 'technology', 'and', 'innovation', '.', 'Furthermore', ',', 'many', 'experts', 'believe', 'AI', 'could', 'solve', 'major', 'challenges', 'and', 'crisis', 'situations', '.']
85


# Stemming

**Definition:** Reduces words to their root form, often by chopping off suffixes. Types:

- Porter Stemmer – Basic and widely used, but may not handle all words well.
- Lancaster Stemmer – More aggressive, sometimes over-stems words.
- Snowball Stemmer – Advanced, supports multiple languages



In [53]:

from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

words_to_stem=['give','giving','given','gaved','thinking','loving','maximum','student','written']

for words in words_to_stem:
    print(words+ ' : ' +pst.stem(words))

give : give
giving : give
given : given
gaved : gave
thinking : think
loving : love
maximum : maximum
student : student
written : written


## PorterStemmer

In [54]:
from nltk.stem import PorterStemmer
pst = PorterStemmer()

In [55]:
pst.stem('affection')

'affect'

In [56]:
pst.stem('playing')

'play'

In [57]:
pst.stem('maximum')

'maximum'

## LancasterStemmer

In [58]:
from nltk.stem import LancasterStemmer
lst = LancasterStemmer()

for words in words_to_stem:
    print(words+ ' : ' +lst.stem(words))

give : giv
giving : giv
given : giv
gaved : gav
thinking : think
loving : lov
maximum : maxim
student : stud
written : writ


In [59]:
from nltk.stem import SnowballStemmer
lst = SnowballStemmer('english')

for words in words_to_stem:
    print(words+ ' : ' +lst.stem(words))

give : give
giving : give
given : given
gaved : gave
thinking : think
loving : love
maximum : maximum
student : student
written : written


## SnowballStemmer

In [60]:
stemmer = SnowballStemmer("german")     # choose a Language
>>> stemmer.stem("Autobahnen")          # Stem a word

'autobahn'

# Lemmatization

**Definition:** Reduces words to their dictionary (lemma) form, considering meaning and grammar.

- More accurate than stemming because it uses vocabulary and morphological analysis

In [61]:
from nltk.stem import wordnet
from nltk.stem import WordNetLemmatizer
word_lem = WordNetLemmatizer()

In [62]:
words_to_stem

['give',
 'giving',
 'given',
 'gaved',
 'thinking',
 'loving',
 'maximum',
 'student',
 'written']

In [63]:
for words in words_to_stem:
    print(words+ ' : ' +word_lem.lemmatize(words))

give : give
giving : giving
given : given
gaved : gaved
thinking : thinking
loving : loving
maximum : maximum
student : student
written : written


# Stopwords

**Definition:** Commonly used words (e.g., "the", "is", "in") that are often removed in NLP tasks.

- Removing stopwords helps focus on meaningful content
- Improves the accuracy of tokenization 

In [64]:
from nltk.corpus import stopwords

In [65]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [66]:
len(stopwords.words('english'))

198

In [67]:
stopwords.words('french')

['au',
 'aux',
 'avec',
 'ce',
 'ces',
 'dans',
 'de',
 'des',
 'du',
 'elle',
 'en',
 'et',
 'eux',
 'il',
 'ils',
 'je',
 'la',
 'le',
 'les',
 'leur',
 'lui',
 'ma',
 'mais',
 'me',
 'même',
 'mes',
 'moi',
 'mon',
 'ne',
 'nos',
 'notre',
 'nous',
 'on',
 'ou',
 'par',
 'pas',
 'pour',
 'qu',
 'que',
 'qui',
 'sa',
 'se',
 'ses',
 'son',
 'sur',
 'ta',
 'te',
 'tes',
 'toi',
 'ton',
 'tu',
 'un',
 'une',
 'vos',
 'votre',
 'vous',
 'c',
 'd',
 'j',
 'l',
 'à',
 'm',
 'n',
 's',
 't',
 'y',
 'été',
 'étée',
 'étées',
 'étés',
 'étant',
 'étante',
 'étants',
 'étantes',
 'suis',
 'es',
 'est',
 'sommes',
 'êtes',
 'sont',
 'serai',
 'seras',
 'sera',
 'serons',
 'serez',
 'seront',
 'serais',
 'serait',
 'serions',
 'seriez',
 'seraient',
 'étais',
 'était',
 'étions',
 'étiez',
 'étaient',
 'fus',
 'fut',
 'fûmes',
 'fûtes',
 'furent',
 'sois',
 'soit',
 'soyons',
 'soyez',
 'soient',
 'fusse',
 'fusses',
 'fût',
 'fussions',
 'fussiez',
 'fussent',
 'ayant',
 'ayante',
 'ayantes',


In [68]:
len(stopwords.words('french'))

157

In [69]:
stopwords.words('german')

['aber',
 'alle',
 'allem',
 'allen',
 'aller',
 'alles',
 'als',
 'also',
 'am',
 'an',
 'ander',
 'andere',
 'anderem',
 'anderen',
 'anderer',
 'anderes',
 'anderm',
 'andern',
 'anderr',
 'anders',
 'auch',
 'auf',
 'aus',
 'bei',
 'bin',
 'bis',
 'bist',
 'da',
 'damit',
 'dann',
 'der',
 'den',
 'des',
 'dem',
 'die',
 'das',
 'dass',
 'daß',
 'derselbe',
 'derselben',
 'denselben',
 'desselben',
 'demselben',
 'dieselbe',
 'dieselben',
 'dasselbe',
 'dazu',
 'dein',
 'deine',
 'deinem',
 'deinen',
 'deiner',
 'deines',
 'denn',
 'derer',
 'dessen',
 'dich',
 'dir',
 'du',
 'dies',
 'diese',
 'diesem',
 'diesen',
 'dieser',
 'dieses',
 'doch',
 'dort',
 'durch',
 'ein',
 'eine',
 'einem',
 'einen',
 'einer',
 'eines',
 'einig',
 'einige',
 'einigem',
 'einigen',
 'einiger',
 'einiges',
 'einmal',
 'er',
 'ihn',
 'ihm',
 'es',
 'etwas',
 'euer',
 'eure',
 'eurem',
 'euren',
 'eurer',
 'eures',
 'für',
 'gegen',
 'gewesen',
 'hab',
 'habe',
 'haben',
 'hat',
 'hatte',
 'hatten',
 '

In [70]:
len(stopwords.words('german'))

232

In [71]:
stopwords.words('chinese')

['一',
 '一下',
 '一些',
 '一切',
 '一则',
 '一天',
 '一定',
 '一方面',
 '一旦',
 '一时',
 '一来',
 '一样',
 '一次',
 '一片',
 '一直',
 '一致',
 '一般',
 '一起',
 '一边',
 '一面',
 '万一',
 '上下',
 '上升',
 '上去',
 '上来',
 '上述',
 '上面',
 '下列',
 '下去',
 '下来',
 '下面',
 '不一',
 '不久',
 '不仅',
 '不会',
 '不但',
 '不光',
 '不单',
 '不变',
 '不只',
 '不可',
 '不同',
 '不够',
 '不如',
 '不得',
 '不怕',
 '不惟',
 '不成',
 '不拘',
 '不敢',
 '不断',
 '不是',
 '不比',
 '不然',
 '不特',
 '不独',
 '不管',
 '不能',
 '不要',
 '不论',
 '不足',
 '不过',
 '不问',
 '与',
 '与其',
 '与否',
 '与此同时',
 '专门',
 '且',
 '两者',
 '严格',
 '严重',
 '个',
 '个人',
 '个别',
 '中小',
 '中间',
 '丰富',
 '临',
 '为',
 '为主',
 '为了',
 '为什么',
 '为什麽',
 '为何',
 '为着',
 '主张',
 '主要',
 '举行',
 '乃',
 '乃至',
 '么',
 '之',
 '之一',
 '之前',
 '之后',
 '之後',
 '之所以',
 '之类',
 '乌乎',
 '乎',
 '乘',
 '也',
 '也好',
 '也是',
 '也罢',
 '了',
 '了解',
 '争取',
 '于',
 '于是',
 '于是乎',
 '云云',
 '互相',
 '产生',
 '人们',
 '人家',
 '什么',
 '什么样',
 '什麽',
 '今后',
 '今天',
 '今年',
 '今後',
 '仍然',
 '从',
 '从事',
 '从而',
 '他',
 '他人',
 '他们',
 '他的',
 '代替',
 '以',
 '以上',
 '以下',
 '以为',
 '以便',
 '以免',
 '以前',
 '以及',
 '以后',
 '以外',
 '以後',
 

In [72]:
len(stopwords.words('chinese'))

841

In [73]:
stopwords.words('tamil')

['அங்கு',
 'அங்கே',
 'அடுத்த',
 'அதனால்',
 'அதன்',
 'அதற்கு',
 'அதிக',
 'அதில்',
 'அது',
 'அதே',
 'அதை',
 'அந்த',
 'அந்தக்',
 'அந்தப்',
 'அன்று',
 'அல்லது',
 'அவன்',
 'அவரது',
 'அவர்',
 'அவர்கள்',
 'அவள்',
 'அவை',
 'ஆகிய',
 'ஆகியோர்',
 'ஆகும்',
 'இங்கு',
 'இங்கே',
 'இடத்தில்',
 'இடம்',
 'இதனால்',
 'இதனை',
 'இதன்',
 'இதற்கு',
 'இதில்',
 'இது',
 'இதை',
 'இந்த',
 'இந்தக்',
 'இந்தத்',
 'இந்தப்',
 'இன்னும்',
 'இப்போது',
 'இரு',
 'இருக்கும்',
 'இருந்த',
 'இருந்தது',
 'இருந்து',
 'இவர்',
 'இவை',
 'உன்',
 'உள்ள',
 'உள்ளது',
 'உள்ளன',
 'எந்த',
 'என',
 'எனக்',
 'எனக்கு',
 'எனப்படும்',
 'எனவும்',
 'எனவே',
 'எனினும்',
 'எனும்',
 'என்',
 'என்ன',
 'என்னும்',
 'என்பது',
 'என்பதை',
 'என்ற',
 'என்று',
 'என்றும்',
 'எல்லாம்',
 'ஏன்',
 'ஒரு',
 'ஒரே',
 'ஓர்',
 'கொண்ட',
 'கொண்டு',
 'கொள்ள',
 'சற்று',
 'சிறு',
 'சில',
 'சேர்ந்த',
 'தனது',
 'தன்',
 'தவிர',
 'தான்',
 'நான்',
 'நாம்',
 'நீ',
 'பற்றி',
 'பற்றிய',
 'பல',
 'பலரும்',
 'பல்வேறு',
 'பின்',
 'பின்னர்',
 'பிற',
 'பிறகு',
 'பெரும்',
 'பேர்',
 'போது',
 

In [74]:
len(stopwords.words('tamil'))

125

In [75]:
stopwords.words('bengali')

['অতএব',
 'অথচ',
 'অথবা',
 'অনুযায়ী',
 'অনেক',
 'অনেকে',
 'অনেকেই',
 'অন্তত',
 'অন্য',
 'অবধি',
 'অবশ্য',
 'অর্থাত',
 'আই',
 'আগামী',
 'আগে',
 'আগেই',
 'আছে',
 'আজ',
 'আদ্যভাগে',
 'আপনার',
 'আপনি',
 'আবার',
 'আমরা',
 'আমাকে',
 'আমাদের',
 'আমার',
 'আমি',
 'আর',
 'আরও',
 'ই',
 'ইত্যাদি',
 'ইহা',
 'উচিত',
 'উত্তর',
 'উনি',
 'উপর',
 'উপরে',
 'এ',
 'এঁদের',
 'এঁরা',
 'এই',
 'একই',
 'একটি',
 'একবার',
 'একে',
 'এক্',
 'এখন',
 'এখনও',
 'এখানে',
 'এখানেই',
 'এটা',
 'এটাই',
 'এটি',
 'এত',
 'এতটাই',
 'এতে',
 'এদের',
 'এব',
 'এবং',
 'এবার',
 'এমন',
 'এমনকী',
 'এমনি',
 'এর',
 'এরা',
 'এল',
 'এস',
 'এসে',
 'ঐ',
 'ও',
 'ওঁদের',
 'ওঁর',
 'ওঁরা',
 'ওই',
 'ওকে',
 'ওখানে',
 'ওদের',
 'ওর',
 'ওরা',
 'কখনও',
 'কত',
 'কবে',
 'কমনে',
 'কয়েক',
 'কয়েকটি',
 'করছে',
 'করছেন',
 'করতে',
 'করবে',
 'করবেন',
 'করলে',
 'করলেন',
 'করা',
 'করাই',
 'করায়',
 'করার',
 'করি',
 'করিতে',
 'করিয়া',
 'করিয়ে',
 'করে',
 'করেই',
 'করেছিলেন',
 'করেছে',
 'করেছেন',
 'করেন',
 'কাউকে',
 'কাছ',
 'কাছে',
 'কাজ',
 'কাজে',
 'কারও',
 '

In [76]:
len(stopwords.words('bengali'))

398

# POS Tagging
-Definition: Assigns grammatical labels (noun, verb, adjective, etc.) to each token in a sentence

In [77]:
sent = 'sam is a natural when it comes to drawing'
sent_tokens = word_tokenize(sent)
sent_tokens

['sam', 'is', 'a', 'natural', 'when', 'it', 'comes', 'to', 'drawing']

In [78]:
from nltk.tokenize import sent_tokenize

In [79]:
AI_sent = sent_tokenize(AI)
AI_sent

['Artificial Intelligence refers to the intelligence of machines.',
 'This is in contrast to the natural intelligence of\nhumans and animals.',
 'With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and\nproblem-solving.',
 'Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines.',
 'It is probably the fastest-growing development in the World of technology and innovation.',
 'Furthermore, many experts believe\nAI could solve major challenges and crisis situations.']

In [80]:
for token in sent_tokens:
    print(nltk.pos_tag([token]))

[('sam', 'NN')]
[('is', 'VBZ')]
[('a', 'DT')]
[('natural', 'JJ')]
[('when', 'WRB')]
[('it', 'PRP')]
[('comes', 'VBZ')]
[('to', 'TO')]
[('drawing', 'VBG')]
