In [1]:
text = """Tokenization is a fundamental concept in Natural Language Processing (NLP). It involves breaking down a piece of text, such as a paragraph or a document, into smaller units called tokens. These tokens can be individual words, subwords, or even characters. Tokenization is an essential step in NLP tasks as it provides the foundation for further analysis and processing. Let's consider an example to understand tokenization better. Imagine we have the following sentence: 'I love eating pizza.' When tokenized, this sentence might be represented as ['I', 'love', 'eating', 'pizza']. Here, each word is considered a separate token. However, the tokenization process can be more complex, especially for languages with compound words or morphological variations. Tokenization techniques can vary depending on the requirements of the task and the specific language being processed. For instance, some languages might employ subword tokenization, where words are broken down into smaller units. This can help capture morphological information and handle out-of-vocabulary words more effectively. In addition to breaking down text into tokens, tokenization also involves handling punctuation, special characters, and other textual elements. For example, the sentence 'Hello, world!' might be tokenized into ['Hello', ',', 'world', '!'], where the commas and exclamation mark are treated as separate tokens. Tokenization plays a crucial role in various NLP applications. For text classification tasks, tokens serve as input features, enabling the model to understand the semantic content of the text. In machine translation, tokens help align words and phrases across different languages. Sentiment analysis, named entity recognition, and information retrieval are other areas where tokenization proves valuable. There are several libraries and tools available that offer tokenization functionalities for different programming languages. Python-based libraries like NLTK, spaCy, and the Hugging Face Transformers library provide easy-to-use tokenization methods. These libraries often come with pre-trained models that can handle tokenization for multiple languages. To practice tokenization, you can start by selecting a library and exploring its documentation and examples. Try tokenizing different sentences and texts, and observe the resulting tokens. Experiment with different tokenization options and consider the impact on downstream NLP tasks. Remember that tokenization is just the first step in NLP pipelines, and subsequent processing steps like stemming, lemmatization, or stop word removal may be necessary depending on the task at hand. By practicing tokenization on various texts, you can gain a better understanding of how tokens are formed and how they contribute to NLP analysis. Happy tokenizing!" I hope this text provides you with ample material to practice tokenization. Let me know if you need any further assistance!"""

In [2]:
import nltk
import spacy
import string

<h2>Removing Punctuation</h2>

In [3]:
translator = str.maketrans('', '', string.punctuation)
text_without_punctuation = text.translate(translator)
text_without_punctuation = text_without_punctuation.lower()

print(text_without_punctuation)

tokenization is a fundamental concept in natural language processing nlp it involves breaking down a piece of text such as a paragraph or a document into smaller units called tokens these tokens can be individual words subwords or even characters tokenization is an essential step in nlp tasks as it provides the foundation for further analysis and processing lets consider an example to understand tokenization better imagine we have the following sentence i love eating pizza when tokenized this sentence might be represented as i love eating pizza here each word is considered a separate token however the tokenization process can be more complex especially for languages with compound words or morphological variations tokenization techniques can vary depending on the requirements of the task and the specific language being processed for instance some languages might employ subword tokenization where words are broken down into smaller units this can help capture morphological information and

<h1>Lemmatization</h1>

<h2>Leammatization using NLTK library</h2>

In [4]:
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

tokens = nltk.word_tokenize(text_without_punctuation)

for words in tokens:
    print(words + " ---> " + lemmatizer.lemmatize(words))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\abhik\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


tokenization ---> tokenization
is ---> is
a ---> a
fundamental ---> fundamental
concept ---> concept
in ---> in
natural ---> natural
language ---> language
processing ---> processing
nlp ---> nlp
it ---> it
involves ---> involves
breaking ---> breaking
down ---> down
a ---> a
piece ---> piece
of ---> of
text ---> text
such ---> such
as ---> a
a ---> a
paragraph ---> paragraph
or ---> or
a ---> a
document ---> document
into ---> into
smaller ---> smaller
units ---> unit
called ---> called
tokens ---> token
these ---> these
tokens ---> token
can ---> can
be ---> be
individual ---> individual
words ---> word
subwords ---> subwords
or ---> or
even ---> even
characters ---> character
tokenization ---> tokenization
is ---> is
an ---> an
essential ---> essential
step ---> step
in ---> in
nlp ---> nlp
tasks ---> task
as ---> a
it ---> it
provides ---> provides
the ---> the
foundation ---> foundation
for ---> for
further ---> further
analysis ---> analysis
and ---> and
processing ---> processin

<h2>Leammatization using SpaCy library</h2>

In [5]:
nlp = spacy.load("en_core_web_sm")

doc = nlp(text_without_punctuation)

for token in doc:
    print(token.text + " ---> " + token.lemma_)

tokenization ---> tokenization
is ---> be
a ---> a
fundamental ---> fundamental
concept ---> concept
in ---> in
natural ---> natural
language ---> language
processing ---> processing
nlp ---> nlp
it ---> it
involves ---> involve
breaking ---> break
down ---> down
a ---> a
piece ---> piece
of ---> of
text ---> text
such ---> such
as ---> as
a ---> a
paragraph ---> paragraph
or ---> or
a ---> a
document ---> document
into ---> into
smaller ---> small
units ---> unit
called ---> call
tokens ---> token
these ---> these
tokens ---> token
can ---> can
be ---> be
individual ---> individual
words ---> word
subwords ---> subword
or ---> or
even ---> even
characters ---> character
tokenization ---> tokenization
is ---> be
an ---> an
essential ---> essential
step ---> step
in ---> in
nlp ---> nlp
tasks ---> task
as ---> as
it ---> it
provides ---> provide
the ---> the
foundation ---> foundation
for ---> for
further ---> further
analysis ---> analysis
and ---> and
processing ---> processing
lets -

<h2>Lemmatization using NLTK and POS tagging</h2>

In [9]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

nltk.download('averaged_perceptron_tagger')

lemmatizer = WordNetLemmatizer()

def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

text = """Tokenization is a fundamental concept in Natural Language Processing (NLP). It involves breaking down a piece of text, such as a paragraph or a document, into smaller units called tokens. These tokens can be individual words, subwords, or even characters. Tokenization is an essential step in NLP tasks as it provides the foundation for further analysis and processing. Let's consider an example to understand tokenization better. Imagine we have the following sentence: 'I love eating pizza.' When tokenized, this sentence might be represented as ['I', 'love', 'eating', 'pizza']. Here, each word is considered a separate token. However, the tokenization process can be more complex, especially for languages with compound words or morphological variations. Tokenization techniques can vary depending on the requirements of the task and the specific language being processed. For instance, some languages might employ subword tokenization, where words are broken down into smaller units. This can help capture morphological information and handle out-of-vocabulary words more effectively. In addition to breaking down text into tokens, tokenization also involves handling punctuation, special characters, and other textual elements. For example, the sentence 'Hello, world!' might be tokenized into ['Hello', ',', 'world', '!'], where the commas and exclamation mark are treated as separate tokens. Tokenization plays a crucial role in various NLP applications. For text classification tasks, tokens serve as input features, enabling the model to understand the semantic content of the text. In machine translation, tokens help align words and phrases across different languages. Sentiment analysis, named entity recognition, and information retrieval are other areas where tokenization proves valuable. There are several libraries and tools available that offer tokenization functionalities for different programming languages. Python-based libraries like NLTK, spaCy, and the Hugging Face Transformers library provide easy-to-use tokenization methods. These libraries often come with pre-trained models that can handle tokenization for multiple languages. To practice tokenization, you can start by selecting a library and exploring its documentation and examples. Try tokenizing different sentences and texts, and observe the resulting tokens. Experiment with different tokenization options and consider the impact on downstream NLP tasks. Remember that tokenization is just the first step in NLP pipelines, and subsequent processing steps like stemming, lemmatization, or stop word removal may be necessary depending on the task at hand. By practicing tokenization on various texts, you can gain a better understanding of how tokens are formed and how they contribute to NLP analysis. Happy tokenizing!" I hope this text provides you with ample material to practice tokenization. Let me know if you need any further assistance!"""

pos_tagged = nltk.pos_tag(nltk.word_tokenize(text_without_punctuation))

wordnet_tagged = [(word, pos_tagger(tag)) for word, tag in pos_tagged]

lemmatized_sentence = []

for word, tag in wordnet_tagged:
    if tag is None:
        lemmatized_sentence.append(word)
    else:
        lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))

lemmatized_sentence = " ".join(lemmatized_sentence)

print(lemmatized_sentence)

tokenization be a fundamental concept in natural language processing nlp it involve break down a piece of text such as a paragraph or a document into small unit call token these token can be individual word subwords or even character tokenization be an essential step in nlp task as it provide the foundation for further analysis and processing let consider an example to understand tokenization well imagine we have the following sentence i love eat pizza when tokenized this sentence might be represent as i love eat pizza here each word be consider a separate token however the tokenization process can be more complex especially for language with compound word or morphological variation tokenization technique can vary depend on the requirement of the task and the specific language be process for instance some language might employ subword tokenization where word be break down into small unit this can help capture morphological information and handle outofvocabulary word more effectively in

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\abhik\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


<h1>Stemming</h1>

<h2>Stemming using NLTK library</h2>

In [12]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

tokens = word_tokenize(text_without_punctuation)

stemmer = PorterStemmer()

stemmed_words = []
for token in tokens:
    stemmed_word = stemmer.stem(token)
    stemmed_words.append(stemmed_word)
    
stemmed_text = ' '.join(stemmed_words)

print(stemmed_text)

token is a fundament concept in natur languag process nlp it involv break down a piec of text such as a paragraph or a document into smaller unit call token these token can be individu word subword or even charact token is an essenti step in nlp task as it provid the foundat for further analysi and process let consid an exampl to understand token better imagin we have the follow sentenc i love eat pizza when token thi sentenc might be repres as i love eat pizza here each word is consid a separ token howev the token process can be more complex especi for languag with compound word or morpholog variat token techniqu can vari depend on the requir of the task and the specif languag be process for instanc some languag might employ subword token where word are broken down into smaller unit thi can help captur morpholog inform and handl outofvocabulari word more effect in addit to break down text into token token also involv handl punctuat special charact and other textual element for exampl 

<h2>Stemming using SpaCy library</h2>

In [13]:
nlp = spacy.load("en_core_web_sm")

def stem_token(token):
    stem = token.lemma_
    return stem

spacy.tokens.Token.set_extension("stem", getter=stem_token)

doc = nlp(text_without_punctuation)

stemmed_words = [token._.stem for token in doc]

stemmed_text = ' '.join(stemmed_words)

print(stemmed_text)

tokenization be a fundamental concept in natural language processing nlp it involve break down a piece of text such as a paragraph or a document into small unit call token these token can be individual word subword or even character tokenization be an essential step in nlp task as it provide the foundation for further analysis and processing let consider an example to understand tokenization well imagine we have the following sentence I love eat pizza when tokenize this sentence might be represent as I love eat pizza here each word be consider a separate token however the tokenization process can be more complex especially for language with compound word or morphological variation tokenization technique can vary depend on the requirement of the task and the specific language be process for instance some language might employ subword tokenization where word be break down into small unit this can help capture morphological information and handle outofvocabulary word more effectively in a