<h1>Tokenization</h1>

In [1]:
import string

In [23]:
text = """Tokenization is a fundamental concept in Natural Language Processing (NLP). It involves breaking down a piece of text, such as a paragraph or a document, into smaller units called tokens. These tokens can be individual words, subwords, or even characters. Tokenization is an essential step in NLP tasks as it provides the foundation for further analysis and processing. Let's consider an example to understand tokenization better. Imagine we have the following sentence: 'I love eating pizza.' When tokenized, this sentence might be represented as ['I', 'love', 'eating', 'pizza']. Here, each word is considered a separate token. However, the tokenization process can be more complex, especially for languages with compound words or morphological variations. Tokenization techniques can vary depending on the requirements of the task and the specific language being processed. For instance, some languages might employ subword tokenization, where words are broken down into smaller units. This can help capture morphological information and handle out-of-vocabulary words more effectively. In addition to breaking down text into tokens, tokenization also involves handling punctuation, special characters, and other textual elements. For example, the sentence 'Hello, world!' might be tokenized into ['Hello', ',', 'world', '!'], where the commas and exclamation mark are treated as separate tokens. Tokenization plays a crucial role in various NLP applications. For text classification tasks, tokens serve as input features, enabling the model to understand the semantic content of the text. In machine translation, tokens help align words and phrases across different languages. Sentiment analysis, named entity recognition, and information retrieval are other areas where tokenization proves valuable. There are several libraries and tools available that offer tokenization functionalities for different programming languages. Python-based libraries like NLTK, spaCy, and the Hugging Face Transformers library provide easy-to-use tokenization methods. These libraries often come with pre-trained models that can handle tokenization for multiple languages. To practice tokenization, you can start by selecting a library and exploring its documentation and examples. Try tokenizing different sentences and texts, and observe the resulting tokens. Experiment with different tokenization options and consider the impact on downstream NLP tasks. Remember that tokenization is just the first step in NLP pipelines, and subsequent processing steps like stemming, lemmatization, or stop word removal may be necessary depending on the task at hand. By practicing tokenization on various texts, you can gain a better understanding of how tokens are formed and how they contribute to NLP analysis. Happy tokenizing!" I hope this text provides you with ample material to practice tokenization. Let me know if you need any further assistance!"""

<h2>Removing Punctuation</h2>

In [24]:
translator = str.maketrans('', '', string.punctuation)
text_without_punctuation = text.translate(translator)
text_without_punctuation = text_without_punctuation.lower()

print(text_without_punctuation)

tokenization is a fundamental concept in natural language processing nlp it involves breaking down a piece of text such as a paragraph or a document into smaller units called tokens these tokens can be individual words subwords or even characters tokenization is an essential step in nlp tasks as it provides the foundation for further analysis and processing lets consider an example to understand tokenization better imagine we have the following sentence i love eating pizza when tokenized this sentence might be represented as i love eating pizza here each word is considered a separate token however the tokenization process can be more complex especially for languages with compound words or morphological variations tokenization techniques can vary depending on the requirements of the task and the specific language being processed for instance some languages might employ subword tokenization where words are broken down into smaller units this can help capture morphological information and

In [25]:
import nltk

<h2>Sentence tokenization and word tokenization using NLTK library</h2>

In [26]:
from nltk.tokenize import word_tokenize, sent_tokenize

#Its better to preserve punctuation marks for better sentence tokeinzation.
sentences = sent_tokenize(text)

words = word_tokenize(text_without_punctuation)

print("NLTK Sentence Tokenization:")
print(sentences)
print("\nNLTK Word Tokenization:")
print(words)

NLTK Sentence Tokenization:
['Tokenization is a fundamental concept in Natural Language Processing (NLP).', 'It involves breaking down a piece of text, such as a paragraph or a document, into smaller units called tokens.', 'These tokens can be individual words, subwords, or even characters.', 'Tokenization is an essential step in NLP tasks as it provides the foundation for further analysis and processing.', "Let's consider an example to understand tokenization better.", "Imagine we have the following sentence: 'I love eating pizza.'", "When tokenized, this sentence might be represented as ['I', 'love', 'eating', 'pizza'].", 'Here, each word is considered a separate token.', 'However, the tokenization process can be more complex, especially for languages with compound words or morphological variations.', 'Tokenization techniques can vary depending on the requirements of the task and the specific language being processed.', 'For instance, some languages might employ subword tokenization,

<h2>Sentence tokenization and word tokenization using SpaCy library</h2>

In [27]:
#In the spaCy example, the en_core_web_sm model is loaded, which is a small English language model.
#The text is then processed using the nlp() function, which returns a spaCy Doc object.
#The individual word tokens and sentences tokens are extracted from the document by iterating over the tokens in the doc object.

import spacy

nlp = spacy.load("en_core_web_sm")

# Tokenize the text using spaCy
doc = nlp(text_without_punctuation)
doc1 = nlp(text)

# Extract tokens from the spaCy document
tokens = [token.text for token in doc]
sentences = list(doc1.sents)

print("spaCy Word Tokenization:")
print(tokens)
print("spaCy Sentence Tokenization:")
for sentence in sentences:
    print(sentence.text)

spaCy Word Tokenization:
['tokenization', 'is', 'a', 'fundamental', 'concept', 'in', 'natural', 'language', 'processing', 'nlp', 'it', 'involves', 'breaking', 'down', 'a', 'piece', 'of', 'text', 'such', 'as', 'a', 'paragraph', 'or', 'a', 'document', 'into', 'smaller', 'units', 'called', 'tokens', 'these', 'tokens', 'can', 'be', 'individual', 'words', 'subwords', 'or', 'even', 'characters', 'tokenization', 'is', 'an', 'essential', 'step', 'in', 'nlp', 'tasks', 'as', 'it', 'provides', 'the', 'foundation', 'for', 'further', 'analysis', 'and', 'processing', 'lets', 'consider', 'an', 'example', 'to', 'understand', 'tokenization', 'better', 'imagine', 'we', 'have', 'the', 'following', 'sentence', 'i', 'love', 'eating', 'pizza', 'when', 'tokenized', 'this', 'sentence', 'might', 'be', 'represented', 'as', 'i', 'love', 'eating', 'pizza', 'here', 'each', 'word', 'is', 'considered', 'a', 'separate', 'token', 'however', 'the', 'tokenization', 'process', 'can', 'be', 'more', 'complex', 'especially'