<h1>POS Tagging</h1>

In [3]:
text = """Tokenization is a fundamental concept in Natural Language Processing (NLP). It involves breaking down a piece of text, such as a paragraph or a document, into smaller units called tokens. These tokens can be individual words, subwords, or even characters. Tokenization is an essential step in NLP tasks as it provides the foundation for further analysis and processing. Let's consider an example to understand tokenization better. Imagine we have the following sentence: 'I love eating pizza.' When tokenized, this sentence might be represented as ['I', 'love', 'eating', 'pizza']. Here, each word is considered a separate token. However, the tokenization process can be more complex, especially for languages with compound words or morphological variations. Tokenization techniques can vary depending on the requirements of the task and the specific language being processed. For instance, some languages might employ subword tokenization, where words are broken down into smaller units. This can help capture morphological information and handle out-of-vocabulary words more effectively. In addition to breaking down text into tokens, tokenization also involves handling punctuation, special characters, and other textual elements. For example, the sentence 'Hello, world!' might be tokenized into ['Hello', ',', 'world', '!'], where the commas and exclamation mark are treated as separate tokens. Tokenization plays a crucial role in various NLP applications. For text classification tasks, tokens serve as input features, enabling the model to understand the semantic content of the text. In machine translation, tokens help align words and phrases across different languages. Sentiment analysis, named entity recognition, and information retrieval are other areas where tokenization proves valuable. There are several libraries and tools available that offer tokenization functionalities for different programming languages. Python-based libraries like NLTK, spaCy, and the Hugging Face Transformers library provide easy-to-use tokenization methods. These libraries often come with pre-trained models that can handle tokenization for multiple languages. To practice tokenization, you can start by selecting a library and exploring its documentation and examples. Try tokenizing different sentences and texts, and observe the resulting tokens. Experiment with different tokenization options and consider the impact on downstream NLP tasks. Remember that tokenization is just the first step in NLP pipelines, and subsequent processing steps like stemming, lemmatization, or stop word removal may be necessary depending on the task at hand. By practicing tokenization on various texts, you can gain a better understanding of how tokens are formed and how they contribute to NLP analysis. Happy tokenizing!" I hope this text provides you with ample material to practice tokenization. Let me know if you need any further assistance!"""

<h2>Remove Punctuation</h2>

In [7]:
import string

translator = str.maketrans('', '', string.punctuation)
text_without_punctuation = text.translate(translator)
text_without_punctuation = text_without_punctuation.lower()

print(text_without_punctuation)

tokenization is a fundamental concept in natural language processing nlp it involves breaking down a piece of text such as a paragraph or a document into smaller units called tokens these tokens can be individual words subwords or even characters tokenization is an essential step in nlp tasks as it provides the foundation for further analysis and processing lets consider an example to understand tokenization better imagine we have the following sentence i love eating pizza when tokenized this sentence might be represented as i love eating pizza here each word is considered a separate token however the tokenization process can be more complex especially for languages with compound words or morphological variations tokenization techniques can vary depending on the requirements of the task and the specific language being processed for instance some languages might employ subword tokenization where words are broken down into smaller units this can help capture morphological information and

<h2>POS tagging using NLTK</h2>

In [15]:
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize

tokens = nltk.word_tokenize(text_without_punctuation)

tagged_tokens = pos_tag(tokens)

print(tagged_tokens)

[('tokenization', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('fundamental', 'JJ'), ('concept', 'NN'), ('in', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('nlp', 'IN'), ('it', 'PRP'), ('involves', 'VBZ'), ('breaking', 'VBG'), ('down', 'RP'), ('a', 'DT'), ('piece', 'NN'), ('of', 'IN'), ('text', 'NN'), ('such', 'JJ'), ('as', 'IN'), ('a', 'DT'), ('paragraph', 'NN'), ('or', 'CC'), ('a', 'DT'), ('document', 'NN'), ('into', 'IN'), ('smaller', 'JJR'), ('units', 'NNS'), ('called', 'VBD'), ('tokens', 'NNS'), ('these', 'DT'), ('tokens', 'NNS'), ('can', 'MD'), ('be', 'VB'), ('individual', 'JJ'), ('words', 'NNS'), ('subwords', 'NNS'), ('or', 'CC'), ('even', 'RB'), ('characters', 'NNS'), ('tokenization', 'NN'), ('is', 'VBZ'), ('an', 'DT'), ('essential', 'JJ'), ('step', 'NN'), ('in', 'IN'), ('nlp', 'JJ'), ('tasks', 'NNS'), ('as', 'IN'), ('it', 'PRP'), ('provides', 'VBZ'), ('the', 'DT'), ('foundation', 'NN'), ('for', 'IN'), ('further', 'JJ'), ('analysis', 'NN'), ('and', 'CC'), ('proce

<h2>POS tagging using NLTK and ML</h2>
<h3>If you're using a machine learning-based approach and have a labeled training dataset, you can train a custom POS tagger. Here's a simple example using NLTK's Hidden Markov Model (HMM) tagger:</h3>

In [18]:
import nltk

training_data = [
    ("This", "DT"), ("is", "VBZ"), ("an", "DT"), ("example", "NN"), ("sentence", "NN"),
    ("for", "IN"), ("tagging", "NN"), (".", ".")
]

tagger = nltk.tag.hmm.HiddenMarkovModelTagger.train([training_data])

test_sentence = ["This", "is", "a", "test", "sentence", "."]
tagged_sentence = tagger.tag(test_sentence)
print(tagged_sentence)

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('test', 'NN'), ('sentence', 'NN'), ('.', '.')]


<h2>POS tagging using SpaCy</h2>

In [21]:
nlp = spacy.load('en_core_web_sm')

def pos_tagging(text):
    doc = nlp(text)
    tagged_words = [(token.text, token.pos_) for token in doc]
    return tagged_words

tagged_sentence = pos_tagging(text_without_punctuation)
print(tagged_sentence)

[('tokenization', 'NOUN'), ('is', 'AUX'), ('a', 'DET'), ('fundamental', 'ADJ'), ('concept', 'NOUN'), ('in', 'ADP'), ('natural', 'ADJ'), ('language', 'NOUN'), ('processing', 'NOUN'), ('nlp', 'NOUN'), ('it', 'PRON'), ('involves', 'VERB'), ('breaking', 'VERB'), ('down', 'ADP'), ('a', 'DET'), ('piece', 'NOUN'), ('of', 'ADP'), ('text', 'NOUN'), ('such', 'ADJ'), ('as', 'ADP'), ('a', 'DET'), ('paragraph', 'NOUN'), ('or', 'CCONJ'), ('a', 'DET'), ('document', 'NOUN'), ('into', 'ADP'), ('smaller', 'ADJ'), ('units', 'NOUN'), ('called', 'VERB'), ('tokens', 'NOUN'), ('these', 'DET'), ('tokens', 'NOUN'), ('can', 'AUX'), ('be', 'AUX'), ('individual', 'ADJ'), ('words', 'NOUN'), ('subwords', 'NOUN'), ('or', 'CCONJ'), ('even', 'ADV'), ('characters', 'NOUN'), ('tokenization', 'NOUN'), ('is', 'AUX'), ('an', 'DET'), ('essential', 'ADJ'), ('step', 'NOUN'), ('in', 'ADP'), ('nlp', 'ADJ'), ('tasks', 'NOUN'), ('as', 'SCONJ'), ('it', 'PRON'), ('provides', 'VERB'), ('the', 'DET'), ('foundation', 'NOUN'), ('for', 