1. Tokenization and Text Cleaning
At the heart of NLP lies the art of breaking down text into meaningful units. Tokenization is the process of splitting text into words, phrases, or even sentences (tokens). It’s the initial step that sets the stage for further analysis. Coupled with text cleaning, where we remove unnecessary characters, numbers, and symbols, tokenization ensures we work with pristine, understandable language units.


In [1]:
# !pip install nltk
import nltk
text = "Aftab is a student! He loves Natural Language Processing."
tokens = nltk.word_tokenize(text)
print(tokens)

['Aftab', 'is', 'a', 'student', '!', 'He', 'loves', 'Natural', 'Language', 'Processing', '.']


In [2]:
# Now we will clean it
cleaned_tokens = [token.lower() for token in tokens if token.isalpha()]
print(cleaned_tokens)

['aftab', 'is', 'a', 'student', 'he', 'loves', 'natural', 'language', 'processing']


2. Stop Words Removing:
Not all words contribute equally to the meaning of a sentence. Stop words like “the” or “and” are often filtered out to focus on more meaningful content.

In [4]:
from nltk.corpus import stopwords
stopwords = set(stopwords.words("english"))
filtered_tokens =[token for token in tokens if token not in stopwords]
print(filtered_tokens)

['Aftab', 'student', '!', 'He', 'loves', 'Natural', 'Language', 'Processing', '.']


3. Stemming and Lemmatizing
Stemming and lemmatization are both text normalization techniques used in Natural Language Processing (NLP) to reduce words to their base or root forms. While they share the goal of simplifying words, they operate differently in terms of the linguistic knowledge they apply.

3.a. Stemming: Reducing to Root Forms

Stemming involves cutting off prefixes or suffixes of words to obtain their root or base form, known as the stem. The purpose is to treat words with similar meanings as if they were the same. Stemming is a rule-based method that doesn’t always result in a valid word, but it’s computationally less intensive.

In [6]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(token) for token in filtered_tokens]
print(stemmed_words)

['aftab', 'student', '!', 'he', 'love', 'natur', 'languag', 'process', '.']


3.b. Lemmatization: Transforming to Dictionary Form

Lemmatization, on the other hand, involves reducing words to their base or dictionary forms, known as lemmas. It takes into account the context of the word in a sentence and applies morphological analysis. Lemmatization results in valid words and is more linguistically informed compared to stemming.

In [8]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(token) for token in filtered_tokens]
print(lemmatized_words)

['Aftab', 'student', '!', 'He', 'love', 'Natural', 'Language', 'Processing', '.']


4. Part-of-Speech Tagging:
Part-of-speech tagging (POS tagging) is a natural language processing task where the goal is to assign a grammatical category (such as noun, verb, adjective, etc.) to each word in a given text. This provides a deeper understanding of the structure and function of each word in a sentence.
<a href= "https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">The Penn Treebank POS Tag Set</a> is a widely used standard for representing these part-of-speech tags in English text.



In [17]:
from nltk import pos_tag
#nltk.download()   #Uncomment if getting error
pos_tags=nltk.pos_tag(filtered_tokens)
print(pos_tags)

[('Aftab', 'NNP'), ('student', 'NN'), ('!', '.'), ('He', 'PRP'), ('loves', 'VBZ'), ('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('.', '.')]


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


5. Named Entity Recognition (NER):
NER takes language understanding to the next level by identifying and classifying entities like names, locations, organizations, etc., in a given text. This is crucial for extracting meaningful information from unstructured data.

In [24]:
from nltk import ne_chunk
ner_tags = ne_chunk(pos_tags)
print(ner_tags)

(S
  (GPE Aftab/NNP)
  student/NN
  !/.
  He/PRP
  loves/VBZ
  (ORGANIZATION Natural/JJ Language/NNP)
  Processing/NNP
  ./.)
