In [34]:
pip install nltk





[notice] A new release of pip is available: 24.2 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [35]:
import nltk

### **Pre-Processing In NLP**

**1. Steps for Text Preprocessing**
- Tokenization
- Part of speech(POS) tagging
- Stop words removal 
- Stremming
- Lemmatization
- Named Entity Recognition

### **1. Tokenization**
- Tokenization is the process of breaking up text into the smaller parts called tokens.
- A token may be a word, part of a word, or just characters like punctuation.

### **White Space Tokenization**
- The simplest way to tokenize text is to use whitespace within a string as the "delimiter" of words. 
- This can be accomplished with python's split function, which is available on all string object instances as well as on the string built-in class itself

In [36]:
text="I was born in Ahmedabad in 2000"
text.split()

['I', 'was', 'born', 'in', 'Ahmedabad', 'in', '2000']

In [37]:
text="I was born in Ahmedabad in 2000 , and i am 24 years old"
text.split(",")

['I was born in Ahmedabad in 2000 ', ' and i am 24 years old']

### **NLTK Word Tokenize**

In [38]:
from nltk.tokenize import(sent_tokenize,word_tokenize,TreebankWordTokenizer,wordpunct_tokenize,TweetTokenizer,MWETokenizer)

In [39]:
text="Natural Language processing is an amazing topic."

In [40]:
import nltk
nltk.download('punkt_tab')
word_tokenize(text)

[nltk_data] Error loading punkt_tab: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


['Natural', 'Language', 'processing', 'is', 'an', 'amazing', 'topic', '.']

In [41]:
sent_tokenize(text)

['Natural Language processing is an amazing topic.']

**Wordpunct_tokenize** tokenizer splits the sentences into words based on whitespaces and punctuations.

In [42]:
text="Hope is the only thing stronger than fear # HOPE # STRONG"
wordpunct_tokenize(text)

['Hope',
 'is',
 'the',
 'only',
 'thing',
 'stronger',
 'than',
 'fear',
 '#',
 'HOPE',
 '#',
 'STRONG']

**Treebank word tokenizer** incorporates a variety of common rules for english word tokenization. it separates pharase-terminating punctuation like (?!.;,) from adjacent tokens & retains decimal numbers as a single token.

In [43]:
text="What you don't want to do with yourself don't do with others."
tokanizer=TreebankWordTokenizer()
tokanizer.tokenize(text)

['What',
 'you',
 'do',
 "n't",
 'want',
 'to',
 'do',
 'with',
 'yourself',
 'do',
 "n't",
 'do',
 'with',
 'others',
 '.']

**Tweet tokenizer** when we want to apply tokenization in text data like tweets, the tweet tokenizer can produce practical tokens.
- Through this issue, NLTK has a rule based tokenizer special for tweets.
- We can split emojis into different words if we need them for tasks like sentiment analysis.

In [44]:
tweet="Just learned about #NLTK and its TweetTokenizer! 🥹 #NLP"
tokenizer=TweetTokenizer()
tokenizer.tokenize(tweet)

['Just',
 'learned',
 'about',
 '#NLTK',
 'and',
 'its',
 'TweetTokenizer',
 '!',
 '\U0001f979',
 '#NLP']

 **MWE Tokenizer** can merge multi-word expressions into single tokens.

In [45]:
text="The natural language Toolkit is a great tool for NLP tasks."
MWE_TOKENIZER=MWETokenizer([('natural','language','Toolkit'),('NLP','tasks')])
tokens=MWE_TOKENIZER.tokenize(nltk.word_tokenize(text))
tokens

['The',
 'natural_language_Toolkit',
 'is',
 'a',
 'great',
 'tool',
 'for',
 'NLP_tasks',
 '.']

**Now, let us tokenize the same text using regular expression tokenizer:**

In [46]:
from nltk.tokenize import RegexpTokenizer
reg_tokenizer=RegexpTokenizer(pattern='\d+$')
tokens=reg_tokenizer.tokenize("My contact num is 8733942263")
tokens

['8733942263']

### **Bigrams, Trigrams & Ngrams**
- Bigrams- tokens of two consecutive written words are known as Bigrams
- Trigrams- tokens of three consecutive written words are known as Trigrams
- Ngrams- tokens of any number of  consecutive written words are known as Ngrams

In [47]:
from nltk.util import ngrams, bigrams, trigrams

In [48]:
string="ML can be seen as a time-saving device that allows humans to explore their more creative ambitions while ML is in the background crunching numbers"
ml_tokens=nltk.word_tokenize(string)

**Bigrams**

In [49]:
ml_bigrams=list(nltk.bigrams(ml_tokens))
ml_bigrams

[('ML', 'can'),
 ('can', 'be'),
 ('be', 'seen'),
 ('seen', 'as'),
 ('as', 'a'),
 ('a', 'time-saving'),
 ('time-saving', 'device'),
 ('device', 'that'),
 ('that', 'allows'),
 ('allows', 'humans'),
 ('humans', 'to'),
 ('to', 'explore'),
 ('explore', 'their'),
 ('their', 'more'),
 ('more', 'creative'),
 ('creative', 'ambitions'),
 ('ambitions', 'while'),
 ('while', 'ML'),
 ('ML', 'is'),
 ('is', 'in'),
 ('in', 'the'),
 ('the', 'background'),
 ('background', 'crunching'),
 ('crunching', 'numbers')]

**Trigrams**

In [50]:
ml_trigrams=list(nltk.trigrams(ml_tokens))
ml_trigrams

[('ML', 'can', 'be'),
 ('can', 'be', 'seen'),
 ('be', 'seen', 'as'),
 ('seen', 'as', 'a'),
 ('as', 'a', 'time-saving'),
 ('a', 'time-saving', 'device'),
 ('time-saving', 'device', 'that'),
 ('device', 'that', 'allows'),
 ('that', 'allows', 'humans'),
 ('allows', 'humans', 'to'),
 ('humans', 'to', 'explore'),
 ('to', 'explore', 'their'),
 ('explore', 'their', 'more'),
 ('their', 'more', 'creative'),
 ('more', 'creative', 'ambitions'),
 ('creative', 'ambitions', 'while'),
 ('ambitions', 'while', 'ML'),
 ('while', 'ML', 'is'),
 ('ML', 'is', 'in'),
 ('is', 'in', 'the'),
 ('in', 'the', 'background'),
 ('the', 'background', 'crunching'),
 ('background', 'crunching', 'numbers')]

**POS Tagging (Parts of Speech)**
1. once the tokens are generated, the next step is to tag the tokens with  respect to their POS
2. POS Tagging is the process of labeling tokens with respective parts of speech.
3. These tags take into 

In [51]:
import nltk


In [52]:
nltk.download('averaged_perceptron_tagger')


[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [Errno 11001] getaddrinfo failed>


False

In [53]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Error loading averaged_perceptron_tagger_eng: <urlopen
[nltk_data]     error [Errno 11001] getaddrinfo failed>


False

In [54]:

# Define the text
text = "ML can solve problems that cannot be solved by numerical means alone"

# Tokenize and apply POS tagging
pos_tags = nltk.pos_tag(nltk.word_tokenize(text))
print(pos_tags)


[('ML', 'NNP'), ('can', 'MD'), ('solve', 'VB'), ('problems', 'NNS'), ('that', 'WDT'), ('can', 'MD'), ('not', 'RB'), ('be', 'VB'), ('solved', 'VBN'), ('by', 'IN'), ('numerical', 'JJ'), ('means', 'NNS'), ('alone', 'RB')]


In [55]:
text="jin eats a bananas"
nltk.pos_tag(nltk.word_tokenize(text))

[('jin', 'NN'), ('eats', 'VBZ'), ('a', 'DT'), ('bananas', 'NN')]

### **Stop Words Removal - 28-01-2025**

In [56]:
from nltk.corpus import stopwords

In [57]:
nltk.download('stopwords')

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

In [58]:
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [59]:
from nltk.corpus import stopwords

# Another sample text
new_text="The quick brown fox jumped over the lazy dog"

# Tokenize the new text using NLTK
new_word=word_tokenize(new_text)

# Remove stop words using NLTK
new_filtered_words= [word for word in new_word if word.lower() not in stopwords.words('english')]

# Join the filtered words to form a clean text
clean_text=" ".join(new_filtered_words)
print("Original text: " , new_text)
print("Text after stopword removal: " , clean_text)

Original text:  The quick brown fox jumped over the lazy dog
Text after stopword removal:  quick brown fox jumped lazy dog


In [60]:
pip install gensim

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [61]:
from gensim.parsing.preprocessing import remove_stopwords

# Another sample text
new_text="The majestic mountains provide a breathtaking view"

# Remove stop words using gensim
new_filtered_words=remove_stopwords(new_text)

print("Original text: " , new_text)
print("Text after stopword removal: " , new_filtered_words)

Original text:  The majestic mountains provide a breathtaking view
Text after stopword removal:  The majestic mountains provide breathtaking view


In [62]:
from nltk.stem import WordNetLemmatizer

In [63]:
lemma=WordNetLemmatizer()


### **Stemming - 30-01-2025**
Steming is converting words into base forms

In [64]:
# Stemming using NLTK
from nltk.stem import PorterStemmer
pst=PorterStemmer()
print(pst.stem("Change"))
print(pst.stem("Changing"))
print(pst.stem("changes"))
print(pst.stem("Changed"))


chang
chang
chang
chang


### **NonEnglish Stemmers**

In [65]:
from nltk.stem import SnowballStemmer
sbst=SnowballStemmer("spanish")
print(sbst.languages)

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


In [66]:
from nltk.stem import SnowballStemmer
sbst=SnowballStemmer("spanish")
print(sbst.stem("produccion"))
print(sbst.stem("producto"))

produccion
product


### **Lemmatization - 31-01-2025**
- A way of converting inflected form of a root word using parts of speech & context as a base
- It applies different rules to each POS to get the root word called lemma
- The root thus obtained has a grammatical meaning unlike stemming
- The additinal feature makes the process slower as compared to stemming.
- for example , a lemmatizer should map gone

In [67]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
word_lem=WordNetLemmatizer()
print(word_lem.lemmatize("eating", pos="v"))
print(word_lem.lemmatize("eats", pos="v"))
print(word_lem.lemmatize("ate", pos="v"))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\divu2\AppData\Roaming\nltk_data...


eat
eat
eat


### **Comparing Lemmatization with Stemming**

In [69]:
from nltk.stem import*
print("Result of WordNetLemmatizer",
word_lem.lemmatize("gone",pos="v"))

print("\nResult of PorterStemmer:",
      PorterStemmer().stem("gone"))

print("\nResult of LancasterStemmer:",
      LancasterStemmer().stem("gone"))

Result of WordNetLemmatizer go

Result of PorterStemmer: gone

Result of LancasterStemmer: gon


- Lemmatization generates the correct morphological root of the word using POS key, Porter Stemmer does not stem the word at all wheares lancaster stemmer performs over stemming.