## 📝 Text Preprocessing with NLTK: A Beginner's Guide

Welcome to this comprehensive guide on text preprocessing with NLTK (Natural Language Toolkit)! This notebook will walk you through various essential text preprocessing techniques, all explained in simple terms with easy-to-follow code examples. Whether you're just starting out in NLP (Natural Language Processing) or looking to brush up on your skills, you're in the right place! 🚀


NLTK provides a comprehensive suite of tools for processing and analyzing unstructured text data.

### 1. Tokenization 
Tokenization is the process of splitting text into individual words or sentences.


In [4]:
# Sentence Tokenization

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

text = "Hello World. This is NLTK. It is great for text processing."
sentences = sent_tokenize(text)
print(sentences)


['Hello World.', 'This is NLTK.', 'It is great for text processing.']


[nltk_data] Downloading package punkt to /Users/praneet/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
# Word Tokenization

from nltk.tokenize import word_tokenize

words = word_tokenize(text)
print(words)


['Hello', 'World', '.', 'This', 'is', 'NLTK', '.', 'It', 'is', 'great', 'for', 'text', 'processing', '.']


### 2. Removing Stop Words
Stop words are common words that may not be useful for text analysis (e.g., "is", "the", "and").



In [6]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]
print(filtered_words)


['Hello', 'World', '.', 'NLTK', '.', 'great', 'text', 'processing', '.']


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/praneet/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 3. Stemming
Stemming reduces words to their root form by chopping off the ends.

In [7]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]
print(stemmed_words)


['hello', 'world', '.', 'nltk', '.', 'great', 'text', 'process', '.']


### 4. Lemmatization
Lemmatization reduces words to their base form (lemma), taking into account the meaning of the word.

In [8]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
print(lemmatized_words)


[nltk_data] Downloading package wordnet to /Users/praneet/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['Hello', 'World', '.', 'NLTK', '.', 'great', 'text', 'processing', '.']


### 5. Part-of-Speech Tagging
Tagging words with their parts of speech (POS) helps understand the grammatical structure.

The complete POS tag list can be accessed from the Installation and set-up notebook.

In [9]:
nltk.download('averaged_perceptron_tagger')

pos_tags = nltk.pos_tag(lemmatized_words)
print(pos_tags)


[('Hello', 'NNP'), ('World', 'NNP'), ('.', '.'), ('NLTK', 'NNP'), ('.', '.'), ('great', 'JJ'), ('text', 'JJ'), ('processing', 'NN'), ('.', '.')]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/praneet/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### 6. Named Entity Recognition
Identify named entities such as names of people, organizations, locations, etc.



In [12]:
# Numpy is required to run this
%pip install numpy

nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk.chunk import ne_chunk

named_entities = ne_chunk(pos_tags)
print(named_entities)


Collecting numpy
  Downloading numpy-1.26.4-cp39-cp39-macosx_11_0_arm64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading numpy-1.26.4-cp39-cp39-macosx_11_0_arm64.whl (14.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.0/14.0 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: numpy
Successfully installed numpy-1.26.4
Note: you may need to restart the kernel to use updated packages.


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/praneet/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /Users/praneet/nltk_data...
[nltk_data]   Package words is already up-to-date!


(S
  (PERSON Hello/NNP)
  World/NNP
  ./.
  NLTK/NNP
  ./.
  great/JJ
  text/JJ
  processing/NN
  ./.)


### 7. Word Frequency Distribution
Count the frequency of each word in the text.



In [13]:
from nltk.probability import FreqDist

freq_dist = FreqDist(lemmatized_words)
print(freq_dist.most_common(5))


[('.', 3), ('Hello', 1), ('World', 1), ('NLTK', 1), ('great', 1)]


### 8. Removing Punctuation
Remove punctuation from the text.



In [14]:
import string

no_punct = [word for word in lemmatized_words if word not in string.punctuation]
print(no_punct)


['Hello', 'World', 'NLTK', 'great', 'text', 'processing']


### 9. Lowercasing
Convert all words to lowercase.



In [16]:
lowercased = [word.lower() for word in no_punct]
print(lowercased)


['hello', 'world', 'nltk', 'great', 'text', 'processing']


### 10. Spelling Correction
Correct the spelling of words.



In [23]:
%pip install pyspellchecker

from nltk.corpus import wordnet
from spellchecker import SpellChecker

spell = SpellChecker()

def correct_spelling(word):
    if not wordnet.synsets(word):
        return spell.correction(word)
    return word

lemmatized_words = ['hello', 'world', '.', 'klown', 'taxt', 'procass', '.']
words_with_corrected_spelling = [correct_spelling(word) for word in lemmatized_words]
print(words_with_corrected_spelling)



Note: you may need to restart the kernel to use updated packages.
['hello', 'world', '.', 'known', 'text', 'process', '.']


### 11. Removing Numbers
Remove numerical values from the text.



In [26]:
lemmatized_words = ['hello', 'world', '88', 'text', 'process', '.']

no_numbers = [word for word in lemmatized_words if not word.isdigit()]
print(no_numbers)


['hello', 'world', 'text', 'process', '.']


### 12. Word Replacement
Replace specific words with other words (e.g., replacing slang with formal words).



In [28]:
lemmatized_words = ['hello', 'world', 'gr8', 'text', 'NLTK', '.']
replacements = {'NLTK': 'Natural Language Toolkit', 'gr8' : 'great'}

replaced_words = [replacements.get(word, word) for word in lemmatized_words]
print(replaced_words)


['hello', 'world', 'great', 'text', 'Natural Language Toolkit', '.']


### 13. Synonym Replacement
Replace words with their synonyms.



In [30]:
from nltk.corpus import wordnet
lemmatized_words = ['hello', 'world', 'awesome', 'text', 'great', '.']

def get_synonym(word):
    synonyms = wordnet.synsets(word)
    if synonyms:
        return synonyms[0].lemmas()[0].name()
    return word

synonym_replaced = [get_synonym(word) for word in lemmatized_words]
print(synonym_replaced)


['hello', 'universe', 'amazing', 'text', 'great', '.']


### 14. Extracting Bigrams and Trigrams
Extract bigrams (pairs of consecutive words) and trigrams (triplets of consecutive words).



In [31]:
from nltk import bigrams

bigrams_list = list(bigrams(lemmatized_words))
print(bigrams_list)


[('hello', 'world'), ('world', 'awesome'), ('awesome', 'text'), ('text', 'great'), ('great', '.')]


In [32]:
from nltk import trigrams

trigrams_list = list(trigrams(lemmatized_words))
print(trigrams_list)


[('hello', 'world', 'awesome'), ('world', 'awesome', 'text'), ('awesome', 'text', 'great'), ('text', 'great', '.')]


### 15. Sentence Segmentation
Split text into sentences while considering abbreviations and other punctuation complexities.



In [34]:
import nltk.data

text = 'Hello World. This is NLTK. It is great for text preprocessing.'

# Load the sentence tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# Tokenize the text into sentences
sentences = tokenizer.tokenize(text)

# Print the tokenized sentences
print(sentences)


['Hello World.', 'This is NLTK.', 'It is great for text preprocessing.']


### 16. Identifying Word Frequencies
Identify and display the frequency of words in a text.



In [36]:
from nltk.probability import FreqDist

lemmatized_words = ['hello', 'hello', 'awesome', 'text', 'great', '.', '.', '.']


word_freq = FreqDist(lemmatized_words)
for word, freq in word_freq.items():
    print(f"{word}: {freq}")


hello: 2
awesome: 1
text: 1
great: 1
.: 3


### 17. Removing HTML tags
Remove HTML tags from the text.


In [38]:
%pip install bs4

from bs4 import BeautifulSoup

html_text = "<p>Hello World. This is NLTK.</p>"
soup = BeautifulSoup(html_text, "html.parser")
cleaned_text = soup.get_text()
print(cleaned_text)


Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting beautifulsoup4 (from bs4)
  Downloading beautifulsoup4-4.12.3-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>1.2 (from beautifulsoup4->bs4)
  Downloading soupsieve-2.5-py3-none-any.whl.metadata (4.7 kB)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Downloading beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.9/147.9 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading soupsieve-2.5-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.12.3 bs4-0.0.2 soupsieve-2.5
Note: you may need to restart the kernel to use updated packages.
Hello World. This is NLTK.


### 18. Detecting Language
Detect the language of the text.

In [41]:
%pip install langdetect

from langdetect import detect

language = detect(text)
print(language) #`en` (for English)




Note: you may need to restart the kernel to use updated packages.
en


### 19. Tokenizing by Regular Expressions
Use Regular Expressions to tokenize text.

In [49]:
text = 'Hello World. This is NLTK. It is great for text preprocessing.'

from nltk.tokenize import regexp_tokenize

pattern = r'\w+'
regex_tokens = regexp_tokenize(text, pattern)
print(regex_tokens)


['Hello', 'World', 'This', 'is', 'NLTK', 'It', 'is', 'great', 'for', 'text', 'preprocessing']


### 20. Remove Frequent Words
Removes frequent words (also known as “high-frequency words”) from a list of tokens using NLTK, you can use the nltk.FreqDist() function to calculate the frequency of each word and filter out the most common ones.

In [51]:
import nltk

# input text
text = "Natural language processing is a field of AI. I love AI."

# tokenize the text
tokens = nltk.word_tokenize(text)

# calculate the frequency of each word
fdist = nltk.FreqDist(tokens)

# remove the most common words (e.g., the top 10% of words by frequency)
filtered_tokens = [token for token in tokens if fdist[token] < fdist.N() * 0.1]

print("Tokens without frequent words:", filtered_tokens)

Tokens without frequent words: ['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'I', 'love']


### 21. Remove extra whitespace
Tokenizes the input string into individual sentences and remove any leading or trailing whitespace from each sentence.

In [53]:
import nltk.data

# Text data
text = 'Hello World. This is NLTK. It is great for text preprocessing.'

# Load the sentence tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# Tokenize the text into sentences
sentences = tokenizer.tokenize(text)

# Remove extra whitespace from each sentence
sentences = [sentence.strip() for sentence in sentences]

# Print the tokenized sentences
print(sentences)


['Hello World.', 'This is NLTK.', 'It is great for text preprocessing.']


## Conclusion 🎉

##### Text preprocessing is a crucial step in natural language processing (NLP) and can significantly impact the performance of your models and applications. With NLTK, we have a powerful toolset that simplifies and streamlines these tasks.
##### I hope this guide has provided you with a solid foundation for text preprocessing with NLTK. As you continue your journey in NLP, remember that preprocessing is just the beginning. There are many more exciting and advanced techniques to explore and apply in your projects.

##### Happy coding!💻