#  Natural Language Processing (NLP) Text Preprocessing: A Complete Beginner’s Guide
***In the world of Natural Language Processing (NLP), raw text data is often messy and inconsistent. Before applying machine learning models, it's crucial to clean and prepare the text—a process known as text preprocessing. This article will walk you through the most essential preprocessing steps, techniques, and libraries commonly used in NLP.***

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

## 1.This small script takes a sentence and splits it into words (tokens):


In [185]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [186]:
text ="What is NLP? Natural language processing (NLP) is a machine learning technology that gives computers the ability to interpret, manipulate, and comprehend human language. Organizations today have large volumes of voice and text data from various communication channels like emails, text messages, social media newsfeeds, video, audio, and more. They use NLP software to automatically process this data, analyze the intent or sentiment in the message, and respond in real time to human communication."

In [187]:
token=word_tokenize(text)

In [188]:
print("tokens:",token)

tokens: ['What', 'is', 'NLP', '?', 'Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'machine', 'learning', 'technology', 'that', 'gives', 'computers', 'the', 'ability', 'to', 'interpret', ',', 'manipulate', ',', 'and', 'comprehend', 'human', 'language', '.', 'Organizations', 'today', 'have', 'large', 'volumes', 'of', 'voice', 'and', 'text', 'data', 'from', 'various', 'communication', 'channels', 'like', 'emails', ',', 'text', 'messages', ',', 'social', 'media', 'newsfeeds', ',', 'video', ',', 'audio', ',', 'and', 'more', '.', 'They', 'use', 'NLP', 'software', 'to', 'automatically', 'process', 'this', 'data', ',', 'analyze', 'the', 'intent', 'or', 'sentiment', 'in', 'the', 'message', ',', 'and', 'respond', 'in', 'real', 'time', 'to', 'human', 'communication', '.']


In [189]:
len(token)

89

In [190]:
from nltk.probability import FreqDist
fdist = FreqDist()

In [191]:
for word in token:
    fdist[word.lower()]+=1
fdist

FreqDist({',': 9, 'and': 4, 'nlp': 3, 'the': 3, 'to': 3, '.': 3, 'is': 2, 'language': 2, 'human': 2, 'text': 2, ...})

In [192]:
fdist['nlp']

3

In [193]:
fdist_top5 = fdist.most_common(5)
fdist_top5

[(',', 9), ('and', 4), ('nlp', 3), ('the', 3), ('to', 3)]

## 2. Sentence Tokenization using NLTK

In [194]:
text2 ="NLP is interesting part in AI. CSE students should try to learn NLP. Many ML models applies in NLP."

In [195]:
sent_tok = sent_tokenize(text2)

In [196]:
print("Sentense tokenize",sent_tok)

Sentense tokenize ['NLP is interesting part in AI.', 'CSE students should try to learn NLP.', 'Many ML models applies in NLP.']


##  3. Whitespace Tokenizer

In [197]:
text3 ="Natural language processing is fun & interesting part in computer science and Artificial Inteligence"

In [198]:
tokens = text3.split()
print(tokens)

['Natural', 'language', 'processing', 'is', 'fun', '&', 'interesting', 'part', 'in', 'computer', 'science', 'and', 'Artificial', 'Inteligence']


In [199]:
from nltk.tokenize import WordPunctTokenizer

In [200]:
text4 = "Don't worry! We'll tokenize: correctly?"

In [201]:
punctuation = WordPunctTokenizer()

In [202]:
tokenize_punct = punctuation.tokenize(text4)

In [203]:
print(tokenize_punct)

['Don', "'", 't', 'worry', '!', 'We', "'", 'll', 'tokenize', ':', 'correctly', '?']


##  4.Regex Tokenizer (only keep words)

In [204]:
from nltk import RegexpTokenizer

In [205]:
regular_expression = RegexpTokenizer(r'\w+')
#The regular expression pattern(r'\w+'), \w matches any word character: [a-zA-Z0-9_] (letters, digits, underscores).

In [206]:
text5 = "Tokenize only words! Exclude punctuation."

In [207]:
tokense2 = regular_expression.tokenize(text5)

In [208]:
print(tokense2)

['Tokenize', 'only', 'words', 'Exclude', 'punctuation']


## 5.Blankline Tokenize

In [209]:
# from nltk.tokenize import blankline_tokenize
# blankline = blankline_tokenize(token)
# len(blankline)

##  N-grams: Unigram, Bigram, Trigram

In [210]:
from nltk.util import ngrams, trigrams, bigrams

In [211]:
string = "Natural language processing (NLP) is a machine learning technology that gives computers the ability to interpret, manipulate, and comprehend human language. Organizations today have large volumes of voice and text data from various communication channels like emails, text messages, social media newsfeeds, video, audio, and more. They use NLP software to automatically process this data, analyze the intent or sentiment in the message, and respond in real time to human communication."

In [212]:
woto = word_tokenize(string)
woto

['Natural',
 'language',
 'processing',
 '(',
 'NLP',
 ')',
 'is',
 'a',
 'machine',
 'learning',
 'technology',
 'that',
 'gives',
 'computers',
 'the',
 'ability',
 'to',
 'interpret',
 ',',
 'manipulate',
 ',',
 'and',
 'comprehend',
 'human',
 'language',
 '.',
 'Organizations',
 'today',
 'have',
 'large',
 'volumes',
 'of',
 'voice',
 'and',
 'text',
 'data',
 'from',
 'various',
 'communication',
 'channels',
 'like',
 'emails',
 ',',
 'text',
 'messages',
 ',',
 'social',
 'media',
 'newsfeeds',
 ',',
 'video',
 ',',
 'audio',
 ',',
 'and',
 'more',
 '.',
 'They',
 'use',
 'NLP',
 'software',
 'to',
 'automatically',
 'process',
 'this',
 'data',
 ',',
 'analyze',
 'the',
 'intent',
 'or',
 'sentiment',
 'in',
 'the',
 'message',
 ',',
 'and',
 'respond',
 'in',
 'real',
 'time',
 'to',
 'human',
 'communication',
 '.']

In [213]:
bigrams = list(nltk.bigrams(woto))
bigrams

[('Natural', 'language'),
 ('language', 'processing'),
 ('processing', '('),
 ('(', 'NLP'),
 ('NLP', ')'),
 (')', 'is'),
 ('is', 'a'),
 ('a', 'machine'),
 ('machine', 'learning'),
 ('learning', 'technology'),
 ('technology', 'that'),
 ('that', 'gives'),
 ('gives', 'computers'),
 ('computers', 'the'),
 ('the', 'ability'),
 ('ability', 'to'),
 ('to', 'interpret'),
 ('interpret', ','),
 (',', 'manipulate'),
 ('manipulate', ','),
 (',', 'and'),
 ('and', 'comprehend'),
 ('comprehend', 'human'),
 ('human', 'language'),
 ('language', '.'),
 ('.', 'Organizations'),
 ('Organizations', 'today'),
 ('today', 'have'),
 ('have', 'large'),
 ('large', 'volumes'),
 ('volumes', 'of'),
 ('of', 'voice'),
 ('voice', 'and'),
 ('and', 'text'),
 ('text', 'data'),
 ('data', 'from'),
 ('from', 'various'),
 ('various', 'communication'),
 ('communication', 'channels'),
 ('channels', 'like'),
 ('like', 'emails'),
 ('emails', ','),
 (',', 'text'),
 ('text', 'messages'),
 ('messages', ','),
 (',', 'social'),
 ('soci

In [214]:
bigrams = list(nltk.trigrams(woto))
bigrams

[('Natural', 'language', 'processing'),
 ('language', 'processing', '('),
 ('processing', '(', 'NLP'),
 ('(', 'NLP', ')'),
 ('NLP', ')', 'is'),
 (')', 'is', 'a'),
 ('is', 'a', 'machine'),
 ('a', 'machine', 'learning'),
 ('machine', 'learning', 'technology'),
 ('learning', 'technology', 'that'),
 ('technology', 'that', 'gives'),
 ('that', 'gives', 'computers'),
 ('gives', 'computers', 'the'),
 ('computers', 'the', 'ability'),
 ('the', 'ability', 'to'),
 ('ability', 'to', 'interpret'),
 ('to', 'interpret', ','),
 ('interpret', ',', 'manipulate'),
 (',', 'manipulate', ','),
 ('manipulate', ',', 'and'),
 (',', 'and', 'comprehend'),
 ('and', 'comprehend', 'human'),
 ('comprehend', 'human', 'language'),
 ('human', 'language', '.'),
 ('language', '.', 'Organizations'),
 ('.', 'Organizations', 'today'),
 ('Organizations', 'today', 'have'),
 ('today', 'have', 'large'),
 ('have', 'large', 'volumes'),
 ('large', 'volumes', 'of'),
 ('volumes', 'of', 'voice'),
 ('of', 'voice', 'and'),
 ('voice', 'a

In [215]:
bigrams = list(nltk.ngrams(woto,5))
bigrams

[('Natural', 'language', 'processing', '(', 'NLP'),
 ('language', 'processing', '(', 'NLP', ')'),
 ('processing', '(', 'NLP', ')', 'is'),
 ('(', 'NLP', ')', 'is', 'a'),
 ('NLP', ')', 'is', 'a', 'machine'),
 (')', 'is', 'a', 'machine', 'learning'),
 ('is', 'a', 'machine', 'learning', 'technology'),
 ('a', 'machine', 'learning', 'technology', 'that'),
 ('machine', 'learning', 'technology', 'that', 'gives'),
 ('learning', 'technology', 'that', 'gives', 'computers'),
 ('technology', 'that', 'gives', 'computers', 'the'),
 ('that', 'gives', 'computers', 'the', 'ability'),
 ('gives', 'computers', 'the', 'ability', 'to'),
 ('computers', 'the', 'ability', 'to', 'interpret'),
 ('the', 'ability', 'to', 'interpret', ','),
 ('ability', 'to', 'interpret', ',', 'manipulate'),
 ('to', 'interpret', ',', 'manipulate', ','),
 ('interpret', ',', 'manipulate', ',', 'and'),
 (',', 'manipulate', ',', 'and', 'comprehend'),
 ('manipulate', ',', 'and', 'comprehend', 'human'),
 (',', 'and', 'comprehend', 'human'

## Stemming: Normalized words into its based from or root form

In [216]:
from nltk.stem import PorterStemmer
pst = PorterStemmer()

In [217]:
pst.stem('affective')

'affect'

In [218]:
word_to_stem=['giving','achivement','impressive','implementing','Blocking','Engineering']
for words in word_to_stem:
    print(words+ ":" +pst.stem(words))


giving:give
achivement:achiv
impressive:impress
implementing:implement
Blocking:block
Engineering:engin


In [219]:
from nltk.stem import LancasterStemmer
pst = LancasterStemmer()
for words in word_to_stem:
    print(words+ ":" +pst.stem(words))

giving:giv
achivement:ach
impressive:impress
implementing:impl
Blocking:block
Engineering:engin


In [220]:
from nltk.stem import SnowballStemmer
pst = SnowballStemmer('english')
for words in word_to_stem:
    print(words+ ":" +pst.stem(words))

giving:give
achivement:achiv
impressive:impress
implementing:implement
Blocking:block
Engineering:engin


## Lemmatization: Lemmatization is the process of reducing a word to its dictionary form (lemma), lemmatization always returns real words.

In [221]:
from nltk.stem import wordnet
from nltk.stem import WordNetLemmatizer
wo_lem = WordNetLemmatizer()

In [222]:
wo_lem.lemmatize('corpora')

'corpus'

In [223]:
for words in word_to_stem:
    print(words+ ":" +wo_lem.lemmatize(words))

giving:giving
achivement:achivement
impressive:impressive
implementing:implementing
Blocking:Blocking
Engineering:Engineering


In [224]:
# len(stopwords.words('english'))

In [225]:
fdist_top5

[(',', 9), ('and', 4), ('nlp', 3), ('the', 3), ('to', 3)]

## Parts of Speech

In [226]:
sentence = "There is chance get a Bike from your mom! John."
sentence_token = word_tokenize(sentence)
for token in sentence_token:
    print(nltk.pos_tag([token]))

[('There', 'EX')]
[('is', 'VBZ')]
[('chance', 'NN')]
[('get', 'VB')]
[('a', 'DT')]
[('Bike', 'IN')]
[('from', 'IN')]
[('your', 'PRP$')]
[('mom', 'NN')]
[('!', '.')]
[('John', 'NNP')]
[('.', '.')]


In [227]:
sentence2 = "We are willing to achive 1000 ds from first job! what do you think Masud?"
sent_token2 = word_tokenize(sentence2)
for token in sent_token2:
    print(nltk.pos_tag([token]))

[('We', 'PRP')]
[('are', 'VBP')]
[('willing', 'JJ')]
[('to', 'TO')]
[('achive', 'JJ')]
[('1000', 'CD')]
[('ds', 'NN')]
[('from', 'IN')]
[('first', 'RB')]
[('job', 'NN')]
[('!', '.')]
[('what', 'WP')]
[('do', 'VB')]
[('you', 'PRP')]
[('think', 'NN')]
[('Masud', 'NN')]
[('?', '.')]


## NER: Name entity recogtition

In [228]:
from nltk import ne_chunk

In [229]:
ne_sent = "US president stays in White House."
ne_tokens = word_tokenize(ne_sent)
ne_tags = nltk.pos_tag(ne_tokens)
print(ne_tags)

[('US', 'NNP'), ('president', 'NN'), ('stays', 'NNS'), ('in', 'IN'), ('White', 'NNP'), ('House', 'NNP'), ('.', '.')]


In [230]:
# NER = ne_chunk(ne_tags)
# print(NER)

## Regular Expression
***re.findall()*** | Find all matches of a pattern in a text | re.findall(r'\d+', text)
***re.search()*** | Search for the first match | re.search(r'pattern', text)
***re.match()*** | Check if the beginning of the text matches | re.match(r'Hello', text)
***re.split()*** | Split text by a regex pattern | re.split(r'[,\s]', text)
***re.sub()*** | Substitute (replace) matches with something else | re.sub(r'apple', 'banana', text)
***re.compile()*** | Compile a regex pattern for reuse | pattern = re.compile(r'\w+')

In [231]:
import re
string = "Masud is working hard to learn AI also he is seeking to learn theories from reading books."
pattern = 'Masud'
match = re.match(pattern, string)
print(match)



<re.Match object; span=(0, 5), match='Masud'>


In [232]:
pattern2 = "learn"
search = re.search(pattern2,string)
print(search)
print(search.group(0))

<re.Match object; span=(25, 30), match='learn'>
learn


In [233]:
pattern3 = "learn"
all = re.findall(pattern2 , string)
print(all)

['learn', 'learn']


In [234]:
index = re.finditer(pattern, string)
for indx in index:
    print(indx.start())

0


#### \d{2}: Matches exactly two digits (e.g., "12").
#### The pattern \d{2}-\d{2}-\d{4} matches dates like "12-05-2007".
#### \s: Matches any whitespace (space, tab, etc.).
#### So the pattern [;,\s] splits the text whenever it sees a space, comma, or semicolon.

In [235]:
text = "Masud born in 16-12-2001 in Khulna, he admitted in University on 22-07-2022."
pattern = r'\d{2}-\d{2}-\d{4}'
date = re.findall(pattern, text)
print(date)

['16-12-2001', '22-07-2022']


In [236]:
print(re.sub(pattern, "Sunday", text))

Masud born in Sunday in Khulna, he admitted in University on Sunday.


## Removing StopWords: 
### Words like ***“is”, “the”, “and”*** often add little meaning and should be filtered out.



In [237]:
from nltk.corpus import stopwords

In [238]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [239]:
text = "Natural Language Processing is a fascinating field in Artificial Intelligence."
token = nltk.word_tokenize(text)
token

['Natural',
 'Language',
 'Processing',
 'is',
 'a',
 'fascinating',
 'field',
 'in',
 'Artificial',
 'Intelligence',
 '.']

In [240]:
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]


In [241]:
print(filtered_tokens)

['Natural', 'language', 'processing', 'fun', '&', 'interesting', 'part', 'computer', 'science', 'Artificial', 'Inteligence']


## Remove punctuation from a text
### ***string.punctuation*** : (!"#$%&'()*+,-./:;<=>?@[\]^_{|}~`).

In [242]:
import string

In [243]:
text = "Hello! How's everything going in 2025? Let's test punctuation removal."
token = nltk.word_tokenize(text)
print(token)

['Hello', '!', 'How', "'s", 'everything', 'going', 'in', '2025', '?', 'Let', "'s", 'test', 'punctuation', 'removal', '.']


In [244]:
tokens_no_punct = [word for word in tokens if word not in string.punctuation]
print(tokens_no_punct)

['Natural', 'language', 'processing', 'is', 'fun', 'interesting', 'part', 'in', 'computer', 'science', 'and', 'Artificial', 'Inteligence']
