<a href="https://colab.research.google.com/github/Krishishah7/nlp-text-preprocessing-basics/blob/main/nlp_text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Text Preprocessing

This notebook demonstrates basic text preprocessing steps used in Natural Language Processing (NLP).


In [9]:
import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [7]:
with open("sample_nlp_text.txt", "r") as file:
    text = file.read()

print("Original Text:")
print(text)


Original Text:
Artificial Intelligence is transforming the way humans interact with machines.
Natural Language Processing allows computers to understand human language.
Text preprocessing is an important step in NLP because raw text contains noise.
Removing stopwords, punctuation, and converting text to lowercase improves model performance.



In [11]:

# Lowercase
text = text.lower()

# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))

# Tokenization
tokens = word_tokenize(text)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]

# Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_tokens]

#Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_tokens]

print("\nOriginal Tokens:")
print(filtered_tokens[:10])

print("Lemmatized Tokens:")
print(lemmatized_words)

print("Processed Tokens:")
print(stemmed_words)



Original Tokens:
['artificial', 'intelligence', 'transforming', 'way', 'humans', 'interact', 'machines', 'natural', 'language', 'processing']
Lemmatized Tokens:
['artificial', 'intelligence', 'transforming', 'way', 'human', 'interact', 'machine', 'natural', 'language', 'processing', 'allows', 'computer', 'understand', 'human', 'language', 'text', 'preprocessing', 'important', 'step', 'nlp', 'raw', 'text', 'contains', 'noise', 'removing', 'stopwords', 'punctuation', 'converting', 'text', 'lowercase', 'improves', 'model', 'performance']
Processed Tokens:
['artifici', 'intellig', 'transform', 'way', 'human', 'interact', 'machin', 'natur', 'languag', 'process', 'allow', 'comput', 'understand', 'human', 'languag', 'text', 'preprocess', 'import', 'step', 'nlp', 'raw', 'text', 'contain', 'nois', 'remov', 'stopword', 'punctuat', 'convert', 'text', 'lowercas', 'improv', 'model', 'perform']
