# NLP Preprocessing Notebook (Beginner-Friendly)

In NLP, raw text data is unstructured and noisy, which makes it difficult for models to learn meaningful patterns. Preprocessing helps transform this raw data into a clean, structured, and useful format for training language models.

### Step 1: Install required libraries (Run only once)

In [1]:
#pip install nltk    # nltk is a Library for NLP to perform Text preprocessing

In [1]:
import nltk
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,TreebankWordTokenizer
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [2]:
# Download necessary data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\naeem\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\naeem\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\naeem\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Step 2: Sample Raw Text

In [3]:
text = "Hello! Welcome to NLP with Gen AI. It's amazing, isn't it? Let's learn together in 2025!"

In [4]:
print("Original Text:")
print(text)

Original Text:
Hello! Welcome to NLP with Gen AI. It's amazing, isn't it? Let's learn together in 2025!


### Step 3: Lowercase

In [5]:
text_lower = text.lower()

In [6]:
print(text_lower)

hello! welcome to nlp with gen ai. it's amazing, isn't it? let's learn together in 2025!


### Step 4: Remove Punctuation

In [7]:
text_no_punct = text_lower.translate(str.maketrans('', '', string.punctuation))

In [8]:
print(text_no_punct)

hello welcome to nlp with gen ai its amazing isnt it lets learn together in 2025


### Step 5: Remove Numbers

In [9]:
text_no_numbers = re.sub(r'\d+', '', text_no_punct)

In [10]:
print(text_no_numbers)

hello welcome to nlp with gen ai its amazing isnt it lets learn together in 


### Step 6: Tokenization

In [11]:
tokenizer = TreebankWordTokenizer()

tokens = tokenizer.tokenize(text_no_numbers)


In [12]:
print("Tokens:", tokens)

Tokens: ['hello', 'welcome', 'to', 'nlp', 'with', 'gen', 'ai', 'its', 'amazing', 'isnt', 'it', 'lets', 'learn', 'together', 'in']


### Step 7: Remove Stopwords

In [13]:
stop_words = set(stopwords.words('english'))
filtered_tokens = [w for w in tokens if w not in stop_words]

In [14]:
print(stop_words)

{'itself', "don't", 'most', 'each', 'during', 'more', 'them', 'did', 'how', "i'd", 'needn', 'not', 'i', 'from', 'd', 'in', 't', 'which', 'hadn', "it'll", "they'd", 'having', "doesn't", 'after', 'doesn', "hasn't", 'such', "wouldn't", 'but', 'off', 'against', 'up', "they've", 'when', 'before', "couldn't", 'me', 'no', 'aren', "we'd", 'll', 'into', 'if', 'few', 'at', 'ain', 'between', 'were', "won't", 'was', 'have', 'these', 'on', 'will', "he'll", 'her', 'am', "it's", "it'd", 'so', 'where', "we're", 'or', 'won', 'above', 'now', 'what', "i've", 've', 'of', "you'll", "shouldn't", 'their', 'mustn', "aren't", 'while', 'themselves', "isn't", "i'll", 'it', 'our', "she's", 'didn', 'down', 'too', 'nor', 'other', 'whom', 'can', 'for', 'its', "she'll", 'those', 'we', 'any', 'had', 'isn', 'this', 'wouldn', 'who', 'theirs', "they'll", 'myself', 'couldn', 'ourselves', "weren't", 'your', "should've", "you've", "didn't", 'herself', 'just', 'shouldn', 'ma', "that'll", "haven't", 'himself', 'weren', 'has',

In [15]:
print(filtered_tokens)

['hello', 'welcome', 'nlp', 'gen', 'ai', 'amazing', 'isnt', 'lets', 'learn', 'together']


### Step 8: Stemming

In [16]:
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in filtered_tokens]

In [17]:
print(stemmed)

['hello', 'welcom', 'nlp', 'gen', 'ai', 'amaz', 'isnt', 'let', 'learn', 'togeth']


### Step 9: Lemmatization

In [18]:
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in filtered_tokens]

In [19]:
print(lemmatized)

['hello', 'welcome', 'nlp', 'gen', 'ai', 'amazing', 'isnt', 'let', 'learn', 'together']
