##Install Required Libraries


In [2]:
# üì¶ Install libraries (run this cell once, then you can skip it later)
!pip install nltk spacy

# Download spaCy English model
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.8/12.8 MB[0m [31m103.0 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m‚ö† Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


##Imports & Setup

In [3]:
# üß† Imports and basic setup

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

import spacy

# üîΩ Download NLTK resources (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# üîÅ Load spaCy model
nlp = spacy.load("en_core_web_sm")

# ‚úÖ Prepare objects
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

print("Setup done ‚úÖ")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Setup done ‚úÖ


##Sample Text

In [4]:
# üìù Sample text (you can change this to any text you want)

text = """
Natural Language Processing (NLP) is a field of Artificial Intelligence.
It helps computers understand, interpret, and generate human language.
Students are studying NLP to build chatbots, translators, and more!
"""

print("Original Text:")
print(text)


Original Text:

Natural Language Processing (NLP) is a field of Artificial Intelligence.
It helps computers understand, interpret, and generate human language.
Students are studying NLP to build chatbots, translators, and more!



##Basic Cleaning Function

In [5]:
# üßπ Step 1: Basic cleaning (lowercase, remove URLs, numbers, punctuation)

def basic_cleaning(text: str) -> str:
    # 1. Lowercase
    text = text.lower()

    # 2. Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)

    # 3. Remove numbers
    text = re.sub(r'\d+', '', text)

    # 4. Remove punctuation (keep only letters and spaces)
    text = re.sub(r'[^a-z\s]', ' ', text)

    # 5. Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

cleaned_text = basic_cleaning(text)
print("Cleaned Text:")
print(cleaned_text)


Cleaned Text:
natural language processing nlp is a field of artificial intelligence it helps computers understand interpret and generate human language students are studying nlp to build chatbots translators and more


##Tokenization

In [8]:
# ‚úÇÔ∏è Step 2: Tokenization (split into words)

def tokenize(text: str):
    return word_tokenize(text)

tokens = tokenize(cleaned_text)
print("Tokens:")
print(tokens)
print("\nNumber of tokens:", len(tokens))


Tokens:
['natural', 'language', 'processing', 'nlp', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'it', 'helps', 'computers', 'understand', 'interpret', 'and', 'generate', 'human', 'language', 'students', 'are', 'studying', 'nlp', 'to', 'build', 'chatbots', 'translators', 'and', 'more']

Number of tokens: 29


##topword Removal

In [9]:
# üö´ Step 3: Remove stopwords (words like: is, the, and, of...)

def remove_stopwords(tokens):
    filtered = [w for w in tokens if w not in stop_words]
    return filtered

tokens_no_sw = remove_stopwords(tokens)
print("Tokens after stopword removal:")
print(tokens_no_sw)
print("\nNumber of tokens after stopword removal:", len(tokens_no_sw))


Tokens after stopword removal:
['natural', 'language', 'processing', 'nlp', 'field', 'artificial', 'intelligence', 'helps', 'computers', 'understand', 'interpret', 'generate', 'human', 'language', 'students', 'studying', 'nlp', 'build', 'chatbots', 'translators']

Number of tokens after stopword removal: 20


##Stemming

In [10]:
# üå± Step 4: Stemming (cut word to its "root" form, sometimes ugly)

def apply_stemming(tokens):
    stemmed = [stemmer.stem(w) for w in tokens]
    return stemmed

stemmed_tokens = apply_stemming(tokens_no_sw)
print("Stemmed tokens:")
print(stemmed_tokens)


Stemmed tokens:
['natur', 'languag', 'process', 'nlp', 'field', 'artifici', 'intellig', 'help', 'comput', 'understand', 'interpret', 'gener', 'human', 'languag', 'student', 'studi', 'nlp', 'build', 'chatbot', 'translat']


##Lemmatization (NLTK + spaCy)

In [11]:
# üå≥ Step 5: Lemmatization (get proper dictionary form of words)

# Option A: NLTK lemmatizer
def apply_lemmatization_nltk(tokens):
    lemmas = [lemmatizer.lemmatize(w) for w in tokens]
    return lemmas

lemmatized_tokens_nltk = apply_lemmatization_nltk(tokens_no_sw)
print("Lemmatized tokens (NLTK):")
print(lemmatized_tokens_nltk)


Lemmatized tokens (NLTK):
['natural', 'language', 'processing', 'nlp', 'field', 'artificial', 'intelligence', 'help', 'computer', 'understand', 'interpret', 'generate', 'human', 'language', 'student', 'studying', 'nlp', 'build', 'chatbots', 'translator']


##Full Pipeline Function

In [12]:
# üîó Full preprocessing pipeline: from raw text to all stages

def preprocess_pipeline(raw_text: str):
    print("üîπ Original Text:")
    print(raw_text)
    print("-" * 80)

    cleaned = basic_cleaning(raw_text)
    print("üîπ After Cleaning:")
    print(cleaned)
    print("-" * 80)

    tokens = tokenize(cleaned)
    print("üîπ Tokens:")
    print(tokens)
    print("Number of tokens:", len(tokens))
    print("-" * 80)

    tokens_no_sw = remove_stopwords(tokens)
    print("üîπ After Stopword Removal:")
    print(tokens_no_sw)
    print("Number of tokens (no stopwords):", len(tokens_no_sw))
    print("-" * 80)

    stemmed = apply_stemming(tokens_no_sw)
    print("üîπ After Stemming:")
    print(stemmed)
    print("-" * 80)

    lemmas_nltk = apply_lemmatization_nltk(tokens_no_sw)
    print("üîπ After Lemmatization (NLTK):")
    print(lemmas_nltk)
    print("-" * 80)

    # Return data if you want to use it later
    return {
        "cleaned": cleaned,
        "tokens": tokens,
        "tokens_no_sw": tokens_no_sw,
        "stemmed": stemmed,
        "lemmas_nltk": lemmas_nltk
    }

# ‚ñ∂Ô∏è Run the pipeline on our sample text
results = preprocess_pipeline(text)


üîπ Original Text:

Natural Language Processing (NLP) is a field of Artificial Intelligence.
It helps computers understand, interpret, and generate human language.
Students are studying NLP to build chatbots, translators, and more!

--------------------------------------------------------------------------------
üîπ After Cleaning:
natural language processing nlp is a field of artificial intelligence it helps computers understand interpret and generate human language students are studying nlp to build chatbots translators and more
--------------------------------------------------------------------------------
üîπ Tokens:
['natural', 'language', 'processing', 'nlp', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'it', 'helps', 'computers', 'understand', 'interpret', 'and', 'generate', 'human', 'language', 'students', 'are', 'studying', 'nlp', 'to', 'build', 'chatbots', 'translators', 'and', 'more']
Number of tokens: 29
-------------------------------------------------------

##Try Your Own Text

In [13]:
# ‚úçÔ∏è Test with your own sentence or paragraph

my_text = input("Write any English sentence or paragraph:\n")

_ = preprocess_pipeline(my_text)


Write any English sentence or paragraph:
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
üîπ Original Text:
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
--------------------------------------------------------------------------------
üîπ After Cleaning:
nltk is a leading platform for building pyt