Q1. Text normalization, tokenization, stopword removal & frequency
Below we:

Define a 5–6 sentence paragraph on “technology”

Convert to lowercase & strip punctuation

Tokenize into sentences & words

Remove English stopwords
Compute and display word frequency (excluding stopwords)


In [1]:
import nltk
nltk.download('punkt')       # For tokenization
nltk.download('stopwords')   # For stopword removal
nltk.download('wordnet')     # For lemmatization
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [2]:
# Q1: setup
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from collections import Counter
import string

# If running first time, you may need:
# nltk.download('punkt')
# nltk.download('stopwords')

# 1. Define paragraph
text = """
Over the past few years, artificial intelligence has dramatically changed how we engage with modern technology.
From digital assistants that handle casual conversations to autonomous vehicles maneuvering through traffic, the influence is significant.
Programmers are striving to create more intelligent systems that can adapt and learn instantly.
At the same time, issues related to data security and the responsible use of AI are becoming increasingly important.
It's evident that advancements in machine learning and deep neural networks will define the coming decade.
"""

# 2. Lowercase & remove punctuation
clean = text.lower().translate(str.maketrans('', '', string.punctuation))

# 3. Tokenize
sentences = sent_tokenize(clean)
words = word_tokenize(clean)

# 4. Remove stopwords
stops = set(stopwords.words('english'))
filtered = [w for w in words if w not in stops and w.isalpha()]

# 5. Word frequency
freq = Counter(filtered)
print("Sentence tokens:", sentences)
print("Word tokens:", words)
print("Filtered words:", filtered)
print("Word Frequency Distribution:")
for word, count in freq.most_common():
    print(f"{word}: {count}")


Sentence tokens: ['\nover the past few years artificial intelligence has dramatically changed how we engage with modern technology\nfrom digital assistants that handle casual conversations to autonomous vehicles maneuvering through traffic the influence is significant\nprogrammers are striving to create more intelligent systems that can adapt and learn instantly\nat the same time issues related to data security and the responsible use of ai are becoming increasingly important\nits evident that advancements in machine learning and deep neural networks will define the coming decade']
Word tokens: ['over', 'the', 'past', 'few', 'years', 'artificial', 'intelligence', 'has', 'dramatically', 'changed', 'how', 'we', 'engage', 'with', 'modern', 'technology', 'from', 'digital', 'assistants', 'that', 'handle', 'casual', 'conversations', 'to', 'autonomous', 'vehicles', 'maneuvering', 'through', 'traffic', 'the', 'influence', 'is', 'significant', 'programmers', 'are', 'striving', 'to', 'create', '

Q2. Stemming vs Lemmatization
Using the filtered word list from Q1:

Porter Stemmer

Lancaster Stemmer

WordNet Lemmatizer

Compare outputs side by side

In [3]:
# Q2: stemming & lemmatization
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer


porter = PorterStemmer()
lancaster = LancasterStemmer()
lemmatizer = WordNetLemmatizer()

results = []
for w in filtered:
    results.append({
        'original': w,
        'porter': porter.stem(w),
        'lancaster': lancaster.stem(w),
        'lemma': lemmatizer.lemmatize(w)
    })

# Display in a neat table
import pandas as pd
df2 = pd.DataFrame(results)
print(df2)


         original        porter  lancaster         lemma
0            past          past       past          past
1           years          year       year          year
2      artificial      artifici        art    artificial
3    intelligence      intellig   intellig  intelligence
4    dramatically        dramat       dram  dramatically
5         changed         chang      chang       changed
6          engage         engag        eng        engage
7          modern        modern     modern        modern
8      technology     technolog  technolog    technology
9         digital         digit      digit       digital
10     assistants        assist     assist     assistant
11         handle         handl      handl        handle
12         casual        casual        cas        casual
13  conversations       convers    convers  conversation
14     autonomous       autonom    autonom    autonomous
15       vehicles        vehicl      vehic       vehicle
16    maneuvering        maneuv

Q3. Regular Expressions & Text Splitting
Starting from the original (normalized) text:

Extract with regex:

Words > 5 letters
Numbers (if any)
Capitalized words
Using split:

Words containing only alphabets

Words starting with a vowel

In [4]:
# Q3: regex & splitting
import re

orig = text  # use the original (with punctuation & case)

# a. >5 letters
long_words = re.findall(r'\b\w{6,}\b', orig)
# b. Numbers
numbers = re.findall(r'\d+(?:\.\d+)?', orig)
# c. Capitalized words
capitalized = re.findall(r'\b[A-Z][a-z]+\b', orig)

# d. Split into alpha-only words
alpha_words = re.findall(r'\b[A-Za-z]+\b', orig)
# e. Words starting with vowel
vowel_words = [w for w in alpha_words if re.match(r'^[AEIOUaeiou]', w)]

print("Words >5 letters:", long_words)
print("Numbers:", numbers)
print("Capitalized words:", capitalized)
print("Alpha-only words:", alpha_words)
print("Words starting with vowel:", vowel_words)

Words >5 letters: ['artificial', 'intelligence', 'dramatically', 'changed', 'engage', 'modern', 'technology', 'digital', 'assistants', 'handle', 'casual', 'conversations', 'autonomous', 'vehicles', 'maneuvering', 'through', 'traffic', 'influence', 'significant', 'Programmers', 'striving', 'create', 'intelligent', 'systems', 'instantly', 'issues', 'related', 'security', 'responsible', 'becoming', 'increasingly', 'important', 'evident', 'advancements', 'machine', 'learning', 'neural', 'networks', 'define', 'coming', 'decade']
Numbers: []
Capitalized words: ['Over', 'From', 'Programmers', 'At', 'It']
Alpha-only words: ['Over', 'the', 'past', 'few', 'years', 'artificial', 'intelligence', 'has', 'dramatically', 'changed', 'how', 'we', 'engage', 'with', 'modern', 'technology', 'From', 'digital', 'assistants', 'that', 'handle', 'casual', 'conversations', 'to', 'autonomous', 'vehicles', 'maneuvering', 'through', 'traffic', 'the', 'influence', 'is', 'significant', 'Programmers', 'are', 'strivin

Q4. Custom Tokenizer & Regex-based Cleaning
Write a tokenizer that:

Keeps contractions together (isn't)

Treats hyphens as part of words (state-of-the-art)

Separates numbers (but keeps decimals together)

Use re.sub to:

Replace emails → <EMAIL>

Replace URLs → <URL>

Replace phone nos. (123-456-7890 or +91 9876543210) → <PHONE>

In [5]:
# Q4: custom tokenizer & cleaning
def custom_tokenize(s):
    # placeholder for emails/URLs/phones before tokenization
    s = re.sub(r'\b[\w\.-]+@[\w\.-]+\.\w+\b', '<EMAIL>', s)
    s = re.sub(r'https?://\S+|www\.\S+', '<URL>', s)
    s = re.sub(r'\+?\d{1,3}[\s-]\d{6,10}', '<PHONE>', s)
    s = re.sub(r'\b\d+\.\d+\b', lambda m: f"<NUM:{m.group(0)}>", s)  # mark decimals
    s = re.sub(r'\b\d+\b', '<NUM>', s)
    # now split on whitespace & punctuation except hyphens/apostrophes
    tokens = re.findall(r"[A-Za-z0-9<>:_]+(?:['-][A-Za-z0-9]+)*", s)
    # restore decimal tokens
    return [t.replace('<NUM:', '').replace('>', '') if t.startswith('<NUM:') else t for t in tokens]

sample = text.strip()
tokens_q4 = custom_tokenize(sample)
print("Custom tokens:", tokens_q4)

Custom tokens: ['Over', 'the', 'past', 'few', 'years', 'artificial', 'intelligence', 'has', 'dramatically', 'changed', 'how', 'we', 'engage', 'with', 'modern', 'technology', 'From', 'digital', 'assistants', 'that', 'handle', 'casual', 'conversations', 'to', 'autonomous', 'vehicles', 'maneuvering', 'through', 'traffic', 'the', 'influence', 'is', 'significant', 'Programmers', 'are', 'striving', 'to', 'create', 'more', 'intelligent', 'systems', 'that', 'can', 'adapt', 'and', 'learn', 'instantly', 'At', 'the', 'same', 'time', 'issues', 'related', 'to', 'data', 'security', 'and', 'the', 'responsible', 'use', 'of', 'AI', 'are', 'becoming', 'increasingly', 'important', "It's", 'evident', 'that', 'advancements', 'in', 'machine', 'learning', 'and', 'deep', 'neural', 'networks', 'will', 'define', 'the', 'coming', 'decade']
