Q1. Write a unique paragraph (5-6 sentences) about your favorite topic (e.g., sports,
technology, food, books, etc.).
1. Convert text to lowercase and remove punctuation.
2. Tokenize the text into words and sentences.
3. Remove stopwords (using NLTK's stopwords list).
4. Display word frequency distribution (excluding stopwords).

In [5]:
import nltk
import string
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk import FreqDist


nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('stopwords')

text = """Technology has always fascinated me, especially how it keeps evolving to make life easier. 
From smartphones and smartwatches to artificial intelligence and robotics, the possibilities seem endless. 
I enjoy learning about the latest gadgets and how they’re built to solve real-world problems. 
The pace of innovation in this field is incredible, and it’s exciting to imagine what the future holds. 
Whether it’s a new app or a groundbreaking invention, technology never fails to amaze me."""

text_lower_nopunct = text.lower().translate(str.maketrans('', '', string.punctuation))

words = word_tokenize(text_lower_nopunct)
sentences = sent_tokenize(text_lower_nopunct)

stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]

fdist = FreqDist(filtered_words)

print("Tokenized Sentences:")
print(sentences)
print("\nFiltered Words (no stopwords):")
print(filtered_words)
print("\nWord Frequency Distribution:")
print(fdist.most_common())


Tokenized Sentences:
['technology has always fascinated me especially how it keeps evolving to make life easier \nfrom smartphones and smartwatches to artificial intelligence and robotics the possibilities seem endless \ni enjoy learning about the latest gadgets and how they’re built to solve realworld problems \nthe pace of innovation in this field is incredible and it’s exciting to imagine what the future holds \nwhether it’s a new app or a groundbreaking invention technology never fails to amaze me']

Filtered Words (no stopwords):
['technology', 'always', 'fascinated', 'especially', 'keeps', 'evolving', 'make', 'life', 'easier', 'smartphones', 'smartwatches', 'artificial', 'intelligence', 'robotics', 'possibilities', 'seem', 'endless', 'enjoy', 'learning', 'latest', 'gadgets', '’', 'built', 'solve', 'realworld', 'problems', 'pace', 'innovation', 'field', 'incredible', '’', 'exciting', 'imagine', 'future', 'holds', 'whether', '’', 'new', 'app', 'groundbreaking', 'invention', 'techno

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Aashishsharma\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Aashishsharma\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Aashishsharma\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Q2: Stemming and Lemmatization
1. Take the tokenized words from Question 1 (after stopword removal).
2. Apply stemming using NLTK's PorterStemmer and LancasterStemmer.
3. Apply lemmatization using NLTK's WordNetLemmatizer.
4. Compare and display results of both techniques.

In [4]:
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('wordnet')
nltk.download('omw-1.4') 
porter = PorterStemmer()
lancaster = LancasterStemmer()
lemmatizer = WordNetLemmatizer()
porter_stems = [porter.stem(word) for word in filtered_words]
lancaster_stems = [lancaster.stem(word) for word in filtered_words]
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
print("Original Words (after stopword removal):")
print(filtered_words)
print("\nPorter Stemmer:")
print(porter_stems)
print("\nLancaster Stemmer:")
print(lancaster_stems)
print("\nWordNet Lemmatizer:")
print(lemmatized_words)


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Aashishsharma\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Aashishsharma\AppData\Roaming\nltk_data...


Original Words (after stopword removal):
['technology', 'always', 'fascinated', 'especially', 'keeps', 'evolving', 'make', 'life', 'easier', 'smartphones', 'smartwatches', 'artificial', 'intelligence', 'robotics', 'possibilities', 'seem', 'endless', 'enjoy', 'learning', 'latest', 'gadgets', '’', 'built', 'solve', 'realworld', 'problems', 'pace', 'innovation', 'field', 'incredible', '’', 'exciting', 'imagine', 'future', 'holds', 'whether', '’', 'new', 'app', 'groundbreaking', 'invention', 'technology', 'never', 'fails', 'amaze']

Porter Stemmer:
['technolog', 'alway', 'fascin', 'especi', 'keep', 'evolv', 'make', 'life', 'easier', 'smartphon', 'smartwatch', 'artifici', 'intellig', 'robot', 'possibl', 'seem', 'endless', 'enjoy', 'learn', 'latest', 'gadget', '’', 'built', 'solv', 'realworld', 'problem', 'pace', 'innov', 'field', 'incred', '’', 'excit', 'imagin', 'futur', 'hold', 'whether', '’', 'new', 'app', 'groundbreak', 'invent', 'technolog', 'never', 'fail', 'amaz']

Lancaster Stemmer:

Q3. Regular Expressions and Text Spliƫng
1. Take their original text from Question 1.
2. Use regular expressions to:
2. Extract all words with more than 5 letters.
2. Extract all numbers (if any exist in their text).
2. Extract all capitalized words.
3. Use text spliƫng techniques to:
3. Split the text into words containing only alphabets (removing digits and special characters).
3. Extract words starting with a vowel.

In [6]:
import re

text = """Technology has always fascinated me, especially how it keeps evolving to make life easier. 
From smartphones and smartwatches to artificial intelligence and robotics, the possibilities seem endless. 
I enjoy learning about the latest gadgets and how they’re built to solve real-world problems. 
The pace of innovation in this field is incredible, and it’s exciting to imagine what the future holds. 
Whether it’s a new app or a groundbreaking invention, technology never fails to amaze me."""
words_gt5 = re.findall(r'\b[a-zA-Z]{6,}\b', text)
numbers = re.findall(r'\b\d+\b', text)
capitalized_words = re.findall(r'\b[A-Z][a-z]*\b', text)
alphabetic_words = re.findall(r'\b[a-zA-Z]+\b', text)
vowel_words = [word for word in alphabetic_words if re.match(r'^[aeiouAEIOU]', word)]
print("Words with more than 5 letters:")
print(words_gt5)
print("\nNumbers in the text:")
print(numbers)
print("\nCapitalized Words:")
print(capitalized_words)
print("\nWords with only alphabets (no digits/special chars):")
print(alphabetic_words)
print("\nWords starting with a vowel:")
print(vowel_words)


Words with more than 5 letters:
['Technology', 'always', 'fascinated', 'especially', 'evolving', 'easier', 'smartphones', 'smartwatches', 'artificial', 'intelligence', 'robotics', 'possibilities', 'endless', 'learning', 'latest', 'gadgets', 'problems', 'innovation', 'incredible', 'exciting', 'imagine', 'future', 'Whether', 'groundbreaking', 'invention', 'technology']

Numbers in the text:
[]

Capitalized Words:
['Technology', 'From', 'I', 'The', 'Whether']

Words with only alphabets (no digits/special chars):
['Technology', 'has', 'always', 'fascinated', 'me', 'especially', 'how', 'it', 'keeps', 'evolving', 'to', 'make', 'life', 'easier', 'From', 'smartphones', 'and', 'smartwatches', 'to', 'artificial', 'intelligence', 'and', 'robotics', 'the', 'possibilities', 'seem', 'endless', 'I', 'enjoy', 'learning', 'about', 'the', 'latest', 'gadgets', 'and', 'how', 'they', 're', 'built', 'to', 'solve', 'real', 'world', 'problems', 'The', 'pace', 'of', 'innovation', 'in', 'this', 'field', 'is', '

Q4. Custom Tokenization & Regex-based Text Cleaning
1. Take original text from QuesƟon 1.
2. Write a custom tokenization funcƟon that:
a. Removes punctuation and special symbols, but keeps contractions (e.g.,
"isn't" should not be split into "is" and "n't").
b. Handles hyphenated words as a single token (e.g., "state-of-the-art" remains
a single token).
c. Tokenizes numbers separately but keeps decimal numbers intact (e.g., "3.14"
should remain as is).

3. Use Regex SubsƟtuƟons (re.sub) to:
a. Replace email addresses with '<EMAIL>' placeholder.
b. Replace URLs with '<URL>' placeholder.
c. Replace phone numbers (formats: 123-456-7890 or +91 9876543210) with
'<PHONE>' placeholder.

In [8]:
import re
text = "Contact us at support@example.com or visit our website https://www.example.com for more info. You can also call +91 9876543210 or 123-456-7890. This isn't state-of-the-art, but it's affordable. The value is 3.14."
def clean_text(text):
    text = re.sub(r'\b[\w\.-]+@[\w\.-]+\.\w+\b', '<EMAIL>', text)
    text = re.sub(r'(https?://[^\s]+)', '<URL>', text)
    text = re.sub(r'\b(?:\+91\s?\d{10}|(?:\d{3}-){2}\d{4})\b', '<PHONE>', text)
    return text
def custom_tokenize(text):
    text = clean_text(text)
    token_pattern = r"""\b\w+(?:-\w+)+\b|\b\w+'\w+\b|\b\d+\.\d+\b|\b\w+\b"""
    tokens = re.findall(token_pattern, text, re.VERBOSE)
    return tokens
tokens = custom_tokenize(text)
print("Tokens:")
print(tokens)


Tokens:
['Contact', 'us', 'at', 'EMAIL', 'or', 'visit', 'our', 'website', 'URL', 'for', 'more', 'info', 'You', 'can', 'also', 'call', '91', '9876543210', 'or', 'PHONE', 'This', "isn't", 'state-of-the-art', 'but', "it's", 'affordable', 'The', 'value', 'is', '3.14']
