Ques 1: Write a unique paragraph (5-6 sentences) about your favorite topic (e.g., sports,
technology, food, books, etc.).
1. Convert text to lowercase and remove punctuation.
2. Tokenize the text into words and sentences.
3. Remove stopwords (using NLTK's stopwords list).
4. Display word frequency distribution (excluding stopwords).

In [27]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.probability import FreqDist

paragraph = """The vastness of space has always fascinated humanity.
               Exploring distant galaxies, discovering exoplanets, and searching for extraterrestrial life drive scientific progress.
               Missions like the James Webb Telescope reveal stunning cosmic phenomena, while Mars rovers hunt for signs of ancient life.
               Private companies like SpaceX aim to make interstellar travel a reality.
               Every breakthrough brings us closer to understanding our place in the universe."""

# nltk.download('punkt')
# nltk.download('stopwords')

def clean_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('','',string.punctuation))
    return text

cleaned_text = clean_text(paragraph)
print("Lowercase + No Punctuation:\n",cleaned_text)

words=word_tokenize(cleaned_text)
sentences=sent_tokenize(paragraph)

print("\n2. Tokenization:")
print("- Words:", words)
print("- Sentences:", sentences)

stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]
print("\n3. After Stopword Removal:\n", filtered_words)
print("\n ",stopwords.words('english'))

fdist = FreqDist(filtered_words)
print("\n4. Word Frequency Distribution:")
for word,freq in fdist.most_common():
    print(f"{word:15}: {freq}")


Lowercase + No Punctuation:
 the vastness of space has always fascinated humanity 
               exploring distant galaxies discovering exoplanets and searching for extraterrestrial life drive scientific progress 
               missions like the james webb telescope reveal stunning cosmic phenomena while mars rovers hunt for signs of ancient life 
               private companies like spacex aim to make interstellar travel a reality 
               every breakthrough brings us closer to understanding our place in the universe

2. Tokenization:
- Words: ['the', 'vastness', 'of', 'space', 'has', 'always', 'fascinated', 'humanity', 'exploring', 'distant', 'galaxies', 'discovering', 'exoplanets', 'and', 'searching', 'for', 'extraterrestrial', 'life', 'drive', 'scientific', 'progress', 'missions', 'like', 'the', 'james', 'webb', 'telescope', 'reveal', 'stunning', 'cosmic', 'phenomena', 'while', 'mars', 'rovers', 'hunt', 'for', 'signs', 'of', 'ancient', 'life', 'private', 'companies', 'lik

Ques 2 : Stemming and Lemmatization
1. Take the tokenized words from Question 1 (after stopword removal).
2. Apply stemming using NLTK's PorterStemmer and LancasterStemmer.
3. Apply lemmatization using NLTK's WordNetLemmatizer.
4. Compare and display results of both techniques.

In [34]:
import nltk
from nltk.stem import PorterStemmer,LancasterStemmer,WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

filtered_words = [
    'vastness', 'space', 'always', 'fascinated', 'humanity', 'exploring',
    'distant', 'galaxies', 'discovering', 'exoplanets', 'searching',
    'extraterrestrial', 'life', 'drive', 'scientific', 'progress', 'missions',
    'like', 'james', 'webb', 'telescope', 'reveal', 'stunning', 'cosmic',
    'phenomena', 'mars', 'rovers', 'hunt', 'signs', 'ancient', 'life',
    'private', 'companies', 'like', 'spacex', 'aim', 'make', 'interstellar',
    'travel', 'reality', 'every', 'breakthrough', 'brings', 'us', 'closer',
    'understanding', 'place', 'universe'
]

porter = PorterStemmer()
porter_stems = [porter.stem(word) for word in filtered_words]

lancaster = LancasterStemmer()
lancaster_stems = [lancaster.stem(word) for word in filtered_words]

# nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in filtered_words]

print(f"{'Original':<20} {'Porter':<20} {'Lancaster':<20} {'Lemma':<20}")
print("-" * 80)
for original, porter, lancaster, lemma in zip(filtered_words, porter_stems, lancaster_stems, lemmas):
    print(f"{original:<20} {porter:<20} {lancaster:<20} {lemma:<20}")

Original             Porter               Lancaster            Lemma               
--------------------------------------------------------------------------------
vastness             vast                 vast                 vastness            
space                space                spac                 space               
always               alway                alway                always              
fascinated           fascin               fascin               fascinated          
humanity             human                hum                  humanity            
exploring            explor               expl                 exploring           
distant              distant              dist                 distant             
galaxies             galaxi               galaxy               galaxy              
discovering          discov               discov               discovering         
exoplanets           exoplanet            exoplanet            exoplanets      

Ques 3: Regular Expressions and Text Splittng
1. Take their original text from Question 1.
2. Use regular expressions to:
a. Extract all words with more than 5 letters.
b. Extract all numbers (if any exist in their text).
c. Extract all capitalized words.
3. Use text splittng techniques to:
a. Split the text into words containing only alphabets (removing digits and special
characters).
b. Extract words starting with a vowel.

In [38]:
import re

text = """The vastness of space has always fascinated humanity.
          Exploring distant galaxies, discovering exoplanets, and searching for extraterrestrial life drive scientific progress.
          Missions like the James Webb Telescope reveal stunning cosmic phenomena, while Mars rovers hunt for signs of ancient life.
          Private companies like SpaceX aim to make interstellar travel a reality.
          Every breakthrough brings us closer to understanding our place in the universe."""

long_words = re.findall(r'\b\w{6,}\b',text)
print("Words with >5 letters:              ",long_words)

numbers = re.findall(r'\b\d+\b',text)
print("Numbers found:                      ",numbers)

capitalized_words = re.findall(r'\b[A-Z][a-z]+\b',text)
print("Capitalized words:                  ",capitalized_words)

clean_words = re.findall(r'\b[a-zA-Z]+\b',text)
print("Alphabetic words only:              ",clean_words)

vowel_words = re.findall(r'\b[aeiouAEIOU][a-zA-Z]*\b',text)
print("Words starting with vowels:         ",vowel_words)

Words with >5 letters:               ['vastness', 'always', 'fascinated', 'humanity', 'Exploring', 'distant', 'galaxies', 'discovering', 'exoplanets', 'searching', 'extraterrestrial', 'scientific', 'progress', 'Missions', 'Telescope', 'reveal', 'stunning', 'cosmic', 'phenomena', 'rovers', 'ancient', 'Private', 'companies', 'SpaceX', 'interstellar', 'travel', 'reality', 'breakthrough', 'brings', 'closer', 'understanding', 'universe']
Numbers found:                       []
Capitalized words:                   ['The', 'Exploring', 'Missions', 'James', 'Webb', 'Telescope', 'Mars', 'Private', 'Every']
Alphabetic words only:               ['The', 'vastness', 'of', 'space', 'has', 'always', 'fascinated', 'humanity', 'Exploring', 'distant', 'galaxies', 'discovering', 'exoplanets', 'and', 'searching', 'for', 'extraterrestrial', 'life', 'drive', 'scientific', 'progress', 'Missions', 'like', 'the', 'James', 'Webb', 'Telescope', 'reveal', 'stunning', 'cosmic', 'phenomena', 'while', 'Mars', 'rover

Ques 4: Custom Tokenization & Regex-based Text Cleaning
1. Take original text from Question 1.
2. Write a custom tokenization function that:
a. Removes punctuation and special symbols, but keeps contractions (e.g.,
"isn't" should not be split into "is" and "n't").
b. Handles hyphenated words as a single token (e.g., "state-of-the-art" remains
a single token).
c. Tokenizes numbers separately but keeps decimal numbers intact (e.g., "3.14"
should remain as is).

3. Use Regex Substitutions (re.sub) to:
a. Replace email addresses with '<EMAIL>' placeholder.
b. Replace URLs with '<URL>' placeholder.
c. Replace phone numbers (formats: 123-456-7890 or +91 9876543210) with
'<PHONE>' placeholder.

In [3]:
import re

text = """The vastness of space has always fascinated humanity. Exploring distant galaxies, discovering exoplanets, and searching for extraterrestrial life drive scientific progress.
          Missions like the James Webb Telescope reveal stunning cosmic phenomena, while Mars rovers hunt for signs of ancient life.
          Private companies like SpaceX aim to make interstellar travel a reality. Every breakthrough brings us closer to understanding our place in the universe.
          Contact us at info@example.com or visit https://www.example.org. For support, call +1 (800) 555-1234 or 9876543210."""

def custom_tokenize(text):
    contractions = r"(?<!\w)([a-zA-Z]+'[a-zA-Z]+)(?!\w)"           # (?<!\w) check for word not just before ..i in (...isn't....) and (?!\w) check after t..
    hyphenated = r"\b[a-zA-Z]+(?:-[a-zA-Z]+)+\b"
    decimals = r"\b\d+\.\d+\b"
    integers = r"(?<!\w)\d+(?!\w)"

    pattern = f"({contractions}|{hyphenated}|{decimals}|{integers}|\w+)"
    tokens = re.findall(pattern,text)
    tokens = [token for group in tokens for token in group if token]
    return tokens

tokens = custom_tokenize(text)
print("Custom Tokens:",tokens)

def clean_text(text):
    text = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b','<EMAIL>',text)
    text = re.sub(r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+','<URL>',text)
    text = re.sub(r'(?:\+\d{1,3}\s?)?(?:\(\d{3}\)|\d{3})[-\s]?\d{3}[-\s]?\d{4}\b','<PHONE>',text)
    return text

cleaned_text = clean_text(text)
print("\nCleaned Text:",cleaned_text)

Custom Tokens: ['The', 'vastness', 'of', 'space', 'has', 'always', 'fascinated', 'humanity', 'Exploring', 'distant', 'galaxies', 'discovering', 'exoplanets', 'and', 'searching', 'for', 'extraterrestrial', 'life', 'drive', 'scientific', 'progress', 'Missions', 'like', 'the', 'James', 'Webb', 'Telescope', 'reveal', 'stunning', 'cosmic', 'phenomena', 'while', 'Mars', 'rovers', 'hunt', 'for', 'signs', 'of', 'ancient', 'life', 'Private', 'companies', 'like', 'SpaceX', 'aim', 'to', 'make', 'interstellar', 'travel', 'a', 'reality', 'Every', 'breakthrough', 'brings', 'us', 'closer', 'to', 'understanding', 'our', 'place', 'in', 'the', 'universe', 'Contact', 'us', 'at', 'info', 'example', 'com', 'or', 'visit', 'https', 'www', 'example', 'org', 'For', 'support', 'call', '1', '800', '555', '1234', 'or', '9876543210']

Cleaned Text: The vastness of space has always fascinated humanity. Exploring distant galaxies, discovering exoplanets, and searching for extraterrestrial life drive scientific progr