# Q1. Write a unique paragraph (5-6 sentences) about your favorite topic (e.g., sports, technology, food, books, etc.).     
- Convert text to lowercase and remove punctuation.  
- Tokenize the text into words and sentences.  
- Remove stopwords (using NLTK's stopwords list).
- Display word frequency distribution (excluding stopwords).

In [2]:
import nltk
import re
import string
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus   import stopwords
from nltk.stem     import PorterStemmer, LancasterStemmer, WordNetLemmatizer
from nltk.probability import FreqDist

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to C:\Users\A/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\A/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\A/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\A/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [3]:
from IPython.display import display, Markdown
new_paragraph = """My fascination lies deeply within the realm of quantum mechanics, a universe governed by rules that often defy our everyday intuition. The concept of superposition, where a particle can exist in multiple states simultaneously until observed, constantly sparks my curiosity about the fundamental nature of reality. Entanglement, the seemingly spooky connection between particles that allows them to instantaneously influence each other regardless of distance, hints at a deeper interconnectedness we are only beginning to grasp. Exploring the mathematical elegance that underpins these phenomena, from the Schrödinger equation to quantum field theory, feels like deciphering the very language of the cosmos. While still largely theoretical in many aspects, the potential applications of quantum computing and quantum cryptography promise to revolutionize technology as we know it, making this field an endlessly captivating frontier of scientific inquiry."""
new_lines = new_paragraph.split("\n")
print(new_paragraph)

My fascination lies deeply within the realm of quantum mechanics, a universe governed by rules that often defy our everyday intuition. The concept of superposition, where a particle can exist in multiple states simultaneously until observed, constantly sparks my curiosity about the fundamental nature of reality. Entanglement, the seemingly spooky connection between particles that allows them to instantaneously influence each other regardless of distance, hints at a deeper interconnectedness we are only beginning to grasp. Exploring the mathematical elegance that underpins these phenomena, from the Schrödinger equation to quantum field theory, feels like deciphering the very language of the cosmos. While still largely theoretical in many aspects, the potential applications of quantum computing and quantum cryptography promise to revolutionize technology as we know it, making this field an endlessly captivating frontier of scientific inquiry.


In [4]:
# Process text: lowercase and punctuation removal
processed_text = new_paragraph.lower()
processed_text = processed_text.translate(str.maketrans('', '', string.punctuation))
# .maketrans() creates a translation table mapping characters to replacements or None for deletion, used with .translate().

# Break down into individual words and sentences
individual_tokens = word_tokenize(processed_text)
sentence_units = sent_tokenize(processed_text)

# Filter out common, insignificant words
common_words = set(stopwords.words("english"))
meaningful_words = [item for item in individual_tokens if item not in common_words]

# Calculate and display word occurrences
frequency_distribution = FreqDist(meaningful_words)
print("Filtered Words:", meaningful_words)
print("Word Frequencies:")
frequency_distribution.pprint()

Filtered Words: ['fascination', 'lies', 'deeply', 'within', 'realm', 'quantum', 'mechanics', 'universe', 'governed', 'rules', 'often', 'defy', 'everyday', 'intuition', 'concept', 'superposition', 'particle', 'exist', 'multiple', 'states', 'simultaneously', 'observed', 'constantly', 'sparks', 'curiosity', 'fundamental', 'nature', 'reality', 'entanglement', 'seemingly', 'spooky', 'connection', 'particles', 'allows', 'instantaneously', 'influence', 'regardless', 'distance', 'hints', 'deeper', 'interconnectedness', 'beginning', 'grasp', 'exploring', 'mathematical', 'elegance', 'underpins', 'phenomena', 'schrödinger', 'equation', 'quantum', 'field', 'theory', 'feels', 'like', 'deciphering', 'language', 'cosmos', 'still', 'largely', 'theoretical', 'many', 'aspects', 'potential', 'applications', 'quantum', 'computing', 'quantum', 'cryptography', 'promise', 'revolutionize', 'technology', 'know', 'making', 'field', 'endlessly', 'captivating', 'frontier', 'scientific', 'inquiry']
Word Frequencie

# Q2. Stemming and Lemmatization:  
- Take the tokenized words from Question 1 (after stopword removal).  
- Apply stemming using NLTK's PorterStemmer and LancasterStemmer.   
- Apply lemmatization using NLTK's WordNetLemmatizer.   
- Compare and display results of both techniques.  

In [6]:
# Initialize
port_stemmer = PorterStemmer()
lanc_stemmer = LancasterStemmer()
word_net_lemmatizer = WordNetLemmatizer()

# Use filtered_words from Q1
port_result = [port_stemmer.stem(item) for item in meaningful_words]
lanc_result = [lanc_stemmer.stem(item) for item in meaningful_words]
lemma_output = [word_net_lemmatizer.lemmatize(item) for item in meaningful_words]

# PorterStemmer: A rule-based stemming algorithm that removes suffixes to reduce words to their root form. It follows a set of predefined rules and does not require training.
# LancasterStemmer: A more aggressive rule-based stemming algorithm that performs faster but may be harsher in its reductions compared to the Porter Stemmer.
# WordNetLemmatizer: Uses the WordNet lexical database to reduce words to their lemma form (dictionary form). It requires knowledge of the word's part of speech (POS) for accuracy.

print("Porter Stemmer:", port_result)
print("\nLancaster Stemmer:", lanc_result)
print("\nLemmatized:", lemma_output)

Porter Stemmer: ['fascin', 'lie', 'deepli', 'within', 'realm', 'quantum', 'mechan', 'univers', 'govern', 'rule', 'often', 'defi', 'everyday', 'intuit', 'concept', 'superposit', 'particl', 'exist', 'multipl', 'state', 'simultan', 'observ', 'constantli', 'spark', 'curios', 'fundament', 'natur', 'realiti', 'entangl', 'seemingli', 'spooki', 'connect', 'particl', 'allow', 'instantan', 'influenc', 'regardless', 'distanc', 'hint', 'deeper', 'interconnected', 'begin', 'grasp', 'explor', 'mathemat', 'eleg', 'underpin', 'phenomena', 'schrödinger', 'equat', 'quantum', 'field', 'theori', 'feel', 'like', 'deciph', 'languag', 'cosmo', 'still', 'larg', 'theoret', 'mani', 'aspect', 'potenti', 'applic', 'quantum', 'comput', 'quantum', 'cryptographi', 'promis', 'revolution', 'technolog', 'know', 'make', 'field', 'endlessli', 'captiv', 'frontier', 'scientif', 'inquiri']

Lancaster Stemmer: ['fascin', 'lie', 'deeply', 'within', 'realm', 'quant', 'mech', 'univers', 'govern', 'rul', 'oft', 'defy', 'everyday

# Q3. Regular Expressions and Text Splitting:   
- Take the original text from Question 1.   
- Use regular expressions to:     
    - Extract all words with more than 5 letters.  
    - Extract all numbers (if any exist in their text).   
    - Extract all capitalized words.    
- Use text splitting techniques to:  
    - Split the text into words containing only alphabets (removing digits and special characters).   
    - Extract words starting with a vowel. 


In [8]:
import re

# Input string for analysis
input_string = new_paragraph

# Find words exceeding five characters
long_strings = re.findall(r'\b\w{6,}\b', input_string)

# Extract any sequence of digits
found_numbers = re.findall(r'\d+', input_string)

# Identify words starting with a capital letter followed by lowercase letters
leading_capital_words = re.findall(r'\b[A-Z][a-z]*\b', input_string)

# Select words containing only alphabetic characters
pure_alpha_words = re.findall(r'\b[a-zA-Z]+\b', input_string)

# Filter words that commence with a vowel
starting_vowel_words = [item for item in pure_alpha_words if item[0].lower() in 'aeiou']

print("Words > 5 letter length:", long_strings)
print("\nNumbers:", found_numbers)
print("\nCapitalized words:", leading_capital_words)
print("\nAlphabet-only words:", pure_alpha_words)
print("\nWords starting with vowels:", starting_vowel_words)

Words > 5 letter length: ['fascination', 'deeply', 'within', 'quantum', 'mechanics', 'universe', 'governed', 'everyday', 'intuition', 'concept', 'superposition', 'particle', 'multiple', 'states', 'simultaneously', 'observed', 'constantly', 'sparks', 'curiosity', 'fundamental', 'nature', 'reality', 'Entanglement', 'seemingly', 'spooky', 'connection', 'between', 'particles', 'allows', 'instantaneously', 'influence', 'regardless', 'distance', 'deeper', 'interconnectedness', 'beginning', 'Exploring', 'mathematical', 'elegance', 'underpins', 'phenomena', 'Schrödinger', 'equation', 'quantum', 'theory', 'deciphering', 'language', 'cosmos', 'largely', 'theoretical', 'aspects', 'potential', 'applications', 'quantum', 'computing', 'quantum', 'cryptography', 'promise', 'revolutionize', 'technology', 'making', 'endlessly', 'captivating', 'frontier', 'scientific', 'inquiry']

Numbers: []

Capitalized words: ['My', 'The', 'Entanglement', 'Exploring', 'While']

Alphabet-only words: ['My', 'fascinatio

# Q4. Custom Tokenization & Regex-based Text Cleaning:   
- Take original text from Question 1   
- Write a custom tokenization function that:   
    - Removes punctuation and special symbols, but keeps contractions (e.g., "isn't" should not be split into "is" and "n't").   
    - Handles hyphenated words as a single token (e.g., "state-of-the-art" remains a single token).   
    - Tokenizes numbers separately but keeps decimal numbers intact (e.g., "3.14" should remain as it is).   
- Use Regex Substitutions (re.sub) to:   
    - Replace email addresses with `<EMAIL>` placeholder.   
    - Replace URLs with `<URL>` placeholder.   
    - Replace phone numbers (formats: 123-456-7890 or +91 9876543210) with `<PHONE>` placeholder.

In [10]:
import re

# Example string containing contact information
example_string = "Contact us at support@website.com. See more at https://website.info or ring us at +91 7777777777."

# Specialized token extraction method
def specialized_token_extraction(text):
    # Preserve hyphenated and apostrophed words, as well as decimal and whole numbers
    return re.findall(r"\b\w+(?:[-']\w+)*\b|\d+\.\d+|\d+", text)

# Generate tokens from the example string
generated_tokens = specialized_token_extraction(example_string)

# Replace identifiable patterns with placeholders
modified_text = re.sub(r'\S+@\S+', '<EMAIL>', example_string)
modified_text = re.sub(r'https?://\S+', '<URL>', modified_text)
modified_text = re.sub(r'\+91\s?\d{10}|\d{3}-\d{3}-\d{4}', '<PHONE>', modified_text)

print("Custom Tokens:", generated_tokens)
print("Cleaned Text:", modified_text)

Custom Tokens: ['Contact', 'us', 'at', 'support', 'website', 'com', 'See', 'more', 'at', 'https', 'website', 'info', 'or', 'ring', 'us', 'at', '91', '7777777777']
Cleaned Text: Contact us at <EMAIL> See more at <URL> or ring us at <PHONE>.
