1. Write a program for pre-processing of a text document such as stop word removal, stemming.

In [13]:
!pip install nltk
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: C:\Program Files\Python313\python.exe -m pip install --upgrade pip


In [14]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sayal\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sayal\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sayal\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [15]:
def pre_processing(text):

    
    # Get tools ready
    stopwords_list = set(stopwords.words('english'))
    ps = PorterStemmer()
    wnl = WordNetLemmatizer()

    # Process the text
    text = text.lower()
    words = word_tokenize(text)

    # Filter out stopwords and punctuation
    filtered_words = []
    for word in words:
        if word.isalpha() and word not in stopwords_list:
            filtered_words.append(word)

    # Now, process the filtered list in two ways
    stemmed_words = [ps.stem(word) for word in filtered_words]
    lemmatized_words = [wnl.lemmatize(word, pos='v') for word in filtered_words]


    print(f"Original: {text[:60]}...")
    print(f"  STEMMED : {stemmed_words}")
    print(f"  LEMMATIZED : {lemmatized_words}\n")

    # Return the stemmed words for the main practical requirement
    return stemmed_words

# --- 4. File Processing ---
processed_text = []
file_name = "sample.txt" # Make sure this file is in the same directory

print("--- STARTING PRE-PROCESSING (Stemming & Lemmatization) ---\n")
try:
    with open(file_name, 'r') as f:
        for line in f:
            if line.strip(): # Skip empty lines
                # The function will print both, but only returns the stemmed words
                stemmed_tokens = pre_processing(line)
                processed_text.extend(stemmed_tokens)
except FileNotFoundError:
    print(f"Error: The file '{file_name}' was not found.")
print("\n--- PROCESS COMPLETE ---")

# --- 5. Print Final List (Stemmed) ---
print("\nFinal list of all *stemmed* tokens (for CSV):")
print(processed_text)

--- STARTING PRE-PROCESSING (Stemming & Lemmatization) ---

Original: in a small village nestled between rolling hills and ancient...
  STEMMED : ['small', 'villag', 'nestl', 'roll', 'hill', 'ancient', 'forest', 'mysteri', 'old', 'librari', 'seem', 'hold', 'secret', 'world', 'librari', 'built', 'weather', 'stone', 'cover', 'ivi', 'place', 'time', 'felt', 'though', 'stood', 'still', 'shelv', 'line', 'book', 'shape', 'size', 'ancient', 'page', 'threaten', 'crumbl', 'slightest', 'touch']
  LEMMATIZED : ['small', 'village', 'nestle', 'roll', 'hill', 'ancient', 'forest', 'mysterious', 'old', 'library', 'seem', 'hold', 'secrets', 'world', 'library', 'build', 'weather', 'stone', 'cover', 'ivy', 'place', 'time', 'felt', 'though', 'stand', 'still', 'shelve', 'line', 'book', 'shape', 'size', 'ancient', 'page', 'threaten', 'crumble', 'slightest', 'touch']

Original: one day, a young girl named elara stumbled upon this hidden ...
  STEMMED : ['one', 'day', 'young', 'girl', 'name', 'elara', 'stumbl

In [16]:
print(processed_text)

['small', 'villag', 'nestl', 'roll', 'hill', 'ancient', 'forest', 'mysteri', 'old', 'librari', 'seem', 'hold', 'secret', 'world', 'librari', 'built', 'weather', 'stone', 'cover', 'ivi', 'place', 'time', 'felt', 'though', 'stood', 'still', 'shelv', 'line', 'book', 'shape', 'size', 'ancient', 'page', 'threaten', 'crumbl', 'slightest', 'touch', 'one', 'day', 'young', 'girl', 'name', 'elara', 'stumbl', 'upon', 'hidden', 'treasur', 'elara', 'curiou', 'adventur', 'soul', 'alway', 'seek', 'new', 'stori', 'knowledg', 'wander', 'aisl', 'discov', 'dusti', 'old', 'book', 'cover', 'adorn', 'strang', 'symbol', 'titl', 'written', 'languag', 'recogn', 'moment', 'open', 'gust', 'wind', 'swirl', 'around', 'word', 'page', 'began', 'glow', 'book', 'spoke', 'forgotten', 'era', 'lost', 'civil', 'magic', 'realm', 'beyond', 'imagin', 'told', 'tale', 'brave', 'hero', 'power', 'sorcer', 'epic', 'battl', 'good', 'evil', 'elara', 'captiv', 'eye', 'wide', 'wonder', 'devour', 'stori', 'knew', 'found', 'someth', 'e

In [17]:
import csv
with open("output.csv", "w") as file:
    writer = csv.writer(file, delimiter=',')
    writer.writerow(processed_text)

Purpose of the Practical
The purpose of this practical is to clean and normalize raw text data. Computers can't understand sentences directly. They see "running" and "ran" as two completely different words.

Pre-processing cleans this "unstructured" data by:

Tokenizing: Splitting sentences into a list of individual words (tokens).

Removing Noise: Getting rid of common "filler" words (like 'a', 'the', 'is') and punctuation that don't add much meaning.

Normalizing: Reducing words to their common root (e.g., "running" -> "run").

This makes the data much smaller, more efficient, and easier for analysis algorithms (like search engines or spam filters) to understand.

üß† Core Theory (How it Works)
There are two main techniques you're using to normalize words:

Stemming (using PorterStemmer): This is a crude, rule-based process. It simply chops off the end of words to get to a common "stem."

It's very fast.

It's "aggressive" and the resulting stem might not be a real word (e.g., "studies" becomes "studi", "history" becomes "histori").

Example: studies, studying, study all become studi.

Lemmatization (using WordNetLemmatizer): This is a "smarter," dictionary-based process. It uses the WordNet dictionary to find the actual root word (the "lemma") based on its part of speech (like a verb or noun).

It's slower (it has to look up words).

It's more accurate and the result is always a real word.

Example: studies, studying, study all become study.

üìã Step-by-Step Code Explanation
Import Libraries: You first import nltk (Natural Language Toolkit) and the specific tools you need:

word_tokenize: To split sentences into words.

stopwords: To get the list of "filler" words.

PorterStemmer: The tool for stemming.

WordNetLemmatizer: The tool for lemmatizing.

Download NLTK Data: The nltk.download() commands are essential. They download the required model for 'punkt' (tokenization), the 'stopwords' list, and the 'wordnet' dictionary.

Define pre_processing Function:

Initialize Tools: You create instances of the PorterStemmer, WordNetLemmatizer, and load the list of stopwords.words('english') into a set() (which is much faster for lookups than a list).

Tokenize & Filter: The code takes a line of text, converts it to .lower(), and uses word_tokenize() to get a list. It then loops through this list and keeps only the words that are .isalpha() (to remove punctuation and numbers) and are not in the stopword list.

Stem & Lemmatize: It then processes this filtered list twice:

Once with ps.stem(word) to create the stemmed list.

Once with wnl.lemmatize(word, pos='v') to create the lemmatized list.

Process the File:

The main part of the script opens your sample.txt file.

It reads the file line by line.

It calls your pre_processing function on each line.

It .extend()s a final list (processed_text) with the stemmed tokens returned from the function.

Save Output: Finally, it uses the csv library to save your final list of stemmed tokens into the output.csv file.

üõ†Ô∏è Key Libraries & Functions
nltk: The main library for all Natural Language Processing (NLP) in Python.

word_tokenize(text): Splits a string into a list of tokens (words and punctuation).

stopwords.words('english'): The built-in list of common English stop words.

word.isalpha(): A simple string method that returns True if all characters are letters. This is an easy way to remove numbers and punctuation (like '!', ',', '.') at the same time.

PorterStemmer(): The most common stemmer. You call its .stem(word) method.

WordNetLemmatizer(): The NLTK lemmatizer. You call its .lemmatize(word, pos='v') method.