Write a Python script that:
Use Genism to preprocess data from a sample text file, follow basic procedures like tokenization, stemming, lemmatization etc.


In [3]:
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS, preprocess_string, remove_stopwords
from gensim.parsing.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
import nltk

# Download NLTK data for lemmatization
nltk.download('wordnet')
nltk.download('omw-1.4')

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Load sample text from a file
file_path = "sample_text.txt"  # Replace with your text file path
with open(file_path, 'r', encoding='utf-8') as file:
    text = file.read()

# Step 1: Tokenization using Gensim
tokens = simple_preprocess(text)
print("Tokenized Text:", tokens)

# Step 2: Remove Stopwords using Gensim
filtered_tokens = [word for word in tokens if word not in STOPWORDS]
print("Text after Removing Stopwords:", filtered_tokens)

# Step 3: Apply Stemming using Gensim's PorterStemmer
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print("Text after Stemming:", stemmed_tokens)

# Step 4: Apply Lemmatization using NLTK's WordNetLemmatizer
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print("Text after Lemmatization:", lemmatized_tokens)

# Step 5: Combine back to a single preprocessed text (optional)
preprocessed_text = " ".join(lemmatized_tokens)
print("Final Preprocessed Text:", preprocessed_text)


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\edbid\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\edbid\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Tokenized Text: ['natural', 'language', 'processing', 'nlp', 'is', 'fascinating', 'domain', 'of', 'artificial', 'intelligence', 'that', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'human', 'language', 'it', 'involves', 'breaking', 'down', 'text', 'into', 'smaller', 'components', 'such', 'as', 'words', 'or', 'sentences', 'through', 'tokenization', 'additionally', 'processes', 'like', 'stemming', 'and', 'lemmatization', 'help', 'reduce', 'words', 'to', 'their', 'base', 'forms', 'ensuring', 'consistency', 'in', 'analysis', 'applications', 'of', 'nlp', 'include', 'chatbots', 'sentiment', 'analysis', 'language', 'translation', 'and', 'more', 'the', 'ability', 'to', 'process', 'and', 'understand', 'natural', 'language', 'opens', 'up', 'world', 'of', 'possibilities', 'for', 'making', 'technology', 'smarter', 'and', 'more', 'accessible', 'to', 'users']
Text after Removing Stopwords: ['natural', 'language', 'processing', 'nlp', 'fascinating', 'domain', 'artificial', 'i