Write a Python script that:
1. Use Genism to preprocess data from a sample text file, follow basic procedures like tokenization, stemming, lemmatization etc.

In [3]:
from gensim.parsing.preprocessing import remove_stopwords, preprocess_string, strip_punctuation, strip_numeric, stem_text
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
import nltk

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Load and read a sample text file
file_path = "sample_text.txt"  # Replace with the path to your file
with open(file_path, 'r', encoding='utf-8') as file:
    text = file.read()

# Step 1: Tokenization
tokens = word_tokenize(text.lower())  # Convert to lowercase and tokenize

# Step 2: Remove stopwords
tokens_no_stopwords = [word for word in tokens if word not in remove_stopwords(word)]

# Step 3: Stemming
stemmed_tokens = [stem_text(word) for word in tokens_no_stopwords]

# Step 4: Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in stemmed_tokens]

# Optional: Combine custom Gensim filters
custom_filters = [strip_punctuation, strip_numeric, lambda x: x.lower()]
processed_text = preprocess_string(text, custom_filters)

# Results
print("Original Text:")
print(text)
print("\nTokenized Text:")
print(tokens)
print("\nTokens without Stopwords:")
print(tokens_no_stopwords)
print("\nStemmed Tokens:")
print(stemmed_tokens)
print("\nLemmatized Tokens:")
print(lemmatized_tokens)
print("\nProcessed Text with Gensim Custom Filters:")
print(processed_text)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lokes\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\lokes\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\lokes\AppData\Roaming\nltk_data...


Original Text:
Natural language processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and human language. It enables machines to understand, interpret, and generate human languages, making it possible for computers to perform tasks like language translation, sentiment analysis, and more.


Tokenized Text:
['natural', 'language', 'processing', '(', 'nlp', ')', 'is', 'a', 'subfield', 'of', 'artificial', 'intelligence', '(', 'ai', ')', 'that', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'human', 'language', '.', 'it', 'enables', 'machines', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'languages', ',', 'making', 'it', 'possible', 'for', 'computers', 'to', 'perform', 'tasks', 'like', 'language', 'translation', ',', 'sentiment', 'analysis', ',', 'and', 'more', '.']

Tokens without Stopwords:
['is', 'a', 'of', 'that', 'on', 'the', 'between', 'and', 'it', 'to', 'and', 'it', 'for', 't