# Q1. Form tokenization and Filter stop words & punctuation

This section addresses text pre-processing tasks using NLTK and Regular Expressions on the provided dataset.

In [None]:
import nltk
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

# Load the data
file_path = '../assets/dataset-a/Data_1.txt'
with open(file_path, 'r') as file:
    text = file.read().strip()

print("Original Text:")
print(text)

: 

### (a) Demonstrate word tokenisation using the split function, Regular Expression and NLTK packages separately and report the output.

In [None]:
# 1. Split function
tokens_split = text.split()
print("Tokenization using split():")
print(tokens_split[:20], "...")

# 2. Regular Expression
# \w+ matches one or more alphanumeric characters
tokens_re = re.findall(r'\w+', text)
print("\nTokenization using Regular Expression (\\w+):")
print(tokens_re[:20], "...")

# 3. NLTK package
tokens_nltk = word_tokenize(text)
print("\nTokenization using NLTK word_tokenize():")
print(tokens_nltk[:20], "...")

### (b) Justify the most suitable tokenisation operation for text analytics. Support your answer using obtained outputs.

**Justification:**
NLTK's `word_tokenize` is the most suitable operation for text analytics for several reasons:
1. **Punctuation Handling:** The `split()` function keeps punctuation attached to words (e.g., `"input."` or `"advance;"`), which is problematic for frequency analysis or vectorization.
2. **Sophisticated Rules:** While basic regex (`\w+`) removes punctuation, it can fail on hyphenated words (e.g., `"open-class"` becomes `"open"` and `"class"`). NLTK uses the Penn Treebank Tokenizer rules which are more linguistically sound.
3. **Consistency:** NLTK correctly identifies punctuation marks as separate tokens, which allows for systematic filtering in the next stages of the pipeline.

### (c) Demonstrate stop words and punctuations removal and report the output suitably along with the stop words found in the given text corpus.

In [None]:
# Get stop words and punctuation
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

# Find stop words in the corpus
found_stop_words = [word for word in tokens_nltk if word.lower() in stop_words]
# Remove duplicates for reporting
unique_stop_words = sorted(list(set(found_stop_words)))

print("Stop words found in the corpus:")
print(unique_stop_words)

# Filter tokens
filtered_tokens = [
    word for word in tokens_nltk 
    if word.lower() not in stop_words and word not in punctuation
]

print("\nFiltered tokens (Stop words and Punctuation removed):")
print(filtered_tokens)

### (d) Explain the importance of filtering the stop words and punctuations in text analytics.

**Explanation:**
Filtering stop words and punctuation is crucial in text analytics because:
1. **Dimensionality Reduction:** Stop words (like "the", "is", "at") occur very frequently but carry little semantic value. Removing them significantly reduces the size of the feature space, improving computational efficiency.
2. **Noise Reduction:** Punctuation and common functional words act as noise. By removing them, we focus on "content-bearing" words (nouns, verbs, adjectives) that define the actual meaning and context of the document.
3. **Improved Signal-to-Noise Ratio:** For tasks like sentiment analysis or topic modeling, stop words can skew results. Removing them helps models capture the distinct linguistic signatures of different classes or topics more accurately.