# Q5. Alternative Approach Implementation (Individual Component)

This notebook demonstrates an alternative tokenization technique using **TextBlob** and compares it with the group's approach (NLTK, Split, and Regex).

---
- ### *Chong Kah Jun (TP067165)*

In [None]:
from textblob import TextBlob
import nltk

# Ensure necessary NLTK data is available (TextBlob uses NLTK under the hood)
nltk.download('punkt')

# Load the data
file_path = '../assets/dataset-a/Data_1.txt'
with open(file_path, 'r') as file:
    text = file.read().strip()

print("Original Text:")
print(text)

**Implement tokenization using an alternative approach (TextBlob)**

In [None]:
# Create a TextBlob object
blob = TextBlob(text)

# Extract words using the .words property
tokens_textblob = blob.words

print("Tokenization using TextBlob:")
print(list(tokens_textblob)[:20], "...")
print(f"\nTotal tokens found: {len(tokens_textblob)}")

**Compare and contrast the alternative approach with the groupâ€™s approach**

| Feature | Group Approach (NLTK `word_tokenize`) | Alternative Approach (`TextBlob.words`) |
| :--- | :--- | :--- |
| Data Type | Returns a standard Python `list` of strings. | Returns a `WordList` object (a subclass of list). |
| Punctuation | Retains punctuation as separate tokens. | Automatically filters out punctuation by default. |
| API Style | Functional/Procedural (`word_tokenize(text)`). | Object-Oriented (`blob.words`). |
| Underlying Engine | Uses the Punkt tokenizer. | Uses NLTK's tokenizer but wraps it for ease of use. |

Contrast:
While NLTK's `word_tokenize` provides a list of all characters including periods and commas, `TextBlob` specifically targets "words" and omits standard punctuation marks immediately. This makes `TextBlob`'s output cleaner for word-frequency tasks but less useful for syntactic analysis where punctuation markers are needed.

**Explain why the alternative approach is better, worse, or just different**

Evaluation:

Why it is Different:
The `TextBlob` approach is "just different" in its design philosophy. It is built for ease of use and rapid prototyping. Unlike the group approach which requires manual filtering of punctuation (as shown in Q1.c), `TextBlob.words` performs a basic level of filtering during the tokenization phase itself.

Pros (Better for specific tasks):
1. Ease of Use: The API is more intuitive for developers. Once a `TextBlob` object is created, multiple attributes (sentences, words, sentiment, tags) are accessible without calling different functions.
2. Built-in Methods: The `WordList` returned by `TextBlob` has convenient methods like `.lemmatize()` and `.singularize()`, which can be called directly on the list, reducing the need for loops.

Cons (Worse for specific tasks):
1. Transparency: It abstracts away the tokenization process. In complex NLP pipelines, a researcher might want to know exactly how a period is handled. `TextBlob` makes assumptions that might not always be desired.
2. Punctuation Loss: If the task requires knowing the end of a sentence through punctuation or identifying symbols (like in code analysis), `TextBlob.words` is worse because it discards them by default.