# Q5. Alternative Approach Implementation (Individual Component)

This notebook demonstrates an alternative tokenization technique using **Scikit-learn’s CountVectorizer** and compares it with the group's approach (NLTK, Split, and Regex).

---
- ### *Cheah Thong Yau (TP070955)*

1️. Implement Tokenization Using an Alternative Approach (3 marks)

The `CountVectorizer` from Scikit-learn automatically performs tokenization as part of its preprocessing pipeline. It:

- Converts text to lowercase

- Removes punctuation

- Splits text into word tokens

- Builds a vocabulary of unique terms

Instead of returning a simple list of words, it generates a feature vocabulary that maps words to numerical representations for machine learning models.

In [4]:
#1
#!pip install scikit-learn

from sklearn.feature_extraction.text import CountVectorizer

# Load the data
file_path = '../assets/dataset-a/Data_1.txt'
with open(file_path, 'r') as file:
    text = file.read().strip()

print("Original Text:")
print(text)

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit the vectorizer to the text
X = vectorizer.fit_transform([text])

# Extract tokens (feature names)
tokens = vectorizer.get_feature_names_out()

print("\nFirst 20 tokens:")
print(tokens[:20])

print(f"\nTotal unique tokens found: {len(tokens)}")


Original Text:
Classification is the task of choosing the correct class label for a given input. In basic
classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. The basic classification task has a number of interesting variants. For example, in multiclass classification, each instance may be assigned multiple labels; in open-class classification, the set of labels is not defined in advance; and in sequence classification, a list of inputs are jointly classified.

First 20 tokens:
['advance' 'all' 'and' 'are' 'assigned' 'basic' 'be' 'choosing' 'class'
 'classification' 'classified' 'considered' 'correct' 'defined' 'each'
 'example' 'for' 'from' 'given' 'has']

Total unique tokens found: 45


2. **Compare and Contrast with the Group’s Approach**

| Feature              | NLTK `word_tokenize`                 | Split                                                 | Regex                                     | `CountVectorizer`                                  |
| -------------------- | ------------------------------------ | ----------------------------------------------------- | ----------------------------------------- | ------------------------------------------------ |
| Output Type          | List of word tokens                  | List of word tokens                                   | List of word tokens                       | Array of unique vocabulary tokens                |
| Punctuation          | Retained as separate tokens          | Retained or removed depending on how split is applied | Can retain or remove depending on pattern | Removed automatically                            |
| Case Handling        | Preserved unless manually lowercased | Preserved unless manually lowercased                  | Preserved unless manually lowercased      | Automatically lowercased                         |
| Tokenization Control | Standard tokenization rules          | Full manual control                                   | Full manual/custom control                | Limited control; tokenization rules are internal |
| Primary Purpose      | Linguistic tokenization              | Basic manual splitting                                | Custom pattern-based tokenization         | Machine learning feature extraction              |
| Output Format        | Human-readable tokens                | Human-readable tokens                                 | Human-readable tokens                     | Numerical feature-ready representation           |


Contrast
- NLTK provides standard linguistic tokenization.
- Split is the simplest method and gives full manual control but cannot handle punctuation well automatically.
- Regex allows custom rules and flexible tokenization for special patterns.
- CountVectorizer automatically removes punctuation, lowercases text, and produces a numerical vocabulary suitable for machine learning pipelines.

Main difference: group methods focus on linguistic flexibility, while CountVectorizer focuses on numerical preprocessing for ML models.


3️. **Evaluation: Better, Worse, or Just Different? (5 Marks)**

Why It Is Different

`CountVectorizer` is designed for machine learning workflows rather than pure linguistic analysis. It integrates tokenization with preprocessing and numerical encoding in a single step.

Pros
- Automatically handles preprocessing such as lowercasing and punctuation removal.
- Produces numerical features directly usable in classification models.
- Allows customization through parameters like:
    - max_features (Limit max number of unique words, keeping only the most frequent to reduce dimensionality)
    - stop_words   (Removes common words that carry little meaning (e.g., the, is, and, in))
    - ngram_range  (Defie range of word combination to extract, capturing word sequences)
- Efficient for large-scale datasets.

Cons
- Removes punctuation and sentence boundaries, structural information such as sentence endings or question marks will be lost.
- No context awareness, treating every word independently, making it unsuitable for detailed linguistic analysis.
- Tokenization process is less transparent compared to regex or NLTK, all happens internally within the library
- Vocabulary size can become large, If not limited using `max_features`, it will greatly increase memory usage and computational cost.