<h2 style="text-align: center">344.105/6/7/12/26/27/28/29/30, UE Natural Language Processing</h2>
<h1 style="color:rgb(0,120,170)">Assignment 2</h1>
<h2 style="color:rgb(0,120,170)">Document Classification with Standard Machine Learning Methods</h2>

<b>Terms of Use</b><br>
This  material is prepared for educational purposes at the Johannes Kepler University (JKU) Linz, and is exclusively provided to the registered students of the mentioned course at JKU. It is strictly forbidden to distribute the current file, the contents of the assignment, and its solution. The use or reproduction of this manuscript is only allowed for educational purposes in non-profit organizations, while in this case, the explicit prior acceptance of the author(s) is required.
</div>

<h2>Table of contents</h2>
<ol>
    <a href="#section-general-guidelines"><li style="font-size:large;font-weight:bold">General Guidelines</li></a>
    <a href="#section-preprocessing"><li style="font-size:large;font-weight:bold">Task A: Pre-processing & Feature Extraction (15 points)</li></a>
    <a href="#section-training"><li style="font-size:large;font-weight:bold">Task B: Training and Results Analysis (15 points)</li></a>
    <a href="#section-optional"><li style="font-size:large;font-weight:bold">Task C: Linear Model Interpretability (2 extra point)</li></a>
    
</ol>

<a name="section-general-guidelines"></a><h2 style="color:rgb(0,120,170)">General Guidelines</h2>

### Assignment objective
 
The aim of this assignment is to implement a document (sentence) classification model using (standard) machine learning methods. The assignment in total has **30 points**; it also offers **2 extra points** which can cover any missing point.
 
This Notebook encompasses all aspects of the assignment, namely the descriptions of tasks as well as your solutions and reports. Feel free to add any required cell for solutions. The cells can contain code, reports, charts, tables, or any other material, required for the assignment. Feel free to provide the solutions in an interactive and visual way!
 
Please discuss any unclear point in the assignment in the provided forum in MOODLE. It is also encouraged to provide answers to your peer's questions. However when submitting a post, keep in mind to avoid providing solutions. Please let the tutor(s) know shall you find any error or unclarity in the assignment.

The use of Large Language Models (LLMs), such as ChatGPT or similar tools, is strictly not allowed for solving this assignment. If the use of such tools is detected, it will result in 0 points for the entire assignment and deregistration from the course.

### Libraries & Dataset

The assignment should be implemented with recent versions of `Python` (>3.7). Any standard Python library can be used, so far that the library is free and can be simply installed using `pip` or `conda`. Examples of potentially useful libraries are `scikit-learn`, `numpy`, `scipy`, `gensim`, `nltk`, `spaCy`, and `AllenNLP`. Use the latest stable version of each library.

### Submission

Each group should submit the following two files:

- One Jupyter Notebook file (`.ipynb`), containing all the code, results, visualizations, etc. **In the submitted Notebook, all the results and visualizations should already be present, and can be observed simply by loading the Notebook in a browser.** The Notebook must be self-contained, meaning that (if necessary) one can run all the cells from top to bottom without any error. Do not forget to put in your names and student numbers.
- The HTML file (`.html`) achieved from exporting the Jupyter Notebook to HTML (Download As HTML).

You do not need to include the data files in the submission.

<table style="width:50%; margin-left:auto; margin-right:auto; border-collapse:collapse; font-size:16px;">
    <tr>
        <th style="border:1px solid #ddd; padding:8px;">Team Member</th>
        <th style="border:1px solid #ddd; padding:8px;">Name</th>
        <th style="border:1px solid #ddd; padding:8px;">Matr. Number</th>
    </tr>
    <tr>
        <td style="border:1px solid #ddd; padding:8px;">1</td>
        <td style="border:1px solid #ddd; padding:8px;">[Enter Name]</td>
        <td style="border:1px solid #ddd; padding:8px;">[Enter Matr. Number]</td>
    </tr>
    <tr>
        <td style="border:1px solid #ddd; padding:8px;">2</td>
        <td style="border:1px solid #ddd; padding:8px;">[Enter Name]</td>
        <td style="border:1px solid #ddd; padding:8px;">[Enter Matr. Number]</td>
    </tr>
</table>

<a name="section-preprocessing"></a><h2 style="color:rgb(0,120,170)">Task A: Pre-processing & Feature Extraction (15 points)</h2>

**Preprocessing (5 points).** Load the train, validation, and test sets. Study the text and according to your judgements, apply at least <ins>two text cleaning/preprocessing methods</ins>. Punctuations marks, numbers, dates, case-sensitivity are some examples of the elements which can be potentially considered for cleaning/preprocessing. Tokenize the result text with a tokenizer of your choice. Report your approaches to text cleaning and tokenization and the reasons of your choices. Provide some examples, showing the effects of the applied approaches on the text.

**Creating dictionary (5 points).** Create a dictionary of vocabularies following the guidelines discussed in the lecture. Next, reduce the size of dictionary using a method of your choice, for instance by considering a cut-off threshold on the tokens with low frequencies. When removing tokens from the dictionary, consider a strategy for handling Out-Of-Vocabulary (OOV) tokens, namely the ones in the train/validation/test datasets that are not anymore in the dictionary. Some possible strategies could be to remove OOVs completely from the texts, or to replace them with a special token like <OOV\>. Explain your approaches and report the statistics of the dictionary before and after the reduction.

**Creating sentence vectors (5 points).** Use the dictionary to prepare <ins>two variations of document representation vectors</ins>, separately for train, validation, and test sets. Both variations follow a Bag-of-Words approach with a different token weighting method. One applied weighting must be `tf-idf` and the other one can be any other method discussed in the lecture such as `tc`, `tf`, `BM25`. These term weighting methods should be implemented; using a library to readily calculate the term weightings is not allowed. Report the applied approaches. Calculate and report the sparsity rate of the vectors of train, validation, and test sets, namely what percentages of the vectors in each set are filled with zeros.

</div>

In [4]:
#Imports
import pandas as pd
import numpy as np
import torch
import re
from collections import Counter
from typing import List

from sympy.geometry.entity import translate

In [22]:
class TextDataset(torch.utils.data.Dataset):
    """
    Thin wrapper around torch.utils.data.Dataset for text classification.
    Stores raw text and labels from CSV files.
    """
    def __init__(self, csv_path, text_col, label_col, transform=None, sep="\t"):
        self.df = pd.read_csv(csv_path ,sep=sep)
        self.texts = self.df[text_col].astype(str).tolist()
        self.labels = self.df[label_col].tolist()
        self.transform = transform  # optional preprocessing (clean/tokenize)
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        if self.transform:
            text = self.transform(text)
        return text, label
    
    def get_texts(self):
        """Returns all raw texts (useful before tokenization)."""
        return self.texts
    
    
class Preprocessor:
    """
    Handles text cleaning and tokenization. Can be attached as transform.
    """
    def __init__(self, lowercase=True, remove_punct=True, remove_digits=True):
        self.lowercase = lowercase
        self.remove_punct = remove_punct
        self.remove_digits = remove_digits

    def clean_text(self, text: str) -> str:
        if self.lowercase:
            text = text.lower()
        if self.remove_punct:
            text = re.sub(r"[^\w\s]", "", text)
        if self.remove_digits:
            text = re.sub(r"\d+", "", text)
        return text.strip()

    def tokenize(self, text: str) -> List[str]:
        # Simple whitespace tokenizer (replace with nltk if allowed)
        return text.split()

    def __call__(self, text: str) -> List[str]:
        """Makes Preprocessor usable as a torch transform."""
        cleaned = self.clean_text(text)
        tokens = self.tokenize(cleaned)
        return tokens
    
    
class Vocabulary:
    """
    Wraps a Counter to build token-index mapping and handle OOVs.
    """
    def __init__(self, min_freq=2, oov_token="<OOV>"):
        self.min_freq = min_freq
        self.oov_token = oov_token
        self.token2idx = {}
        self.idx2token = {}
        self.freqs = Counter()

    def build(self, tokenized_texts: List[List[str]]):
        for tokens in tokenized_texts:
            self.freqs.update(tokens)

        # Filter by min frequency
        valid_tokens = [tok for tok, f in self.freqs.items() if f >= self.min_freq]

        # Build mappings
        self.token2idx = {self.oov_token: 0}
        for idx, token in enumerate(valid_tokens, start=1):
            self.token2idx[token] = idx
        self.idx2token = {i: t for t, i in self.token2idx.items()}

    def encode(self, tokens: List[str]) -> List[int]:
        """Convert list of tokens → list of indices (handles OOV)."""
        return [self.token2idx.get(tok, 0) for tok in tokens]

    def decode(self, indices: List[int]) -> List[str]:
        """Convert list of indices → tokens."""
        return [self.idx2token.get(i, self.oov_token) for i in indices]

    def stats(self):
        print(f"Vocab size (with OOV): {len(self.token2idx)}")
        print(f"Most common tokens: {self.freqs.most_common(5)}")
        
        
class Vectorizer:
    """
    Converts tokenized text → document-term matrices.
    Provides TF and TF-IDF weighting (manual implementation).
    """
    def __init__(self, vocab: Vocabulary):
        self.vocab = vocab

    def build_term_matrix(self, tokenized_texts: List[List[str]]) -> np.ndarray:
        num_docs = len(tokenized_texts)
        vocab_size = len(self.vocab.token2idx)
        mat = np.zeros((num_docs, vocab_size), dtype=np.float32)

        for i, tokens in enumerate(tokenized_texts):
            for tok in tokens:
                idx = self.vocab.token2idx.get(tok, 0)
                mat[i, idx] += 1
        return mat

    def compute_tf(self, counts: np.ndarray) -> np.ndarray:
        doc_lengths = counts.sum(axis=1, keepdims=True) + 1e-9
        return counts / doc_lengths

    def compute_idf(self, counts: np.ndarray) -> np.ndarray:
        num_docs = counts.shape[0]
        df = np.count_nonzero(counts > 0, axis=0)
        idf = np.log((num_docs + 1) / (df + 1)) + 1
        return idf

    def compute_tfidf(self, counts: np.ndarray) -> np.ndarray:
        tf = self.compute_tf(counts)
        idf = self.compute_idf(counts)
        return tf * idf

    def compute_sparsity(self, mat: np.ndarray) -> float:
        zeros = np.sum(mat == 0)
        total = mat.size
        return 100.0 * zeros / total

In [20]:
prep = Preprocessor()
train_dataset = TextDataset("train.csv", "tweet", "label", transform=prep)
vocab = Vocabulary()
vocab.build([t for t, _ in train_dataset])
vocab.stats()

Vocab size (with OOV): 17064
Most common tokens: [('the', 23882), ('to', 19257), ('a', 16384), ('i', 15817), ('and', 12134)]


In [23]:
vec = Vectorizer(vocab)
train_term_matrix = vec.build_term_matrix(train_dataset.get_texts())

# TF and TF-IDF
train_tf     = vec.compute_tf(train_term_matrix)
train_tfidf  = vec.compute_tfidf(train_term_matrix)

# Sparsity
sparsity_tf    = vec.compute_sparsity(train_tf)
sparsity_tfidf = vec.compute_sparsity(train_tfidf)

In [24]:
train_tokens = [t for t, _ in train_dataset]

In [25]:
print("Sample cleaned tokens:", train_tokens[0][:10])
print("Vocab size:", len(vocab.token2idx))
print("Train TF shape:", train_tf.shape)
print("Train TF-IDF shape:", train_tfidf.shape)
print("Sparsity TF:", round(sparsity_tf, 2), "%")
print("Sparsity TF-IDF:", round(sparsity_tfidf, 2), "%")

Sample cleaned tokens: ['draws', 'and', 'a', 'loss', 'in', 'the', 'league', 'at', 'old', 'trafford']
Vocab size: 17064
Train TF shape: (52298, 17064)
Train TF-IDF shape: (52298, 17064)
Sparsity TF: 99.89 %
Sparsity TF-IDF: 99.89 %


<a name="section-training"></a><h2 style="color:rgb(0,120,170)">Task B: Training and Results Analysis (15 points)</h2>

To evaluate the models, use <ins>accuracy</ins> as the metric throughout the task.

**Dummy baseline (2 points).** Create one dummy baseline classifier that predicts the validation/test labels only based on the distribution of the labels in the training set (without any use of the feature vectors). This is a weak baseline and acts as a sanity check for the actual classifiers.

**Training and tuning classifiers (5 points).** Select at least <ins>two classification algorithms</ins> from standard machine learning classifiers. Using each classification algorithm, train a machine learning model on each of the variations of feature vectors. This should result in <ins>four experiment sets</ins> (2 variations of feature vectors × 2 classification algorithms). The ML model in each of the experiments possibly have several involving hyper-parameters. For each experiment, select <ins>one of the hyper-parameters and tune its value</ins>. The tuning process is done by first assigning at least <ins>three values</ins> to the hyper-parameter, then training separate models based on each value, and finally using the evaluation results on the validation set to select the best-performing model. Report the studied hyper-parameters, the evaluation results of each on the validation set, and finally the selected value of the hyper-parameter.

**Evaluation, reporting results, and discussion (3 point).** Evaluate the selected models of the four experiments on the test set. Report the results of <ins>the four experiments on both validation and test sets (side by side) in one table as well as in one plot</ins>. Compare different experiments and models. Are the test results lower(/higher) than the validation results? If it is the case, where can it be rooted from? Among all these models and variations, what are the most important factors improving the classification results?

**Confusion matrix (2 point).** Select the best performing model among the experiments and use it to create a confusion matrix. The matrix shows the predicted versus true results for each label. Explain your observations on the matrix. Across which classes do you observe significant confusions?

**Features visualization (3 point).** Continue with the best performing model and now take its feature vectors for the *dataitems in the test set*. Project these feature vectors to a 2-dimensional space using the TSNE method.  Using these 2-dimensional vectors, create two plots where the dataitems are shown as points (small circles) on the plots. The plots look exactly the same but only differ in the coloring of the data points. The first plot colors every dataitem with its *true label*, while the second one colors each according to its *predicted label by the model*. Keep in mind to assign the same colors to the classes of the plots, so that the plots are visually comparable. Put these two plots side by side, observe the differences, and compare the results. Report your observations.


</div>

<a name="section-optional"></a><h2 style="color:rgb(0,120,170)">Task C: Linear Model Interpretability (2 extra points)</h2>

Train a logistic regression model on one of the document representations. Take the coefficient weights, learned by the model, on each dimension (which here corresponds to each token in the dictionary). Separately for each class, study what are the tokens that have the highest contributions/importance for the predictions of the model.