## DAT550 - Group Project
### Bag of Words Document Classification using Feedforward Neural Network and Recurrent Neural Network
Authors: Andreas etternavn, Haakon Vollheim Webb, Håkon Nodeland, Stian etternavn



The code has the following structure:

1. Defining imports
2. Reading data
3. Cleaning data
4. Preprocessing data
5. Defining the model
6. Analysis

# 🧠 Project 2: Bag-of-Words Document Classification with Feedforward and Recurrent Neural Networks

## 🎯 Objective
Train and evaluate Feedforward Neural Networks (FFNN) and Recurrent Neural Networks (RNN) on a multiclass document classification task using various bag-of-words (BoW) feature extraction methods.

---

## 📁 Dataset Information

- **Source**: A subset of the ArXiv-10 dataset ([Dataset Link](https://paperswithcode.com/dataset/arxiv-10))
- **Structure**: 
  - `abstract` (input feature)
  - `field` (target label)
- **Task**: Predict the field of research (10-class classification) from the article abstract.

---

# 🗺️ Workflow Overview

Use the links below to jump directly to the relevant code sections.

---

## 🧹 Part 0: Common Preprocessing and Setup (Shared)
- [🧩 Code: Imports and Config (0.1)](#️code-imports-and-config-01)
- [🧩 Code: Load and Inspect Dataset (0.2)](#code-load-and-inspect-dataset-02)
- [🧩 Code: Train-Dev Split (0.3)](#code-train-dev-split-03)
- [🧩 Code: Text Preprocessing (0.4)](#code-text-preprocessing-04)

---

## ✨ Part A: Feedforward Neural Network with Bag-of-Words (Håkon og Haakon)
- [🧩 Code: Feature Extraction – Bag of Words (A.1)](#code-feature-extraction--bag-of-words-a1)
- [🧩 Code: FFNN Model Definition (A.2)](#code-ffnn-model-definition-a2)
- [🧩 Code: FFNN Training Loop (A.3)](#code-ffnn-training-loop-a3)
- [🧩 Code: FFNN Evaluation on Dev Set (A.4)](#code-ffnn-evaluation-on-dev-set-a4)

---

## 🔁 Part B: Recurrent Neural Network with Pre-trained Embeddings (Andreas og Stian)
- [🧩 Code: Load Pre-trained Word Embeddings (B.1)](#code-load-pre-trained-word-embeddings-b1)
- [🧩 Code: Text-to-Sequence Pipeline (B.2)](#code-text-to-sequence-pipeline-b2)
- [🧩 Code: RNN Model Definition (B.3)](#code-rnn-model-definition-b3)
- [🧩 Code: RNN Training Loop (B.4)](#code-rnn-training-loop-b4)
- [🧩 Code: RNN Evaluation on Dev Set (B.5)](#code-rnn-evaluation-on-dev-set-b5)

---

# 🧹 Part 0: Common Preprocessing and Setup (Both Must Use)

## 🔧 0.1: Imports and Config

- Load libraries (NumPy, pandas, PyTorch, scikit-learn, etc.)
- Set random seed for reproducibility

## 📄 0.2: Load and Inspect Dataset

- Load the dataset from `.csv.gz` file
- Print sample rows, label distribution

## 🧪 0.3: Train-Dev Split

- Split the data into **training** and **development** sets
- Use **stratified splitting** to preserve label distribution

### 📊 Purpose of Train/Dev Split

We split the dataset into **training** and **development (dev)** sets to evaluate how well the model generalizes to unseen data.

- **Train Set**: Used to fit the model and update weights.
- **Dev Set**: Used to tune hyperparameters and compare model configurations without overfitting.

> ✅ This separation ensures we don’t evaluate on data the model has already seen, giving us a more honest estimate of its real-world performance.


## ✏️ 0.4: Text Preprocessing

- Clean text (lowercasing, punctuation removal, optional stopwords)
- Tokenization (if needed for embedding-based methods)

---

# ✨ Part A: Feedforward Neural Network with Bag-of-Words

### 👤 Assigned to: **Håkon og Haakon**

## 📊 A.1: Feature Extraction – Bag of Words

- Use two different vectorization techniques:
- `CountVectorizer`
- `TfidfVectorizer`
- Optionally adjust:
- `ngram_range`
- `min_df`, `max_df`
- `max_features`

## 🧠 A.2: Model – Feedforward Neural Network (MLP)

- Design multiple MLP architectures with:
- 1 hidden layer
- 2+ hidden layers
- Use BoW feature vectors as input

## 🏋️ A.3: Training

- Use `CrossEntropyLoss` as the loss function
- Optimizer: Adam or SGD
- Track loss and accuracy per epoch

## 📈 A.4: Evaluation on Dev Set

- Evaluate using:
- Accuracy
- Precision
- Recall
- Macro-F1 Score
- Plot confusion matrix
- Compare models using both BoW types

## 📑 A.5: Summary

- Compare TF-IDF vs CountVectorizer
- Discuss impact of model depth
- Reflect on overfitting/underfitting, training time, etc.

---

# 🔁 Part B: Recurrent Neural Network with Pre-trained Embeddings

### 👤 Assigned to: **Andreas og Stian**

## 🔡 B.1: Load Pre-trained Word Embeddings

- Choose at least one:
- Word2Vec
- GloVe
- FastText
- Use pre-trained embeddings or train your own on external corpus (optional)

## 🧩 B.2: Text-to-Sequence Pipeline

- Tokenize each abstract into word indices
- Convert each word to embedding
- Pad/truncate sequences to same length

## 🧠 B.3: RNN-Based Classifier

- Use PyTorch to build models with:
- Simple RNN
- LSTM
- GRU
- Vary architectures:
- Hidden state sizes
- Layers
- Bidirectional RNNs

## 🏋️ B.4: Training

- Use `CrossEntropyLoss`
- Optimizer: Adam
- Monitor training loss and accuracy

## 📈 B.5: Evaluation on Dev Set

- Metrics:
- Accuracy
- Precision
- Recall
- Macro-F1 Score
- Try different sequence pooling methods:
- Last hidden state
- Max/mean pooling
- BiRNN concatenation

## 📑 B.6: Summary

- Compare performance across embedding models and RNN types
- Discuss tradeoffs (training time, performance, etc.)

### 🧩 Code: Imports and Config (0.1)
Set up the required Python libraries and configure global settings (e.g., seed, device).

In [None]:
# ✅ Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, recall_score

# ✅ Config
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)

In [None]:
print(torch.cuda.is_available())  # True if your system is GPU-ready
print(torch.cuda.get_device_name(0))  # Name of your GPU (if available)
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", DEVICE)

### 🧩 Code: Load and Inspect Dataset (0.2)
Load the compressed dataset and inspect its structure, size, and label distribution.

In [None]:
df = pd.read_csv("data/arxiv100.csv")

# Basic inspection
print(df.head())
print(df["label"].value_counts())
print(f"Dataset size: {df.shape}")

### 🧩 Code: Train-Dev Split (0.3)
Split the dataset into training and development sets using stratified sampling to preserve class balance.


In [None]:
from sklearn.model_selection import train_test_split

# Stratified split on label
train_texts, dev_texts, train_labels, dev_labels = train_test_split(
    df["abstract"], df["label"], 
    test_size=0.2, 
    stratify=df["label"], 
    random_state=SEED
)

### 🧩 Code: Text Preprocessing (0.4)
Clean the abstract text (lowercasing, removing punctuation, etc.) and prepare it for feature extraction.


In [None]:
import re

def clean_text(text):
    text = text.lower() # Converts to lower
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text) # Removes punctuation and special characters
    text = re.sub(r"\d+", "", text) # Removes numbers
    return text

train_texts = train_texts.apply(clean_text)
dev_texts = dev_texts.apply(clean_text)

### 🧩 Code: Feature Extraction – Bag of Words (A.1)
Use `CountVectorizer` and `TfidfVectorizer` to convert abstracts into numerical feature vectors.


In [None]:
from sklearn.preprocessing import LabelEncoder

# === Label encoding (shared across both vectorizers) ===
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(train_labels)
y_dev_encoded = label_encoder.transform(dev_labels)

# === Bag-of-Words Vectorization ===
vectorizers = {
    "count": CountVectorizer(max_features=10000, ngram_range=(1, 2)),
    "tfidf": TfidfVectorizer(max_features=10000, ngram_range=(1, 2)),
}

# Store the feature vectors in a dict for easy comparison
bow_features = {}

for name, vectorizer in vectorizers.items():
    print(f"Fitting and transforming with {name} vectorizer...")
    X_train_vec = vectorizer.fit_transform(train_texts)
    X_dev_vec = vectorizer.transform(dev_texts)

    # Convert to dense arrays for PyTorch usage
    X_train_array = X_train_vec.toarray()
    X_dev_array = X_dev_vec.toarray()

    bow_features[name] = {
        "X_train": X_train_array,
        "X_dev": X_dev_array,
        "y_train": y_train_encoded,
        "y_dev": y_dev_encoded,
        "vectorizer": vectorizer,
    }

print("Done vectorizing with CountVectorizer and TF-IDF.")

### 🧩 Code: FFNN Model Definition (A.2)
Define one or more fully connected feedforward neural networks (MLPs) using PyTorch.

In [None]:
class FFNNClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dims, output_dim, 
                 use_batchnorm=False, use_dropout=True, dropout_rate=0.3):
        super().__init__()

        layers = []
        prev_dim = input_dim

        for h_dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, h_dim))
            if use_batchnorm:
                layers.append(nn.BatchNorm1d(h_dim))
            layers.append(nn.ReLU())
            if use_dropout:
                layers.append(nn.Dropout(dropout_rate))
            prev_dim = h_dim

        layers.append(nn.Linear(prev_dim, output_dim))
        self.model = nn.Sequential(*layers)

    def forward(self, x):
        return self.model(x)

### 🧩 Code: FFNN Training Loop (A.3)
Implement the training loop for the MLP models, including loss calculation and optimization.


In [None]:
from torch.utils.data import TensorDataset, DataLoader

def train_ffnn(
    model, 
    X_train, y_train, 
    X_dev, y_dev, 
    epochs=10, 
    batch_size=64, 
    lr=1e-3,
    use_weight_init=True,
    verbose=True
):
    # Move model to device
    model = model.to(DEVICE)

    # Optional: custom weight initialization
    if use_weight_init:
        def init_weights(m):
            if isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                nn.init.zeros_(m.bias)
        model.apply(init_weights)

    # Prepare data loaders
    train_dataset = TensorDataset(
        torch.tensor(X_train, dtype=torch.float32),
        torch.tensor(y_train, dtype=torch.long)
    )
    dev_dataset = TensorDataset(
        torch.tensor(X_dev, dtype=torch.float32),
        torch.tensor(y_dev, dtype=torch.long)
    )
    
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    dev_loader = DataLoader(dev_dataset, batch_size=batch_size)

    # Set loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    for epoch in range(epochs):
        model.train()
        total_loss = 0
        correct = 0
        total = 0

        for X_batch, y_batch in train_loader:
            X_batch, y_batch = X_batch.to(DEVICE), y_batch.to(DEVICE)

            optimizer.zero_grad()
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            loss.backward()
            optimizer.step()

            total_loss += loss.item() * y_batch.size(0)
            preds = outputs.argmax(dim=1)
            correct += (preds == y_batch).sum().item()
            total += y_batch.size(0)

        train_loss = total_loss / total
        train_acc = correct / total

        # Evaluate on dev set
        model.eval()
        with torch.no_grad():
            dev_correct = 0
            dev_total = 0
            for X_dev_batch, y_dev_batch in dev_loader:
                X_dev_batch, y_dev_batch = X_dev_batch.to(DEVICE), y_dev_batch.to(DEVICE)
                dev_outputs = model(X_dev_batch)
                dev_preds = dev_outputs.argmax(dim=1)
                dev_correct += (dev_preds == y_dev_batch).sum().item()
                dev_total += y_dev_batch.size(0)

        dev_acc = dev_correct / dev_total

        if verbose:
            print(f"Epoch {epoch+1}/{epochs} | Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.4f} | Dev Acc: {dev_acc:.4f}")


In [None]:
model = FFNNClassifier(
    input_dim=bow_features["count"]["X_train"].shape[1],
    hidden_dims=[512, 256],
    output_dim=len(label_encoder.classes_),
    use_batchnorm=True,
    use_dropout=True
)

train_ffnn(
    model=model,
    X_train=bow_features["count"]["X_train"],
    y_train=bow_features["count"]["y_train"],
    X_dev=bow_features["count"]["X_dev"],
    y_dev=bow_features["count"]["y_dev"],
    epochs=10,
    batch_size=64,
    lr=1e-3
)


### 🧩 Code: FFNN Evaluation on Dev Set (A.4)
Evaluate the trained MLPs on the dev set using accuracy, precision, recall, and macro-F1 score.


In [None]:
def evaluate_ffnn(model, X, y_true, label_encoder, title="Confusion Matrix", verbose=True):
    model.eval()
    with torch.no_grad():
        X_tensor = torch.tensor(X, dtype=torch.float32).to(DEVICE)
        preds = model(X_tensor).argmax(dim=1).cpu().numpy()

    # Metrics
    y_true = np.array(y_true)
    acc = accuracy_score(y_true, preds)
    f1 = f1_score(y_true, preds, average="macro")
    precision = precision_score(y_true, preds, average="macro")
    recall = recall_score(y_true, preds, average="macro")

    if verbose:
        print("Classification Report:\n")
        print(classification_report(y_true, preds, target_names=label_encoder.classes_))
    
        # Confusion Matrix
        cm = confusion_matrix(y_true, preds)
        plt.figure(figsize=(8, 6))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                    xticklabels=label_encoder.classes_,
                    yticklabels=label_encoder.classes_)
        plt.xlabel("Predicted Label")
        plt.ylabel("True Label")
        plt.title(title)
        plt.tight_layout()
        plt.show()

    return {
        "accuracy": acc,
        "f1": f1,
        "precision": precision,
        "recall": recall
    }


In [None]:
results = evaluate_ffnn(
    model=model,
    X=bow_features["tfidf"]["X_dev"],
    y_true=bow_features["tfidf"]["y_dev"],
    label_encoder=label_encoder,
    title="TF-IDF (1-layer FFNN) Confusion Matrix"
)

### 🧩 Code: Load Pre-trained Word Embeddings (B.1)
Load pre-trained embeddings (e.g., GloVe, Word2Vec) and map vocabulary to embedding vectors.


In [None]:
def load_glove_embeddings(path="glove.6B.100d.txt"):
    embeddings = {}
    with open(path, 'r', encoding='utf8') as f:
        for line in f:
            values = line.strip().split()
            word = values[0]
            vec = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vec
    return embeddings


### 🧩 Code: Text-to-Sequence Pipeline (B.2)
Convert preprocessed abstracts into padded sequences of word indices aligned with the embedding matrix.


In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=20000, oov_token="<OOV>")
tokenizer.fit_on_texts(train_texts)

X_train_seq = tokenizer.texts_to_sequences(train_texts)
X_dev_seq = tokenizer.texts_to_sequences(dev_texts)

max_len = 200  # or compute based on data
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len, padding='post')
X_dev_pad = pad_sequences(X_dev_seq, maxlen=max_len, padding='post')


### 🧩 Code: RNN Model Definition (B.3)
Define the RNN model architecture using PyTorch (Simple RNN, LSTM, or GRU).


In [None]:
class RNNClassifier(nn.Module):
    def __init__(self, embedding_matrix, hidden_dim, output_dim, rnn_type="lstm"):
        super().__init__()
        vocab_size, emb_dim = embedding_matrix.shape
        self.embedding = nn.Embedding.from_pretrained(torch.tensor(embedding_matrix, dtype=torch.float32), freeze=False)

        if rnn_type == "gru":
            self.rnn = nn.GRU(emb_dim, hidden_dim, batch_first=True, bidirectional=True)
        else:
            self.rnn = nn.LSTM(emb_dim, hidden_dim, batch_first=True, bidirectional=True)

        self.fc = nn.Linear(hidden_dim * 2, output_dim)

    def forward(self, x):
        x = self.embedding(x)
        _, (hidden, _) = self.rnn(x)
        out = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)  # BiRNN
        return self.fc(out)

### 🧩 Code: RNN Training Loop (B.4)
Train the RNN on sequence data, tracking loss and accuracy across epochs.


In [None]:
def train_rnn(model, X_train, y_train, X_dev, y_dev, epochs=10, lr=1e-3):
    model.to(DEVICE)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)

    train_data = torch.utils.data.TensorDataset(
        torch.tensor(X_train, dtype=torch.long),
        torch.tensor(y_train, dtype=torch.long)
    )
    loader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=True)

    for epoch in range(epochs):
        model.train()
        for X_batch, y_batch in loader:
            X_batch, y_batch = X_batch.to(DEVICE), y_batch.to(DEVICE)
            optimizer.zero_grad()
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            loss.backward()
            optimizer.step()

        print(f"Epoch {epoch+1}/{epochs} - Loss: {loss.item():.4f}")

### 🧩 Code: RNN Evaluation on Dev Set (B.5)
Evaluate the RNN using dev data and compute relevant classification metrics.


In [None]:
def evaluate_rnn(model, X, y, label_encoder):
    model.eval()
    with torch.no_grad():
        preds = model(torch.tensor(X, dtype=torch.long).to(DEVICE)).argmax(dim=1).cpu().numpy()
    print(classification_report(y, preds, target_names=label_encoder.classes_))