**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part II: BERT

Please see the description of the assignment in the README file (section 2) <br>
**Guide notebook**: [guides/bert_guide.ipynb](guides/bert_guide.ipynb)


***

<br>

* Note that you should report results using a classification report.

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW? Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bert_guide` notebook

* **Optionally**, you can fine-tune a pre-trained BERT model to classify news articles as is done in [guides/bert_guide_finetuning.ipybb](guides/bert_guide_finetuning.ipybb), the same task as in part 1. As this requires more computational resources, this part is optional. If you do decide to complete this part, you will need to use a GPU (e.g., Google Colab) to train the model. (For reference, training on a 2020 Macbook Pro with 16GB RAM and a M1 chip results in an out-of-memory error). Therefore, we suggest that you use Google Colab or another cloud-based service with a GPU. You can easily upload the `bert_guide_finetuning.ipynb` notebook to Google Colab and run it there.

<br>

***

#Install Dependencies

In [None]:
!pip install 

Collecting huggingface_hub==0.16.4
  Downloading huggingface_hub-0.16.4-py3-none-any.whl.metadata (12 kB)
Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
Installing collected packages: huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.30.2
    Uninstalling huggingface-hub-0.30.2:
      Successfully uninstalled huggingface-hub-0.30.2
Successfully installed huggingface_hub-0.16.4


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 3.5.1 requires huggingface-hub>=0.24.0, but you have huggingface-hub 0.16.4 which is incompatible.


# Imports for the project

In [None]:
from collections import defaultdict
from datasets import Dataset, DatasetDict
from langchain_core.documents import Document
from sentence_transformers import InputExample, SentenceTransformer, SentencesDataset, losses, models
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, pipeline, TrainingArguments, Trainer, DataCollatorWithPadding
import evaluate
import numpy as np
import pandas as pd
import random

### 1. Load the data

In [3]:
# Load structured CBS knowledge base
df = pd.read_csv("data/cbs_graduate_programs.csv")

# Prepare LangChain Document objects with metadata
chunks = []

for _, row in df.iterrows():
    metadata = {
        "url": row["url"],
        "section_title": row["section_title"],
        "section_type": row["section_type"],
        "page_name": row["page_name"]
    }
    text = "page_name: " + row["page_name"] + "text: " + row["text_chunk"]
    chunks.append(Document(page_content=text, metadata=metadata))

print(f"Loaded {len(chunks)} chunks.")

Loaded 1643 chunks.


### 2. Train-test split

In [4]:
# Convert LangChain Documents into dict format
data = [{
    "text": str(doc.page_content) if doc.page_content else "",
    "url": str(doc.metadata.get("url", "")),
    "section_title": str(doc.metadata.get("section_title", "")),
    "section_type": str(doc.metadata.get("section_type", "")),
    "page_name": str(doc.metadata.get("page_name", ""))
} for doc in chunks]

# Train-test split (80/20)
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Convert to HuggingFace Dataset
train_dataset = Dataset.from_list(train_data)
test_dataset = Dataset.from_list(test_data)
cbs_dataset = DatasetDict({"train": train_dataset, "test": test_dataset})

# Inspect
print(cbs_dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'url', 'section_title', 'section_type', 'page_name'],
        num_rows: 1314
    })
    test: Dataset({
        features: ['text', 'url', 'section_title', 'section_type', 'page_name'],
        num_rows: 329
    })
})


In [5]:
page_name_to_chunks = defaultdict(list)
for item in cbs_dataset["train"]:
    page_name_to_chunks[item["page_name"]].append(item)

# Create positive and negative pairs
train_examples = []

for page_name, group in page_name_to_chunks.items():
    # Positive pairs (within same page)
    for i in range(len(group)):
        for j in range(i + 1, len(group)):
            train_examples.append(InputExample(texts=[group[i], group[j]], label=1.0))

    # Sample negative examples (pair with random chunk from other page)
    negatives = [t for k, v in page_name_to_chunks.items() if k != page_name for t in v]
    for _ in range(min(3, len(group))):  # limit to avoid too many negatives
        anchor = random.choice(group)
        negative = random.choice(negatives)
        train_examples.append(InputExample(texts=[anchor, negative], label=0.0))


In [6]:
# Load ModernBERT as a sentence transformer model
word_embedding_model = models.Transformer("sentence-transformers/all-MiniLM-L6-v2")
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

Downloading model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [7]:
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=2,             # Increase if needed
    warmup_steps=100,
    show_progress_bar=True
)

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/857 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
model = SentenceTransformer("models/modernbert-cbs-embedding")

# Load ModernBERT tokenizer and model from HuggingFace

In [None]:
# Define the mappping from label names to label ids
id2label = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

# Define the mapping from label ids to label names (the reverse of id2label)
label2id = {v: k for k, v in id2label.items()}

# load the model
model = ModernBertForSequenceClassification.from_pretrained("answerdotai/ModernBERT-base", num_labels=4, id2label=id2label, label2id=label2id)

# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")

# Tokenize and encode the data

In [None]:
def preprocess_function(examples):
    """ Tokenize the text column in the examples. """
    return tokenizer(examples["text"], truncation=True)

tokenized_ag_news = ag_news.map(preprocess_function, batched=True, batch_size=4)

# Set evaluation metric

In [None]:
f1 = evaluate.load("f1")


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    f1 = f1_score(labels, predictions, average='weighted')  # 'weighted' for multiclass
    return {"f1": f1}

# Define a data collator and mount Google Drive

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

from google.colab import drive
drive.mount('/content/drive')

# Train the model

In [None]:
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/AIML_2025/ma2/my_awesome_model",  # THIS NEEDS TO CHANGE ON GOOGLE COLAB: "/content/drive/MyDrive/Colab Notebooks/my_awesome_model" or similar. Please check the path.
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.025,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ag_news["train"],
    eval_dataset=tokenized_ag_news["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

# Evaluate the model

In [None]:
train_predictions = trainer.predict(tokenized_ag_news["train"])
test_predictions = trainer.predict(tokenized_ag_news["test"])

# Extract predictions and labels
train_preds, train_labels = train_predictions.predictions.argmax(axis=1), train_predictions.label_ids
test_preds, test_labels = test_predictions.predictions.argmax(axis=1), test_predictions.label_ids

# Classification report for train dataset
print("Train Classification Report:")
print(classification_report(train_labels, train_preds))

# Classification report for test dataset
print("Test Classification Report:")
print(classification_report(test_labels, test_preds))