**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part II: BERT

Please see the description of the assignment in the README file (section 2) <br>
**Guide notebook**: [guides/bert_guide.ipynb](guides/bert_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW? Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bert_guide` notebook

* **Optionally**, you can fine-tune a pre-trained BERT model to classify news articles as is done in [guides/bert_guide_finetuning.ipybb](guides/bert_guide_finetuning.ipybb), the same task as in part 1. As this requires more computational resources, this part is optional. If you do decide to complete this part, you will need to use a GPU (e.g., Google Colab) to train the model. (For reference, training on a 2020 Macbook Pro with 16GB RAM and a M1 chip results in an out-of-memory error). Therefore, we suggest that you use Google Colab or another cloud-based service with a GPU. You can easily upload the `bert_guide_finetuning.ipynb` notebook to Google Colab and run it there.

<br>

***

In [24]:
# imports for the project

from datasets import load_dataset, DatasetDict

from transformers import pipeline
import numpy as np
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression


### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [25]:
# Define dataset sizes
TRAIN_SIZE = 20  # percent as whole number
TEST_SIZE = 20  # percent as whole number

In [26]:
ag_news_train = load_dataset("fancyzhx/ag_news", split=f"train[:{TRAIN_SIZE}%]")
ag_news_test = load_dataset("fancyzhx/ag_news", split=f"test[:{TEST_SIZE}%]")

# Store the original texts before creating embeddings
test_texts = ag_news_test["text"]

ag_news = DatasetDict({
    "train": ag_news_train,
    "test": ag_news_test
})

# Print the structure to verify
print("Train dataset columns:", ag_news["train"].column_names)
print("Test dataset columns:", ag_news["test"].column_names)

Train dataset columns: ['text', 'label']
Test dataset columns: ['text', 'label']


In [27]:
# Store original texts before creating embeddings
test_texts = ag_news_test["text"]

In [28]:
embedder = pipeline(
    model="answerdotai/ModernBERT-base",      # model used for embedding
    tokenizer="answerdotai/ModernBERT-base",  # tokenizer used for embedding
    task="feature-extraction",                # feature extraction task (returns embeddings)
    device=0                                  # use GPU 0 if available
)

Device set to use cuda:0


In [29]:

# Function to extract embeddings
def get_embeddings(examples):
    """ Extract the [CLS] embedding for each text. """
    embeddings = embedder(examples["text"])  # Full token embeddings
    cls_embeddings = [e[0][0] for e in embeddings]  # Extract first token ([CLS])
    return {"embeddings": cls_embeddings}

# Extract embeddings for both train and test sets
# Using batched=True and batch_size=8 for efficiency
ag_news = ag_news.map(
    get_embeddings,
    batched=True,
    batch_size=8,
    remove_columns=["text"]  # Only remove the text column, keep label
)

# Extract features and labels for scikit-learn
X_train = np.array(ag_news["train"]["embeddings"])  # Feature embeddings
y_train = np.array(ag_news["train"]["label"])       # Labels

X_test = np.array(ag_news["test"]["embeddings"])
y_test = np.array(ag_news["test"]["label"])

# Check shapes
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

Map:   0%|          | 0/24000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1520 [00:00<?, ? examples/s]

X_train shape: (24000, 768), y_train shape: (24000,)
X_test shape: (1520, 768), y_test shape: (1520,)


In [33]:
lr = LogisticRegression(max_iter=10000)
lr.fit(X_train, y_train)

In [34]:
# Train the best performing model
best_model = LogisticRegression(
    C=0.1,
    penalty='l2',
    max_iter=1000
)
best_model.fit(X_train, y_train)

# Get predictions
y_pred_test = best_model.predict(X_test)
y_pred_train = best_model.predict(X_train)

# Print final results
print("Final Model Results:")
print("\nTraining Results:")
print(classification_report(y_train, y_pred_train))
print("\nTest Results:")
print(classification_report(y_test, y_pred_test))

Final Model Results:

Training Results:
              precision    recall  f1-score   support

           0       0.90      0.90      0.90      6195
           1       0.95      0.97      0.96      5856
           2       0.86      0.85      0.86      5601
           3       0.88      0.88      0.88      6348

    accuracy                           0.90     24000
   macro avg       0.90      0.90      0.90     24000
weighted avg       0.90      0.90      0.90     24000


Test Results:
              precision    recall  f1-score   support

           0       0.89      0.89      0.89       383
           1       0.95      0.96      0.96       404
           2       0.84      0.84      0.84       339
           3       0.88      0.87      0.87       394

    accuracy                           0.89      1520
   macro avg       0.89      0.89      0.89      1520
weighted avg       0.89      0.89      0.89      1520




Our  initial results from  the first test had a perfect score on training data, but lacked significantly on test data. Hints at overfitting. To handle this issue, several different paramaters were experimented with:

    Regularization strength. This tweaks the forced simplicity of the model. More simple less chance of overfitting to trainingdata?
    Regularization type. Ridge, Lasso, Elastic net (R1/R2). Methods of  penalizing large coefficeints, or even setting them to zero.
    Class weight balancing,  increasing the weight of minority classes such as sports.

It  seems that C= 0.1 (strong regularization) and L2 regularization worked best. 

Fine  tuning the  model  provided  better results with  less overfitting. Manually inspecting the wrong predictions  gave insights like "Text: Athens - a \$12bn bill THE world sighed with relief when Greeks kept their promise to deliver some of the world #39;s finest sport venues in time for the Athens Olympics..." A sports article was predicted to be a business article, likely  due to the $12bn  dollars  mentioned. The financial  aspects of sports are probably very close to the business patterns of business articles. Other inspections gave similar understandable insights.

Increasing the dataset size from  1 to  20% increased performance significantly. The model went from being inferior to BoW to Superior. The  model is  best  performing in  sports news (0.96 F1 score) but remains challenged in business news, although I managed to get an F1 score of 0.84. The  gap between  training and test decreased compared to initial state of the model, showing less risk of overfit.