**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part II: BERT

Please see the description of the assignment in the README file (section 2) <br>
**Guide notebook**: [guides/bert_guide.ipynb](guides/bert_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW? Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bert_guide` notebook

* **Optionally**, you can fine-tune a pre-trained BERT model to classify news articles as is done in [guides/bert_guide_finetuning.ipybb](guides/bert_guide_finetuning.ipybb), the same task as in part 1. As this requires more computational resources, this part is optional. If you do decide to complete this part, you will need to use a GPU (e.g., Google Colab) to train the model. (For reference, training on a 2020 Macbook Pro with 16GB RAM and a M1 chip results in an out-of-memory error). Therefore, we suggest that you use Google Colab or another cloud-based service with a GPU. You can easily upload the `bert_guide_finetuning.ipynb` notebook to Google Colab and run it there.

<br>

***

In [1]:
# imports for the project

!pip install transformers huggingface_hub datasets torch evaluate accelerate fastparquet
from transformers import pipeline, AutoTokenizer, BertModel, TrainingArguments, Trainer, DataCollatorWithPadding
import evaluate
!pip install datasets
import numpy as np
from sklearn.linear_model import LogisticRegression
from datasets import load_dataset, DatasetDict
from sklearn.metrics import classification_report

Collecting evaluate
  Using cached evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting accelerate
  Downloading accelerate-1.5.2-py3-none-any.whl.metadata (19 kB)
Using cached evaluate-0.4.3-py3-none-any.whl (84 kB)
Downloading accelerate-1.5.2-py3-none-any.whl (345 kB)
Installing collected packages: accelerate, evaluate
Successfully installed accelerate-1.5.2 evaluate-0.4.3


### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [2]:
ag_news = load_dataset("fancyzhx/ag_news")

# Forbereder træningsdatasættet ved at sample 1200 eksempler
ag_news_train = ag_news["train"].shuffle(seed=42).select(range(1200))  
ag_news_test = ag_news["test"].shuffle(seed=42).select(range(760))  

# Samler datasættet til DatasetDict
ag_news = DatasetDict({
    "train": ag_news_train,
    "test": ag_news_test
})

ag_news

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 760
    })
})

In [3]:
embedder = pipeline(
    model="bert-base-uncased",              # model used for embedding
    tokenizer="bert-base-uncased",          # tokenizer used for embedding
    task="feature-extraction",               # feature extraction task (returns embeddings)
    device=0                                  # use CPU 0 if available
)

Device set to use cpu


In [4]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

def get_embeddings(data):
    """Extract the [CLS] embedding for each text."""
    inputs = tokenizer(data["text"], return_tensors='pt', padding=True, truncation=True)
    outputs = model(**inputs)
    cls_embeddings = outputs.last_hidden_state[:, 0, :].detach().numpy()  # Extract [CLS] token embeddings
    return {"embeddings": cls_embeddings}

ag_news = ag_news.map(get_embeddings, batched=True, batch_size=8)

Map:   0%|          | 0/1200 [00:00<?, ? examples/s]

Map:   0%|          | 0/760 [00:00<?, ? examples/s]

In [5]:
ag_news

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 760
    })
})

In [6]:
try:
    X_train = np.array(ag_news["train"]["embeddings"])  # Feature embeddings
    y_train = np.array(ag_news["train"]["label"])       # Labels

    X_test = np.array(ag_news["test"]["embeddings"])
    y_test = np.array(ag_news["test"]["label"])

    # Check shapes
    print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
    print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")
except Exception as e:
    print("An error occurred:", e)

X_train shape: (1200, 768), y_train shape: (1200,)
X_test shape: (760, 768), y_test shape: (760,)


In [7]:
lr = LogisticRegression(max_iter=1000)

lr.fit(X_train, y_train)

In [8]:
y_pred_train = lr.predict(X_train)

print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       294
           1       1.00      1.00      1.00       296
           2       1.00      1.00      1.00       285
           3       1.00      1.00      1.00       325

    accuracy                           1.00      1200
   macro avg       1.00      1.00      1.00      1200
weighted avg       1.00      1.00      1.00      1200



In [9]:
y_pred = lr.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.89      0.85      0.87       191
           1       0.94      0.96      0.95       184
           2       0.86      0.79      0.82       193
           3       0.77      0.85      0.81       192

    accuracy                           0.86       760
   macro avg       0.86      0.86      0.86       760
weighted avg       0.86      0.86      0.86       760



In [None]:
# Accuracy increased from 81% in BoW to 86% in Bert
# Business precision 0.89, Recall 0.85, F1-Score 0.87
# Sci/Tech precision 0.94, Recall 0.96, F1-Score 0.95
# Sports precision 0.86, Recall 0.79, F1-Score 0.82
# World precision 0.77, Recall 0.85, F1-Score 0.81

# hyper param. - batch size best 16-32 for BERT; typically 3-5 epochs for fine tuning; better to start with 2e-5 to 5e-5 LR for BERT
# BERT tends to have higher precision in most classes, particularly in Class 1, suggesting it is more effective at minimizing false positives.
