**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part II: BERT

Please see the description of the assignment in the README file (section 2) <br>
**Guide notebook**: [guides/bert_guide.ipynb](guides/bert_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW? Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bert_guide` notebook

* **Optionally**, you can fine-tune a pre-trained BERT model to classify news articles as is done in [guides/bert_guide_finetuning.ipybb](guides/bert_guide_finetuning.ipybb), the same task as in part 1. As this requires more computational resources, this part is optional. If you do decide to complete this part, you will need to use a GPU (e.g., Google Colab) to train the model. (For reference, training on a 2020 Macbook Pro with 16GB RAM and a M1 chip results in an out-of-memory error). Therefore, we suggest that you use Google Colab or another cloud-based service with a GPU. You can easily upload the `bert_guide_finetuning.ipynb` notebook to Google Colab and run it there.

<br>

***

In [6]:
# imports for the project

from datasets import load_dataset, DatasetDict
from transformers import pipeline
import numpy as np
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [7]:
TRAIN_SIZE = 1 # percent as whole number
TEST_SIZE = 10 # percent as whole number

In [8]:
ag_news_train = load_dataset("fancyzhx/ag_news", split="train[:20%]")  # 20% of the training data
ag_news_test = load_dataset("fancyzhx/ag_news", split="test")  # full test data

ag_news = DatasetDict({
    "train": ag_news_train,
    "test": ag_news_test
})

ag_news

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 24000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

In [9]:
embedder = pipeline(
    model="answerdotai/ModernBERT-base",      # model used for embedding
    tokenizer="answerdotai/ModernBERT-base",  # tokenizer used for embedding
    task="feature-extraction",                # feature extraction task (returns embeddings)
    device=0                                  # use GPU 0 if available
)

Device set to use mps:0


In [10]:
def get_embeddings(data):
    """ Extract the [CLS] embedding for each text. """
    embeddings = embedder(data["text"])  # Full token embeddings
    cls_embeddings = [e[0][0] for e in embeddings]  # Extract first token ([CLS])
    return {"embeddings": cls_embeddings}

ag_news = ag_news.map(get_embeddings, batched=True, batch_size=8)

In [11]:
ag_news

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 24000
    })
    test: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 7600
    })
})

In [12]:
X_train = np.array(ag_news["train"]["embeddings"])  # Feature embeddings
y_train = np.array(ag_news["train"]["label"])       # Labels

X_test = np.array(ag_news["test"]["embeddings"])
y_test = np.array(ag_news["test"]["label"])

# Check shapes
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")
# Train a logistic regression model
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train, y_train)

# Print classification report
y_pred_train = lr.predict(X_train)
print(classification_report(y_train, y_pred_train))

X_train shape: (24000, 768), y_train shape: (24000,)
X_test shape: (7600, 768), y_test shape: (7600,)
              precision    recall  f1-score   support

           0       0.92      0.92      0.92      6195
           1       0.97      0.98      0.97      5856
           2       0.87      0.87      0.87      5601
           3       0.89      0.89      0.89      6348

    accuracy                           0.91     24000
   macro avg       0.91      0.91      0.91     24000
weighted avg       0.91      0.91      0.91     24000



In [13]:
y_pred = lr.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.87      0.88      1900
           1       0.96      0.96      0.96      1900
           2       0.84      0.81      0.82      1900
           3       0.82      0.87      0.84      1900

    accuracy                           0.88      7600
   macro avg       0.88      0.88      0.88      7600
weighted avg       0.88      0.88      0.88      7600



In [15]:
from sklearn.model_selection import GridSearchCV
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

# Define the hyperparameter grid
param_grid = {
        
            'C': [0.01, 0.1, 1, 10],
            'solver': ['liblinear'],             # saga supports l1 and l2 penalties
            'max_iter': [1000, 5000, 10000],
            'penalty': ['l1', 'l2']
        
}

# Initialize GridSearchCV
grid_search = GridSearchCV(LogisticRegression(random_state=42), param_grid, cv=3, scoring='accuracy', n_jobs=-1)

# Perform the grid search
grid_search.fit(X_train, y_train)

# Get the best model and hyperparameters
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_

# Print the best hyperparameters
print("Best Hyperparameters:", best_params)

# Evaluate the best model on the test set
y_pred_best = best_model.predict(X_test)
print(classification_report(y_test, y_pred_best))

Best Hyperparameters: {'C': 1, 'max_iter': 1000, 'penalty': 'l1', 'solver': 'liblinear'}
              precision    recall  f1-score   support

           0       0.91      0.88      0.89      1900
           1       0.95      0.96      0.96      1900
           2       0.85      0.81      0.83      1900
           3       0.82      0.88      0.85      1900

    accuracy                           0.88      7600
   macro avg       0.88      0.88      0.88      7600
weighted avg       0.88      0.88      0.88      7600



### Reflection on Performance and Hyperparameter Choices

The performance of the system demonstrates the effectiveness of using BERT embeddings combined with a Logistic Regression classifier for text classification. The classification report indicates that the model achieves reasonable accuracy on both the training and test datasets, suggesting that the embeddings capture meaningful features from the text. The tuning of the model performed slightly better compared to the default hyperparameters, as evidenced by the improved classification report metrics.

The best hyperparameters (`C=1`, `penalty='l1'`, `solver='liblinear'`, `max_iter=1000`) were selected based on grid search with cross-validation. 

#### Limitations
The processing time for embedding extraction and hyperparameter tuning was a restriction. To address this, the dataset size and hyperparameter grid were constrained. Future work could explore distributed computing or cloud-based solutions (e.g., Google Colab with GPU) to enable a more comprehensive search and faster processing.

Overall, the system performs well given the constraints, but there is room for improvement with additional computational resources and a broader hyperparameter search.
```