## Lab6-Assignment: Topic Classification

Use the same training, development, and test partitions of the the 20 newsgroups text dataset as in Lab6.4-Topic-classification-BERT.ipynb 

* Fine-tune and examine the performance of another transformer-based pretrained language models, e.g., RoBERTa, XLNet

* Compare the performance of this model to the results achieved in Lab6.4-Topic-classification-BERT.ipynb and to a conventional machine learning approach (e.g., SVM, Naive Bayes) using bag-of-words or other engineered features of your choice. 

    Describe the differences in performance in terms of Precision, Recall, and F1-score evaluation metrics.

In [19]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
from simpletransformers.classification import ClassificationModel, ClassificationArgs
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

In [None]:
# Run only if necessary :)
!pip install simpletransformers --upgrade

In [2]:
categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'sci.space'] 

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=categories, random_state=42)
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), categories=categories, random_state=42)

# roBERTa

In [10]:
from collections import Counter
Counter(newsgroups_train.target)

Counter({2: 594, 3: 593, 1: 584, 0: 480})

In [11]:
Counter(newsgroups_test.target)

Counter({2: 396, 3: 394, 1: 389, 0: 319})

In [12]:
train = pd.DataFrame({'text': newsgroups_train.data, 'labels': newsgroups_train.target})

In [13]:
print(len(train))
train.head(5)

2251


Unnamed: 0,text,labels
0,WHile we are on the subject of the shuttle sof...,3
1,There is a program called Graphic Workshop you...,1
2,,2
3,My girlfriend is in pain from kidney stones. S...,2
4,I think that's the correct spelling..\n\tI am ...,2


In [14]:
from sklearn.model_selection import train_test_split

train, dev = train_test_split(train, test_size=0.1, random_state=0, 
                               stratify=train[['labels']])

In [15]:
print(len(train))
print("train:", train[['labels']].value_counts(sort=False))
train.head(3)

2025
train: labels
0         432
1         525
2         534
3         534
Name: count, dtype: int64


Unnamed: 0,text,labels
559,I wonder how many atheists out there care to s...,0
2060,We are interested in purchasing a grayscale pr...,1
1206,"Dear Binary Newsers,\n\nI am looking for Quick...",1


In [16]:
print(len(dev))
print("dev:", dev[['labels']].value_counts(sort=False))
dev.head(3)

226
dev: labels
0         48
1         59
2         60
3         59
Name: count, dtype: int64


Unnamed: 0,text,labels
1570,I'd dump him. Rude is rude and it seems he en...,2
1761,Hi Everyone ::\n\nI am looking for some soft...,1
455,A friend of mine has been diagnosed with Psori...,2


In [20]:
# Model configuration # https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model 
model_args = ClassificationArgs()

model_args.overwrite_output_dir=True # overwrite existing saved models in the same directory
model_args.evaluate_during_training=True # to perform evaluation while training the model
# (eval data should be passed to the training method)

model_args.num_train_epochs=10 # number of epochs
model_args.train_batch_size=32 # batch size
model_args.learning_rate=4e-6 # learning rate
model_args.max_seq_length=256 # maximum sequence length
# Note! Increasing max_seq_len may provide better performance, but training time will increase. 
# For educational purposes, we set max_seq_len to 256.

# Early stopping to combat overfitting: https://simpletransformers.ai/docs/tips-and-tricks/#using-early-stopping
model_args.use_early_stopping=True
model_args.early_stopping_delta=0.01 # "The improvement over best_eval_loss necessary to count as a better checkpoint"
model_args.early_stopping_metric='eval_loss'
model_args.early_stopping_metric_minimize=True
model_args.early_stopping_patience=2
model_args.evaluate_during_training_steps=32 # how often you want to run validation in terms of training steps (or batches)

In [23]:
# Checking steps per epoch
steps_per_epoch = int(np.ceil(len(train) / float(model_args.train_batch_size)))
print('Each epoch will have {:,} steps.'.format(steps_per_epoch)) # 64 steps = validating 2 times per epoch

Each epoch will have 64 steps.


In [None]:
model = ClassificationModel('roberta', 'roberta-base', num_labels=4, args=model_args, use_cuda=True) #True when using Colab, off when using VSC

In [None]:
print(str(model.args).replace(',', '\n')) # model args

In [None]:
_, history = model.train_model(train, eval_df=dev) 

In [None]:
# Training and evaluation loss
train_loss = history['train_loss']
eval_loss = history['eval_loss']
plt.plot(train_loss, label='Training loss')
plt.plot(eval_loss, label='Evaluation loss')
plt.title('Training and evaluation loss')
plt.legend()

In [None]:
# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(dev)
result

In [None]:
predicted, probabilities = model.predict(test.text.to_list())
test['predicted'] = predicted

In [None]:
test.head(5)

In [None]:
# Result (note: your result can be different due to randomness in operations)
print(classification_report(test['labels'], test['predicted']))

# Naive Bayes with Bag of Words

In [21]:
train = pd.DataFrame({'text': newsgroups_train.data, 'labels': newsgroups_train.target})
test = pd.DataFrame({'text': newsgroups_test.data, 'labels': newsgroups_test.target})

print("Train Data Distribution:")
print(train['labels'].value_counts())
print("\nTest Data Distribution:")
print(test['labels'].value_counts())

Train Data Distribution:
labels
2    594
3    593
1    584
0    480
Name: count, dtype: int64

Test Data Distribution:
labels
2    396
3    394
1    389
0    319
Name: count, dtype: int64


In [22]:
train, dev = train_test_split(train, test_size=0.1, random_state=0, stratify=train[['labels']])

In [5]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train['text'])
X_dev = vectorizer.transform(dev['text'])
X_test = vectorizer.transform(test['text'])

### Training The Model

In [7]:
nb_model = MultinomialNB()
nb_model.fit(X_train, train['labels'])

### Eval. Dev Set

In [8]:
y_dev_pred = nb_model.predict(X_dev)
print("Validation Classification Report:")
print(classification_report(dev['labels'], y_dev_pred))

Validation Classification Report:
              precision    recall  f1-score   support

           0       0.74      0.94      0.83        48
           1       0.92      0.95      0.93        59
           2       0.95      0.90      0.92        60
           3       0.91      0.73      0.81        59

    accuracy                           0.88       226
   macro avg       0.88      0.88      0.87       226
weighted avg       0.89      0.88      0.88       226



### Eval. Test Set

In [9]:
y_test_pred = nb_model.predict(X_test)
test['predicted'] = y_test_pred
print("Test Classification Report:")
print(classification_report(test['labels'], test['predicted']))

Test Classification Report:
              precision    recall  f1-score   support

           0       0.77      0.89      0.82       319
           1       0.92      0.87      0.89       389
           2       0.83      0.87      0.85       396
           3       0.90      0.78      0.83       394

    accuracy                           0.85      1498
   macro avg       0.85      0.85      0.85      1498
weighted avg       0.86      0.85      0.85      1498



# Comparisons

### BERT vs. roBERTa

type here

### BERT vs. Naive Bayes w/ Bag of Words (BoW)

type here