## Lab6-Assignment: Topic Classification

Use the same training, development, and test partitions of the the 20 newsgroups text dataset as in Lab6.4-Topic-classification-BERT.ipynb

* Fine-tune and examine the performance of another transformer-based pretrained language models, e.g., RoBERTa, XLNet

* Compare the performance of this model to the results achieved in Lab6.4-Topic-classification-BERT.ipynb and to a conventional machine learning approach (e.g., SVM, Naive Bayes) using bag-of-words or other engineered features of your choice.
Describe the differences in performance in terms of Precision, Recall, and F1-score evaluation metrics.

In [None]:
!pip install simpletransformers



In [None]:
# Import libraries
import pandas as pd
import numpy as np
import sklearn
from sklearn.metrics import classification_report
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import matplotlib.pyplot as plt
import seaborn as sn

In [None]:
# Load data(from 6.4)
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

# load only a sub-selection of the categories (4 in our case)
categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'sci.space']

# remove the headers, footers and quotes (to avoid overfitting)
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=categories, random_state=42)
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), categories=categories, random_state=42)
train = pd.DataFrame({'text': newsgroups_train.data, 'labels': newsgroups_train.target})
test = pd.DataFrame({'text': newsgroups_test.data, 'labels': newsgroups_test.target})
train, dev = train_test_split(train, test_size=0.1, random_state=0,
                               stratify=train[['labels']])



In [None]:
# Bert (from 6.4)

# Model configuration # https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model
model_args = ClassificationArgs()

model_args.overwrite_output_dir=True # overwrite existing saved models in the same directory
model_args.evaluate_during_training=False # to perform evaluation while training the model
# (eval data should be passed to the training method)

model_args.num_train_epochs=10 # number of epochs
model_args.train_batch_size=32 # batch size
model_args.learning_rate=4e-6 # learning rate
model_args.max_seq_length=256 # maximum sequence length
# Note! Increasing max_seq_len may provide better performance, but training time will increase.
# For educational purposes, we set max_seq_len to 256.

# Early stopping to combat overfitting: https://simpletransformers.ai/docs/tips-and-tricks/#using-early-stopping
model_args.use_early_stopping=True
model_args.early_stopping_delta=0.01 # "The improvement over best_eval_loss necessary to count as a better checkpoint"
model_args.early_stopping_metric='eval_loss'
model_args.early_stopping_metric_minimize=True
model_args.early_stopping_patience=2
model_args.evaluate_during_training_steps=32 # how often you want to run validation in terms of training steps (or batches)

model = ClassificationModel('bert', 'bert-base-cased', num_labels=4, args=model_args, use_cuda=True) # CUDA is enabled
model.train_model(train)
predicted, probabilities = model.predict(test.text.to_list())
test['predicted'] = predicted
print(classification_report(test['labels'], test['predicted']))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

  0%|          | 0/4 [00:00<?, ?it/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 10 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/15 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0       0.84      0.82      0.83       319
           1       0.82      0.92      0.87       389
           2       0.93      0.88      0.91       396
           3       0.85      0.82      0.83       394

    accuracy                           0.86      1498
   macro avg       0.86      0.86      0.86      1498
weighted avg       0.86      0.86      0.86      1498



In [None]:
model2 = ClassificationModel('roberta', 'roberta-base', num_labels=4, args=model_args, use_cuda=True) # CUDA is enabled
model2.train_model(train)
predicted2, probabilities = model2.predict(test.text.to_list())
test['predicted2'] = predicted2
print(classification_report(test['labels'], test['predicted2']))

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

  0%|          | 0/4 [00:00<?, ?it/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

Running Epoch 10 of 10:   0%|          | 0/64 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/15 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0       0.84      0.82      0.83       319
           1       0.82      0.92      0.87       389
           2       0.91      0.88      0.89       396
           3       0.85      0.79      0.82       394

    accuracy                           0.85      1498
   macro avg       0.85      0.85      0.85      1498
weighted avg       0.85      0.85      0.85      1498



In [None]:
# Data preprocessing for naive bayes and svm (bag of words)
import gensim
from gensim.corpora import Dictionary
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
from gensim.matutils import corpus2dense
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

def lemmatize_stemming(text):
    return lemmatizer.lemmatize(text)
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
           # result.append(token)
            result.append(lemmatize_stemming(token))
    return result

train['text'] = train['text'].fillna('missing')
test['text'] = test['text'].fillna('missing')



processed_docs_train = train['text'].map(preprocess)
processed_docs_test = test['text'].map(preprocess)

all_docs = pd.concat([processed_docs_train, processed_docs_test])
all_docs = all_docs.dropna()
all_docs_list = all_docs.tolist()


dictionary = gensim.corpora.Dictionary(all_docs)

bow_corpus_train = [dictionary.doc2bow(doc) for doc in processed_docs_train]
bow_corpus_test = [dictionary.doc2bow(doc) for doc in processed_docs_test]


#to matrix for scikit learn

num_terms = len(dictionary.token2id)



X_train = corpus2dense(bow_corpus_train, num_terms=num_terms).T
X_test = corpus2dense(bow_corpus_test, num_terms=num_terms).T

y_train = train['labels']
y_test = test['labels']


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


559     [wonder, atheist, care, speculate, face, world...
2060    [interested, purchasing, grayscale, printer, o...
1206    [dear, binary, newsers, looking, quick, micros...
1420    [hello, looking, commercial, software, package...
1210    [actually, flexible, create, temp, file, check...
                              ...                        
1493    [nice, collection, historical, book, medical, ...
1494    [stuff, deleted, french, spot, example, come, ...
1496    [posting, delayed, week, falling, software, cr...
1497    [commericial, support, exploration, example, b...
Name: text, Length: 3523, dtype: object


In [None]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report

svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)
y_pred_svm = svm_classifier.predict(X_test)
print("SVM Classifier Report:")
print(classification_report(y_test, y_pred_svm))

SVM Classifier Report:
              precision    recall  f1-score   support

           0       0.72      0.65      0.68       319
           1       0.69      0.81      0.74       389
           2       0.78      0.67      0.72       396
           3       0.68      0.70      0.69       394

    accuracy                           0.71      1498
   macro avg       0.71      0.71      0.71      1498
weighted avg       0.71      0.71      0.71      1498



In [None]:
from sklearn.naive_bayes import MultinomialNB
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
y_pred_nb = nb_classifier.predict(X_test)
print("Naive Bayes Classifier Report:")
print(classification_report(y_test, y_pred_nb))


Naive Bayes Classifier Report:
              precision    recall  f1-score   support

           0       0.81      0.85      0.83       319
           1       0.89      0.88      0.89       389
           2       0.83      0.89      0.86       396
           3       0.87      0.79      0.83       394

    accuracy                           0.85      1498
   macro avg       0.85      0.85      0.85      1498
weighted avg       0.85      0.85      0.85      1498

