## Lab6-Assignment: Topic Classification

Use the same training, development, and test partitions of the the 20 newsgroups text dataset as in Lab6.4-Topic-classification-BERT.ipynb 

* Fine-tune and examine the performance of another transformer-based pretrained language models, e.g., RoBERTa, XLNet

* Compare the performance of this model to the results achieved in Lab6.4-Topic-classification-BERT.ipynb and to a conventional machine learning approach (e.g., SVM, Naive Bayes) using bag-of-words or other engineered features of your choice. 

    Describe the differences in performance in terms of Precision, Recall, and F1-score evaluation metrics.

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

In [None]:
# Run only if necessary :)
!pip install simpletransformers --upgrade

In [2]:
categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'sci.space'] 

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=categories, random_state=42)
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), categories=categories, random_state=42)

# roBERTa

In [None]:
# Start Coding here

# Naive Bayes with Bag of Words

In [3]:
train = pd.DataFrame({'text': newsgroups_train.data, 'labels': newsgroups_train.target})
test = pd.DataFrame({'text': newsgroups_test.data, 'labels': newsgroups_test.target})

print("Train Data Distribution:")
print(train['labels'].value_counts())
print("\nTest Data Distribution:")
print(test['labels'].value_counts())

Train Data Distribution:
labels
2    594
3    593
1    584
0    480
Name: count, dtype: int64

Test Data Distribution:
labels
2    396
3    394
1    389
0    319
Name: count, dtype: int64


In [4]:
train, dev = train_test_split(train, test_size=0.1, random_state=0, stratify=train[['labels']])

In [5]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train['text'])
X_dev = vectorizer.transform(dev['text'])
X_test = vectorizer.transform(test['text'])

### Training The Model

In [7]:
nb_model = MultinomialNB()
nb_model.fit(X_train, train['labels'])

### Eval. Dev Set

In [8]:
y_dev_pred = nb_model.predict(X_dev)
print("Validation Classification Report:")
print(classification_report(dev['labels'], y_dev_pred))

Validation Classification Report:
              precision    recall  f1-score   support

           0       0.74      0.94      0.83        48
           1       0.92      0.95      0.93        59
           2       0.95      0.90      0.92        60
           3       0.91      0.73      0.81        59

    accuracy                           0.88       226
   macro avg       0.88      0.88      0.87       226
weighted avg       0.89      0.88      0.88       226



### Eval. Test Set

In [9]:
y_test_pred = nb_model.predict(X_test)
test['predicted'] = y_test_pred
print("Test Classification Report:")
print(classification_report(test['labels'], test['predicted']))

Test Classification Report:
              precision    recall  f1-score   support

           0       0.77      0.89      0.82       319
           1       0.92      0.87      0.89       389
           2       0.83      0.87      0.85       396
           3       0.90      0.78      0.83       394

    accuracy                           0.85      1498
   macro avg       0.85      0.85      0.85      1498
weighted avg       0.86      0.85      0.85      1498

