# Topic Classification Notebook


In this notebook we will be training a model to classify the sentences from our test set "sentiment-topic-test.tsv". For this we used the BERT language model to classify sentences in one of 3 categories: sports, movie and book.

In [1]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)

Using device: cuda


In [2]:
!pip install simpletransformers --upgrade
import pandas as pd
from sklearn.model_selection import train_test_split
from simpletransformers.classification import ClassificationModel, ClassificationArgs
from sklearn.metrics import classification_report

Collecting simpletransformers
  Downloading simpletransformers-0.70.1-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.4/42.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from simpletransformers)
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting seqeval (from simpletransformers)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tensorboardx (from simpletransformers)
  Downloading tensorboardX-2.6.2.2-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting streamlit (from simpletransformers)
  Downloading streamlit-1.44.0-py3-none-any.whl.metadata (8.9 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets->sim

Our dataset "big_data.csv" is a dataset uploaded to Kaggle by Aron Ferenc (https://www.kaggle.com/datasets/aronferencz/topic-analysis-dataset-sportbookmovie).
It is a combined dataset parsed from other Kaggle datasets. It is a simple csv dataset with three columns: sentence id, text, label.
The book section comes from a dataset on "Top 100 Bestselling Book Reviews on Amazon" by Ansh Tanwar (https://www.kaggle.com/datasets/anshtanwar/top-200-trending-books-with-reviews/data?select=customer+reviews.csv).
The movie section comes from a dataset "Movie Reviews Sentiment Analysis using NLP" by Gaurav Dutta (https://www.kaggle.com/code/gauravduttakiit/movie-reviews-sentiment-analysis-using-nlp). The sports section comes from the dataset "Football Transfer News Articles for NLP" by Crxxom, and was derived from headlines from the football website 90min.com (https://www.kaggle.com/datasets/crxxom/football-transfer-news-for-nlp)


The dataset consists of 118607 unique values, and is slightly unbalanced towards movies, which has 42%, while book and sports have 28% and 31% respectively.

The type of text may also be problematic, as for book and movie the training data uses reviews, which are often longer than the 1 sentence used in our test set. The sports text is all football-related, and since it comes from headlines consists of a lot of transfer news, which uses a lot of player names, club names, and some financial transaction. This is also slightly different from our test set.

Since the dataset was already parsed from the three original sets for use in topic classification, there is very little preprocessing necessary, all we need to do is change the labels from strings into integers.

For this we use "book" : 0, "movie" : 1, "sports": 2

In [10]:
train_data = pd.read_csv("big_data.csv")

In [11]:
# Changing the strings of labels into integers so our model can use it.
# Book: 0, Movie: 1, Sports: 2
train_data['label'] = train_data['label'].replace({"book":0, "movie":1, "sports":2})
train_data['label'] = pd.to_numeric(train_data['label'], errors='coerce')
train_data = train_data.drop("idx", axis=1)
print("Training data distribution:", train_data[['label']].value_counts(sort=False))

Training data distribution: label
0        33581
1        50000
2        36726
Name: count, dtype: int64


  train_data['label'] = train_data['label'].replace({"book":0, "movie":1, "sports":2})


In [None]:
# Splitting up our data into test and eval sets
train, dev = train_test_split(train_data, test_size=0.1, random_state=0,
                               stratify=train_data[['label']])


The model we use is BERT, which we saw from lab 6 is very effective at classifying topics. After some failed runs, the below setup gave us the best results consistently.

In [None]:
# Model configuration # https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model
model_args = ClassificationArgs()

model_args.overwrite_output_dir=True # overwrite existing saved models in the same directory
model_args.evaluate_during_training=True # to perform evaluation while training the model
# (eval data should be passed to the training method)

model_args.num_train_epochs= 3 #10//10 # number of epochs
model_args.train_batch_size= 32#32//10 # batch size
model_args.learning_rate=4e-6 # learning rate
model_args.max_seq_length= 512 # maximum sequence length
# Note! Increasing max_seq_len may provide better performance, but training time will increase.
# For educational purposes, we set max_seq_len to 256.

# Early stopping to combat overfitting: https://simpletransformers.ai/docs/tips-and-tricks/#using-early-stopping
model_args.use_early_stopping=True
model_args.early_stopping_delta=0.01 # "The improvement over best_eval_loss necessary to count as a better checkpoint"
model_args.early_stopping_metric='eval_loss'
model_args.early_stopping_metric_minimize=True
model_args.early_stopping_patience=2
model_args.evaluate_during_training_steps=640 # how often you want to run validation in terms of training steps (or batches)

model = ClassificationModel('bert', 'bert-base-cased', num_labels=4, args=model_args, use_cuda=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
_, history = model.train_model(train, eval_df=dev)



  0%|          | 0/216 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

  scaler = amp.GradScaler()


Running Epoch 1 of 3:   0%|          | 0/3384 [00:00<?, ?it/s]

  with amp.autocast():


  0%|          | 0/24 [00:00<?, ?it/s]

  with amp.autocast():
  with amp.autocast():


  0%|          | 0/24 [00:00<?, ?it/s]

  with amp.autocast():
  with amp.autocast():


  0%|          | 0/24 [00:00<?, ?it/s]

  with amp.autocast():
  with amp.autocast():


  0%|          | 0/24 [00:00<?, ?it/s]

  with amp.autocast():
  with amp.autocast():


  0%|          | 0/24 [00:00<?, ?it/s]

  with amp.autocast():


In [None]:
# Saving our model for future use
model.model.save_pretrained('model1')
model.tokenizer.save_pretrained('model1')
model.config.save_pretrained('model1/')

In [None]:
test = pd.read_csv("sentiment-topic-test.tsv", sep='\t')

# Preprocessing our test set, by changing the labels in the same way as the training data
test = test.replace("book", 0)
test = test.replace("movie", 1)
test = test.replace("sports", 2)

  test = test.replace("sports", 2)


In [None]:
predicted, probabilities = model.predict(test.sentence.to_list())
test['predicted'] = predicted
# Predicting the labels for test set and evaluating the results

print(classification_report(test['topic'], test['predicted']))

0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  with amp.autocast():


              precision    recall  f1-score   support

           0       0.50      1.00      0.67         6
           1       1.00      0.17      0.29         6
           2       1.00      0.83      0.91         6

    accuracy                           0.67        18
   macro avg       0.83      0.67      0.62        18
weighted avg       0.83      0.67      0.62        18



In [None]:
import collections

# Counting the amounts of times a label was predicted by our model
label_counts = collections.Counter(list(test["predicted"].values))
print(label_counts)

Counter({np.int64(0): 12, np.int64(2): 5, np.int64(1): 1})


              precision    recall  f1-score   support

           0       0.50      1.00      0.67         6
           1       1.00      0.17      0.29         6
           2       1.00      0.83      0.91         6

    accuracy                           0.67        18
    macro avg      0.83      0.67      0.62        18
    weighted avg   0.83      0.67      0.62        18

Results from evaluating our model. Below we can see the sentences, actual topic, and predicted label.

As we can see from the macro weighted average, our topic classification was not as succesful as we would like it to be. If we look at the precision and recall for the individual labels, the counter in the cell above and the predicted labels below, we can see where the problem lies.

Our model only predicted movie once. Out of the 6 actual movie sentences, 5 were predicted as book. The movie label was only predicted once. This makes for high movie precision and book recall, but very low book precision and movie recall.

This would be the opposite of what you would expect when looking at the training data, as there was actually less data for books. Part of this can be explained by the fact that both topics are actually quite close together, as both are about stories, contain words like thriller, protagonist, plot etc.
But most likely this is an effect of our training data for these two topics consisting of reviews, often made up of multiple sentences. This may have confused the model with a lot of information, that can be quite similar as there won't be much difference between movie and book reviews.

We see the model is very well able to predict sports, even through our concerns about the headlines being about specific topics within sports.

All in all what we can say is that our model is good at seperating sports and book/movies, but not effective at distinguishing the latter two.


What could have been done differently is changing the datasets, or preprocessing the data differently. By using sentence separation or sentence-based datasets there might have been better results.

In [None]:
test

Unnamed: 0,sentence_id,sentence,sentiment,topic,predicted
0,0,The atmosphere at the stadium tonight was elec...,positive,2,2
1,1,The game was so intense I forgot to breathe at...,positive,2,0
2,2,It had me hooked from the first chapter.,positive,0,0
3,3,"It’s more of a slow burn than a page-turner, b...",neutral,0,0
4,4,"It’s split into two timelines, which keeps it ...",neutral,0,0
5,5,I could watch this film a hundred times and st...,positive,1,1
6,6,Best thriller I’ve seen in ages. Had me on the...,positive,1,0
7,7,How do you concede three goals in ten minutes?...,negative,2,2
8,8,"They rotated their squad for the cup game, whi...",neutral,2,2
9,9,"The trailer gave away most of the plot, but th...",neutral,1,0
