# DIGI405 Lab Class 8: LLM Text classification

⚠️ **Run this notebook in Google Colab - not in the DIGI405 JupyterHub**. You will want to change to a GPU/TPU accelerated runtime. To do this go to Runtime > Change runtime type. Read more about GPU/TPU availability in Google Colab [here](https://research.google.com/colaboratory/faq.html#gpu-availability).


In this notebook we'll compare the performance of an LLM model, [Hugging Face DistilBERT base (uncased)](https://huggingface.co/distilbert-base-uncased), and a Bag of Words model, the [sci-kit learn multinomial Naive Bayes classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) on the same dataset.

The first part of this notebook is exactly the same as our other lab notebook. Run through all the cells in the notebook to ensure all the data is loaded correctly, and the Naive Bayes model is trained for comparison. You may want to use the 'best' pre-processing and feature selection settings that you found for the Naive Bayes model.

In the second part of the notebook we'll fine-tune the ["distilbert-base-uncased"](https://huggingface.co/distilbert/distilbert-base-uncased) model on our dataset and compare its classification performance against the Naive Bayes model. DistilBERT base (uncased) is accessed through the 🤗 Hugging Face [transformers](https://huggingface.co/docs/hub/en/transformers) library.

---

**Remember:** Each time you change settings below, you need to rerun the following cells in order to implement the classification pipeline.


## Setup

Below we are importing required libraries.

We will use the [Naive Bayes Classifier](https://scikit-learn.org/stable/modules/naive_bayes.html). We will also use Scikit-learn's different feature extraction methods based on counts or tf-idf weights. The [NLTK](https://www.nltk.org/) library is used for pre-processing.

In [None]:
# Install datasets
!pip install datasets==2.14.5

In [None]:
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import classification_report

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from wordcloud import WordCloud

import re
import warnings

import nltk
from nltk import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet

The following step downloads the following [NLTK](https://www.nltk.org/) resources: stopwords, the POS tagger (used by the NLTK lemmatizer), the Punkt tokenizer models, and the [WordNet lexical database](https://wordnet.princeton.edu/) (used for word meanings and relationships).

In [None]:
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')

This cell loads some defaults for the stop word lists.

In [None]:
stop_words = None
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as sklearn_stop_words
nltk_stop_words = nltk.corpus.stopwords.words('english')

Define some functions (you don't need to change anything here, just run the cell) ...

In [None]:
# nice preview of document
def get_preview(docs, targets, target_names, doc_id, max_len=0):
    preview = ''
    if max_len < 1:
        preview += 'Label\n'
        preview += '=====\n'
    else:
        preview += str(doc_id)
        preview += '\t'
    preview += target_names[targets[doc_id]]
    if max_len < 1:
        preview += '\n\nFull Text\n'
        preview += '=========\n'
        preview += docs[doc_id]
        preview += '\n'
    else:
        excerpt = get_excerpt(docs[doc_id], max_len)
        preview += '\t' + excerpt
    return preview

_RE_COMBINE_WHITESPACE = re.compile(r"\s+")

# generate an excerpt
def get_excerpt(text, max_len):
    excerpt = _RE_COMBINE_WHITESPACE.sub(' ',text[0:max_len])
    if max_len < len(text):
        excerpt += '...'
    return excerpt.strip()

# combine a defined stop word list (or no stop word list) with any extra stop words defined
def set_stop_words(stop_word_list, extra_stop_words):
    if len(extra_stop_words) > 0:
        if stop_word_list is None:
            stop_word_list = []
        stop_words = list(stop_word_list) + extra_stop_words
    else:
        stop_words = stop_word_list

    return stop_words

# initiate stemming or lemmatising
def set_normaliser(normalise):
    if normalise == 'PorterStemmer':
        normaliser = PorterStemmer()
    elif normalise == 'SnowballStemmer':
        normaliser = SnowballStemmer('english')
    elif normalise == 'WordNetLemmatizer':
        normaliser = WordNetLemmatizer()
    else:
        normaliser = None
    return normaliser

# we are using a custom tokenisation process to allow different tokenisers and stemming/lemmatising ...
def tokenise(doc):
    global tokeniser, normalise, normaliser

    # you could obviously add more tokenisers here if you wanted ...
    if tokeniser == 'sklearn':
        tokenizer = RegexpTokenizer(r"(?u)\b\w\w+\b") # this is copied straight from sklearn source
        tokens = tokenizer.tokenize(doc)
    elif tokeniser == 'word_tokenize':
        tokens = word_tokenize(doc)
    elif tokeniser == 'wordpunct':
        tokens = wordpunct_tokenize(doc)
    else:
        tokens = word_tokenize(doc)

    # if using a normaliser then iterate through tokens and return the normalised tokens ...
    if normalise == 'PorterStemmer':
        return [normaliser.stem(t) for t in tokens]
    elif normalise == 'SnowballStemmer':
        return [normaliser.stem(t) for t in tokens]
    elif normalise == 'WordNetLemmatizer':
        # NLTK's lemmatiser needs parts of speech, otherwise assumes everything is a noun
        pos_tokens = nltk.pos_tag(tokens)
        lemmatised_tokens = []
        for token in pos_tokens:
            # NLTK's lemmatiser needs specific values for pos tags - this rewrites them ...
            # default to noun
            tag = wordnet.NOUN
            if token[1].startswith('J'):
                tag = wordnet.ADJ
            elif token[1].startswith('V'):
                tag = wordnet.VERB
            elif token[1].startswith('R'):
                tag = wordnet.ADV
            lemmatised_tokens.append(normaliser.lemmatize(token[0],tag))
        return lemmatised_tokens
    else:
        # no normaliser so just return tokens
        return tokens

## Preview stop word lists

As discussed in the lecture material, pre-processing can have a major influence on the results of text classification tasks.

In particular, you should put thought into whether a stop word list is sensible for your task. The scikit-learn website also makes this point at https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words and recommends caution about using its stop word list! That page also links to a recent paper discussing practical issues with stop word lists, including whether the way you are tokenising your documents matches the tokenisation approach used in your stop word list.

Using the cells below you can preview the stop word lists supplied by scikit-learn and NLTK, which we have used previously in class. You will notice they are quite different.

In [None]:
print(sklearn_stop_words)

In [None]:
print(nltk_stop_words)

## Load corpus and set train/test split

Scikit-learn is packaged with a number of standard data-sets used in machine learning and provides a way to load other data.

We will begin by loading texts from two categories in the **[20 newsgroups dataset](http://qwone.com/~jason/20Newsgroups/)** to work through an example classifying documents related to politics and religion.

*What is a newsgroup?* We are stretching back into internet history here - way before people talked to strangers on Facebook and X and other social media, there were Usenet Newsgroups! [Here is a link to a Deja News page from 1998](https://web.archive.org/web/19980127204536/http://emarket.dejanews.com/emarket/about/idgs/aboutidgs.shtml) and also a [Wikipedia article](https://en.wikipedia.org/wiki/Usenet_newsgroup) that explains what Newsgroups are all about.

This data-set was built from discussions between real people on the internet in the 1990s. Please be aware that within this data-set are texts that include racist, sexist, and other offensive language use.

**Note:** This cell also sets the following train/test split: **80% of the data is used for training and 20% is used for testing.** The documents are assigned to each group randomly. It can be useful to rerun this cell to reshuffle your dataset so you can evaluate your model using different data for training and testing.

In [None]:
# this chooses the categories to load
cats = ['talk.politics.misc', 'talk.religion.misc']

# this downloads/loads the data
# dataset = fetch_20newsgroups(subset='train', categories=cats)
dataset = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=cats)

# assign the train/test split - 0.2 is 80% for training, 20% for testing
test_size = 0.2

# do the train test split ...
# docs_train and docs_test are the documents
# y_train and y_test are the labels
docs_train, docs_test, y_train, y_test = train_test_split(dataset.data, dataset.target,
                                                          test_size = test_size, random_state=None)

## Inspect documents and labels

In the next cells we can look at the data we have imported. Firstly, we will preview the document labels and a brief excerpt.

In [None]:
for train_id in range(len(docs_train)):
    print(get_preview(docs_train, y_train, dataset.target_names, train_id, max_len=80))

### You can use the following cell to inspect a specific document and its label based on its index in the training set.

Note: The indexes will change each time you import the data above because of the random train/test split.

In [None]:
train_id = 1  # Enter the index of the document you want to preview
print(get_preview(docs_train, y_train, dataset.target_names, train_id))

## Preprocessing

**This next section of the notebook steps you through some key kinds of pre-processing for text classification using Naive Bayes and a bag of words (BoW) model.**

On the first run you should read about each setting, but leave the settings as they are. You will come back to this section to tune your model.

### Choose between token counts or tf-idf weights

You can choose to vectorize your text using frequency or tf-idf weights. Valid values are:
```
Vectorizer = CountVectorizer
```
or
```
Vectorizer = TfidfVectorizer
```

In [None]:
Vectorizer = CountVectorizer  # Set the vectorization method you want to use

### Lowercase

Setting lowercase to True will transform all document text to lowercase. Setting it to False will not do this transformation.

In [None]:
lowercase = True

### Set how you are tokenising the text

With this notebook you can choose between the following tokenisers.

This option duplicates the behaviour of scikit-learn's default tokeniser: "The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator)". In this notebook we duplicate this behaviour using the NLTK's regular expression tokeniser and this regular expression: `r"(?u)\b\w\w+\b"`.
```
tokeniser = 'sklearn'
```
You can use this or specify one of the following tokenisers based on NLTK ...

Tokenise based on NLTK's wordpunct_tokenize tokeniser (to include words and punctuation!):
```
tokeniser = 'wordpunct'
```
This applies NLTK's word_tokenize tokeniser.
```
tokeniser = 'word_tokenize'
```

In [None]:
tokeniser = 'sklearn'

### Stemming / Lemmatising

This allows to use NLTK stemmers or lemmatisers (or not). Valid options are shown below. Look for more information on the NLTK website: https://www.nltk.org/api/nltk.stem.html. Note: that stemming and lemmatising (in particular) will make the preprocessing take longer!

```
normalise = None
```
or
```
normalise = 'PorterStemmer'
```
or
```
normalise = 'SnowballStemmer'
```
or
```
normalise = 'WordNetLemmatizer'
```

In [None]:
normalise = None

### Configure stop words

Hopefully you have read the notes on stop word lists above and previewed the different lists.

Do you want to apply a stop_word list? Valid values for stop_words below are:
```
stop_word_list = None
```
or
```
stop_word_list = nltk_stop_words
```
or
```
stop_word_list = list(sklearn_stop_words)
```

**Note:** the sklearn_stop_words list is downloaded as a set. We use list() to convert it to a list type.

In [None]:
stop_word_list = None

In [None]:
print(stop_word_list)

You can also add extra stop words to any of the lists above.
For example:
```
extra_stop_words = ['stopword1','stopword2','stopword3']
```
If you don't want extra stop words, then the next cell should look like:
```
extra_stop_words = []
```

In [None]:
extra_stop_words = []

### Filter features based on document frequency

The following settings allow you to remove features that occur in many documents or in only a few documents.

Firstly, `min_df` ignores terms that occur below a minimum proportion of documents. For example, 0.01 would ignore terms that occur in less than 1% of documents.

In [None]:
min_df = 0.0

`max_df` allows you to ignore terms above a maximum proportion of documents. For example, 0.95 would ignore terms that occur in more than 95% of documents.

In [None]:
max_df = 1.0

### Set a maximum number of features

`max_features` set this to `None` for no limit or set to the maximum number of the most frequent features (e.g setting it to 1000 would use the 1000 most frequent features).

In [None]:
max_features = 1000

### Ngrams

With ngram_range set to (1,1) you will use unigrams as features i.e. each feature will be a token. If you set it to (1,2) you will use unigrams and bigrams. (1,3) will use unigrams, bigrams and trigrams. If you just want bigrams you would use (2,2). Please note: increasing the ngram range from (1,1) will add more time to preprocessing, as there will be more features.

In [None]:
ngram_range = (1,1)

### Encoding options

You can change the default encoding here and what to do if you get characters outside your default encoding.

In [None]:
encoding = 'utf-8'
decode_error = 'ignore' # what to do if contains characters not of the given encoding - options 'strict', 'ignore', 'replace'

## Setup the feature extraction and classification pipeline

This sets up a Sci-kit learn pipeline for feature extraction and classification.

**Important Note 1:** When you change settings above or reload your dataset you should rerun this cell!

**Important Note 2:** This cell outputs the settings you used above, which you can cut and paste into a document to keep track of changes you are making and their effects.

In [None]:
# you shouldn't need to change anything in this cell!

stop_words = set_stop_words(stop_word_list, extra_stop_words)
normaliser = set_normaliser(normalise)

pipeline = Pipeline([
    ('vectorizer', Vectorizer(
            tokenizer = tokenise,
            lowercase = lowercase,
            min_df = min_df,
            max_df = max_df,
            max_features = max_features,
            stop_words = stop_words,
            ngram_range = ngram_range,
            encoding = encoding,
            decode_error = decode_error)),
    ('classifier', MultinomialNB()), #here is where you would specify an alternative classifier
])

print('Classifier settings')
print('===================')
print('classifier:', type(pipeline.steps[1][1]).__name__)
print('vectorizer:', type(pipeline.steps[0][1]).__name__)
print('classes:', dataset.target_names)
print('lowercase:', lowercase)
print('tokeniser:', tokeniser)
print('normalise:', normalise)
print('min_df:', min_df)
print('max_df:', max_df)
print('max_features:', max_features)
if stop_word_list == nltk_stop_words:
    print('stop_word_list:', 'nltk_stop_words')
elif stop_word_list == list(sklearn_stop_words):
    print('stop_word_list:', 'sklearn_stop_words')
else:
    print('stop_word_list:', 'None')
print('extra_stop_words:', extra_stop_words)
print('ngram_range:', ngram_range)
print('encoding:', encoding)
print('decode_error:', decode_error)

## Train the classifier and predict labels on test data

This cell does the work of training the classifier and predicting labels on test data. It also outputs evaluation metrics, a confusion matrix and features indicative of each class.

**Important Note:** You can cut and paste the model output into a document (with the settings above) to keep track of changes you are making and their effects.

In [None]:
# you shouldn't need to change anything in this cell!
warnings.filterwarnings("ignore", category=UserWarning)

pipeline.fit(docs_train, y_train)
y_predicted = pipeline.predict(docs_test)

# print report
print('\nEvaluation metrics')
print('==================\n')
print(metrics.classification_report(y_test, y_predicted, target_names = dataset.target_names))
cm = metrics.confusion_matrix(y_true=y_test, y_pred=y_predicted, labels=[0, 1])

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=dataset.target_names)
disp = disp.plot(include_values=True, cmap='Blues', ax=None, xticks_rotation='vertical')
plt.show()

vect = pipeline.steps[0][1]
clf = pipeline.steps[1][1]

print()

logodds=clf.feature_log_prob_[1]-clf.feature_log_prob_[0]

print("Features most indicative of",dataset.target_names[0])
print('============================' + '='*len(dataset.target_names[0]))
for i in np.argsort(logodds)[:20]:
    print(vect.get_feature_names_out()[i], end=' ')
print()
print()

print("Features most indicative of",dataset.target_names[1])
print('============================' + '='*len(dataset.target_names[1]))
for i in np.argsort(-logodds)[:20]:
    print(vect.get_feature_names_out()[i], end=' ')

lookup = dict((v,k) for k,v in vect.vocabulary_.items())

## List all features

Just for your reference here is a count and list of all features used in this model.

In [None]:
print('Total Features: ',len(vect.get_feature_names_out()))
print(vect.get_feature_names_out())

## Inspect correctly/incorrectly classified documents

The output in the next cell is quite long and will take a few moments to generate. It will show you wordclouds and a preview of documents for correctly and incorrectly classified documents. The size of words in the wordclouds are based on adding up counts/tf-idf scores of features based on documents related to each cell in the confusion matrix

In [None]:
# setup a counter for each cell in the confusion matrix
counter = {}
previews = {}
for true_target, target_name in enumerate(dataset.target_names):
    counter[true_target] = {}
    previews[true_target] = {}
    for predicted_target, target_name in enumerate(dataset.target_names):
        counter[true_target][predicted_target] = {}
        previews[true_target][predicted_target] = ''

# get doc-term matrix for test docs
doc_terms = vect.transform(docs_test)

# iterate through all predictions, building the counter and preview of docs
# there is a better way to do this, but this will do!
for doc_id, prediction in enumerate(clf.predict(doc_terms)):
    for k, v in enumerate(doc_terms[doc_id].toarray()[0]):
        if v > 0:
            if lookup[k] not in counter[y_test[doc_id]][prediction]:
                counter[y_test[doc_id]][prediction][lookup[k]] = 0
            counter[y_test[doc_id]][prediction][lookup[k]] += v

    previews[y_test[doc_id]][prediction] += get_preview(docs_test, y_test, dataset.target_names, doc_id, max_len=80) + '\n'

# output a wordcloud and preview of docs for each cell of confusion matrix ...
for true_target, target_name in enumerate(dataset.target_names):
    for predicted_target, target_name in enumerate(dataset.target_names):
        if true_target == predicted_target:
            print(f'\nCORRECTLY CLASSIFIED:\n{dataset.target_names[true_target]}')
        else:
            print(f'\n{dataset.target_names[true_target]} INCORRECTLY CLASSIFIED as: {dataset.target_names[predicted_target]}')
        print('=================================================================')

        wordcloud = WordCloud(background_color="white", width=800, height=400, color_func=lambda *args, **kwargs: "black").generate_from_frequencies(counter[true_target][predicted_target])
        plt.figure(figsize=(8, 4), dpi= 150)
        plt.imshow(wordcloud, interpolation="bilinear")
        plt.axis("off")
        plt.show()

        print(previews[true_target][predicted_target])



## Preview document and its features

Use this cell to preview a document using its index in the test set. You can see the predicted label, its actual label, the full text and the features for this specific document.

In [None]:
test_id = 147

print('Prediction')
print('==========')
print(dataset.target_names[clf.predict(vect.transform([docs_test[test_id]]))[0]])
print()

print(get_preview(docs_test, y_test, dataset.target_names, test_id))

print('Features')
print('========')
for k, v in enumerate(vect.transform([docs_test[test_id]]).toarray()[0]):
    if v > 0:
        print(v, '\t', lookup[k])

---

## DistilBERT

**Question:** Run this model and compare the results to the Bag of Words (Naive Bayes) model. Make note of some of the pros and cons of each and discuss with your neighbour or the tutors.

In [None]:
# Import libraries
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import torch

The following cell sets-up and trains the model, then returns the results of evaluation against the test set.

Just run the cell, there is no need to change anything.

In [None]:
warnings.filterwarnings("ignore", category=FutureWarning)

print('\n\nWe further split our training set into a new training set and an evaluation set\n\n')
docs_train_b, docs_eval, y_train_b, y_eval = train_test_split(docs_train, y_train, test_size=0.2, random_state=None)

# Set up DistilBERT model
model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=len(dataset.target_names))

train_encodings = tokenizer(docs_train_b, truncation=True, padding=True, max_length=512)
eval_encodings = tokenizer(docs_eval, truncation=True, padding=True, max_length=512)

train_dataset = Dataset.from_dict({'input_ids': train_encodings['input_ids'], 'attention_mask': train_encodings['attention_mask'], 'labels': y_train_b})
eval_dataset = Dataset.from_dict({'input_ids': eval_encodings['input_ids'], 'attention_mask': eval_encodings['attention_mask'], 'labels': y_eval})

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch"
)

# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

print('\n\nTraining...please be patient 😉\n')

print("""
Key points:
Convergence: Ideally, both training and validation loss should be decreasing in each epoch and converging to a lower value.
Overfitting: If the training loss decreases while the validation loss increases, it indicates that the model is overfitting to the training data.
Underfitting: If both training and validation loss remain high, it indicates the model is underfitting - it is not learning well from the data.
""")

# Train the model
trainer.train()

def distilbert_classify(texts, model, tokenizer):
    """
    Classify new text using our fine-tuned DistilBERT model.
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # check for cuda availability and set device
    model = model.to(device) # move model to device
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512).to(device) # move inputs to device
    with torch.no_grad():
        outputs = model(**inputs)
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    return probabilities, probabilities.argmax(dim=-1).tolist()

print('\nNow see how the fine-tuned model performs on new data - our original test set')

# Evaluate the model
probabilities, hf_predictions = distilbert_classify(docs_test, model, tokenizer)

print("\nDistilBERT Results:")
print(classification_report(y_test, hf_predictions, target_names=dataset.target_names))

# Display confusion matrix
cm = metrics.confusion_matrix(y_true=y_test, y_pred=hf_predictions, labels=range(len(dataset.target_names)))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=dataset.target_names)
disp.plot(include_values=True, cmap='Blues', xticks_rotation='vertical')
plt.show()

In [None]:
# Identify and rank incorrectly classified documents
incorrect_indices = [i for i, (true, pred) in enumerate(zip(y_test, hf_predictions)) if true != pred]
incorrect_docs = [
    (
        i,  # Original index
        docs_test[i],
        dataset.target_names[y_test[i]],  # Map true label to text
        dataset.target_names[hf_predictions[i]],  # Map predicted label to text
        probabilities[i].max().item()
    )
    for i in incorrect_indices
]

# Sort by the highest confidence in the incorrect class
incorrect_docs_sorted = sorted(incorrect_docs, key=lambda x: x[4], reverse=True)
df_incorrect = pd.DataFrame(incorrect_docs_sorted, columns=['Test set index', 'Document', 'True Label', 'Predicted Label', 'Confidence'])

print("\nIncorrectly Classified Documents\n")
print("Confidence is the probability the model has assigned for the given document belonging to the predicted class.\n")
display(df_incorrect)

In [None]:
index_to_display = 23
print(df_incorrect.loc[index_to_display, 'Document'])