# Question Type Classification

In this notebook, I will implement a Naive Bayes classifier for classifying question types, and evaluate the performance of the classifier.

## Setup

I will be using the TREC question classficiation dataset by [Xin and Roth, 2001](https://www.aclweb.org/anthology/C02-1150)

### Dataset Description

The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 labeled questions in training set and another 500 for test set.

The dataset has 6 coarse class labels and 50 fine class labels. I only used the course class labels.


I will be using a small sample from the dataset.

In [None]:
# install the libraries
!pip install datasets
!pip install nltk
import nltk
nltk.download('stopwords')

Collecting datasets
  Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/536.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━[0m [32m399.4/536.6 kB[0m [31m11.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=12.0.0 (from datasets)
  Downloading pyarrow-15.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (38.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m40.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-non

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Inspect the Dataset

The following helper function inspects the dataset.

In [None]:
# preprocess trec dataset.
# consider course_label only

import datasets

dataset = datasets.load_dataset('trec')
"""coarse_label (ClassLabel): Coarse class label. Possible values are:
'ABBR' (0): Abbreviation.
'ENTY' (1): Entity.
'DESC' (2): Description and abstract concept.
'HUM' (3): Human being.
'LOC' (4): Location.
'NUM' (5): Numeric value.
"""
label_mappings = {'ABBR': 0, 'ENTY': 1, 'DESC': 2, 'HUM': 3, 'LOC': 4, 'NUM': 5}
reversed_label_mappings = {v: k for k, v in label_mappings.items()}

def print_data_sample(ds, text_field, label_field, print_count=5, label_mappings=None):
  count_by_label = {e: 0 for e in set(ds[label_field])}
  print(count_by_label)
  for label in count_by_label:
    for example, example_label in zip(ds[text_field], ds[label_field]):
        if example_label == label:
            if count_by_label[label] == -1:
                continue
            count_by_label[label] += 1
            label_text = label_mappings[label] if label_mappings else label
            print(f"{label_text}:  {example}")
            if count_by_label[example_label] == print_count:
                count_by_label[example_label] = -1

dataset['train']['coarse_label'][:10]
list(zip(dataset['train']['text'], dataset['train']['coarse_label']))[:2]

print_data_sample(dataset['train'], 'text', 'coarse_label', print_count=3, label_mappings=reversed_label_mappings)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/17.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5452 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

{0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0}
ABBR:  What is the full form of .com ?
ABBR:  What does the abbreviation AIDS stand for ?
ABBR:  What does INRI stand for when used on Jesus ' cross ?
ENTY:  What films featured the character Popeye Doyle ?
ENTY:  What fowl grabs the spotlight after the Chinese Year of the Monkey ?
ENTY:  What is considered the costliest disaster the insurance industry has ever faced ?
DESC:  How did serfdom develop in and then leave Russia ?
DESC:  How can I find a list of celebrities ' real names ?
DESC:  What are liver enzymes ?
HUM:  What contemptible scoundrel stole the cork from my lunch ?
HUM:  What team did baseball 's St. Louis Browns become ?
HUM:  What is the oldest profession ?
LOC:  What sprawling U.S. state boasts the most airports ?
LOC:  What is the highest waterfall in the United States ?
LOC:  Which two states enclose Chesapeake Bay ?
NUM:  When was Ozzy Osbourne born ?
NUM:  How many Jews were executed in concentration camps during WWII ?
NUM:  Wh

## Writing Naive Bayes Classifier

Here I implement the Naive Bayes classifier. Stop words are removced from the vocabulary (not included in the calculations of log-likelihood). Also, I used Laplace smoothing.

In [None]:
import os, math
from collections import Counter

# load nltk stopwords
import nltk
from typing import Dict, Set, Union, Tuple, List

from collections import Counter, defaultdict
import math
from tqdm import tqdm


class NaiveBayesClassifier:
    """Code for a bag-of-words Naive Bayes classifier.
    """

    def __init__(self, remove_stops: bool = True) -> None:
        self.remove_stops = remove_stops
        if self.remove_stops:
          self.stop_words = set(nltk.corpus.stopwords.words('english'))
        self.classes = {}

        # the following will be populated after the train() function is called
        self.log_prior: Dict[int, float] = None
        self.likelihoods: Dict[int, Dict[str, float]] = None
        self.vocab_all: Union[Set[str], Dict[str, int]] = None
        self.vocab_by_class: Dict[int, Dict[str, int]] = None

    def train(self, X: List[str], y: List[int]) -> None:
        """
        Train the Naive Bayes classification model.

        Args:
          X: training data
          y: labels

        Returns:
          None (updates class attributes self.vocabulary, self.logprior, self.loglikelihood)
        """

        # no of documents
        N_doc = len(X)

        # Initialize classes and vocabulary
        self.classes = set(y)
        self.vocab_all = set()
        self.vocab_by_class = defaultdict(Counter)
        self.log_prior = {}
        self.likelihoods = defaultdict(lambda: defaultdict(float))

        # logprior
        for c in self.classes:
            N_c = sum(1 for label in y if label == c)
            self.log_prior[c] = math.log(N_c / N_doc)

            # Initialize bigdoc for class c
            bigdoc_c = []
            for doc, label in zip(X, y):
                if label == c:
                    if self.remove_stops:
                      tokens = [word for word in doc.strip().split() if word not in self.stop_words]
                    else:
                      tokens = [word for word in doc.strip().split()]
                    bigdoc_c.extend(tokens)
                    self.vocab_all.update(tokens)

            # vocab_by_class update
            self.vocab_by_class[c].update(bigdoc_c)

        # log-likelihood with Laplace smoothing
        for c in self.classes:
            total_words = sum(self.vocab_by_class[c].values())
            for word in self.vocab_all:
                count_w_c = self.vocab_by_class[c][word]
                self.likelihoods[c][word] = math.log((count_w_c + 1) / (total_words + len(self.vocab_all)))

    def predict(self, doc: str) -> int:
        """
        Return the most likely class for a given document.
        Use the likelihood and log_prior values populated during training

        Returns:
            The most likely class as predicted by the model.
        """
        class_scores = {cls: self.log_prior[cls] for cls in self.classes}
        words = doc.strip().split()
        # Calculate score for each class
        for cls in self.classes:
            for word in words:
                if word in self.vocab_all:  # Only consider words that are in the vocabulary, otherwise ignore
                    class_scores[cls] += self.likelihoods[cls].get(word, 0)

        # Return class with the highest score
        return max(class_scores, key=class_scores.get)

    def predict_all(self, test_docs: List[str]) -> List[int]:
        """
        Predict the class of all documents in the test set.
        This is just a loop over all documents in the test set
        """
        y_pred = [self.predict(doc) for doc in test_docs]
        return y_pred

    @staticmethod
    def evaluate(
        y_pred: List[int], y_true: List[int],
    ) -> Tuple[float, float, float]:
        """
        Calculate a precision, recall, and F1 score for the model
        on a given test set. Use macro averaging for these metrics.

        Args:
            y_pred: Predicted labels
            y_true: Ground truth labels

        Returns:
            (float, float, float)
            The model's precision, recall, and F1 score relative to the
            target class.
        """
        precision_score = 0.0
        recall_score = 0.0
        f1_score = 0.0

        classes = set(y_true)
        # Calculate metrics for each class
        for cls in classes:
            TP = sum((y_pred[i] == cls and y_true[i] == cls) for i in range(len(y_pred)))
            FP = sum((y_pred[i] == cls and y_true[i] != cls) for i in range(len(y_pred)))
            FN = sum((y_pred[i] != cls and y_true[i] == cls) for i in range(len(y_pred)))

            precision = TP / (TP + FP) if (TP + FP) > 0 else 0
            recall = TP / (TP + FN) if (TP + FN) > 0 else 0

            precision_score+= precision
            recall_score += recall
            f1 = (2 * (precision * recall)) / (precision + recall) if (precision + recall) > 0 else 0
            f1_score += f1

        # Macro average the sums by the number of classes
        num_classes = len(classes)
        precision_score = precision_score / num_classes
        recall_score = recall_score / num_classes
        f1_score = f1_score / num_classes

        return precision_score, recall_score, f1_score



def main():

    trec_dataset = datasets.load_dataset('trec')
    print(trec_dataset)
    train = trec_dataset['train'][:1000]
    val = trec_dataset['test']

    text_key = 'text'
    label_key = 'coarse_label'


    X = [e.strip() for e in train[text_key]]
    y = train[label_key]
    X_val = [e.strip() for e in val[text_key]]
    y_val = val[label_key]


    def preprocess(X):
      return [x.lower() for x in X]

    X = preprocess(X)
    X_val = preprocess(X_val)


    clf = NaiveBayesClassifier(remove_stops=True)

    from pprint import pprint
    clf.train(X, y)
    y_pred = clf.predict_all(X_val)
    precision, recall, f1 = NaiveBayesClassifier.evaluate(y_pred, y_val)
    print(f'precision: {precision}, recall: {recall}, f1: {f1}')


main()


DatasetDict({
    train: Dataset({
        features: ['text', 'coarse_label', 'fine_label'],
        num_rows: 5452
    })
    test: Dataset({
        features: ['text', 'coarse_label', 'fine_label'],
        num_rows: 500
    })
})
precision: 0.44771083470098766, recall: 0.42781423510031624, f1: 0.40205032540376506


**Model performance when removing stop words:**  precision: 0.44771083470098766, recall: 0.42781423510031624, f1: 0.40205032540376506

**Model performance without removing stop words:**  precision: 0.5644659534152084, recall: 0.5030523705543285, f1: 0.5011024157070146