# 02-02 : Multi-label text classification

After extracting intents, we use Keras, a comprehensive deep learning library, to develop a multi-class classification model.

## References

- [Large-scale multi-label text classification](https://keras.io/examples/nlp/multi_label_classification/)

In [1]:
import pandas as pd
from typing import List
from IPython.display import display

from tensorflow.keras import layers
from tensorflow import keras
import tensorflow as tf

from sklearn.model_selection import train_test_split

2024-03-17 18:06:21.989138: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-17 18:06:21.989165: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-17 18:06:21.990032: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-17 18:06:21.994129: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## 1. Data Description

In [2]:
data_path = '../../data'

orig_data_path = f'{data_path}/hellopeter'
orig_file = f'{orig_data_path}/00-01_vodacom_selected_reviews.parquet.gz'

intent_path = f'{data_path}/multiclass_model'
intent_extract_file = f'{intent_path}/01-03_intents.parquet.gz'
intent_file = f'{intent_path}/02-01_flat_intents.parquet.gz'

### 1.1 Data Lineage

1. The original dataset is a collection publicly accessible customer reviews/complaints scraped from [Hellopeter](https://www.hellopeter.com/) site between 2021 and 2023. This dataset was [created](https://github.com/JohnnyFoulds/dsm050-2023-apr/blob/master/notebooks/01_hellopeter/01-01_retrieve_data.ipynb) in another research project investigating: [Evaluating Customer Satisfaction and Preferences in the Telecommunications Industry: A Comparative Analysis of Survey Data and Online Reviews](https://github.com/JohnnyFoulds/dsm050-2023-apr/blob/master/notebooks/04_draft/04-04_cw02.ipynb)

2. Data selection was perform in the `00-01_data_selection` notebook based on the following criteria.
    - Reviews from the for the 5 month period from **2022-06-01** to **2023-06-30** were selected.

    - Only reviews from the **Vodacom** telecommunications company were selected.

    - Very short, or very long reviews were removed. Reviews between from **10** to **100** words were selected. The word count was calculated using a basic `.str.split().str.len()` which is sufficient for this purpose.

3. The unlabeled data was then labeled in the `01-02_batch_classification` notebook using Generative AI.
    - The **Mistral 7B v0.2** Large Language Model (LLM) were hosted on the local servier with [Ollama](https://ollama.com/library/mistral). Please refer to the `01-01_classification_test` notebook for further details.

    - The classification was done using multiple prompts similar to Chain-of-Thought (CoT) techniques for classification. _Implementation details can be found in the `src` directory._

    - Classification was done based on the categories defined in `src/config/category_definitions.jsonl`.

    - It took an average of **7 seconds** to classify a single review.

4. Using Generative AI for labeling introduced new categories that were cleaned up in the `01-03_cleanup` notebook.

    - First new categories that were prefixed with an original category were replaced.

    - Then, new categories that contained an original category in round brackets were replaced with the original.

    - For the remaining new categories, the reviews were manually inspected and the reviews were reclassified via manual mapping.

5. The data labels was then converted into a format suitable for modeling in the `02-01_data_preperation` notebook.

### 1.2 Data Structure

#### Source Data

The following shows a sample customer review from the original source data.

In [3]:
df_source = pd.read_parquet(orig_file)
with pd.option_context('display.max_colwidth', None):
    display(
        df_source[df_source.id == 3950575]
    )

Unnamed: 0,id,review_title,review_content
5215,3950575,Vodacom is useless!!!,"Good day\n\nAgain, vodacom did not do their jobs. The amount went off as I explicitly asked for it not to. Vodacom now owes me R300 as it has been debited from my account twice now. I will be taking this to social media now. And I want to please cancel all my contracts with vodacom."


### Data Labels

A sample of the labels extracted using the LLM is shown below. From this we can see that the LLM has extracted multiple labels for each review, and a reason is generated form each label for human verification.

In [4]:
df_intents_extracted = pd.read_parquet(intent_extract_file)
with pd.option_context('display.max_colwidth', None):
    display(
        df_intents_extracted[df_intents_extracted.id == 3950575]
    )

Unnamed: 0,category,reason,relevance,sentiment,id
29,Billing,The text mentions that an amount was debited from the account twice.,1.0,negative,3950575
30,Cancellation,The text expresses the intent to cancel all contracts with Vodacom due to the billing issue.,1.0,negative,3950575
31,Customer's Feeling,The text contains a negative sentiment towards Vodacom.,0.5,negative,3950575


### Prepared Data Labels

The prepared data labels are shown below. The data labels are prepared for multi-label classification.

This data will need to be combined with the `review_title` and `review_content` from the original source data to create the final dataset for modeling.

In [5]:
df_intents = pd.read_parquet(intent_file)
df_intents["category_list"] = df_intents["category_list"].apply(lambda x: list(x))
df_intents["relevance_list"] = df_intents["relevance_list"].apply(lambda x: list(x))
df_intents["sentiment_list"] = df_intents["sentiment_list"].apply(lambda x: list(x))

with pd.option_context('display.max_colwidth', None):
    display(
        df_intents[df_intents.id == 3950575]
    )

print(f'Data samples: {len(df_source)}')

Unnamed: 0,id,category_list,relevance_list,sentiment_list
2,3950575,"[Billing, Cancellation, Customer's Feeling]","[1.0, 1.0, 0.5]","[negative, negative, negative]"


Data samples: 5218


## 2. Data Preprocessing

### 2.1 Multi-label Binarization

In [6]:
# get the list of category target values
categories = tf.ragged.constant(df_intents.category_list.values)

# learn the vocabulary
lookup = keras.layers.StringLookup(output_mode="multi_hot")
lookup.adapt(categories)

# show the vocabulary
vocab = lookup.get_vocabulary()
print("Vocabulary:\n")
print(lookup.get_vocabulary())
print(f'Vocabulary size: {len(vocab)}')

Vocabulary:

['[UNK]', "Customer's Feeling", 'Billing', 'Network Coverage', 'Cancellation', 'Call Center', 'Policy', 'Account Management', 'Response', 'Resolution', 'Devices', 'Staff Level', 'Price Plans', 'Brand', 'Abuse', 'Products', 'Service', 'Services', 'SIM', 'Other']
Vocabulary size: 20


2024-03-17 18:06:23.136442: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-17 18:06:23.164127: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-17 18:06:23.164325: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-

## 3. Train Test Split

In a multi-label classification problem, imbalance can occur at two levels:

1. **Label imbalance**: Some labels appear more frequently than others.
2. **Label combination imbalance**: Some combinations of labels appear more frequently than others.

Both these imbalances are present in the dataset. Imbalance can lead to a model that performs well on the majority classes but poorly on the minority classes. This is because the model might be biased towards predicting the majority classes due to their higher occurrence in the training data.

To address this, we will ideally use a stratified split to ensure that the distribution of labels in the training and validation sets is similar.

In [7]:
df_category_count = df_intents.category_list.value_counts().reset_index()
df_category_count.columns = ['category', 'samples']

print(f'Combination Category Count  : {len(df_category_count)}')
print(f'Combinations with one sample: {len(df_category_count[df_category_count.samples == 1])}')

Combination Category Count  : 376
Combinations with one sample: 128


Unfortunately we can see that about a third (0.34) of the unique category combinations have only one sample. This means that we will not be able to use a stratified split, as the validation set will not contain any of these unique combinations.

In [8]:
test_split = 0.2

# initial train and test split
train_full_df, test_df = train_test_split(
    df_intents,
    test_size=test_split
)

# splitting the train set further into validation and new train sets
val_df = train_full_df.sample(frac=0.2)
train_df = train_full_df.drop(val_df.index)

print(f"Number of rows in training set   : {len(train_df):>5}")
print(f"Number of rows in validation set : {len(val_df):>5}")
print(f"Number of rows in test set       : {len(test_df):>5}")

Number of rows in training set   :  3214
Number of rows in validation set :   803
Number of rows in test set       :  1005


## 4. Modeling

### 4.1 Baseline

For a multi-label text classification task in Natural Language Processing (NLP), a commonsense baseline could be designed using simple heuristics based on the frequency of specific keywords or phrases associated with each label. Multi-label classification differs from binary or multi-class classification in that each text instance can be associated with multiple labels simultaneously, rather than belonging to just one category.

To accomplish this we could start with  simple libraries like the [Natural Laugage Toolkit (NLTK)](https://www.nltk.org/) to tokenize the text and count the frequency of specific words or phrases associated with each label. We could then use these frequencies to predict the labels for new text instances. However for the sake of simplicity we will instead start with a very basic deep learning model.