# 02-02 : Multi-label text classification

After extracting intents, we use Keras, a comprehensive deep learning library, to develop a multi-class classification model.

## References

- [Large-scale multi-label text classification](https://keras.io/examples/nlp/multi_label_classification/)

In [4]:
import pandas as pd
from IPython.display import display

## Data Description

In [1]:
data_path = '../../data'

orig_data_path = f'{data_path}/hellopeter'
orig_file = f'{orig_data_path}/00-01_vodacom_selected_reviews.parquet.gz'

intent_path = f'{data_path}/multiclass_model'
intent_extract_file = f'{intent_path}/02-01_intents.parquet.gz'
intent_file = f'{intent_path}/02-01_flat_intents.parquet.gz'

### Data Lineage

1. The original dataset is a collection publicly accessible customer reviews/complaints scraped from [Hellopeter](https://www.hellopeter.com/) site between 2021 and 2023. This dataset was [created](https://github.com/JohnnyFoulds/dsm050-2023-apr/blob/master/notebooks/01_hellopeter/01-01_retrieve_data.ipynb) in another research project investigating: [Evaluating Customer Satisfaction and Preferences in the Telecommunications Industry: A Comparative Analysis of Survey Data and Online Reviews](https://github.com/JohnnyFoulds/dsm050-2023-apr/blob/master/notebooks/04_draft/04-04_cw02.ipynb)

2. Data selection was perform in the `00-01_data_selection` notebook based on the following criteria.
    - Reviews from the for the 5 month period from **2022-06-01** to **2023-06-30** were selected.

    - Only reviews from the **Vodacom** telecommunications company were selected.

    - Very short, or very long reviews were removed. Reviews between from **10** to **100** words were selected. The word count was calculated using a basic `.str.split().str.len()` which is sufficient for this purpose.

3. The unlabeled data was then labeled in the `01-02_batch_classification` notebook using Generative AI.
    - The **Mistral 7B v0.2** Large Language Model (LLM) were hosted on the local servier with [Ollama](https://ollama.com/library/mistral). Please refer to the `01-01_classification_test` notebook for further details.

    - The classification was done using multiple prompts similar to Chain-of-Thought (CoT) techniques for classification. _Implementation details can be found in the `src` directory._

    - Classification was done based on the categories defined in `src/config/category_definitions.jsonl`.

    - It took an average of **7 seconds** to classify a single review.

4. The data labels was then converted into a format suitable for modeling in the `02-01_data_preperation` notebook.

### Data Structure

#### Source Data

The following shows a sample customer review from the original source data.

In [7]:
df_source = pd.read_parquet(orig_file)
with pd.option_context('display.max_colwidth', None):
    display(
        df_source[df_source.id == 3950575]
    )

Unnamed: 0,id,review_title,review_content
5215,3950575,Vodacom is useless!!!,"Good day\n\nAgain, vodacom did not do their jobs. The amount went off as I explicitly asked for it not to. Vodacom now owes me R300 as it has been debited from my account twice now. I will be taking this to social media now. And I want to please cancel all my contracts with vodacom."


### Data Labels

A sample of the labels extracted using the LLM is shown below. From this we can see that the LLM has extracted multiple labels for each review, and a reason is generated form each label for human verification.

In [8]:
df_intents_extracted = pd.read_parquet(intent_extract_file)
with pd.option_context('display.max_colwidth', None):
    display(
        df_intents_extracted[df_intents_extracted.id == 3950575]
    )

Unnamed: 0,category,reason,relevance,sentiment,id
29,Billing,The text mentions that an amount was debited from the account twice.,1.0,negative,3950575
30,Cancellation,The text expresses the intent to cancel all contracts with Vodacom due to the billing issue.,1.0,negative,3950575
31,Customer's Feeling,The text contains a negative sentiment towards Vodacom.,0.5,negative,3950575


### Prepared Data Labels

The prepared data labels are shown below. The data labels are prepared for multi-label classification.

This data will need to be combined with the `review_title` and `review_content` from the original source data to create the final dataset for modeling.

In [9]:
df_intents = pd.read_parquet(intent_file)
with pd.option_context('display.max_colwidth', None):
    display(
        df_intents[df_intents.id == 3950575]
    )

Unnamed: 0,id,category_list,relevance_list,sentiment_list
2,3950575,"[Billing, Cancellation, Customer's Feeling]","[1.0, 1.0, 0.5]","[negative, negative, negative]"
