# 02-01 : Data Preparation

Use the dataset created in `01-03_cleanup.ipynb` to create a new dataset that can be used for training a keras multi-label text classification model.

## References

- [Large-scale multi-label text classification](https://keras.io/examples/nlp/multi_label_classification/)

In [1]:
import glob
import pandas as pd

In [2]:
data_path = '../../data'
orig_data_path = f'{data_path}/hellopeter'
orig_file = f'{orig_data_path}/00-01_vodacom_selected_reviews.parquet.gz'

class_data_path = f'{data_path}/multiclass_model'
class_data_file = f'{class_data_path}/01-03_intents.parquet.gz'

output_path = f'{data_path}/multiclass_model'
output_file = f'{output_path}/02-01_flat_intents.parquet.gz'

## Load Data

Load the original dataset selected in `00-01_data_preparation.ipynb`.

In [3]:
df_orig = pd.read_parquet(orig_file) \
    .sort_values('id')

print(df_orig.shape)
with pd.option_context('display.max_colwidth', None):
    display(df_orig.head(3))

(5218, 3)


Unnamed: 0,id,review_title,review_content
5217,3950516,Vodacom fraudster,"Vodacom is a scam! Never ever take, a contract with those people. I had a, contract ending end October. End August I called them and cancelled the contract. I was suprised to find myself at credit bureau while I was looking for a house bond. They didn't cancel my contract. I call them, the system shows I indeed cancel the contract but they don't know why t wasn't cancelled. They are taking me from pillar to post and my life is at a, standstill. Fraudsters"
5216,3950535,bad service,"still awating any feedback from vodacom legal department ant the email address of DCA Hammond Pole, so that I can forward him all the mails to vodacom that has not been responded by Vodacom, and as stated two times allready, I dont have my number any more so cant phone the DCA, the messages has also been ignored by Vodacom"
5215,3950575,Vodacom is useless!!!,"Good day\n\nAgain, vodacom did not do their jobs. The amount went off as I explicitly asked for it not to. Vodacom now owes me R300 as it has been debited from my account twice now. I will be taking this to social media now. And I want to please cancel all my contracts with vodacom."


Load the classifications generated with the Large Language Model (LLM).

In [4]:
df_class = pd.read_parquet(class_data_file) \
    .sort_values(['id', 'category'])

print(df_class.shape)
with pd.option_context('display.max_colwidth', None):
    display(df_class.head(6))

(10327, 5)


Unnamed: 0,category,reason,relevance,sentiment,id
33,Cancellation,The text mentions 'cancelled the contract' and 'they didn't cancel my contract'.,1.0,negative,3950516
34,Policy,The text uses the term 'scam' and 'fraudsters' to describe Vodacom.,1.0,negative,3950516
32,Response,The text describes the customer's frustration with not receiving any feedback or response from Vodacom's legal department despite following up multiple times.,1.0,negative,3950535
29,Billing,The text mentions that an amount was debited from the account twice.,1.0,negative,3950575
30,Cancellation,The text expresses the intent to cancel all contracts with Vodacom due to the billing issue.,1.0,negative,3950575
31,Customer's Feeling,The text contains a negative sentiment towards Vodacom.,0.5,negative,3950575


## Flatten Classifications

Flatten the classifications dataset to have a single row per id, where the multiple classifications for that id are placed in a list column.

In [5]:
df_flatten = df_class \
    .sort_values(['id', 'category']) \
    .groupby('id') \
    .agg({
        'category': list,
        'relevance': list,
        'sentiment': list
    }) \
    .reset_index()

df_flatten.columns = ['id', 'category_list', 'relevance_list', 'sentiment_list']
df_flatten = df_flatten \
    .sort_values('id') \
    .reset_index(drop=True)

print(df_flatten.shape)
with pd.option_context('display.max_colwidth', None):
    display(df_flatten.head(5))

(5022, 4)


Unnamed: 0,id,category_list,relevance_list,sentiment_list
0,3950516,"[Cancellation, Policy]","[1.0, 1.0]","[negative, negative]"
1,3950535,[Response],[1.0],[negative]
2,3950575,"[Billing, Cancellation, Customer's Feeling]","[1.0, 1.0, 0.5]","[negative, negative, negative]"
3,3950595,"[Call Center, Customer's Feeling]","[1.0, 1.0]","[negative, negative]"
4,3950626,"[Billing, Policy]","[1.0, 1.0]","[negative, negative]"


In [6]:
df_flatten.category_list.value_counts().head(20)

category_list
[Customer's Feeling, Network Coverage]      373
[Billing, Customer's Feeling]               309
[Billing, Cancellation]                     232
[Call Center, Customer's Feeling]           201
[Cancellation, Customer's Feeling]          156
[Billing, Policy]                           133
[Account Management, Customer's Feeling]    129
[Customer's Feeling, Staff Level]           105
[Customer's Feeling, Response]              105
[Customer's Feeling, Devices]                97
[Network Coverage, Resolution]               96
[Cancellation, Policy]                       92
[Account Management, Billing]                92
[Customer's Feeling, Policy]                 89
[Customer's Feeling, Resolution]             87
[Billing, Response]                          77
[Network Coverage, Response]                 76
[Billing, Resolution]                        69
[Call Center, Network Coverage]              68
[Billing, Call Center]                       61
Name: count, dtype: int64

## Save Dataset

Save the dataset to be used in model building.

In [7]:
df_flatten.to_parquet(output_file, compression='gzip')