# 01-02 : Batch Classification

In this notebook we uses a LLM (as shown in `01-01_classification_test.ipynb`) to perform batch classifications on the customer review dataset selected in `00-01_data_preparation.ipynb`.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys

sys.path.append(os.path.abspath("../../src"))

In [3]:
import pandas as pd
from pprint import pprint
from IPython.display import display, Markdown
from tqdm.notebook import tqdm

from langchain_core.language_models.llms import BaseLLM
from langchain_community.llms import Ollama
import service.classification as classification

In [4]:
data_path = '../../data'
input_path = f'{data_path}/hellopeter'
output_path = f'{data_path}/intent_extraction'

input_file = f'{input_path}/00-01_vodacom_selected_reviews.parquet.gz'

## Load Data

In [5]:
df_input = pd.read_parquet(input_file)

print(df_input.shape)
with pd.option_context('display.max_colwidth', None):
    display(df_input.sample(3))

(5218, 3)


Unnamed: 0,id,review_title,review_content
1787,4279243,Bad service,Really bad service we want to do a sim swop and they asked for the last 5 number. No one will remember who phoned Last or who we phoned. Now the don't want to do a Sim swop this is really pathetic. We got proof of residence and id now my number and still no service the consultant sits on her phone and ignores us.
3272,4146372,Poor service,I cannot receive sms or notifications from institutions and banks etc... I went to my 2 banks they said everything is fine from their side Vodacom walk in centers checked and they advise me to call the relevant department or customer care i phone them and they phone me back I explain that i cannot receive sms's only from person to person that's all . They lady said they going to escalate it but up until now its a no show i am disappointed :(
3843,4088769,Vodacom,"Worst experience ever, fiber off almost a week, and all they can say is we will escalate, I don't have words 😔 \nIf you ask for supervisor, they just tell you he'll come back to you, but no never one call back \nGood luck if jou join them"


## Create LLM 

In [6]:
# create the client
llm = Ollama(
    model="mistral",
    top_p=0.001,
    temperature=0.001,
    num_predict=512)

In [7]:
# ensure that the client is working
review = df_input.iloc[2588] 
review_content = "**%s**\n\n%s" % (
    review['review_title'], 
    review['review_content'])

display(Markdown("## Review"))
display(Markdown(review_content))

display(Markdown("## Response"))
display(Markdown(llm.invoke(review_content)))

## Review

**Poor signal. Worse service.**

I was incorrectly billed R900 for Summer VC I did not purchase.

Two seperate official complaints laid...the last one 10 days ago and no one response

Not acceptable.

## Response

 I'm sorry to hear that you have experienced both a billing error and poor customer service in regards to your complaint about being incorrectly billed R900 for a Summer VC that you did not purchase. I understand how frustrating it can be to not receive a response after laying official complaints, especially when the issue is important to you.

I would suggest reaching out to the company's customer support team once again, this time through a different channel such as social media or email if phone support has been unresponsive. Be sure to include all relevant details of your complaint and any previous correspondence with their team. You may also want to consider escalating the issue to a supervisor or manager if necessary.

Additionally, you can file a complaint with the relevant consumer protection agency in your country for further investigation if the company continues to ignore your concerns. Remember to keep records of all communication with the company and any related documents, as this will be helpful during the investigation process.

I hope that these suggestions help resolve the issue and improve your experience with the company's customer service. If you have any other questions or need further assistance, please don't hesitate to ask.

## Functions

### Classification

In [8]:
def classify_review(id:str, title:str, content:str, llm:BaseLLM):
    result = pd.DataFrame()

    # combine the title and content
    review_content = f"**{title}**\n\n{content}"

    # classify the review
    response = classification.get_classification(
        text=review_content,
        llm=llm)
    
    # convert the response to a dataframe
    if response is not None:
        result = pd.DataFrame(response.dict()["categories"])
        result['id'] = id
    
    return result

## test the function
classify_review(
    id=review['id'],
    title=review['review_title'],
    content=review['review_content'],
    llm=llm
)

Unnamed: 0,category,reason,relevance,sentiment,id
0,Network Coverage,The text mentions 'Poor signal' and 'worse ser...,0.6,negative,4208159
1,Billing,The text mentions 'incorrectly billed R900 for...,0.6,negative,4208159
2,Response,The text mentions 'no one response' to officia...,0.4,neutral,4208159


### Process Batch

In [9]:
def process_batch(df, batch_num:int, start_index:int, llm:BaseLLM, batch_size:int=32):
    global output_path
    df_result = pd.DataFrame()

    # select the rows in the batch
    end_index = start_index + batch_size
    batch = df.iloc[start_index:end_index]

    # perform the classifications
    for index, row in tqdm(batch.iterrows(), total=batch_size, leave=False):
        df_result = pd.concat([
            df_result,
            classify_review(
                id=row['id'],
                title=row['review_title'],
                content=row['review_content'],
                llm=llm)
        ])

    # save the classifications    
    df_result.to_parquet(
        f'{output_path}/batch_{batch_num+1:05}.parquet.gz',
        index=False,
        compression='gzip')

    return end_index

## test the function
start_index = process_batch(df_input, 0, 0, llm, 5)
pd.read_parquet(f'{output_path}/batch_00001.parquet.gz')

  0%|          | 0/5 [00:00<?, ?it/s]

Unnamed: 0,category,reason,relevance,sentiment,id
0,Customer's Feeling,The customer expresses their dissatisfaction w...,1.0,negative,4492164
1,Response,The customer mentions their frustration with n...,1.0,negative,4492164
2,Devices,The text mentions a problem with the cellphone...,1.0,negative,4492020
3,Resolution,The text implies that the customer has reached...,1.0,negative,4492020
4,Network Coverage,The customer mentions having a home WiFi route...,1.0,negative,4491829
5,Customer Feeling,The customer expresses their frustration and d...,1.0,negative,4491829
6,Billing,The text mentions a query related to the Vodac...,1.0,negative,4491788
7,Customer's Feeling,The text expresses frustration with Vodacom's ...,1.0,negative,4491788
8,Cancellation,The text mentions trying to cancel a business ...,1.0,negative,4491706
9,Billing,The text mentions not receiving an invoice or ...,1.0,neutral,4491706


### Process Data

In [10]:
def process_data(df, llm:BaseLLM, start_batch_num:int=0, batch_size:int=32):
    start_index = start_batch_num * batch_size
    batch_num = start_batch_num

    pbar = tqdm(total=len(df)//batch_size)
    while start_index < len(df):
        try:
            start_index = process_batch(
                df=df, 
                batch_num=batch_num, 
                start_index=start_index, 
                llm=llm, 
                batch_size=batch_size)
            batch_num += 1
            pbar.update(1)
        except Exception as e:
            print(f'Error processing batch {batch_num}')
            raise e
        
    pbar.close()

## test the function
df_test = df_input.sample(6)
process_data(df=df_test, llm=llm, batch_size=2)

display(pd.read_parquet(f'{output_path}/batch_00001.parquet.gz'))
display(pd.read_parquet(f'{output_path}/batch_00002.parquet.gz'))
display(pd.read_parquet(f'{output_path}/batch_00003.parquet.gz'))

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Unnamed: 0,category,reason,relevance,sentiment,id
0,Policy,The text mentions 'company rules' which is a p...,1.0,negative,4057328
1,Customer's Feeling,'They are not willing to assist me in any way'...,1.0,negative,4057328
2,Policy,The text mentions Vodacom's unfair business pr...,1.0,negative,4161638


Unnamed: 0,category,reason,relevance,sentiment,id
0,Billing,The text mentions that the customer settled an...,1.0,negative,4352227
1,Account Management,The text mentions that the customer has contac...,1.0,negative,4352227
2,Billing,The text mentions 'they have been charging me ...,1.0,negative,4265747
3,Customer's Feeling,The text contains the phrase 'Vodacom service ...,1.0,negative,4265747


Unnamed: 0,category,reason,relevance,sentiment,id
0,Billing,The text expresses dissatisfaction with Vodaco...,1.0,negative,4011053
1,Customer's Feeling,The text includes a statement about how the cu...,1.0,negative,4011053
2,Devices,"The text mentions a specific device, V Tech wa...",1.0,neutral,4305303
3,Account Management,The text discusses an issue with syncing the w...,1.0,negative,4305303


## Classify Dataset

Classify the input dataset in batches.

In [11]:
process_data(df=df_input, llm=llm, batch_size=100)

  0%|          | 0/52 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

Value Error: Failed to parse SentimentCategories from completion {'properties': {'categories': [{'category': 'Price Plans', 'reason': "The text mentions a 'price increase' of 20% that was not communicated to the customer.", 'relevance': 1.0, 'sentiment': 'negative'}, {'category': "Customer's Feeling", 'reason': "The text expresses the customer's negative emotions towards the situation, such as feeling unfairly treated and powerless to prevent future price increases.", 'relevance': 1.0, 'sentiment': 'negative'}]}, 'required': ['categories']}. Got: 1 validation error for SentimentCategories
categories
  field required (type=value_error.missing)
Value Error: Failed to parse SentimentCategories from completion {'properties': {'categories': [{'category': 'Abuse', 'reason': "The text mentions unwanted calls from a Vodacom number, which falls under the 'Abuse' category definition.", 'relevance': 1.0, 'sentiment': 'negative'}, {'category': 'Policy', 'reason': "The text mentions a 'Vodacom prom

  0%|          | 0/100 [00:00<?, ?it/s]

Value Error: Failed to parse SentimentCategories from completion {'properties': {'categories': [{'category': 'Staff Level', 'reason': 'The text expresses gratitude and positive sentiment towards a specific Vodacom staff member, Jabu Manqele.', 'relevance': 1.0, 'sentiment': 'positive'}, {'category': "Customer's Feeling", 'reason': 'The text expresses positive feelings towards a Vodacom staff member and the interaction with them.', 'relevance': 1.0, 'sentiment': 'positive'}]}, 'required': ['categories']}. Got: 1 validation error for SentimentCategories
categories
  field required (type=value_error.missing)
