# 01-02 : Batch Classification

In this notebook we uses a LLM (as shown in `01-01_classification_test.ipynb`) to perform batch classifications on the customer review dataset selected in `00-01_data_preparation.ipynb`.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys

sys.path.append(os.path.abspath("../../src"))

In [3]:
import pandas as pd
from pprint import pprint
from IPython.display import display, Markdown
from tqdm.notebook import tqdm

from langchain_core.language_models.llms import BaseLLM
from langchain_community.llms import Ollama
import service.classification as classification

In [4]:
data_path = '../../data'
input_path = f'{data_path}/hellopeter'
output_path = f'{data_path}/intent_extraction'

input_file = f'{input_path}/00-01_vodacom_selected_reviews.parquet.gz'

## Load Data

In [5]:
df_input = pd.read_parquet(input_file)

print(df_input.shape)
with pd.option_context('display.max_colwidth', None):
    display(df_input.sample(3))

(5218, 3)


Unnamed: 0,id,review_title,review_content
3869,4086526,Worst Internet provider in the World!!!!!,Worst Internet provider in the World!!!!! Service is non existing!!!
2239,4237618,Poor service delivery,"No internet during load shedding or load reduction, insufficient back up system,poor service in general. No feedback to customers"
4617,4008434,I am done with vodacom,Worst network ever from call centre to back office


## Create LLM 

In [6]:
# create the client
llm = Ollama(
    model="mistral",
    top_p=0.001,
    temperature=0.001,
    num_predict=512)

In [7]:
# ensure that the client is working
review = df_input.iloc[2588] 
review_content = "**%s**\n\n%s" % (
    review['review_title'], 
    review['review_content'])

display(Markdown("## Review"))
display(Markdown(review_content))

display(Markdown("## Response"))
display(Markdown(llm.invoke(review_content)))

## Review

**Poor signal. Worse service.**

I was incorrectly billed R900 for Summer VC I did not purchase.

Two seperate official complaints laid...the last one 10 days ago and no one response

Not acceptable.

## Response

 I'm sorry to hear that you have experienced both a billing error and poor customer service in regards to your complaint about being incorrectly billed R900 for a Summer VC that you did not purchase. I understand how frustrating it can be to not receive a response after laying official complaints, especially when the issue is important to you.

I would suggest reaching out to the company's customer support team once again, this time through a different channel such as social media or email if phone support has been unresponsive. Be sure to include all relevant details of your complaint and any previous correspondence with their team. You may also want to consider escalating the issue to a supervisor or manager if necessary.

Additionally, you can file a complaint with the relevant consumer protection agency in your country for further investigation if the company continues to ignore your concerns. Remember to keep records of all communication with the company and any related documents, as this will be helpful during the investigation process.

I hope that these suggestions help resolve the issue and improve your experience with the company's customer service. If you have any other questions or need further assistance, please don't hesitate to ask.

## Functions

### Classification

In [8]:
def classify_review(id:str, title:str, content:str, llm:BaseLLM):
    result = pd.DataFrame()

    # combine the title and content
    review_content = f"**{title}**\n\n{content}"

    # classify the review
    response = classification.get_classification(
        text=review_content,
        llm=llm)
    
    # convert the response to a dataframe
    if response is not None:
        result = pd.DataFrame(response.dict()["categories"])
        result['id'] = id
    
    return result

## test the function
classify_review(
    id=review['id'],
    title=review['review_title'],
    content=review['review_content'],
    llm=llm
)

Unnamed: 0,category,reason,relevance,sentiment,id
0,Network Coverage,The text mentions 'Poor signal' and 'worse ser...,0.6,negative,4208159
1,Billing,The text mentions 'incorrectly billed R900 for...,0.6,negative,4208159
2,Response,The text mentions 'no one response' to officia...,0.4,neutral,4208159


### Process Batch

In [9]:
def process_batch(df, batch_num, start_index, llm:BaseLLM, batch_size=32):
    global output_path
    df_result = pd.DataFrame()

    # select the rows in the batch
    end_index = start_index + batch_size
    batch = df.iloc[start_index:end_index]

    # perform the classifications
    for index, row in tqdm(batch.iterrows(), total=batch_size):
        df_result = pd.concat([
            df_result,
            classify_review(
                id=row['id'],
                title=row['review_title'],
                content=row['review_content'],
                llm=llm)
        ])

    # save the classifications    
    df_result.to_parquet(
        f'{output_path}/batch_{batch_num:05}.parquet.gz',
        index=False,
        compression='gzip')

    return end_index

## test the function
start_index = process_batch(df_input, 1, 0, llm, 5)
pd.read_parquet(f'{output_path}/batch_00001.parquet.gz')

  0%|          | 0/5 [00:00<?, ?it/s]

Unnamed: 0,category,reason,relevance,sentiment,id
0,Customer's Feeling,The customer expresses their dissatisfaction w...,1.0,negative,4492164
1,Response,The customer mentions their frustration with n...,1.0,negative,4492164
2,Devices,The text mentions a problem with the cellphone...,1.0,negative,4492020
3,Resolution,The text implies that the customer has reached...,1.0,negative,4492020
4,Network Coverage,The customer mentions having a home WiFi route...,1.0,negative,4491829
5,Customer Feeling,The customer expresses their frustration and d...,1.0,negative,4491829
6,Billing,The text mentions a query related to the Vodac...,1.0,negative,4491788
7,Customer's Feeling,The text expresses frustration with Vodacom's ...,1.0,negative,4491788
8,Cancellation,The text mentions trying to cancel a business ...,1.0,negative,4491706
9,Billing,The text mentions not receiving an invoice or ...,1.0,neutral,4491706


### Process Data

In [10]:

import os

def classify(text):
    # Your classification function here
    pass

def process_batch(df, batch_num, start_index):
    batch_size = 32
    end_index = start_index + batch_size
    batch = df.iloc[start_index:end_index]
    batch['classification'] = batch.apply(lambda row: classify(row['review_title'] + ' ' + row['review_content']), axis=1)
    batch.to_parquet(f'{data_path}/batch_{batch_num}.parquet.gz', index=False, compression='gzip')
    return end_index

def process_data(df, start_batch_num=0):
    start_index = start_batch_num * 32
    batch_num = start_batch_num
    while start_index < len(df):
        try:
            start_index = process_batch(df, batch_num, start_index)
            batch_num += 1
        except Exception as e:
            print(f'Error processing batch {batch_num}')
            raise e

# Load your DataFrame
df_reviews = pd.read_csv('your_data.csv')

# Process the data starting from batch number X
process_data(df_reviews, start_batch_num=X)

FileNotFoundError: [Errno 2] No such file or directory: 'your_data.csv'