# 01-02 : Batch Classification

In this notebook we uses a LLM (as shown in `01-01_classification_test.ipynb`) to perform batch classifications on the customer review dataset selected in `00-01_data_preparation.ipynb`.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys

sys.path.append(os.path.abspath("../../src"))

In [3]:
import pandas as pd
from pprint import pprint
from IPython.display import display, Markdown
from tqdm.notebook import tqdm

from langchain_core.language_models.llms import BaseLLM
from langchain_community.llms import Ollama
import service.classification as classification

In [4]:
data_path = '../../data'
input_path = f'{data_path}/hellopeter'
output_path = f'{data_path}/intent_extraction'

input_file = f'{input_path}/00-01_vodacom_selected_reviews.parquet.gz'

## Load Data

In [5]:
df_input = pd.read_parquet(input_file)

print(df_input.shape)
with pd.option_context('display.max_colwidth', None):
    display(df_input.sample(3))

(5218, 3)


Unnamed: 0,id,review_title,review_content
1155,4348368,OPPO watches not available,The OPPO watches gets advertised in all your new booklets but everyone just shrugs their shoulders and say there are no stock!!!!
3454,4125567,Vodawhat? Power to who?,Why is it that when one calls **********4 and select speak to a consultant someone literally answers the call 9am-09h30 9 Oct 2022 and doesn't talk and because i kept trying i am now barred from the call-center. Is this a plot for client retentions when i just want to cancel something that doesn't suit my budget or a disgruntled employee that wants to avoid penalties for call abandonment so they answer and do not talk Zippoing the system?
7,4491103,Pathetic Service!,"We cancelled a business contract and ask Vodacom to migrate the to prepaid. This was supposed to be done at the end of June. Vodacom did the migration before the time and I am now unable to use my cellphone number which is the only number my clients has to contact me on at the moment. I am unable to visit a Vodacom store today to Rica number. Which means that my clients are unable to contact me, when I called the Vodacom customer care line the consultant was rude and sarcastic. What pathetic service!!!!"


## Create LLM 

In [6]:
# create the client
llm = Ollama(
    model="mistral",
    top_p=0.001,
    temperature=0.001,
    num_predict=512)

In [7]:
# ensure that the client is working
review = df_input.iloc[2588] 
review_content = "**%s**\n\n%s" % (
    review['review_title'], 
    review['review_content'])

display(Markdown("## Review"))
display(Markdown(review_content))

display(Markdown("## Response"))
display(Markdown(llm.invoke(review_content)))

## Review

**Poor signal. Worse service.**

I was incorrectly billed R900 for Summer VC I did not purchase.

Two seperate official complaints laid...the last one 10 days ago and no one response

Not acceptable.

## Response

 I'm sorry to hear that you have experienced both a billing error and poor customer service in regards to your complaint about being incorrectly billed R900 for a Summer VC that you did not purchase. I understand how frustrating it can be to not receive a response after laying official complaints, especially when the issue is important to you.

I would suggest reaching out to the company's customer support team once again, this time through a different channel such as social media or email if phone support has been unresponsive. Be sure to include all relevant details of your complaint and any previous correspondence with their team. You may also want to consider escalating the issue to a supervisor or manager if necessary.

Additionally, you can file a complaint with the relevant consumer protection agency in your country for further investigation if the company continues to ignore your concerns. Remember to keep records of all communication with the company and any related documents, as this will be helpful during the investigation process.

I hope that these suggestions help resolve the issue and improve your experience with the company's customer service. If you have any other questions or need further assistance, please don't hesitate to ask.

## Functions

### Classification

In [8]:
# # test problematic review
# problem_text = '**Vodacom ***** not taken seriously**\n\n@vodacom I CANT BELIEVE YOU GUYS DO NOT TAKE ***** SERIOUSLY... I have emailed **************** and called ********** & ********** phones are not being answered do not get any responses.FFS 🤬'
# display(Markdown(problem_text))
# display(Markdown(llm.invoke(problem_text)))

# print(classification.get_classification(problem_text, llm))

In [9]:
def classify_review(id:str, title:str, content:str, llm:BaseLLM):
    result = pd.DataFrame()

    # combine the title and content
    review_content = f"**{title}**\n\n{content}"

    # classify the review
    response = classification.get_classification(
        text=review_content,
        llm=llm)
    
    # convert the response to a dataframe
    if response is not None:
        result = pd.DataFrame(response.dict()["categories"])
        result['id'] = id
    
    return result

# ## test the function
# classify_review(
#     id=review['id'],
#     title=review['review_title'],
#     content=review['review_content'],
#     llm=llm
# )

### Process Batch

In [10]:
def process_batch(df, batch_num:int, start_index:int, llm:BaseLLM, batch_size:int=32):
    global output_path
    df_result = pd.DataFrame()

    # select the rows in the batch
    end_index = start_index + batch_size
    batch = df.iloc[start_index:end_index]

    # perform the classifications
    for index, row in tqdm(batch.iterrows(), total=batch_size, leave=False):
        df_result = pd.concat([
            df_result,
            classify_review(
                id=row['id'],
                title=row['review_title'],
                content=row['review_content'],
                llm=llm)
        ])

    # save the classifications    
    df_result.to_parquet(
        f'{output_path}/batch_{batch_num+1:05}.parquet.gz',
        index=False,
        compression='gzip')

    return end_index

# ## test the function
# start_index = process_batch(df_input, 0, 0, llm, 5)
# pd.read_parquet(f'{output_path}/batch_00001.parquet.gz')

### Process Data

In [11]:
def process_data(df, llm:BaseLLM, start_batch_num:int=0, batch_size:int=32):
    start_index = start_batch_num * batch_size
    batch_num = start_batch_num

    pbar = tqdm(total=len(df)//batch_size)
    while start_index < len(df):
        try:
            start_index = process_batch(
                df=df, 
                batch_num=batch_num, 
                start_index=start_index, 
                llm=llm, 
                batch_size=batch_size)
            batch_num += 1
            pbar.update(1)
        except Exception as e:
            print(f'Error processing batch {batch_num}')
            raise e
        
    pbar.close()

# ## test the function
# df_test = df_input.sample(6)
# process_data(df=df_test, llm=llm, batch_size=2)

# display(pd.read_parquet(f'{output_path}/batch_00001.parquet.gz'))
# display(pd.read_parquet(f'{output_path}/batch_00002.parquet.gz'))
# display(pd.read_parquet(f'{output_path}/batch_00003.parquet.gz'))

## Classify Dataset

Classify the input dataset in batches.

In [None]:
process_data(df=df_input, llm=llm, batch_size=100, start_batch_num=0)