In [106]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
import numpy as np
import time

from dotenv import load_dotenv
from utils import utils

import os
from dotenv import load_dotenv

## Context

The quality check of the sampled reviews made in the previous step was done manually by Jonas Torres (the author of this project).

This is something that does not scale well.

The goal of this notebook is to:

1. Review the quality of the topics assigned by the LLM.
2. Use the same LLM to review its own work. This can now be done because we have a sample with "tags" that were reviewed by a human (the author).

Since a human is reviewing the topic tags, we assume that the human holds the "ground truth" (i.e if a topic assignment is correct or not).

In [309]:
df_r = pd.read_csv(path+"4.tagged_reviews_sample_verified.csv")

In [310]:
df_r = df_r[['content', 'score','app', 'gemini_llm_topic', 'correct',
       'human_labeled_topic', 'cause', "corrected_topic"]]

## Explore quality of Topic Assignment

The file containing the manual evaluation with some simple analysis can be found in this [link](https://docs.google.com/spreadsheets/d/1To8TfdoSm2ZhZTx0WZmG03pY5xsYlMggNMXZJcb_tc0/edit?usp=sharing).

### Variable description

In [311]:
df_r.head(1).T

Unnamed: 0,0
content,This app weights 460 MB! Old review: Stop forcing me to update
score,1
app,co.mona.android
gemini_llm_topic,Usability/UI/UX
correct,1
human_labeled_topic,
cause,
corrected_topic,Usability/UI/UX


### Accuracy

In [312]:
#Overall accuracyp
print("Overall Topic Accuracy: ",df_r["correct"].mean().round(2)*100, "%")

Overall Topic Accuracy:  86.0 %


Overall topic accuracy is descent.

When breaking the accuracy down by topic, more insights come up.

* Educational Resources/Onboarding: low prevalence and 0% accuracy.
* Privacy: only 1 review was tagged with this topic.
* Wallet Security/Integration: low prevalence and 0% accuracy.
* Usability/UI/UX: 74.5% accuracy. This is a category that is worth exploring further.

In [313]:
#Accuracy by topic
df_r.groupby("gemini_llm_topic").agg({"correct":["count", "mean"]})

Unnamed: 0_level_0,correct,correct
Unnamed: 0_level_1,count,mean
gemini_llm_topic,Unnamed: 1_level_2,Unnamed: 2_level_2
Account Management,38,0.894737
Customer Support,36,0.805556
Educational Resources/Onboarding,4,0.0
Features/Functionality,25,0.92
Generic feedback,50,0.94
Privacy,1,1.0
Reliability/Stability,39,0.923077
Security,5,0.8
Transaction Fees/Speed,31,0.935484
Usability/UI/UX,51,0.745098


### Analyzing the Errors

In [314]:
def word_prevalence(reviews, word):
    match_counter=0
    reviews_l = [review.lower() for review in reviews]
    for r in reviews_l:
        if word.lower() in r:
            match_counter +=1
    return match_counter/len(reviews)

In [315]:
usability_index = df_r["gemini_llm_topic"] == "Usability/UI/UX"
wrong_index = df_r["correct"] == 0

In [316]:
df_r[usability_index & wrong_index]["corrected_topic"].value_counts()

corrected_topic
Reliability/Stability     7
Generic feedback          5
Features/Functionality    1
Name: count, dtype: int64

In [317]:
df_r.loc[usability_index & wrong_index,["gemini_llm_topic","corrected_topic", "content"]]

Unnamed: 0,gemini_llm_topic,corrected_topic,content
118,Usability/UI/UX,Generic feedback,Very cheap style is of this Binance which compels the net user to bother to cee this in tricky repeatable !
119,Usability/UI/UX,Generic feedback,Most Stupid crypto trading app
137,Usability/UI/UX,Reliability/Stability,The app is good but i cant installed it in my new device i dont know the problem
162,Usability/UI/UX,Reliability/Stability,The app is still junk. The same problems for over two years. Trying to top up my debit card always requires me to shut down the app and restart. Don't ask me my phone make/model or tell me it's a problem with my phone ... It's been like that what different phones. The other nuisance is the forced updates... FFS give us a heads up beforehand. Don't just lock us out of the bloody app. It's so frustrating. And why does it need more than 2Gb of storage space to update with a 45Mb update? Grr
171,Usability/UI/UX,Reliability/Stability,"I am trying to open my crypto, BUT the app keeps locking me out... this app is useless. I guess I will open another app other than crypto. And there is no way to get help... I put in the code that they send me, and it won't take it.... AND now you tell me to live chat you???? HOW? I can't get on the app dummies"
180,Usability/UI/UX,Generic feedback,"Worst app , don't install"
183,Usability/UI/UX,Features/Functionality,Unable to use card
185,Usability/UI/UX,Reliability/Stability,"Well, the app is super slow, but at least it works again since the problematic update a while back. The constant ""must update to continue using the app"" is EXTREMELY annoying. I don't care if i miss 1 or 2 features, i just want to check on my crypto for once without being nagged to update. Now if an update made the app faster or better again, maybe I'd be interested. But nothing changes. This has been posted a long time, so from now on, every time I get nagged to update,this review loses a star."
190,Usability/UI/UX,Reliability/Stability,Id love it if the app didn't update every couple of weeks
208,Usability/UI/UX,Reliability/Stability,Finding it difficult to download the app my data keeps going


In [318]:
#correct usability reviews with the word "app"
df_r[usability_index & ~wrong_index]["content"].str.lower().str.contains("app").sum() / sum(usability_index & ~wrong_index)

np.float64(0.5)

In [319]:
wrong_usability_reviews = df_r[usability_index & wrong_index]["content"].values
correct_usability_reviews = df_r[usability_index & ~wrong_index]["content"].values

In [320]:
word_prevalence(wrong_usability_reviews, "app")

0.8461538461538461

In [321]:
word_prevalence(correct_usability_reviews, "app")

0.5

In [322]:
word_prevalence(wrong_usability_reviews, "use")

0.3076923076923077

In [323]:
word_prevalence(correct_usability_reviews, "use")

0.5

## Have an LLM quality check its own work

In [324]:
load_dotenv()
GEMINI_API_KEY=os.getenv("GEMINI_NEW")

In [325]:
with open("data/1.crypto_category.txt", "r") as f:
    category_list = f.readlines()

In [326]:
#
prompt_base = """You are a Customer Success manager who is doing a quality check on a set of customer reviews that were assigned to a topic\
        by a Large Language Model.
        The LLM was handed a predefined list of topics: {category_list}
        And given the following instruction: 'If the review is just an expression
        of sentiment (eg: Great!, Bad!, etc). Please use the 'Generic feedback' category.'
   
        * This the review: {review}
        * This is the assigned topic: {topic}
        
        Please review the topic assignment and return the number 1 if the assigned topic is in the comments. Even if the assigned topic
        wasnt the core theme, consider it 1 as long as its present in the review.
    
        Your answer should be in a string with the following format if the assigned topic is wrong:
        0 | core_topic (the main topic of the string) | topic2, topic3 (if more topics exist) | reason for the new topic

        And the following format if it is correct:
        1 | existing_topic | topic2, topic3 (if more topics exist) | reason
        """

In [327]:
review = "Finding it difficult to download the app my data keeps going"

In [328]:
prompt = prompt_base.format(category_list = category_list, review=review,topic="Generic feedback")

In [329]:
res =utils.gemini_query(prompt, gemini_key = GEMINI_API_KEY)

In [330]:
res.split("|")

['0 ',
 ' Reliability/Stability ',
 '  ',
 ' The review clearly describes a problem with downloading the app and data issues, which directly relates to the reliability and stability of the app.  "Generic feedback" is insufficient.']

In [331]:
debug=True
perc = df_r.shape[0]//10
gemini_feedback = []

In [332]:
start_time = time.time()
n_samples= 20

if debug: 
    df_c = df_r.sample(n_samples).copy()
    reviews = df_c.content.values
    topics = df_c.gemini_llm_topic.values
else:
    df_c = df_r.copy()
    reviews = df_c.content.values
    topics = df_c.gemini_llm_topic.values
    
for i, review in enumerate(reviews):
    if debug:
        print("Checking Review #", i+1)
    prompt = prompt_base.format(category_list = category_list, review=review,topic=topics[i])

    res = utils.gemini_query(prompt, gemini_key = GEMINI_API_KEY, debug=debug)
    gemini_feedback.append(res)
    if i%perc == 0:
        print(f"{i} out of {len(reviews)} done.")
    

end_time = time.time()
print(f"{i} reviews were processed in {(end_time-start_time)/60} minutes")

Checking Review # 1
0 out of 20 done.
Checking Review # 2
Checking Review # 3
Checking Review # 4
Checking Review # 5
Checking Review # 6
Checking Review # 7
Checking Review # 8
Checking Review # 9
Checking Review # 10
Checking Review # 11
Checking Review # 12
Checking Review # 13
Checking Review # 14
Checking Review # 15
Checking Review # 16
Checking Review # 17
Gemini Failed to respond. Sleeping...
Entering recursive step. 1
Checking Review # 18
Checking Review # 19
Checking Review # 20
19 reviews were processed in 1.02536381483078 minutes


In [334]:
correct_list = []
core_topics = []
seconday_topics = []
reason = []
for f in gemini_feedback:
    f_list = f.split("|")
    try:
        correct_list.append(int(f_list[0].strip()))
        core_topics.append(f_list[1].strip())
        seconday_topics.append(f_list[2].strip())
        reason.append(" ".join(f_list[3:]).strip())
    except:
        correct_list.append(0)
        core_topics.append(" ".join())
        seconday_topics.append(f_list[2].strip())
        reason.append(" ".join(f_list[3:]).strip())
        

IndexError: list index out of range

In [302]:
df_c["correct_llm"]   = correct_list
df_c["core_topic_llm"]   = core_topics
df_c["secondary_topic_llm"]   = seconday_topics
df_c["reason_llm"]   = reason

In [303]:
df_c["match"] = np.logical_and(df_c["correct"]==1, df_c["correct_llm"]==1) * 1

In [304]:
df_c["match"].mean()

np.float64(0.4)

In [305]:
np.logical_or(df_c["match"] == 1,df_c["correct_llm"]==1 ).mean()

np.float64(0.55)

In [306]:
df_c["correct_llm"].mean()

np.float64(0.55)

In [307]:
df_c.to_csv("data/4.tagged_reviews_sample_llm_verified.csv", index = False)

## Automatic Reviews Conclusion

Needs more work but has potential.