Read df from pickle file

And provide a step-by-step inspection to amend latest checkpoint file, and save the amended chkpt as a new file
1. Edit 'NA' sentiments
2. Edit error results

In [1]:
import os
from pathlib import Path
import re
import pandas as pd

In [2]:
def check_for_checkpoint(dataset_folder:Path, base_file: Path):
    '''Search for any checkpoints file. Return the latest checkpoint file (with the largest index), and the index of the file.

    Else return the base_file

    params:
    dataset_folder: the folder contains a specific dataset and all checkpoints.
    base_file: the raw file without any checkpoints suffix.
    '''
    all_pkl = []
    for root, dirs, files in os.walk(dataset_folder):
        all_pkl = list(map(lambda f: Path(root, f), files))
        all_pkl = [p for p in all_pkl if p.suffix == '.pkl']
        break       # only scan for 1 level, not looking for directories inside the folder

    # get checkpoint files, containing the keyword 'ckpt'
    checkpoint_files = [f for f in all_pkl if "ckpt" in f.name]

    if checkpoint_files:

        get_index = lambda f: int(re.search(f"{base_file.stem}_ckpt_([0-9]*){base_file.suffix}", f.name)[1])
        # filename looks like 'file_name' + '_ckpt_' + '123' + '.pkl'
        largest_index_file = max(
            checkpoint_files, key=get_index
        )
        largest_index = get_index(largest_index_file)

        return largest_index_file, largest_index
    else:
        return base_file, 0
    
def load_pickle(path_to_load:Path) -> pd.DataFrame:
    df = pd.read_pickle(path_to_load)
    print('\n')
    print(f'Successfully loaded df from {str(path_to_load)}')
    # print(df.head())
    return df

In [26]:
# change the folder and file name !!

dataset_folder = Path('dataset_cleaned_heartless_sampled_20231129/dataset_cleaned_heartless_sampled_20231129_chunk_002').resolve()
base_file = Path(dataset_folder.parent, 'dataset_cleaned_heartless_sampled_20231129_chunk_002.pkl')

# this will get the latest checkpoint
df_filepath, curr_ckpt_index = check_for_checkpoint(dataset_folder, base_file)

print(df_filepath)
print(curr_ckpt_index)

/root/FYP/NLP/sa_hkuchatgpt/dataset_cleaned_heartless_sampled_20231129/dataset_cleaned_heartless_sampled_20231129_chunk_002/dataset_cleaned_heartless_sampled_20231129_chunk_002_ckpt_034.pkl
34


In [27]:
df = load_pickle(df_filepath)

print(df.head(10))



Successfully loaded df from /root/FYP/NLP/sa_hkuchatgpt/dataset_cleaned_heartless_sampled_20231129/dataset_cleaned_heartless_sampled_20231129_chunk_002/dataset_cleaned_heartless_sampled_20231129_chunk_002_ckpt_034.pkl
   dataset_index  app_id                                         app_name  \
0        3648911  292030                         The Witcher 3: Wild Hunt   
1        5080431   39120                                             RIFT   
2        1537424   22370             Fallout 3 - Game of the Year Edition   
3        6375377   92800                                        SpaceChem   
4        2566551  247950                                         Sacred 3   
5        3824834  301690                             Cobi Treasure Deluxe   
6        2646517  250580  Paranautical Activity: Deluxe Atonement Edition   
7         123253  105600                                         Terraria   
8        1220843  219640                       Chivalry: Medieval Warfare   
9        4

The interface to edit checkpoint

In [28]:
# show ration of processed vs unprocessed rows
df_processed_rows = df[df['total_token_used'] != -1]
num_of_processed_rows = len(df_processed_rows)
total_num_of_rows = len(df)

print(f"processed rows: {num_of_processed_rows}")
print(f"total number of rows: {total_num_of_rows}")
print(f"processed ratio: {num_of_processed_rows / total_num_of_rows:.04}")

processed rows: 1020
total number of rows: 3000
processed ratio: 0.34


In [29]:
# show index of unprocessed rows
print("List of unprocessed index")
list_unprocessed = list(df[df['total_token_used'] == -1].index)
print(list_unprocessed)

List of unprocessed index
[1020, 1021, 1022, 1023, 1024, 1025, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071, 1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087, 1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103, 1104, 1105, 1106, 1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114, 1115, 1116, 1117, 1118, 1119, 1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1131, 1132, 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145, 1146, 1147, 1148, 1149, 1150, 1151, 1152, 1153, 1154, 1155, 1156, 1157, 1158, 1159, 1160, 1161, 1162, 1163, 1164, 1165, 1166, 1167, 1168, 1169, 1170, 1171, 1172, 1173, 1174, 1175, 1176, 1177, 1178, 1179, 1180, 1181, 1

Filter useful rows that received response from ChatGPT 3.5

1. filter non-json formatted rows

for those json-formatted rows -> check overall sum probability = 1 or not, find problematic rows and fix them.

for those non-json formatted rows
1. Get rows contain "NA" as keyword
2. There are other rows


In [30]:
# to check rows: use df.loc[index_of_df] or df.iloc[index_as_a_list]

df.loc[38]

dataset_index                                                 6051145
app_id                                                            570
app_name                                                       Dota 2
review_text                   so many quitter but funny community LOL
review_score                                                        1
review_votes                                                        0
response            {"positive": 0.2, "neutral": 0.6, "negative": ...
total_token_used                                                  124
Name: 38, dtype: object

In [31]:
# test json-formatting
import json


def test_json_formatting(df, target_column_name:str):
    valid_rows = []
    invalid_rows = []

    for index, row in df.iterrows():
        try:
            json_data = json.loads(str(row[target_column_name]))

            valid_rows.append(index)
        except ValueError:
            invalid_rows.append(index)

    return valid_rows, invalid_rows

valid_rows, invalid_rows = test_json_formatting(df_processed_rows, 'response')
print("Rows that can be formatted into JSON:", valid_rows)
print("Rows that cannot be formatted into JSON:", invalid_rows)
print()
print("Number of rows that can be formatted into JSON:", len(valid_rows))
print("Number of rows that cannot be formatted into JSON:", len(invalid_rows))

Rows that can be formatted into JSON: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 61, 62, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 82, 83, 85, 86, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 163, 164, 165, 166, 167, 168, 169, 172, 174, 176, 178, 179, 180, 182, 184, 187, 188, 190, 192, 193, 194, 195, 196, 197, 199, 200, 201, 202, 204, 205, 206, 209, 211, 212, 213, 214, 216, 217, 218, 219, 221, 222, 227, 228, 231, 232, 235, 236, 237, 239, 244, 246, 247, 251, 254, 256, 258, 259, 261, 263, 2

In [32]:
# show index with result containing "NA"
num_of_na_rows = len(df[df['response'].str.contains('NA') == True])
print(f"num of rows with \"NA\": {num_of_na_rows}")
print(f"'NA' row ratio: {num_of_na_rows / total_num_of_rows:.04}")

num of rows with "NA": 7
'NA' row ratio: 0.002333


In [33]:
# get rows that is json format
# only them will be saved

df_json_rows = df.iloc[valid_rows]
df_json_rows.head()

Unnamed: 0,dataset_index,app_id,app_name,review_text,review_score,review_votes,response,total_token_used
0,3648911,292030,The Witcher 3: Wild Hunt,Well Geralt is a real Hunt thats for sure.,-1,1,"{\n ""positive"": 1.0,\n ""neutral"": 0.0,\n...",139
1,5080431,39120,RIFT,im not sure if i should or shouldnt recommend ...,-1,0,"{""positive"": 0, ""neutral"": 0.5, ""negative"": 0.5}",175
2,1537424,22370,Fallout 3 - Game of the Year Edition,i love fallout and have been playing it for ye...,-1,0,"{""positive"": 0.0, ""neutral"": 0.0, ""negative"": ...",175
3,6375377,92800,SpaceChem,Take a precautionary step to prevent Alzheimer...,1,0,"{""positive"": 1.0, ""neutral"": 0.0, ""negative"": ...",129
4,2566551,247950,Sacred 3,"Buggy. Firstly, Crashes when I want to play in...",-1,0,"{""positive"":0.0, ""neutral"":0.0, ""negative"":1.0}",258


In [42]:
# test sum of probability = 1 or not
# show error

# iteratively run this cell after you make a change to the dataframe.
# to see whether the change passes the checking.

def test_sum_prob(df, target_column_name:str):
    valid_rows = []
    invalid_rows = []

    for index, row in df.iterrows():
        
        try:
            response = json.loads(str(row[target_column_name]))

            sum_of_prob = response['positive'] + response['neutral'] + response['negative']

            if 1 - 0.1 < sum_of_prob < 1 + 0.1:
                valid_rows.append(index)
            else:
                print("sum_of_prob =", sum_of_prob)
                print(response['positive'], response['neutral'], response['negative'])
                print(index)
                print()
                invalid_rows.append(index)
        except Exception as e:
            print(e)
            print(index)
            print()
            invalid_rows.append(index)

    return valid_rows, invalid_rows

sum_valid_rows, sum_invalid_rows = test_sum_prob(df_json_rows, 'response')

print("Rows that have sum of prob = 1:", sum_valid_rows)
print("Rows that have sum of prob != 1", sum_invalid_rows)
print()
print("Number of rows that have sum of prob = 1:", len(sum_valid_rows))
print("Number of rows that have sum of prob != 1:", len(sum_invalid_rows))

Rows that have sum of prob = 1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 61, 62, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 82, 83, 85, 86, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 163, 164, 165, 166, 167, 168, 169, 172, 174, 176, 178, 179, 180, 182, 184, 187, 188, 190, 192, 193, 194, 195, 196, 197, 199, 200, 201, 202, 204, 205, 206, 209, 211, 212, 213, 214, 216, 217, 218, 219, 221, 222, 227, 228, 231, 232, 235, 236, 237, 239, 244, 246, 247, 251, 254, 256, 258, 259, 261, 263, 264, 26

In [38]:
# fixing them (number of rows that have sum of prob != 1) each
# by changing the variable 'processing_idx' to locate the row wanna fix
processing_idx = 977


print(df_json_rows.loc[processing_idx])
print()
print(df_json_rows.at[processing_idx, 'review_text'])
print()
print(df_json_rows.at[processing_idx, 'response'])


dataset_index                                                 4839411
app_id                                                         372000
app_name                                Tree of Savior (English Ver.)
review_text         100% Item Farmers are Bots 100% Items are Over...
review_score                                                       -1
review_votes                                                        1
response               {"positive": 0, "neutral": 0, "negative": 100}
total_token_used                                                  217
Name: 977, dtype: object

100% Item Farmers are Bots 100% Items are Overpriced including Cash Shop(Hairstyle in-game cost more than haircut in real-life) 100% New Bugs every patch 100% Gay Costumes 100% No Healer 95% Hero Class 95% of the community are Bots 95% Disconnection in-game 95% stuck in loading 95% Solo Gaming 90% Item from Duplicate exploit 90% Lag 90% FPS Drop 90% Game Freeze

{"positive": 0, "neutral": 0, "negative": 100}


In [39]:
# approach 1: overwrite with a new response by us. 
# This is for seeing 'NA' or other format (like percentages) in the json.

# use df.at to access the particluar cell and edit the response

import ast

to_be_replaced_with = {"positive": 0, "neutral": 0, "negative": 1}

df_json_rows.at[processing_idx, 'response'] = json.dumps(to_be_replaced_with)

print(df_json_rows.at[processing_idx, 'response'])

# after fixing, rerun the cell with test_cell_prob to see whether error is dissappeared

{"positive": 0, "neutral": 0, "negative": 1}


In [41]:
# approach 2: remove rows that sentiments cannot be told from the short review.
# usually for responses with all zeros.

# sometimes you wanna remove rows
# use df.drop(list_of_indices)

# create a backup before dropping rows
df_json_rows_bkup = df_json_rows.copy(deep=True)

df_json_rows.drop([961], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_json_rows.drop([961], inplace=True)


Saving new stuff

we only save rows with valid json object, and sum of probability = 1

currently save as a new pkl object (i.e. not adding a checkpoint to the folder), but behaviour may change

In [43]:
from datetime import datetime

# change the file name as well !!
save_path = Path(dataset_folder.parent, f'{base_file.stem}_cleaned_{datetime.now().strftime("%Y%m%d")}.pkl').resolve()

df_json_rows.to_pickle(save_path)
print(f"Saved to: {save_path}")

Saved to: /root/FYP/NLP/sa_hkuchatgpt/dataset_cleaned_heartless_sampled_20231129/dataset_cleaned_heartless_sampled_20231129_chunk_002_cleaned_20231211.pkl
