### Data Preprocessing Script Guide
- 그냥 전체 실행 버튼 한 번만 누르면 각자 로컬 컴에서 전처리된 데이터셋 추출할 수 있도록 해놨음.
- 실행할 때마다 두 번째 cell의 filename 변수만 각 category 데이터셋 이름별로 바꿔서 실행하기만 하면 끝!
- 원본 raw *.jsonl 파일들도 모두 src/dataset/에 위치하나, 용량 문제로 .gitignore 파일을 통해 업로드를 막아놓았고 최종 전처리 결과물인 *_25k.jsonl 파일들만 src/dataset/하에 올려놓았음 
- 한 줄 요약: 그냥 코드는 참고만 하시되(안 돌려봐도됨), 우리가 input으로 사용할 최종 jsonl들은 용량 작아서 git에 dataset/ 폴더 안에 넣어놨으니 그냥 git clone해서 그대로 가져다 쓰시면 된다~

In [50]:
import pandas as pd

In [51]:
# 중요: category 별로 filename 바꿔야함 (맨 아래에서 저장할 때 이름은 이에 맞게 자동으로 바뀌도록 연동해놨음)
filename = "All_Beauty.jsonl" 
path = f"../dataset/{filename}"

In [52]:
df = pd.read_json(path, lines=True)
df.head()

Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase
0,5,Such a lovely scent but not overpowering.,This spray is really nice. It smells really go...,[],B00YQ6X8EO,B00YQ6X8EO,AGKHLEW2SOWHNMFQIJGBECAF7INQ,2020-05-05 14:08:48.923,0,True
1,4,Works great but smells a little weird.,"This product does what I need it to do, I just...",[],B081TJ8YS3,B081TJ8YS3,AGKHLEW2SOWHNMFQIJGBECAF7INQ,2020-05-04 18:10:55.070,1,True
2,5,Yes!,"Smells good, feels great!",[],B07PNNCSP9,B097R46CSY,AE74DYR3QUGVPZJ3P7RFWBGIX7XQ,2020-05-16 21:41:06.052,2,True
3,1,Synthetic feeling,Felt synthetic,[],B09JS339BZ,B09JS339BZ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,2022-01-28 18:13:50.220,0,True
4,5,A+,Love it,[],B08BZ63GMJ,B08BZ63GMJ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,2020-12-30 10:02:43.534,0,True


In [53]:
# basic infos
print("Total sample #:", len(df))
print("\nData info:")
print(df.info())

Total sample #: 701528

Data info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 701528 entries, 0 to 701527
Data columns (total 10 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   rating             701528 non-null  int64         
 1   title              701528 non-null  object        
 2   text               701528 non-null  object        
 3   images             701528 non-null  object        
 4   asin               701528 non-null  object        
 5   parent_asin        701528 non-null  object        
 6   user_id            701528 non-null  object        
 7   timestamp          701528 non-null  datetime64[ns]
 8   helpful_vote       701528 non-null  int64         
 9   verified_purchase  701528 non-null  bool          
dtypes: bool(1), datetime64[ns](1), int64(2), object(6)
memory usage: 48.8+ MB
None


In [54]:
reviews = df[["rating", "text"]].copy()
reviews.head(10)

Unnamed: 0,rating,text
0,5,This spray is really nice. It smells really go...
1,4,"This product does what I need it to do, I just..."
2,5,"Smells good, feels great!"
3,1,Felt synthetic
4,5,Love it
5,4,The polish was quiet thick and did not apply s...
6,5,Great for many tasks. I purchased these for m...
7,3,These were lightweight and soft but much too s...
8,5,This is perfect for my between salon visits. I...
9,5,I get Keratin treatments at the salon at least...


In [55]:
reviews = reviews[reviews["rating"].isin([1, 2, 3, 4, 5])]

In [56]:
print("Rating unique values:", sorted(reviews["rating"].unique()))
print("\nSample # by ratings:")
print(reviews["rating"].value_counts().sort_index())

print("\nRatio by ratings:")
print(reviews["rating"].value_counts(normalize=True).sort_index())

Rating unique values: [1, 2, 3, 4, 5]

Sample # by ratings:
rating
1    102080
2     43034
3     56307
4     79381
5    420726
Name: count, dtype: int64

Ratio by ratings:
rating
1    0.145511
2    0.061343
3    0.080263
4    0.113154
5    0.599728
Name: proportion, dtype: float64


In [57]:
reviews["text_len"] = reviews["text"].str.len()
print("\nText length statistics:")
print(reviews["text_len"].describe())

for i, row in reviews.sample(3, random_state=0).iterrows():
    print("=" * 80)
    print("rating:", row["rating"])
    print("text:", row["text"][:300], "...")


Text length statistics:
count    701528.000000
mean        173.031641
std         246.924645
min           0.000000
25%          44.000000
50%         102.000000
75%         209.000000
max       14989.000000
Name: text_len, dtype: float64
rating: 5
text: Excellent ...
rating: 5
text: Will leave your hair with a still sleek and shine hold all day!!!!! ...
rating: 5
text: Used for halloween costume. A little work to get it together.. I had to pluck and use concealer for part.. also used mascara to darken roots a little more. But this wig gave me everything I needed for my bride of chucky look. I also like the fullness of the unit. Follow on IG for more pictures! @ mo ...


In [58]:
# # of words
reviews["word_count"] = reviews["text"].str.split().str.len()

print("\nword # statistics:")
print(reviews["word_count"].describe())


word # statistics:
count    701528.000000
mean         32.750720
std          45.973273
min           0.000000
25%           8.000000
50%          19.000000
75%          40.000000
max        2585.000000
Name: word_count, dtype: float64


In [59]:
reviews = reviews.drop(['word_count', 'text_len'], axis=1)

- 한 category 안에서 각 rating 별로 5000개 sample을 추출한다.
    - Negative: 1-2(5000+5000) + OOD: 3(5000) + Positive: 4-5(5000+5000) = 총 25,000개 sample 추출 
- 4개의 category dataset을 사용하므로, 25,000*4 = 총 100,000 개 sample을 LLM의 input으로 사용할 예정!

In [60]:
TARGET_PER_RATING = 5000

df_sampled = (reviews.groupby("rating", group_keys=False).apply(lambda x: x.sample(n=TARGET_PER_RATING, random_state=2)))

  df_sampled = (reviews.groupby("rating", group_keys=False).apply(lambda x: x.sample(n=TARGET_PER_RATING, random_state=2)))


In [61]:
print("Total length after sampling:", len(df_sampled))
print(df_sampled["rating"].value_counts().sort_index())

Total length after sampling: 25000
rating
1    5000
2    5000
3    5000
4    5000
5    5000
Name: count, dtype: int64


In [62]:
df_sampled.head()

Unnamed: 0,rating,text
109604,1,"This crown was very pretty to look at, but was..."
276899,1,These arrived melted so bad that the chapstick...
188878,1,Way Too short for tragus
680604,1,Cought fire after 3 times of using them in the...
420790,1,I've used Sonicare tooth brushes for over 10 y...


In [63]:
# Add Negative/OOD/Positive label
def mapping(r):
    if r in [1, 2]:  # negative(N)
        return "N"
    elif r == 3:
        return "OOD" # OOD
    else:            # positive(P)
        return "P"

In [64]:
df_sampled["group"] = df_sampled["rating"].map(mapping)

print(df_sampled.head())
print(df_sampled["group"].value_counts())

        rating                                               text group
109604       1  This crown was very pretty to look at, but was...     N
276899       1  These arrived melted so bad that the chapstick...     N
188878       1                           Way Too short for tragus     N
680604       1  Cought fire after 3 times of using them in the...     N
420790       1  I've used Sonicare tooth brushes for over 10 y...     N
group
N      10000
P      10000
OOD     5000
Name: count, dtype: int64


In [65]:
# output

new_filename = f"{filename}_25k.jsonl"
out_jsonl = f"../dataset/test/{new_filename}"

df_sampled.to_json(
    out_jsonl,
    orient="records",
    lines=True,
    force_ascii=False
)

print("Successfully saved", out_jsonl)

Successfully saved ../dataset/test/All_Beauty.jsonl_25k.jsonl
