### Data Preprocessing Script Guide
- 그냥 전체 실행 버튼 한 번만 누르면 각자 로컬 컴에서 전처리된 데이터셋 추출할 수 있도록 해놨음.
- 실행할 때마다 두 번째 cell의 filename 변수만 각 category 데이터셋 이름별로 바꿔서 실행하기만 하면 끝!
- 원본 raw *.jsonl 파일들도 모두 src/dataset/에 위치하나, 용량 문제로 .gitignore 파일을 통해 업로드를 막아놓았고 최종 전처리 결과물인 *_25k.jsonl 파일들만 src/dataset/하에 올려놓았음 
- 한 줄 요약: 그냥 코드는 참고만 하시되(안 돌려봐도됨), 우리가 input으로 사용할 최종 jsonl들은 용량 작아서 git에 dataset/ 폴더 안에 넣어놨으니 그냥 git clone해서 그대로 가져다 쓰시면 된다~

In [72]:
import pandas as pd

In [73]:
# 중요: category 별로 filename 바꿔야함 (맨 아래에서 저장할 때 이름은 이에 맞게 자동으로 바뀌도록 연동해놨음)
filename = "Industrial_and_Scientific.jsonl" 
path = f"../dataset/{filename}"

In [74]:
df = pd.read_json(path, lines=True)
df.head()

Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase
0,5,Best value for the money,These masks are great even though there is no ...,[],B08C7HDF1F,B0BX2672L8,AG2L7H23R5LLKDKLBEF2Q3L2MVDA,2023-02-17 02:54:13.163,3,True
1,5,TOO good.,These scissors are so good they got stolen by ...,[],B07BT4YLHT,B07BT4YLHT,AG2L7H23R5LLKDKLBEF2Q3L2MVDA,2022-12-24 01:09:30.434,1,True
2,4,Good,Good. Sensor push easier to work with but thes...,[],B06XY65HCX,B06XY65HCX,AGCI7FAH4GL5FI65HYLKWTMFZ2CQ,2020-01-21 19:54:56.378,0,True
3,5,Five Stars,Great ORB finish & size. Bought for our laundr...,[],B01KW20EQ0,B01KW20EQ0,AGXVBIUFLFGMVLATYXHJYL4A5Q7Q,2018-07-02 18:39:44.971,0,True
4,1,Only one ply - will not work,These masks are notably thinner than other dis...,[],B08F59NF33,B08N66L183,AGBFYI2DDIKXC5Y4FARTYDTQBMFQ,2022-01-30 14:50:43.612,0,True


In [75]:
# basic infos
print("Total sample #:", len(df))
print("\nData info:")
print(df.info())

Total sample #: 5183005

Data info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5183005 entries, 0 to 5183004
Data columns (total 10 columns):
 #   Column             Dtype         
---  ------             -----         
 0   rating             int64         
 1   title              object        
 2   text               object        
 3   images             object        
 4   asin               object        
 5   parent_asin        object        
 6   user_id            object        
 7   timestamp          datetime64[ns]
 8   helpful_vote       int64         
 9   verified_purchase  bool          
dtypes: bool(1), datetime64[ns](1), int64(2), object(6)
memory usage: 360.8+ MB
None


In [76]:
reviews = df[["rating", "text"]].copy()
reviews.head(10)

Unnamed: 0,rating,text
0,5,These masks are great even though there is no ...
1,5,These scissors are so good they got stolen by ...
2,4,Good. Sensor push easier to work with but thes...
3,5,Great ORB finish & size. Bought for our laundr...
4,1,These masks are notably thinner than other dis...
5,1,Took forever to arrive and smelled like fish! ...
6,4,Wow the sticking power on this stuff is crazy....
7,5,"I love this sign, live on a corner and near 2 ..."
8,5,"Comfortable, doesn’t fray, value for money for..."
9,5,good fit. . . . just a few adjustments of soun...


In [77]:
reviews = reviews[reviews["rating"].isin([1, 2, 3, 4, 5])]

In [78]:
print("Rating unique values:", sorted(reviews["rating"].unique()))
print("\nSample # by ratings:")
print(reviews["rating"].value_counts().sort_index())

print("\nRatio by ratings:")
print(reviews["rating"].value_counts(normalize=True).sort_index())

Rating unique values: [1, 2, 3, 4, 5]

Sample # by ratings:
rating
1     584133
2     234763
3     315792
4     560932
5    3487385
Name: count, dtype: int64

Ratio by ratings:
rating
1    0.112702
2    0.045295
3    0.060928
4    0.108225
5    0.672850
Name: proportion, dtype: float64


In [79]:
reviews["text_len"] = reviews["text"].str.len()
print("\nText length statistics:")
print(reviews["text_len"].describe())

for i, row in reviews.sample(3, random_state=0).iterrows():
    print("=" * 80)
    print("rating:", row["rating"])
    print("text:", row["text"][:300], "...")


Text length statistics:
count    5.183005e+06
mean     1.762676e+02
std      2.837600e+02
min      0.000000e+00
25%      3.800000e+01
50%      9.500000e+01
75%      2.070000e+02
max      3.327600e+04
Name: text_len, dtype: float64
rating: 1
text: I’m very disappointed they have a very bad smell especially the black ones.  I want to return them but I don’t want them to sell to someone else because I opened them and try it on. ...
rating: 4
text: A+ ...
rating: 4
text: thumbs up ...


In [80]:
# # of words
reviews["word_count"] = reviews["text"].str.split().str.len()

print("\nword # statistics:")
print(reviews["word_count"].describe())


word # statistics:
count    5.183005e+06
mean     3.270338e+01
std      5.189473e+01
min      0.000000e+00
25%      7.000000e+00
50%      1.800000e+01
75%      3.900000e+01
max      6.040000e+03
Name: word_count, dtype: float64


In [81]:
reviews = reviews.drop(['word_count', 'text_len'], axis=1)

- 한 category 안에서 각 rating 별로 5000개 sample을 추출한다.
    - Negative: 1-2(5000+5000) + OOD: 3(5000) + Positive: 4-5(5000+5000) = 총 25,000개 sample 추출 
- 4개의 category dataset을 사용하므로, 25,000*4 = 총 100,000 개 sample을 LLM의 input으로 사용할 예정!

In [82]:
TARGET_PER_RATING = 5000

df_sampled = (reviews.groupby("rating", group_keys=False).apply(lambda x: x.sample(n=TARGET_PER_RATING, random_state=42)))

  df_sampled = (reviews.groupby("rating", group_keys=False).apply(lambda x: x.sample(n=TARGET_PER_RATING, random_state=42)))


In [83]:
print("Total length after sampling:", len(df_sampled))
print(df_sampled["rating"].value_counts().sort_index())

Total length after sampling: 25000
rating
1    5000
2    5000
3    5000
4    5000
5    5000
Name: count, dtype: int64


In [84]:
df_sampled.head()

Unnamed: 0,rating,text
2850641,1,I had nothing but problems with this filament....
4304545,1,"very thin wire easy to break, If Radio shack w..."
4056460,1,"I know they're just masks, but they are extrem..."
4816939,1,This product arrived damaged and returns not a...
3386887,1,Pins on device prone to breaking off. Even asi...


In [85]:
# Add Negative/OOD/Positive label
def mapping(r):
    if r in [1, 2]:  # negative(N)
        return "N"
    elif r == 3:
        return "OOD" # OOD
    else:            # positive(P)
        return "P"

In [86]:
df_sampled["group"] = df_sampled["rating"].map(mapping)

print(df_sampled.head())
print(df_sampled["group"].value_counts())

         rating                                               text group
2850641       1  I had nothing but problems with this filament....     N
4304545       1  very thin wire easy to break, If Radio shack w...     N
4056460       1  I know they're just masks, but they are extrem...     N
4816939       1  This product arrived damaged and returns not a...     N
3386887       1  Pins on device prone to breaking off. Even asi...     N
group
N      10000
P      10000
OOD     5000
Name: count, dtype: int64


In [87]:
# output

new_filename = f"{filename}_25k.jsonl"
out_jsonl = f"../dataset/{new_filename}"

df_sampled.to_json(
    out_jsonl,
    orient="records",
    lines=True,
    force_ascii=False
)

print("Successfully saved", out_jsonl)

Successfully saved ../dataset/Industrial_and_Scientific.jsonl_25k.jsonl
