# Process `wisesight-sentiment` for [Huggingface Datasets](https://github.com/huggingface/datasets)

This notebook processes [`wisesight-sentiment`](https://github.com/PyThaiNLP/wisesight-sentiment) dataset which was provided by **Wisesight (Thailand) Co., Ltd.** It contains 24,063 texts with 4 categories (`q`uestion, `neg`ative, `neu`tral, and `pos`itive) for training set and 2,674 texts for test set. We perform a uniformly random 90/10 train-validation split from the original train set.

In [1]:
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

In [2]:
# Set data path
data_folder = Path("../kaggle-competition/")

In [3]:
# Generate _train.csv from traint.txt and train_label.txt
texts = []
labels = []

with open(data_folder / "train.txt") as f:
    texts = [line.strip() for line in f]

with open(data_folder / "train_label.txt") as f:
    labels = [line.strip() for line in f]

df = pd.DataFrame({"category": labels, "texts": texts})
del texts
del labels
df.shape

(24063, 2)

In [4]:
# Generate _test.csv from test.txt and test_label.txt
texts = []
labels = []

with open(data_folder / "test.txt") as f:
    texts = [line.strip() for line in f]

with open(data_folder / "test_label.txt") as f:
    labels = [line.strip() for line in f]

test_df = pd.DataFrame({"category": labels, "texts": texts})
del texts
del labels
test_df.shape

(2674, 2)

In [5]:
# Filter #ERROR! from all datasets
df = df[df.texts != "#ERROR!"].reset_index(drop=True)
test_df = test_df[test_df.texts != "#ERROR!"].reset_index(drop=True)
df.shape, test_df.shape

((24032, 2), (2671, 2))

In [6]:
# Split validation
train_df, valid_df = train_test_split(df, test_size=0.1, random_state=1412)
train_df.shape, valid_df.shape

((21628, 2), (2404, 2))

In [7]:
train_df.describe()

Unnamed: 0,category,texts
count,21628,21628
unique,4,21612
top,neu,แท็กซี่หรือไม่แท็กซี่
freq,11795,2


In [8]:
train_df.category.value_counts() / train_df.shape[0]

neu    0.545358
neg    0.253884
pos    0.178750
q      0.022009
Name: category, dtype: float64

In [9]:
valid_df.describe()

Unnamed: 0,category,texts
count,2404,2404
unique,4,2403
top,neu,สวัสดีค่ะ เราขอสอบถามเรื่องการขึ้นรถไฟฟ้าTHSR ...
freq,1291,2


In [10]:
valid_df.category.value_counts() / valid_df.shape[0]

neu    0.537022
neg    0.264975
pos    0.180532
q      0.017471
Name: category, dtype: float64

In [11]:
test_df.describe()

Unnamed: 0,category,texts
count,2671,2671
unique,4,2671
top,neu,พล ศรีราชา - จิ๋ว เชียงราย รอบรองชนะเลิศ ระบบ ...
freq,1453,1


In [12]:
test_df.category.value_counts() / test_df.shape[0]

neu    0.543991
neg    0.255709
pos    0.178959
q      0.021340
Name: category, dtype: float64

In [13]:
# save
train_df.to_json("train.jsonl", orient="records", lines=True)
valid_df.to_json("valid.jsonl", orient="records", lines=True)
test_df.to_json("test.jsonl", orient="records", lines=True)