# Emotion Classifier Example Project

## TODO

Bagus! Kalau kamu ingin **step-by-step** tanpa detail kode, agar bisa dicoba dan dipelajari sendiri, berikut adalah alur kerja yang bisa kamu ikuti di **Jupyter Notebook**, mulai dari load data sampai save model menggunakan `pickle`, lalu mencoba inference sederhana:

---

### 🧪 Tahapan Umum: Train & Save Model (Train Phase)

1. **Persiapkan environment**
   - Aktifkan virtual environment
   - Instal library yang dibutuhkan (`pandas`, `scikit-learn`, `pyarrow`, `pickle`)

2. **Import library yang diperlukan**
   - Contoh: `pandas`, `numpy`, `sklearn`, `pickle`

3. **Load dataset**
   - Gunakan `pd.read_parquet()` untuk membaca file `.parquet`
   - Pastikan kolom teks dan label emosi sudah benar

4. **Eksplorasi awal**
   - Tampilkan beberapa baris data
   - Cek distribusi label emosi

5. **Preprocessing sederhana**
   - Bersihkan teks (hapus karakter aneh, lowercase, dll)
   - Pertimbangkan tokenisasi atau hapus stopwords (opsional)

6. **Split data**
   - Pisahkan menjadi data latih dan data uji
   - Pisahkan juga antara fitur (`X`) dan target (`y`)

7. **Vektorisasi teks**
   - Gunakan `TfidfVectorizer` atau `CountVectorizer`
   - Sesuaikan dengan kebutuhan model

8. **Latih model**
   - Pilih model seperti `SVM`, `Naive Bayes`, atau `Random Forest`
   - Latih model menggunakan data latih

9. **Evaluasi model**
   - Uji performa model dengan data uji
   - Lihat akurasi, precision, recall, dll

10. **Simpan model dan vectorizer**
    - Gunakan `pickle.dump()` untuk menyimpan model dan vectorizer
    - Simpan dalam file `.pkl`

---

### ⚙️ Tahapan Inference (Test Phase)

11. **Muat model dan vectorizer**
    - Gunakan `pickle.load()` untuk memuat kembali model dan vectorizer

12. **Masukkan teks baru**
    - Ketik atau input teks bebas sebagai input pengguna

13. **Lakukan preprocessing**
    - Lakukan hal yang sama seperti saat training

14. **Transformasi teks**
    - Gunakan vectorizer yang telah dimuat untuk mengubah teks

15. **Prediksi emosi**
    - Gunakan model untuk memprediksi label emosi dari teks

16. **Tampilkan hasil**
    - Print atau tampilkan hasil prediksi

---

### ✅ Tips:
- Coba satu-satu setiap tahap
- Jika error, cek kembali langkah sebelumnya
- Mainkan dataset dan model untuk eksperimen

Kamu sudah punya semua informasi yang cukup untuk melanjutkan.

Selamat belajar dan bereksperimen! Kalau ada kendala atau ingin lanjut ke integrasi Unity/C#, saya siap bantu 😄

## Preparation and Setup

In [26]:
import numpy as np
import pandas as pd
import sklearn

In [27]:
emotion_dataframe = pd.read_parquet("emotions_dataset.parquet", engine="pyarrow")

In [28]:
emotion_dataframe.head(10)

Unnamed: 0,Sentence,Label
0,Unfortunately later died from eating tainted m...,happiness
1,Last time I saw was loooong ago. Basically bef...,neutral
2,You mean by number of military personnel? Beca...,neutral
3,Need to go middle of the road no NAME is going...,sadness
4,feel melty miserable enough imagine must,sadness
5,feel sense relief also sadness end colleagues ...,happiness
6,think get feel weird ones use dryers time,surprise
7,If your host stand has a register that isn’t l...,neutral
8,Oh . someone finally posted something I cant b...,surprise
9,feel presence beloved behind tilt neck side sm...,love


In [29]:
# check length of table
len(emotion_dataframe)

131306

In [30]:
emotion_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131306 entries, 0 to 131305
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   Sentence  131306 non-null  object
 1   Label     131306 non-null  object
dtypes: object(2)
memory usage: 2.0+ MB


### Check is there any null or NaN data

In [31]:
isnull = emotion_dataframe.isnull().sum().sum()
isnull

np.int64(0)

In [32]:
isna = emotion_dataframe.isna().sum().sum()
isna

np.int64(0)

### Check labels

In [33]:
labels = emotion_dataframe["Label"].unique()
labels

array(['happiness', 'neutral', 'sadness', 'surprise', 'love', 'fear',
       'confusion', 'disgust', 'desire', 'shame', 'sarcasm', 'anger',
       'guilt'], dtype=object)

## Preprocessing

### Clean Punctuation
except "?" and "!"

In [36]:
clean_sentence_data = emotion_dataframe.copy()

# menghapus tanda baca selain ! dan ?
clean_sentence_data["Sentence"] = emotion_dataframe.copy()["Sentence"].str.replace(r'[^\w\s!?]', '', regex=True)

clean_sentence_data.head(10)

Unnamed: 0,Sentence,Label
0,Unfortunately later died from eating tainted m...,happiness
1,Last time I saw was loooong ago Basically befo...,neutral
2,You mean by number of military personnel? Beca...,neutral
3,Need to go middle of the road no NAME is going...,sadness
4,feel melty miserable enough imagine must,sadness
5,feel sense relief also sadness end colleagues ...,happiness
6,think get feel weird ones use dryers time,surprise
7,If your host stand has a register that isnt lo...,neutral
8,Oh someone finally posted something I cant br...,surprise
9,feel presence beloved behind tilt neck side sm...,love


### Lowercase vs Sensitive-Case

#### Lowercase

In [40]:
lowercase_data = emotion_dataframe.copy()
lowercase_data["Sentence"] = lowercase_data["Sentence"].str.lower()

In [39]:
lowercase_data.head(10)

Unnamed: 0,Sentence,Label
0,unfortunately later died from eating tainted m...,happiness
1,last time i saw was loooong ago. basically bef...,neutral
2,you mean by number of military personnel? beca...,neutral
3,need to go middle of the road no name is going...,sadness
4,feel melty miserable enough imagine must,sadness
5,feel sense relief also sadness end colleagues ...,happiness
6,think get feel weird ones use dryers time,surprise
7,if your host stand has a register that isn’t l...,neutral
8,oh . someone finally posted something i cant b...,surprise
9,feel presence beloved behind tilt neck side sm...,love


In [35]:
#  just keep it taht way
sensi_case_data = emotion_dataframe.copy()
sensi_case_data.head(10)

Unnamed: 0,Sentence,Label
0,Unfortunately later died from eating tainted m...,happiness
1,Last time I saw was loooong ago. Basically bef...,neutral
2,You mean by number of military personnel? Beca...,neutral
3,Need to go middle of the road no NAME is going...,sadness
4,feel melty miserable enough imagine must,sadness
5,feel sense relief also sadness end colleagues ...,happiness
6,think get feel weird ones use dryers time,surprise
7,If your host stand has a register that isn’t l...,neutral
8,Oh . someone finally posted something I cant b...,surprise
9,feel presence beloved behind tilt neck side sm...,love


### Split Data

In [44]:
#lowercase
X_low = lowercase_data.copy()["Sentence"]
y_low = lowercase_data.copy()["Label"]

X_sensi = sensi_case_data.copy()["Sentence"]
y_sensi = sensi_case_data.copy()["Label"]


In [41]:
X_low, y_low

(0         unfortunately later died from eating tainted m...
 1         last time i saw was loooong ago. basically bef...
 2         you mean by number of military personnel? beca...
 3         need to go middle of the road no name is going...
 4                  feel melty miserable enough imagine must
                                 ...                        
 131301    yeah, shes reasonable on some issues for sure,...
 131302    >just something motor**sport** fans tell thems...
 131303                                what about hot water?
 131304                 i’d love to learn how to make bread.
 131305    dont give away my zip code! craft beer and fab...
 Name: Sentence, Length: 131306, dtype: object,
 0         happiness
 1           neutral
 2           neutral
 3           sadness
 4           sadness
             ...    
 131301    happiness
 131302    confusion
 131303    confusion
 131304       desire
 131305    happiness
 Name: Label, Length: 131306, dtype: object)

In [45]:
X_sensi, y_sensi

(0         Unfortunately later died from eating tainted m...
 1         Last time I saw was loooong ago. Basically bef...
 2         You mean by number of military personnel? Beca...
 3         Need to go middle of the road no NAME is going...
 4                  feel melty miserable enough imagine must
                                 ...                        
 131301    Yeah, shes reasonable on some issues for sure,...
 131302    >just something motor**sport** fans tell thems...
 131303                                What about hot water?
 131304                 I’d love to learn how to make bread.
 131305    Dont give away my zip code! Craft beer and fab...
 Name: Sentence, Length: 131306, dtype: object,
 0         happiness
 1           neutral
 2           neutral
 3           sadness
 4           sadness
             ...    
 131301    happiness
 131302    confusion
 131303    confusion
 131304       desire
 131305    happiness
 Name: Label, Length: 131306, dtype: object)

In [47]:
from sklearn.model_selection import train_test_split

X_low_train, X_low_test, y_low_train, y_low_test = train_test_split(X_low, y_low, test_size=0.2)

X_sensi_train, X_sensi_test, y_sensi_train, y_sensi_test = train_test_split(X_sensi, y_sensi, test_size=0.2)

## Tokenization

In [48]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w+\b|[!?]')

In [None]:
X_low_train_token = vectorizer.fit_transform(X_low_train)
X_low_test_token = vectorizer.transform(X_low_test)

In [None]:
X_sensi_train_token = vectorizer.fit_transform(X_sensi_train)
X_sensi_test_token = vectorizer.transform(X_sensi_test)

#### Label Encoder

In [52]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

In [53]:
y_low_train_encoded = le.fit_transform(y_low_train)
y_low_test_encoded = le.transform(y_low_test)

In [54]:
y_sensi_train_encoded = le.fit_transform(y_sensi_train)
y_sensi_test_encoded = le.transform(y_sensi_test)

## Training

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [57]:
model_low = MultinomialNB()

# training
model_low.fit(X_low_train_token, y_low_train_encoded)

In [58]:
model_sensi = MultinomialNB()

# training
model_sensi.fit(X_sensi_train_token, y_sensi_train_encoded)