## Deteksi Emosi Pengguna Twitter

Deteksi emosi merupakan salah satu permasalahan yang dihadapi pada ***Natural Language Processing*** (NLP). Alasanya diantaranya adalah kurangnya dataset berlabel untuk mengklasifikasikan emosi berdasarkan data twitter. Selain itu, sifat dari data twitter yang dapat memiliki banyak label emosi (***multi-class***). Manusia memiliki berbagai emosi dan sulit untuk mengumpulkan data yang cukup untuk setiap emosi. Oleh karena itu, masalah ketidakseimbangan kelas akan muncul (***class imbalance***). Pada Ujian Tengah Semester (UTS) kali ini, Anda telah disediakan dataset teks twitter yang sudah memiliki label untuk beberapa kelas emosi. Tugas utama Anda adalah membuat model yang mumpuni untuk kebutuhan klasifikasi emosi berdasarkan teks.

### Informasi Data

Dataset yang akan digunakan adalah ***tweet_emotion.csv***. Berikut merupakan informasi tentang dataset yang dapat membantu Anda.

- Total data: 40000 data
- Label emosi: anger, boredom, empty, enthusiasm, fun, happiness, hate, love, neutral, relief, sadness, surprise, worry
- Jumlah data untuk setiap label tidak sama (***class imbalance***)
- Terdapat 3 kolom = 'tweet_id', 'sentiment', 'content'

### Penilaian UTS

UTS akan dinilai berdasaarkan 4 proses yang akan Anda lakukan, yaitu pra pengolahan data, ektraksi fitur, pembuatan model machine learning, dan evaluasi.

#### Pra Pengolahan Data

> **Perhatian**
> 
> Sebelum Anda melakukan sesuatu terhadap data Anda, pastikan data yang Anda miliki sudah "baik", bebas dari data yang hilang, menggunakan tipe data yang sesuai, dan sebagainya.
>

Data tweeter yang ada dapatkan merupakan sebuah data mentah, maka beberapa hal dapat Anda lakukan (namun tidak terbatas pada) yaitu,

1. Case Folding
2. Tokenizing
3. Filtering
4. Stemming

*CATATAN: PADA DATA TWITTER TERDAPAT *MENTION* (@something) YANG ANDA HARUS TANGANI SEBELUM MASUK KE TAHAP EKSTRAKSI FITUR*

#### Ekstrasi Fitur

Anda dapat menggunakan beberapa metode, diantaranya

1. Bag of Words (Count / TF-IDF)
2. N-gram
3. dan sebagainya

#### Pembuatan Model

Anda dibebaskan dalam memilih algoritma klasifikasi. Anda dapat menggunakan algoritma yang telah diajarkan didalam kelas atau yang lain, namun dengan catatan. Berdasarkan asas akuntabilitas pada pengembangan model machine learning, Anda harus dapat menjelaskan bagaimana model Anda dapat menghasilkan nilai tertentu.

#### Evaluasi

Pada proses evaluasi, minimal Anda harus menggunakan metric akurasi. Akan tetapi Anda juga dapat menambahkan metric lain seperti Recall, Precision, F1-Score, detail Confussion Metric, ataupun Area Under Curve (AUC).

### Lembar Pengerjaan
Lembar pengerjaan dimulai dari cell dibawah ini

In [1]:
import numpy as np
import pandas as pd

In [14]:
df = pd.read_csv('data/tweet_emotions.csv')

display(df.head())

jml_baris_asli = df.shape[0]
print(f'Jumlah baris: {jml_baris_asli}')

Unnamed: 0,tweet_id,sentiment,content
0,1956967341,empty,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,wants to hang out with friends SOON!
4,1956968416,neutral,@dannycastillo We want to trade with someone w...


Jumlah baris: 40000


#### Preprocessing
##### Remove Duplicates

In [16]:
# Drop twit yang sama
df.drop_duplicates(subset=['content'], inplace=True)

# Cek jumlah data
jml_baris_drop = df.shape[0]
print(f'Jumlah baris: {jml_baris_drop}')
print(f'Jumlah baris duplikasi {jml_baris_asli - jml_baris_drop}')

Jumlah baris: 39827
Jumlah baris duplikasi 173


##### Remove Mention (@) and URL

Remove words begin with @ and URL such as http, https, ftp, ftps, and www

In [41]:
import re # python regex lib

df = df.copy()

# Membuat kolom baru untuk kebutuhan berbandingan
df['content_clean'] = df['content']

# Membuat fungsi lambda untuk membuat mention, url
rm_rt_url = lambda x: re.sub('(@[A-Za-z0–9\w]+) | (@\w+:) | (\w+:\/\/\S+) | (www.\S+)',' ',x)
rm_punct = lambda x: re.sub('\W', ' ', x)

# Membuat fungsi untuk membuang protocol internet

# Map filter
df['content_clean'] = df.content_clean.map(rm_rt_url).map(rm_punct)
df.head(100)

Unnamed: 0,tweet_id,sentiment,content,content_clean
0,1956967341,empty,@tiffanylue i know i was listenin to bad habi...,i know i was listenin to bad habit earlier a...
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin o...,Layin n bed with a headache ughhhh waitin o...
2,1956967696,sadness,Funeral ceremony...gloomy friday...,Funeral ceremony gloomy friday
3,1956967789,enthusiasm,wants to hang out with friends SOON!,wants to hang out with friends SOON
4,1956968416,neutral,@dannycastillo We want to trade with someone w...,We want to trade with someone who has Houston...
...,...,...,...,...
95,1956989514,sadness,@sweeetnspicy hiii im on my ipod...i cant fall...,hiii im on my ipod i cant fall asleep
96,1956989526,sadness,dont wanna work 11-830 tomorrow but i get paid,dont wanna work 11 830 tomorrow but i get paid
97,1956989560,sadness,feels sad coz i wasnt able to play with the gu...,feels sad coz i wasnt able to play with the gu...
98,1956989561,neutral,PrinceCharming,PrinceCharming


##### Case Folding

Make all twit to lowercase

In [43]:
df['content_clean'] = df.content_clean.str.lower()

df.tail(10)

Unnamed: 0,tweet_id,sentiment,content,content_clean
39990,1753918829,neutral,@shonali I think the lesson of the day is not ...,i think the lesson of the day is not to have ...
39991,1753918846,neutral,@lovelylisaj can you give me the link for the ...,can you give me the link for the kimba diarie...
39992,1753918881,neutral,@jasimmo Ooo showing of your French skills!! l...,ooo showing of your french skills lol thing...
39993,1753918892,neutral,"@sendsome2me haha, yeah. Twitter has many uses...",haha yeah twitter has many uses for me it ...
39994,1753918900,happiness,Succesfully following Tayla!!,succesfully following tayla
39995,1753918954,neutral,@JohnLloydTaylor,johnlloydtaylor
39996,1753919001,love,Happy Mothers Day All my love,happy mothers day all my love
39997,1753919005,love,Happy Mother's Day to all the mommies out ther...,happy mother s day to all the mommies out ther...
39998,1753919043,happiness,@niariley WASSUP BEAUTIFUL!!! FOLLOW ME!! PEE...,wassup beautiful follow me peep out my ...
39999,1753919049,love,@mopedronin bullet train from tokyo the gf ...,bullet train from tokyo the gf and i have ...


##### Select Label and Features

In [51]:
from sklearn.preprocessing import LabelEncoder

X = df['content_clean'].values
y = df['sentiment'].values

print(X.shape)
print(y.shape)

# Encode Label
encoder = LabelEncoder()

y = encoder.fit_transform(y)
print(y.shape)

(39827,)
(39827,)
(39827,)


#### Feature Extraction

Extract TF-IDF with CountVectorizer without Stemming / Lemmatization

In [78]:
# Split Train Test
# Split dulu untuk menghindari leaking information
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20)

##### 1st Experiment - Just Counting

In [79]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(stop_words='english')

X_train_count = count_vect.fit_transform(X_train)
X_test_count = count_vect.transform(X_test)

##### 2nd Experiment - TF-IDF

In [80]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english')

X_train_tfidf = tfidf_vect.fit_transform(X_train)
X_test_tfidf = tfidf_vect.transform(X_test)

#### Classification
##### Naive Bayes

In [82]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

## 1st Experiment
mnb = MultinomialNB()

mnb.fit(X_train_count, y_train)

# Predict train
y_count_train = mnb.predict(X_train_count)
acc_count_train = accuracy_score(y_train, y_count_train)

# Predict test
y_count_test = mnb.predict(X_test_count)
acc_count_test = accuracy_score(y_test, y_count_test)

# Print Hasil
print(f'Akurasi fitur count (training): {acc_count_train}')
print(f'Akurasi fitur count (testing): {acc_count_test}')


# =================== #


## 2nd Experiment
mnb_tf = MultinomialNB()

mnb_tf.fit(X_train_tfidf, y_train)

# Predict train
y_tf_train = mnb_tf.predict(X_train_tfidf)
acc_tf_train = accuracy_score(y_train, y_tf_train)

# Predict test
y_tf_test = mnb_tf.predict(X_test_tfidf)
acc_tf_test = accuracy_score(y_test, y_tf_test)

# Print Hasil
print('==========')
print(f'Akurasi fitur TF-IDF (training): {acc_tf_train}')
print(f'Akurasi fitur TF-IDF (testing): {acc_tf_test}')

Akurasi fitur count (training): 0.5519286902482659
Akurasi fitur count (testing): 0.3172231985940246
Akurasi fitur TF-IDF (training): 0.4586798907755563
Akurasi fitur TF-IDF (testing): 0.3024102435350238


##### SVM Linear

In [83]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

## 1st Experiment
svc = SVC(kernel="linear")

svc.fit(X_train_count, y_train)

# Predict train
y_count_train = svc.predict(X_train_count)
acc_count_train = accuracy_score(y_train, y_count_train)

# Predict test
y_count_test = svc.predict(X_test_count)
acc_count_test = accuracy_score(y_test, y_count_test)

# Print Hasil
print(f'Akurasi fitur count (training): {acc_count_train}')
print(f'Akurasi fitur count (testing): {acc_count_test}')


# =================== #


## 2nd Experiment
svc_tf = SVC()

svc_tf.fit(X_train_tfidf, y_train)

# Predict train
y_tf_train = svc_tf.predict(X_train_tfidf)
acc_tf_train = accuracy_score(y_train, y_tf_train)

# Predict test
y_tf_test = svc_tf.predict(X_test_tfidf)
acc_tf_test = accuracy_score(y_test, y_tf_test)

# Print Hasil
print('==========')
print(f'Akurasi fitur TF-IDF (training): {acc_tf_train}')
print(f'Akurasi fitur TF-IDF (testing): {acc_tf_test}')

Akurasi fitur count (training): 0.8027368883588085
Akurasi fitur count (testing): 0.32011046949535527
Akurasi fitur TF-IDF (training): 0.7470889174853269
Akurasi fitur TF-IDF (testing): 0.3471001757469244


#### Another Approach

Karena hasil kurang oke, maka mungkin kita perlu Stemming atau Lemmatization. Stemming dan Lemmatization adalah sesuatu yang berbeda

In [101]:
# Perform Stemming vs. Lemmatization

# Copy DataFrame
df_stem = df.copy()

In [68]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /Users/afif/nltk_data...
[nltk_data] Downloading package omw-1.4 to /Users/afif/nltk_data...


True

In [102]:
# Tokenized
from nltk.tokenize import TweetTokenizer

tweet_token = TweetTokenizer()
df_stem['content_token'] = df_stem['content_clean'].apply(tweet_token.tokenize)

df_stem.head()

Unnamed: 0,tweet_id,sentiment,content,content_clean,content_token
0,1956967341,empty,@tiffanylue i know i was listenin to bad habi...,i know i was listenin to bad habit earlier a...,"[i, know, i, was, listenin, to, bad, habit, ea..."
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin o...,layin n bed with a headache ughhhh waitin o...,"[layin, n, bed, with, a, headache, ughhhh, wai..."
2,1956967696,sadness,Funeral ceremony...gloomy friday...,funeral ceremony gloomy friday,"[funeral, ceremony, gloomy, friday]"
3,1956967789,enthusiasm,wants to hang out with friends SOON!,wants to hang out with friends soon,"[wants, to, hang, out, with, friends, soon]"
4,1956968416,neutral,@dannycastillo We want to trade with someone w...,we want to trade with someone who has houston...,"[we, want, to, trade, with, someone, who, has,..."


In [103]:
# Stemming
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("english")

def stemming(text):
    stem_text = [stemmer.stem(word) for word in text]
    return stem_text

df_stem['content_stem'] = df_stem['content_token'].apply(lambda x: stemming(x))

df_stem.head()

Unnamed: 0,tweet_id,sentiment,content,content_clean,content_token,content_stem
0,1956967341,empty,@tiffanylue i know i was listenin to bad habi...,i know i was listenin to bad habit earlier a...,"[i, know, i, was, listenin, to, bad, habit, ea...","[i, know, i, was, listenin, to, bad, habit, ea..."
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin o...,layin n bed with a headache ughhhh waitin o...,"[layin, n, bed, with, a, headache, ughhhh, wai...","[layin, n, bed, with, a, headach, ughhhh, wait..."
2,1956967696,sadness,Funeral ceremony...gloomy friday...,funeral ceremony gloomy friday,"[funeral, ceremony, gloomy, friday]","[funer, ceremoni, gloomi, friday]"
3,1956967789,enthusiasm,wants to hang out with friends SOON!,wants to hang out with friends soon,"[wants, to, hang, out, with, friends, soon]","[want, to, hang, out, with, friend, soon]"
4,1956968416,neutral,@dannycastillo We want to trade with someone w...,we want to trade with someone who has houston...,"[we, want, to, trade, with, someone, who, has,...","[we, want, to, trade, with, someon, who, has, ..."


In [104]:
# Lemmatization
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

def lemmatizing(text):
    lemm_text = [wnl.lemmatize(word) for word in text]
    return lemm_text

df_stem['content_lemm'] = df_stem['content_token'].apply(lambda x: lemmatizing(x))

df_stem.head()

Unnamed: 0,tweet_id,sentiment,content,content_clean,content_token,content_stem,content_lemm
0,1956967341,empty,@tiffanylue i know i was listenin to bad habi...,i know i was listenin to bad habit earlier a...,"[i, know, i, was, listenin, to, bad, habit, ea...","[i, know, i, was, listenin, to, bad, habit, ea...","[i, know, i, wa, listenin, to, bad, habit, ear..."
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin o...,layin n bed with a headache ughhhh waitin o...,"[layin, n, bed, with, a, headache, ughhhh, wai...","[layin, n, bed, with, a, headach, ughhhh, wait...","[layin, n, bed, with, a, headache, ughhhh, wai..."
2,1956967696,sadness,Funeral ceremony...gloomy friday...,funeral ceremony gloomy friday,"[funeral, ceremony, gloomy, friday]","[funer, ceremoni, gloomi, friday]","[funeral, ceremony, gloomy, friday]"
3,1956967789,enthusiasm,wants to hang out with friends SOON!,wants to hang out with friends soon,"[wants, to, hang, out, with, friends, soon]","[want, to, hang, out, with, friend, soon]","[want, to, hang, out, with, friend, soon]"
4,1956968416,neutral,@dannycastillo We want to trade with someone w...,we want to trade with someone who has houston...,"[we, want, to, trade, with, someone, who, has,...","[we, want, to, trade, with, someon, who, has, ...","[we, want, to, trade, with, someone, who, ha, ..."


In [120]:
# Select label and Token
X_stem = df_stem['content_stem'].values
X_lemm = df_stem['content_lemm'].values
y = df_stem['sentiment'].values

y = encoder.fit_transform(y)

In [121]:
# Split Train Test
X_stem_train, X_stem_test, y_stem_train, y_stem_test = train_test_split(X_stem, y, test_size=0.3, random_state=20)
X_lemm_train, X_lemm_test, y_lemm_train, y_lemm_test = train_test_split(X_lemm, y, test_size=0.3, random_state=20)

In [126]:
# CountVectorizer


count_stem = CountVectorizer(preprocessor=' '.join, stop_words='english')

## Stem
X_stem_train_count = count_stem.fit_transform(X_stem_train)
X_stem_test_count = count_stem.transform(X_stem_test)

## Lemm
X_lemm_train_count = count_stem.fit_transform(X_lemm_train)
X_lemm_test_count = count_stem.transform(X_lemm_test)

In [130]:
# Classification

## Stem
mnb_stem = MultinomialNB()

mnb_stem.fit(X_stem_train_count, y_stem_train)

y_stem_count_train_pred = mnb_stem.predict(X_stem_train_count)
acc_stem_count_train = accuracy_score(y_stem_train, y_stem_count_train_pred)
print(acc_stem_count_train)

y_stem_count_test_pred = mnb_stem.predict(X_stem_test_count)
acc_stem_count_test = accuracy_score(y_stem_test, y_stem_count_test_pred)
print(acc_stem_count_test)

## Lemm
mnb_lemm = MultinomialNB()

mnb_lemm.fit(X_lemm_train_count, y_lemm_train)

y_lemm_count_train_pred = mnb_lemm.predict(X_lemm_train_count)
acc_lemm_count_train = accuracy_score(y_lemm_train, y_lemm_count_train_pred)
print('=======')
print(acc_stem_count_train)

y_lemm_count_test_pred = mnb_lemm.predict(X_lemm_test_count)
acc_lemm_count_test = accuracy_score(y_lemm_test, y_lemm_count_test_pred)
print(acc_lemm_count_test)

0.5309204390558864
0.32036153653025357
0.5309204390558864
0.3187714453092309
