## Deteksi Emosi Pengguna Twitter

Deteksi emosi merupakan salah satu permasalahan yang dihadapi pada ***Natural Language Processing*** (NLP). Alasanya diantaranya adalah kurangnya dataset berlabel untuk mengklasifikasikan emosi berdasarkan data twitter. Selain itu, sifat dari data twitter yang dapat memiliki banyak label emosi (***multi-class***). Manusia memiliki berbagai emosi dan sulit untuk mengumpulkan data yang cukup untuk setiap emosi. Oleh karena itu, masalah ketidakseimbangan kelas akan muncul (***class imbalance***). Pada Ujian Tengah Semester (UTS) kali ini, Anda telah disediakan dataset teks twitter yang sudah memiliki label untuk beberapa kelas emosi. Tugas utama Anda adalah membuat model yang mumpuni untuk kebutuhan klasifikasi emosi berdasarkan teks.

### Informasi Data

Dataset yang akan digunakan adalah ****tweet_emotion.csv***. Berikut merupakan informasi tentang dataset yang dapat membantu Anda.

- Total data: 40000 data
- Label emosi: anger, boredom, empty, enthusiasm, fun, happiness, hate, love, neutral, relief, sadness, surprise, worry
- Jumlah data untuk setiap label tidak sama (***class imbalance***)
- Terdapat 3 kolom = 'tweet_id', 'sentiment', 'content'

### Penilaian UTS

UTS akan dinilai berdasaarkan 4 proses yang akan Anda lakukan, yaitu pra pengolahan data, ektraksi fitur, pembuatan model machine learning, dan evaluasi.

#### Pra Pengolahan Data

> **Perhatian**
> 
> Sebelum Anda melakukan sesuatu terhadap data Anda, pastikan data yang Anda miliki sudah "baik", bebas dari data yang hilang, menggunakan tipe data yang sesuai, dan sebagainya.
>

Data tweeter yang ada dapatkan merupakan sebuah data mentah, maka beberapa hal dapat Anda lakukan (namun tidak terbatas pada) yaitu,

1. Case Folding
2. Tokenizing
3. Filtering
4. Stemming

*CATATAN: PADA DATA TWITTER TERDAPAT *MENTION* (@something) YANG ANDA HARUS TANGANI SEBELUM MASUK KE TAHAP EKSTRAKSI FITUR*

#### Ekstrasi Fitur

Anda dapat menggunakan beberapa metode, diantaranya

1. Bag of Words (Count / TF-IDF)
2. N-gram
3. dan sebagainya

#### Pembuatan Model

Anda dibebaskan dalam memilih algoritma klasifikasi. Anda dapat menggunakan algoritma yang telah diajarkan didalam kelas atau yang lain, namun dengan catatan. Berdasarkan asas akuntabilitas pada pengembangan model machine learning, Anda harus dapat menjelaskan bagaimana model Anda dapat menghasilkan nilai tertentu.

#### Evaluasi

Pada proses evaluasi, minimal Anda harus menggunakan metric akurasi. Akan tetapi Anda juga dapat menambahkan metric lain seperti Recall, Precision, F1-Score, detail Confussion Metric, ataupun Area Under Curve (AUC).

In [3]:
import numpy as np # Linear Algebra
import pandas as pd # Data processing
from wordcloud import WordCloud # Visualization
import regex as re # Text processing
import matplotlib.pyplot as plt # Visualization
from textblob import Word, TextBlob # Text features
import nltk # Text Manipulation
nltk.download('wordnet') # Download wordnet
nltk.download('punkt') # Download punkt
nltk.download("stopwords") # Download Stopwords
nltk.download("omw-1.4")  # Download omw-1.4
from nltk.corpus import words as dict_words #contains english word dictionary
from nltk.corpus import stopwords #contains stop words
from sklearn.model_selection import train_test_split # split out data into training and testing sets
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2

[nltk_data] Error loading wordnet: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading punkt: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading omw-1.4: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


In [4]:
original_data_frame = pd.read_csv("data/tweet_emotions.csv")
# Membuat salinan dataframe untuk manipulasi
dataset = original_data_frame.copy()
dataset = dataset[['content','sentiment']]

In [5]:
dataset.head(100)

Unnamed: 0,content,sentiment
0,@tiffanylue i know i was listenin to bad habi...,empty
1,Layin n bed with a headache ughhhh...waitin o...,sadness
2,Funeral ceremony...gloomy friday...,sadness
3,wants to hang out with friends SOON!,enthusiasm
4,@dannycastillo We want to trade with someone w...,neutral
...,...,...
95,@sweeetnspicy hiii im on my ipod...i cant fall...,sadness
96,dont wanna work 11-830 tomorrow but i get paid,sadness
97,feels sad coz i wasnt able to play with the gu...,sadness
98,PrinceCharming,neutral


CLEANING DATA

In [6]:
def remove_unwanted_text(content):
  '''
  Removes unwanted text from content using regex
  Input
  content: A string
  Output
  final: the final parsed string
  '''
  handle = re.sub('@[^\s]+', '', content)
  link = re.sub('http[^\s]+', '', handle)
  link = re.sub('www[^\s]+', '', link)
  ht = re.sub('#[^\s]+', '', link)
  final = re.sub('&[^\s]+', '', ht)
  pa = re.sub('^[^\s]+', '', final)

  return pa

In [7]:
def stem(words):
  '''
  Input
  words: words to be processed
  Output
  returns array of stemmed words
  '''
  tb = TextBlob(' '.join(words))
  return [w for w in tb.words.stem()]

def lemmatize(words):
  '''
  Input
  words: words to be processed
  Output
  returns array of lemmatized words
  '''
  tb = TextBlob(' '.join(words))
  return [w for w in tb.words.lemmatize()]

def correct_words(words):
  '''
  Corrects a word using TextBlob
  Input
  words: a list of words
  Output
  returns corrected words
  '''
  return [Word(w).correct() for w in words]
    
def word_frequency(text):
  '''
  Counts the words in a dataframe
  Input
  text: text to be counted
  Output
  word frequecy in text
  '''
 # Tokenization
  tb = TextBlob(text)
  return tb.word_counts

def remove_punctuations(words):
  '''
  Input
  words: A list of words to be processed
  Output
  returns a list of words that punctuations and numbers have been removed. 
  '''
  new_words = []
  for w in words:
      l = re.sub('[^A-Za-z ]+', '', w)
      if l != '':
          new_words.append(l)
          
  return new_words

def remove_stop_words(words):
  '''
  Input
  words: Words to be processed
  Output
  returns a list of words without english stopwords
  '''
  sw = stopwords.words("english") # English Stop Words
  sw.append('could') # should be in sw since wouldn't and couldn't are in (lemmatization or stem don't convert)
  sw.append('would') # should be in sw since wouldn't and couldn't are in (lemmatization or stem don't convert)
  sw.append('nt') # nt appears a lot
  sw.append('im') # im appears a lot
  sw = remove_punctuations(sw)

  return [w for w in words if w.lower() not in sw]

def clean_data(content):
  '''
  Cleans the incoming data
  Input
  content: a dataframe series
  '''
  for i in range(len(content)):
      tweet = remove_unwanted_text(content[i].lower())
      tb = TextBlob(tweet)
      words = remove_punctuations(tb.words)
      words = remove_stop_words(words)

      lemmatized_words = lemmatize(words) # for the most part this was good
      words = [w for w in lemmatized_words if len(w)>1] # removes any character of len 1 usually a "u" or "n"
      dataset.loc[i, 'content'] = " ".join(words)

def get_array(text):
  '''
  Returns the text into an array format
  '''
  return text.split(" ")      

def get_entire_text(content):
  '''
  Return the entire text in a dataframe series
  '''
  entire_text = ""
  for i in content:
      entire_text += i + " "
  return entire_text

def display_top_words(wf, k=20):
    '''
    Displays the top words in a data set
    Input
    k: Number of top words to display. Defaults to 20.
    Output displays top [k] values
    '''
    wf_sorted = sorted(wf.items(), key=lambda x: x[1], reverse=True)
    wf_sorted_greater_than_1 = [w for w in wf_sorted if w[1] > 1]
    top = wf_sorted_greater_than_1[:k:]
    x = [x[0] for x in top]
    y = [y[1] for y in top]
    %matplotlib inline
    plt.figure(figsize=(20,10))
    plt.bar(x, y)
    plt.xticks(rotation=45)
    plt.show()

def display_wordcloud(words):
    '''
    Displays wordcloud
    '''
    wc = WordCloud()
    wc.generate(words)
    %matplotlib inline 
    plt.imshow(wc)

In [8]:
clean_data(dataset['content']) # cleans the data
text = get_entire_text(dataset['content'])
wf = word_frequency(text) # Word Frequency 

In [9]:
dataset

Unnamed: 0,content,sentiment
0,know listenin bad habit earlier started freaki...,empty
1,bed headache ughhhh waitin call,sadness
2,ceremony gloomy friday,sadness
3,hang friend soon,enthusiasm
4,want trade someone houston ticket one,neutral
...,...,...
39995,,neutral
39996,mother day love,love
39997,mother day mommy woman man long momma someone day,love
39998,wassup beautiful follow peep new hit single de...,happiness


PROCESSING DATA

In [10]:
from sklearn import preprocessing
X = dataset['content'].values
lbl_enc = preprocessing.LabelEncoder()
y = lbl_enc.fit_transform(dataset.sentiment.values)


In [11]:
dataset['label'] = y

In [12]:
dataset.head(10)

Unnamed: 0,content,sentiment,label
0,know listenin bad habit earlier started freaki...,empty,2
1,bed headache ughhhh waitin call,sadness,10
2,ceremony gloomy friday,sadness,10
3,hang friend soon,enthusiasm,3
4,want trade someone houston ticket one,neutral,8
5,go prom bc bf like friend,worry,12
6,sleep thinking old friend want married damn wa...,sadness,10
7,,worry,12
8,charlene love miss,sadness,10
9,sorry least friday,sadness,10


In [13]:
from sklearn.model_selection import train_test_split
x1_train,x1_test,y1_train,y1_test = train_test_split(X,y,test_size=1/3, random_state=0)

In [14]:
x1_test.shape

(13334,)

In [15]:
x1_train.shape

(26666,)

In [16]:
y1_train.shape

(26666,)

In [17]:
y1_test.shape

(13334,)

In [18]:
from sklearn.naive_bayes import MultinomialNB 
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

Extracting with TF-IDF

In [19]:
#Extracting TF-IDF parameters
tfidf = TfidfVectorizer(max_features=1000, analyzer='word', ngram_range=(1,3))
x_train_tfidf = tfidf.fit_transform(x1_train)
x_val_tfidf = tfidf.fit_transform(x1_test)

Extracting with CountVector

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
#Extracting Count Vectors Parameters
count_vect = CountVectorizer(analyzer='word')
count_vect.fit(dataset['content'])
x_train_count = count_vect.transform(x1_train)
x_val_count = count_vect.transform(x1_test)

In [21]:
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

MODEL MULTINOMIAL

In [22]:

nb = MultinomialNB()
nb.fit(x_train_tfidf, y1_train)
y_pred = nb.predict(x_val_tfidf)
print('naive bayes tfidf accuracy %s' % accuracy_score(y_pred, y1_test))

naive bayes tfidf accuracy 0.20743962801859908


LINEAR SVM

In [23]:
# Model 2: Linear SVM
lsvm = SGDClassifier(alpha=0.001, random_state=5, max_iter=15, tol=None)
lsvm.fit(x_train_tfidf, y1_train)
y_pred = lsvm.predict(x_val_tfidf)
print(accuracy_score(y_pred, y1_test))
print(confusion_matrix(y_pred, y1_test))

print(classification_report(y_pred, y1_test))

0.1735413229338533
[[   0    0    1    1    2    2    0    2    4    0    0    0    1]
 [   0    0    4    0    0    9    2    2    7    0    6    0   10]
 [   1    1    3    4   12   36    6   31   51    6   42   14   50]
 [   1    1    4    7   15   26    2   25   27    6   22    8   39]
 [   2    0    4    8   19   41   10   42   57   23   43   19   64]
 [   4   16   23   24   51  171   46   91  271   50  152   84  284]
 [   1    2    6    7   13   54   11   20   80   23   44   22  108]
 [   2    4   15   21   66  164   37  106  268   65  199   68  277]
 [  14   11   97   66  137  410  115  278 1018  121  408  203  710]
 [   0    2    9    4   13   67   10   34   67   12   31   13   59]
 [   6   15   34   37   88  233   70  179  360   55  271   90  446]
 [   0    1   11   12   23   69   15   43   93   14   64   36  125]
 [   6    9   48   54  134  447   88  453  595  116  480  170  660]]
              precision    recall  f1-score   support

           0       0.00      0.00      0.

LOGISTIC REGRESSION

In [24]:
# Model 3: Logistic Regression
logreg = LogisticRegression(C=1)
logreg.fit(x_train_tfidf, y1_train)
y_pred = logreg.predict(x_val_tfidf)
print(accuracy_score(y_pred, y1_test))
print(confusion_matrix(y_pred, y1_test))
print(classification_report(y_pred, y1_test))

0.2143392830358482
[[   0    0    0    0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    1    2    1    0    0    0]
 [   0    0    0    0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    1    1    0    0    2    3    1    3    1    2]
 [   7    7   24   28   56  217   40  115  243   68  168   63  309]
 [   0    0    1    0    3    5    3    3   17    4    6    2   19]
 [   3    0    2    8   23   43   12   32   78   17   54   21   63]
 [  17   24  129  102  240  637  182  412 1440  165  702  303 1161]
 [   0    0    0    1    1    3    0    3    3    1    2    0    4]
 [   3    9   25   31   61  195   44  187  269   57  209   85  320]
 [   0    0    0    0    3   21    2    4   14    3   11   16   16]
 [   7   22   78   74  185  608  129  547  829  174  607  236  939]]
              precision    recall  f1-score   support

           0       0.00      0.00      0.

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


COUNT VECTOR

In [25]:
# Model 1: Multinomial Naive Bayes Classifier
nb = MultinomialNB()
nb.fit(x_train_count, y1_train)
y_pred = nb.predict(x_val_count)
print(accuracy_score(y_pred, y1_test))
print(confusion_matrix(y_pred, y1_test))
print(classification_report(y_pred, y1_test))

0.30785960701964904
[[   0    0    0    0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    1    1    0    1    5    0    0    0    0]
 [   0    0    0    0    0    0    0    1    0    0    0    0    0]
 [   0    0    2    0    1    5    0    2   10    0    1    0    4]
 [   6    3   27   41  134  550   13  291  298   92   84  104  162]
 [   0    0    2    0    0    0   11    1    3    0    0    0    2]
 [   1    1    4   13   26  121    1  384  103   18   27   28   48]
 [   9   14   95   75  164  475   70  228 1137  134  310  218  622]
 [   0    0    0    0    1    2    1    0    4    2    1    1    1]
 [   1   11   14   18   25   45   66   35  132   15  241   42  210]
 [   0    0    0    0    0    5    0    3    4    3    3    2    7]
 [  20   33  115   98  221  525  250  360 1202  227 1095  332 1777]]
              precision    recall  f1-score   support

           0       0.00      0.00      0

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [26]:
# Model 2: Linear SVM
lsvm = SGDClassifier(alpha=0.001, random_state=5, max_iter=15, tol=None)
lsvm.fit(x_train_count, y1_train)
y_pred = lsvm.predict(x_val_count)
print(accuracy_score(y_pred, y1_test))
print(confusion_matrix(y_pred, y1_test))
print(classification_report(y_pred, y1_test))

0.31175941202939855
[[   0    0    0    0    0    1    0    0    2    1    2    2    1]
 [   0    1    2    1    2    3    2    3    2    3    5    0    3]
 [   1    0    2    0    2   10    4    6   27    3   16    4   19]
 [   0    1    1    4    4   20    1    2   11    1   12    7   20]
 [   2    2    3   12   35   67    4   24   62   10   22   17   50]
 [   8    4   24   41  149  535   26  229  305   91  126   96  213]
 [   0    2    6    7   10   15   84   19   65    5   68   19   99]
 [   1    1    9   21   61  258   14  606  170   48   93   70  136]
 [   9   15  109   68  127  389   73  184 1232  126  332  209  636]
 [   0    0    2    4    9   40    7   21   47   27   26    8   46]
 [   1   12   28   25   56  108   65   64  241   40  435   66  387]
 [   2    0    7    3   16   53    9   18   63   17   39   37   64]
 [  13   24   66   59  102  230  123  130  671  119  586  192 1159]]
              precision    recall  f1-score   support

           0       0.00      0.00      0

In [27]:
# Model 3: Logistic Regression
logreg = LogisticRegression(C=1)
logreg.fit(x_train_count, y1_train)
y_pred = logreg.predict(x_val_count)
print(accuracy_score(y_pred, y1_test))
print(confusion_matrix(y_pred, y1_test))
print(classification_report(y_pred, y1_test))

0.3272086395680216
[[   0    0    0    0    0    0    0    0    0    0    0    0    0]
 [   0    1    0    1    0    0    1    0    0    0    1    0    0]
 [   1    0    2    0    1    2    0    0    5    0    3    1    3]
 [   0    0    0    0    1    4    0    1    5    2    2    2    2]
 [   0    2    1    6   33   62    3   13   36   11   18    5   31]
 [   6    4   20   45  134  550   14  282  227   87  104   86  166]
 [   0    0    1    3    4    4   58    7   25    0   33    9   51]
 [   1    1    4   14   39  156    6  486  118   28   56   43   77]
 [  13   20  140  102  211  560  102  276 1619  190  491  294  908]
 [   0    0    1    3    5   29    3   14   25   24   12    7   29]
 [   1   14   28   19   40   75   79   62  202   29  430   52  390]
 [   1    0    2    4   14   43   11   21   41    9   23   26   42]
 [  14   20   60   48   91  244  135  144  595  111  589  202 1134]]
              precision    recall  f1-score   support

           0       0.00      0.00      0.

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Kesimpulannya, Hasil akurasi yang menggunakan Count vectorizer lebih baik dibandingkan TF-IDF.