# Sentiment Analysis Komentar Twitter Pengguna Media Sosial Untuk Pemilu 2014

By Jonathan Stanley

Contact:

jonathanstanleyofficial@gmail.com / 082112426652



Proyek ini memanfaatkan model BERT untuk melakukan sentiment analysis terhadap tweet para pengguna sosial media yang berkaitan dengan Pemilu Presiden 2014. Model yang dibuat diharapkan dapat mengklasifikasikan isi tweet yang positif dan negatif perihal isu tersebut.

## Data Reading and Understanding

In [3]:
# Data Reading
import pandas as pd

dataset = pd.read_csv("Capres2014-1.1.csv", usecols=["Isi_Tweet", "Sentimen"])

In [4]:
# EXPLORE the data
dataset.head()

Unnamed: 0,Isi_Tweet,Sentimen
0,"Biusnya habis ! RT""@eddies_song: Dahlan Iskan ...",-1
1,"Presiden Prabowo ,Presiden Terakhir Indonesia",1
2,@republikaonline masa capres prabowo bergitu b...,-1
3,"Kalo kata bapak capres ARB, kita harus ""berani...",1
4,"RT @DhafaRizky_: Najis,org gila doang yg dukun...",-1


Dataset yang digunakan memiliki dua komponen fitur, Fitur pertama adalah isi tweet dengan bentuk text dan fitur kedua adalah Sentimen yang telah diberi label -1 untuk tweet negatif dan 1 untuk tweet positif. 

In [5]:
## Mengecek Imbalanced Data
dataset['Sentimen'].value_counts()

Sentimen
 1    1117
-1     768
Name: count, dtype: int64

In [6]:
## Mengganti {-1,1} menjadi {0,1}
dataset['Sentimen'] = dataset['Sentimen'].replace(-1,0)

## Data Preprocessing

In [11]:
import nltk
from nltk.corpus import stopwords
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

nltk.download('stopwords')
#Setting indonesian stopwords
stop_words = set(stopwords.words('indonesian'))

#Stemming indonesian words
stemmer_factory = StemmerFactory()
stemmer = stemmer_factory.create_stemmer()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:

# Pra Pengolahan - Cleaning
import numpy as np
import re
import string

def clean_text(tweet):
    
    # Convert to lower case
    tweet = tweet.lower()
    # remove unicode characters
    tweet = tweet.encode('ascii', 'ignore').decode()
    # Clean www.* or https?://*
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','',tweet)
    # Clean @username
    tweet = re.sub('@[^\s]+','',tweet)
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #Remove punctuations
    tweet = tweet.translate(str.maketrans('', '', string.punctuation))
    #Remove common Indonesian stop words and stemming Indonesian words
    tweet_tokens = tweet.split()
    filtered_words = [word for word in tweet_tokens if word not in stop_words]
    stemmed_words = [stemmer.stem(word) for word in filtered_words]
    tweet = ' '.join(stemmed_words)
    #trim
    tweet = tweet.strip('\'"')
    
    return tweet

dataset["Isi_Tweet"] = dataset['Isi_Tweet'].map(lambda x: clean_text(x))
dataset = dataset[dataset['Isi_Tweet'].apply(lambda x: len(x.split()) >=1)]
dataset.shape

(1885, 2)

In [13]:
## Pra Pengolahan - Splitting (80% training : 20% testing)
from sklearn.model_selection import train_test_split

train_data, test_data, train_labels, test_labels = train_test_split(
    dataset['Isi_Tweet'], dataset['Sentimen'], test_size=0.2, random_state=42)

In [14]:
!pip install transformers

Collecting transformers
  Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/85/f6/c5065913119c41ecad148c34e3a861f719e16b89a522287213698da911fc/transformers-4.37.2-py3-none-any.whl.metadata
  Downloading transformers-4.37.2-py3-none-any.whl.metadata (129 kB)
     ---------------------------------------- 0.0/129.4 kB ? eta -:--:--
     --------------------------------- ---- 112.6/129.4 kB 2.2 MB/s eta 0:00:01
     -------------------------------------- 129.4/129.4 kB 1.9 MB/s eta 0:00:00
Collecting filelock (from transformers)
  Obtaining dependency information for filelock from https://files.pythonhosted.org/packages/81/54/84d42a0bee35edba99dee7b59a8d4970eccdd44b99fe728ed912106fc781/filelock-3.13.1-py3-none-any.whl.metadata
  Downloading filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Collecting huggingface-hub<1.0,>=0.19.3 (from transformers)
  Obtaining dependency information for huggingface-hub<1.0,>=0.19.3 from https://files.pythonhost


[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [15]:
!pip show transformers

Name: transformers
Version: 4.37.2
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: C:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: 


In [16]:
from tensorflow import keras
from transformers import AutoTokenizer, TFAutoModel
import IPython

In [17]:

#Pretrained model Indobert
bert_tokenizer = AutoTokenizer.from_pretrained("indobenchmark/indobert-base-p2")
def tokenisasi(teks):
    encode_dict = bert_tokenizer(teks,
                                   add_special_tokens = True,
                                   max_length = 128, 
                                   padding = 'max_length',
                                   truncation = True,
                                   return_attention_mask = True,
                                   return_tensors = 'tf',)

    tokenID = encode_dict['input_ids']
    attention_mask = encode_dict['attention_mask']
    return tokenID, attention_mask

def create_input(data):
    tokenID, input_mask = [], []
    for teks in data:
        token, mask = tokenisasi(teks)
        tokenID.append(token)
        input_mask.append(mask)
    
    return [np.asarray(tokenID, dtype=np.int32).reshape(-1, 128), 
            np.asarray(input_mask, dtype=np.int32).reshape(-1, 128)]

bert_model = TFAutoModel.from_pretrained("indobenchmark/indobert-base-p2", trainable=False)

def bert(hp):
    
    #Input layer
    input_token = keras.layers.Input(shape=(128,), dtype=np.int32,
                                        name="input_token")
    input_mask = keras.layers.Input(shape=(128,), dtype=np.int32,
                                   name="input_mask")

    #Embedding
    bert_embedding = bert_model([input_token, input_mask])[0]
    
    
    # Attention mechanism
    num_heads = hp.Int('num_heads', min_value=2, max_value=8, step=2)
    attention = keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=128)(bert_embedding, bert_embedding, bert_embedding)
    add_attention = keras.layers.Add()([bert_embedding, attention])
    layer_norm1 = keras.layers.LayerNormalization(epsilon=1e-6)(add_attention)
    
    #Dropout Layer
    dropout_rate = 0.2
    dropout_layer = keras.layers.Dropout(dropout_rate)(layer_norm1)

    #Output layer
    output = keras.layers.Dense(1, activation='sigmoid',
                                kernel_regularizer=keras.regularizers.l2(hp.Choice('kernel_dense', values = [0.01, 0.001])))(dropout_layer)
    
    
    #Adjust Learning Rates
    learning_rate = 1e-3
    lr_schedule = keras.optimizers.schedules.ExponentialDecay(
        learning_rate,
        decay_steps=1000,
        decay_rate=0.95,
        staircase=True
    )
    
    #Model Compiler
    model = keras.models.Model(inputs=[input_token, input_mask], outputs=output)

    model.compile(optimizer = keras.optimizers.Adam(lr_schedule),
                  loss ='binary_crossentropy',
                  metrics=['accuracy'])
   
    return model

class ClearTrainingOutput(keras.callbacks.Callback):
    def on_train_end(*args, **kwargs):
        IPython.display.clear_output(wait = True)

early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

Some layers from the model checkpoint at indobenchmark/indobert-base-p2 were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at indobenchmark/indobert-base-p2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [18]:
!pip install keras-tuner

Collecting keras-tuner
  Obtaining dependency information for keras-tuner from https://files.pythonhosted.org/packages/2b/39/21f819fcda657c37519cf817ca1cd03a8a025262aad360876d2a971d38b3/keras_tuner-1.4.6-py3-none-any.whl.metadata
  Downloading keras_tuner-1.4.6-py3-none-any.whl.metadata (5.4 kB)
Collecting kt-legacy (from keras-tuner)
  Obtaining dependency information for kt-legacy from https://files.pythonhosted.org/packages/16/53/aca9f36da2516db008017db85a1f3cafaee0efc5fc7a25d94c909651792f/kt_legacy-1.0.5-py3-none-any.whl.metadata
  Downloading kt_legacy-1.0.5-py3-none-any.whl.metadata (221 bytes)
Downloading keras_tuner-1.4.6-py3-none-any.whl (128 kB)
   ---------------------------------------- 0.0/128.9 kB ? eta -:--:--
   --------- ------------------------------ 30.7/128.9 kB 1.3 MB/s eta 0:00:01
   ---------------------------- ----------- 92.2/128.9 kB 1.3 MB/s eta 0:00:01
   ---------------------------------------- 128.9/128.9 kB 1.1 MB/s eta 0:00:00
Downloading kt_legacy-1.0.5


[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
from keras_tuner.tuners import BayesianOptimization

bert_train_data = create_input(train_data)
bert_test_data = create_input(test_data)

tuner = BayesianOptimization(bert,
                             objective = 'val_accuracy', 
                             max_trials = 10,
                             directory = '/content/Hasil',
                             project_name = 'Sentiment-BERT',
                             overwrite = True)

tuner.search(bert_train_data, train_labels,
             batch_size=256, epochs = 50,
             validation_data=(bert_test_data, test_labels),
             callbacks=[early_stop, ClearTrainingOutput()])

# Mendapatkan model terbaik
model = tuner.get_best_models()[0]

Trial 10 Complete [01h 25m 47s]
val_accuracy: 0.8386936187744141

Best val_accuracy So Far: 0.8514589071273804
Total elapsed time: 13h 55m 59s
INFO:tensorflow:Oracle triggered exit


In [67]:
## Evaluasi Model

test_loss, test_acc = model.evaluate(bert_test_data, test_labels)
print('Test accuracy:', test_acc)

Test accuracy: 0.8514589071273804


## Result Evaluation

Hasil prediksi model BERT menunjukkan performa yang cukup baik untuk menjalankan sentiment analysis pada project ini. Akurasi prediksi yang diperoleh sebesar 0.8515 sehingga kurang lebih 85% dari prediksi telah sesuai dengan data sentiment sebenarnya pada testing set yang digunakan untuk evaluasi

In [None]:
## Penyimpanan dan Memuat Kembali Model
model.save('Data/model_mlp_sentiment.h5')

model = keras.models.load_model('Data/model_mlp_sentiment.h5')