# Sentiment Analysis
Sentiment analysis dilakukan untuk mencari tahu konotasi tweet yang didapatkan.  
[Sentiment terhadap partai politik dapat menjadi pendorong individu dalam berpartisipasi dalam pemilu](https://doi.org/10.1016/j.electstud.2012.12.006).  
Kemampuan rata - rata sentimen akan diuji untuk melihat apakah ada hubungan antar sentimen rata - rata provinsi terhadap partai politik.  


Sentiment analysis akan dilakukan dengan 3 metode yaitu:
1. Sentiwordnet dengan bantuan Barasa
2. Inset
3. IndoBertTweet


Sebelum melakukan sentiment analysis, tweet akan dibersihkan dari kata - kata yang tidak penting.
Pembersihan tersebut terdiri dari 3 tahap yaitu:
1. Menghapus kata - kata yang tidak penting menggunakan stopwords dari nltk, sastrawi, dan kata - kata yang ditambahkan sendiri.
2. Menghapus kata - kata yang tidak penting menggunakan regex.
3. Menormalisasi kata - kata menjadi lowercase, menghapus tanda baca.


In [None]:
# Import library
import nltk
import re
import numpy as np
import pandas as pd
from nlp_id.tokenizer import Tokenizer
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory, StopWordRemover, ArrayDictionary
from nlp_id.stopword import StopWord 
from nltk.corpus import stopwords
import words as w
import json

# Kelas normalizer untuk melakukan normalisasi teks
class normalizer():
    def __init__(self):
        # Load stopwords
        nltk.download('stopwords')
        stopwords_sastrawi = StopWordRemoverFactory()
        stopwords_nlpid = StopWord() 
        stopwords_nltk = stopwords.words('indonesian')
        stopwords_github = list(np.array(pd.read_csv("stopwords.txt", header=None).values).squeeze())
        more_stopword = w.custom_stopwords
        data_stopword = stopwords_sastrawi.get_stop_words() + stopwords_nlpid.get_stopword() + stopwords_github + stopwords_nltk + more_stopword 
        data_stopword = list(set(data_stopword))

        # Only use 'rt' as stopwords
        data_stopword = list(set(data_stopword))

        # Combine slang dictionary
        with open('slang.txt') as f:
            data = f.read()
        data_slang = json.loads(data) 

        with open('sinonim.txt') as f:
            data = f.readlines()
        for line in data:
            word = line.split('=')
            data_slang[word[0].strip()] = word[1].strip()

        # print(data_slang)
        more_dict = w.custom_dict
        data_slang.update(more_dict)

        self.stopwords, self.slang = data_stopword, data_slang
        self.tokenizer = Tokenizer()


    def normalize(self,text):
        text = text.lower()
  
        # Change HTML entities
        text = text.replace('&amp;', 'dan')
        text = text.replace('&gt;', 'lebih dari')
        text = text.replace('&lt;', 'kurang dari')
        
        # Remove url
        text = re.sub(r'http\S+', 'httpurl', text)
        
        # Remove HTML tags
        text = re.sub(r'<.*?>', ' ', text)
        
        # Remove hashtags
        text = re.sub(r'#\w+', ' ', text)
        
        # Replace @mentions with 'user'
        text = re.sub(r'@\w+', 'user', text)

        # Remove non-letter characters
        text = re.sub('[^a-zA-z]', ' ', text)

        # Remove excess space
        text = re.sub(' +', ' ', text)
        text = text.strip()

        result = []
         # Tokenize words
        word_token = self.tokenizer.tokenize(text)
        for word in word_token:
            # Case Folding to Lower Case
            word = word.strip().lower() 
            if word in self.slang:
                word = self.slang[word]
            # Stopwords removal
            if word not in self.stopwords: 
                result.append(word)
            else:
                continue
        return result

### Hasil Normalisasi

Contoh input data:  
"Luar biasa! Coba kita bayangkan apa yg bakal terjadi jika Ketua MK, Ketua MA, Panglima TNI, Jaksa Agung, Ketua KPK, Kepala BIN, dan Kapolri juga dgn menggunakan alasan yg sama ikut cawe2 dlm memenangkan Capres-Cawapres tertentu dlm Pemilu 2024? Itukah maksudnya?#RakyatMonitor#"  
  
Hasil Output normalisasi:  
*['coba', 'bayangkan', 'ketua', 'mk', 'ketua', 'panglima', 'tni', 'jaksa', 'agung', 'ketua', 'kpk', 'kepala', 'bin', 'kapolri', 'alasan', 'cawe', 'memenangkan', 'capres', 'cawapres', 'pemilu', 'maksud']*

## Sentiwordnet dengan bantuan Barasa

Barasa merupakan implementasi sentiwordnet bahasa indonesia yang dibuat oleh [neocl](https://github.com/neocl/barasa).  
Data Barasa yang tidak standar memperlukan pembuatan kelas Sentiwordnet baru berdasarkan data Barasa.  
### Contoh Barasa
| synset     | language | goodness | lemma          | PosScore | NegScore |
|------------|----------|----------|----------------|----------|----------|
| 00001740-a | B        | L        | akauntan       | 0.125    | 0        |
| 00001740-a | B        | L        | berdaya upaya  | 0.125    | 0        |
| 00001740-a | B        | L        | berkemampuan   | 0.125    | 0        |
| 00001740-a | B        | L        | berkesanggupan | 0.125    | 0        |
| 00001740-a | B        | L        | berkeupayaan   | 0.125    | 0        |
| 00001740-a | B        | L        | beroleh        | 0.125    | 0        |
| 00001740-a | B        | L        | boleh          | 0.125    | 0        |
| 00001740-a | B        | L        | cakap          | 0.125    | 0        |
| 00001740-a | B        | L        | cekap          | 0.125    | 0        |
| 00001740-a | B        | L        | handal         | 0.125    | 0        |

In [None]:
from nltk.corpus.reader.wordnet import Synset
from nltk.corpus.reader import WordNetError
from nltk.corpus import wordnet as wn
import nltk
from nlp_id.tokenizer import Tokenizer
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory, StopWordRemover, ArrayDictionary
from nlp_id.stopword import StopWord 
from nltk.corpus import stopwords
import words as w
import numpy as np
import pandas as pd
import spacy
import re

# Kelas SentiSynset untuk menyimpan nilai sentimen dari synset
class SentiSynset:
    def __init__(self, pos_score, neg_score, synset):
        self._pos_score = pos_score
        self._neg_score = neg_score
        self._obj_score = 1.0 - (self._pos_score + self._neg_score)
        self.synset = synset


    def pos_score(self):
        return self._pos_score


    def neg_score(self):
        return self._neg_score


    def obj_score(self):
        return self._obj_score


    def __str__(self):
        """Prints just the Pos/Neg scores for now."""
        s = "<"
        s += self.synset.name() + ": "
        s += "PosScore=%s " % self._pos_score
        s += "NegScore=%s" % self._neg_score
        s += ">"
        return s

    def __repr__(self):
        return "Senti" + repr(self.synset)



# Kelas SentiWordNet untuk melakukan sentiment analysis
class CustomSentiWordNet(object):
    def __init__(self):
        with open("barasa.txt", "r", encoding="utf-8") as f:
            lines = f.readlines()
        # create empty 2d dict
        synsets = {}
        id_dict = {}
        # Memasukan data syset ke dalam dict
        for line in lines:
            if line.startswith("#"):
                continue
            parts = line.strip().split("\t")
            if len(parts) != 6:
                continue
            synset_id = parts[0]

            if synset_id not in synsets:
                synsets[synset_id] = {}
            
            synset = {}
            # Menyimpan nilai lemma dan sentimen dari synset
            id, lang, goodness, lemma, pos, neg = parts
            pos = float(pos)
            neg = float(neg)
            synsets[synset_id][lemma] = (pos, neg, 1 - (pos + neg))
            id_dict[lemma] = synset_id
        self.lemma_dict = id_dict
        self.synsets = synsets
        self.not_found = {}
    
    def _get_synset(self, synset_id):
        # fungsi untuk mendapatkan synset dari synset_id
        synsets = self.synsets[synset_id]
        return synsets
        
        
    
    def _get_pos_file(self, pos):
        # fungsi untuk mendapatkan pos tag dari synset
        if pos == 'n':
            return 'noun'
        elif pos == 'v':
            return 'verb'
        elif pos == 'a' or pos == 's':
            return 'adj'
        elif pos == 'r':
            return 'adv'
        else:
            raise WordNetError('Unknown POS tag: {}'.format(pos))
    
    
    def senti_synset(self, synset_id):
        # fungsi untuk mendapatkan nilai sentimen dari synset
        pos_score,neg_score,obj_score = self.synsets[synset_id]
        synset = self._get_synset(synset_id)
        return SentiSynset(synset, pos_score, neg_score)
    
    def calculate_sentiment(self,tokens):
        # fungsi untuk menghitung nilai sentimen dari kalimat
        pos = []
        neg = []
        for token in tokens:
            # skip if token not in lemma_dict
            if token not in self.lemma_dict:
                self.not_found[token] = self.not_found.get(token, 0) + 1
                continue
            synsets = self.synsets[self.lemma_dict[token]][token]
            pos_score, neg_score, obj_score = synsets
            pos.append(pos_score)
            neg.append(neg_score)
        return pos, neg
    
    def get_not_found(self):
        # fungsi untuk mendapatkan kata yang tidak ditemukan di synset
        return self.not_found

Synset merupakan id dari suatu kata. Class synset dibuat untuk menyimpan data synset.  
Setelah semua sysnet dibaca, kata - kata yang ditemukan akan disimpan dalam dictionary.  
Dictionary ini akan digunakan untuk mencari synset dan positif negatif score dari suatu kata.  

Dengan input teks yang telah dinormalisasi, hasil sentiwordnet yang dihasilkan merupakan:
```
Positive = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.625, 0.125, 0.0]
Negative = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.125]

Kata - kata tidak ditemukan = 
{'mk': 1,
 'tni': 1,
 'kpk': 1,
 'bin': 1,
 'kapolri': 1,
 'cawe': 1,
 'capres': 1,
 'cawapres': 1,
 'pemilu': 1}
```

Hasil sentimen yang didapatkan dengan mengurangi rata - rata positif dengan rata - rata negatif.

## Inset

Inset merupakan sentiment lexicon sentimen bahasa indonesia yang dibuat oleh Fajri Koto, and Gemala Y [InSet](https://github.com/fajri91/InSet/tree/master).  
Inset berisi 3609 kata positif dan 6609 kata negatif dengan berat -5 sampai +5.

### Contoh InSet
  
Positive  
| word      | weight |
|-----------|--------|
| hai       | 3      |
| merekam   | 2      |
| ekstensif | 3      |
| paripurna | 1      |
| detail    | 2      |
| pernik    | 3      |
| belas     | 2      |
  
Negative  
| word                 | weight |
|----------------------|--------|
| putus tali gantung   | -2     |
| gelebah              | -2     |
| gobar hati           | -2     |
| tersentuh (perasaan) | -1     |
| isak                 | -5     |
| larat hati           | -3     |
| nelangsa             | -3     |




Implementasi Inset dilakukan dengan membuat kelas Inset baru dimana inset akan menghitung trigram, bigram, dan unigram dari tweet lalu mencari kata - kata tersebut di inset.  
Jika kata - kata tersebut ditemukan, maka akan dihitung bobotnya dan dihilangkan dari kalimat.  

In [None]:
# Import library
import pandas as pd
import numpy as np
from nltk import ngrams

def read_inset(path):
    # fungsi untuk membaca file inset
    sentiments = {}
    with open(path, 'r') as f:
        lines = f.readlines()
    for line in lines:
        if line.startswith('#'):
            continue
        word, sentiment = line.split('\t')
        sentiments[word] = int(sentiment)
    print(len(sentiments))
    return sentiments

def print_n_grams(unigrams, bigrams, trigrams):
    # fungsi untuk print n-grams
    print('Unigrams: ', ', '.join(unigrams))
    print('Bigrams: ', ', '.join(bigrams))
    print('Trigrams: ', ', '.join(trigrams))

    

class inSet():
    def __init__(self, verbose = False):
        self.pos = read_inset('Inset/positive.tsv')
        self.neg = read_inset('Inset/negative.tsv')
        # Verbose merupakan flag untuk menampilkan hasil perhitungan n-grams
        self.verbose = verbose

    def delete_word_from_text(self, text, word):
        # fungsi untuk menghapus kata dari kalimat
        text = text.replace(word, '', 1)
        return text
    
    
    def calculate_n_gram(self, text):
        # fungsi untuk menghitung n-grams dari kalimat
        unigrams = ngrams(text.split(), 1)
        bigrams = ngrams(text.split(), 2)
        trigrams = ngrams(text.split(), 3)

        unigrams = [' '.join(grams) for grams in unigrams]
        bigrams = [' '.join(grams) for grams in bigrams]
        trigrams = [' '.join(grams) for grams in trigrams]

        return unigrams, bigrams, trigrams
    
    def recalculate_n_grams(self, text, word):
        # fungsi untuk menghitung ulang n-grams setelah menghapus kata

        text = self.delete_word_from_text(text, word)
        unigrams, bigrams, trigrams = self.calculate_n_gram(text)
        if self.verbose:
            print_n_grams(unigrams, bigrams, trigrams)
        return unigrams, bigrams, trigrams, text

    def calculate_inset_score(self, text):
        # fungsi untuk menghitung nilai sentimen dari kalimat
        unigrams, bigrams, trigrams = self.calculate_n_gram(text)
        pos_score = 0
        neg_score = 0
        # Looping untuk menghitung nilai sentimen dari n-grams
        # Pencarian kata dilakukan dari Trigram -> Bigram -> Unigram, jika ditemukan maka kata akan dihapus dari kalimat
        for trigram in trigrams:
            if trigram in self.pos:
                if self.verbose:
                    print('Hit Trigram Pos ', trigram)
                pos_score += self.pos[trigram]
                unigrams, bigrams, trigrams, text = self.recalculate_n_grams(text, trigram)
            if trigram in self.neg:
                if self.verbose:
                    print('Hit Trigram Neg ', trigram)
                neg_score += self.neg[trigram]
                unigrams, bigrams, trigrams, text = self.recalculate_n_grams(text, trigram)
        
        for bigram in bigrams:
            if bigram in self.pos:
                if self.verbose:
                    print('Hit Bigram Pos ', bigram)
                pos_score += self.pos[bigram]
                unigrams, bigrams, trigrams, text = self.recalculate_n_grams(text, bigram)

            if bigram in self.neg:
                if self.verbose:
                    print('Hit Bigram Neg ', bigram)
                neg_score += self.neg[bigram]
                unigrams, bigrams, trigrams, text = self.recalculate_n_grams(text, bigram)

        for unigram in unigrams:
            if unigram in self.pos:
                if self.verbose:
                    print('Hit Unigram Pos ', unigram)
                pos_score += self.pos[unigram]
                unigrams, bigrams, trigrams, text = self.recalculate_n_grams(text, unigram)

            if unigram in self.neg:
                if self.verbose:
                    print('Hit Unigram Neg ', unigram)
                neg_score += self.neg[unigram]
                unigrams, bigrams, trigrams, text = self.recalculate_n_grams(text, unigram)

        return pos_score, neg_score

Seperti pada sentiwordnet, perhitungan negative positif dilakukan dengan mengurangi rata - rata positif dengan rata - rata negatif.  
Dengan teks sebelumnya, hasil yang didapatkan adalah:
- Positive: 8
- Negative: -2

## IndoBertTweet
IndoBertTweet merupakan sentiment lexicon sentimen bahasa indonesia yang dibuat oleh Fajri Koto, Jey Han Lau [IndoBertTweet](https://arxiv.org/pdf/2109.04607.pdf)  
IndoBertTweet merupakan model berbasis transformer yang dilatih dengan data tweet indonesia.  
Fine tuning untuk tugas analisa sentimen dilakukan dengan dataset SmSA yang telah disediakan oleh IndoBertTweet.

In [None]:
# Import library
import json, glob, os, random
import argparse
import logging
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.metrics import f1_score, accuracy_score
from transformers import BertTokenizer, BertModel, BertConfig
from transformers import AdamW, get_linear_schedule_with_warmup
import re, emoji
from datetime import datetime


# Inisiai logger dan model yang diguankan
logger = logging.getLogger(__name__)
model_dict = { 'indobertweet': 'indolem/indobertweet-base-uncased',
               'indobert': 'indolem/indobert-base-uncased'}


def find_url(string):
    # fungsi untuk mencari url dalam kalimat
    regex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
    url = re.findall(regex,string)
    return [x[0] for x in url]

def preprocess_tweet(tweet):
    # fungsi untuk melakukan preprocessing terhadap kalimat
    tweet = emoji.demojize(tweet).lower()
    new_tweet = []
    for word in tweet.split():
        if word[0] == '@' or word == '[username]':
            new_tweet.append('@USER')
        elif find_url(word) != []:
            new_tweet.append('HTTPURL')
        elif word == 'httpurl' or word == '[url]':
            new_tweet.append('HTTPURL')
        else:
            new_tweet.append(word)
    return ' '.join(new_tweet)

def set_seed(args):
    # fungsi untuk mengatur random seed
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)

# Kelas BertData untuk melakukan tokenisasi terhadap kalimat
class BertData():
    def __init__(self, args):
        # Inisialisasi tokenizer
        # IndoBertTweet akan digunakan untuk tokenisasi
        self.tokenizer = BertTokenizer.from_pretrained(model_dict[args.bert_model], do_lower_case=True) 
        self.sep_token = '[SEP]'
        self.cls_token = '[CLS]'
        self.pad_token = '[PAD]'
        self.sep_vid = self.tokenizer.vocab[self.sep_token]
        self.cls_vid = self.tokenizer.vocab[self.cls_token]
        self.pad_vid = self.tokenizer.vocab[self.pad_token]
        self.MAX_TOKEN = args.max_token

    def preprocess_one(self, src_txt, label):
        # fungsi untuk melakukan tokenisasi terhadap satu kalimat
        src_txt = preprocess_tweet(src_txt)
        src_subtokens = [self.cls_token] + self.tokenizer.tokenize(src_txt) + [self.sep_token]    
        print(src_subtokens) 
        src_subtoken_idxs = self.tokenizer.convert_tokens_to_ids(src_subtokens)
        
        # Pengecekan apakah token melebihi batas maksimum token
        # Jika melebihi, maka token akan dipotong
        if len(src_subtoken_idxs) > self.MAX_TOKEN:
            src_subtoken_idxs = src_subtoken_idxs[:self.MAX_TOKEN]
            src_subtoken_idxs[-1] = self.sep_vid
        else:
            # Jika tidak melebihi, maka token akan diisi dengan token [PAD]
            src_subtoken_idxs += [self.pad_vid] * (self.MAX_TOKEN-len(src_subtoken_idxs))
        segments_ids = [0] * len(src_subtoken_idxs)
        assert len(src_subtoken_idxs) == len(segments_ids)
        return src_subtoken_idxs, segments_ids, label
    
    def preprocess(self, src_txts, labels):
        # fungsi untuk melakukan tokenisasi terhadap banyak kalimat
        # pengecekan apakah banyak kalimat sama dengan banyak label
        assert len(src_txts) == len(labels)
        output = []
        for idx in range(len(src_txts)):
            output.append(self.preprocess_one(src_txts[idx], labels[idx]))
        return output

# Kelas Batch untuk melakukan batch processing
class Batch():
    def __init__(self, data, idx, batch_size, device):
        # Inisialisasi batch
        cur_batch = data[idx:idx+batch_size]
        src = torch.tensor([x[0] for x in cur_batch])
        seg = torch.tensor([x[1] for x in cur_batch])
        label = torch.tensor([x[2] for x in cur_batch])
        mask_src = 0 + (src != 0)
        
        self.src = src.to(device)
        self.seg= seg.to(device)
        self.label = label.to(device)
        self.mask_src = mask_src.to(device)

    def get(self):
        return self.src, self.seg, self.label, self.mask_src

# Model untuk melakukan klasifikasi sentimen
class Model(nn.Module):
    def __init__(self, args, device):
        # Inisialisasi model
        super(Model, self).__init__()
        self.args = args
        self.device = device
        # Inisialisasi tokenizer dan model bert
        self.tokenizer = BertTokenizer.from_pretrained(model_dict[args.bert_model], do_lower_case=True)
        self.bert = BertModel.from_pretrained(model_dict[args.bert_model])
        # Inisialisasi layer 
        self.linear = nn.Linear(self.bert.config.hidden_size, args.vocab_label_size)
        self.dropout = nn.Dropout(0.2)
        self.loss = torch.nn.CrossEntropyLoss(ignore_index=args.vocab_label_size, reduction='sum')


    def forward(self, src, seg, mask_src):
        # fungsi untuk melakukan forward propagation
        top_vec, _ = self.bert(input_ids=src, token_type_ids=seg, attention_mask=mask_src, return_dict=False)
        top_vec = self.dropout(top_vec)
        top_vec *= mask_src.unsqueeze(dim=-1).float()
        top_vec = torch.sum(top_vec, dim=1) / mask_src.sum(dim=-1).float().unsqueeze(-1)
        conclusion = self.linear(top_vec).squeeze()
        return conclusion
    
    def get_loss(self, src, seg, label, mask_src):
        # fungsi untuk menghitung loss
        output = self.forward(src, seg, mask_src)
        return self.loss(output.view(-1,self.args.vocab_label_size), label.view(-1))

    def predict(self, src, seg, mask_src):
        # fungsi untuk melakukan prediksi
        output = self.forward(src, seg, mask_src)
        batch_size = output.shape[0]
        prediction = torch.argmax(output, dim=-1).data.cpu().numpy().tolist()
        return prediction


def prediction(dataset, model, args):
    # fungsi untuk melakukan prediksi
    preds = []
    golds = []
    model.eval()
    for j in range(0, len(dataset), args.batch_size):
        src, seg, label, mask_src = Batch(dataset, j, args.batch_size, args.device).get()
        preds += model.predict(src, seg, mask_src)
        golds += label.cpu().data.numpy().tolist()
    return f1_score(golds, preds, average='macro'), preds

def create_vocab(labels):
    # fungsi untuk membuat label menjadi id
    unique = np.unique(labels)
    label2id = {}
    id2label = {}
    counter = 0
    for word in unique:
        label2id[word] = counter
        id2label[counter] = word
        counter += 1
    return label2id, id2label

def convert_label2id(label2id, labels):
    return [label2id[x] for x in labels]

def save_df(pred, id2label):
    # fungsi untuk menyimpan hasil prediksi
    ids = np.arange(len(pred))
    pred = [id2label[p] for p in pred]
    df = pd.DataFrame()
    df['index']=ids
    df['label']=pred
    df.to_csv('pred_bertW.csv', index=False)

def train(args, train_dataset, dev_dataset, test_dataset, model, id2label):
    """ Train the model """
    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    # Pembagian training dalam batch
    t_total = len(train_dataset) // args.batch_size * args.num_train_epochs
    args.warmup_steps = int(0.1 * t_total)
    # Optimizer weight decay
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]
    # Optimizer AdamW
    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Total optimization steps = %d", t_total)
    logger.info("  Warming up = %d", args.warmup_steps)
    logger.info("  Patience  = %d", args.patience)

    # Added here for reproductibility
    global best_model
    set_seed(args)
    tr_loss = 0.0
    global_step = 1
    best_f1_dev = 0
    cur_patience = 0
    # Looping training untuk setiap epoch
    for i in range(int(args.num_train_epochs)):
        random.shuffle(train_dataset)
        epoch_loss = 0.0
        # Looping training untuk setiap batch
        for j in range(0, len(train_dataset), args.batch_size):
            # Pemanggilan batch
            src, seg, label, mask_src = Batch(train_dataset, j, args.batch_size, args.device).get()
            model.train()
            # Perhitungan loss dengan dari seluruh batch
            loss = model.get_loss(src, seg, label, mask_src)
            loss = loss.sum()/args.batch_size
            if args.n_gpu > 1:
                loss = loss.mean()  # mean() to average on multi-gpu parallel (not distributed) training
            # Backpropagation
            loss.backward()

            # Update loss per batch dan per epoch
            tr_loss += loss.item()
            epoch_loss += loss.item()
            # Gradient clipping untuk mencegah exploding gradient pada backpropagation
            torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)

            # Update parameter
            optimizer.step()
            scheduler.step()  
            # Reset gradient
            model.zero_grad()
            global_step += 1
        logger.info("Finish epoch = %s, loss_epoch = %s", i+1, epoch_loss/global_step)

        # Evaluation
        dev_f1, _ = prediction(dev_dataset, model, args)
        if dev_f1 > best_f1_dev:
            best_f1_dev = dev_f1
            _, test_pred = prediction(test_dataset, model, args)
            save_df(test_pred, id2label)
            #SAVE
            cur_patience = 0
            # Save a trained model
            logger.info("Better, BEST F1 in DEV = %s, SAVE TEST!", best_f1_dev)
            best_model = model.state_dict()
            print(best_model)
            model.save_pretrained(args.output_dir)
            
          
        else:
            cur_patience += 1
            if cur_patience == args.patience:
                logger.info("Early Stopping Not Better, BEST F1 in DEV = %s", best_f1_dev)
                break
            else:
                logger.info("Not Better, BEST F1 in DEV = %s", best_f1_dev)

    return global_step, tr_loss / global_step, best_f1_dev


# Argument Setting
args_parser = argparse.ArgumentParser()
args_parser.add_argument('--bert_model', default='indobertweet', choices=['indobert', 'indobertweet'], help='select one of models')
args_parser.add_argument('--data_path', default='/content/gdrive/MyDrive/TA_Bayu-05111940000172/Indobert/SMsA/Data/', help='path to all train/test/dev')
args_parser.add_argument('--output_dir', default='/content/gdrive/MyDrive/TA_Bayu-05111940000172/Indobert/SMsA/Model/', help='path to save model')
args_parser.add_argument('--max_token', type=int, default=128, help='maximum token allowed for 1 instance')
args_parser.add_argument('--batch_size', type=int, default=30, help='batch size')
args_parser.add_argument('--learning_rate', type=float, default=5e-5, help='learning rate')
args_parser.add_argument('--weight_decay', type=int, default=0, help='weight decay')
args_parser.add_argument('--adam_epsilon', type=float, default=1e-8, help='adam epsilon')
args_parser.add_argument('--max_grad_norm', type=float, default=1.0)
args_parser.add_argument('--num_train_epochs', type=int, default=20, help='total epoch')
args_parser.add_argument('--warmup_steps', type=int, default=242, help='warmup_steps, the default value is 10% of total steps')
args_parser.add_argument('--logging_steps', type=int, default=200, help='report stats every certain steps')
args_parser.add_argument('--seed', type=int, default=2021)
args_parser.add_argument('--local_rank', type=int, default=-1)
args_parser.add_argument('--patience', type=int, default=5, help='patience for early stopping')
args_parser.add_argument('--no_cuda', default=False)
args_parser.add_argument('-f')
args = args_parser.parse_args()




# Setup CUDA, GPU & distributed training
if args.local_rank == -1 or args.no_cuda:
    # Pengecekan apakah ada GPU yang tersedia
    device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
    args.n_gpu = torch.cuda.device_count()
else: 
    # Inisialisasi GPU yang tersedia
    torch.cuda.set_device(args.local_rank)
    device = torch.device("cuda", args.local_rank)
    torch.distributed.init_process_group(backend="nccl")
    args.n_gpu = 1

args.device = device

# Setup logging
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
)

# Setup random seed
set_seed(args)



# Load pretrained model and tokenizer
if args.local_rank not in [-1, 0]:
    # Make sure only the first process in distributed training will download model & vocab
    torch.distributed.barrier()

if args.local_rank == 0:
    # Make sure only the first process in distributed training will download model & vocab
    torch.distributed.barrier()


# Load Data untuk Preprocess dan Tokenize
bertdata = BertData(args)

# Load Dataset Train, Dev, Test
trainset = pd.read_csv(args.data_path+'train_preprocess.tsv', sep='\t')
devset = pd.read_csv(args.data_path+'valid_preprocess.tsv', sep='\t')
testset = pd.read_csv(args.data_path+'test_preprocess_masked_label.tsv', sep='\t')
xtrain, ytrain = list(trainset['text']), list(trainset['label'])
xdev, ydev = list(devset['text']), list(devset['label'])
xtest, ytest = list(testset['text']), list(testset['label'])


# Pengantian label string menjadi id
label2id, id2label = create_vocab(ytrain)
ytrain =  convert_label2id (label2id, ytrain)
ydev =  convert_label2id (label2id, ydev)
ytest =  convert_label2id (label2id, ytest)
args.vocab_label_size = len(label2id)

# Load Model
model = Model(args, device)
best_model = model.state_dict()

model.to(args.device)
# preprocess data
train_dataset = bertdata.preprocess(xtrain, ytrain)
dev_dataset = bertdata.preprocess(xdev, ydev)
test_dataset = bertdata.preprocess(xtest, ytest)

# Train
global_step, tr_loss, best_f1_dev= train(args, train_dataset, dev_dataset, test_dataset, model, id2label)


print('Dev set F1', best_f1_dev)

Kode diatas merupakan kode yang digunakan untuk melakukan fine tuning dari IndoBertTweet.  
Kode tersebut diambil dari [Github IndoBERTweet SmSA](https://github.com/indolem/IndoBERTweet/blob/main/sentiment_SmSA/indobertweet.py) dengan beberapa perubahan.  
  


Dengan mengubah kode tersebut, kita dapat menggunakan model yang telah ditrain untuk melakukan prediksi sentimen dari tweet.  
Proses prapemrosesan data dilakukan berbeda dengan sentiwordnet dan inset, disini kami mengikuti proses prapemrosesan yang dilakukan oleh IndoBertTweet.  


In [None]:
# Import library

import json, glob, os, random
import argparse
import logging
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.metrics import f1_score, accuracy_score
from transformers import BertTokenizer, BertModel, BertConfig
from transformers import AdamW, get_linear_schedule_with_warmup
import re, emoji
from datetime import datetime



logger = logging.getLogger(__name__)
model_dict = { 'indobertweet': 'indolem/indobertweet-base-uncased',
               'indobert': 'indolem/indobert-base-uncased'}


def find_url(string):
    regex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
    url = re.findall(regex,string)
    return [x[0] for x in url]

def preprocess_tweet(tweet):
    tweet = emoji.demojize(tweet).lower()
    new_tweet = []
    for word in tweet.split():
        if word[0] == '@' or word == '[username]':
            new_tweet.append('@USER')
        elif find_url(word) != []:
            new_tweet.append('HTTPURL')
        elif word == 'httpurl' or word == '[url]':
            new_tweet.append('HTTPURL')
        else:
            new_tweet.append(word)
    return ' '.join(new_tweet)

def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)


class BertData():
    def __init__(self, args):
        self.tokenizer = BertTokenizer.from_pretrained(model_dict[args.bert_model], do_lower_case=True)
        self.sep_token = '[SEP]'
        self.cls_token = '[CLS]'
        self.pad_token = '[PAD]'
        self.sep_vid = self.tokenizer.vocab[self.sep_token]
        self.cls_vid = self.tokenizer.vocab[self.cls_token]
        self.pad_vid = self.tokenizer.vocab[self.pad_token]
        self.MAX_TOKEN = args.max_token

    def preprocess_one(self, src_txt):
        src_txt = preprocess_tweet(src_txt)
        src_subtokens = [self.cls_token] + self.tokenizer.tokenize(src_txt) + [self.sep_token]        
        src_subtoken_idxs = self.tokenizer.convert_tokens_to_ids(src_subtokens)
        
        if len(src_subtoken_idxs) > self.MAX_TOKEN:
            src_subtoken_idxs = src_subtoken_idxs[:self.MAX_TOKEN]
            src_subtoken_idxs[-1] = self.sep_vid
        else:
            src_subtoken_idxs += [self.pad_vid] * (self.MAX_TOKEN-len(src_subtoken_idxs))
        segments_ids = [0] * len(src_subtoken_idxs)
        assert len(src_subtoken_idxs) == len(segments_ids)
        return src_subtoken_idxs, segments_ids
    
    def preprocess(self, src_txts):
        output = []
        for idx in range(len(src_txts)):
            output.append(self.preprocess_one(src_txts[idx]))
        return output


class Batch():
    def __init__(self, data, idx, batch_size, device):
        cur_batch = data[idx:idx+batch_size]
        src = torch.tensor([x[0] for x in cur_batch])
        seg = torch.tensor([x[1] for x in cur_batch])
        # Karena tidak ada label, maka tidak perlu diambil
        # label = torch.tensor([x[2] for x in cur_batch])
        mask_src = 0 + (src != 0)
        
        self.src = src.to(device)
        self.seg= seg.to(device)
        # Karena tidak ada label, maka tidak perlu diambil
        # self.label = label.to(device)
        self.mask_src = mask_src.to(device)

    def get(self):
        # Karena tidak ada label, maka tidak perlu diambil
        return self.src, self.seg, self.mask_src


class Model(nn.Module):
    def __init__(self, args, device):
        super(Model, self).__init__()
        self.args = args
        self.device = device
        self.tokenizer = BertTokenizer.from_pretrained(model_dict[args.bert_model], do_lower_case=True)
        self.bert = BertModel.from_pretrained(model_dict[args.bert_model])
        self.linear = nn.Linear(self.bert.config.hidden_size, args.vocab_label_size)
        self.dropout = nn.Dropout(0.2)
        self.loss = torch.nn.CrossEntropyLoss(ignore_index=args.vocab_label_size, reduction='sum')


    def forward(self, src, seg, mask_src):
        top_vec, _ = self.bert(input_ids=src, token_type_ids=seg, attention_mask=mask_src, return_dict=False)
        top_vec = self.dropout(top_vec)
        top_vec *= mask_src.unsqueeze(dim=-1).float()
        top_vec = torch.sum(top_vec, dim=1) / mask_src.sum(dim=-1).float().unsqueeze(-1)
        conclusion = self.linear(top_vec).squeeze()
        return conclusion
    
    def get_loss(self, src, seg, label, mask_src):
        output = self.forward(src, seg, mask_src)
        return self.loss(output.view(-1,self.args.vocab_label_size), label.view(-1))

    def predict(self, src, seg, mask_src):
        output = self.forward(src, seg, mask_src)
        batch_size = output.shape[0]
        prediction = torch.argmax(output, dim=-1).data.cpu().numpy().tolist()
        return prediction


def prediction(dataset, model, args):
    preds = []
    # Karena tidak ada label, maka tidak perlu diambil
    # golds = []
    model.eval()
    for j in range(0, len(dataset), args.batch_size):
        src, seg, mask_src = Batch(dataset, j, args.batch_size, args.device).get()
        preds += model.predict(src, seg, mask_src)
        # Karena tidak ada label, maka tidak perlu diambil
        # golds += label.cpu().data.numpy().tolist()
    return preds

def create_vocab(labels):
    unique = np.unique(labels)
    label2id = {}
    id2label = {}
    counter = 0
    for word in unique:
        label2id[word] = counter
        id2label[counter] = word
        counter += 1
    return label2id, id2label

def convert_label2id(label2id, labels):
    return [label2id[x] for x in labels]



args_parser = argparse.ArgumentParser()
args_parser.add_argument('--bert_model', default='indobertweet', choices=['indobert', 'indobertweet'], help='select one of models')
args_parser.add_argument('--data_path', default='./indobert_smsa/data/', help='path to all train/test/dev')
args_parser.add_argument('--output_dir', default='./indobert_smsa/Model/', help='path to save model')
args_parser.add_argument('--max_token', type=int, default=128, help='maximum token allowed for 1 instance')
args_parser.add_argument('--batch_size', type=int, default=30, help='batch size')
args_parser.add_argument('--learning_rate', type=float, default=5e-5, help='learning rate')
args_parser.add_argument('--weight_decay', type=int, default=0, help='weight decay')
args_parser.add_argument('--adam_epsilon', type=float, default=1e-8, help='adam epsilon')
args_parser.add_argument('--max_grad_norm', type=float, default=1.0)
args_parser.add_argument('--num_train_epochs', type=int, default=20, help='total epoch')
args_parser.add_argument('--warmup_steps', type=int, default=242, help='warmup_steps, the default value is 10% of total steps')
args_parser.add_argument('--logging_steps', type=int, default=200, help='report stats every certain steps')
args_parser.add_argument('--seed', type=int, default=2021)
args_parser.add_argument('--local_rank', type=int, default=-1)
args_parser.add_argument('--patience', type=int, default=5, help='patience for early stopping')
args_parser.add_argument('--no_cuda', default=False)
args = args_parser.parse_args()




# Setup CUDA, GPU & distributed training
if args.local_rank == -1 or args.no_cuda:
    device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
    args.n_gpu = torch.cuda.device_count()
else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
    torch.cuda.set_device(args.local_rank)
    device = torch.device("cuda", args.local_rank)
    torch.distributed.init_process_group(backend="nccl")
    args.n_gpu = 1
args.device = device

logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
)

set_seed(args)



# Load pretrained model and tokenizer
if args.local_rank not in [-1, 0]:
    # Make sure only the first process in distributed training will download model & vocab
    torch.distributed.barrier()

if args.local_rank == 0:
    # Make sure only the first process in distributed training will download model & vocab
    torch.distributed.barrier()

bertdata = BertData(args)

trainset = pd.read_csv(args.data_path+'train_preprocess.tsv', sep='\t')


xtrain, ytrain = list(trainset['text']), list(trainset['label'])


label2id, id2label = create_vocab (ytrain)
args.vocab_label_size = len(label2id)

model = Model(args, device)
best_model = model.state_dict()

model.to(args.device)
model.load_state_dict(torch.load('indobert_smsa\model_SMSA.pt', map_location=args.device))

print(model)

text = ["Saya sangat senang", "Saya sangat sedih", "Hari cerah"]

preprocessed = bertdata.preprocess(text)
print(preprocessed)
res = prediction(preprocessed, model, args)
print(res)




### Hasil Prapemrosesan IndoBERTweet
```
([3, 2315, 2799, 5, 6008, 1732, 10621, 2064, 3798, 4120, 2022, 1997, 2501, 10259, 16, 2501, 2262, 16, 5524, 3136, 16, 5476, 3167, 16, 2501, 5534, 16, 2370, 2937, 16, 1501, 8897, 1614, 11794, 2216, 3217, 3798, 1959, 3354, 12558, 929, 952, 18425, 4821, 10298, 17, 12558, 6964, 3241, 18425, 4002, 8980, 35, 1570, 3251, 9727, 35, 7, 2425, 12592, 5871, 7, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
```
### Hasil prediksi IndoBERTweet
```
[0]
```

Dapat dilihat dari perbandingan sekilas, hanya IndoBERTweet yang dapat menghasilkan prediksi sentimen yang akurat.  
Untuk mendapatkan perbandingan yang konkrit, ketiga model tersebut akan diuji dengan label manual yang dilakukan. 

## Validasi Prediksi
### Sorting Keywords

Untuk menghasilkan data terbaik dalam waktu yang singkat, dilakukan filtering untuk mencari data paling menarik diantara 600000 data yang ada.  
Filtering dilakukan dengan lexicon keywords yang sudah dibuat dan dihitung total kata - kata yang menarik pada setiap tweet.  

In [None]:
# Import library
import pandas as pd

def get_n_keywords(text):
    # Pengecekan total keyword yang ada di text
    total = 0
    for word in text.split():
        if word in keyword['keyword']:
            total = keyword['keyword'][word] + total
    return total

keywords = pd.read_csv('keyword.csv', sep=';',encoding = 'unicode_escape')
keywords['text'] = keywords['text'].astype(str)
keywords['text'] = keywords['text'].apply(lambda x: x.lower())

keywords = keywords.drop(columns=['count'])
keywords.set_index('text', inplace=True)

keyword = keywords.to_dict()
get_n_keywords('penerus bangsa kita jokowi dodo jk presiden , nomor 1 diatas segalanya, indonesia')

Dengan contoh kata "penerus bangsa kita jokowi dodo jk presiden , nomor 1 diatas segalanya, indonesia", hasil yang didapaktan merupakan 5 yaitu:
- bangsa
- jokowi
- presiden
- 1
- indonesia

Lalu, data yang telah di cek keywordsnya akan disortir berdasarkan total kata - kata yang menarik.  
Data tersebut lalu akan di tagging secara manual. Hasil data tagging tersebut akan digunakan untuk validasi prediksi.
### Hasil Tagging Manual
| Label    | Count |
|----------|-------|
| Negative | 349   |
| Neutral  | 239   |
| Positive | 175   |
| Skip     | 5     |

## Validasi Prediksi
Validasi prediksi dilakukan dengan melihat classification report dan confusion matrix dari masing - masing model.

In [6]:
import pandas as pd
df = pd.read_csv('tagged_joined.csv', sep=';')
df = df.dropna()
df = df[df['tag_overall'] != 5]
df = df[df['tag_overall'] != 4]

### Prediksi Barasa Sentiwordnet

In [7]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import numpy as np

pos_sentiword = df['posSentiword'].tolist()
neg_sentiword = df['negSentiword'].tolist()

y_pred = []

for i in range(len(pos_sentiword)):
    delta = pos_sentiword[i] - neg_sentiword[i]
    if delta > 0.25:
        y_pred.append(3)
    elif delta < -0.25:
        y_pred.append(1)
    else:
        y_pred.append(2)

y_true = df['tag_overall'].tolist()
print(classification_report(y_true, y_pred, target_names=['neg', 'neu', 'pos']))
print(confusion_matrix(y_true, y_pred))

              precision    recall  f1-score   support

         neg       0.51      0.21      0.30       349
         neu       0.30      0.47      0.37       239
         pos       0.24      0.34      0.28       175

    accuracy                           0.32       763
   macro avg       0.35      0.34      0.32       763
weighted avg       0.38      0.32      0.32       763

[[ 75 173 101]
 [ 42 112  85]
 [ 30  86  59]]


### Prediksi InSet

In [9]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import numpy as np

pos_inset = df['posInset'].tolist()
neg_inset = df['negInset'].tolist()


y_pred = []

for i in range(len(pos_inset)):
    delta = pos_inset[i] + neg_inset[i]
    if delta > 10:
        y_pred.append(3)
    elif delta < -10:
        y_pred.append(1)
    else:
        y_pred.append(2)

y_true = df['tag_overall'].tolist()
print(classification_report(y_true, y_pred, target_names=['neg', 'neu', 'pos']))
print(confusion_matrix(y_true, y_pred))




              precision    recall  f1-score   support

         neg       0.56      0.23      0.32       349
         neu       0.34      0.69      0.46       239
         pos       0.31      0.25      0.27       175

    accuracy                           0.38       763
   macro avg       0.41      0.39      0.35       763
weighted avg       0.44      0.38      0.35       763

[[ 79 208  62]
 [ 40 166  33]
 [ 21 111  43]]


### Prediksi IndoBERTweet

In [11]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import numpy as np

y_true = df['tag_overall'].tolist()
# Bertlabel ditambah 2 karena hasil prediksi bertlabel  -1, 0, 1 dan tag_overall 1, 2, 3
y_pred = (df['BERTlabel'] + 2).tolist()
print(classification_report(y_true, y_pred, target_names=['neg', 'neu', 'pos']))
print(confusion_matrix(y_true, y_pred))

              precision    recall  f1-score   support

         neg       0.71      0.94      0.81       349
         neu       0.80      0.53      0.63       239
         pos       0.74      0.62      0.67       175

    accuracy                           0.74       763
   macro avg       0.75      0.69      0.71       763
weighted avg       0.75      0.74      0.72       763

[[327  11  11]
 [ 86 126  27]
 [ 46  21 108]]


Terbukti bahwa hasil prediksi dari IndoBERTweet lebih akurat dibandingkan dengan kedua model lainnya. Selanjutnya, hasil prediski dari IndoBERTweet akan digunakan untuk melakukan analisa selanjutnya