# **Perbandingan Algoritma Support Vector Machine, Rule-Based Classifier, dan Gradient Boosted Decision Tree dalam Analisis Sentimen**


---

**KELOMPOK REKOGNISI**


> Fachri Kurniansyah - M0721025


> Felix - M0721028

pada tugas ini, akan dilakukan analisis sentimen menggunakan 3 model berbeda, yaitu :
- **Support Vector machine**
- **Rule-Based Classifier**
- **Gradient Boosted Decision Tree**

Data yang digunakan adalah data yang didapatkan dari github [**`indonlu/dataset/smsa_doc-sentiment-prosa`**](https://github.com/IndoNLP/indonlu/tree/master/dataset/smsa_doc-sentiment-prosa). Analisis sentimen ini akan menggunakan **TF-IDF** sebagai metode vektorisasi.
<br>
<br>
Untuk Notebook, reguirements, dan hasil perbandingan antar model dapat dilihat di github berikut [**`FelixMatrixar/Basic-NLP-in-PyTorch`**](https://github.com/FelixMatrixar/Basic-NLP-in-PyTorch/blob/main/Felix_Fachri%20Tugas%20Kelompok%20(Revised).ipynb).

---
# **Import Library**



Tahap ini adalah meng-import **Library** yang dibutuhkan selama proses analisis sentimen dilakukan.

In [1]:
import asyncio
import matplotlib.pyplot as plt
import nest_asyncio
import nltk
import optuna
import pandas as pd
import re

from abc import ABC, abstractmethod
from nltk.corpus import stopwords
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

nest_asyncio.apply()

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\celle\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\celle\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

---
# **Import Data**

Tahap ini berfungsi untuk *meng-import* **data train** dan **data test** yang berisi 2 kolom, yaitu `text` dan `label` dengan total 11.500 baris. Pada `label` sendiri, terdapat 3 kelas yang berbeda yang menunjukkan sentimen dari suatu text, yaitu :
- positive
- neutral
- negative

In [3]:
column_names = ['text', 'label']
train_data = pd.read_csv('train_preprocess.tsv', sep='\t', header=None, names=column_names)
test_data = pd.read_csv('test_preprocess.tsv', sep='\t', header=None, names=column_names)

In [4]:
train_data

Unnamed: 0,text,label
0,warung ini dimiliki oleh pengusaha pabrik tahu...,positive
1,mohon ulama lurus dan k212 mmbri hujjah partai...,neutral
2,lokasi strategis di jalan sumatera bandung . t...,positive
3,betapa bahagia nya diri ini saat unboxing pake...,positive
4,duh . jadi mahasiswa jangan sombong dong . kas...,negative
...,...,...
10995,tidak kecewa,positive
10996,enak rasa masakan nya apalagi kepiting yang me...,positive
10997,hormati partai-partai yang telah berkoalisi,neutral
10998,"pagi pagi di tol pasteur sudah macet parah , b...",negative


In [5]:
test_data

Unnamed: 0,text,label
0,kemarin gue datang ke tempat makan baru yang a...,negative
1,kayak nya sih gue tidak akan mau balik lagi ke...,negative
2,"kalau dipikir-pikir , sebenarnya tidak ada yan...",negative
3,ini pertama kalinya gua ke bank buat ngurusin ...,negative
4,waktu sampai dengan gue pernah disuruh ibu lat...,negative
...,...,...
495,kata nya tidur yang baik itu minimal enam jam ...,neutral
496,indonesia itu ada di benua asia .,neutral
497,salah satu kegemaran anak remaja indonesia sek...,neutral
498,melihat warna hijau bisa bikin mata jadi lebih...,positive


---
# **Preprocessing Data**

Pada tahap ini, akan dilakukan **preprocessing data** agar nantinya data text akan lebih siap untuk dianalisis. Tahap preprocessing ini terbagi menjadi beberapa tahap, yaitu :

- Menghapus Stopword
- Melakukan Stemming
- Membersihkan text (Detail terlihat pada syntax)
- Menghapus Stopword

Proses Preprocessing data ini diaplikasikan di `data train` dan juga `data test`

In [6]:
stopword_factory = StopWordRemoverFactory()
stopword_remover = stopword_factory.create_stop_word_remover()
stop_words = set(stopwords.words('indonesian'))
factory = StopWordRemoverFactory()
stop_words_sastrawi = set(factory.get_stop_words())
stop_words = stop_words.union(stop_words_sastrawi)
stemmer_factory = StemmerFactory()
stemmer = stemmer_factory.create_stemmer()

def clean_text(text):
    if type(text) == float:
        return ""
    temp = text.lower()
    temp = re.sub(r'^RT\s+', '', temp, flags=re.IGNORECASE).strip()
    temp = re.sub("@\S+","", temp)                # Remove mentions
    temp = re.sub("#[A-Za-z0-9_]+","", temp)      # Remove hashtags
    temp = re.sub(r"https\S+","", temp)           # Remove URLs
    temp = re.sub('[()!?]', '', temp)             # Remove specific punctuations
    temp = re.sub("\[.*?\]","", temp)             # Remove text inside square brackets
    temp = re.sub("[^a-z0-9\s]", "", temp)        # Remove non-alphanumeric characters (preserve spaces)
    temp = re.sub(r'[0-9]', '', temp)             # Remove digits
    temp = re.sub('\s+', ' ', temp).strip()       # Replace multiple spaces with a single space and strip leading/trailing spaces
    temp = ' '.join([word for word in temp.split() if word not in stop_words])
    temp = stemmer.stem(temp)
    return temp

train_data['clean'] = train_data['text'].apply(lambda x: clean_text(x))
test_data['clean'] = test_data['text'].apply(lambda x: clean_text(x))

  temp = re.sub("@\S+","", temp)                # Remove mentions
  temp = re.sub("\[.*?\]","", temp)             # Remove text inside square brackets
  temp = re.sub("[^a-z0-9\s]", "", temp)        # Remove non-alphanumeric characters (preserve spaces)
  temp = re.sub('\s+', ' ', temp).strip()       # Replace multiple spaces with a single space and strip leading/trailing spaces


---
# **TF-IDF**

Tahap ini adalah proses mengubah text menjadi representassi numerik atau vektorisasi menggunakan metode `TF-IDF`. Pengaplikasian **TF-IDF** hanya akan diaplikasikan ke data X saja, bukan ke data y. Terlihat juga pada distribusi label terjadi ketidakseimbangan data dengan label **positive** paling banyak dan label **neutral** paling sedikit.

In [7]:
X_train, y_train = train_data['clean'], train_data['label']
X_test, y_test = test_data['clean'], test_data['label']
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

In [8]:
print("Distribusi label:\n", y_train.value_counts())

Distribusi label:
 label
positive    6416
negative    3436
neutral     1148
Name: count, dtype: int64


---
# **Machine Learning Modeling**

Pada proses `modeling`, akan ada 3 tahap, yaitu :
- Pembuatan fungsi untuk ketiga model (SVM, Rule-Based Classifier, GBDT)
- Melakukan Modeling dan Evaluating pada dataset
- Mengetahu metrik evaluasi dari setiap model menggunakan Confusion Matrix

Training pada setiap model akan berjalan lama hingga **3 Jam** dikarenakan terdapat **Hyperparameter Tuning** pada setiap modelnya untuk mencari Hyperparameter terbaik dari setiap modelnya.

---
## **Function Definition**

In [9]:
class BaseModel(ABC):
    def __init__(self, name):
        self.name = name
        self.best_params = None
        self.model = None

    @abstractmethod
    def tune(self, X_train, y_train):
        pass

    @abstractmethod
    def train(self, X_train, y_train):
        pass

    def evaluate(self, X_test, y_test):
        y_pred = self.model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        report = classification_report(y_test, y_pred)
        return accuracy, report

In [10]:
class SVMModel(BaseModel):
    def __init__(self):
        super().__init__('SVM')

    def tune(self, X_train, y_train):
        def objective(trial):
            C = trial.suggest_float('C', 1e-3, 1e3, log=True)
            kernel = trial.suggest_categorical('kernel', ['linear', 'poly', 'rbf', 'sigmoid'])
            gamma = trial.suggest_float('gamma', 1e-4, 1e1, log=True)
            degree = trial.suggest_int('degree', 2, 5) if kernel == 'poly' else 3

            model = SVC(C=C, kernel=kernel, gamma=gamma, degree=degree, random_state=42)
            score = cross_val_score(model, X_train, y_train, cv=3, scoring='accuracy')
            return score.mean()

        study = optuna.create_study(direction='maximize')
        study.optimize(objective, n_trials=30)
        self.best_params = study.best_params

    def train(self, X_train, y_train):
        self.model = SVC(**self.best_params, random_state=42)
        self.model.fit(X_train, y_train)


In [11]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_val_score

class GBDTModel(BaseModel):
    def __init__(self):
        super().__init__('GBDT')

    def _convert_to_dense(self, X):
        """
        Convert sparse matrix to dense if necessary.
        """
        return X.toarray() if hasattr(X, "toarray") else X

    def tune(self, X_train, y_train):
        """
        Tune the HistGradientBoostingClassifier using Optuna.
        """
        def objective(trial):
            # Convert sparse to dense
            X_train_dense = self._convert_to_dense(X_train)

            # Hyperparameter tuning
            learning_rate = trial.suggest_float('learning_rate', 0.01, 0.1, log=True)
            max_depth = trial.suggest_int('max_depth', 3, 10)
            max_iter = trial.suggest_int('max_iter', 50, 200)

            model = HistGradientBoostingClassifier(
                learning_rate=learning_rate, max_depth=max_depth, max_iter=max_iter, random_state=42
            )
            score = cross_val_score(model, X_train_dense, y_train, cv=3, scoring='accuracy')
            return score.mean()

        # Create and run the Optuna study
        study = optuna.create_study(direction='maximize')
        study.optimize(objective, n_trials=30)

        # Save the best parameters
        self.best_params = study.best_params
        print(f"Best parameters for GBDT: {self.best_params}")

    def train(self, X_train, y_train):
        """
        Train the HistGradientBoostingClassifier with the best parameters.
        """
        # Convert sparse to dense
        X_train_dense = self._convert_to_dense(X_train)

        # Initialize and fit the model
        self.model = HistGradientBoostingClassifier(**self.best_params, random_state=42)
        self.model.fit(X_train_dense, y_train)

    def evaluate(self, X_test, y_test):
        """
        Evaluate the model on the test set.
        """
        # Convert sparse to dense
        X_test_dense = self._convert_to_dense(X_test)

        # Predict and calculate accuracy and classification report
        y_pred = self.model.predict(X_test_dense)
        accuracy = accuracy_score(y_test, y_pred)
        report = classification_report(y_test, y_pred)
        return accuracy, report


In [12]:
class RuleBasedModel(BaseModel):
    def __init__(self):
        super().__init__('Rule-Based')

    def tune(self, X_train, y_train):
        """
        Tune the DummyClassifier to select the best strategy.
        """
        def objective(trial):
            # Suggest a strategy to evaluate
            strategy = trial.suggest_categorical(
                'strategy', ['most_frequent', 'prior', 'stratified', 'uniform']
            )
            # Create and evaluate the model
            model = DummyClassifier(strategy=strategy)
            score = cross_val_score(model, X_train, y_train, cv=3, scoring='accuracy')
            return score.mean()

        # Use Optuna to find the best strategy
        study = optuna.create_study(direction='maximize')
        study.optimize(objective, n_trials=30)  # Limited trials since it's lightweight

        # Save the best parameters
        self.best_params = {'strategy': study.best_params['strategy']}
        print(f"Best strategy for Rule-Based: {self.best_params['strategy']}")

    def train(self, X_train, y_train):
        """
        Train the DummyClassifier with the selected strategy.
        """
        self.model = DummyClassifier(strategy=self.best_params['strategy'])
        self.model.fit(X_train, y_train)

In [13]:
async def process_model(model, X_train, y_train, X_test, y_test):
    print(f"Tuning {model.name}...")
    await asyncio.to_thread(model.tune, X_train, y_train)
    print(f"Best parameters for {model.name}: {model.best_params}")

    print(f"Training {model.name}...")
    await asyncio.to_thread(model.train, X_train, y_train)

    print(f"Evaluating {model.name}...")
    accuracy, report = model.evaluate(X_test, y_test)
    print(f"Accuracy for {model.name}: {accuracy}")
    print(report)
    return model.name, accuracy

async def main(models, X_train, y_train, X_test, y_test):
    tasks = [process_model(model, X_train, y_train, X_test, y_test) for model in models]
    results = await asyncio.gather(*tasks)
    return results


---
## **Modeling and Evaluating**

In [None]:
# Instantiate models
svm_model = SVMModel()
gbdt_model = GBDTModel()
rule_based_model = RuleBasedModel()

models = [
          svm_model, 
          gbdt_model, 
          rule_based_model
          ]

# Run all models asynchronously
results = asyncio.run(main(models, X_train_tfidf, y_train, X_test_tfidf, y_test))

[I 2024-12-25 18:31:25,830] A new study created in memory with name: no-name-c1d629fe-378e-4de8-8ca4-bd622d7b9cd2


[I 2024-12-25 18:31:25,857] A new study created in memory with name: no-name-817108e1-4b93-44d1-a82d-fbabf0af36a6
[I 2024-12-25 18:31:25,861] A new study created in memory with name: no-name-b1a53462-d8e9-4862-981f-96d3ae17f9b4
[I 2024-12-25 18:31:25,897] Trial 0 finished with value: 0.44372775117949154 and parameters: {'strategy': 'stratified'}. Best is trial 0 with value: 0.44372775117949154.


Tuning SVM...
Tuning GBDT...
Tuning Rule-Based...


[I 2024-12-25 18:31:25,970] Trial 1 finished with value: 0.44327225521778435 and parameters: {'strategy': 'stratified'}. Best is trial 0 with value: 0.44372775117949154.
[I 2024-12-25 18:31:25,988] Trial 2 finished with value: 0.4486362222786571 and parameters: {'strategy': 'stratified'}. Best is trial 2 with value: 0.4486362222786571.
[I 2024-12-25 18:31:26,008] Trial 3 finished with value: 0.45054543719752105 and parameters: {'strategy': 'stratified'}. Best is trial 3 with value: 0.45054543719752105.
[I 2024-12-25 18:31:26,025] Trial 4 finished with value: 0.4495456272809202 and parameters: {'strategy': 'stratified'}. Best is trial 3 with value: 0.45054543719752105.
[I 2024-12-25 18:31:26,044] Trial 5 finished with value: 0.44490732950776235 and parameters: {'strategy': 'stratified'}. Best is trial 3 with value: 0.45054543719752105.
[I 2024-12-25 18:31:26,061] Trial 6 finished with value: 0.5832727203840468 and parameters: {'strategy': 'most_frequent'}. Best is trial 6 with value: 0.

Best strategy for Rule-Based: most_frequent
Best parameters for Rule-Based: {'strategy': 'most_frequent'}
Training Rule-Based...
Evaluating Rule-Based...
Accuracy for Rule-Based: 0.416
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00       204
     neutral       0.00      0.00      0.00        88
    positive       0.42      1.00      0.59       208

    accuracy                           0.42       500
   macro avg       0.14      0.33      0.20       500
weighted avg       0.17      0.42      0.24       500



[I 2024-12-25 18:31:49,559] Trial 0 finished with value: 0.5832727203840468 and parameters: {'C': 0.03729904161642435, 'kernel': 'sigmoid', 'gamma': 0.0005638228812140856}. Best is trial 0 with value: 0.5832727203840468.
[I 2024-12-25 18:32:16,621] Trial 1 finished with value: 0.5832727203840468 and parameters: {'C': 0.004423196431533216, 'kernel': 'linear', 'gamma': 0.0006439605815368789}. Best is trial 0 with value: 0.5832727203840468.
[I 2024-12-25 18:32:41,330] Trial 2 finished with value: 0.5832727203840468 and parameters: {'C': 0.0034878401911191672, 'kernel': 'sigmoid', 'gamma': 0.00012665804650277226}. Best is trial 0 with value: 0.5832727203840468.
[I 2024-12-25 18:33:09,430] Trial 3 finished with value: 0.5832727203840468 and parameters: {'C': 0.9022082659302042, 'kernel': 'sigmoid', 'gamma': 0.0009410598472952621}. Best is trial 0 with value: 0.5832727203840468.
[I 2024-12-25 18:33:28,039] Trial 4 finished with value: 0.8456361379238796 and parameters: {'C': 0.79932239908662

Best parameters for SVM: {'C': 0.9342058907746094, 'kernel': 'linear', 'gamma': 0.01486921084615767}
Training SVM...
Evaluating SVM...
Accuracy for SVM: 0.734
              precision    recall  f1-score   support

    negative       0.68      0.88      0.77       204
     neutral       0.72      0.44      0.55        88
    positive       0.81      0.72      0.76       208

    accuracy                           0.73       500
   macro avg       0.74      0.68      0.69       500
weighted avg       0.74      0.73      0.73       500



[I 2024-12-25 18:47:52,437] Trial 2 finished with value: 0.7862725419050086 and parameters: {'learning_rate': 0.01576143526646691, 'max_depth': 9, 'max_iter': 122}. Best is trial 0 with value: 0.8165448977435122.
[I 2024-12-25 18:53:09,519] Trial 3 finished with value: 0.8203631044204531 and parameters: {'learning_rate': 0.06808285752610349, 'max_depth': 8, 'max_iter': 127}. Best is trial 3 with value: 0.8203631044204531.
[I 2024-12-25 18:54:54,158] Trial 4 finished with value: 0.7388180452573051 and parameters: {'learning_rate': 0.011542713125375432, 'max_depth': 4, 'max_iter': 74}. Best is trial 3 with value: 0.8203631044204531.
[I 2024-12-25 18:59:39,830] Trial 5 finished with value: 0.8168176994076767 and parameters: {'learning_rate': 0.06339500740731728, 'max_depth': 8, 'max_iter': 109}. Best is trial 3 with value: 0.8203631044204531.
[I 2024-12-25 19:02:34,623] Trial 6 finished with value: 0.7740000623858379 and parameters: {'learning_rate': 0.02919357303324603, 'max_depth': 6, '

Best parameters for GBDT: {'learning_rate': 0.0976727170592889, 'max_depth': 10, 'max_iter': 199}
Best parameters for GBDT: {'learning_rate': 0.0976727170592889, 'max_depth': 10, 'max_iter': 199}
Training GBDT...
Evaluating GBDT...
Accuracy for GBDT: 0.71
              precision    recall  f1-score   support

    negative       0.65      0.88      0.75       204
     neutral       0.63      0.31      0.41        88
    positive       0.82      0.72      0.77       208

    accuracy                           0.71       500
   macro avg       0.70      0.63      0.64       500
weighted avg       0.72      0.71      0.70       500



---
## **Save the Result**

Tahap ini berfungsi untuk menyimpan hasil **accuracy** untuk setiap model dengan hyperparameter terbaik.

In [17]:
# Compare Results
results_df = pd.DataFrame(results, columns=["Model", "Accuracy"])

In [18]:
results_df.to_csv("results.csv", index=False)

---
# **Conclusion**

<div align='justify'>
&emsp;&emsp;&emsp;&emsp;
Berdasarkan hasil analisis sentimen menggunakan tiga model berbeda, yaitu <strong>Rule-Based Classifier</strong>, <strong>Support Vector Machine (SVM)</strong>, dan <strong>Gradient Boosted Decision Tree (GBDT)</strong>, terlihat bahwa <strong><code>SVM</code></strong> memberikan performa terbaik dengan <strong>akurasi 73.4%</strong> dan <strong>F1-Macro 69%</strong>. Model <strong><code>Rule-Based</code></strong> hanya mencapai akurasi 41.6% dengan F1-Macro 20% karena strategi <em>most_frequent</em> cenderung memprediksi kelas mayoritas, sehingga performanya rendah dalam menangani kelas negatif dan netral. Sementara itu, GBDT menunjukkan hasil dengan akurasi 71% dan F1-Macro 64%, menandakan bahwa model ini mampu menangkap pola yang lebih kompleks dibandingkan Rule-Based. Perbedaan performa ini menunjukkan bahwa model berbasis pembelajaran mesin seperti SVM dan GBDT lebih efektif dalam menangani variasi data dan distribusi kelas yang tidak seimbang, dengan SVM memberikan keseimbangan terbaik antara akurasi dan kemampuan generalisasi di ketiga kelas.
</div>

