# Reduce Model

Notebook ini berisi tahapan - tahapan yang dilakukan untuk mereduksi size model utama yang akan digunakan

## Import Library

Load library yang akan digunakan

In [1]:
# generals
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

# text preprocessing
import re
import string
import unicodedata
from indoNLP.preprocessing import *
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

# modelling
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import *

# onnx exporter
from onnx.checker import check_model
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import StringTensorType

**configs**

In [2]:
if os.path.isdir("../data/"):
    main_dir = "../"
else:
    main_dir = "https://raw.githubusercontent.com/Hyuto/skripsi/master/"

SEED = 2022

## Load Dataset

Load dataset yang telah siap untuk diolah. Data merupakan hasil dari proses [sampling](https://github.com/Hyuto/skripsi/blob/master/notebook/sampling.ipynb) dan telah melalui proses
filtering dan labelling secara manual.

In [3]:
data = pd.read_csv(main_dir + "data/sample-data.csv")
data.head()

Unnamed: 0,date,url,user,content,label
0,2021-09-02 01:39:05+00:00,https://twitter.com/no_nykrstnd/status/1433243...,no_nykrstnd,"-Dari hasil monitoring, calon Vaksin Merah Put...",0.0
1,2021-07-15 06:09:36+00:00,https://twitter.com/DakwahMujahidah/status/141...,DakwahMujahidah,[PODCAST] Ngomong Politik - Ilusi Penguatan Ke...,0.0
2,2021-07-05 08:57:50+00:00,https://twitter.com/gamisjohor/status/14119725...,gamisjohor,3. GAMIS menyambut baik saranan daripada YAB P...,2.0
3,2021-09-09 09:17:58+00:00,https://twitter.com/inyesaw/status/14358952423...,inyesaw,@txtdaribogor Abis vaksin terbitlah positif covid,4.0
4,2021-01-02 04:37:14+00:00,https://twitter.com/pringgolakseno/status/1345...,pringgolakseno,"Gambling, vaksin sama ga divaksin.\nGa divaksi...",4.0


## Data Preprocessing

Melakukan tindakan pertama untuk menyiapkan data sebelum proses pemodelan.

**General Preprocessing**

Melakukan preprocessing terhadap keseluruhan data, berikut adalah tahapan - tahapan yang dilakukan:

1. Menghapus semua baris yang terdapat nilai `NaN` (kosong) di dalamnya.
2. Membenarkan tipe data pada kolom `"tanggal"` dan `"label"`

In [4]:
data.dropna(inplace=True)
data["date"] = pd.to_datetime(data["date"]).dt.tz_localize(None)
data["label"] = data["label"].astype(int)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3000 entries, 0 to 2999
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   date     3000 non-null   datetime64[ns]
 1   url      3000 non-null   object        
 2   user     3000 non-null   object        
 3   content  3000 non-null   object        
 4   label    3000 non-null   int64         
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 140.6+ KB


**Text Preprocessing**

Melakukan preprocessing terhadap data bertipe teks pada kolom `"content"` yang merupakan komponen utama, berikut adalah tahapan - tahapan yang dilakukan:

1. Case folding
2. Noise removal
   * Menghapus whitespace
   * Mengganti non-ascii karakter
   * Menghapus HTML
   * Menghapus URL
   * Menghapus digit
   * Menghapus tanda baca
3. Mengganti word elongation
4. Mengganti kata gaul (slang words)
5. Menerjemahkan emoji
6. Tokenization and stemming

In [5]:
STEMMER = StemmerFactory().create_stemmer()


def preprocessing(text):
    text = text.lower()
    text = re.sub(r"\s+", " ", text, flags=re.UNICODE)  # remove whitespace
    text = emoji_to_words(text)  # remove emoji
    text = unicodedata.normalize("NFD", text).encode("ascii", "ignore").decode("ascii")
    text = remove_html(text)  # remove html tags
    text = remove_url(text)  # remove url
    # text = re.sub(r"(?<![\w@])@([\w@]+(?:[.!][\w@]+)*)", " ", text)
    text = replace_word_elongation(text)  # replace WE
    text = replace_slang(text)  # replace slang words
    text = text.translate(str.maketrans(string.digits, " " * len(string.digits)))  # remove numbers
    text = text.translate(
        str.maketrans(string.punctuation, " " * len(string.punctuation))
    )  # remove punctuation
    text = " ".join(text.split())
    text = STEMMER.stem(text)
    return " ".join(text.split())


data["cleaned"] = [preprocessing(x) for x in tqdm(data["content"].values)]

  0%|          | 0/3000 [00:00<?, ?it/s]

## Modelling

Akan dilakukan pemodelan dengan konfigurasi yang telah di tetapkan pada [Main Notebook](https://github.com/Hyuto/skripsi/blob/master/notebook/Main%20Notebook.ipynb)

### Main Model

Akan dibangun dan dilatih model utama

In [6]:
x_train, x_test, y_train, y_test = train_test_split(
    data["cleaned"].values,
    data["label"].values,
    test_size=0.2,
    random_state=1000,
    stratify=data["label"].values,
)

pipe_linear_main = Pipeline(
    [
        ("tf-idf", TfidfVectorizer(max_features=5000)),
        (
            "svm",
            SVC(
                C=1.3, kernel="linear", probability=True, class_weight="balanced", random_state=SEED
            ),
        ),
    ]
)

pipe_linear_main.fit(x_train, y_train)
pd.DataFrame(classification_report(y_test, pipe_linear_main.predict(x_test), output_dict=True)).T

Unnamed: 0,precision,recall,f1-score,support
0,0.859155,0.770202,0.81225,396.0
1,0.166667,0.173913,0.170213,23.0
2,0.542373,0.695652,0.609524,92.0
3,0.28125,0.346154,0.310345,26.0
4,0.25,0.1875,0.214286,16.0
5,0.265306,0.433333,0.329114,30.0
6,0.3,0.176471,0.222222,17.0
accuracy,0.668333,0.668333,0.668333,0.668333
macro avg,0.380679,0.397604,0.381136,600.0
weighted avg,0.697214,0.668333,0.677985,600.0


**Insight**

Dilihat dari tabel diatas model utama memiliki performa yang cukup baik seperti yang sudah dibahas pada [Main Notebook](https://github.com/Hyuto/skripsi/blob/master/notebook/Main%20Notebook.ipynb). Namun model tersebut terbilang cukup kompleks dikarenakan memiliki jumlah parameter yang banyak. Jumlah parameter yang banyak tersebut diakibatkan oleh banyaknya jumlah feature yang di input kepada model oleh metode pembobotan TF-IDF. Untuk meminimalisasi kekompleksan model maka akan dilakukan penurunan pada jumlah feature yang di input kepada model pada TF-IDF.

### Medium Model

Dilakukan perunan jumlah feature dari 5000 ke 3000 dan didapatkan model dengan performa sebagai berikut

In [7]:
pipe_linear_medium = Pipeline(
    [
        ("tf-idf", TfidfVectorizer(max_features=3000)),
        (
            "svm",
            SVC(
                C=1.3, kernel="linear", probability=True, class_weight="balanced", random_state=SEED
            ),
        ),
    ]
)

pipe_linear_medium.fit(x_train, y_train)
pd.DataFrame(classification_report(y_test, pipe_linear_medium.predict(x_test), output_dict=True)).T

Unnamed: 0,precision,recall,f1-score,support
0,0.866667,0.755051,0.807018,396.0
1,0.153846,0.173913,0.163265,23.0
2,0.533333,0.695652,0.603774,92.0
3,0.25,0.307692,0.275862,26.0
4,0.214286,0.1875,0.2,16.0
5,0.25,0.433333,0.317073,30.0
6,0.272727,0.176471,0.214286,17.0
accuracy,0.656667,0.656667,0.656667,0.656667
macro avg,0.36298,0.389945,0.368754,600.0
weighted avg,0.69645,0.656667,0.670681,600.0


**Insight**

Didapatkan performa medium model pada tabel diatas. Dapat dilihat model mengalami penurunan performa yang dilihat dari nilai f1-score. Penurunan tersebut dinilai tidak terlalu besar dibandingkan dengan penurunan jumlah feature input model. Sehingga medium model masih dapat bekerja dengan performa yang cukup baik.

### Small Model

Dilakukan perunan jumlah feature dari 5000 ke 1000 dan didapatkan model dengan performa sebagai berikut

In [8]:
pipe_linear_small = Pipeline(
    [
        ("tf-idf", TfidfVectorizer(max_features=1000)),
        (
            "svm",
            SVC(
                C=1.3, kernel="linear", probability=True, class_weight="balanced", random_state=SEED
            ),
        ),
    ]
)

pipe_linear_small.fit(x_train, y_train)
pd.DataFrame(classification_report(y_test, pipe_linear_small.predict(x_test), output_dict=True)).T

Unnamed: 0,precision,recall,f1-score,support
0,0.869697,0.724747,0.790634,396.0
1,0.142857,0.173913,0.156863,23.0
2,0.508065,0.684783,0.583333,92.0
3,0.268293,0.423077,0.328358,26.0
4,0.238095,0.3125,0.27027,16.0
5,0.255319,0.4,0.311688,30.0
6,0.333333,0.176471,0.230769,17.0
accuracy,0.641667,0.641667,0.641667,0.641667
macro avg,0.373666,0.413642,0.381702,600.0
weighted avg,0.697565,0.641667,0.660835,600.0


**Insight**

Didapatkan performa small model pada tabel diatas. Dapat dilihat model mengalami penurunan performa yang dilihat dari nilai f1-score. Penurunan tersebut bernilai cukup besar namun dibandingkan dengan penurunan jumlah feature input model performa model masih terbilang cukup baik.

## Export Model

Export setiap model dalam format onnx agar lebih mudah di deploy di segala bentuk device.

In [9]:
os.makedirs("output", exist_ok=True)


def convert2onnx(model, output_name, export=True):
    initial_type = [("words", StringTensorType([None, 1]))]
    options = {"svm": {"zipmap": False}}
    onnx_model = convert_sklearn(model, initial_types=initial_type, options=options)
    check_model(onnx_model)

    if export:
        filename = f"output/model-{output_name}.onnx"
        with open(filename, "wb") as writer:
            writer.write(onnx_model.SerializeToString())
        print(f"Exported to : {filename} - Size : {os.stat(filename).st_size}")

In [10]:
convert2onnx(pipe_linear_main, "svm-linear-large")
convert2onnx(pipe_linear_medium, "svm-linear-medium")
convert2onnx(pipe_linear_small, "svm-linear-small")

Exported to : output/model-svm-linear-large.onnx - Size : 54463174
Exported to : output/model-svm-linear-medium.onnx - Size : 32458631
Exported to : output/model-svm-linear-small.onnx - Size : 10617919
