# Klasifikasi Hoax

<p align='justify'>Projek klasifikasi hoax Pilpres merupakan upaya untuk mengembangkan sistem yang dapat mengidentifikasi dan membedakan informasi yang bersifat palsu atau hoaks terkait pemilihan presiden (Pilpres). Tujuan dari proyek ini adalah untuk meningkatkan pemahaman masyarakat tentang kebenaran informasi selama periode pemilihan presiden, sehingga dapat mengurangi penyebaran berita palsu yang dapat memengaruhi opini publik dan proses demokrasi. Metode klasifikasi yang digunakan dalam proyek ini mencakup penggunaan teknik pembelajaran mesin dan analisis data untuk membedakan antara informasi yang dapat dipercaya dan yang tidak. Proyek ini diharapkan dapat memberikan kontribusi positif terhadap integritas dan keamanan proses demokrasi dalam konteks pemilihan presiden.</p>

## instal semua yang akan digunakan

In [None]:
!pip install optuna
!pip install transformers[torch]
!pip install torch
!pip install transformers
!pip install transformers datasets evaluate
!pip install datasets
!pip install huggingface_hub

Collecting optuna
  Downloading optuna-3.5.0-py3-none-any.whl (413 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m413.4/413.4 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.13.1-py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.4/233.4 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting colorlog (from optuna)
  Downloading colorlog-6.8.2-py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.0-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, alembic, optuna
Successfully installed Mako-1.3.0 alembic-1.13.1 colorlog-6.8.2 optuna-3.5.0
Collecting accelerate>=0.20.3 (from transformers[torch])
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━

## login ke huggingface

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## load library yang akan digunakan

In [None]:
import pandas as pd
import numpy as np
from datasets import Dataset

## mengkonekan dengan Gdrive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## load dataset

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Indonesia_ai/Hoax/data/data_bersih.csv')
df.head()

Unnamed: 0,Judul Artikel,Berita_Hoax
0,idi mati tugas kpps milu bukan lelah karena racun,Hoax
1,skenario milu harus bakal curang,Hoax
2,partai komunis cina biaya kampanye ganjar besa...,Hoax
3,pkb gerindra satu dukung anies baswedan milu,Hoax
4,video tvone siar jokowi maju jadi cawapres milu,Hoax


## cek dataset

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25932 entries, 0 to 25931
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Judul Artikel  25930 non-null  object
 1   Berita_Hoax    25932 non-null  object
dtypes: object(2)
memory usage: 405.3+ KB


Ternyata ada <b>missing value sebanyak 2 data</b> pada kolom <b>Judul Artikel</b>

In [None]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25930 entries, 0 to 25931
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Judul Artikel  25930 non-null  object
 1   Berita_Hoax    25930 non-null  object
dtypes: object(2)
memory usage: 607.7+ KB


In [None]:
# get category distribution
df['Berita_Hoax'].value_counts()

Fakta    13650
Hoax     12280
Name: Berita_Hoax, dtype: int64

In [None]:
tags = np.unique(df['Berita_Hoax']) # get unique category
num_tags = len(tags) # get the number of category, here we have 2 tags/categories
label2id = {t: i for i, t in enumerate(tags)} # make a dictionary to map label to id
id2label = {i: t for i, t in enumerate(tags)} # make a dictionary to map id to label

In [None]:
train_data = df.sample(frac=0.8, random_state=42)
test_data = df.drop(train_data.index)

In [None]:

X_train = Dataset.from_pandas(train_data)
X_test = Dataset.from_pandas(test_data)

In [None]:
print(X_train)
print(X_test)

Dataset({
    features: ['Judul Artikel', 'Berita_Hoax', '__index_level_0__'],
    num_rows: 20744
})
Dataset({
    features: ['Judul Artikel', 'Berita_Hoax', '__index_level_0__'],
    num_rows: 5186
})


In [None]:
X_train[10]

{'Judul Artikel': 'pakar khawatir isu amandemen uud rembet soal presiden',
 'Berita_Hoax': 'Fakta',
 '__index_level_0__': 25244}

In [None]:
from transformers import AutoTokenizer

max_length = 128

tokenizer = AutoTokenizer.from_pretrained("indolem/indobertweet-base-uncased", max_length=max_length)


def tokenize_function(examples):

    # process the input sequence
    tokenized_input = tokenizer(examples["Judul Artikel"],
                                truncation=True,
                                padding='max_length',
                                max_length=max_length)
    # process the labels
    tokenized_input['label'] = [label2id[lb] for lb in examples['Berita_Hoax']]

    return tokenized_input

tokenized_train_data = X_train.map(tokenize_function, batched=True)
tokenized_test_data = X_test.map(tokenize_function, batched=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.10k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/235k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Map:   0%|          | 0/20744 [00:00<?, ? examples/s]

Map:   0%|          | 0/5186 [00:00<?, ? examples/s]

In [None]:
tokenized_test_data

Dataset({
    features: ['Judul Artikel', 'Berita_Hoax', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask', 'label'],
    num_rows: 5186
})

In [None]:
tokenized_test_data[50]

{'Judul Artikel': 'panglima tni udah tau kan arah mana makan yg banyak jendral biar tambah sehat',
 'Berita_Hoax': 'Hoax',
 '__index_level_0__': 213,
 'input_ids': [3,
  5524,
  3136,
  9988,
  5661,
  2797,
  3442,
  2420,
  2387,
  3798,
  1814,
  11589,
  9421,
  5937,
  5427,
  4,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,


In [None]:
# define the metrics

import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report

def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    report = classification_report(labels, pred, digits=4)
    acc = accuracy_score(y_true=labels, y_pred=pred)
    rec = recall_score(y_true=labels, y_pred=pred)
    prec = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)

    print("Classification Report:\n{}".format(report))
    return {"accuracy": acc, "precision": prec, "recall": rec, "f1": f1}

In [None]:
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification


checkpoint = "indolem/indobertweet-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint,
                                                           num_labels=num_tags,
                                                           id2label=id2label,
                                                           label2id=label2id)

pytorch_model.bin:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at indolem/indobertweet-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
output_dir = "hoaxpemilu" # name your own output directory
training_args = TrainingArguments(output_dir=output_dir,
                                  evaluation_strategy="epoch",
                                  num_train_epochs=5,
                                  push_to_hub=True)

trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_test_data,
    compute_metrics=compute_metrics,
)

trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2232,0.283323,0.936753,0.972373,0.889395,0.929035
2,0.1317,0.233414,0.957964,0.948163,0.962303,0.955181
3,0.0515,0.27473,0.951215,0.966336,0.927506,0.946523
4,0.0236,0.310009,0.958735,0.958333,0.952775,0.955546
5,0.0089,0.300086,0.960663,0.958506,0.956918,0.957711


TrainOutput(global_step=12965, training_loss=0.10107335400682502, metrics={'train_runtime': 3051.4901, 'train_samples_per_second': 33.99, 'train_steps_per_second': 4.249, 'total_flos': 6822469665484800.0, 'train_loss': 0.10107335400682502, 'epoch': 5.0})

In [None]:
trainer.evaluate()

{'eval_loss': 0.3000864088535309,
 'eval_accuracy': 0.9606633243347474,
 'eval_precision': 0.9585062240663901,
 'eval_recall': 0.9569179784589892,
 'eval_f1': 0.9577114427860696,
 'eval_runtime': 39.1548,
 'eval_samples_per_second': 132.449,
 'eval_steps_per_second': 16.575,
 'epoch': 5.0}

In [None]:
# save model
trainer.save_model(output_dir)
# push
trainer.push_to_hub(commit_message="Training complete")

model.safetensors:   0%|          | 0.00/442M [00:00<?, ?B/s]

events.out.tfevents.1706617567.540a72298b07.205.0:   0%|          | 0.00/11.0k [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

events.out.tfevents.1706620658.540a72298b07.205.1:   0%|          | 0.00/560 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/AptaArkana/hoaxpemilu/commit/8dc42e5239155dc5f2bafe368b061f49493eab26', commit_message='Training complete', commit_description='', oid='8dc42e5239155dc5f2bafe368b061f49493eab26', pr_url=None, pr_revision=None, pr_num=None)