# Proyek NLP - [Classification of Online Gambling Comments]

**Judul Proyek :** HAJAR (Hapus Judi Online Anti Ribet)

**ID Team :** CC25-CF230

**Anggota Team :**

1. (ML) MC009D5Y0493 - Ahmad Zaky Humami – Universitas Gunadarma - Aktif
2. (ML) MC009D5Y0506 - Fahru Rahman – Universitas Gunadarma - Aktif
3. (ML) MC314D5X1177 - Shofi Shulhiyana – Universitas Singaperbangsa Karawang - Aktif
4. (FEBE) FC009D5Y0885 - Muhammad Faris Rasyid Raharjo - Universitas Gunadarma  Aktif
5. (FEBE) FC314D5Y1568 - Fahry Firdaus Marpaung - Universitas Singaperbangsa Karawang - Aktif
6. (FEBE) FC009D5Y1828 - Linggar Riza Hamretta - Universitas Gunadarma - Aktif


## Proyek Overview
YouTube merupakan salah satu platform media sosial berbasis video terbesar di dunia yang memungkinkan penggunanya untuk dapat saling berinteraksi melalui kolom komentar. Namun, akhir-akhir ini kolom komentar pada konten video YouTube sering disalahgunakan oleh pihak yang tidak bertanggungjawab untuk menyebarkan konten spam dan promosi seputar judi online. Komentar tersebut mengganggu kenyamanan pengguna ketika ingin berdiskusi di kolom komentar dan berpotensi membahayakan jika terdapat tautan ke situs judi online tersebut. Pelaku spam juga kerap menggunakan kata-kata tersamar, dan simbol.

Mengapa Masalah Ini Harus Diselesaikan :
1.  **Mengganggu Kenyamanan Pengguna**: Kolom komentar yang disalahgunakan dapat mengganggu kenyamanan pengguna ketika ingin berdiskusi di kolom komentar. Hal ini dapat mengurangi pengalaman pengguna dan membuat mereka tidak ingin kembali ke platform tersebut.
2.  **Bahaya Konten Judi Online**: Konten judi online dapat membahayakan pengguna, terutama anak-anak dan remaja yang masih belum dewasa. Mereka mungkin tidak memiliki pengetahuan yang cukup untuk mengenali bahaya konten tersebut dan dapat terjebak dalam situasi yang tidak diinginkan. Selain itu, konten judi online juga dapat membahayakan pengguna yang sudah dewasa karena dapat menyebabkan kecanduan dan masalah keuangan. 


## Business Understanding

### Problem Statements
1. 
2. 
3. 

### Goals
1. 
2. 
3. 

## Data Understanding

### Import Library


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Text preprocessing
import unicodedata
import string
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from ftfy import fix_text

# Train/test split & metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

# Deep learning with TensorFlow/Keras
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification, AutoTokenizer
from collections import Counter

# Classical ML models
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

# Word2Vec embeddings
from wordcloud import WordCloud
import io

In [None]:
nltk.download('punkt_tab', download_dir='C:/nltk_data')
nltk.download('punkt', download_dir='C:/nltk_data')
nltk.download('stopwords', download_dir='C:/nltk_data')
nltk.data.path.append('C:/nltk_data')

### Load Dataset

Dataset For Training

In [None]:
df1 = pd.read_csv('Datasets/youtube_comments.csv')
df2 = pd.read_csv('Datasets/komentar_judi.csv')
df3 = pd.read_csv('Datasets/komentar_judi2.csv')

### Deskripsi Variables

Variabel | Keterangan
----------|----------
Author | Unique username.
Comment  | user comments from YouTube videos.

In [None]:
df1.info()

In [None]:
df1.isnull().sum()

In [None]:
print("Duplicated : ", df1.duplicated().sum())

In [None]:
df2.info()

In [None]:
df2.isnull().sum()

In [None]:
print("Duplicated : ", df2.duplicated().sum())

In [None]:
df3.info()

In [None]:
df3.isnull().sum()

In [None]:
print("Duplicated : ", df3.duplicated().sum())

#### Handling Missing Values


In [None]:
df1 = df1.dropna()

print("Missing values after dropping NaNs:")
print("Dataset 1 missing values:\n", df1.isnull().sum())

In [None]:
df2 = df2.dropna()

print("Missing values after dropping NaNs:")
print("Dataset 2 missing values:\n", df2.isnull().sum())

In [None]:
df3 = df3.dropna()

print("Missing values after dropping NaNs:")
print("Dataset 3 missing values:\n", df3.isnull().sum())

### Merge Dataset

In [None]:
df_combined = pd.concat([df1, df2, df3], ignore_index=True)

In [None]:
df_combined.info()

In [None]:
df_combined.head(100)

Missing Values

In [None]:
df_combined.isnull().sum()

## Data Preprocessing

In [None]:
# Drop a one word comment
df_combined = df_combined[df_combined['comment'].str.split().str.len() > 1]

Hapus missing values

In [None]:
# Remove rows with any missing values
df_combined.dropna(inplace=True)

print("Missing values per column: \n")
print(df_combined.isnull().sum())

### Cleaning text

In [None]:
# Converting all the characters in a text into lower case
def casefoldingText(text):
      return text.lower()

In [None]:
def normalize_unicode_to_ascii(text):
      """Normalize text to ASCII, preserving spaces"""
      text = fix_text(text)

      if isinstance(text, str):
            # Normalize Unicode to decomposed form
            text = unicodedata.normalize('NFKD', text)
            # Ganti karakter non-ASCII dengan spasi daripada menghapusnya
            text = ''.join(ch if ord(ch) < 128 else ' ' for ch in text)

      return text

print(normalize_unicode_to_ascii("gachoг m𝘶lu ԁi 𝘿𝙊 𝙍 𝘼 𝟳 𝟳🙍!"))
print(normalize_unicode_to_ascii("Pngguna bru 🛑𝐊𝗨𝐒𝗨𝐌𝗔𝐓𝟬𝐓𝟬🚦,pm aja"))
print(normalize_unicode_to_ascii("Gaji numpang lewat? Biarin, ada ♛𝗔𝗦𝗜𝗔𝗚𝗘𝗡𝗧𝗜𝗡𝗚♛"))
print(normalize_unicode_to_ascii("😎: Betul Bro ⚡𝗦𝗨𝗣𝗘𝗥𝗠𝗢𝗡𝗘𝗬𝟴𝟴⚡⚡𝗦𝗨𝗣𝗘𝗥𝗠𝗢𝗡𝗘𝗬𝟴𝟴⚡"))

In [None]:
# Function untuk Cleaning Text
def cleaningText(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove mentions (@username)
    text = re.sub(r'@\w+', '', text)
    # Remove hashtags (#hashtag)
    text = re.sub(r'#\w+', '', text)
    
    # Replace comma in the middle of the text with space
    text = re.sub(r',', ' ', text)

    # Remove the numbers, but keep the numbers attached to the word
    text = re.sub(r'\b\d+\b', '', text)
    
    # Remove non-alphanumeric characters at the beginning or end of the string
    text = re.sub(r'^[^a-zA-Z0-9]+|[^a-zA-Z0-9]+$', ' ', text)

    text = text.translate(str.maketrans(' ', ' ', string.punctuation)) # remove all punctuations
    
    text = text.replace('\n', ' ')
    text = re.sub(r'[^\w\s]', ' ', text)# replace new line into space
    text = re.sub(r'\s+', ' ', text).strip() # remove characters space from both left and right text
    
    text = re.sub(r'(\b[a-z]+) (\d+\b)', r'\1\2', text)
    
    

    return text

text = "gachoг m𝘶lu ԁi 𝘿𝙊 𝙍 𝘼 𝟳 𝟳🙍!"
text = normalize_unicode_to_ascii(text)
print(text)
print(cleaningText(text))

print(cleaningText(normalize_unicode_to_ascii("gachoг m𝘶lu ԁi 𝘿𝙊 𝙍 𝘼 𝟳 𝟳🙍!")))
print(cleaningText(normalize_unicode_to_ascii("Pngguna bru 🛑𝐊𝗨𝐒𝗨𝐌𝗔𝐓𝟬𝐓𝟬🚦,pm aja")))
print(cleaningText(normalize_unicode_to_ascii("Gaji numpang lewat? Biarin, ada ♛𝗔𝗦𝗜𝗔𝗚𝗘𝗡𝗧𝗜𝗡𝗚♛")))
print(cleaningText(normalize_unicode_to_ascii("😎: Betul Bro ⚡𝗦𝗨𝗣𝗘𝗥𝗠𝗢𝗡𝗘𝗬𝟴𝟴⚡⚡𝗦𝗨𝗣𝗘𝗥𝗠𝗢𝗡𝗘𝗬𝟴𝟴⚡")))

### Tokenizing text

In [None]:
def tokenizingText(text): # Tokenizing or splitting a string, text into a list of tokens
    text = word_tokenize(text)
    return text

### Removing Stopwords

In [None]:
def filteringText(text): # Remove stopwors in a text
    listStopwords = set(stopwords.words('indonesian'))
    listStopwords1 = set(stopwords.words('english'))
    listStopwords.update(listStopwords1)
    listStopwords.update(['iya','yaa','gak','nya','na','sih','ku',"di","ga","ya","gaa","loh","kah","woi","woii","woy"])
    filtered = []
    for txt in text:
        if txt not in listStopwords:
            filtered.append(txt)
    text = filtered
    return text

### Stemming text

In [None]:
def stemmingText(text): # Reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words
    # Membuat objek stemmer
    factory = StemmerFactory()
    stemmer = factory.create_stemmer()

    # Memecah teks menjadi daftar kata
    words = text.split()

    # Menerapkan stemming pada setiap kata dalam daftar
    stemmed_words = [stemmer.stem(word) for word in words]

    # Menggabungkan kata-kata yang telah distem
    stemmed_text = ' '.join(stemmed_words)

    return stemmed_text

### Convert to sentence

In [None]:
def toSentence(list_words): # Convert list of words into sentence
    sentence = ' '.join(word for word in list_words)
    return sentence

### Correcting the slang words

In [None]:
def fix_slangwords(text):
    if not isinstance(text, str):
        return ""

    slang_dict = {
        # Negasi & Penyangkalan
        "gk": "tidak", "gak": "tidak", "ga": "tidak", "g": "tidak", "nggak": "tidak", "enggak": "tidak",
        "gpp": "tidak apa-apa", "gakpapa": "tidak apa-apa", "gabut": "tidak ada kerjaan",
        "gapapa": "tidak apa-apa", "gapapa": "tidak apa-apa", "ngga": "tidak", "nggak": "tidak",
        "nggakpapa": "tidak apa-apa", "nggpp": "tidak apa-apa", "nggaapa": "tidak apa-apa",

        # Penghubung & Penjelas
        "tp": "tetapi", "tapi": "tetapi", "kl": "kalau", "klw": "kalau", "kalo": "kalau", "krn": "karena",
        "karena": "karena", "jd": "jadi", "jg": "juga", "aja": "saja", "sih": "", "kok": "mengapa",
        "dl": "dulu", "pdhl": "padahal", "btw": "ngomong-ngomong", "spt": "seperti",

        # Kata ganti orang
        "sy": "saya", "gw": "saya", "gue": "saya", "gua": "saya", "w": "saya", "gwe": "saya",
        "q": "aku", "ak": "aku", "aq": "aku", "km": "kamu", "lu": "kamu", "lo": "kamu", "elo": "kamu",
        "elu": "kamu", "loe": "kamu", "u": "kamu", "i": "saya", "tmn": "teman", "tmn2": "teman-teman",

        # Kata kerja / tindakan
        "udh": "sudah", "udah": "sudah", "sdh": "sudah", "lg": "lagi", "bikin": "membuat",
        "ksih": "kasih", "ksh": "kasih", "jgn": "jangan", "jangan": "jangan", "biar": "agar",
        "supaya": "agar", "bisa": "bisa", "bs": "bisa", "bsa": "bisa", "sabi": "bisa", "dlm": "dalam",
        "belain": "membela", "belainin": "membela", "bela": "membela", "bales": "membalas",
        "balas": "membalas", "balikin": "mengembalikan", "balikinya": "mengembalikannya",
        "balikinya": "mengembalikannya", "balikin aja": "mengembalikannya", "balikin dong": "mengembalikannya",

        # Kata benda / objek
        "org": "orang", "modal": "uang", "cuan": "untung", "bonus": "hadiah", "jp": "jackpot",
        "jepe": "jackpot", "jepey": "jackpot", "slot": "permainan judi", "betting": "taruhan",
        "promo": "promosi", "event": "acara", "depo": "deposit", "wd": "withdraw",

        # Emosi dan ekspresi informal
        "anjay": "astaga", "anjir": "astaga", "anjrit": "astaga", "wkwk": "haha", "wkwkwk": "haha",
        "wk": "haha", "lol": "haha", "ngakak": "tertawa", "baper": "terbawa perasaan",
        "kepo": "penasaran", "julid": "iri", "gibah": "bergosip", "santuy": "santai", "woles": "santai",
        "mager": "malas", "lebay": "berlebihan", "pecah": "seru", "ngablu": "mengigau", "cape": "capek",
        "capekkk": "capek", "pusinggg": "pusing", "ngeri": "hebat", "goks": "hebat", "receh": "tidak penting",
        "mantul": "bagus", "mantab": "mantap", "uhuy": "mantap", "skuy": "ayo", "gas": "ayo",
        "gaskeun": "ayo", "panik": "takut", "bgt": "banget", "banget": "sekali", "auto": "langsung",
        "halu": "berkhayal", "sabi": "bisa",

        # Kata rujukan/julukan
        "min": "admin", "bang": "kakak", "bg": "kakak", "bng": "kakak", "kak": "kakak", "bro": "saudara",
        "sis": "kakak", "ngab": "teman", "cuy": "teman", "ngabers": "remaja pria", "mrk": "mereka",
        "sm": "sama", "sama": "dengan", "dg": "dengan", "dr": "dari", "utk": "untuk", "yg": "yang",
        "dll": "dan lain-lain", "dst": "dan seterusnya", "ttp": "tetap", "tsb": "tersebut",
        "mnrt": "menurut", "jdwal": "jadwal", "bener": "benar", "d": "di", "emg": "memang", "emng": "memang",
        "bocil": "anak kecil", "gacr": "gacor", "gacir": "gacor", "gcr": "gacor", "mekswin": "maxwin",
        "win": "menang", "gmpng": "mudah", "gampang": "mudah", "bet": "banget", "nasib": "keberuntungan"
    }

    words = text.split()
    new_words = [slang_dict.get(word.lower(), word) for word in words]
    return ' '.join(new_words)

In [None]:
# Applies a series of text preprocessing functions to the 'comment' column of the DataFrame.
df_combined.loc[:,'normalizeText'] = df_combined['comment'].apply(normalize_unicode_to_ascii)
df_combined['cleanText'] = df_combined['normalizeText'].apply(cleaningText)
df_combined['casefoldingText'] = df_combined['cleanText'].apply(casefoldingText)
df_combined['fixSlangWords'] = df_combined['casefoldingText'].apply(fix_slangwords)
df_combined['stemmingText'] = df_combined['fixSlangWords'].apply(stemmingText)
df_combined['tokenizingText'] = df_combined['stemmingText'].apply(tokenizingText)
df_combined['stopWordText'] = df_combined['tokenizingText'].apply(filteringText)
df_combined['finalText'] = df_combined['stopWordText'].apply(toSentence)

In [None]:
df_combined.head(100)

### Encode Labeling

#### Label Judol

In [None]:
# Define keywords for gambling comments
gambling_keywords = [
    'gacor', 'gacor mulu', 'g4cor', 'g4cor mulu', 'g4c0r', 'g4c0r mulu', 'gac0r', 'gac0r mulu', 'gachor', 'gachor mulu', 'gachog', 'gachog mulu', 'gacho', 'gacho mulu',
    'jepe', 'jepe terus', 'j3pe', 'j3pe terus', 'jep3', 'jep3 terus', 'j3p3', 'j3p3 terus', 'jp', 'jp terus', 'jackpot', 'jackpot terus', 'jekpot', 'jekpot terus', 'j3kpot', 'j3kpot terus', 'j3kp0t', 'j3kp0t terus', 'jekp0t', 'jekp0t terus', 'j4ckp0t', 'j4ckp0t terus', 'j4ckpot', 'j4ckpot terus',
    'jackpot mulu', 'jackpot terus', 'jackpot hari ini', 'jackpot mudah menang', 'jackpot gampang menang',
    'jackpot maxwin', 'jackpot gacor', 'jackpot gacor hari ini', 'jackpot gacor terbaru', 'jackpot gacor maxwin',
    'jackpot gacor mudah menang', 'jackpot slot', 'jackpot judi', 'jackpot online', 'jackpot slot online',
    'bonus', 'b0nus', 'klaim', 'hoki', 'h0k1', 'hok1', 'h0ki', 'cuan', 'cu4n', 'menang', 'm3nang', 'menang terus', 'm3ang terus',
    'main', 'main slot', 'main judi', 'judi online', 'judi slot', 'judi slot online', 'judi slot gacor',
    'judi slot terbaru', 'judi slot hari ini', 'judi slot gampang menang', 'judi slot mudah menang', 'judi slot maxwin',
    'judi slot gacor hari ini', 'judi slot gacor terbaru', 'judi slot gacor maxwin', 'judi slot gacor mudah menang',
    'spin', 'free spin', 'auto win', 'pola', 'wd', 'withdraw', 'depo', 'deposit', 'withdrawal', 'saldo',
    'deposit pulsa', 'deposit ovo', 'deposit dana', 'deposit gopay', 'deposit via pulsa', 'deposit via ovo',
    'deposit via dana', 'deposit via gopay', 'withdraw pulsa', 'withdraw ovo', 'withdraw dana', 'withdraw gopay',
    'withdraw via pulsa', 'withdraw via ovo', 'withdraw via dana', 'withdraw via gopay', 'deposit bank',
    'withdraw bank', 'deposit bank lokal', 'withdraw bank lokal', 'deposit bank online', 'withdraw bank online',
    'bandar', 'situs', 'toto', 'togel', 'sl0t', 'slot', 'slot online', 'game slot', 'link slot', 'link gacor',
    'link alternatif', 'link judi', 'link slot gacor', 'link slot terbaru', 'link slot hari ini', 'link slot mudah menang',
    'pr0be855', 'weton88', 'pulauwin', '25kbet', 'alexis17', 'alexis', 'berkah99', 'aero88', 'sgi88', 'pluto88',
    'sultan88', 'sultanbet', 'sultanbet88', 'sultanbet99', 'sultanbet77', 'sultanbet88', 'sultanbet99', 'sultanbet77',
    'garudahoki', 'mona4d', 'berlian', 'btv', 'xuxu4d', 'pstoto99', 'daftar sekarang', 'join sekarang', 'link alternatif',
    'login disini', 'klik disini', 'event harian', 'event mingguan', 'turnover', 'rollingan', 'komisi', 'claim sekarang',
    'claim bonus', 'claim hadiah', 'claim jackpot', 'claim jepe', 'claim jp', 'claim bonus harian', 'claim bonus mingguan',
    'claim bonus bulanan', 'claim bonus tahunan', 'claim bonus slot', 'claim bonus judi', 'claim bonus gacor',
    'live casino', 'judi', 'casino', 'tembus', 'untung terus', 'deposit via dana', 'via gopay', 'via ovo', 'via pulsa',
    'slot terpercaya', 'slot terbaru', 'promo deposit', 'promosi slot', 'event slot', 'winrate tinggi', 'maxwin', 'm4xw1n',
    'maxwin mulu', 'maxwin terus', 'maxwin hari ini', 'maxwin slot', 'maxwin judi', 'maxwin gacor', 'maxwin mudah menang',
    'pr0m0', 'promo', 'link alternatif', 'slot maxwin', 'slot pragmatic', 'slot demo', 'slot terbaru hari ini', 'asiagenting',
    'slot tergacor', 'slot terbaik', 'bet', 'betting', 'big win', 'winrate', 'modal receh', 'main disini', 'langsung gas',
    'langsung menang', 'langsung jackpot', 'langsung jepe', 'langsung jp', 'langsung gacor', 'langsung auto win',
    'spin gratis', 'rtp tinggi', 'rtp slot', 'slot mudah menang', 'slot hari ini', 'jp terus', 'win terus', 'situs terpercaya',
    'situs judi', 'situs slot', 'situs slot online', 'situs judi online', 'situs slot gacor', 'situs slot terbaru',
    'situs slot hari ini', 'situs slot mudah menang', 'situs slot maxwin', 'situs judi terpercaya', 'situs judi online terpercaya',
    'slot online terpercaya', 'gunungwin', 'ayamwin', 'pulau777', 'pulau7', 'zeus', 'kusumat0t0', 'pecahan', 'maxwin', 'supermoney88',
    'supermoney', 'supermoney88supermoney88', 'supermoney77', 'supermoney99',
    'dora', 'd ora', 'do ra', 'dor a', 'd o ra', 'd or a', 'do r a', 'd o r a', 'dora77', 'ora77', 'a77', ' 77',' 7 7 ',
    'probe855', 'probe 855', 'pro be 855', 'pro be855', 'pro be 8 5 5', 'pr0be855', 'pr0be 855', 'pr0 be 855', 'pr0 be855', 'pr0 be 8 5 5', 'probe', 'pr0be',
]

# Function to check if a comment contains any gambling keyword
def is_gambling_comment(comment):
    if isinstance(comment, str):
        # Ensure the comment is lowercase and split into words for accurate matching
        words = comment.lower().split()
        for keyword in gambling_keywords:
            # Check if the keyword exists as a whole word in the comment
            if keyword in words:
                return 1 # Label 1 for gambling
        return 0 # Label 0 for not gambling
    return 0 # Default to 0 if comment is not a string

# Apply the labeling function to the 'cleaned_comment' column
df_combined['label'] = df_combined['finalText'].apply(is_gambling_comment)

In [None]:
df_combined[df_combined['label'] == 0]

In [None]:
df_combined[df_combined['label'] == 1]

In [None]:
label_counts = df_combined['label'].value_counts()

# Labels for the pie chart
labels = ['Not Gambling (0)', 'Gambling (1)']
sizes = label_counts.values
colors = ['lightblue', 'lightcoral']
explode = (0.1, 0)

# Plotting the pie chart
plt.figure(figsize=(8, 8))
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)
plt.title('Distribution of Comment Labels')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.legend(title="Label")
plt.show()


#### Label Sentiment

In [None]:
lexicon_positive = dict()
lexicon_negative = dict()

response_positive = requests.get('https://raw.githubusercontent.com/angelmetanosaa/dataset/main/lexicon_positive.csv')
if response_positive.status_code == 200:
      reader = csv.reader(StringIO(response_positive.text), delimiter=',')
      for row in reader:
            lexicon_positive[row[0]] = int(row[1])
else:
      print("Failed to fetch positive lexicon data")

response_negative = requests.get('https://raw.githubusercontent.com/angelmetanosaa/dataset/main/lexicon_negative.csv')
if response_negative.status_code == 200:
      reader = csv.reader(StringIO(response_negative.text), delimiter=',')
      for row in reader:
            lexicon_negative[row[0]] = int(row[1])
else:
      print("Failed to fetch negative lexicon data")


In [None]:
def sentiment_analysis_lexicon_indonesia(text):
      score = 0
      
      for word in text:
            if word in lexicon_positive:
                  score += lexicon_positive[word]
            
            elif word in lexicon_negative:
                  score += lexicon_negative[word]
      
      if score >= 0:
            polarity = 'positive'
      else:
            polarity = 'negative'

      return score, polarity

In [None]:
results = df_combined['stopWordText'].apply(sentiment_analysis_lexicon_indonesia)
results = list(zip(*results))
df_combined['polarity_score'] = results[0]
df_combined['polarity'] = results[1]
print(df_combined['polarity'].value_counts())

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))
sizes = [count for count in df_combined['polarity'].value_counts()]
labels = list(df_combined['polarity'].value_counts().index)

explode = (0.1, 0)
ax.pie(x=sizes, labels=labels, autopct='%1.1f%%', explode=explode, textprops={'fontsize': 14})
ax.set_title('Sentiment Polarity on Review Data', fontsize=16, pad=20)
plt.show()

In [None]:
pd.set_option('display.max_colwidth', 2000)
positive_tweets = df_combined[df_combined['polarity'] == 'positive']
positive_tweets = positive_tweets[['finalText', 'polarity_score', 'polarity','stopWordText']]
positive_tweets = positive_tweets.sort_values(by='polarity_score', ascending=False)
positive_tweets = positive_tweets.reset_index(drop=True)
positive_tweets.index += 1

In [None]:
pd.set_option('display.max_colwidth', 2000)
negative_tweets = df_combined[df_combined['polarity'] == 'negative']
negative_tweets = negative_tweets[['finalText', 'polarity_score', 'polarity','stopWordText']]
negative_tweets = negative_tweets.sort_values(by='polarity_score', ascending=True)
negative_tweets = negative_tweets[0:10]
negative_tweets = negative_tweets.reset_index(drop=True)
negative_tweets.index += 1

In [None]:
list_words = ''
for tweet in df_combined['stopWordText']:
      for word in tweet:
            list_words += ' ' + (word)

wordcloud = WordCloud(width=600, height=400, background_color='white', min_font_size=10).generate(list_words)

fig, ax = plt.subplots(figsize=(8, 6))
ax.set_title('Awan Kata dari Seluruh Tweets ', fontsize=18)
ax.grid(False)
ax.imshow((wordcloud))
fig.tight_layout(pad=0)
ax.axis('off')
plt.show()

In [None]:
# Membuat string kosong 'list_words' yang akan digunakan untuk mengumpulkan semua kata dari teks yang sudah dibersihkan dalam tweet negatif.
list_words = ''
for tweet in negative_tweets['stopWordText']:
      for word in tweet:
            list_words += ' ' + (word)

wordcloud = WordCloud(width=600, height=400, background_color='white', min_font_size=10).generate(list_words)
fig, ax = plt.subplots(figsize=(8, 6))
ax.set_title('Awan Kata dari Tweets Negatif', fontsize=18)
ax.grid(False)
ax.imshow((wordcloud))
fig.tight_layout(pad=0)
ax.axis('off')
plt.show()

In [None]:
# Membuat string kosong 'list_words' yang akan digunakan untuk mengumpulkan semua kata dari teks yang sudah dibersihkan dalam tweet positif.
list_words = ''
for tweet in positive_tweets['stopWordText']:
      for word in tweet:
            list_words += ' ' + (word)

wordcloud = WordCloud(width=600, height=400, background_color='white', min_font_size=10).generate(list_words)
fig, ax = plt.subplots(figsize=(8, 6))
ax.set_title('Awan Kata dari Tweet Positif', fontsize=18)
ax.grid(False)
ax.imshow((wordcloud))
fig.tight_layout(pad=0)
ax.axis('off')
plt.show()

## Model Development

### Model Judol

In [None]:
df_combined.head(100)

In [None]:
def tokenize(texts, labels, tokenizer):
      """Custom preprocessing untuk BERT-like model"""
      print(f"Preprocessing {len(texts)} texts...")

      clean_texts = []
      clean_labels = []

      for i, (text, label) in enumerate(zip(texts, labels)):
            if isinstance(text, str) and len(text.strip()) > 0:
                  clean_texts.append(text.strip())
                  clean_labels.append(int(label))
            else:
                  print(f"Skipping invalid text at index {i}: {text}")

      print(f"Valid texts after cleaning: {len(clean_texts)}")

      # Encode menggunakan custom tokenizer
      encoded = tokenizer(
            clean_texts,
            truncation=True,
            padding=True,
            max_length=128,
            return_tensors="tf"
      )

      return {
            'input_ids': tf.constant(encoded['input_ids'], dtype=tf.int32),
            'attention_mask': tf.constant(encoded['attention_mask'], dtype=tf.int32),
            'labels': tf.constant(clean_labels, dtype=tf.int32)
      }

def create_dataset(data, batch_size=16, shuffle=True):
      """Create TensorFlow dataset in correct format for model"""
      dataset = tf.data.Dataset.from_tensor_slices((
            {
                  'input_ids': data['input_ids'],
                  'attention_mask': data['attention_mask']
            },
            data['labels']
      ))

      if shuffle:
            dataset = dataset.shuffle(buffer_size=1000)

      return dataset.batch(batch_size).prefetch(tf.data.AUTOTUNE)

print("Preprocessing functions defined!")


In [None]:
# Muat dataset
texts = df_combined['finalText'].tolist()
labels = df_combined['label'].tolist()

print(f"Dataset loaded: {len(texts)} samples")
print(f"Label distribution: {Counter(labels)}")

In [None]:
# Split dataset
train_texts, val_texts, train_labels, val_labels = train_test_split(
      texts, labels, test_size=0.2, random_state=42
)

print(f"Training samples: {len(train_texts)}")
print(f"Validation samples: {len(val_texts)}")

In [None]:
print("Creating custom BERT tokenizer...")
# tokenizer = CustomBERTTokenizer(vocab_size=30000, max_length=128)
tokenizer = AutoTokenizer.from_pretrained("indobenchmark/indobert-base-p1")

# Build vocabulary dari training data
# tokenizer.build_vocab(train_texts)

print("Tokenizer ready!")

# Test tokenizer
sample_text = train_texts[0]
encoded_sample = tokenizer.encode(sample_text)
print(f"\nSample text: {sample_text}")
print(f"Encoded input_ids: {encoded_sample[:10]}...")
print(f"Attention mask: {encoded_sample[:10]}...")

In [None]:
# Preprocess training data
print("Preprocessing training data...")
train_data = tokenize(train_texts, train_labels, tokenizer)

# Preprocess validation data
print("\nPreprocessing validation data...")
val_data = tokenize(val_texts, val_labels, tokenizer)

# Create datasets
train_dataset = create_dataset(train_data, batch_size=16, shuffle=True)
val_dataset = create_dataset(val_data, batch_size=16, shuffle=False)

print("\nData preprocessing completed!")
print(f"Training data shape: {train_data['input_ids'].shape}")
print(f"Validation data shape: {val_data['input_ids'].shape}")

In [None]:
# IndoBERT model
print("IndoBERT model...")
model = TFAutoModelForSequenceClassification.from_pretrained("indobenchmark/indobert-base-p1", num_labels=2)

# Compile model
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(
      optimizer=optimizer,
      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
      metrics=['accuracy']
)

print("Model created and compiled successfully!")

In [None]:
# Setup callbacks
callbacks = [
    tf.keras.callbacks.EarlyStopping(
        monitor='val_loss',
        patience=5,
        restore_best_weights=True,
        verbose=1
    ),
    tf.keras.callbacks.ModelCheckpoint(
        filepath='./custom_bert_model.keras',
        monitor='val_accuracy',
        save_best_only=True,
        verbose=1
    ),
    tf.keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=3,
        verbose=1
    ),
    tf.keras.callbacks.TensorBoard(
        log_dir='./logs'
    )
]

print("Training callbacks setup completed!")

In [None]:
for batch in train_dataset.take(1):
      print(batch)

In [None]:
if 'train_dataset' in locals() and 'val_dataset' in locals():
      # CELL 12: Train Model
      # Start training
      print("Starting training...")
      print("This may take a while depending on your dataset size...")

      history = model.fit(
            train_dataset,
            validation_data=val_dataset,
            epochs=20,
            callbacks=callbacks,
            verbose=1
      )

      print("Training completed!")
else:
      print("Skipping model training due to empty dataset(s).")

In [None]:
print("Evaluating model...")

val_predictions = []
val_true_labels = []

for batch in val_dataset:
      inputs, labels = batch  # Unpack the inputs and labels
      preds = model(inputs)

      val_predictions.extend(tf.argmax(preds, axis=1).numpy())
      val_true_labels.extend(labels.numpy())

# Calculate metrics
accuracy = accuracy_score(val_true_labels, val_predictions)
precision, recall, f1, _ = precision_recall_fscore_support(val_true_labels, val_predictions, average='weighted')

print(f"Validation Accuracy: {accuracy:.4f}")
print(f"Validation Precision: {precision:.4f}")
print(f"Validation Recall: {recall:.4f}")
print(f"Validation F1-score: {f1:.4f}")

print("\nClassification Report:")
print(classification_report(val_true_labels, val_predictions, target_names=['Bukan Judi Online', 'Judi Online']))


In [None]:
# CELL 14: Prediction Functions
def predict_custom_bert(text, model, tokenizer):
      """Prediksi single text dengan custom BERT"""
      encoded = tokenizer.encode(text)
      inputs = {
            'input_ids': tf.constant([encoded['input_ids']], dtype=tf.int32),
            'attention_mask': tf.constant([encoded['attention_mask']], dtype=tf.int32)
      }

      predictions = model(inputs)
      predicted_class = tf.argmax(predictions, axis=1).numpy()[0]
      confidence = tf.reduce_max(predictions).numpy()

      result = "Judi Online" if predicted_class == 1 else "Bukan Judi Online"
      return result, confidence

def predict_multiple_bert(texts, model, tokenizer):
      """Batch prediction dengan custom BERT"""
      encoded = tokenizer.encode_batch(texts)
      inputs = {
            'input_ids': tf.constant(encoded['input_ids'], dtype=tf.int32),
            'attention_mask': tf.constant(encoded['attention_mask'], dtype=tf.int32)
      }

      predictions = model(inputs)
      predicted_classes = tf.argmax(predictions, axis=1).numpy()
      confidences = tf.reduce_max(predictions, axis=1).numpy()

      results = []
      for pred, conf in zip(predicted_classes, confidences):
            label = "Judi Online" if pred == 1 else "Bukan Judi Online"
            results.append((label, conf))

      return results

print("Prediction functions defined!")

In [None]:
# CELL 15: Test Predictions
# Test single prediction
sample_text = "ijazah jokowi itu asli penelaah ilmiah itu hanya menebak dan mengira2"
result, confidence = predict_custom_bert(sample_text, model, tokenizer)
print(f"Sample prediction:")
print(f"Text: '{sample_text}'")
print(f"Prediction: {result} (confidence: {confidence:.4f})")

# Test multiple predictions
sample_texts = [
      "roy suryo itu kan penjahat yang keluar dari penjara",
      "sehat selalu semuanya salam dari weton88 mudah jackpot",
      "weton88 tempat paling uhuy",
      "presiden jokowi memberikan sambutan di acara kemerdekaan",
      "ayo main slot di situs terpercaya bonus besar",
      "Bagus sekali podcastnya, isinya bener2 bermanfaat ❤"
]

results = predict_multiple_bert(sample_texts, model, tokenizer)
print("\nMultiple predictions:")
for text, (result, conf) in zip(sample_texts, results):
      print(f"'{text}' → {result} (confidence: {conf:.4f})")


In [None]:
# # simpan model dan tokenizer ke /models
# model.save_pretrained('./fine_tuned_indobert')
# tokenizer.save_pretrained('./fine_tuned_indobert')

### Model Sentiment

In [None]:
df_combined.head(100)

#### Model TF-IDF + SVM

In [None]:
print("\nSkema 1: TF-IDF + SVM (80/20)")
X1 = df_combined['finalText']
y1 = df_combined['polarity']
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.2, random_state=42)

vectorizer1 = TfidfVectorizer(max_features=5000)
X_train_tfidf1 = vectorizer1.fit_transform(X_train1)
X_test_tfidf1 = vectorizer1.transform(X_test1)

model1 = SVC(kernel='linear')
model1.fit(X_train_tfidf1, y_train1)

pred_train1 = model1.predict(X_train_tfidf1)
pred_test1 = model1.predict(X_test_tfidf1)
print("Akurasi Train:", accuracy_score(y_train1, pred_train1))
print("Akurasi Test:", accuracy_score(y_test1, pred_test1))

#### Model TF-IDF + MultinomialNB

In [None]:
print("\nSkema 1: TF-IDF + SVM (80/20)")
X1 = df_combined['finalText']
y1 = df_combined['polarity']
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.2, random_state=42)

vectorizer2 = TfidfVectorizer(max_features=5000)
X_train_tfidf1 = vectorizer2.fit_transform(X_train1)
X_test_tfidf1 = vectorizer2.transform(X_test1)

model2 = MultinomialNB(
      alpha=1.0, fit_prior=True, class_prior=None
)
model2.fit(X_train_tfidf1, y_train1)

pred_train1 = model2.predict(X_train_tfidf1)
pred_test1 = model2.predict(X_test_tfidf1)
print("Akurasi Train:", accuracy_score(y_train1, pred_train1))
print("Akurasi Test:", accuracy_score(y_test1, pred_test1))

#### Model TF-IDF + LogisticRegression

In [None]:
print("Skema 3: TF-IDF + Random Forest (70/30)")
X3 = df_combined['text_akhir']
y3 = df_combined['polarity']
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y3, test_size=0.3, random_state=42)

vectorizer3 = TfidfVectorizer(max_features=5000)
X_train_tfidf3 = vectorizer3.fit_transform(X_train3)
X_test_tfidf3 = vectorizer3.transform(X_test3)

model3 = LogisticRegression(n_estimators=100, random_state=42)
model3.fit(X_train_tfidf3, y_train3)

pred_train3 = model3.predict(X_train_tfidf3)
pred_test3 = model3.predict(X_test_tfidf3)
print("Akurasi Train:", accuracy_score(y_train3, pred_train3))
print("Akurasi Test:", accuracy_score(y_test3, pred_test3))

In [None]:
def inference_all_models(n=5):
      print("Tampilkan Inference\n")

      # Ambil sampel data
      sample_data = df_combined.sample(n=n, random_state=52)
      sample_cleaned = sample_data['text_akhir'].tolist()
      sample_tokens = sample_data['text_tokenizingText'].tolist()
      indices = sample_data.index

      # TF-IDF Transform untuk model1 dan model3
      sample_tfidf1 = vectorizer1.transform(sample_cleaned)
      sample_tfidf2 = vectorizer2.transform(sample_cleaned)
      sample_tfidf3 = vectorizer3.transform(sample_cleaned)
      
      # Prediksi dari ketiga model
      preds_model1 = model1.predict(sample_tfidf1)
      preds_model2 = model2.predict(sample_tfidf2)
      preds_model3 = model3.predict(sample_tfidf3)

      # Tampilkan hasilnya
      for i, idx in enumerate(indices):
            original_text = df_combined.loc[idx, 'comment']
            print(f"Teks {i+1} : \"{original_text}\"")
            print(f"  Skema 1 (TF-IDF + SVM)          : \"{preds_model1[i]}\"")
            print(f"  Skema 2 (Word2Vec + RF)         : \"{preds_model2[i]}\"")
            print(f"  Skema 3 (TF-IDF + Random Forest): \"{preds_model3[i]}\"\n")

In [None]:
inference_all_models(n=5)