# Hoax Detection Using Traditional Machine Learning
## Dataset from Satria Data 2020 - Big Data Challenge

This method represents words as dense word vectors which are trained unlike the one-hot encoding which are hardcoded. This means that the word embeddings collect more information into fewer dimensions. **Word embeddings do not understand the text as a human would, but they rather map the statistical structure of the language used in the corpus**

## Word Embedding Using Keras Embedding Layer

In [86]:
# import dependencies
import re
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import PorterStemmer
from string import punctuation
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten, GlobalMaxPooling1D
from pandarallel import pandarallel


In [2]:
# multiprocessing Initialization
pandarallel.initialize(progress_bar=True)

INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


In [3]:
# Read Dataset
train_data = pd.read_excel("../Dataset/training/DataLatih.xlsx", engine="openpyxl")
test_data = pd.read_excel("../Dataset/testing/DataUji.xlsx", engine="openpyxl")

In [4]:
# Reconstruct train dataframe
train_df = pd.DataFrame()
train_df["konten"] = train_data["judul_translate"] + " " + train_data["narasi_translate"]
train_df["Class"] = train_data["label"]

# Reconstruct test dataframe
test_df = pd.DataFrame()
test_df["ID"] = test_data["ID"]
test_df["konten"] = test_data["judul_translate"] + " " + test_data["narasi_translate"]

In [5]:
# define stopword list, indonesia
STOPWORDS = set(StopWordRemoverFactory().get_stop_words() + stopwords.words('english'))

# define list kata singkat
KATASINGKAT = {"dlm":"dalam", "gw":"saya", "yg":"yang", "lu":"kamu", "dkt":"dekat", "kalo":"kalau", "n":"and"}

# define stemmer sastrawi for Indonesia
stemmer_ind = StemmerFactory().create_stemmer()
stemmer_eng = PorterStemmer()

In [6]:
# function of preprocessing
def remove_kata_singkat(word):
    if word in list(KATASINGKAT.keys()):
        return KATASINGKAT.get(word)
    else:
        return word
    
def normalize_word(row):
    # remove punctuation
    konten = re.sub(r'[^a-zA-Z\s]', ' ', row.konten, re.I|re.A)
    
    # case folding and remove kata singkat
    konten = " ".join([remove_kata_singkat(word.lower()).strip() for word in nltk.word_tokenize(konten)])
    
    # remove stopword and number
    konten = " ".join([word for word in nltk.word_tokenize(konten) if word not in punctuation and word.isalpha() and word not in STOPWORDS])
    
    # stemming
    konten = stemmer_ind.stem(konten)
    konten = stemmer_eng.stem(konten)
    
    # final assignment
    row.konten = konten
    
    return row

In [7]:
# Parallel preprocess to dataframe with progressbar
train_df = train_df.parallel_apply(normalize_word, axis=1)
test_df = test_df.parallel_apply(normalize_word, axis=1)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1058), Label(value='0 / 1058'))), …

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=118), Label(value='0 / 118'))), HB…

In [79]:
konten_train = train_df["konten"]
konten_test = test_df["konten"]

In [64]:
# konten_all = konten_train.append(konten_test)

In [80]:
konten_train.shape, konten_test.shape

((4231,), (470,))

In [37]:
# instantiate object
tokenizer = Tokenizer()

In [68]:
tokenizer.fit_on_texts(konten_train)

In [69]:
X = tokenizer.texts_to_sequences(konten_train)
y = train_df["Class"]
X_test = tokenizer.texts_to_sequences(konten_test)

In [70]:
# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

In [71]:
maxlen = 500

In [72]:
X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_val = pad_sequences(X_val, padding='post', maxlen=maxlen)

vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

In [93]:
vocab_size

14027

In [105]:
# DEFINE MODEL
embedding_dim = 50

model = Sequential()
model.add(Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model.add(GlobalMaxPooling1D())
model.add(Dense(100, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 500, 50)           701350    
_________________________________________________________________
global_max_pooling1d_10 (Glo (None, 50)                0         
_________________________________________________________________
dense_26 (Dense)             (None, 100)               5100      
_________________________________________________________________
dense_27 (Dense)             (None, 50)                5050      
_________________________________________________________________
dense_28 (Dense)             (None, 1)                 51        
Total params: 711,551
Trainable params: 711,551
Non-trainable params: 0
_________________________________________________________________


In [106]:
history = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=20)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [101]:
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))

Training Accuracy: 0.9994


In [102]:
loss, accuracy = model.evaluate(X_val, y_val, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

Testing Accuracy:  0.7910


In [92]:
model.predict(X_test[0])



array([[0.7199089 ],
       [0.12223333],
       [0.5589266 ],
       [0.8440214 ],
       [0.9654064 ],
       [0.3387453 ],
       [0.5810466 ],
       [0.1708524 ],
       [0.9654064 ],
       [0.7340007 ],
       [0.618966  ],
       [0.12223333],
       [0.14833039],
       [0.08844292],
       [0.9654064 ],
       [0.8132427 ],
       [0.95487833],
       [0.3387453 ],
       [0.5810466 ],
       [0.1708524 ],
       [0.9654064 ],
       [0.7340007 ],
       [0.15317488]], dtype=float32)

## Using Pretrained Word Embeddings

An alternative is to use a precomputed embedding space that utilizes a much larger corpus. It is possible to precompute word embeddings by simply training them on a large corpus of text. Among the most popular methods are [Word2Vec](https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf) developed by Google and [GloVe](https://nlp.stanford.edu/projects/glove/) (Global Vectors for Word Representation) developed by the Stanford NLP Group.

### Reference
https://realpython.com/python-keras-text-classification/