## Fake News Classification

Solution: [Universal Sentence Encoder (USE) for English](https://www.aclweb.org/anthology/D18-2029)

This [blog post](https://towardsdatascience.com/using-use-universal-sentence-encoder-to-detect-fake-news-dfc02dc32ae9) reaches ~ 90% accuracy with universal encoder from tf hub. But can we reach the same results with a simple encoder ?

In [1]:
import csv
import itertools
import os

from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from tqdm.notebook import tqdm

from encoder import build_from_fasttext_bin
from nn import train_w2v, train_nn, load_model, fasttext
from utils import preprocess_sentence

  return f(*args, **kwds)


Download the dataset (from github)

In [2]:
CSV_FILE = 'fake_or_real_news.csv'

! [[ ! -f { CSV_FILE } ]] && wget https://github.com/saadarshad102/Fake-News-Detection-Universal-Sentence-Encoder/raw/master/{ CSV_FILE }
    
def read_news(fname):
    X = []
    y = []
    with open(fname) as f:
        for row_num, row in enumerate(csv.reader(f)):
            if row_num == 0:
                continue
            _, title, text, label = row
            X.append(text)
            y.append(label)
    return X, y

X, y = read_news(CSV_FILE)
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

Build a corpus for word2vec training and pre-process with `textacy` lib:
  - normalize unicode charset.
  - deaccent (rèsume -> resume)
  - unpack contractions (he's --> he is).
  - remove emojis, hashtags, URLs, emails, etc
  - remove punctuation marks
  - strip whitespace
  - lowercase
  
and train word2vec skipgram model as follows;
  - dim = 200
  - lr = relatively low.
  - epochs = 15 (but should probably be ~ 25).
  - ws = 5 (but should probably be ~ 7).
  - sub-word information (minn = 3, maxn = 6).
  
alternatively, we can use a [pre-built model](https://fasttext.cc/docs/en/pretrained-vectors.html).

In [3]:
W2V_PREBUILT_MODEL = 'cc.en.300.bin'
W2V_MODEL = 'model.bin' # W2V_PREBUILT_MODEL

if W2V_MODEL == W2V_PREBUILT_MODEL:
    ! [[ ! -f {W2V_PREBUILT_MODEL} ]] && wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/{W2V_PREBUILT_MODEL}.gz
    ! [[ ! -f {W2V_PREBUILT_MODEL} ]] && gzip -d {W2V_PREBUILT_MODEL}.gz
    ! ls -lh {W2V_PREBUILT_MODEL}

if not os.path.isfile(W2V_MODEL):
    # build w2v corpus
    corpus = []
    for raw_sentence in tqdm(X):
        sent = preprocess_sentence(raw_sentence)
        corpus.append(sent)

    # train word2vec
    model = train_w2v(corpus,
                      model='skipgram',
                      dim=200,
                      min_count=20,
                      lr=0.015,
                      epoch=20,
                      ws=7,
                      minn=3,
                      maxn=6)
    # save model
    model.save_model(W2V_MODEL)

else: # load prebuilt model
    model = fasttext.load_model(W2V_MODEL)




word2vec ---> "Simple But Tough to Beat .." encoder

In [4]:
sentence_encoder = build_from_fasttext_bin(model, preprocessor=preprocess_sentence, weighted=True)

del model # free some memory !

Split train/test and transform sentences to their embedding representation

In [6]:
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.2, random_state=42)

X_train = sentence_encoder.fit_transform(X_train)
X_test = sentence_encoder.transform(X_test)

print('X_train.shape = ', X_train.shape)
print('X_test.shape = ', X_test.shape)

X_train.shape =  (5068, 200)
X_test.shape =  (1267, 200)


Now we can train a binary classification net:
  - 1 hidden layer (128).
  - dropout ~ [0.2 - 0.5].
  - binary logloss.

In [7]:
MODEL_PT = 'model.h5'

model = train_nn(
    X_train,
    y_train,
    hidden_layers=(128,),
    activation='relu',
    dropout=0.4,
    epochs=25,
    batch_size=32,
    # validation_split=0.1,
    validation_data=(X_test, y_test),
    patience=4,
    shuffle=True,
    optimizer='adam',
    pt=MODEL_PT,
)

Train on 5068 samples, validate on 1267 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


91% accuracy with a pretty simple encoder ! that's nice !

In [8]:
model = load_model(MODEL_PT)
preds = model.predict_classes(X_test, batch_size=32)
preds = preds.reshape(preds.shape[0])

report = classification_report(y_test, preds, target_names=label_encoder.classes_)
print(report)

              precision    recall  f1-score   support

        FAKE       0.91      0.90      0.91       628
        REAL       0.90      0.91      0.91       639

    accuracy                           0.91      1267
   macro avg       0.91      0.91      0.91      1267
weighted avg       0.91      0.91      0.91      1267

