<a href="https://colab.research.google.com/github/RobinSmits/FakeNews-Generator-And-Detector/blob/main/FakeNews_Classifier_RoBERTa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In the first notebook we used a T5 model to train on the 'title' and 'description' of the 'ag_news_subset' dataset. As a second step we used that T5 model to generate a CSV with fake news based on the input 'title'.

In this second notebook we will use the real and fake news in the generated csv ('generated_fake_news.csv') file to train a RoBERTa Base model to classify the real (column: 'description' - label: 0) or fake news (column: 'generated' - label: 1).

In [1]:
import numpy as np
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm

# Install Specific Versions
!pip install tensorflow==2.3.1
!pip install tensorflow-datasets==4.1.0
!pip install transformers==4.0.0
!pip install sentencepiece==0.1.94

# Import Packages
import tensorflow as tf
import tensorflow_datasets as tfds
from transformers import *
import sentencepiece



I've created and tested these notebooks on Google Colab Pro and used Google Drive to store and load any files created. 

If you run the code locally on a computer then modify the 'WORK_DIR' accordingly. Google Drive will not be needed in that case.

In [2]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Set Folder to use...
WORK_DIR = '/content/drive/My Drive/fake_news/'
os.makedirs(WORK_DIR, exist_ok = True) 

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Next we set some config for the device to use (Note: TPU to be added/tested in the future.) We also set the necessary constants. 

For the learning rate you could try different settings. Setting the learning rate to high will very quickly overfit the model

And finally we set the RoBERTa tokenizer and config to use.

In [3]:
# Set strategy choice
USE_GPU = True
USE_CPU = False

# Set strategy with config. Our code should run on all.
if USE_GPU:
    strategy = tf.distribute.OneDeviceStrategy(device = "/gpu:0")
if USE_CPU:
    strategy = tf.distribute.OneDeviceStrategy(device = "/cpu:0")

# Constants
MAX_LEN = 512
EPOCHS = 2
VERBOSE = 1  

# Batch Size
BATCH_SIZE = 8 * strategy.num_replicas_in_sync
print(f'Batch Size: {BATCH_SIZE}')

# Learning Rate
LR = 1e-6 * strategy.num_replicas_in_sync
print('Learning Rate: {}'.format(LR))

# Set RoBERTa Type
roberta_type = 'roberta-base'
print(f'RoBERTa Model Type: {roberta_type}')

# Set RoBERTa Config
roberta_config = RobertaConfig.from_pretrained(roberta_type, num_labels = 2) # Binary classification so set num_labels = 2

# Set RoBERTa Tokenizer
roberta_tokenizer = RobertaTokenizer.from_pretrained(roberta_type, 
                                                     return_dict = True,
                                                     add_prefix_space = True,
                                                     do_lower_case = True)

Batch Size: 8
Learning Rate: 1e-06
RoBERTa Model Type: roberta-base


Next we define a function to process the 't5_generated_fake_news.csv' which is loaded as a Pandas Dataframe. We loop through all rows and from each row we use the columns 'description' and 'generated' as input for the RoBERTa model.

The 'description' input will be labelled with 0. The 'generated' input will be labelled with 1. The 'title' which we used in the T5 model is not used.

In [4]:
def create_dataset(df):
    number_of_samples = df.shape[0]
    total_samples = 2 * df.shape[0]

    # Placeholders input
    input_ids = np.zeros((total_samples, MAX_LEN), dtype = 'int32')
    input_masks = np.zeros((total_samples, MAX_LEN), dtype = 'int32')
    labels = np.zeros((total_samples, ), dtype = 'int32')

    for index, row in tqdm(zip(range(0, total_samples, 2), train_df.iterrows()), total = number_of_samples):
        
        # Get title and description as strings
        description = row[1]['description']
        generated = row[1]['generated']

        # Process Description - Set Label for real as 0
        input_encoded = roberta_tokenizer.encode_plus(description, add_special_tokens = True, max_length = MAX_LEN, truncation = True)
        input_ids_sample = input_encoded['input_ids']
        input_ids[index,:len(input_ids_sample)] = input_ids_sample
        attention_mask_sample = input_encoded['attention_mask']
        input_masks[index,:len(attention_mask_sample)] = attention_mask_sample
        labels[index] = 0

        # Process Generated - Set Label for fake as 1
        input_encoded = roberta_tokenizer.encode_plus(generated, add_special_tokens = True, max_length = MAX_LEN, truncation = True)
        input_ids_sample = input_encoded['input_ids']
        input_ids[index+1,:len(input_ids_sample)] = input_ids_sample
        attention_mask_sample = input_encoded['attention_mask']
        input_masks[index+1,:len(attention_mask_sample)] = attention_mask_sample
        labels[index+1] = 1

    # Create DatasetDictionary structure is also preserved.
    dataset = tf.data.Dataset.from_tensor_slices(({'input_ids': input_ids, 'attention_mask': input_masks}, labels))

    # Return Dataset
    return dataset

We load the csv file. Split it into an 80/20 train and test set and call the previously defined function to generate our final Tensorflow Datasets.

In [5]:
# Import Generated Fake News
df = pd.read_csv(WORK_DIR + 't5_generated_fake_news.csv')

# Split in Train and Validation
train_df, val_df = train_test_split(df, test_size = 0.2, random_state = 42, shuffle = True)

# Show Sizes
print(f'Train Shape: {train_df.shape}')
print(f'Validation Shape: {val_df.shape}')

# Create Train Dataset
train_dataset = create_dataset(train_df)
train_dataset = train_dataset.shuffle(2048)
train_dataset = train_dataset.batch(BATCH_SIZE)
train_dataset = train_dataset.repeat(-1)
train_dataset = train_dataset.prefetch(128)

# Create Validation Dataset
validation_dataset = create_dataset(val_df)
validation_dataset = validation_dataset.batch(BATCH_SIZE)
validation_dataset = validation_dataset.repeat(-1)
validation_dataset = validation_dataset.prefetch(128)

# Steps
train_steps = (train_df.shape[0] * 2) // BATCH_SIZE
val_steps = (val_df.shape[0] * 2) // BATCH_SIZE
print(f'Train Steps: {train_steps}')
print(f'Val Steps: {val_steps}')

Train Shape: (48000, 4)
Validation Shape: (12000, 4)


HBox(children=(FloatProgress(value=0.0, max=48000.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=12000.0), HTML(value='')))


Train Steps: 12000
Val Steps: 3000


Define a function to create and compile the RoBERTa base model.

In [6]:
def create_model():
    # Create Model
    with strategy.scope():      
        model = TFRobertaForSequenceClassification.from_pretrained(roberta_type, config = roberta_config)
        
        optimizer = tf.keras.optimizers.Adam(learning_rate = LR)
        loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True)
        metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

        model.compile(optimizer = optimizer, loss = loss, metrics = [metric])        
        
        return model

Create a callback to save the model weights to storage.

In [7]:
class SaveModel(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs = None):
        print("\nSave Model Weights")

        # Save the entire model as a SavedModel.
        self.model.save_weights(WORK_DIR + 'roberta_base_model.h5')

Finally create the model and perform the training and validation.

In [8]:
  # Create Model
  model = create_model()

  # Summary
  model.summary()

  # Fit Model
  model.fit(train_dataset,
            steps_per_epoch = train_steps,
            validation_data = validation_dataset,
            validation_steps = val_steps,
            epochs = EPOCHS, 
            verbose = VERBOSE,
            callbacks = [SaveModel()])

Some layers from the model checkpoint at roberta-base were not used when initializing TFRobertaForSequenceClassification: ['lm_head']
- This IS expected if you are initializing TFRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "tf_roberta_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
roberta (TFRobertaMainLayer) multiple                  124645632 
_________________________________________________________________
classifier (TFRobertaClassif multiple                  592130    
Total params: 125,237,762
Trainable params: 125,237,762
Non-trainable params: 0
_________________________________________________________________
Epoch 1/2
Save Model Weights
Epoch 2/2
Save Model Weights


<tensorflow.python.keras.callbacks.History at 0x7f9db2d18b00>

After training the model should achieve around 97% accuracy on the validation set.

And interresting point I noticed during testing is that the 97% accuracy was achieved based on the fake news as generated by a trained T5-base model.

When I performed the initial tests with fake news generated by a trained T5-small model the classifier could achieve around 99% accuracy. This makes sense as you would expect a smaller NLP model to be able to generate less 'natural looking' fake news. The 97% classification accuracy achieved on the fake news generated by the T5-base model more or less proves that it generates better fake news.