<a href="https://colab.research.google.com/github/RobinSmits/FakeNews-Generator-And-Detector/blob/main/FakeNews_Generator_And_Detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this last notebook we will use the 'test' part of the 'ag_news_subset' dataset. It contains 7600 rows with data that both the T5 and RoBERTa model have never seen before.

We will again use the T5 model to use the 'title' as input and generate fake news. The generated output is stored in the file 'generated_fake_news_final.csv'.

As a final and last step the RoBERTa model will classify the input into real or fake news.

In [1]:
import numpy as np
import os
import pandas as pd
from tqdm.notebook import tqdm

import tensorflow as tf
import tensorflow_datasets as tfds

!pip install transformers==3.5.1
from transformers import *

Collecting transformers==3.5.1
[?25l  Downloading https://files.pythonhosted.org/packages/3a/83/e74092e7f24a08d751aa59b37a9fc572b2e4af3918cb66f7766c3affb1b4/transformers-3.5.1-py3-none-any.whl (1.3MB)
[K     |▎                               | 10kB 17.7MB/s eta 0:00:01[K     |▌                               | 20kB 25.2MB/s eta 0:00:01[K     |▊                               | 30kB 24.1MB/s eta 0:00:01[K     |█                               | 40kB 18.9MB/s eta 0:00:01[K     |█▎                              | 51kB 16.1MB/s eta 0:00:01[K     |█▌                              | 61kB 17.8MB/s eta 0:00:01[K     |█▊                              | 71kB 15.4MB/s eta 0:00:01[K     |██                              | 81kB 16.1MB/s eta 0:00:01[K     |██▎                             | 92kB 14.9MB/s eta 0:00:01[K     |██▌                             | 102kB 13.8MB/s eta 0:00:01[K     |██▊                             | 112kB 13.8MB/s eta 0:00:01[K     |███                        

I've created and tested these notebooks on Google Colab Pro and used Google Drive to store and load any files created. 

If you run the code locally on a computer then modify the 'WORK_DIR' accordingly. Google Drive will not be needed in that case.

In [2]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Set Folder to use...
WORK_DIR = '/content/drive/My Drive/fake_news/'
os.makedirs(WORK_DIR, exist_ok = True) 

Mounted at /content/drive


Next we set some config for the device to use (Note: TPU to be added/tested in the future.) We also set the necessary constants. 

And all the necessary information for the T5 and RoBERTa models will be set.

In [3]:
# Set strategy choice
USE_GPU = True
USE_CPU = False

# Set strategy with config. Our code should run on all.
if USE_GPU:
    strategy = tf.distribute.OneDeviceStrategy(device = "/gpu:0")
if USE_CPU:
    strategy = tf.distribute.OneDeviceStrategy(device = "/cpu:0")

# Constants
MAX_LEN = 512
VERBOSE = 1

# Batch Size
GENERATE_BATCH_SIZE = 38 * strategy.num_replicas_in_sync
PREDICT_BATCH_SIZE = 16 * strategy.num_replicas_in_sync
print(f'Predict Batch Size: {PREDICT_BATCH_SIZE}')
print(f'Generate Batch Size: {GENERATE_BATCH_SIZE}')

Predict Batch Size: 16
Generate Batch Size: 38


In [4]:
# Set T5 Type
t5_size = 't5-base'
print(f'T5 Model Type: {t5_size}')

# Set T5 Task Name
task_name = 'generate fake news: '
print(f'T5 Task Name: {task_name}')

# Set T5 Config
t5_config = T5Config.from_pretrained(t5_size)

# Set T5 Tokenizer
t5_tokenizer = T5Tokenizer.from_pretrained(t5_size, return_dict = True)

T5 Model Type: t5-base
T5 Task Name: generate fake news: 


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1199.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




In [5]:
# Set RoBERTa Type
roberta_type = 'roberta-base'
print(f'RoBERTa Model Type: {roberta_type}')

# Set RoBERTa Config
roberta_config = RobertaConfig.from_pretrained(roberta_type, num_labels = 2)  # Binary classification so set num_labels = 2

# Set RoBERTa Tokenizer
roberta_tokenizer = RobertaTokenizer.from_pretrained(roberta_type, 
                                             return_dict = True,
                                             add_prefix_space = True,
                                             do_lower_case = True)

RoBERTa Model Type: roberta-base


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=481.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




For Generation we will use the 'test' set part of the Tensorflow Dataset 'ag_news_subset'. It contains a train set of 120K rows and a test set of 7600 rows.

Both the T5 and RoBERTa model have never been trained on the 'test' set part of the data. It is completely unseen to both models.

Each row contains a 'title' which is a news paper headline and a 'description' which is a short part of the news paper article.

The 'title' will be used as input for the T5 model to generate the fake news.

!! Note: I've experienced multiple times that on the initial download of the dataset an error occurs. If you run it again it will just work...

In [7]:
# Get data and datasets
ag_news_ds, info = tfds.load('ag_news_subset', split = ['test'], with_info = True, shuffle_files = True, as_supervised = False)
test_ds = ag_news_ds[0]

# Dataset features
print(info.features)

# Samples
total_samples = info.splits['test'].num_examples 
print(f'Total Samples: {total_samples}')

INFO:absl:Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: ag_news_subset/1.0.0
INFO:absl:Load dataset info from /tmp/tmpuhm41vpetfds
INFO:absl:Generating dataset ag_news_subset (/root/tensorflow_datasets/ag_news_subset/1.0.0)


[1mDownloading and preparing dataset ag_news_subset/1.0.0 (download: 11.24 MiB, generated: 35.79 MiB, total: 47.03 MiB) to /root/tensorflow_datasets/ag_news_subset/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Extraction completed...', max=1.0, styl…

INFO:absl:URL https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbUDNpeUdjb0wxRms already downloaded: reusing /root/tensorflow_datasets/downloads/ucexport_download_id_0Bz8a_Dbh9QhbUDNpeUdjb0wxj4g1umFAV8OV-uDwxSJR0LdxO_k1jxMuFWwAfNX9jos.
INFO:absl:Generating split train










HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/ag_news_subset/1.0.0.incomplete6AHD9Z/ag_news_subset-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=120000.0), HTML(value='')))

INFO:absl:Done writing /root/tensorflow_datasets/ag_news_subset/1.0.0.incomplete6AHD9Z/ag_news_subset-train.tfrecord. Shard lengths: [120000]
INFO:absl:Generating split test




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/ag_news_subset/1.0.0.incomplete6AHD9Z/ag_news_subset-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=7600.0), HTML(value='')))

INFO:absl:Done writing /root/tensorflow_datasets/ag_news_subset/1.0.0.incomplete6AHD9Z/ag_news_subset-test.tfrecord. Shard lengths: [7600]
INFO:absl:Skipping computing stats for mode ComputeStatsMode.SKIP.
INFO:absl:Constructing tf.data.Dataset for split ['test'], from /root/tensorflow_datasets/ag_news_subset/1.0.0


[1mDataset ag_news_subset downloaded and prepared to /root/tensorflow_datasets/ag_news_subset/1.0.0. Subsequent calls will reuse this data.[0m
FeaturesDict({
    'description': Text(shape=(), dtype=tf.string),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=4),
    'title': Text(shape=(), dtype=tf.string),
})
Total Samples: 7600


Next map the test_ds to use only the 'description' and 'title'. The test_ds is used for generating the fake news.

In [8]:
# Map and Decode Split(s)
def decode_example(example):
    decoded_example = info.features.decode_example(example)
    
    description = decoded_example['description']
    title = decoded_example['title']
    
    return title, description

# Map
test_ds = test_ds.map(decode_example, num_parallel_calls = tf.data.experimental.AUTOTUNE)



Create the Keras Model to be used for T5.

In [9]:
class KerasTFT5ForConditionalGeneration(TFT5ForConditionalGeneration):
    def __init__(self, *args, log_dir = None, cache_dir = None, **kwargs):
        super().__init__(*args, **kwargs)
        self.loss_tracker= tf.keras.metrics.Mean(name='loss') 
    
    @tf.function
    def train_step(self, data):
        x = data[0]
        y = x['labels']
        y = tf.reshape(y, [-1, 1])
        with tf.GradientTape() as tape:
            outputs = self(x, training = True)
            loss = outputs[0]
            logits = outputs[1]
            loss = tf.reduce_mean(loss)
            grads = tape.gradient(loss, self.trainable_variables)
            
        self.optimizer.apply_gradients(zip(grads, self.trainable_variables))
        self.loss_tracker.update_state(loss)        
        self.compiled_metrics.update_state(y, logits)
        metrics = {m.name: m.result() for m in self.metrics}
        
        return metrics

    def test_step(self, data):
        x = data[0]
        y = x["labels"]
        y = tf.reshape(y, [-1, 1])
        output = self(x, training = False)
        loss = output[0]
        loss = tf.reduce_mean(loss)
        logits = output[1]
        
        self.loss_tracker.update_state(loss)
        metrics = self.compiled_metrics.update_state(y, logits)
        
        return metrics

Next create the model and load the weights file.

In [10]:
# Create Model
with strategy.scope():
    model = KerasTFT5ForConditionalGeneration.from_pretrained(t5_size, config = t5_config)
    model.compile(optimizer = tf.keras.optimizers.Adam(), 
                  metrics = [tf.keras.metrics.SparseTopKCategoricalAccuracy(name = 'accuracy')])

# Summary
model.summary()

# Load Weights
model.load_weights(WORK_DIR + 't5_base_model.h5')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=892146080.0, style=ProgressStyle(descri…




All model checkpoint layers were used when initializing KerasTFT5ForConditionalGeneration.

Some layers of KerasTFT5ForConditionalGeneration were not initialized from the model checkpoint at t5-base and are newly initialized: ['loss']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "keras_tf_t5for_conditional_generation"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
shared (TFSharedEmbeddings)  multiple                  24674304  
_________________________________________________________________
encoder (TFT5MainLayer)      multiple                  84954240  
_________________________________________________________________
decoder (TFT5MainLayer)      multiple                  113275392 
_________________________________________________________________
loss (Mean)                  multiple                  2         
Total params: 222,903,938
Trainable params: 222,903,936
Non-trainable params: 2
_________________________________________________________________


Use the test_ds to prepare the final dataframe that will be saved to disk after the fake news generation.

In [11]:
# Placeholders
titles, descriptions = [], []
 
# Process Tensorflow Dataset as Numpy ... otherwise not possible to process tokenization.
generate_ds_numpy = tfds.as_numpy(test_ds)

for index, sample in tqdm(zip(range(total_samples), generate_ds_numpy), total = total_samples):
    # Get title and description as strings
    titles.append(sample[0].decode('utf-8'))             # title
    descriptions.append(sample[1].decode('utf-8'))       # description = 

# Create Dataframe
df = pd.DataFrame()
df['title'] = titles
df['description'] = descriptions
df['generated'] = ''

# Summary
print(df.head())
print(df.shape)

HBox(children=(FloatProgress(value=0.0, max=7600.0), HTML(value='')))


                                               title  ... generated
0               Carolina's Davis Done for the Season  ...          
1      Philippine Rebels Free Troops, Talks in Doubt  ...          
2          New Rainbow Six Franchise for Spring 2005  ...          
3                          Kiwis heading for big win  ...          
4  Shelling, shooting resumes in breakaway Georgi...  ...          

[5 rows x 3 columns]
(7600, 3)


Perform the text generation based on the prepared dataframe. Note that the 'title' is used as input. The generated fake news will be stored in the dataframe 'generated' column.

The dataframe is saved to storage for reference.

In [12]:
text_list = None
generated = []

for index, row in tqdm(zip(range(total_samples), df.iterrows()), total = total_samples):
    index += 1

    if text_list is None:
        text_list = []

    # Prep input text
    text_list.append(task_name + row[1]['title'])
    
    if index % GENERATE_BATCH_SIZE == 0:
        # Batch Encode with Special Tokens
        textlist_encoded = t5_tokenizer.batch_encode_plus(text_list, add_special_tokens = True, max_length = MAX_LEN, padding = True, truncation = True, return_tensors = 'tf')
        
        input_ids = textlist_encoded['input_ids']
        
        # Generate FakeNews
        generated_fakenews = model.generate(input_ids, 
                                          max_length = MAX_LEN, 
                                          top_p = 0.95, 
                                          top_k = 256, 
                                          temperature = 1.1,
                                          num_beams = 1, 
                                          num_return_sequences = 1, 
                                          repetition_penalty = 1.1)
        
        for mapping in generated_fakenews.numpy():
            generated.append(t5_tokenizer.decode(mapping))

        # Reset Text List
        text_list = []

# Generate Final File
df['generated'] = generated
df.to_csv(WORK_DIR + 'generated_fake_news_final.csv')
print(df.head())

HBox(children=(FloatProgress(value=0.0, max=7600.0), HTML(value='')))


                                               title  ...                                          generated
0               Carolina's Davis Done for the Season  ...  CHARLOTTE, NC (Sports Network) - Carolina's Je...
1      Philippine Rebels Free Troops, Talks in Doubt  ...  Philippine rebels freed more than 2,000 troops...
2          New Rainbow Six Franchise for Spring 2005  ...  The Rainbow Six franchise will be released in ...
3                          Kiwis heading for big win  ...  The Kiwis are heading for a big win in the fir...
4  Shelling, shooting resumes in breakaway Georgi...  ...  AFP - A new round of shelling and shootings in...

[5 rows x 3 columns]


### RoBERTa FakeNews Detector

Next we define a function to process the 'generated_fake_news_final.csv' which is loaded as a Pandas Dataframe. We loop through all rows and from each row we use the columns 'description' and 'generated' as input for the RoBERTa model.

The 'description' input will be labelled with 0. The 'generated' input will be labelled with 1. The 'title' which we used in the T5 model is not used.

In [13]:
def create_dataset(df):
    number_of_samples = df.shape[0]
    total_samples = 2 * df.shape[0]

    # Placeholders input
    input_ids = np.zeros((total_samples, MAX_LEN), dtype = 'int32')
    input_masks = np.zeros((total_samples, MAX_LEN), dtype = 'int32')
    labels = np.zeros((total_samples, ), dtype = 'int32')

    for index, row in tqdm(zip(range(0, total_samples, 2), df.iterrows()), total = number_of_samples):
        
        # Get title and description as strings
        description = row[1]['description']
        generated = row[1]['generated']

        # Process Description - Set Label for real as 0
        input_encoded = roberta_tokenizer.encode_plus(description, add_special_tokens = True, max_length = MAX_LEN, truncation = True)
        input_ids_sample = input_encoded['input_ids']
        input_ids[index,:len(input_ids_sample)] = input_ids_sample
        attention_mask_sample = input_encoded['attention_mask']
        input_masks[index,:len(attention_mask_sample)] = attention_mask_sample
        labels[index] = 0

        # Process Generated - Set Label for fake as 1
        input_encoded = roberta_tokenizer.encode_plus(generated, add_special_tokens = True, max_length = MAX_LEN, truncation = True)
        input_ids_sample = input_encoded['input_ids']
        input_ids[index+1,:len(input_ids_sample)] = input_ids_sample
        attention_mask_sample = input_encoded['attention_mask']
        input_masks[index+1,:len(attention_mask_sample)] = attention_mask_sample
        labels[index+1] = 1

    # Create DatasetDictionary structure is also preserved.
    dataset = tf.data.Dataset.from_tensor_slices(({'input_ids': input_ids, 'attention_mask': input_masks}, labels))

    # Return Dataset
    return dataset

We load the csv file and call the previously defined function to generate our final Tensorflow Dataset.

In [14]:
# Import Generated Fake News
df = pd.read_csv(WORK_DIR + 'generated_fake_news_final.csv')

# Show Sizes
print(f'Test DF Shape: {df.shape}')

# Create Validation Dataset
test_dataset = create_dataset(df)
test_dataset = test_dataset.batch(PREDICT_BATCH_SIZE)
test_dataset = test_dataset.repeat(-1)
test_dataset = test_dataset.prefetch(128)

# Steps
test_steps = (df.shape[0] * 2) // PREDICT_BATCH_SIZE
print(f'Test Steps: {test_steps}')

Test DF Shape: (7600, 4)


HBox(children=(FloatProgress(value=0.0, max=7600.0), HTML(value='')))


Test Steps: 950


Define a function to create and compile the RoBERTa base model.

In [15]:
def build_model():
    # Create Model
    with strategy.scope():      
        model = TFRobertaForSequenceClassification.from_pretrained(roberta_type, config = roberta_config)
        
        optimizer = tf.keras.optimizers.Adam()
        loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True)
        metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

        model.compile(optimizer = optimizer, loss = loss, metrics = [metric])        
        
        return model

Create the model and load the weights file

In [16]:
# Create Model
model = build_model()

# Summary
model.summary()

# Load Weights
model.load_weights(WORK_DIR + 'roberta_base_model.h5')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=657434796.0, style=ProgressStyle(descri…




Some layers from the model checkpoint at roberta-base were not used when initializing TFRobertaForSequenceClassification: ['lm_head']
- This IS expected if you are initializing TFRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "tf_roberta_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
roberta (TFRobertaMainLayer) multiple                  124645632 
_________________________________________________________________
classifier (TFRobertaClassif multiple                  592130    
Total params: 125,237,762
Trainable params: 125,237,762
Non-trainable params: 0
_________________________________________________________________


Next lets first evaluate the test set and see how well the RoBERTa model can classify the generated data.

With an evaluation accuracy of around 97% the RoBERTa model performs a nice job of classifying the real and fake news.

In [17]:
# Evaluate Dataset
eval = model.evaluate(test_dataset, steps = test_steps, verbose = 1)
print(f'Detection Accuracy: {eval[1] * 100}%')

Detection Accuracy: 97.03289270401001%


We can also perform prediction with the test set. This is basically the same action as the evaluation. However evaluation will give us back the evaluation metrics where as prediction will give us back the raw predictions.

In [18]:
 # Predict Dataset
preds = model.predict(test_dataset, steps = test_steps, verbose = 1)

# Raw Predictions
print(preds)

# Probabilities
print(tf.nn.softmax(preds).numpy()[0])

(array([[ 4.068584  , -4.2594576 ],
       [-3.859234  ,  3.9326305 ],
       [ 4.044755  , -4.33063   ],
       ...,
       [-4.2476196 ,  4.358499  ],
       [-0.47813815,  0.32547617],
       [-3.0547214 ,  3.0789363 ]], dtype=float32),)
[[9.9975842e-01 2.4158655e-04]
 [4.1291147e-04 9.9958712e-01]
 [9.9976963e-01 2.3041795e-04]
 ...
 [1.8294917e-04 9.9981707e-01]
 [3.0925292e-01 6.9074708e-01]
 [2.1639417e-03 9.9783605e-01]]
