<a href="https://colab.research.google.com/github/RobinSmits/FakeNews-Generator-And-Detector/blob/main/FakeNews_Generator_T5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import os
import pandas as pd
from tqdm.notebook import tqdm

# Install Specific Versions
!pip install tensorflow==2.3.1
!pip install tensorflow-datasets==4.1.0
!pip install transformers==4.0.0
!pip install sentencepiece==0.1.94

# Import Packages
import tensorflow as tf
import tensorflow_datasets as tfds
from transformers import *
import sentencepiece

Collecting tensorflow==2.3.1
[?25l  Downloading https://files.pythonhosted.org/packages/ad/ad/769c195c72ac72040635c66cd9ba7b0f4b4fc1ac67e59b99fa6988446c22/tensorflow-2.3.1-cp36-cp36m-manylinux2010_x86_64.whl (320.4MB)
[K     |████████████████████████████████| 320.4MB 24kB/s 
Installing collected packages: tensorflow
  Found existing installation: tensorflow 2.3.0
    Uninstalling tensorflow-2.3.0:
      Successfully uninstalled tensorflow-2.3.0
Successfully installed tensorflow-2.3.1
Collecting tensorflow-datasets==4.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/8b/02/c1260ff4caf483c01ce36ca45a63f05417f732d94ec42cce292355dc7ea4/tensorflow_datasets-4.1.0-py3-none-any.whl (3.6MB)
[K     |████████████████████████████████| 3.6MB 6.4MB/s 
Installing collected packages: tensorflow-datasets
  Found existing installation: tensorflow-datasets 4.0.1
    Uninstalling tensorflow-datasets-4.0.1:
      Successfully uninstalled tensorflow-datasets-4.0.1
Successfully installed ten

I've created and tested these notebooks on Google Colab Pro and used Google Drive to store and load any files created. 

If you run the code locally on a computer then modify the 'WORK_DIR' accordingly. Google Drive will not be needed in that case.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Set Folder to use...
WORK_DIR = '/content/drive/My Drive/fake_news/'
os.makedirs(WORK_DIR, exist_ok = True) 

Mounted at /content/drive


Next we set some config for the device to use (Note: TPU to be added/tested in the future.) Next we set some constants. For the learning rate you could try different settings. If the model generates only garbage...then likely the learning rate was set to high.

You can set 2 actions:

1.   PERFORM_TRAINING: Set to True to Train a T5 Model from scratch.
2.   GENERATE_TEXT: Set to True to use a T5 Model to generate a text file with fake news. Note that generation can take a long time (multiple hours)

And finally we set the T5 tokenizer and config to use.

In [None]:
# Set Device to Run on.
USE_GPU = True
USE_CPU = False

# Set strategy with config
if USE_GPU:
    strategy = tf.distribute.OneDeviceStrategy(device = "/gpu:0")
if USE_CPU:
    strategy = tf.distribute.OneDeviceStrategy(device = "/cpu:0")

# TODO: Add and test TPU functioning

# Constants
MAX_LEN = 512     # Max number of tokens for T5 to use.
EPOCHS = 1
VERBOSE = 1
TRAIN_SPLITS = 2

# Set Actions
PERFORM_TRAINING = True
GENERATE_TEXT = True

# Batch Size
GENERATE_BATCH_SIZE = 30 * strategy.num_replicas_in_sync
BATCH_SIZE = 4 * strategy.num_replicas_in_sync
print(f'Train Batch Size: {BATCH_SIZE}')
print(f'Generate Batch Size: {GENERATE_BATCH_SIZE}')

# Learning Rate
LR = 1e-4 * strategy.num_replicas_in_sync
print('Learning Rate: {}'.format(LR))

# Set T5 Type
t5_size = 't5-base'     
print(f'T5 Model Type: {t5_size}')

# Set T5 Task Name
task_name = 'generate fake news: '
print(f'T5 Task Name: {task_name}')

# Set T5 Config
t5_config = T5Config.from_pretrained(t5_size)

# Set T5 Tokenizer
t5_tokenizer = T5Tokenizer.from_pretrained(t5_size, return_dict = True)

Train Batch Size: 4
Generate Batch Size: 30
Learning Rate: 0.0001
T5 Model Type: t5-base
T5 Task Name: generate fake news: 


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1199.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




For Training and Generation we will use the Tensorflow Dataset 'ag_news_subset'. It contains a train set of 120K rows and a test set of 7600 rows.

Each row contains a 'title' which is a news paper headline and a 'description' which is a short part of the news paper article.

The 'title' will be used as input and the long text 'description' will be specified as output. This way we can train the model to generate 'fake news' based on a short input.

We will split the train set in to 2 equal parts each with 60K rows.

The first set will be used to train the T5 model on its task.
The second set will be used to generate new data (the fake news...) with input data that the T5 model has never seen before.

!! Note: I've experienced multiple times that on the initial download of the dataset an error occurs. If you run it again it will just work...

In [None]:
# Define Splits
splits = tfds.even_splits('train', n = TRAIN_SPLITS)
print(f'Configured Splits: {splits}')

# Get data and datasets
ag_news_ds, info = tfds.load('ag_news_subset', split = splits, with_info = True, shuffle_files = True, as_supervised = False)
train_ds = ag_news_ds[0]
generate_ds = ag_news_ds[1]

# Dataset features
print(f'Show Dataset Features:\n {info.features}')

# Samples
total_train_samples = int(info.splits['train'].num_examples / TRAIN_SPLITS) 
total_generate_samples = int(info.splits['train'].num_examples / TRAIN_SPLITS) 

# Summary
print(f'Total Samples for Training: {total_train_samples}')
print(f'Total Samples for Generation: {total_generate_samples}')

INFO:absl:Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: ag_news_subset/1.0.0
INFO:absl:Load dataset info from /tmp/tmpdbrntpdbtfds
INFO:absl:Generating dataset ag_news_subset (/root/tensorflow_datasets/ag_news_subset/1.0.0)


Configured Splits: ['train[0%:50%]', 'train[50%:100%]']
[1mDownloading and preparing dataset ag_news_subset/1.0.0 (download: 11.24 MiB, generated: 35.79 MiB, total: 47.03 MiB) to /root/tensorflow_datasets/ag_news_subset/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Extraction completed...', max=1.0, styl…

INFO:absl:URL https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbUDNpeUdjb0wxRms already downloaded: reusing /root/tensorflow_datasets/downloads/ucexport_download_id_0Bz8a_Dbh9QhbUDNpeUdjb0wxj4g1umFAV8OV-uDwxSJR0LdxO_k1jxMuFWwAfNX9jos.










HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=120000.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/ag_news_subset/1.0.0.incomplete5M8Q15/ag_news_subset-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=120000.0), HTML(value='')))

INFO:absl:Done writing /root/tensorflow_datasets/ag_news_subset/1.0.0.incomplete5M8Q15/ag_news_subset-train.tfrecord. Shard lengths: [120000]


HBox(children=(FloatProgress(value=0.0, max=7600.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/ag_news_subset/1.0.0.incomplete5M8Q15/ag_news_subset-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=7600.0), HTML(value='')))

INFO:absl:Done writing /root/tensorflow_datasets/ag_news_subset/1.0.0.incomplete5M8Q15/ag_news_subset-test.tfrecord. Shard lengths: [7600]
INFO:absl:Skipping computing stats for mode ComputeStatsMode.SKIP.
INFO:absl:Constructing tf.data.Dataset for split ['train[0%:50%]', 'train[50%:100%]'], from /root/tensorflow_datasets/ag_news_subset/1.0.0


[1mDataset ag_news_subset downloaded and prepared to /root/tensorflow_datasets/ag_news_subset/1.0.0. Subsequent calls will reuse this data.[0m
Show Dataset Features:
 FeaturesDict({
    'description': Text(shape=(), dtype=tf.string),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=4),
    'title': Text(shape=(), dtype=tf.string),
})
Total Samples for Training: 60000
Total Samples for Generation: 60000


Next map the train_ds and generate_ds to use only the 'description' and 'title'. The train_ds is used for training the model. The generate_ds is used for generating the fake news.

In [None]:
# Map and Decode Split(s)
def decode_example(example):
    decoded_example = info.features.decode_example(example)
    
    description = decoded_example['description']
    title = decoded_example['title']
    
    return title, description

# Map
train_ds = train_ds.map(decode_example, num_parallel_calls = tf.data.experimental.AUTOTUNE)
generate_ds = generate_ds.map(decode_example, num_parallel_calls = tf.data.experimental.AUTOTUNE)

Lets look at some tokenized samples from the train_ds and generate_ds

In [None]:
# Train: Show Input and Output Samples encoded
train_ds_numpy = tfds.as_numpy(train_ds.take(2))

for sample in train_ds_numpy:
    # Get title and description as strings
    title = sample[0].decode('utf-8')
    description = sample[1].decode('utf-8')
    
    # Encode with special tokens and use maximum length
    input_encoded = t5_tokenizer.encode_plus(title, add_special_tokens = True, max_length = MAX_LEN, padding = True, truncation = True)
    output_encoded = t5_tokenizer.encode_plus(description, add_special_tokens = True, max_length = MAX_LEN, padding = True, truncation = True)
    
    # Print...
    print(f'Title: {title}')
    print(f'Input - Title Encoded: {input_encoded}')
    print(f'Description: {description}')
    print(f'Output - Description Encoded: {output_encoded}\n')

Title: AMD Debuts Dual-Core Opteron Processor
Input - Title Encoded: {'input_ids': [22806, 374, 2780, 7, 17338, 18, 13026, 15, 4495, 449, 106, 10272, 127, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Description: AMD #39;s new dual-core Opteron chip is designed mainly for corporate computing applications, including databases, Web services, and financial transactions.
Output - Description Encoded: {'input_ids': [22806, 1713, 3288, 117, 7, 126, 7013, 18, 9022, 4495, 449, 106, 6591, 19, 876, 3, 4894, 21, 2849, 10937, 1564, 6, 379, 16961, 6, 1620, 364, 6, 11, 981, 6413, 5, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Title: Wood's Suspension Upheld (Reuters)
Input - Title Encoded: {'input_ids': [2985, 31, 7, 1923, 7, 3208, 1938, 3234, 14796, 41, 18844, 61, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Description: Reuters - Major League Baseball\Monday announced a decision o

In [None]:
# Test: Show Input and Output Samples encoded
generate_ds_numpy = tfds.as_numpy(generate_ds.take(2))

for batch in generate_ds_numpy:
    # Get title and description as strings
    title = batch[0].decode('utf-8')
    description = batch[1].decode('utf-8')
    
    # Encode with special tokens and use maximum length
    input_encoded = t5_tokenizer.encode_plus(title, add_special_tokens = True, max_length = MAX_LEN, padding = True, truncation = True)
    output_encoded = t5_tokenizer.encode_plus(description, add_special_tokens = True, max_length = MAX_LEN, padding = True, truncation = True)
            
    # Print...
    print(f'Title: {title}')
    print(f'Input - Title Encoded: {input_encoded}')
    print(f'Description: {description}')
    print(f'Output - Description Encoded: {output_encoded}\n')

Title: XM Strikes Back
Input - Title Encoded: {'input_ids': [3, 4, 329, 5500, 5208, 7, 3195, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
Description: Two weeks after its rival Sirius Satellite Radio (Nasdaq: SIRI) grabbed headlines by signing Howard Stern, XM Satellite Radio (Nasdaq: XMSR) disclosed late yesterday that it 
Output - Description Encoded: {'input_ids': [2759, 1274, 227, 165, 8374, 22438, 302, 24552, 5061, 41, 567, 9, 7, 26, 9, 1824, 10, 7933, 5593, 61, 19303, 26, 12392, 7, 57, 8097, 13816, 17594, 6, 3, 4, 329, 24552, 5061, 41, 567, 9, 7, 26, 9, 1824, 10, 3, 4, 4211, 448, 61, 19972, 1480, 4981, 24, 34, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Title: RIGHT GAINS IN LITHUANIA ELECTIONS
Input - Title Encoded: {'input_ids': [3, 27262, 10615, 14750, 3388, 8729, 4611, 16597, 26077, 3, 3577, 14196, 22164, 1], 'attention_mask': [1, 1, 1

Perform processing of the train_ds and prepare the data for model training.

In [None]:
# Placeholders input
input_ids = np.zeros((total_train_samples, MAX_LEN), dtype='int32')
input_masks = np.zeros((total_train_samples, MAX_LEN), dtype='int32')

# Placeholders output
output_ids = np.zeros((total_train_samples, MAX_LEN), dtype='int32')
output_masks = np.zeros((total_train_samples, MAX_LEN), dtype='int32')

# Process Tensorflow Dataset as Numpy ... otherwise not possible to process tokenization.
train_ds_numpy = tfds.as_numpy(train_ds)

for index, sample in tqdm(zip(range(total_train_samples), train_ds_numpy), total = total_train_samples):
    
    # Get title and description as strings
    title = sample[0].decode('utf-8')
    description = sample[1].decode('utf-8')
    
    # Process Input
    input_encoded = t5_tokenizer.encode_plus(task_name + title, add_special_tokens = True, max_length = MAX_LEN, padding = True, truncation = True)
    input_ids_sample = input_encoded['input_ids']
    input_ids[index,:len(input_ids_sample)] = input_ids_sample
    attention_mask_sample = input_encoded['attention_mask']
    input_masks[index,:len(attention_mask_sample)] = attention_mask_sample

    # Process Output
    output_encoded = t5_tokenizer.encode_plus(description, add_special_tokens = True, max_length = MAX_LEN, padding = True, truncation = True)
    output_ids_sample = output_encoded['input_ids']
    output_ids[index,:len(output_ids_sample)] = output_ids_sample
    attention_mask_sample = output_encoded['attention_mask']
    output_masks[index,:len(attention_mask_sample)] = attention_mask_sample

HBox(children=(FloatProgress(value=0.0, max=60000.0), HTML(value='')))




Create the Keras Model to be used for T5.

In [None]:
class KerasTFT5ForConditionalGeneration(TFT5ForConditionalGeneration):
    def __init__(self, *args, log_dir=None, cache_dir= None, **kwargs):
        super().__init__(*args, **kwargs)

        self.loss_tracker= tf.keras.metrics.Mean(name='loss') 
    
    @tf.function
    def train_step(self, data):
        x = data[0]
        y = x['labels']
        y = tf.reshape(y, [-1, 1])
        with tf.GradientTape() as tape:
            outputs = self(x, training=True)
            loss = outputs[0]
            logits = outputs[1]
            loss = tf.reduce_mean(loss)
            grads = tape.gradient(loss, self.trainable_variables)
            
        self.optimizer.apply_gradients(zip(grads, self.trainable_variables))
        self.loss_tracker.update_state(loss)        
        self.compiled_metrics.update_state(y, logits)
        metrics = {m.name: m.result() for m in self.metrics}
        
        return metrics

    def test_step(self, data):
        x = data[0]
        y = x["labels"]
        y = tf.reshape(y, [-1, 1])
        output = self(x, training=False)
        loss = output[0]
        loss = tf.reduce_mean(loss)
        logits = output[1]
        
        self.loss_tracker.update_state(loss)
        self.compiled_metrics.update_state(y, logits)
        
        return {m.name: m.result() for m in self.metrics}

Create a callback to save the model weights to storage.

In [None]:
class SaveModel(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs = None):
        print("\nSave Model Weights")

        # Save the entire model as a SavedModel.
        self.model.save_weights(WORK_DIR + 't5_base_model.h5')

Finally create and compile the model. Show the summary. Set the final input_data and perform the training.

In [None]:
# Perform training only if specified
if PERFORM_TRAINING:
        
    # Create Model
    with strategy.scope():
        model = KerasTFT5ForConditionalGeneration.from_pretrained(t5_size, config = t5_config)
        model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = LR), 
                      metrics = [tf.keras.metrics.SparseTopKCategoricalAccuracy(name = 'accuracy')])

    # Summary
    model.summary()

    # Set Input
    input_data = {'input_ids': input_ids, 'labels': output_ids, 'attention_mask': input_masks, 'decoder_attention_mask': output_masks}

    # Fit Model
    model.fit(input_data,
              epochs = EPOCHS, 
              batch_size = BATCH_SIZE, 
              verbose = VERBOSE,
              shuffle = True,
              callbacks = [SaveModel()],
              use_multiprocessing = False,
              workers = 4)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=892146080.0, style=ProgressStyle(descri…




Some layers from the model checkpoint at t5-base were not used when initializing KerasTFT5ForConditionalGeneration: ['decoder/block_._0/layer_._1/EncDecAttention/relative_attention_bias/embeddings:0']
- This IS expected if you are initializing KerasTFT5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing KerasTFT5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of KerasTFT5ForConditionalGeneration were not initialized from the model checkpoint at t5-base and are newly initialized: ['loss']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "keras_tf_t5for_conditional_generation"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
shared (TFSharedEmbeddings)  multiple                  24674304  
_________________________________________________________________
encoder (TFT5MainLayer)      multiple                  84954240  
_________________________________________________________________
decoder (TFT5MainLayer)      multiple                  113275008 
_________________________________________________________________
loss (Mean)                  multiple                  2         
Total params: 222,903,554
Trainable params: 222,903,552
Non-trainable params: 2
_________________________________________________________________


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Save Model Weights


If GENERATE_TEXT is true than create the model and load the weights file.

In [None]:
if GENERATE_TEXT:        
    # Create Model
    with strategy.scope():
        model = KerasTFT5ForConditionalGeneration.from_pretrained(t5_size, config = t5_config)
        model.compile(optimizer = tf.keras.optimizers.Adam(), 
                      metrics = [tf.keras.metrics.SparseTopKCategoricalAccuracy(name = 'accuracy')])

    # Summary
    model.summary()

    # Load Weights
    model.load_weights(WORK_DIR + 't5_base_model.h5')

Some layers from the model checkpoint at t5-base were not used when initializing KerasTFT5ForConditionalGeneration: ['decoder/block_._0/layer_._1/EncDecAttention/relative_attention_bias/embeddings:0']
- This IS expected if you are initializing KerasTFT5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing KerasTFT5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of KerasTFT5ForConditionalGeneration were not initialized from the model checkpoint at t5-base and are newly initialized: ['loss']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "keras_tf_t5for_conditional_generation_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
shared (TFSharedEmbeddings)  multiple                  24674304  
_________________________________________________________________
encoder (TFT5MainLayer)      multiple                  84954240  
_________________________________________________________________
decoder (TFT5MainLayer)      multiple                  113275008 
_________________________________________________________________
loss (Mean)                  multiple                  2         
Total params: 222,903,554
Trainable params: 222,903,552
Non-trainable params: 2
_________________________________________________________________


Lets show a few examples of the generated fake news. The 'title' is shown as is the original 'description'. The 'Generated Fake News' is the output generated by the T5 model.

In [None]:
if GENERATE_TEXT:
    # Test: Show Input and Output Samples encoded
    generate_samples = 4
    generate_ds_numpy = tfds.as_numpy(generate_ds.take(generate_samples))

    for index, sample in zip(range(generate_samples), generate_ds_numpy):
        # Get title and description as strings
        title = sample[0].decode('utf-8')
        description = sample[1].decode('utf-8')

        print(f'\n\n========= Sample:  {index}')
        print(f'Title: {title}')
        print(f'Description: {description}')

        # Encode with Special Tokens
        input_encoded = t5_tokenizer.encode_plus(task_name + title, add_special_tokens = True, max_length = MAX_LEN, padding = True, truncation = True, return_tensors = 'tf')
        
        # Generate FakeNews
        generated_fakenews = model.generate(input_encoded['input_ids'], 
                                          attention_mask = input_encoded['attention_mask'], 
                                          max_length = MAX_LEN, 
                                          top_p = 0.96, 
                                          top_k = 256, 
                                          temperature = 1.3,
                                          num_beams = 2, 
                                          num_return_sequences = 1, 
                                          repetition_penalty = 1.3,
                                          length_penalty = 1.3)

        for mapping in generated_fakenews.numpy():
            decoded_mapping = t5_tokenizer.decode(mapping, skip_special_tokens = True)
            print(f"    Generated Fake News: {decoded_mapping}")



Title: RIGHT GAINS IN LITHUANIA ELECTIONS
Description: Right-wing parties have scored a surprise success in Lithuania #39;s parliamentary elections on the weekend. The right-wing Conservative Party and Liberal Center Union party won 43 seats together in the 141-member parliament.
    Generated Fake News: ATHENS, Greece (Reuters) - The right-hander sat out the first round of the U.S. presidential election on Sunday, but his opponent was not in the running.


Title: XM Strikes Back
Description: Two weeks after its rival Sirius Satellite Radio (Nasdaq: SIRI) grabbed headlines by signing Howard Stern, XM Satellite Radio (Nasdaq: XMSR) disclosed late yesterday that it 
    Generated Fake News: XM Communications Inc. (XM.N: Quote, Profile, Research) on Monday said it will be back in business for the third quarter of this year and will continue to offer its services as a free service.


Title: Spain sweeps opening singles from US in Davis Cup final
Description: With drums pounding and a bra

Use the generate_ds to prepare the final dataframe that will be saved to disk after the fake news generation.

In [None]:
if GENERATE_TEXT:
    # Placeholders
    titles, descriptions = [], []
    
    # Process Tensorflow Dataset as Numpy ... otherwise not possible to process tokenization.
    generate_ds_numpy = tfds.as_numpy(generate_ds)

    for index, sample in tqdm(zip(range(total_generate_samples), generate_ds_numpy), total = total_generate_samples):
        # Get title and description as strings
        titles.append(sample[0].decode('utf-8'))             # title
        descriptions.append(sample[1].decode('utf-8'))       # description = 

    # Create Dataframe
    df = pd.DataFrame()
    df['title'] = titles
    df['description'] = descriptions
    df['generated'] = ''

    # Summary
    print(df.head())
    print(df.shape)

HBox(children=(FloatProgress(value=0.0, max=60000.0), HTML(value='')))


                                               title  ... generated
0                                    XM Strikes Back  ...          
1                 RIGHT GAINS IN LITHUANIA ELECTIONS  ...          
2  Spain sweeps opening singles from US in Davis ...  ...          
3         Before the Bell: GE, Sirius Slip (Reuters)  ...          
4                  Microsoft takes on desktop search  ...          

[5 rows x 3 columns]
(60000, 3)


Perform the text generation based on the prepared dataframe. Note that the 'title' is used as input. The generated fake news will be stored in the dataframe 'generated' column.

The dataframe is saved to storage to be used for further use in the follow up notebooks.

In [None]:
if GENERATE_TEXT:
    text_list = None
    generated = []

    for index, row in tqdm(zip(range(total_generate_samples), df.iterrows()), total = total_generate_samples):
        index += 1

        if text_list is None:
            text_list = []

        # Prep input text
        text_list.append(task_name + row[1]['title'])
        
        if index % GENERATE_BATCH_SIZE == 0:
            # Batch Encode with Special Tokens
            textlist_encoded = t5_tokenizer.batch_encode_plus(text_list, add_special_tokens = True, max_length = MAX_LEN, padding = True, truncation = True, return_tensors = 'tf')
            
            input_ids = textlist_encoded['input_ids']
            
            # Generate FakeNews
            generated_fakenews = model.generate(input_ids, 
                                                max_length = MAX_LEN, 
                                                top_p = 0.95, 
                                                top_k = 256, 
                                                temperature = 1.1,
                                                num_beams = 1, 
                                                num_return_sequences = 1, 
                                                repetition_penalty = 1.1)
            
            for mapping in generated_fakenews.numpy():
                generated.append(t5_tokenizer.decode(mapping, skip_special_tokens = True))

            # Reset Text List
            text_list = []

    # Generate Final File
    df['generated'] = generated
    df.to_csv(WORK_DIR + 't5_generated_fake_news.csv')
    print(df.head())

HBox(children=(FloatProgress(value=0.0, max=60000.0), HTML(value='')))