<a href="https://colab.research.google.com/github/RobinSmits/FakeNews-Generator-And-Detector/blob/main/FakeNews_Generator_T5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we start our process of creating Fake News. In summary we do the following actions:

1.   Download a news dataset and split it in 2 parts.
2.   Train a T5 model to generate fake news based on the first dataset.
3.   Use the T5 model to generate fake news based on the second dataset.

By splitting the dataset we make sure that there is no 'leakage'. Generating the fake news will be done on data that the model hasn't seen during training.

The generated fake news will be used in the second notebook to train a RoBERTa classifier model to recognize real and fake news.

In the final notebook we use a test set to generate fake news with the T5 model and next we try to classify the real and fake news with the RoBERTa classifier.


** Note !! **

This notebook can run on TPU/GPU/CPU. Where GPU is the preferred solution as it is fast for both training the T5 model and generating the fake news with it.
TPU works and is very fast in training the T5 model. But generating the fake news with it is just way to slow.

But offcourse feel free to experiment with that ;-)

In [None]:
import numpy as np
import os
import pandas as pd
from tqdm.notebook import tqdm
from urllib.request import urlopen
import tarfile

# Install Specific Versions
!pip install -q tensorflow==2.4.1
!pip install -q tensorflow-datasets==4.1.0
!pip install -q transformers==4.4.2
!pip install -q sentencepiece==0.1.95

# Import Packages
import tensorflow as tf
import tensorflow_datasets as tfds
from transformers import *
import sentencepiece

I've created and tested these notebooks on Google Colab Pro and used Google Drive to store and load any files created. 

If you run the code locally on a computer then modify the 'WORK_DIR' accordingly. Google Drive will not be needed in that case.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Set Folder to use...
WORK_DIR = '/content/drive/My Drive/fake_news/'
os.makedirs(WORK_DIR, exist_ok = True) 

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Next we set some config for the device to use (TPU/GPU/CPU all work..). In Google Colab just select a Runtime type.
Also we set some constants. For the learning rate you could try different settings ... the current learning rate works fairly well. If the model generates only garbage...then likely the learning rate was set to high.

You can set 2 actions:

1.   PERFORM_TRAINING: Set to True to Train a T5 Model from scratch.
2.   GENERATE_TEXT: Set to True to use a T5 Model to generate a text file with fake news. Note that generation can take a long time (multiple hours)..on TPU especially it is extremely slow...so pick the GPU runtime to do the text generation on. You need to make sure that you have the pretrained T5 model from a previous run or downloaded from the link specified on my github page.

And finally we set the T5 tokenizer and config to use.

In [None]:
# Configure Strategy. Assume TPU...if not set default for GPU/CPU
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.TPUStrategy(tpu)
except ValueError:
    strategy = tf.distribute.get_strategy()

# Set Auto Tune
AUTO = tf.data.experimental.AUTOTUNE

# Supress Warnings
tf.autograph.set_verbosity(0, False)

# Set Pandas Display Options
pd.set_option('display.max_colwidth', 256)

# Constants
MAX_LEN = 512     # Max number of tokens for T5 to use.
EPOCHS = 1
VERBOSE = 1

# Set Actions
PERFORM_TRAINING = True
GENERATE_TEXT = True

# Batch Size
GENERATE_BATCH_SIZE = 30 * strategy.num_replicas_in_sync
BATCH_SIZE = 4 * strategy.num_replicas_in_sync
print(f'Train Batch Size: {BATCH_SIZE}')
print(f'Generate Batch Size: {GENERATE_BATCH_SIZE}')

# Learning Rate
LR = 1e-4
print('Learning Rate: {}'.format(LR))

# Set T5 Type
t5_size = 't5-base'     
print(f'T5 Model Type: {t5_size}')

# Set T5 Task Name
task_name = 'generate fake news: '
print(f'T5 Task Name: {task_name}')

# Set T5 Config
t5_config = T5Config.from_pretrained(t5_size)

# Set T5 Tokenizer
t5_tokenizer = T5Tokenizer.from_pretrained(t5_size, return_dict = True)

INFO:absl:Entering into master device scope: /job:worker/replica:0/task:0/device:CPU:0


INFO:tensorflow:Initializing the TPU system: grpc://10.95.104.146:8470


INFO:tensorflow:Initializing the TPU system: grpc://10.95.104.146:8470


INFO:tensorflow:Clearing out eager caches


INFO:tensorflow:Clearing out eager caches


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Found TPU system:


INFO:tensorflow:Found TPU system:


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


Train Batch Size: 32
Generate Batch Size: 240
Learning Rate: 0.0001
T5 Model Type: t5-base
T5 Task Name: generate fake news: 


For Training and Generation we will use the Tensorflow Dataset 'ag_news_subset'. It contains a train set of 120K rows and a test set of 7600 rows.

Each row contains a 'title' which is a news paper headline and a 'description' which is a short part of the news paper article.

The 'title' will be used as input and the long text 'description' will be specified as output. This way we can train the model to generate 'fake news' based on a short input.

We will split the train set in to 2 equal parts with 60K rows each.

The first set will be used to train the T5 model on its task.
The second set will be used to generate new data (the fake news...) with input data that the T5 model has never seen before.

!! Note: As I've experienced multiple times that on the initial download of the dataset an error occurs I've just specified the Google Drive URL as mentioned in Tensorflow datasets. This just works...

In [None]:
# AG News Subset Download URL from TFDS
AGNEWSSUBSET_URL = 'https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbUDNpeUdjb0wxRms'
AGNEWSSUBSET_DIR = '/tmp/agnewssubet/'

# Download Tar.Gz File and Extract
with urlopen(AGNEWSSUBSET_URL) as targzstream:
    thetarfile = tarfile.open(fileobj = targzstream, mode = "r|gz")
    thetarfile.extractall(AGNEWSSUBSET_DIR)
    
# List Dataset files
agnewssubset_files = os.listdir(AGNEWSSUBSET_DIR + 'ag_news_csv/')
print(agnewssubset_files)

# Load Train Csv
df = pd.read_csv(AGNEWSSUBSET_DIR + 'ag_news_csv/train.csv', names = ['label', 'title', 'description'])
df = df.sample(frac = 1.0, random_state = 42) # Shuffle all the rows 
df.head()

['test.csv', 'classes.txt', 'readme.txt', 'train.csv']


Unnamed: 0,label,title,description
71787,3,"BBC set for major shake-up, claims newspaper","London - The British Broadcasting Corporation, the world #39;s biggest public broadcaster, is to cut almost a quarter of its 28 000-strong workforce, in the biggest shake-up in its 82-year history, The Times newspaper in London said on Monday."
67218,3,Marsh averts cash crunch,Embattled insurance broker #39;s banks agree to waive clause that may have prevented access to credit. NEW YORK (Reuters) - Marsh amp; McLennan Cos.
54066,2,"Jeter, Yankees Look to Take Control (AP)",AP - Derek Jeter turned a season that started with a terrible slump into one of the best in his accomplished 10-year career.
7168,4,Flying the Sun to Safety,"When the Genesis capsule comes back to Earth with its samples of the sun, helicopter pilots will be waiting for it, ready to snag it out of the sky."
29618,3,Stocks Seen Flat as Nortel and Oil Weigh,"NEW YORK (Reuters) - U.S. stocks were set to open near unchanged on Thursday after a warning from technology bellwether Nortel Networks Corp. &lt;A HREF=""http://www.investor.reuters.com/FullQuote.aspx?ticker=NT.N target=/stocks/quickinfo/fullquote""&..."


Next we split the downloaded data into 2 equal parts. 1 set for training the model: train_df and 1 set for generating with the T5 model: generate_df.

We will use a new column 'generated_description' in the generate_df file to store the generated description.

In [None]:
# Split into 2 equal 60K rows sets...one for training and one for generating
train_df = df.iloc[:60000,:]
generate_df = df.iloc[60000:,:]

# Placeholder for new 'generated_description' column in the generation part.
generate_df['generated_description'] = ''

# Total Samples
total_train_samples = train_df.shape[0]
total_generate_samples = generate_df.shape[0]

# Save Train_Df
train_df.to_csv(WORK_DIR + 't5_train_df_news.csv')

# Summary
print(f'Total Samples for Training: {total_train_samples}')
print(f'Total Samples for Generation: {total_generate_samples}')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Total Samples for Training: 60000
Total Samples for Generation: 60000


Lets look at some tokenized samples from the train_df and generate_df

In [None]:
# Train: Show Input and Output Samples encoded
for index, row in train_df[:2].iterrows():
     
    # Get title and description as strings
    title = row['title']
    description = row['description']
    
    # Encode with special tokens and use maximum length
    input_encoded = t5_tokenizer.encode_plus(title, add_special_tokens = True, max_length = MAX_LEN, truncation = True, padding = 'max_length')
    output_encoded = t5_tokenizer.encode_plus(description, add_special_tokens = True, max_length = MAX_LEN, truncation = True, padding = 'max_length')
    
    # Print...
    print(f'Title: {title}')
    print(f'Input - Title Encoded: {input_encoded}')
    print(f'Description: {description}')
    print(f'Output - Description Encoded: {output_encoded}\n')

Title: BBC set for major shake-up, claims newspaper
Input - Title Encoded: {'input_ids': [9938, 356, 21, 779, 8944, 18, 413, 6, 3213, 8468, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [None]:
# Generate: Show Input and Output Samples encoded
for index, row in generate_df[:2].iterrows():
     
    # Get title and description as strings
    title = row['title']
    description = row['description']
    
    # Encode with special tokens and use maximum length
    input_encoded = t5_tokenizer.encode_plus(title, add_special_tokens = True, max_length = MAX_LEN, truncation = True, padding = 'max_length')
    output_encoded = t5_tokenizer.encode_plus(description, add_special_tokens = True, max_length = MAX_LEN, truncation = True, padding = 'max_length')
            
    # Print...
    print(f'Title: {title}')
    print(f'Input - Title Encoded: {input_encoded}')
    print(f'Description: {description}')
    print(f'Output - Description Encoded: {output_encoded}\n')

Title: Besieging holy sites: past lessons
Input - Title Encoded: {'input_ids': [493, 19247, 53, 15273, 1471, 10, 657, 5182, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

Perform processing of the train_df and prepare the data for model training.

In [None]:
# Placeholders input
input_ids = np.zeros((total_train_samples, MAX_LEN), dtype='int32')
input_masks = np.zeros((total_train_samples, MAX_LEN), dtype='int32')

# Placeholders output
output_ids = np.zeros((total_train_samples, MAX_LEN), dtype='int32')
output_masks = np.zeros((total_train_samples, MAX_LEN), dtype='int32')

# Process Train DF dataframe
for index, row in tqdm(zip(range(total_train_samples), train_df.iterrows()), total = total_train_samples):
    
    # Get title and description as strings
    title = row[1]['title']
    description = row[1]['description']
    
    # Process Input
    input_encoded = t5_tokenizer.encode_plus(task_name + title, add_special_tokens = True, max_length = MAX_LEN, truncation = True, padding = 'max_length')
    input_ids_sample = input_encoded['input_ids']
    input_ids[index,:] = input_ids_sample
    attention_mask_sample = input_encoded['attention_mask']
    input_masks[index,:] = attention_mask_sample

    # Process Output
    output_encoded = t5_tokenizer.encode_plus(description, add_special_tokens = True, max_length = MAX_LEN, truncation = True, padding = 'max_length')
    output_ids_sample = output_encoded['input_ids']
    output_ids[index,:] = output_ids_sample
    attention_mask_sample = output_encoded['attention_mask']
    output_masks[index,:] = attention_mask_sample

HBox(children=(FloatProgress(value=0.0, max=60000.0), HTML(value='')))




Create the Keras Model to be used for T5.

In [None]:
class KerasTFT5ForConditionalGeneration(TFT5ForConditionalGeneration):
    def __init__(self, *args, log_dir=None, cache_dir= None, **kwargs):
        super().__init__(*args, **kwargs)

        self.loss_tracker= tf.keras.metrics.Mean(name='loss') 
    
    @tf.function
    def train_step(self, data):
        x = data[0]
        y = x['labels']
        y = tf.reshape(y, [-1, 1])
        with tf.GradientTape() as tape:
            outputs = self(x, training=True)
            loss = outputs[0]
            logits = outputs[1]
            loss = tf.reduce_mean(loss)
            grads = tape.gradient(loss, self.trainable_variables)
            
        self.optimizer.apply_gradients(zip(grads, self.trainable_variables))
        self.loss_tracker.update_state(loss)        
        self.compiled_metrics.update_state(y, logits)
        metrics = {m.name: m.result() for m in self.metrics}
        
        return metrics

    def test_step(self, data):
        x = data[0]
        y = x["labels"]
        y = tf.reshape(y, [-1, 1])
        output = self(x, training=False)
        loss = output[0]
        loss = tf.reduce_mean(loss)
        logits = output[1]
        
        self.loss_tracker.update_state(loss)
        self.compiled_metrics.update_state(y, logits)
        
        return {m.name: m.result() for m in self.metrics}

Create a callback to save the model weights to storage.

In [None]:
class SaveModel(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs = None):
        print("\nSave Model Weights")

        # Save the entire model as a SavedModel.
        self.model.save_weights(WORK_DIR + 't5_base_model.h5')

Finally create and compile the model. Show the summary. 
Set the final input_data and perform the training.

In [None]:
# Perform training only if specified
if PERFORM_TRAINING:
        
    # Create Model
    with strategy.scope():
        model = KerasTFT5ForConditionalGeneration.from_pretrained(t5_size, config = t5_config)
        model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = LR), 
                      metrics = [tf.keras.metrics.SparseTopKCategoricalAccuracy(name = 'accuracy')])

    # Summary
    model.summary()

    # Set Input
    input_data = {'input_ids': input_ids, 'labels': output_ids, 'attention_mask': input_masks, 'decoder_attention_mask': output_masks}

    # Fit Model
    model.fit(input_data,
              epochs = EPOCHS, 
              batch_size = BATCH_SIZE, 
              verbose = VERBOSE,
              shuffle = True,
              callbacks = [SaveModel()],
              use_multiprocessing = False,
              workers = 4)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=892146080.0, style=ProgressStyle(descri…




All model checkpoint layers were used when initializing KerasTFT5ForConditionalGeneration.

All the layers of KerasTFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use KerasTFT5ForConditionalGeneration for predictions without further training.


Model: "keras_tf_t5for_conditional_generation"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
shared (TFSharedEmbeddings)  multiple                  24674304  
_________________________________________________________________
encoder (TFT5MainLayer)      multiple                  84954240  
_________________________________________________________________
decoder (TFT5MainLayer)      multiple                  113275008 
Total params: 222,903,554
Trainable params: 222,903,552
Non-trainable params: 2
_________________________________________________________________
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Cause: while/else statement not yet supported


Cause: while/else statement not yet supported



































Save Model Weights


If GENERATE_TEXT is true than create the model and load the weights file.

In [None]:
if GENERATE_TEXT:        
    # Create Model
    with strategy.scope():
        model = KerasTFT5ForConditionalGeneration.from_pretrained(t5_size, config = t5_config)
        model.compile(optimizer = tf.keras.optimizers.Adam(), 
                      metrics = [tf.keras.metrics.SparseTopKCategoricalAccuracy(name = 'accuracy')])

    # Summary
    model.summary()

    # Load Weights
    model.load_weights(WORK_DIR + 't5_base_model.h5')

All model checkpoint layers were used when initializing KerasTFT5ForConditionalGeneration.

All the layers of KerasTFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use KerasTFT5ForConditionalGeneration for predictions without further training.


Model: "keras_tf_t5for_conditional_generation_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
shared (TFSharedEmbeddings)  multiple                  24674304  
_________________________________________________________________
encoder (TFT5MainLayer)      multiple                  84954240  
_________________________________________________________________
decoder (TFT5MainLayer)      multiple                  113275008 
Total params: 222,903,554
Trainable params: 222,903,552
Non-trainable params: 2
_________________________________________________________________


Lets show a few examples of the generated fake news. The 'title' is shown as is the original 'description'. The 'Generated Fake News' is the output generated by the T5 model based on only the encoded input title.

In [None]:
if GENERATE_TEXT:
    for index, row in generate_df[:4].iterrows():
     
        # Get title and description as strings
        title = row['title']
        description = row['description']

        print(f'\n\n========= Sample:  {index}')
        print(f'Title: {title}')
        print(f'Description: {description}')

        # Encode with Special Tokens
        input_encoded = t5_tokenizer.encode_plus(task_name + title, add_special_tokens = True, max_length = MAX_LEN, truncation = True, padding = 'max_length', return_tensors = 'tf')
        
        # Generate FakeNews
        generated_fakenews = model.generate(input_encoded['input_ids'], 
                                          attention_mask = input_encoded['attention_mask'], 
                                          max_length = MAX_LEN, 
                                          top_p = 0.96, 
                                          top_k = 256, 
                                          temperature = 1.3,
                                          num_beams = 2, 
                                          num_return_sequences = 1, 
                                          repetition_penalty = 1.3,
                                          length_penalty = 1.3)

        for mapping in generated_fakenews.numpy():
            decoded_mapping = t5_tokenizer.decode(mapping, skip_special_tokens = True)
            print(f"    Generated Fake News: {decoded_mapping}")



Title: Besieging holy sites: past lessons
Description: The standoff at one of Islam's holiest shrines parallels one at the Church of the Nativity in 2002.
    Generated Fake News: SAN FRANCISCO (CBS.MW) - The armed forces of the Holy City of San Francisco have been able to take over holy sites for centuries, but they have not been able to stop them from destroying them.


Title: Spain sprouts WiMax network
Description: Europe appears to be fertile ground for new WiMax networks. Spain is the latest country to embrace the emerging high-end broadband wireless technology, following recent deployments in France, Ireland, and the U.K.
    Generated Fake News: Spain has launched a new WiMax network, which will be able to connect to the Internet via a mobile phone. The network will be able to connect to the Internet via a mobile phone or cellular phone.


Title: Oracle sets new deadline on PeopleSoft bid
Description: Oracle has again pushed back the expiration date on its offer for PeopleSof

Perform the text generation based on the generate_df dataframe. Note that the 'title' is used as input. The generated fake news will be stored in the dataframe 'generated_description' column.

The dataframe is saved to storage to be used for further use in the follow up notebooks.

In [None]:
if GENERATE_TEXT:
    text_list = None
    generated = []

    for index, row in tqdm(zip(range(total_generate_samples), generate_df.iterrows()), total = total_generate_samples):
        index += 1

        if text_list is None:
            text_list = []

        # Prep input text
        text_list.append(task_name + row[1]['title'])
        
        if index % GENERATE_BATCH_SIZE == 0:
            # Batch Encode with Special Tokens
            textlist_encoded = t5_tokenizer.batch_encode_plus(text_list, add_special_tokens = True, max_length = MAX_LEN, truncation = True, padding = 'max_length', return_tensors = 'tf')
            
            input_ids = textlist_encoded['input_ids']
            
            # Generate FakeNews
            generated_fakenews = model.generate(input_ids, 
                                                max_length = MAX_LEN, 
                                                top_p = 0.96, 
                                                top_k = 256, 
                                                temperature = 1.3,
                                                num_beams = 1, # Increase the number of beams could give more interresting results...will also increase the time required.
                                                num_return_sequences = 1, 
                                                repetition_penalty = 1.3)
            
            for mapping in generated_fakenews.numpy():
                generated_description = t5_tokenizer.decode(mapping, skip_special_tokens = True)
                generated.append(generated_description)

            # Reset Text List
            text_list = []

    # Generate Final File
    generate_df['generated_description'] = generated
    generate_df.to_csv(WORK_DIR + 't5_generated_fake_news.csv')
    generate_df.head()

HBox(children=(FloatProgress(value=0.0, max=60000.0), HTML(value='')))