<a href="https://colab.research.google.com/github/RobinSmits/FakeNews-Generator-And-Detector/blob/main/FakeNews_Generator_And_Detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this last notebook we will use the 'test' part of the 'ag_news_subset' dataset. It contains 7600 rows with data that both the T5 and RoBERTa model have never seen before.

We will again use the T5 model to use the 'title' as input and generate fake news. The generated output is stored in the file 't5_generated_fake_news_final.csv'.

As a final and last step the RoBERTa model will classify the input into real or fake news.

In [None]:
import numpy as np
import os
import pandas as pd
from tqdm.notebook import tqdm

# Install Specific Versions
!pip install tensorflow==2.3.1
!pip install tensorflow-datasets==4.1.0
!pip install transformers==4.0.0
!pip install sentencepiece==0.1.94

# Import Packages
import tensorflow as tf
import tensorflow_datasets as tfds
from transformers import *
import sentencepiece



I've created and tested these notebooks on Google Colab Pro and used Google Drive to store and load any files created. 

If you run the code locally on a computer then modify the 'WORK_DIR' accordingly. Google Drive will not be needed in that case.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Set Folder to use...
WORK_DIR = '/content/drive/My Drive/fake_news/'
os.makedirs(WORK_DIR, exist_ok = True) 

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Next we set some config for the device to use (Note: TPU to be added/tested in the future.) We also set the necessary constants. 

And all the necessary information for the T5 and RoBERTa models will be set.

In [None]:
# Set strategy choice
USE_GPU = True

# Set strategy with config. Our code should run on all.
if USE_GPU:
    strategy = tf.distribute.OneDeviceStrategy(device = "/gpu:0")

# Set Pandas Display Options
pd.set_option('display.max_colwidth', 256)

# Constants
MAX_LEN = 512
VERBOSE = 1

# Batch Size
GENERATE_BATCH_SIZE = 38 * strategy.num_replicas_in_sync
PREDICT_BATCH_SIZE = 16 * strategy.num_replicas_in_sync
print(f'Predict Batch Size: {PREDICT_BATCH_SIZE}')
print(f'Generate Batch Size: {GENERATE_BATCH_SIZE}')

Predict Batch Size: 16
Generate Batch Size: 38


In [None]:
# Set T5 Type
t5_size = 't5-base'
print(f'T5 Model Type: {t5_size}')

# Set T5 Task Name
task_name = 'generate fake news: '
print(f'T5 Task Name: {task_name}')

# Set T5 Config
t5_config = T5Config.from_pretrained(t5_size)

# Set T5 Tokenizer
t5_tokenizer = T5Tokenizer.from_pretrained(t5_size, return_dict = True)

T5 Model Type: t5-base
T5 Task Name: generate fake news: 


In [None]:
# Set RoBERTa Type
roberta_type = 'roberta-base'
print(f'RoBERTa Model Type: {roberta_type}')

# Set RoBERTa Config
roberta_config = RobertaConfig.from_pretrained(roberta_type, num_labels = 2)  # Binary classification so set num_labels = 2

# Set RoBERTa Tokenizer
roberta_tokenizer = RobertaTokenizer.from_pretrained(roberta_type, 
                                             return_dict = True,
                                             add_prefix_space = True,
                                             do_lower_case = True)

RoBERTa Model Type: roberta-base


For Generation we will use the 'test' set part of the Tensorflow Dataset 'ag_news_subset'. It contains a train set of 120K rows and a test set of 7600 rows.

Both the T5 and RoBERTa model have never been trained on the 'test' set part of the data. It is completely unseen to both models.

Each row contains a 'title' which is a news paper headline and a 'description' which is a short part of the news paper article.

The 'title' will be used as input for the T5 model to generate the fake news.

!! Note: I've experienced multiple times that on the initial download of the dataset an error occurs. If you run it again it will just work...

In [None]:
# Get data and datasets
ag_news_ds, info = tfds.load('ag_news_subset', split = ['test'], with_info = True, shuffle_files = True, as_supervised = False)
test_ds = ag_news_ds[0]

# Dataset features
print(info.features)

# Samples
total_samples = info.splits['test'].num_examples 
print(f'Total Samples: {total_samples}')

INFO:absl:Load dataset info from /root/tensorflow_datasets/ag_news_subset/1.0.0
INFO:absl:Reusing dataset ag_news_subset (/root/tensorflow_datasets/ag_news_subset/1.0.0)
INFO:absl:Constructing tf.data.Dataset for split ['test'], from /root/tensorflow_datasets/ag_news_subset/1.0.0


FeaturesDict({
    'description': Text(shape=(), dtype=tf.string),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=4),
    'title': Text(shape=(), dtype=tf.string),
})
Total Samples: 7600


Next map the test_ds to use only the 'description' and 'title'. The test_ds is used for generating the fake news.

In [None]:
# Map and Decode Split(s)
def decode_example(example):
    decoded_example = info.features.decode_example(example)
    
    description = decoded_example['description']
    title = decoded_example['title']
    
    return title, description

# Map
test_ds = test_ds.map(decode_example, num_parallel_calls = tf.data.experimental.AUTOTUNE)

Create the Keras Model to be used for T5.

In [None]:
class KerasTFT5ForConditionalGeneration(TFT5ForConditionalGeneration):
    def __init__(self, *args, log_dir = None, cache_dir = None, **kwargs):
        super().__init__(*args, **kwargs)
        self.loss_tracker= tf.keras.metrics.Mean(name='loss') 
    
    @tf.function
    def train_step(self, data):
        x = data[0]
        y = x['labels']
        y = tf.reshape(y, [-1, 1])
        with tf.GradientTape() as tape:
            outputs = self(x, training = True)
            loss = outputs[0]
            logits = outputs[1]
            loss = tf.reduce_mean(loss)
            grads = tape.gradient(loss, self.trainable_variables)
            
        self.optimizer.apply_gradients(zip(grads, self.trainable_variables))
        self.loss_tracker.update_state(loss)        
        self.compiled_metrics.update_state(y, logits)
        metrics = {m.name: m.result() for m in self.metrics}
        
        return metrics

    def test_step(self, data):
        x = data[0]
        y = x["labels"]
        y = tf.reshape(y, [-1, 1])
        output = self(x, training = False)
        loss = output[0]
        loss = tf.reduce_mean(loss)
        logits = output[1]
        
        self.loss_tracker.update_state(loss)
        metrics = self.compiled_metrics.update_state(y, logits)
        
        return metrics

Next create the model and load the weights file.

In [None]:
# Create Model
with strategy.scope():
    model = KerasTFT5ForConditionalGeneration.from_pretrained(t5_size, config = t5_config)
    model.compile(optimizer = tf.keras.optimizers.Adam(), 
                  metrics = [tf.keras.metrics.SparseTopKCategoricalAccuracy(name = 'accuracy')])

# Summary
model.summary()

# Load Weights
model.load_weights(WORK_DIR + 't5_base_model.h5')

Some layers from the model checkpoint at t5-base were not used when initializing KerasTFT5ForConditionalGeneration: ['decoder/block_._0/layer_._1/EncDecAttention/relative_attention_bias/embeddings:0']
- This IS expected if you are initializing KerasTFT5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing KerasTFT5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of KerasTFT5ForConditionalGeneration were not initialized from the model checkpoint at t5-base and are newly initialized: ['loss']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "keras_tf_t5for_conditional_generation"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
shared (TFSharedEmbeddings)  multiple                  24674304  
_________________________________________________________________
encoder (TFT5MainLayer)      multiple                  84954240  
_________________________________________________________________
decoder (TFT5MainLayer)      multiple                  113275008 
_________________________________________________________________
loss (Mean)                  multiple                  2         
Total params: 222,903,554
Trainable params: 222,903,552
Non-trainable params: 2
_________________________________________________________________


Use the test_ds to prepare the final dataframe that will be saved to disk after the fake news generation.

In [None]:
# Placeholders
titles, descriptions = [], []
 
# Process Tensorflow Dataset as Numpy ... otherwise not possible to process tokenization.
generate_ds_numpy = tfds.as_numpy(test_ds)

for index, sample in tqdm(zip(range(total_samples), generate_ds_numpy), total = total_samples):
    # Get title and description as strings
    titles.append(sample[0].decode('utf-8'))             # title
    descriptions.append(sample[1].decode('utf-8'))       # description = 

# Create Dataframe
df = pd.DataFrame()
df['title'] = titles
df['description'] = descriptions
df['generated'] = ''

# Summary
df.head()

HBox(children=(FloatProgress(value=0.0, max=7600.0), HTML(value='')))




Unnamed: 0,title,description,generated
0,Carolina's Davis Done for the Season,"CHARLOTTE, N.C. (Sports Network) - Carolina Panthers running back Stephen Davis will miss the remainder of the season after being placed on injured reserve Saturday.",
1,"Philippine Rebels Free Troops, Talks in Doubt","PRESENTACION, Philippines (Reuters) - Philippine communist rebels freed Wednesday two soldiers they had held as ""prisoners of war"" for more than five months, saying they wanted to rebuild confidence in peace talks with the government.",
2,New Rainbow Six Franchise for Spring 2005,"SAN FRANCISCO, CA - November 30, 2004 -Ubisoft, one of the world #39;s largest video game publishers, today announced its plans to launch the next installment in the Tom Clancy #39;s Rainbow SixR franchise for the Sony PlayStationR2 computer entertainm...",
3,Kiwis heading for big win,DANIEL VETTORI spun New Zealand to the brink of a crushing victory over Bangladesh in the second and final Test at the MA Aziz Stadium in Chittagong today.,
4,"Shelling, shooting resumes in breakaway Georgian region (AFP)","AFP - Georgian and South Ossetian forces overnight accused each other of trying to storm the other side's positions in Georgia's breakaway region of South Ossetia, as four Georgian soldiers were reported to be wounded.",


Perform the text generation based on the prepared dataframe. Note that the 'title' is used as input. The generated fake news will be stored in the dataframe 'generated' column.

The dataframe is saved to storage for reference.

In [None]:
text_list = None
generated = []

for index, row in tqdm(zip(range(total_samples), df.iterrows()), total = total_samples):
    index += 1

    if text_list is None:
        text_list = []

    # Prep input text
    text_list.append(task_name + row[1]['title'])
    
    if index % GENERATE_BATCH_SIZE == 0:
        # Batch Encode with Special Tokens
        textlist_encoded = t5_tokenizer.batch_encode_plus(text_list, add_special_tokens = True, max_length = MAX_LEN, padding = True, truncation = True, return_tensors = 'tf')
        
        input_ids = textlist_encoded['input_ids']
        
        # Generate FakeNews
        generated_fakenews = model.generate(input_ids, 
                                          max_length = MAX_LEN, 
                                          top_p = 0.95, 
                                          top_k = 256, 
                                          temperature = 1.1,
                                          num_beams = 1, 
                                          num_return_sequences = 1, 
                                          repetition_penalty = 1.1)
        
        for mapping in generated_fakenews.numpy():
            generated.append(t5_tokenizer.decode(mapping, skip_special_tokens = True))

        # Reset Text List
        text_list = []

# Generate Final File
df['generated'] = generated
df.to_csv(WORK_DIR + 't5_generated_fake_news_final.csv')

# Summary...
df.head()

HBox(children=(FloatProgress(value=0.0, max=7600.0), HTML(value='')))




Unnamed: 0,title,description,generated
0,Carolina's Davis Done for the Season,"CHARLOTTE, N.C. (Sports Network) - Carolina Panthers running back Stephen Davis will miss the remainder of the season after being placed on injured reserve Saturday.","The Carolina Panthers' defensive end, who was a starter for the season, is out for the season."
1,"Philippine Rebels Free Troops, Talks in Doubt","PRESENTACION, Philippines (Reuters) - Philippine communist rebels freed Wednesday two soldiers they had held as ""prisoners of war"" for more than five months, saying they wanted to rebuild confidence in peace talks with the government.","Philippine rebels freed their troops from the Philippines on Friday, a day after a government report said the country #39;s military had quot;really quot; resisted a ceasefire with the government."
2,New Rainbow Six Franchise for Spring 2005,"SAN FRANCISCO, CA - November 30, 2004 -Ubisoft, one of the world #39;s largest video game publishers, today announced its plans to launch the next installment in the Tom Clancy #39;s Rainbow SixR franchise for the Sony PlayStationR2 computer entertainm...","The Rainbow Six franchise will be released in the United States and Canada in spring 2005, with the first release expected to be in the US."
3,Kiwis heading for big win,DANIEL VETTORI spun New Zealand to the brink of a crushing victory over Bangladesh in the second and final Test at the MA Aziz Stadium in Chittagong today.,The Kiwis are heading for a big win in the first Test against Australia on Sunday. The Kiwis are a long way from their first test victory in the Test.
4,"Shelling, shooting resumes in breakaway Georgian region (AFP)","AFP - Georgian and South Ossetian forces overnight accused each other of trying to storm the other side's positions in Georgia's breakaway region of South Ossetia, as four Georgian soldiers were reported to be wounded.","AFP - A shelling operation and a shooting in the breakaway Georgian region of Karshmiya resumed on Monday, with the government threatening to fire more than a million shells at the rebel-held region."


### RoBERTa FakeNews Detector

We load the 't5_generated_fake_news_final.csv' file and do some preprocessing to get the input news and labels correct for the classifier.

In [None]:
# Import Generated Fake News
df = pd.read_csv(WORK_DIR + 't5_generated_fake_news_final.csv', usecols = ['title', 'description', 'generated'])
df.head()

Unnamed: 0,title,description,generated
0,Carolina's Davis Done for the Season,"CHARLOTTE, N.C. (Sports Network) - Carolina Panthers running back Stephen Davis will miss the remainder of the season after being placed on injured reserve Saturday.","The Carolina Panthers' defensive end, who was a starter for the season, is out for the season."
1,"Philippine Rebels Free Troops, Talks in Doubt","PRESENTACION, Philippines (Reuters) - Philippine communist rebels freed Wednesday two soldiers they had held as ""prisoners of war"" for more than five months, saying they wanted to rebuild confidence in peace talks with the government.","Philippine rebels freed their troops from the Philippines on Friday, a day after a government report said the country #39;s military had quot;really quot; resisted a ceasefire with the government."
2,New Rainbow Six Franchise for Spring 2005,"SAN FRANCISCO, CA - November 30, 2004 -Ubisoft, one of the world #39;s largest video game publishers, today announced its plans to launch the next installment in the Tom Clancy #39;s Rainbow SixR franchise for the Sony PlayStationR2 computer entertainm...","The Rainbow Six franchise will be released in the United States and Canada in spring 2005, with the first release expected to be in the US."
3,Kiwis heading for big win,DANIEL VETTORI spun New Zealand to the brink of a crushing victory over Bangladesh in the second and final Test at the MA Aziz Stadium in Chittagong today.,The Kiwis are heading for a big win in the first Test against Australia on Sunday. The Kiwis are a long way from their first test victory in the Test.
4,"Shelling, shooting resumes in breakaway Georgian region (AFP)","AFP - Georgian and South Ossetian forces overnight accused each other of trying to storm the other side's positions in Georgia's breakaway region of South Ossetia, as four Georgian soldiers were reported to be wounded.","AFP - A shelling operation and a shooting in the breakaway Georgian region of Karshmiya resumed on Monday, with the government threatening to fire more than a million shells at the rebel-held region."


In [None]:
# Split out 'description', rename column to 'news' and set label to 0
df_description = df[['title', 'description']].copy()
df_description.rename(columns = {'description': 'news'}, inplace = True)
df_description['label'] = 0
df_description.head()

Unnamed: 0,title,news,label
0,Carolina's Davis Done for the Season,"CHARLOTTE, N.C. (Sports Network) - Carolina Panthers running back Stephen Davis will miss the remainder of the season after being placed on injured reserve Saturday.",0
1,"Philippine Rebels Free Troops, Talks in Doubt","PRESENTACION, Philippines (Reuters) - Philippine communist rebels freed Wednesday two soldiers they had held as ""prisoners of war"" for more than five months, saying they wanted to rebuild confidence in peace talks with the government.",0
2,New Rainbow Six Franchise for Spring 2005,"SAN FRANCISCO, CA - November 30, 2004 -Ubisoft, one of the world #39;s largest video game publishers, today announced its plans to launch the next installment in the Tom Clancy #39;s Rainbow SixR franchise for the Sony PlayStationR2 computer entertainm...",0
3,Kiwis heading for big win,DANIEL VETTORI spun New Zealand to the brink of a crushing victory over Bangladesh in the second and final Test at the MA Aziz Stadium in Chittagong today.,0
4,"Shelling, shooting resumes in breakaway Georgian region (AFP)","AFP - Georgian and South Ossetian forces overnight accused each other of trying to storm the other side's positions in Georgia's breakaway region of South Ossetia, as four Georgian soldiers were reported to be wounded.",0


In [None]:
# Split out 'generated', rename column to 'news' and set label to 1
df_generated = df[['title', 'generated']].copy()
df_generated.rename(columns = {'generated': 'news'}, inplace = True)
df_generated['label'] = 1
df_generated.head()

Unnamed: 0,title,news,label
0,Carolina's Davis Done for the Season,"The Carolina Panthers' defensive end, who was a starter for the season, is out for the season.",1
1,"Philippine Rebels Free Troops, Talks in Doubt","Philippine rebels freed their troops from the Philippines on Friday, a day after a government report said the country #39;s military had quot;really quot; resisted a ceasefire with the government.",1
2,New Rainbow Six Franchise for Spring 2005,"The Rainbow Six franchise will be released in the United States and Canada in spring 2005, with the first release expected to be in the US.",1
3,Kiwis heading for big win,The Kiwis are heading for a big win in the first Test against Australia on Sunday. The Kiwis are a long way from their first test victory in the Test.,1
4,"Shelling, shooting resumes in breakaway Georgian region (AFP)","AFP - A shelling operation and a shooting in the breakaway Georgian region of Karshmiya resumed on Monday, with the government threatening to fire more than a million shells at the rebel-held region.",1


In [None]:
# Combine Dataframes to a final dataframe.
test_df = pd.concat([df_description, df_generated], ignore_index = True)
test_df.sample(n = 10)

Unnamed: 0,title,news,label
8569,US Airways Workers Get Pay Cut,"US Airways workers are getting a pay cut, the airline said on Monday. The union said the cuts were a result of a $1 billion pay cut.",1
10770,Iran Says EU Nuke Negotiations in Final Stages,"Iran said on Monday that negotiations with the European Union on a nuclear deal were in the final stages of negotiations, but that it was still in the process of negotiating with the European Union.",1
3612,Novak Captures First Indoor Title,"BASEL, Switzerland Oct 31, 2004 - Jiri Novak of the Czech Republic won the Swiss Indoors for his first indoor title, defeating David Nalbandian in five sets Sunday in a final in which the Argentine smashed two rackets.",0
112,Rebel threat on the roads leaves Katmandu isolated,"KATMANDU, Nepal The Nepali capital was largely cut off from the rest of the country on Wednesday after Maoist rebels threatened to attack any vehicles traveling on main roads, in a virtual blockade of Katmandu to press their demands for the release of ...",0
11326,Napster Mobile,Napster Mobile is a mobile application that lets users download music and photos from their mobile devices.,1
11304,Google #39;s New PC Search Tool Poses Risks,"Google Inc. has released a new search tool that aims to help users find information about their PCs, but it is a risky move.",1
11589,Dollar Rises on the Interest Rate Plays,The dollar rose against the euro on Friday as a slew of economic indicators helped to keep the U.S. economy in check.,1
2656,PeopleSoft customers reassured,Oracle Corp. President Charles Phillips on Monday said PeopleSoft Inc. customers have become more comfortable with the prospect of a merger between the two software firms even as the proposed transaction awaits a critical ruling from a Delaware court.,0
5971,Allardyce is infuriated by Souness #39; criticism,BOLTON manager Sam Allardyce rounded on his Newcastle counterpart Graeme Souness last night for criticising their style of play. Allardyce saw his unsung side reclaim fourth spot in the table after a 2-1 victory at the Reebok Stadium.,0
8799,Connect the jovian dots,"The jovians are jovial, and the jovians are jovial. They are jovial, jovial, and jovial.",1


Next we define a function to process the Pandas Test Dataframe. We loop through all rows and from each row we use the columns 'news' and 'label'.

Note that the 'label' column is only used for validation of the predictions.

In [None]:
def create_dataset(df):
    total_samples = df.shape[0]

    # Placeholders input
    input_ids = np.zeros((total_samples, MAX_LEN), dtype = 'int32')
    input_masks = np.zeros((total_samples, MAX_LEN), dtype = 'int32')
    labels = np.zeros((total_samples, ), dtype = 'int32')

    for index, row in tqdm(zip(range(0, total_samples), df.iterrows()), total = total_samples):
        
        # Get news and label...
        news = row[1]['news']
        label = row[1]['label']

        # Process News - Set Label.....
        input_encoded = roberta_tokenizer.encode_plus(news, add_special_tokens = True, max_length = MAX_LEN, truncation = True)
        input_ids_sample = input_encoded['input_ids']
        input_ids[index,:len(input_ids_sample)] = input_ids_sample
        attention_mask_sample = input_encoded['attention_mask']
        input_masks[index,:len(attention_mask_sample)] = attention_mask_sample
        labels[index] = int(label)

    # Create Dataset.
    dataset = tf.data.Dataset.from_tensor_slices(({'input_ids': input_ids, 'attention_mask': input_masks}, labels))

    # Return Dataset
    return dataset

In [None]:
# Show Sizes
print(f'Test DF Shape: {test_df.shape}')

# Create Validation Dataset
test_dataset = create_dataset(test_df)
test_dataset = test_dataset.batch(PREDICT_BATCH_SIZE)
test_dataset = test_dataset.prefetch(128)

# Steps
test_steps = test_df.shape[0] // PREDICT_BATCH_SIZE
print(f'Test Steps: {test_steps}')

Test DF Shape: (15200, 3)


HBox(children=(FloatProgress(value=0.0, max=15200.0), HTML(value='')))


Test Steps: 950


Define a function to create and compile the RoBERTa base model.

In [None]:
def build_model():
    # Create Model
    with strategy.scope():      
        model = TFRobertaForSequenceClassification.from_pretrained(roberta_type, config = roberta_config)
        
        optimizer = tf.keras.optimizers.Adam()
        loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True)
        metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

        model.compile(optimizer = optimizer, loss = loss, metrics = [metric])        
        
        return model

Create the model and load the weights file

In [None]:
# Create Model
model = build_model()

# Summary
model.summary()

# Load Weights
model.load_weights(WORK_DIR + 'roberta_base_model.h5')

Some layers from the model checkpoint at roberta-base were not used when initializing TFRobertaForSequenceClassification: ['lm_head']
- This IS expected if you are initializing TFRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "tf_roberta_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
roberta (TFRobertaMainLayer) multiple                  124645632 
_________________________________________________________________
classifier (TFRobertaClassif multiple                  592130    
Total params: 125,237,762
Trainable params: 125,237,762
Non-trainable params: 0
_________________________________________________________________


Next lets first evaluate the test set and see how well the RoBERTa model can classify the generated data.

With an evaluation accuracy of around 97% the RoBERTa model performs a nice job of classifying the real and fake news.

In [None]:
# Evaluate Dataset
eval = model.evaluate(test_dataset, steps = test_steps, verbose = 1)
print(f'Detection Accuracy: {eval[1] * 100}%')

Detection Accuracy: 97.09210395812988%


We can also perform prediction with the test set. This is basically the same action as the evaluation. However evaluation will give us back the evaluation metrics where as prediction will give us back the raw predictions.

In [None]:
 # Predict Dataset
preds = model.predict(test_dataset, steps = test_steps, verbose = 1)

# Raw Predictions
print(preds.logits)

# Probabilities
probs = tf.nn.softmax(preds.logits).numpy()
print(probs)

[[ 4.725556  -4.685101 ]
 [ 4.5961823 -4.5432057]
 [ 4.733269  -4.666181 ]
 ...
 [-1.1234063  1.17313  ]
 [-4.5547814  4.6318207]
 [-4.5342073  4.633972 ]]
[[9.9991810e-01 8.1840459e-05]
 [9.9989259e-01 1.0734147e-04]
 [9.9991727e-01 8.2762701e-05]
 ...
 [9.1410220e-02 9.0858978e-01]
 [1.0239179e-04 9.9989760e-01]
 [1.0429534e-04 9.9989569e-01]]


In [None]:
test_df['label_pred'] = np.argmax(probs, axis = 1)

So the majority of the real news and fake news where classified correctly. That is very nice..but to be honest I'am a lot more interrested in the wrong predictions for the real and fake news.

Let's take a look at some of the predictions where our model messed up.

In [None]:
# Real News ... but classified as Fake...
test_df[test_df.label.eq(0) & test_df.label_pred.eq(1)].head()

Unnamed: 0,title,news,label,label_pred
4,"Shelling, shooting resumes in breakaway Georgian region (AFP)","AFP - Georgian and South Ossetian forces overnight accused each other of trying to storm the other side's positions in Georgia's breakaway region of South Ossetia, as four Georgian soldiers were reported to be wounded.",0,1
7,Sharapova Withdraws From Advanta Tourney (AP),AP - Maria Sharapova withdrew from her semifinal at the Advanta Championships on Saturday with a strained right shoulder.,0,1
12,"Almost 5,000 jobs cut on Bank of America getting funds","Bank of America has an option to cut at least 4,500 jobs while reorganizing its structure. This is not the first time when the bank reduces jobs.",0,1
33,Senate Bill Aims at Makers of File-Sharing Software,The Senate Judiciary Committee is considering a copyright bill that stands at the center of the file-sharing debate.,0,1
73,New Jersey Lawsuit Challenges Electronic Voting,A coalition of private citizens and local elected officials in New Jersey plan to file a lawsuit to block the state's use of electronic voting machines.,0,1


In [None]:
# Fake News ... but classified as Real...
test_df[test_df.label.eq(1) & test_df.label_pred.eq(0)].head()

Unnamed: 0,title,news,label,label_pred
9546,Observers warn militant groups may exploit bitterness in &lt;b&gt;...&lt;/b&gt;,Observers say militant groups may exploit bitterness in the streets of Baghdad to gain access to the disputed territory.,1,0
10418,WCQ Group 6 preview: Eriksson considers three up front.,WCQ Group 6 preview: Swedish coach Eriksson is considering three up front positions in the upcoming World Cup qualifier against the Czech Republic.,1,0
10459,Final edition for a respected Asian newsweekly,"The final edition of Asia's most respected newsweekly, Asia Newsweek, is now available for download.",1,0
10991,Symantec Firewall/VPN Appliance 200/200R (firmware builds prior to &lt;b&gt;...&lt;/b&gt;,Symantec has released a new version of its Firewall/VPN Appliance 200/200R (firmware builds prior to the release of the new product).,1,0
12435,The Discreet Charm of the Very Bourgeois Toy Store?,The Discreet Charm of the Very Bourgeois Toy Store?,1,0
