<a href="https://colab.research.google.com/github/RobinSmits/FakeNews-Generator-And-Detector/blob/main/FakeNews_Generator_And_Detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this last notebook we will use the 'test' part of the 'ag_news_subset' dataset. It contains 7600 rows with data that both the T5 and RoBERTa model have never seen before.

We will again use the T5 model to use the 'title' as input and generate fake news. The generated output is stored in the file 't5_generated_fake_news_final.csv'.

As a final and last step the RoBERTa model will classify the input into real or fake news.

In [1]:
import numpy as np
import os
import pandas as pd
from tqdm.notebook import tqdm
from urllib.request import urlopen
import tarfile

# Install Specific Versions
!pip install -q tensorflow==2.4.1
!pip install -q tensorflow-datasets==4.1.0
!pip install -q transformers==4.4.2
!pip install -q sentencepiece==0.1.95

# Import Packages
import tensorflow as tf
import tensorflow_datasets as tfds
from transformers import *
import sentencepiece
from sklearn.metrics import classification_report

I've created and tested these notebooks on Google Colab Pro and used Google Drive to store and load any files created. 

If you run the code locally on a computer then modify the 'WORK_DIR' accordingly. Google Drive will not be needed in that case.

In [2]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Set Folder to use...
WORK_DIR = '/content/drive/My Drive/fake_news/'
os.makedirs(WORK_DIR, exist_ok = True) 

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Next we set some config for the device to use (Note: This notebook runs on TPU/GPU and CPU) We also set the necessary constants. 

And all the necessary information for the T5 and RoBERTa models will be set.

In [3]:
# Configure Strategy. Assume TPU...if not set default for GPU/CPU
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.TPUStrategy(tpu)
except ValueError:
    strategy = tf.distribute.get_strategy()

# Set Auto Tune
AUTO = tf.data.experimental.AUTOTUNE

# Supress Warnings
tf.autograph.set_verbosity(0, False)

# Set Pandas Display Options
pd.set_option('display.max_colwidth', 256)

# Constants
MAX_LEN = 512
VERBOSE = 1

# Batch Size
GENERATE_BATCH_SIZE = 19 * strategy.num_replicas_in_sync
PREDICT_BATCH_SIZE = 16 * strategy.num_replicas_in_sync
print(f'Predict Batch Size: {PREDICT_BATCH_SIZE}')
print(f'Generate Batch Size: {GENERATE_BATCH_SIZE}')

Predict Batch Size: 16
Generate Batch Size: 19


In [4]:
# Set T5 Type
t5_size = 't5-base'
print(f'T5 Model Type: {t5_size}')

# Set T5 Task Name
task_name = 'generate fake news: '
print(f'T5 Task Name: {task_name}')

# Set T5 Config
t5_config = T5Config.from_pretrained(t5_size)

# Set T5 Tokenizer
t5_tokenizer = T5Tokenizer.from_pretrained(t5_size, return_dict = True)

T5 Model Type: t5-base
T5 Task Name: generate fake news: 


In [5]:
# Set RoBERTa Type
roberta_type = 'roberta-base'
print(f'RoBERTa Model Type: {roberta_type}')

# Set RoBERTa Config
roberta_config = RobertaConfig.from_pretrained(roberta_type, num_labels = 2) # Binary classification so set num_labels = 2

# Set RoBERTa Tokenizer
roberta_tokenizer = RobertaTokenizer.from_pretrained(roberta_type, 
                                                     add_prefix_space = False,
                                                     do_lower_case = False)

RoBERTa Model Type: roberta-base


For Generation we will use the 'test' set part of the Tensorflow Dataset 'ag_news_subset'. It contains a train set of 120K rows and a test set of 7600 rows.

Both the T5 and RoBERTa model have never been trained on the 'test' set part of the data. It is completely unseen to both models.

Each row contains a 'title' which is a news paper headline and a 'description' which is a short part of the news paper article.

The 'title' will be used as input for the T5 model to generate the fake news.

!! Note: I've experienced multiple times that on the initial download of the dataset an error occurs. If you run it again it will just work...

In [6]:
# AG News Subset Download URL from TFDS
AGNEWSSUBSET_URL = 'https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbUDNpeUdjb0wxRms'
AGNEWSSUBSET_DIR = '/tmp/agnewssubet/'

# Download Tar.Gz File and Extract
with urlopen(AGNEWSSUBSET_URL) as targzstream:
    thetarfile = tarfile.open(fileobj = targzstream, mode="r|gz")
    thetarfile.extractall(AGNEWSSUBSET_DIR)
    
# List Dataset files
agnewssubset_files = os.listdir(AGNEWSSUBSET_DIR + 'ag_news_csv/')

# Load Test Csv
test_df = pd.read_csv(AGNEWSSUBSET_DIR + 'ag_news_csv/test.csv', names = ['label', 'title', 'description'])

# Add column for generated text
test_df['generated_description'] = ''

# Samples
total_samples = test_df.shape[0] 
print(f'Total Samples: {total_samples}')

# Summary
test_df.head()

Total Samples: 7600


Unnamed: 0,label,title,description,generated_description
0,3,Fears for T N pension after talks,Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.,
1,4,The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com),"SPACE.com - TORONTO, Canada -- A second\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\privately funded suborbital space flight, has officially announced the first\launch date for its manned rocket.",
2,4,Ky. Company Wins Grant to Study Peptides (AP),"AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins.",
3,4,Prediction Unit Helps Forecast Wildfires (AP),"AP - It's barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures and endless charts, but already he knows what the day will bring. Lightning will strike in places he expects. Winds will pick up, moist places will dry a...",
4,4,Calif. Aims to Limit Farm-Related Smog (AP),"AP - Southern California's smog-fighting agency went after emissions of the bovine variety Friday, adopting the nation's first rules to reduce air pollution from dairy cow manure.",


Create the Keras Model to be used for T5.

In [7]:
class KerasTFT5ForConditionalGeneration(TFT5ForConditionalGeneration):
    def __init__(self, *args, log_dir = None, cache_dir = None, **kwargs):
        super().__init__(*args, **kwargs)
        self.loss_tracker= tf.keras.metrics.Mean(name='loss') 
    
    @tf.function
    def train_step(self, data):
        x = data[0]
        y = x['labels']
        y = tf.reshape(y, [-1, 1])
        with tf.GradientTape() as tape:
            outputs = self(x, training = True)
            loss = outputs[0]
            logits = outputs[1]
            loss = tf.reduce_mean(loss)
            grads = tape.gradient(loss, self.trainable_variables)
            
        self.optimizer.apply_gradients(zip(grads, self.trainable_variables))
        self.loss_tracker.update_state(loss)        
        self.compiled_metrics.update_state(y, logits)
        metrics = {m.name: m.result() for m in self.metrics}
        
        return metrics

    def test_step(self, data):
        x = data[0]
        y = x["labels"]
        y = tf.reshape(y, [-1, 1])
        output = self(x, training = False)
        loss = output[0]
        loss = tf.reduce_mean(loss)
        logits = output[1]
        
        self.loss_tracker.update_state(loss)
        metrics = self.compiled_metrics.update_state(y, logits)
        
        return metrics

Next create the model and load the weights file.

In [8]:
# Create Model
with strategy.scope():
    model = KerasTFT5ForConditionalGeneration.from_pretrained(t5_size, config = t5_config)
    model.compile(optimizer = tf.keras.optimizers.Adam(), 
                  metrics = [tf.keras.metrics.SparseTopKCategoricalAccuracy(name = 'accuracy')])

# Summary
model.summary()

# Load Weights
model.load_weights(WORK_DIR + 't5_base_model.h5')

All model checkpoint layers were used when initializing KerasTFT5ForConditionalGeneration.

All the layers of KerasTFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use KerasTFT5ForConditionalGeneration for predictions without further training.


Model: "keras_tf_t5for_conditional_generation"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
shared (TFSharedEmbeddings)  multiple                  24674304  
_________________________________________________________________
encoder (TFT5MainLayer)      multiple                  84954240  
_________________________________________________________________
decoder (TFT5MainLayer)      multiple                  113275008 
Total params: 222,903,554
Trainable params: 222,903,552
Non-trainable params: 2
_________________________________________________________________


Use the test_ds to prepare the final dataframe that will be saved to disk after the fake news generation.

Perform the text generation based on the prepared dataframe. Note that the 'title' is used as input. The generated fake news will be stored in the dataframe 'generated' column.

The dataframe is saved to storage for reference.

In [9]:
text_list = None
generated = []

for index, row in tqdm(zip(range(total_samples), test_df.iterrows()), total = total_samples):
    index += 1

    if text_list is None:
        text_list = []

    # Prep input text
    text_list.append(task_name + row[1]['title'])
    
    if index % GENERATE_BATCH_SIZE == 0:
        # Batch Encode with Special Tokens
        textlist_encoded = t5_tokenizer.batch_encode_plus(text_list, add_special_tokens = True, max_length = MAX_LEN, truncation = True, padding = 'max_length', return_tensors = 'tf')
        
        input_ids = textlist_encoded['input_ids']
        
        # Generate FakeNews
        generated_fakenews = model.generate(input_ids, 
                                          max_length = MAX_LEN, 
                                          top_p = 0.96, 
                                          top_k = 256, 
                                          temperature = 1.3,
                                          num_beams = 1, 
                                          num_return_sequences = 1, 
                                          repetition_penalty = 1.3)
        
        for mapping in generated_fakenews.numpy():
            generated_description = t5_tokenizer.decode(mapping, skip_special_tokens = True)
            generated.append(generated_description)

        # Reset Text List
        text_list = []

# Generate Final File
test_df['generated_description'] = generated
test_df.to_csv(WORK_DIR + 't5_generated_fake_news_final.csv')

# Summary...
test_df.head()

HBox(children=(FloatProgress(value=0.0, max=7600.0), HTML(value='')))




Unnamed: 0,label,title,description,generated_description
0,3,Fears for T N pension after talks,Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.,"SAN FRANCISCO (Reuters) - The government has urged its members to hold talks with the government on a plan to reduce their pension liabilities, a report said on Thursday."
1,4,The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com),"SPACE.com - TORONTO, Canada -- A second\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\privately funded suborbital space flight, has officially announced the first\launch date for its manned rocket.","SPACE.com - The second private team to launch a human spacecraft has set a date for the launch of the first human spacecraft, a move that will help the crew of the manned spacecraft to reach the moon."
2,4,Ky. Company Wins Grant to Study Peptides (AP),"AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins.",AP - A Kentucky company that has been studying peptides for more than a decade has won a grant to study the drug.
3,4,Prediction Unit Helps Forecast Wildfires (AP),"AP - It's barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures and endless charts, but already he knows what the day will bring. Lightning will strike in places he expects. Winds will pick up, moist places will dry a...",AP - A new unit of the National Weather Service's forecasting unit has helped forecast wildfires in the United States and Canada.
4,4,Calif. Aims to Limit Farm-Related Smog (AP),"AP - Southern California's smog-fighting agency went after emissions of the bovine variety Friday, adopting the nation's first rules to reduce air pollution from dairy cow manure.","AP - California's Environmental Protection Agency is aiming to limit the impact of smog on agriculture by reducing emissions from greenhouse gases, a federal agency said Tuesday."


### RoBERTa FakeNews Detector

We load the 't5_generated_fake_news_final.csv' file and do some preprocessing to get the input news and labels correct for the classifier.

In [10]:
# Import Generated Fake News
df = pd.read_csv(WORK_DIR + 't5_generated_fake_news_final.csv', usecols = ['title', 'description', 'generated_description'])
df.head()

Unnamed: 0,title,description,generated_description
0,Fears for T N pension after talks,Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.,"SAN FRANCISCO (Reuters) - The government has urged its members to hold talks with the government on a plan to reduce their pension liabilities, a report said on Thursday."
1,The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com),"SPACE.com - TORONTO, Canada -- A second\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\privately funded suborbital space flight, has officially announced the first\launch date for its manned rocket.","SPACE.com - The second private team to launch a human spacecraft has set a date for the launch of the first human spacecraft, a move that will help the crew of the manned spacecraft to reach the moon."
2,Ky. Company Wins Grant to Study Peptides (AP),"AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins.",AP - A Kentucky company that has been studying peptides for more than a decade has won a grant to study the drug.
3,Prediction Unit Helps Forecast Wildfires (AP),"AP - It's barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures and endless charts, but already he knows what the day will bring. Lightning will strike in places he expects. Winds will pick up, moist places will dry a...",AP - A new unit of the National Weather Service's forecasting unit has helped forecast wildfires in the United States and Canada.
4,Calif. Aims to Limit Farm-Related Smog (AP),"AP - Southern California's smog-fighting agency went after emissions of the bovine variety Friday, adopting the nation's first rules to reduce air pollution from dairy cow manure.","AP - California's Environmental Protection Agency is aiming to limit the impact of smog on agriculture by reducing emissions from greenhouse gases, a federal agency said Tuesday."


In [11]:
# Split out 'description', rename column to 'news' and set label to 0
df_description = df[['title', 'description']].copy()
df_description.rename(columns = {'description': 'news'}, inplace = True)
df_description['label'] = 0
df_description.head()

Unnamed: 0,title,news,label
0,Fears for T N pension after talks,Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.,0
1,The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com),"SPACE.com - TORONTO, Canada -- A second\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\privately funded suborbital space flight, has officially announced the first\launch date for its manned rocket.",0
2,Ky. Company Wins Grant to Study Peptides (AP),"AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins.",0
3,Prediction Unit Helps Forecast Wildfires (AP),"AP - It's barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures and endless charts, but already he knows what the day will bring. Lightning will strike in places he expects. Winds will pick up, moist places will dry a...",0
4,Calif. Aims to Limit Farm-Related Smog (AP),"AP - Southern California's smog-fighting agency went after emissions of the bovine variety Friday, adopting the nation's first rules to reduce air pollution from dairy cow manure.",0


In [12]:
# Split out 'generated_description', rename column to 'news' and set label to 1
df_generated = df[['title', 'generated_description']].copy()
df_generated.rename(columns = {'generated_description': 'news'}, inplace = True)
df_generated['label'] = 1
df_generated.head()

Unnamed: 0,title,news,label
0,Fears for T N pension after talks,"SAN FRANCISCO (Reuters) - The government has urged its members to hold talks with the government on a plan to reduce their pension liabilities, a report said on Thursday.",1
1,The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com),"SPACE.com - The second private team to launch a human spacecraft has set a date for the launch of the first human spacecraft, a move that will help the crew of the manned spacecraft to reach the moon.",1
2,Ky. Company Wins Grant to Study Peptides (AP),AP - A Kentucky company that has been studying peptides for more than a decade has won a grant to study the drug.,1
3,Prediction Unit Helps Forecast Wildfires (AP),AP - A new unit of the National Weather Service's forecasting unit has helped forecast wildfires in the United States and Canada.,1
4,Calif. Aims to Limit Farm-Related Smog (AP),"AP - California's Environmental Protection Agency is aiming to limit the impact of smog on agriculture by reducing emissions from greenhouse gases, a federal agency said Tuesday.",1


In [13]:
# Combine Dataframes to a final dataframe.
test_df = pd.concat([df_description, df_generated], ignore_index = True)
test_df.sample(n = 10)

Unnamed: 0,title,news,label
11817,"China Mine Blast Kills 56, Death Toll Could Soar",China #39;s largest mine explosion killed 56 people and a death toll could soar as the country #39;s government prepares to take action against the alleged sabotage of a mine in the southern province of Guangdong province.,1
12285,OPEC Head Urges U.S. to Use Oil Reserves,"OPEC Chairman John Kerry has called on the United States to use oil reserves to help the oil industry, a top OPEC official said on Thursday.",1
3098,"Office Depot Chairman, CEO Nelson Resigns","Office Depot Inc. (ODP.N: Quote, Profile, Research) , the No. 2 US office supply chain, on Monday said Chairman and Chief Executive Officer Bruce Nelson has resigned and a search for his successor is underway.",0
7785,Russia ready to contribute to settlement of South Ossetia conflict: Putin,"Russia is ready to contribute to the settlement of the conflict in South Ossetia, President Vladimir Putin said on Thursday.",1
6826,Nissan comes apart without parts,"Tokyo - Japan #39;s Nissan Motor said on Thursday that the company may have to suspend some production next March in addition to already announced suspensions, due to parts shortages, resulting in a decline of about 6 billion (R339.8 million) in annual...",0
1363,NASA space capsule crashes into desert,"The Genesis space capsule, which had orbited the sun for more than three years in an attempt to find clues to the origin of the solar system, crashed to Earth on Wednesday after its parachute failed to deploy.",0
8759,Biffle Bests Mears,"The New York Yankees have a chance to win the first round of the World Series, but they need to be able to win the championship.",1
12609,Australian Bookies Gutted After Betting Splurge,"Australian bookies were gutted after a huge surge in their betting profits on the Australian Open on Sunday, despite a sluggish start to the season.",1
4793,EMC unveils 'Storage Router',EMC has unveiled long-awaited storage virtualization technology that the company said will allow users to manage its arrays -- and high-end boxes from major competitors -- through a single interface.,0
3295,"Iraq #39;s Government in Talks with Sunni, Shi #39;ite Leaders",Iraq #39;s interim government is engaged in cease-fire talks with Sunni and Shi #39;ite leaders in an effort to restore calm to violent parts of Iraq before January #39;s scheduled election.,0


Next we define a function to process the Pandas Test Dataframe. We loop through all rows and from each row we use the columns 'news' and 'label'.

Note that the 'label' column is only used for validation of the predictions.

In [14]:
def create_dataset(df):
    total_samples = df.shape[0]

    # Placeholders input
    input_ids = np.zeros((total_samples, MAX_LEN), dtype = 'int32')
    input_masks = np.zeros((total_samples, MAX_LEN), dtype = 'int32')
    labels = np.zeros((total_samples, ), dtype = 'int32')

    for index, row in tqdm(zip(range(0, total_samples), df.iterrows()), total = total_samples):
        
        # Get news and label...
        news = row[1]['news']
        label = row[1]['label']

        # Process News - Set Label.....
        input_encoded = roberta_tokenizer.encode_plus(news, add_special_tokens = True, max_length = MAX_LEN, truncation = True, padding = 'max_length')
        input_ids_sample = input_encoded['input_ids']
        input_ids[index,:] = input_ids_sample
        attention_mask_sample = input_encoded['attention_mask']
        input_masks[index,:] = attention_mask_sample
        labels[index] = int(label)

    # Create Dataset.
    dataset = tf.data.Dataset.from_tensor_slices(({'input_ids': input_ids, 'attention_mask': input_masks}, labels))

    # Return Dataset
    return dataset

In [15]:
# Show Sizes
print(f'Test DF Shape: {test_df.shape}')

# Create Validation Dataset
test_dataset = create_dataset(test_df)
test_dataset = test_dataset.batch(PREDICT_BATCH_SIZE)
test_dataset = test_dataset.repeat()
test_dataset = test_dataset.prefetch(128)

# Steps
test_steps = test_df.shape[0] // PREDICT_BATCH_SIZE
print(f'Test Steps: {test_steps}')

Test DF Shape: (15200, 3)


HBox(children=(FloatProgress(value=0.0, max=15200.0), HTML(value='')))


Test Steps: 950


Define a function to create and compile the RoBERTa base model.

In [16]:
def build_model():
    # Create Model
    with strategy.scope():      
        model = TFRobertaForSequenceClassification.from_pretrained(roberta_type, config = roberta_config)
        
        optimizer = tf.keras.optimizers.Adam()
        loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True)
        metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

        model.compile(optimizer = optimizer, loss = loss, metrics = [metric])        
        
        return model

Create the model and load the weights file

In [17]:
# Create Model
model = build_model()

# Summary
model.summary()

# Load Weights
model.load_weights(WORK_DIR + 'roberta_base_model.h5')

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "tf_roberta_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
roberta (TFRobertaMainLayer) multiple                  124055040 
_________________________________________________________________
classifier (TFRobertaClassif multiple                  592130    
Total params: 124,647,170
Trainable params: 124,647,170
Non-trainable params: 0
_________________________________________________________________


Next lets first evaluate the test set and see how well the RoBERTa model can classify the generated data.

With an evaluation accuracy of around 97% the RoBERTa model performs a nice job of classifying the real and fake news.

In [18]:
# Evaluate Dataset
eval = model.evaluate(test_dataset, steps = test_steps, verbose = 1)
print(f'Detection Accuracy: {eval[1] * 100}%')

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Detection Accuracy: 97.69079089164734%


We can also perform prediction with the test set. This is basically the same action as the evaluation. However evaluation will give us back the evaluation metrics where as prediction will give us back the raw predictions.

In [19]:
# Predict Dataset
preds = model.predict(test_dataset, steps = test_steps, verbose = 1)

# Raw Predictions
print(preds.logits[:5])

# Probabilities
probs = tf.nn.softmax(preds.logits).numpy()
print(probs[:5])

[[ 4.8985724  -4.9062314 ]
 [ 4.887136   -4.9282126 ]
 [-0.58673674  0.8424304 ]
 [ 4.754228   -4.788338  ]
 [ 4.744828   -4.7787185 ]]
[[9.9994481e-01 5.5182816e-05]
 [9.9994540e-01 5.4604017e-05]
 [1.9322848e-01 8.0677146e-01]
 [9.9992824e-01 7.1727380e-05]
 [9.9992692e-01 7.3104602e-05]]


Lets write the probabilities and predicted labels to the dataframe.

In [20]:
# Write Probabilities to dataframe
test_df['probas'] = probs[:, 1]

# Write predicted Label to dataframe
test_df['label_pred'] = np.argmax(probs, axis = 1)

So the majority of the real news and fake news where classified correctly. Lets take a more detailed look at this with a Classification Report.

In [21]:
target_names = ['Real News', 'Fake News']
print(classification_report(test_df.label.values, test_df.label_pred.values, target_names = target_names))

              precision    recall  f1-score   support

   Real News       1.00      0.95      0.98      7600
   Fake News       0.96      1.00      0.98      7600

    accuracy                           0.98     15200
   macro avg       0.98      0.98      0.98     15200
weighted avg       0.98      0.98      0.98     15200



We will add friendly display names in addition to the 0 and 1 values for the labels and predicted labels. This makes the overview more clear.

In [22]:
test_df['DN_Label'] = np.where(test_df.label == 0, 'RealNews', 'FakeNews')
test_df['DN_Label_Predicted'] = np.where(test_df.label_pred == 0, 'RealNews', 'FakeNews')
test_df.head()

Unnamed: 0,title,news,label,probas,label_pred,DN_Label,DN_Label_Predicted
0,Fears for T N pension after talks,Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.,0,5.5e-05,0,RealNews,RealNews
1,The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com),"SPACE.com - TORONTO, Canada -- A second\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\privately funded suborbital space flight, has officially announced the first\launch date for its manned rocket.",0,5.5e-05,0,RealNews,RealNews
2,Ky. Company Wins Grant to Study Peptides (AP),"AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins.",0,0.806771,1,RealNews,FakeNews
3,Prediction Unit Helps Forecast Wildfires (AP),"AP - It's barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures and endless charts, but already he knows what the day will bring. Lightning will strike in places he expects. Winds will pick up, moist places will dry a...",0,7.2e-05,0,RealNews,RealNews
4,Calif. Aims to Limit Farm-Related Smog (AP),"AP - Southern California's smog-fighting agency went after emissions of the bovine variety Friday, adopting the nation's first rules to reduce air pollution from dairy cow manure.",0,7.3e-05,0,RealNews,RealNews


Lets take a further look at the predictions...

In [23]:
# Real News ... and classified as Real...
test_df.loc[test_df.label.eq(0) & test_df.label_pred.eq(0), ['title', 'news', 'DN_Label', 'DN_Label_Predicted']]

Unnamed: 0,title,news,DN_Label,DN_Label_Predicted
0,Fears for T N pension after talks,Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.,RealNews,RealNews
1,The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com),"SPACE.com - TORONTO, Canada -- A second\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\privately funded suborbital space flight, has officially announced the first\launch date for its manned rocket.",RealNews,RealNews
3,Prediction Unit Helps Forecast Wildfires (AP),"AP - It's barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures and endless charts, but already he knows what the day will bring. Lightning will strike in places he expects. Winds will pick up, moist places will dry a...",RealNews,RealNews
4,Calif. Aims to Limit Farm-Related Smog (AP),"AP - Southern California's smog-fighting agency went after emissions of the bovine variety Friday, adopting the nation's first rules to reduce air pollution from dairy cow manure.",RealNews,RealNews
5,Open Letter Against British Copyright Indoctrination in Schools,"The British Department for Education and Skills (DfES) recently launched a ""Music Manifesto"" campaign, with the ostensible intention of educating the next generation of British musicians. Unfortunately, they also teamed up with the music industry (EMI,...",RealNews,RealNews
...,...,...,...,...
7595,Around the world,"Ukrainian presidential candidate Viktor Yushchenko was poisoned with the most harmful known dioxin, which is contained in Agent Orange, a scientist who analyzed his blood said Friday.",RealNews,RealNews
7596,Void is filled with Clement,"With the supply of attractive pitching options dwindling daily -- they lost Pedro Martinez to the Mets, missed on Tim Hudson, and are resigned to Randy Johnson becoming a Yankee -- the Red Sox struck again last night, coming to terms with free agent Ma...",RealNews,RealNews
7597,Martinez leaves bitter,"Like Roger Clemens did almost exactly eight years earlier, Pedro Martinez has left the Red Sox apparently bitter about the way he was treated by management.",RealNews,RealNews
7598,5 of arthritis patients in Singapore take Bextra or Celebrex &lt;b&gt;...&lt;/b&gt;,SINGAPORE : Doctors in the United States have warned that painkillers Bextra and Celebrex may be linked to major cardiovascular problems and should not be prescribed.,RealNews,RealNews


In [24]:
# Real News ... but classified as Fake...
test_df.loc[test_df.label.eq(0) & test_df.label_pred.eq(1), ['title', 'news', 'DN_Label', 'DN_Label_Predicted']]

Unnamed: 0,title,news,DN_Label,DN_Label_Predicted
2,Ky. Company Wins Grant to Study Peptides (AP),"AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins.",RealNews,FakeNews
32,Sister of man who died in Vancouver police custody slams chief (Canadian Press),Canadian Press - VANCOUVER (CP) - The sister of a man who died after a violent confrontation with police has demanded the city's chief constable resign for defending the officer involved.,RealNews,FakeNews
45,Drew Out of Braves' Lineup After Injury (AP),AP - Outfielder J.D. Drew missed the Atlanta Braves' game against the St. Louis Cardinals on Sunday night with a sore right quadriceps.,RealNews,FakeNews
49,Another Major Non-Factor,"Another major, another disappointment for Tiger Woods, the No. 1 ranked player in the world who has not won a major championship since his triumph at the 2002 U.S. Open.",RealNews,FakeNews
68,Bobcats Trade Drobnjak to Hawks for Pick (AP),AP - The Charlotte Bobcats traded center Predrag Drobnjak to the Atlanta Hawks on Monday for a second round pick in the 2005 NBA draft.,RealNews,FakeNews
...,...,...,...,...
7497,HP drops Itanium development,"Hewlett-Packard Co. (HP) is getting out of the chip-making business. The Palo Alto, California, company on Thursday announced that it reached an agreement with Intel Corp.",RealNews,FakeNews
7508,Parmalat Sues 45 Banks to Recover \$4 Billion,"Parmalat, the bankrupt Italian dairy and food company, sued 45 banks on Thursday seeking to recover money it paid to them in the year before the company #39;s collapse.",RealNews,FakeNews
7512,Israel withdraws from Gaza camp,"Israel withdraws from Khan Younis refugee camp in the Gaza Strip, after a four-day operation that left 11 dead.",RealNews,FakeNews
7532,"Pedro, GM welcome challenge that awaits",The Boston Red Sox could offer Pedro Martinez the still-dizzying celebration of a city that he helped to a historic World Series championship.,RealNews,FakeNews


In [25]:
# Fake News ... and classified as Fake...
test_df.loc[test_df.label.eq(1) & test_df.label_pred.eq(1), ['title', 'news', 'DN_Label', 'DN_Label_Predicted']]

Unnamed: 0,title,news,DN_Label,DN_Label_Predicted
7600,Fears for T N pension after talks,"SAN FRANCISCO (Reuters) - The government has urged its members to hold talks with the government on a plan to reduce their pension liabilities, a report said on Thursday.",FakeNews,FakeNews
7601,The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com),"SPACE.com - The second private team to launch a human spacecraft has set a date for the launch of the first human spacecraft, a move that will help the crew of the manned spacecraft to reach the moon.",FakeNews,FakeNews
7602,Ky. Company Wins Grant to Study Peptides (AP),AP - A Kentucky company that has been studying peptides for more than a decade has won a grant to study the drug.,FakeNews,FakeNews
7603,Prediction Unit Helps Forecast Wildfires (AP),AP - A new unit of the National Weather Service's forecasting unit has helped forecast wildfires in the United States and Canada.,FakeNews,FakeNews
7604,Calif. Aims to Limit Farm-Related Smog (AP),"AP - California's Environmental Protection Agency is aiming to limit the impact of smog on agriculture by reducing emissions from greenhouse gases, a federal agency said Tuesday.",FakeNews,FakeNews
...,...,...,...,...
15195,Around the world,"The world #39;s biggest news agency is announcing that it will be releasing a new report on the global economy, a report that will be released in the coming weeks.",FakeNews,FakeNews
15196,Void is filled with Clement,"The void is filled with Clement Clement, the savior of the savages.",FakeNews,FakeNews
15197,Martinez leaves bitter,"The Spanish coach, who has been a key part of the team for the past two years, has left the club with a bittersweet end to his season.",FakeNews,FakeNews
15198,5 of arthritis patients in Singapore take Bextra or Celebrex &lt;b&gt;...&lt;/b&gt;,"SAN FRANCISCO (Reuters) - Five of the five patients in Singapore who have arthritis take Bextra or Celebrex, the company said on Tuesday.",FakeNews,FakeNews


In [26]:
# Fake News ... but classified as Real...
test_df.loc[test_df.label.eq(1) & test_df.label_pred.eq(0), ['title', 'news', 'DN_Label', 'DN_Label_Predicted']]

Unnamed: 0,title,news,DN_Label,DN_Label_Predicted
7911,"French Take Gold, Bronze in Single Kayak",French sailors took gold and bronze in the single kayaking event at the Yokohama International Speedway on Saturday.,FakeNews,RealNews
9148,"Post-Olympic Greece tightens purse, sells family silver to fill budget holes (AFP)","AFP - Greece tightened its purse after the Olympics, selling family silver to fill budget holes.",FakeNews,RealNews
9499,-Posted by dave.rosenberg 1:51 pm (PDT),"-Posted by dave.rosenberg 1:51 pm (PDT) on Friday, November 21, 2004.",FakeNews,RealNews
12551,Spanish flyer: Markko Martin steers his Ford Focus during the &lt;b&gt;...&lt;/b&gt;,"Markko Martin, the Spanish driver of the Focus, swung his Ford Focus into the air during the &lt;b&gt;Full Circle of Excellence in the International Air Race.",FakeNews,RealNews
13916,"Serbia, Bosnian Serbs Fail to Help War Crimes Tribunal, UN Says",The UN says Serbia and Bosnian Serbs have failed to help the tribunal that tried war crimes cases in Bosnia and Herzegovina.,FakeNews,RealNews
