# **Using a BERT Model to Predict Fake News**

In [1]:
from google.colab import files
uploaded = files.upload()

Saving news.csv to news.csv


In [0]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
import torch.utils.data as data_utils
import torch.optim as optim
import gc #garbage collector for gpu memory 
from tqdm import tqdm

#### The BERT package (transformers) has to be installed and run

In [0]:
%%capture
!pip install transformers

#### Import the library specific to running BERT models on PyTorch. The transformers package using the existing PyTorch infrastructure to recreate the BERT model architecture.

In [0]:
%%capture
from transformers import BertForSequenceClassification, BertTokenizer
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

#### Read in the news data through the csv file. The following columns are not relevant for this endeavor:

*   ID - this is meaningless and could cause overfitting
*   Title - for this experiment we'll choose to omit it




In [0]:
import pandas as pd
news_data = pd.read_csv("news.csv",header=1)

news_data.columns = ['id','title','text','target_names','target']
del news_data['id']
del news_data['title']

#### This is a preview of the data once the irrelevant columns have been removed. 

In [6]:
news_data.head(10)

Unnamed: 0,text,target_names,target
0,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,0
1,U.S. Secretary of State John F. Kerry said Mon...,REAL,1
2,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,0
3,It's primary day in New York and front-runners...,REAL,1
4,"\nI’m not an immigrant, but my grandparents ...",FAKE,0
5,"Share This Baylee Luciani (left), Screenshot o...",FAKE,0
6,A Czech stockbroker who saved more than 650 Je...,REAL,1
7,Hillary Clinton and Donald Trump made some ina...,REAL,1
8,Iranian negotiators reportedly have made a las...,REAL,1
9,"CEDAR RAPIDS, Iowa — “I had one of the most wo...",REAL,1


#### The transformers package comes with a tokenizer for each model. We'll use the BERT tokenizer here and a BERT base model where the text isn't modified for case.

In [7]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

100%|██████████| 213450/213450 [00:00<00:00, 5579422.84B/s]


#### Tokenizing the data so that each sentence is split into words and symbols. Also '[CLS]' and '[SEP]' to the beginning and end of every article.

In [0]:
tokenized_df = list(map(lambda t: ['[CLS]'] + tokenizer.tokenize(t)[:510] + ['[SEP]'], news_data['text']))

#### The max input length for a BERT algorithm is 512, so we'll have to pad each article to this length or cut it short.

In [0]:
totalpadlength = 512

#### We need to get the index for each token so that we can map them to be put in a matrix embedding.

In [0]:
indexed_tokens = list(map(tokenizer.convert_tokens_to_ids, tokenized_df))

In [0]:
index_padded = np.array([xi+[0]*(totalpadlength-len(xi)) for xi in indexed_tokens])

#### Setting up an array with the binary target variable values
* 0 = FAKE
* 1 = REAL

In [0]:
target_variable = news_data['target'].values

#### Creating dictionaries that map the tokens to the index and the index to the token.

In [0]:
all_words = []
for l in tokenized_df:
  all_words.extend(l)
all_indices = []
for i in indexed_tokens:
  all_indices.extend(i)

word_to_ix = dict(zip(all_words, all_indices))
ix_to_word = dict(zip(all_indices, all_words))

#### The BERT algorithm relies on masking to help it learn and to prevent overfitting, so we'll add this to the model.

In [0]:
mask_variable = [[float(i>0) for i in ii] for ii in index_padded]

#### This loads the data into train and test dataloaders, which for PyTorch is necessary to iterate through the algorithm.

In [0]:
BATCH_SIZE = 14
def format_tensors(text_data, mask, labels, batch_size):
    X = torch.from_numpy(text_data)
    X = X.long()
    mask = torch.tensor(mask)
    y = torch.from_numpy(labels)
    y = y.long()
    tensordata = data_utils.TensorDataset(X, mask, y)
    loader = data_utils.DataLoader(tensordata, batch_size=batch_size, shuffle=False)
    return loader

X_train, X_test, y_train, y_test = train_test_split(index_padded, target_variable, 
                                                    test_size=0.1, random_state=42)

train_masks, test_masks, _, _ = train_test_split(mask_variable, index_padded, 
                                                       test_size=0.1, random_state=42)

trainloader = format_tensors(X_train, train_masks, y_train,BATCH_SIZE)
testloader = format_tensors(X_test, test_masks, y_test, BATCH_SIZE)

#### This is a sample batch from the trainloader. The first tensor contains the embeddings for the articles, the second tensor contains the masking information, and the third tensor contains the target variables for each article.

In [16]:
next(iter(trainloader))

[tensor([[  101,  5096, 13053,  ...,  2400,   119,   102],
         [  101,  1118,  5728,  ...,  1142,  3507,   102],
         [  101, 11255,   170,  ...,     0,     0,     0],
         ...,
         [  101,  1109,  2383,  ...,  1103,  5637,   102],
         [  101, 18653, 11922,  ...,  1343,  1107,   102],
         [  101,   107,   146,  ...,     0,     0,     0]]),
 tensor([[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 0., 0., 0.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 0., 0., 0.]]),
 tensor([1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1])]


### Now it's time to create the BERT Model!

#### The BERT model architecture is shown below. This is a BERT base-cased model, which means it has 12 BERT transformer layers, 768 hidden layers, 12 heads, 110M parameters, and is pre-trained on cased English text.


In [17]:
model = BertForSequenceClassification.from_pretrained('bert-base-cased')
model

100%|██████████| 313/313 [00:00<00:00, 216101.59B/s]
100%|██████████| 435779157/435779157 [00:07<00:00, 57478490.93B/s]


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

#### Creating a function to compute the accuracy after each epoch

In [0]:
def compute_accuracy(model, dataloader, device):
    tqdm()
    model.eval()
    correct_preds, num_samples = 0,0
    with torch.no_grad():
        for i, batch in enumerate(tqdm(dataloader)):
            token_ids, masks, labels = tuple(t.to(device) for t in batch)
            _, yhat = model(input_ids=token_ids, attention_mask=masks, labels=labels)
            prediction = (torch.sigmoid(yhat[:,1]) > 0.5).long()
            num_samples += labels.size(0)
            correct_preds += (prediction==labels.long()).sum()
            del token_ids, masks, labels #memory
        torch.cuda.empty_cache() #memory
        gc.collect() # memory
        return correct_preds.float()/num_samples*100

#### Now we iterate through the dataset, updating the model weights at each instance. Since BERT is pre-trained, we keep the learning rate low and only perform a few epochs. This prevents it from overfitting.

In [19]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
torch.cuda.empty_cache() #memory
gc.collect() #memory
NUM_EPOCHS = 3
loss_function = nn.BCEWithLogitsLoss()
losses = []
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=3e-6)
for epoch in range(NUM_EPOCHS):
    model.train()
    running_loss = 0.0
    iteration = 0
    for i, batch in enumerate(trainloader):
        iteration += 1
        token_ids, masks, labels = tuple(t.to(device) for t in batch)
        optimizer.zero_grad()
        loss, yhat = model(input_ids=token_ids, attention_mask=masks, labels=labels)
        loss.backward()
        optimizer.step()
        running_loss += float(loss.item())
        del token_ids, masks, labels #memory
    
        if not i%25:
            print(f'Epoch: {epoch+1:03d}/{NUM_EPOCHS:03d} | '
                  f'Batch {i+1:03d}/{len(trainloader):03d} | '
                  f'Average Loss in last {iteration} iteration(s): {(running_loss/iteration):.4f}')
            running_loss = 0.0
            iteration = 0
        torch.cuda.empty_cache() #memory
        gc.collect() #memory
        losses.append(float(loss.item()))
    with torch.set_grad_enabled(False):
        print(f'\nTraining Accuracy: '
              f'{compute_accuracy(model, trainloader, device):.2f}%')
        


Epoch: 001/003 | Batch 001/406 | Average Loss in last 1 iteration(s): 0.7502
Epoch: 001/003 | Batch 026/406 | Average Loss in last 25 iteration(s): 0.7033
Epoch: 001/003 | Batch 051/406 | Average Loss in last 25 iteration(s): 0.6507
Epoch: 001/003 | Batch 076/406 | Average Loss in last 25 iteration(s): 0.5867
Epoch: 001/003 | Batch 101/406 | Average Loss in last 25 iteration(s): 0.5037
Epoch: 001/003 | Batch 126/406 | Average Loss in last 25 iteration(s): 0.4174
Epoch: 001/003 | Batch 151/406 | Average Loss in last 25 iteration(s): 0.3131
Epoch: 001/003 | Batch 176/406 | Average Loss in last 25 iteration(s): 0.2610
Epoch: 001/003 | Batch 201/406 | Average Loss in last 25 iteration(s): 0.2251
Epoch: 001/003 | Batch 226/406 | Average Loss in last 25 iteration(s): 0.2310
Epoch: 001/003 | Batch 251/406 | Average Loss in last 25 iteration(s): 0.2245
Epoch: 001/003 | Batch 276/406 | Average Loss in last 25 iteration(s): 0.2011
Epoch: 001/003 | Batch 301/406 | Average Loss in last 25 iteratio

0it [00:00, ?it/s]
100%|██████████| 406/406 [01:41<00:00,  4.01it/s]



Training Accuracy: 96.44%
Epoch: 002/003 | Batch 001/406 | Average Loss in last 1 iteration(s): 0.0357
Epoch: 002/003 | Batch 026/406 | Average Loss in last 25 iteration(s): 0.1523
Epoch: 002/003 | Batch 051/406 | Average Loss in last 25 iteration(s): 0.1244
Epoch: 002/003 | Batch 076/406 | Average Loss in last 25 iteration(s): 0.1014
Epoch: 002/003 | Batch 101/406 | Average Loss in last 25 iteration(s): 0.1265
Epoch: 002/003 | Batch 126/406 | Average Loss in last 25 iteration(s): 0.1290
Epoch: 002/003 | Batch 151/406 | Average Loss in last 25 iteration(s): 0.0796
Epoch: 002/003 | Batch 176/406 | Average Loss in last 25 iteration(s): 0.1167
Epoch: 002/003 | Batch 201/406 | Average Loss in last 25 iteration(s): 0.0826
Epoch: 002/003 | Batch 226/406 | Average Loss in last 25 iteration(s): 0.1022
Epoch: 002/003 | Batch 251/406 | Average Loss in last 25 iteration(s): 0.1468
Epoch: 002/003 | Batch 276/406 | Average Loss in last 25 iteration(s): 0.0828
Epoch: 002/003 | Batch 301/406 | Avera

0it [00:00, ?it/s]
100%|██████████| 406/406 [01:41<00:00,  4.02it/s]



Training Accuracy: 98.03%
Epoch: 003/003 | Batch 001/406 | Average Loss in last 1 iteration(s): 0.0147
Epoch: 003/003 | Batch 026/406 | Average Loss in last 25 iteration(s): 0.0896
Epoch: 003/003 | Batch 051/406 | Average Loss in last 25 iteration(s): 0.0574
Epoch: 003/003 | Batch 076/406 | Average Loss in last 25 iteration(s): 0.0628
Epoch: 003/003 | Batch 101/406 | Average Loss in last 25 iteration(s): 0.0736
Epoch: 003/003 | Batch 126/406 | Average Loss in last 25 iteration(s): 0.0842
Epoch: 003/003 | Batch 151/406 | Average Loss in last 25 iteration(s): 0.0581
Epoch: 003/003 | Batch 176/406 | Average Loss in last 25 iteration(s): 0.0861
Epoch: 003/003 | Batch 201/406 | Average Loss in last 25 iteration(s): 0.0550
Epoch: 003/003 | Batch 226/406 | Average Loss in last 25 iteration(s): 0.0692
Epoch: 003/003 | Batch 251/406 | Average Loss in last 25 iteration(s): 0.1308
Epoch: 003/003 | Batch 276/406 | Average Loss in last 25 iteration(s): 0.0439
Epoch: 003/003 | Batch 301/406 | Avera

0it [00:00, ?it/s]
100%|██████████| 406/406 [01:41<00:00,  4.02it/s]



Training Accuracy: 98.79%


#### Finally, we score the final model on the test set

In [20]:
with torch.set_grad_enabled(False):
  print(f'\n\nTest Accuracy:'
  f'{compute_accuracy(model, testloader, device):.2f}%')

0it [00:00, ?it/s]
100%|██████████| 46/46 [00:11<00:00,  4.01it/s]




Test Accuracy:96.36%


#### We then do some error analysis by gathering the articles that were incorrectly predicted and analyzing the text of the articles.

In [21]:
test_predictions = torch.zeros((len(y_test),1))
test_predictions_percent = torch.zeros((len(y_test),1))
with torch.no_grad():
  for i, batch in enumerate(tqdm(testloader)):
    token_ids, masks, labels = tuple(t.to(device) for t in batch)
    _, yhat = model(input_ids=token_ids, attention_mask=masks, labels=labels)
    prediction = (torch.sigmoid(yhat[:,1]) > 0.5).long().view(-1,1)
    test_predictions[i*BATCH_SIZE:(i+1)*BATCH_SIZE] = prediction
    test_predictions_percent[i*BATCH_SIZE:(i+1)*BATCH_SIZE] = torch.sigmoid(yhat[:,1]).view(-1,1)

100%|██████████| 46/46 [00:11<00:00,  4.04it/s]


In [0]:
X_train_words, X_test_words, y_train_words, y_test_words = train_test_split(news_data['text'], target_variable, 
                                                    test_size=0.1, random_state=42)

In [0]:
final_results = X_test_words.to_frame().reset_index(drop=True)
final_results['predicted'] = np.array(test_predictions.reshape(-1), dtype=int).tolist()
final_results['percent'] = np.array(test_predictions_percent.reshape(-1), dtype=float).tolist()
final_results['actual'] = y_test_words
wrong_results = final_results.loc[final_results['predicted']!=final_results['actual']].copy()


In [24]:
print('Number of incorrectly classified articles:', len(wrong_results))

Number of incorrectly classified articles: 23


#### This displays the incorrectly predicted instances, along with the percent confidence the algorithm had in each instance. The threshold for classification is 50%. Instances closer to 100% are more confident it's real news and instances closer to 0% are more confident it's fake news.

In [26]:
wrong_results.loc[:,'text_short'] = wrong_results.loc[:,'text'].apply(lambda x: x[:500])
wrong_results.loc[:,('text_short', 'percent','predicted','actual')].style.set_properties(subset=['text_short'], **{'width': '1000px', 'white-space':'pre-wrap'})

Unnamed: 0,text_short,percent,predicted,actual
78,"Imagine if, during the Jim Crow era, a newspaper offered advertisers the option of placing ads only in copies that went to white readers. That’s basically what Facebook is doing nowadays. The ubiquitous social network not only allows advertisers to target users by their interests or background, it also gives advertisers the ability to exclude specific groups it calls “Ethnic Affinities.” Ads that exclude people based on race, gender and other sensitive factors are prohibited by federal law in",0.886369,1,0
80,"USA Today WASHINGTON — The Army acknowledged Friday that Maj. Gen. John Rossi committed suicide on July 31, making him the highest-ranking soldier ever to have taken his own life. Rossi, who was 55, was just two days from pinning on his third star and taking command of Army Space and Missile Command when he killed himself at his home at Redstone Arsenal in Alabama. ‘ Investigators could find no event, infidelity, misconduct or drug or alcohol abuse, that triggered Rossi’s suicide, said a U.S.",0.888966,1,0
117,"Unprecedented Surge In Election Fraud Incidents From Around The Country Zero Hedge Mounting evidence would suggest it's getting more and more difficult for the left to claim that there are ""no signs"" of fraud in the 2016 election cycle...though we're sure they will continue to try. Just this morning the Miami Herald noted that two arrests were made in Miami-Dade county on election fraud charges including efforts by one woman to illegally register voters (some of whom were dead...a recurring t",0.526852,1,0
125,"Maggie Hassan, left and Kelly Ayotte Hassan declares victory in U.S. Senate race with Ayotte By PAUL FEELYNew Hampshire Union Leader Update, 11:00 a.m. Gov. Maggie Hassan declared she’s won New Hampshire's U.S. Senate race, unseating Republican Sen. Kelly Ayotte.During a hastily-called press conference outside the State House, Hassan said she’s ahead now by enough votes to survive returns from the few outstanding towns that are left.“I am proud to stand here as the next United States senator fro",0.799829,1,0
162,"From the day we are born into this world, we are being taught what our parents have been taught, and what their parents have taught them, without asking many questions such as who we are, why we are here, and why things are the way they are. Existential questions are simply perceived as irrelevant in a left-brained society; in which money and career performance seem to be the primary focus. For those who seek a reason, countless financed religious institutions claim to provide the ultimate ans",0.566883,1,0
175,"(30 fans) - Advertisement - This article originally appeared at TomDispatch.com . To receive TomDispatch in your inbox three times a week, click here . Donald Trump has long campaigned on the promise of running the country the way he's run his businesses. On that basis, we essentially already know what it would mean if he entered the Oval Office and applied his personal business acumen to this nation (and the rest of the world). There's a surprisingly full record to cite. Who can forget, for i",0.628271,1,0
200,"Fox News reported : Five police officers and yellow caution tape surround Donald Trump’s Hollywood Walk of Fame star — or what’s left of it. The Los Angeles police say they are investigating the smashing of Trump’s star following footage that showed the sidewalk tribute was destroyed with a pickax. Det. Meghan Aguilar says investigators were called to the scene before dawn Wednesday. By mid-morning, an LAPD spokesperson at the scene told FOX411 the Chamber of Commerce was sending out a crew",0.661488,1,0
210,"especially the conservatives. It’s Independents like Sanders who will fight for our rights…people who are not bought by the power elite.""",0.855685,1,0
242,"On The Streets Of Baltimore, Trying To Understand The Anger In the early morning, as the cold set in, Anaya Maze stood next to the charred remains of a CVS store. Holding a sign, she was the only protester left in front of a line of police officers dressed in riot gear. She is petite. Still, she faced the police officers, looking at them intently. A few steps away were the charred skeletons of two police vehicles, the victims of an unbridled anger that burned its way through the west side of",0.418241,0,1
287,"With Hillary Clinton making history this election season by becoming the first women nominated by a major party, the sexism has been on full display. The misogynists on the Right have questioned her health, her stamina, and everything in between. That is, of course, code for “the little woman doesn’t belong in the Oval Office.” However, one Texas Republican has taken the misogyny to a whole other level. Meet Sid Miller, the Agriculture Commissioner for the state of Texas. He is also a former",0.787238,1,0
