<a href="https://colab.research.google.com/github/TiredEspressoBean/FakeNewsDetector-AI/blob/main/FakeNews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [17]:
!pip install transformers
!pip install transformers[torch] accelerate



# Problems and Goals

**Problem 1**: Misinformation.

With the proliferation of misinformation, that is mistruths presented as facts, I wish to have a stronger understanding of how misinformation works on a larger systemic level.



**Problem 2**: Lack of personal knowledge about AI.

While understanding the principles of how different systems we know as AI operate, I wanted a more tangible understanding of How they operate through a little bit of practice with such systems.



**Goal**: Therefore with both of these problelms at hand why not go ahead and build an AI model that tries to detect fake news. This will let me explore more popular AI systems within the last few years, that being systems that evaluate linguistics and give me an understanding through the data accrued of how in text a machine would be able to detect misinformation without the use of fact checking, a more intensive process.

# Tools Used

Torch: Machine learning framework commonly used with Python for machine learning modeling.

Treansformers: Library providing the API for our model.

BERT: Bidirectional Encoder Representations from Transformers, a language model developed by google specifically for Natural Language Processing.

Pandas, numpy, and random are standard libraries for mathematics and the like.

In [18]:
import torch
from transformers.file_utils import is_tf_available, is_torch_available, is_torch_tpu_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import pandas as pd
import numpy
import random


# What is BERT?

BERT, the Bidirectional Encoder Represesntations from Transformers, is a language model based on the transformers architecture which is a deep learning model designed to handle sequential data. Now what does that mean?

Rather than going through all the information of what these things are, let's consider how it operates.

A language model is a system by which a machine may understand and generate human like language. It does so through the understanding of statistics and relationships using the text as data point using transformers.

Transformers are a way to efficiently process sequential sequential data, by allowing calculations to be made in parallel with one another across an entire equence. So for example if you have a sentence such as "Don't look up.", the transformers would be able to calculate probabilities for each word.

So with that in mind, let us look at the name BERT;

Bidirectional - Meaning that statistical and relational information is both backwards and forwards interpreted.

Encoder Representations - Data then is encoded as its representation. So for "Don't look up" the actual words themselves aren't used, but a representation of them based on their relationships.

Transformers - So when we have these pieces of data representing the words themselves and their context, we apply this information in a transformer to weigh the likelihoods of these pieces of data.

In [19]:
model_name = "bert-base-uncased"
max_length = 512
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)

model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:

def set_seed(seed: int):

    random.seed(seed)
    numpy.random.seed(seed)
    if is_torch_tpu_available():
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)

    if is_tf_available():
        import tensorflow as tf
        tf.random.set_seed(seed)


set_seed(1)

# Sanitizing the Data

Here we're taking in the data, and removing rows where certain pieces are missing. Specifically rows where there is no text, if there is no author, or if there is no title as we are using all three in order to generate the language model.

This data is being provided by kaggle from their 2018 fake news detection contest, if you wish to run this yourself you'll need to go to https://kaggle.com/competitions/fake-news/overview, sign up if you don't already have an account, download their data sets and in the content folder add a new folder called 'News Data' where you will place the news.csv object and test.csv object. If you wish to use your own data or another data set there will need to be changes made to the sanitizing and tokenizing process.

If you are using a different dataset than the one provided by kaggle, you'll need to use the .notna() function as shown below for each column you wish to include for validation.

In [21]:
news_d = pd.read_csv("/content/News Data/news.csv", error_bad_lines=False)




  news_d = pd.read_csv("/content/News Data/news.csv", error_bad_lines=False)


In [22]:
news_df = news_d[news_d['text'].notna()]
news_df = news_df[news_df["author"].notna()]
news_df = news_df[news_df["title"].notna()]

news_df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


# Tokenize our Data

Now that we have data where there is no information missing and therefore 'more complete' datasets, we can go ahead and turn this data into tokens.

Tokens are smaller pieces of data that can be encoded more easily. This function takes the dataset, adding the true or false label for its validity as well as the text. If there is a title or author, as there should be, then the function adds these to the front of the text in `[author]:[title]-[text]` format for the language model trainer to be able to process. Finally splitting these pieces with the train_test_splits imported function. Afterwards the tokenizer takes these pieces of data and creates the tokens which will finally be processed during the training stage.

If you are trying to use your own data this process will also need to be changed based on what columns you are using to train the model. Generally speaking the goal is for the data to be added to texts and the label for it added to labels.

In [23]:


def prepare_data(df, test_size=0.2, include_title=True, include_author=True):
  texts = []
  labels = []
  for i in range(len(df)):
    text = df["text"].iloc[i]
    label = df["label"].iloc[i]
    if include_title:
      text = df["title"].iloc[i] + " - " + text
    if include_author:
      text = df["author"].iloc[i] + " : " + text
    if text and label in [0, 1]:
      texts.append(text)
      labels.append(label)
  return train_test_split(texts, labels, test_size=test_size)

train_texts, valid_texts, train_labels, valid_labels = prepare_data(news_df)

print(len(train_texts), len(train_labels))
print(len(valid_texts), len(valid_labels))


train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)

14628 14628
3657 3657


# News Object

For the language model trainer to be able to train with this new data we need to give it a way to access and interface with the data so we make an object that the trainer is then able to access.

In [24]:
class NewsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
      item = {}
      for k, v in self.encodings.items():
        item[k] = torch.tensor(v[idx])
      item["labels"] = torch.tensor([self.labels[idx]])
      return item

    def __len__(self):
        return len(self.labels)


train_dataset = NewsDataset(train_encodings, train_labels)
valid_dataset = NewsDataset(valid_encodings, valid_labels)

In [25]:
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Metrics Computation

During the training process we will need to be able to analyze how accurate the model is. We do so by taking the predictions that it makes and comparing them to the known labels already provided to make a deduction as to how accurate the predictions being made are. This data is then used by trainer in order to make changes to the training for where and how it should be proceeding in order to be more accurate in further deductions.

In [26]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    return {'accuracy': accuracy_score(labels, preds)}

# Trainer

Speaking of the trainer it's time to make the trainer itself. We establish the paramters of the training itself, like where it should be saving its logs and the results of its training, batch sizes, and the amount of epochs.

The number of training epochs is the amount of times that the entire data set is passed through during training. So one epoch is one entire pass through the data.

The batch size is how many samples that the trainer is processing at once, as a note this isn't the amount of tokens that are processed at once but the amount of rows that the trainer is looking at.

In [27]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=10,
    per_device_eval_batch_size=20,
    warmup_steps=100,
    logging_dir='./logs',
    load_best_model_at_end=True,
    logging_steps=200,
    save_steps=200,
    evaluation_strategy="steps",
)

trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = valid_dataset,
    compute_metrics = compute_metrics
)

#Run training

With all this then completed we can fianlly start the training process and after see how well the trainer did in making the language model.

In [28]:
trainer.train()

trainer.evaluate()

Step,Training Loss,Validation Loss,Accuracy
200,0.2662,0.033716,0.994531
400,0.0323,0.015552,0.997266
600,0.017,0.011894,0.998359
800,0.0116,0.009964,0.998359
1000,0.0104,0.011452,0.998086
1200,0.0034,0.00774,0.998633
1400,0.0138,0.006131,0.998633


{'eval_loss': 0.006131425499916077,
 'eval_accuracy': 0.998632759092152,
 'eval_runtime': 33.4846,
 'eval_samples_per_second': 109.214,
 'eval_steps_per_second': 5.465,
 'epoch': 1.0}

# Save Model

With this model completed we want to be able to save the model and its information so we can use it to make predictions about new pieces of data.

In [29]:
model_path = "fake-news-bert-base-uncased"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('fake-news-bert-base-uncased/tokenizer_config.json',
 'fake-news-bert-base-uncased/special_tokens_map.json',
 'fake-news-bert-base-uncased/vocab.txt',
 'fake-news-bert-base-uncased/added_tokens.json',
 'fake-news-bert-base-uncased/tokenizer.json')

# Make a Prediction

As an example this is a function that will take a piece of data given to it and evaluate it using the data sanitization and tokenizing methods from before, and make a prediction. This prediction then gets returned as either 'reliable' or 'fake' information as according to the analysis made.

To really test the data though we'll need to take a larger dataset of these unknown pieces of data and generate fake or reliable labels for each of them to be analyzed against at huggingface.

In [30]:
def get_prediction(text, convert_to_label=False):
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
    # perform inference to our model
    outputs = model(**inputs)
    # get output probabilities by doing softmax
    probs = outputs[0].softmax(1)
    # executing argmax function to get the candidate label
    d = {
        0: "fake",
        1: "reliable"
    }
    if convert_to_label:
        return d[int(probs.argmax())]
    else:
        return int(probs.argmax())

In [31]:
real_news = """
Biden Administration Urges Justices to Hear Cases on Social Media Laws
The administration argued that the laws, enacted by Florida and Texas to prevent removal of posts amid conservative complaints about censorship by tech platforms, violated the First Amendment.
"""

get_prediction(real_news, convert_to_label=True)

'reliable'

In [33]:

test_df = pd.read_csv("/content/News Data/test.csv")

new_df = test_df.copy()

new_df["new_text"] = new_df["author"].astype(str) + " : " + new_df["title"].astype(str) + " - " + new_df["text"].astype(str)

new_df["label"] = new_df["new_text"].apply(get_prediction)

final_df = new_df[["id", "label"]]
final_df.to_csv("submit_final.csv", index=False)