### Main goals:
Clarify and reflect on the definition of the term "fake news", which may vary among databases, sometimes non-binary.\
Research, where the data comes from and inspect the data: what are the labels, sources, and authors?\
Is there a person, source or topic which is over- or under-represented?\
Study the literature on how others approach this task. Check the related literature and select your model architecture of choice: LSTM, ...\
Develop a classification model to predict fake news from the text. How do you judge the quality of your results, i.e. which metrics do you consider?
### Optional:
Inspect the falsely classified ones. What can you learn from them?\
Investigate edge cases that you found in your data inspection with respect to how the model learned to identify these.\
Experiment with how you could mitigate if edge cases are covered poorly.


In [1]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments,Trainer, pipeline
from datasets import load_dataset
import numpy as np
import evaluate
from sklearn.metrics import classification_report
from bertviz import model_view,head_view
import shap

In [2]:
model_name = "google-bert/bert-base-uncased"

In [3]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [4]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [5]:
data = load_dataset('GonzaloA/fake_news')

Repo card metadata block was not found. Setting CardData to empty.


In [6]:
data = data.remove_columns(['Unnamed: 0','title'])

In [7]:
tokenized_data = data.map(tokenize_function, batched=True)

In [8]:
small_train_dataset = tokenized_data["train"].shuffle(seed=42).select(range(100))
small_eval_dataset = tokenized_data["validation"].shuffle(seed=42).select(range(100))
small_test_dataset = tokenized_data['test'].select(range(100))

In [10]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
# pipe = pipeline('text-classification',model=model_name, return_all_scores=True)

In [12]:
# explainer = shap.Explainer(pipe)

In [13]:
# shap_values = explainer([data['train']['text'][3]])

In [14]:
# inputs = tokenizer.encode(s, return_tensors='pt')
# outputs = model(inputs)
# attention = outputs[-1]  # Output includes attention weights when output_attentions=True
# tokens = tokenizer.convert_ids_to_tokens(inputs[0]) 

In [15]:
# head_view(attention, tokens)

In [16]:
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

In [17]:
metric = evaluate.load("accuracy")

In [18]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [19]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [20]:
trainer.train()

  0%|          | 0/39 [00:00<?, ?it/s]

  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.30422553420066833, 'eval_accuracy': 0.9, 'eval_runtime': 16.2014, 'eval_samples_per_second': 6.172, 'eval_steps_per_second': 0.802, 'epoch': 1.0}


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.3159090578556061, 'eval_accuracy': 0.89, 'eval_runtime': 15.103, 'eval_samples_per_second': 6.621, 'eval_steps_per_second': 0.861, 'epoch': 2.0}


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.29985320568084717, 'eval_accuracy': 0.9, 'eval_runtime': 15.8418, 'eval_samples_per_second': 6.312, 'eval_steps_per_second': 0.821, 'epoch': 3.0}
{'train_runtime': 220.1631, 'train_samples_per_second': 1.363, 'train_steps_per_second': 0.177, 'train_loss': 0.2782232822516026, 'epoch': 3.0}


TrainOutput(global_step=39, training_loss=0.2782232822516026, metrics={'train_runtime': 220.1631, 'train_samples_per_second': 1.363, 'train_steps_per_second': 0.177, 'train_loss': 0.2782232822516026, 'epoch': 3.0})

In [29]:
pipe = pipeline('text-classification',model='model2', top_k=None)

In [30]:
explainer = shap.Explainer(pipe)

In [47]:
shap_values = explainer(["The population of Los Angeles county has been growing for 6 years straight"])

  0%|          | 0/210 [00:00<?, ?it/s]

In [48]:
shap.plots.text(shap_values)

In [35]:
test_labels = small_test_dataset['label']
small_test_dataset = small_test_dataset.remove_columns(['label','token_type_ids'])

In [36]:
predictions = trainer.predict(small_test_dataset)

  0%|          | 0/13 [00:00<?, ?it/s]

In [None]:
predicted_labels = predictions.predictions.argmax(axis=1)

In [None]:
predicted_labels

In [None]:
print(classification_report(test_labels, predicted_labels))

In [27]:
tokenizer.save_pretrained('model2')

('model2/tokenizer_config.json',
 'model2/special_tokens_map.json',
 'model2/vocab.txt',
 'model2/added_tokens.json',
 'model2/tokenizer.json')