# Model Generalization

Cross-dataset validation

Here, we train the model on dataset A and test it on dataset B. 
- Dataset A: [GonzaloA/fake_news](https://huggingface.co/datasets/GonzaloA/fake_news)
- Dataset B: [Fake News Detection Challenge KDD 2020](https://huggingface.co/datasets/LittleFish-Coder/Fake-News-Detection-Challenge-KDD-2020)

In [1]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

Using cuda device


# Load Finetuned BERT Model and Tokenizer

Load the finetuned BERT model and tokenizer (trained on the fake-news-tfg dataset)

In [2]:
# We have provided a pre-trained model for you to use:
# - bert-base-uncased-fake-news-tfg
# - distilbert-base-uncased-fake-news-tfg
# - roberta-base-fake-news-tfg

model_name = 'roberta-base-fake-news-tfg'

In [3]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(f"LittleFish-Coder/{model_name}")
model = AutoModelForSequenceClassification.from_pretrained(f"LittleFish-Coder/{model_name}").to(device)

  from .autonotebook import tqdm as notebook_tqdm


# Load Different Dataset

Load a different dataset (Kaggle Fake News Dataset) to test the generalization of the model

In [4]:
from datasets import load_dataset

# load and download the dataset from huggingface
dataset = load_dataset("LittleFish-Coder/Fake-News-Detection-Challenge-KDD-2020", download_mode="reuse_cache_if_exists", cache_dir="dataset")

Generating train split: 100%|██████████| 3490/3490 [00:00<00:00, 25101.42 examples/s]
Generating validation split: 100%|██████████| 997/997 [00:00<00:00, 23357.53 examples/s]
Generating test split: 100%|██████████| 499/499 [00:00<00:00, 21184.43 examples/s]


In [5]:
test_dataset = dataset['test']
print(test_dataset)

Dataset({
    features: ['text', 'label', '__index_level_0__'],
    num_rows: 499
})


## Predict via tokenizer & model

### Tokenize the text and get the class

In [6]:
tokenized_test = tokenizer(test_dataset['text'], padding=True, truncation=True, return_tensors="pt")

In [7]:
print(tokenized_test.keys())

dict_keys(['input_ids', 'attention_mask'])


In [8]:
input_ids_list = tokenized_test['input_ids'].tolist()
attention_mask_list = tokenized_test['attention_mask'].tolist()

In [9]:
predicted_labels = []

# iterate over the dataset
for input_ids, attention_mask in zip(input_ids_list, attention_mask_list):
    inputs = {
        'input_ids': torch.tensor([input_ids]).to(device),
        'attention_mask': torch.tensor([attention_mask]).to(device)
    }
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    predicted_class_id = logits.argmax().item()
    prediction = model.config.id2label[predicted_class_id]
    predicted_labels.append(prediction)
    # print(f"Output: {outputs}")
    # print(f"Logits: {logits}")
    # print(f"Prediction: {prediction}")

## Compare Real and Predicted Class

- In KDD2020,
    - 0: real
    - 1: fake

- In fake-news-tfg,
    - 0: fake
    - 1: real

In [10]:
id2label = {
    0: 'real',
    1: 'fake'
}

kdd_labels = [id2label[id] for id in test_dataset['label']]

In [11]:
# now we can calculate the accuracy
correct_predictions = 0

for real_label, predicted_label in zip(kdd_labels, predicted_labels):
    if real_label == predicted_label:
        correct_predictions += 1

accuracy = correct_predictions / len(kdd_labels)

In [12]:
# Results
print(f"Using model: {model_name}")
print(f"Train on: GonzaloA/fake-news-detection-challenge-kdd-2020")
print(f"Test on: Fake-News-Detection-Challenge-KDD-2020")
print(f"Accuracy: {accuracy}")

Using model: roberta-base-fake-news-tfg
Train on: GonzaloA/fake-news-detection-challenge-kdd-2020
Test on: Fake-News-Detection-Challenge-KDD-2020
Accuracy: 0.503006012024048
