# **Final Project**

##**Problem stament :**     

The widespread dissemination of fake news and propaganda presents serious societal risks, including the erosion of public trust, political polarization, manipulation of elections, and the spread of harmful misinformation during crises such as pandemics or conflicts. From an NLP perspective, detecting fake news is fraught with challenges. Linguistically, fake news often mimics the tone and structure of legitimate journalism, making it difficult to distinguish using surface-level features. The absence of reliable and up-to-date labeled datasets, especially across multiple languages and regions, hampers the effectiveness of supervised learning models. Additionally, the dynamic and adversarial nature of misinformation means that malicious actors constantly evolve their language and strategies to bypass detection systems. Cultural context, sarcasm, satire, and implicit bias further complicate automated analysis. Moreover, NLP models risk amplifying biases present in training data, leading to unfair classifications and potential censorship of legitimate content. These challenges underscore the need for cautious, context-aware approaches, as the failure to address them can inadvertently contribute to misinformation, rather than mitigate it.



Use datasets in link : https://drive.google.com/drive/folders/1mrX3vPKhEzxG96OCPpCeh9F8m_QKCM4z?usp=sharing
to complete requirement.

## **About dataset:**

* **True Articles**:

  * **File**: `MisinfoSuperset_TRUE.csv`
  * **Sources**:

    * Reputable media outlets like **Reuters**, **The New York Times**, **The Washington Post**, etc.

* **Fake/Misinformation/Propaganda Articles**:

  * **File**: `MisinfoSuperset_FAKE.csv`
  * **Sources**:

    * **American right-wing extremist websites** (e.g., Redflag Newsdesk, Breitbart, Truth Broadcast Network)
    * **Public dataset** from:

      * Ahmed, H., Traore, I., & Saad, S. (2017): "Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques" *(Springer LNCS 10618)*



## **Requirement**

A team consisting of three members must complete a project that involves applying the methods learned from the beginning of the course up to the present. The team is expected to follow and document the entire machine learning workflow, which includes the following steps:

1. **Data Preprocessing**: Clean and prepare the dataset,etc.

2. **Exploratory Data Analysis (EDA)**: Explore and visualize the data.

3. **Model Building**: Select and build one or more machine learning models suitable for the problem at hand.

4. **Hyperparameter set up**: Set and adjust the model's hyperparameters using appropriate methods to improve performance.

5. **Model Training**: Train the model(s) on the training dataset.

6. **Performance Evaluation**: Evaluate the trained model(s) using appropriate metrics (e.g., accuracy, precision, recall, F1-score, confusion matrix, etc.) and validate their performance on unseen data.

7. **Conclusion**: Summarize the results, discuss the model's strengths and weaknesses, and suggest possible improvements or future work.





In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv(r"C:\Users\Admin\OneDrive - VNU-HCMUS\cleaned")
print(df.head())
print(df.shape)

                                                text  label  text_length
0  link imo happening guys jumping conclusions co...      1         2717
1  house passed budget today slim margin democrat...      1         3203
2  youtube video posted january cloudgate studio ...      0         2549
3  marseille france sang danced chanted even reas...      0         8149
4  sunday nbc meet press discussing president don...      0         1235
(68604, 3)


In [3]:
# Kiểm tra dữ liệu rỗng
df.isnull().sum()

text           7
label          0
text_length    0
dtype: int64

In [4]:
# Loại bỏ các hàng có giá trị NaN trong cột 'text'
df = df.dropna(subset=['text'])

In [5]:
# Chia dữ liệu thành 3 phần: train, valid, test
# 60% train, 20% valid, 20% test
train_df = df[:int(0.6 * len(df))]
valid_df = df[int(0.6 * len(df)):int(0.8 * len(df))]
test_df = df[int(0.8 * len(df)):]

# Tokenize data

In [6]:
# Tokenize dữ liệu
from transformers import BertTokenizerFast
from transformers import XLNetTokenizerFast

class Tokenizer:
    def __init__(self, model_name, max_length=128):
        if 'bert' in model_name:
            self.tokenizer = BertTokenizerFast.from_pretrained(model_name)
        elif 'xlnet' in model_name:
            self.tokenizer = XLNetTokenizerFast.from_pretrained(model_name)
        self.max_length = max_length

    def tokenize_function(self, examples):
        return self.tokenizer(
            examples["text"],
            truncation=True,
            padding="max_length",
            max_length=self.max_length
        )

from datasets import Dataset

tokenizer = Tokenizer('bert-base-uncased')

# Convert DataFrames to HuggingFace Datasets
train_dataset = Dataset.from_pandas(train_df)
valid_dataset = Dataset.from_pandas(valid_df)
test_dataset = Dataset.from_pandas(test_df)

# Tokenize datasets
train_dataset = train_dataset.map(tokenizer.tokenize_function, batched=True)
valid_dataset = valid_dataset.map(tokenizer.tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenizer.tokenize_function, batched=True)

# Set format for PyTorch
train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
valid_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

Map:   0%|          | 0/41158 [00:00<?, ? examples/s]

Map:   0%|          | 0/13719 [00:00<?, ? examples/s]

Map:   0%|          | 0/13720 [00:00<?, ? examples/s]

# Build and Load model

In [7]:
from transformers import BertForSequenceClassification, DistilBertForSequenceClassification, RobertaForSequenceClassification

class ModelBuilder:
    def __init__(self, num_labels=2):
        self.num_labels = num_labels
        self.model = None

    def build_model(self, model_name):
        if model_name == 'bert-base-uncased':
            self.model = BertForSequenceClassification.from_pretrained(model_name, num_labels=self.num_labels)
        else:
            self.model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=self.num_labels)
        return self.model

In [8]:
model_builder = ModelBuilder()
bert_model = model_builder.build_model('bert-base-uncased')
distilbert_model = model_builder.build_model('distilbert-base-uncased')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
bert_model.to(device)
distilbert_model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

# Train model

In [10]:
from transformers import Trainer, TrainingArguments, DataCollatorWithPadding
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    cm = confusion_matrix(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')

    return {
        'accuracy': accuracy_score(labels, preds),
        'f1': float(f1),
        'precision': float(precision),
        'recall': float(recall),
        'confusion_matrix': cm.tolist()
    }

data_collator = DataCollatorWithPadding(tokenizer=tokenizer.tokenizer)

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    gradient_accumulation_steps=2,
    load_best_model_at_end=True,
    dataloader_num_workers=8,
    fp16=True
)



## Bert model

In [11]:
# Training BERT model
bert_trainer = Trainer(
    model=bert_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer.tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)
bert_trainer.train()

  0%|          | 0/2572 [00:00<?, ?it/s]

{'loss': 0.3696, 'grad_norm': 7.150391578674316, 'learning_rate': 4.031881804043546e-05, 'epoch': 0.19}
{'loss': 0.2484, 'grad_norm': 13.864790916442871, 'learning_rate': 3.0598755832037326e-05, 'epoch': 0.39}
{'loss': 0.216, 'grad_norm': 15.154379844665527, 'learning_rate': 2.0878693623639193e-05, 'epoch': 0.58}
{'loss': 0.2044, 'grad_norm': 8.95656681060791, 'learning_rate': 1.1178071539657855e-05, 'epoch': 0.78}
{'loss': 0.1796, 'grad_norm': 9.804190635681152, 'learning_rate': 1.4580093312597201e-06, 'epoch': 0.97}


  0%|          | 0/1715 [00:00<?, ?it/s]

{'eval_loss': 0.1548750251531601, 'eval_accuracy': 0.9452584007580728, 'eval_f1': 0.945269902401505, 'eval_precision': 0.9453140999778809, 'eval_recall': 0.9452584007580728, 'eval_confusion_matrix': [[6830, 404], [347, 6138]], 'eval_runtime': 88.7281, 'eval_samples_per_second': 154.618, 'eval_steps_per_second': 19.329, 'epoch': 1.0}
{'train_runtime': 783.1495, 'train_samples_per_second': 52.554, 'train_steps_per_second': 3.284, 'train_loss': 0.24195206962514257, 'epoch': 1.0}


TrainOutput(global_step=2572, training_loss=0.24195206962514257, metrics={'train_runtime': 783.1495, 'train_samples_per_second': 52.554, 'train_steps_per_second': 3.284, 'total_flos': 2706886537543680.0, 'train_loss': 0.24195206962514257, 'epoch': 0.9998056365403304})

In [12]:
# Evaluate the model
bert_predict = bert_trainer.predict(test_dataset)
print(bert_predict.metrics)

  0%|          | 0/1715 [00:00<?, ?it/s]

{'test_loss': 0.1624840348958969, 'test_accuracy': 0.9426384839650146, 'test_f1': 0.942699935414346, 'test_precision': 0.9429619844999592, 'test_recall': 0.9426384839650146, 'test_confusion_matrix': [[7174, 464], [323, 5759]], 'test_runtime': 79.4158, 'test_samples_per_second': 172.761, 'test_steps_per_second': 21.595}


## Distilbert model

In [13]:
# Training Distilbert model
distilbert_trainer = Trainer(
    model=distilbert_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer.tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)
distilbert_trainer.train()

  0%|          | 0/2572 [00:00<?, ?it/s]

{'loss': 0.3558, 'grad_norm': 2.2682247161865234, 'learning_rate': 4.0299377916018664e-05, 'epoch': 0.19}
{'loss': 0.238, 'grad_norm': 2.72143816947937, 'learning_rate': 3.0598755832037326e-05, 'epoch': 0.39}
{'loss': 0.2172, 'grad_norm': 6.849100112915039, 'learning_rate': 2.0878693623639193e-05, 'epoch': 0.58}
{'loss': 0.191, 'grad_norm': 2.868938684463501, 'learning_rate': 1.1158631415241058e-05, 'epoch': 0.78}
{'loss': 0.1736, 'grad_norm': 2.535066843032837, 'learning_rate': 1.438569206842924e-06, 'epoch': 0.97}


  0%|          | 0/1715 [00:00<?, ?it/s]

{'eval_loss': 0.1640801578760147, 'eval_accuracy': 0.9394270719440192, 'eval_f1': 0.9394599655808152, 'eval_precision': 0.9398129148566099, 'eval_recall': 0.9394270719440192, 'eval_confusion_matrix': [[6729, 505], [326, 6159]], 'eval_runtime': 59.4362, 'eval_samples_per_second': 230.819, 'eval_steps_per_second': 28.854, 'epoch': 1.0}
{'train_runtime': 471.5333, 'train_samples_per_second': 87.285, 'train_steps_per_second': 5.455, 'train_loss': 0.23344826957856885, 'epoch': 1.0}


TrainOutput(global_step=2572, training_loss=0.23344826957856885, metrics={'train_runtime': 471.5333, 'train_samples_per_second': 87.285, 'train_steps_per_second': 5.455, 'total_flos': 1362824597372928.0, 'train_loss': 0.23344826957856885, 'epoch': 0.9998056365403304})

In [14]:
# Evaluate the model
distilbert_predict = distilbert_trainer.predict(test_dataset)
# Print evaluation results
print(distilbert_predict.metrics)

  0%|          | 0/1715 [00:00<?, ?it/s]

{'test_loss': 0.16996903717517853, 'test_accuracy': 0.9381195335276968, 'test_f1': 0.9382232555685888, 'test_precision': 0.9388897067300341, 'test_recall': 0.9381195335276968, 'test_confusion_matrix': [[7095, 543], [306, 5776]], 'test_runtime': 53.5045, 'test_samples_per_second': 256.427, 'test_steps_per_second': 32.053}
