# **Final Project**

##**Problem stament :**     

The widespread dissemination of fake news and propaganda presents serious societal risks, including the erosion of public trust, political polarization, manipulation of elections, and the spread of harmful misinformation during crises such as pandemics or conflicts. From an NLP perspective, detecting fake news is fraught with challenges. Linguistically, fake news often mimics the tone and structure of legitimate journalism, making it difficult to distinguish using surface-level features. The absence of reliable and up-to-date labeled datasets, especially across multiple languages and regions, hampers the effectiveness of supervised learning models. Additionally, the dynamic and adversarial nature of misinformation means that malicious actors constantly evolve their language and strategies to bypass detection systems. Cultural context, sarcasm, satire, and implicit bias further complicate automated analysis. Moreover, NLP models risk amplifying biases present in training data, leading to unfair classifications and potential censorship of legitimate content. These challenges underscore the need for cautious, context-aware approaches, as the failure to address them can inadvertently contribute to misinformation, rather than mitigate it.



Use datasets in link : https://drive.google.com/drive/folders/1mrX3vPKhEzxG96OCPpCeh9F8m_QKCM4z?usp=sharing
to complete requirement.

## **About dataset:**

* **True Articles**:

  * **File**: `MisinfoSuperset_TRUE.csv`
  * **Sources**:

    * Reputable media outlets like **Reuters**, **The New York Times**, **The Washington Post**, etc.

* **Fake/Misinformation/Propaganda Articles**:

  * **File**: `MisinfoSuperset_FAKE.csv`
  * **Sources**:

    * **American right-wing extremist websites** (e.g., Redflag Newsdesk, Breitbart, Truth Broadcast Network)
    * **Public dataset** from:

      * Ahmed, H., Traore, I., & Saad, S. (2017): "Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques" *(Springer LNCS 10618)*



## **Requirement**

A team consisting of three members must complete a project that involves applying the methods learned from the beginning of the course up to the present. The team is expected to follow and document the entire machine learning workflow, which includes the following steps:

1. **Data Preprocessing**: Clean and prepare the dataset,etc.

2. **Exploratory Data Analysis (EDA)**: Explore and visualize the data.

3. **Model Building**: Select and build one or more machine learning models suitable for the problem at hand.

4. **Hyperparameter set up**: Set and adjust the model's hyperparameters using appropriate methods to improve performance.

5. **Model Training**: Train the model(s) on the training dataset.

6. **Performance Evaluation**: Evaluate the trained model(s) using appropriate metrics (e.g., accuracy, precision, recall, F1-score, confusion matrix, etc.) and validate their performance on unseen data.

7. **Conclusion**: Summarize the results, discuss the model's strengths and weaknesses, and suggest possible improvements or future work.





In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [59]:
df = pd.read_csv(r"C:\Users\Admin\OneDrive - VNU-HCMUS\cleaned")
print(df.head())
print(df.shape)

                                                text  label  text_length
0  link imo happening guys jumping conclusions co...      1         2717
1  house passed budget today slim margin democrat...      1         3203
2  youtube video posted january cloudgate studio ...      0         2549
3  marseille france sang danced chanted even reas...      0         8149
4  sunday nbc meet press discussing president don...      0         1235
(68604, 3)


In [61]:
df = df.dropna(subset=['text'])

In [62]:
true_df = df[df['label'] == 1]
fake_df = df[df['label'] == 0]

In [63]:
# Sắp xếp theo độ dài của văn bản giảm dần
true_df = true_df.sort_values(by='text_length', ascending=False)
fake_df = fake_df.sort_values(by='text_length', ascending=False)

# Chia thành train, valid, test set
true_train_df = true_df.iloc[:20000]
true_valid_df = true_df.iloc[20000:25000]
true_test_df = true_df.iloc[25000:]

fake_train_df = fake_df.iloc[:20000]
fake_valid_df = fake_df.iloc[20000:25000]
fake_test_df = fake_df.iloc[25000:]

# Tokenize data

In [None]:
# Tokenize dữ liệu
from transformers import BertTokenizerFast
from transformers import XLNetTokenizerFast

class Tokenizer:
    def __init__(self, model_name, max_length=256):
        if 'bert' in model_name:
            self.tokenizer = BertTokenizerFast.from_pretrained(model_name)
        elif 'xlnet' in model_name:
            self.tokenizer = XLNetTokenizerFast.from_pretrained(model_name)
        self.max_length = max_length

    def tokenize_function(self, examples):
        return self.tokenizer(
            examples["text"],
            truncation=True,
            padding="max_length",
            max_length=self.max_length
        )
    
# Convert pandas DataFrames to HuggingFace Dataset objects
from datasets import Dataset
true_train_dataset = Dataset.from_pandas(true_train_df)
true_valid_dataset = Dataset.from_pandas(true_valid_df)
true_test_dataset = Dataset.from_pandas(true_test_df)

fake_train_dataset = Dataset.from_pandas(fake_train_df)
fake_valid_dataset = Dataset.from_pandas(fake_valid_df)
fake_test_dataset = Dataset.from_pandas(fake_test_df)

tokenizer = Tokenizer('bert-base-uncased')

true_train_dataset = true_train_dataset.map(tokenizer.tokenize_function)
true_valid_dataset = true_valid_dataset.map(tokenizer.tokenize_function)
true_test_dataset = true_test_dataset.map(tokenizer.tokenize_function)

fake_train_dataset = fake_train_dataset.map(tokenizer.tokenize_function)
fake_valid_dataset = fake_valid_dataset.map(tokenizer.tokenize_function)
fake_test_dataset = fake_test_dataset.map(tokenizer.tokenize_function)

true_train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
true_valid_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
true_test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

fake_train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
fake_valid_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
fake_test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])



Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/9073 [00:00<?, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/9524 [00:00<?, ? examples/s]

# Create dataloader

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Build and Load model

In [72]:
# Build or Load Model
from torch.optim import AdamW
from transformers import BertForSequenceClassification, XLNetForSequenceClassification

class ModelBuilder:
    def __init__(self, num_labels=2):
        self.num_labels = num_labels
        self.model = None

    def build_model(self, model_name):
        if 'bert' in model_name:
            self.model = BertForSequenceClassification.from_pretrained(model_name, num_labels=self.num_labels)
        elif 'xlnet' in model_name:
            self.model = XLNetForSequenceClassification.from_pretrained(model_name, num_labels=self.num_labels)
        return self
    
    def optimizer(self, learning_rate=5e-5):
        return AdamW(self.model.parameters(), lr=learning_rate)

In [None]:
model_builder = ModelBuilder()
bert_model = model_builder.build_model('bert-base-uncased')
bert_optimizer = bert_model.optimizer()

xlnet_model = model_builder.build_model('xlnet-base-cased')
xlnet_optimizer = xlnet_model.optimizer()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Train model

## True dataset

In [88]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments, XLNetForSequenceClassification

trainer = Trainer(
    model=bert_model.model,
    args=TrainingArguments(
        output_dir='./results',
        evaluation_strategy='epoch',
        learning_rate=5e-5,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=1,
        weight_decay=0.01,
    ),
    train_dataset=true_train_dataset,
    eval_dataset=true_valid_dataset,
    tokenizer=tokenizer.tokenizer,
    #data_collator=data_collator,
    optimizers=(bert_optimizer, None)
)
trainer.train()

  0%|          | 0/2500 [00:00<?, ?it/s]

{'loss': 0.0001, 'grad_norm': 0.0004417365125846118, 'learning_rate': 4e-05, 'epoch': 0.2}
{'loss': 0.0, 'grad_norm': 0.00023791973944753408, 'learning_rate': 3e-05, 'epoch': 0.4}
{'loss': 0.0, 'grad_norm': 0.00015854518278501928, 'learning_rate': 2e-05, 'epoch': 0.6}
{'loss': 0.0, 'grad_norm': 0.00013984873658046126, 'learning_rate': 1e-05, 'epoch': 0.8}
{'loss': 0.0, 'grad_norm': 9.817580576054752e-05, 'learning_rate': 0.0, 'epoch': 1.0}


  0%|          | 0/625 [00:00<?, ?it/s]

{'eval_loss': 2.757354877758189e-06, 'eval_runtime': 47.515, 'eval_samples_per_second': 105.23, 'eval_steps_per_second': 13.154, 'epoch': 1.0}
{'train_runtime': 776.8651, 'train_samples_per_second': 25.744, 'train_steps_per_second': 3.218, 'train_loss': 1.6758975386619568e-05, 'epoch': 1.0}


TrainOutput(global_step=2500, training_loss=1.6758975386619568e-05, metrics={'train_runtime': 776.8651, 'train_samples_per_second': 25.744, 'train_steps_per_second': 3.218, 'total_flos': 1315555276800000.0, 'train_loss': 1.6758975386619568e-05, 'epoch': 1.0})

In [89]:
# Evaluate the model
eval_results = trainer.evaluate(eval_dataset=true_test_dataset)
print(f"Evaluation results: {eval_results}")

  0%|          | 0/1135 [00:00<?, ?it/s]

Evaluation results: {'eval_loss': 3.6339563393994467e-06, 'eval_runtime': 91.2339, 'eval_samples_per_second': 99.448, 'eval_steps_per_second': 12.441, 'epoch': 1.0}


## Fake dataset

In [90]:
trainer = Trainer(
    model=bert_model.model,
    args=TrainingArguments(
        output_dir='./results',
        evaluation_strategy='epoch',
        learning_rate=5e-5,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=1,
        weight_decay=0.01,
    ),
    train_dataset=fake_train_dataset,
    eval_dataset=fake_valid_dataset,
    tokenizer=tokenizer.tokenizer,
    #data_collator=data_collator,
    optimizers=(bert_optimizer, None)  # Use the optimizer we created
)
trainer.train()

  0%|          | 0/2500 [00:00<?, ?it/s]

{'loss': 0.0515, 'grad_norm': 0.00026390422135591507, 'learning_rate': 4e-05, 'epoch': 0.2}
{'loss': 0.0, 'grad_norm': 0.00019342059385962784, 'learning_rate': 3e-05, 'epoch': 0.4}
{'loss': 0.0, 'grad_norm': 0.00013368419604375958, 'learning_rate': 2e-05, 'epoch': 0.6}
{'loss': 0.0, 'grad_norm': 0.00013435892469715327, 'learning_rate': 1e-05, 'epoch': 0.8}
{'loss': 0.0, 'grad_norm': 9.821492130868137e-05, 'learning_rate': 0.0, 'epoch': 1.0}


  0%|          | 0/625 [00:00<?, ?it/s]

{'eval_loss': 2.6956997771776514e-06, 'eval_runtime': 50.716, 'eval_samples_per_second': 98.588, 'eval_steps_per_second': 12.324, 'epoch': 1.0}
{'train_runtime': 815.3756, 'train_samples_per_second': 24.529, 'train_steps_per_second': 3.066, 'train_loss': 0.010303533741272986, 'epoch': 1.0}


TrainOutput(global_step=2500, training_loss=0.010303533741272986, metrics={'train_runtime': 815.3756, 'train_samples_per_second': 24.529, 'train_steps_per_second': 3.066, 'total_flos': 1315555276800000.0, 'train_loss': 0.010303533741272986, 'epoch': 1.0})

In [None]:
# Evaluate
eval_results = trainer.evaluate(eval_dataset=fake_test_dataset)
print(f"Evaluation results: {eval_results}")

  0%|          | 0/1191 [00:00<?, ?it/s]