# Identify Fake News

High level approach:
1. Install prequisites
2. Load data
3. Prepare data
4. Model Training
5. Evaluate model
6. Infer test data
7. Save results


In [1]:
# To install packages that are not installed by default, uncomment the last two lines of this cell 
# and replace <package list> with a list of needed packages.
# This will ensure the notebook has all the dependencies and works everywhere

#!mamba install <package list>

### Installation

In [2]:
# !mamba install --no-update transformers -y 

***Notes:*** 
Installing huggingface transfomers library for loading transformer based model for sequence classification of fake news data.

### Libraries

In [3]:
# Imports for loading data
import os
from shutil import unpack_archive
import pandas as pd

# Imports for helper functions
import time
from typing import Optional, Callable, Any

# Imports for data preparation and preprocessing
import numpy as np
import re
import unicodedata
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

# Imports for evaluation
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    precision_score,
    recall_score,
)

# Imports for model fine-tuning
import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import DataLoader, TensorDataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm

# Imports for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Set pandas display options
pd.set_option("display.max_columns", 101)
pd.set_option("display.max_colwidth", 100)

  from .autonotebook import tqdm as notebook_tqdm


***Notes:***

Generally, would create a requirements.txt file which contains required libraries for install. This  ensures consistency library versioning across the team and also comes handy in while deployment.

### Helper Functions

***Notes:***

These helpfer functions would generally be moved to a common utility file to be shared across team members, this saves time in common functions used across the components. It also ensures consistency and eliminates possiblity of redudant code. Here adding a seperate section for it for readability.

In [4]:
def get_device() -> str:
    """
    Get the appropriate device (CPU, CUDA GPU, or MPS) for running PyTorch operations.

    Returns:
        str: The selected device ('cuda', 'mps', or 'cpu').
    """
    if torch.cuda.is_available():
        device = "cuda"
    elif torch.backends.mps.is_built():
        device = "mps"
    else:
        device = "cpu"
    return device

device = get_device()

In [5]:
def set_global_seed(seed: int) -> None:
    """
    Set a global random seed for NumPy and PyTorch.

    Args:
        seed (int): The seed value to set.

    Returns:
        None
    """
    # Set seed for NumPy
    np.random.seed(seed)

    # Set seed for PyTorch
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_global_seed(42)

In [6]:
def time_it(func: Callable[..., Any]) -> Callable[..., Any]:
    """
    A decorator that measures the time taken to execute a function and prints the elapsed time.

    Args:
        func (Callable): The function to be decorated.

    Returns:
        Callable: The decorated function.
    """
    def wrapper(*args, **kwargs) -> Any:
        """
        Wrapper function to measure the execution time of the decorated function.

        Args:
            *args: Positional arguments passed to the decorated function.
            **kwargs: Keyword arguments passed to the decorated function.

        Returns:
            Any: The result returned by the decorated function.
        """
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        elapsed_time = end_time - start_time
        print(f"{func.__name__} took {elapsed_time:.4f} seconds to run.")
        return result
    return wrapper


### Load Data

In [7]:
# # Run this cell block to download and extract dataset
# !wget 'https://hr-projects-assets-prod.s3.amazonaws.com/4j8je858bmi/ed0142953e1c6f8c928ef63bcb269cdc/train.zip'
# !wget 'https://hr-projects-assets-prod.s3.amazonaws.com/4j8je858bmi/dc5feebc74511aeb9753b5074b98a8fe/test.zip'

# print('Extracting Train Dataset :')
# unpack_archive('train.zip', '')

# print('Extracting Test Dataset :')
# unpack_archive('test.zip', '')

# # Remove zip files
# os.remove('train.zip')
# os.remove('test.zip')

### Data Description

Column | Description
:---|:---
`id` | Unique ID of the News Article
`title` | Title of the News Article
`content` | Body of the News Article
`tags` | A string of comma-separated tags associated with the news article.
`category` | Category of the news article. (`1` - `fake`, `0` - `reliable`)

In [8]:
# The training dataset containing the news articles and corresponding
# categories is already loaded below.
train_0 = pd.read_csv(os.path.join("train","train_0.csv"))
train_1 = pd.read_csv(os.path.join("train","train_1.csv"))
df = df = pd.concat([train_0, train_1], axis=0, ignore_index=True)
df.head()

Unnamed: 0,id,title,content,tags,category
0,0,God's 10 Commandments Under Attack in Montana,(Photo: Screengrab/NBC Montana) Philip Klevmoen of Gods10.com shows local media the vandalism do...,,0
1,1,Steven Curtis Chapman New Album 'The Glorious Unfolding' to be Released October 1,Christian contemporary artist Steven Curtis Chapman announced his new album The Glorious Unfoldi...,,0
2,2,Casey Anthony Trial: Closing Arguments Begin; Molestation Claim Thrown Out,"There is no evidence that Casey Anthony, who is accused of murdering her daughter Caylee Anthony...",,0
3,3,Aspartame-Induced Fibromyalgia,"Below is an approximation of this video’s audio content. To see any graphs, charts, graphics, im...","fibromyalgia, aspartame, sweeteners, Volume 11, artificial sweeteners, sugar, pain, NutraSweet, ...",0
4,4,Religious Leaders 'March' on Federal Reserve in 'Occupy the Dream' Movement,"The “Occupy” movement is back, and this time it joins forces with progressive religious leaders ...",,0


In [9]:
# size of dataset
df.shape

(50165, 5)

In [10]:
# Saving training DataFrame for future use
df.to_csv(os.path.join("train","data.csv"))

In [11]:
def read_csv(file_path: str, encoding: str = 'utf-8') -> pd.DataFrame:
    """
    Read a CSV file from the specified path with the option to specify the encoding.

    Args:
        file_path (str): The path to the CSV file to be read.
        encoding (str, optional): The encoding to use for reading the CSV file. Default is 'utf-8'.

    Returns:
        pd.DataFrame or None: A pandas DataFrame containing the data from the CSV file if
        successfully read. Returns None if an error occurs during reading.
    """
    try:
        # Try reading the CSV file using the provided file path and encoding
        df = pd.read_csv(file_path, encoding=encoding)
        return df
    except Exception as e:
        print(f"An error occurred while reading the CSV file: {str(e)}")
        return None

data_path = r'train/data.csv'
df = read_csv(data_path)

if df is not None:
    print("CSV file loaded successfully.")
else:
    print("Failed to load CSV file.")

CSV file loaded successfully.


In [12]:
# TODO - comment before full training
# Using a sample of the data for training to speed up the process to overcome compute limitations (running on CPU). 

# Randomly sample 25% data from the DataFrame for training
# sample_len = int(len(df)*0.25)
# df = df.sample(sample_len)

df = df.sample(100)

In [13]:
def reformat_dataframe(row: pd.Series) -> str:
    """
    Concatenates the 'title' and 'content' columns of a DataFrame row.

    Args:
        row (pd.Series): A row from a DataFrame containing 'title' and 'content' columns.

    Returns:
        str: Concatenated string of 'title' and 'content'.
    """
    return row['title'] + ' ' + row['content']

df['text'] = df.apply(reformat_dataframe, axis=1)

***Notes:*** 

Concatenating the columns of title and content into a single column to be used for classification purposes. The hypothesis is that fake news is generally propagated by a catchy title followed by fake content. Thus, the title may contain patterns that can be used to identify if it is fake.

In [14]:
def create_inverse_label_mappings(label_to_idx: dict[str, int]) -> dict[int, str]:
    """
    Create inverse mappings between label names and label indices.

    Args:
        label_to_idx (dict[str, int]): A dictionary where keys are label names (strings)
            and values are corresponding label indices (integers).

    Returns:
        dict[int, str]: A dictionary where keys are label indices (integers)
        and values are corresponding label names (strings).
    """
    idx_to_label = {idx: label for label, idx in label_to_idx.items()}
    return idx_to_label

label_to_idx = { 'real': 0, 'fake': 1 }
idx_to_label = create_inverse_label_mappings(label_to_idx)

### Pre-processing

***Notes:***
It useful to understand data provenance - how data was acquired, processed in the data pipeline before using for modelling. This would enable context aware processing and if there were any loss of context or special consideration requried for pre-processing. For e.g., data might be jumbled when passing through OCR.

Here, I'm using a text standard pre-processing due to lack context.

In [15]:
%%time
def preprocess_text(text: str) -> str:
    """
    Preprocesses a text by:
    1. Removing newline characters '\n' from the text.
    2. Removing extra whitespaces.
    3. Removing square brackets and their contents (e.g., "[...]").
    4. Applying NFC normalization.
    5. Cleaning up special characters and extra spaces.

    Parameters:
        text (str): The input text to be preprocessed.

    Returns:
        str: The preprocessed text.
    """
    # Remove newline characters '\n' from the text
    text = text.replace('\n', ' ')
    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()
    # Remove square brackets and their contents (e.g., "[...]")
    text = re.sub(r'\[[^\]]*\]', '', text)
    # Apply NFC normalization
    text = unicodedata.normalize('NFC', text)
    # Clean up special characters and extra spaces
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text

df['text'] = df['text'].apply(lambda x: preprocess_text(x))

CPU times: user 24.3 ms, sys: 1.07 ms, total: 25.4 ms
Wall time: 24.5 ms


In [16]:
def train_validation_split(X: pd.Series, y: pd.Series, test_size: float = 0.2) -> tuple:
    """
    Split input data into training and validation sets.

    Parameters:
        X (pd.Series): The feature data.
        y (pd.Series): The target data.
        test_size (float, optional): The proportion of the data to include in the validation split.
            Defaults to 0.2.

    Returns:
        tuple: A tuple containing X_train, X_val, y_train, and y_val.
    """
    X, y = shuffle(X, y, random_state=42)
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=test_size, stratify=y, random_state=42)
    return X_train, X_val, y_train, y_val

X, y = df['text'], df['category']
X_train, X_val, y_train, y_val = train_validation_split(X, y, test_size=0.2)

### Training

***Notes:***

Model Selection: For the classification task, the distilled version of BERT (Bidirectional Encoder Representations from Transformers), known as distillBERT, is utilized. Using distillBERT provides a lighter alternative to BERT to overcome computational limitations (CPU) during training.

Model Baseline: Generally, before training a deep learning-based model, we would train a TF-IDF-based statistical model, such as Naive Bayes or Support Vector Classifier, to establish a baseline. These models are faster to run and require less computational resources but may face challenges in generalizing in certain scenarios. However, for the sake of expediency, I have directly chosen to evaluate distillBERT.

Object-Oriented: Here, I have implemented an object-oriented code structure to demonstrate the approach I would take in production pipelines. Additionally, I would also write unit tests (which were skipped due to time constraints), maintain centralized control of the code flow, implement thorough logging, and utilize modular Python files within an organized project structure instead of a notebook.

In [17]:
class TextClassifier:
    """
    Text classifier using Transformers for sequence classification.

    Args:
        num_classes (int, optional): Number of classes for classification. Default is 2.
        batch_size (int, optional): Batch size for training. Default is 8.
        learning_rate (float, optional): Learning rate for optimizer. Default is 2e-5.
        model_directory (str, optional): Pretrained model directory. Default is "distilbert-base-uncased".
        max_length (int, optional): Maximum sequence length. Default is 512.

    Attributes:
        tokenizer (AutoTokenizer): Tokenizer for text processing.
        model (AutoModelForSequenceClassification): Pretrained model for classification.
        device (str): Device for model training (e.g., 'cpu' or 'cuda').
        optimizer (AdamW): Optimizer for model training.
        criterion (nn.CrossEntropyLoss): Loss criterion.

    Methods:
        preprocess_data: Preprocesses input text data.
        train: Trains the model on the provided data.
        predict: Predicts class labels for input texts.
        save_model: Saves the trained model checkpoint.

    """

    def __init__(
        self,
        num_classes: int = 2,
        batch_size: int = 8,
        learning_rate: float = 2e-5,
        model_directory: str = "distilbert-base-uncased",
        max_length: int = 512,
    ):
        self.num_classes = num_classes
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.max_length = max_length

        self.tokenizer = AutoTokenizer.from_pretrained(model_directory, use_fast=True)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_directory, num_labels=num_classes
        )
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)
        self.optimizer = AdamW(self.model.parameters(), lr=learning_rate)
        self.criterion = nn.CrossEntropyLoss()

    def preprocess_data(self, texts):
        """
        Preprocesses the input texts.

        Args:
            texts (list of str): The input texts to be preprocessed.

        Returns:
            input_ids (torch.Tensor): The input IDs after tokenization and encoding.
            attention_masks (torch.Tensor): The attention masks for input texts.
        """
        input_ids = []
        attention_masks = []

        for text in texts:
            encoded_dict = self.tokenizer.encode_plus(
                text,
                add_special_tokens=True,
                truncation="longest_first",
                max_length=self.max_length,
                padding="max_length",
                return_attention_mask=True,
                return_tensors="pt",
            )
            input_ids.append(encoded_dict["input_ids"])
            attention_masks.append(encoded_dict["attention_mask"])

        input_ids = torch.cat(input_ids, dim=0)
        attention_masks = torch.cat(attention_masks, dim=0)

        return input_ids, attention_masks

    def train(
        self, train_texts, train_labels, val_texts, val_labels, num_epochs: int = 1
    ):
        """
        Trains the model on the provided training data and evaluates it on the validation data.

        Args:
            train_texts (list of str): The training texts.
            train_labels (list of int): The training labels.
            val_texts (list of str): The validation texts.
            val_labels (list of int): The validation labels.
            num_epochs (int, optional): The number of training epochs. Default is 3.
        """
        input_ids, attention_masks = self.preprocess_data(train_texts)
        labels = torch.tensor(train_labels)
        dataset = TensorDataset(input_ids, attention_masks, labels)
        dataloader = DataLoader(dataset, batch_size=self.batch_size, shuffle=True)

        val_input_ids, val_attention_masks = self.preprocess_data(val_texts)
        val_labels = torch.tensor(val_labels)
        val_dataset = TensorDataset(val_input_ids, val_attention_masks, val_labels)
        val_dataloader = DataLoader(
            val_dataset, batch_size=self.batch_size, shuffle=False
        )

        for epoch in range(num_epochs):
            self.model.train()
            total_loss = 0.0
            batch_count = 0
            # Create a tqdm progress bar for the training loop
            progress_bar = tqdm(dataloader, desc=f"Epoch {epoch+1}/{num_epochs}")

            for batch in progress_bar:
                batch_count += 1
                batch = tuple(t.to(self.device) for t in batch)
                input_ids, attention_mask, label = batch

                self.optimizer.zero_grad()
                outputs = self.model(
                    input_ids, attention_mask=attention_mask, labels=label
                )
                loss = outputs.loss
                loss.backward()
                self.optimizer.step()

                total_loss += loss.item()

            average_loss = total_loss / len(dataloader)

            # Validation loop
            self.model.eval()
            total_val_loss = 0.0
            val_batch_count = 0
            val_progress_bar = tqdm(val_dataloader, desc=f"Validation")
            with torch.no_grad():
                for batch in val_progress_bar:
                    val_batch_count += 1
                    batch = tuple(t.to(self.device) for t in batch)
                    input_ids, attention_mask, label = batch

                    outputs = self.model(
                        input_ids, attention_mask=attention_mask, labels=label
                    )
                    val_loss = outputs.loss
                    total_val_loss += val_loss.item()

            average_val_loss = total_val_loss / len(val_dataloader)
            print(
                f"Epoch {epoch+1}/{num_epochs} - Train Loss: {average_loss:.4f} - Val Loss: {average_val_loss:.4f}"
            )

            self.save_model(epoch, average_loss, average_val_loss)

    def predict(self, texts):
        """
        Predicts the class labels for the input texts.

        Args:
            texts (list of str): The texts for which predictions are to be made.

        Returns:
            predictions (list of int): The predicted class labels.
        """
        input_ids, attention_masks = self.preprocess_data(texts)
        dataset = TensorDataset(input_ids, attention_masks)
        dataloader = DataLoader(dataset, batch_size=self.batch_size, shuffle=False)

        predictions = []
        progress_bar = tqdm(dataloader, desc="Predicting")
        with torch.no_grad():
            for batch in progress_bar:
                batch = tuple(t.to(self.device) for t in batch)
                input_ids, attention_mask = batch

                outputs = self.model(input_ids, attention_mask=attention_mask)
                logits = outputs.logits
                probabilities = torch.softmax(logits, dim=1)
                predicted_classes = torch.argmax(probabilities, dim=1).tolist()
                predictions.extend(predicted_classes)
        return predictions

    def save_model(self, epoch, train_loss, val_loss, model_path: str = "model"):
        """
        Saves the trained model checkpoint.

        Args:
            epoch (int): The current training epoch.
            train_loss (float): The training loss at the end of the epoch.
            val_loss (float): The validation loss at the end of the epoch.
            model_path (str, optional): Directory to save the model checkpoint. Default is 'model'.
        """
        if not os.path.exists(model_path):
            os.makedirs(model_path)

        model_filename = (
            f"epoch_{epoch+1}_trainloss_{train_loss:.2f}_valloss_{val_loss:.2f}"
        )
        model_filename = model_filename.replace(".", "-")

        model_filepath = os.path.join(model_path, model_filename)
        self.model.save_pretrained(model_filepath)
        self.tokenizer.save_pretrained(model_filepath)
        print(f"Saved model checkpoint: {model_filename}")

In [18]:
train_texts = X_train.to_list()
val_texts = X_val.to_list()
train_labels = y_train.to_list()
val_labels = y_val.to_list()

In [19]:
# Begin Training
model = TextClassifier()
model.train(train_texts, train_labels, val_texts, val_labels)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch 1/1:  40%|████      | 4/10 [00:31<00:45,  7.66s/it]

***Notes:***

Epoch: Here the model was trained for only 1 epoch due to computation constraints. However, there is potential for further training, if the validation loss continues to decrease while the training loss stabilizes. 

Monitoring: For monitoring the training
process we could use of Weights and Biases, however skipping that due to environment contraints.


### Evaluation

In [None]:
# Load fine-tuned model
model_path = "model"
model_filename = f"epoch_0_trainloss_0-05_valloss_0-04"
model_filepath = os.path.join(model_path, model_filename)
model = TextClassifier(model_directory=model_filepath)

In [None]:
# Make predictions
y_pred = model.predict(X_val)

# Convert index to labels for vizualisation
y_true = [idx_to_label[i] for i in y_val.to_list()]
y_pred = [idx_to_label[i] for i in y_pred]

In [None]:
def evaluate_multiclass_classification(y_true, y_pred):
    """
    Evaluate a multi-class classification model.

    Args:
    - y_true (List[Union[int, str]]): True labels.
    - y_pred (List[Union[int, str]]): Predicted labels.

    Returns:
    - metrics (Dict[str, Union[float, str, List[List[int]], str]]): A dictionary containing the evaluation metrics.
      - "Accuracy" (float): Accuracy score.
      - "Precision (weighted)" (float): Weighted precision score.
      - "Recall (weighted)" (float): Weighted recall score.
      - "F1 Score (weighted)" (float): Weighted F1 score.
      - "Confusion Matrix" (List[List[int]]): Confusion matrix.
      - "Classification Report" (str): Classification report as a string.
    """
    # Calculate accuracy
    accuracy = accuracy_score(y_true, y_pred)
    # Calculate precision, recall, and F1-score for each class
    precision = precision_score(y_true, y_pred, average='weighted')
    recall = recall_score(y_true, y_pred, average='weighted')
    f1 = f1_score(y_true, y_pred, average='weighted')
    # Calculate the confusion matrix
    confusion = confusion_matrix(y_true, y_pred)
    # Generate a classification report
    class_report = classification_report(y_true, y_pred, target_names=sorted(set(y_true)))
    # Create a dictionary to store the results
    metrics = {
        "Accuracy": accuracy,
        "Precision (weighted)": precision,
        "Recall (weighted)": recall,
        "F1 Score (weighted)": f1,
        "Confusion Matrix": confusion,
        "Classification Report": class_report
    }
    return metrics

result = evaluate_multiclass_classification(y_true, y_pred)

In [None]:
# Print evaluations results
for key, value in result.items():
    print(f'{key}: \t {value} \n')

***Notes:***

The model is giving good performance of 0.95 + F1 score.

In [None]:
def plot_confusion_matrix(y_true, y_pred, classes):
    """
    Plots a confusion matrix to visualize the performance of a classification model.

    Parameters:
    y_true (array-like): True class labels.
    y_pred (array-like): Predicted class labels.
    classes (array-like): List of class labels.

    Returns:
    None
    """
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.set(font_scale=1.2)  # Adjust font size
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=classes, yticklabels=classes)
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix')
    plt.show()

class_names = sorted(set(y_true))
plot_confusion_matrix(y_true, y_pred, class_names)

### Inference

In [None]:
# Read test data
test = pd.read_csv(os.path.join("test","test.csv"))
test.head()

In [None]:
# Concatenating title and content
test['text'] = test.apply(reformat_dataframe, axis=1)
# Pre-processing text
test['text'] = test['text'].apply(lambda x: preprocess_text(x))

In [None]:
# Load fine-tuned model
model_path = "model"
model_filename = f"epoch_0_trainloss_0-05_valloss_0-04"
model_filepath = os.path.join(model_path, model_filename)
model = TextClassifier(model_directory=model_filepath)

In [None]:
# Perform inference
test['category'] = model.predict(test['text'])

In [None]:
# Select subset column for submission file
submissions_df = test[['id', 'category']].copy()
submissions_df.head()

In [None]:
# save submission result
submissions_df.to_csv('submissions.csv', index=False)

### Conclusion

***Notes:***

Some of the further Improvement areas include:
- Modularize the code further in classes.
- Use WandB for training and logging experiment.
- Write unit test for the functions to make code robust.
- Train the model for longer.
- Try other models like RoBERTa, domain specific models mentioned in the notebook.
- Do error analysis on misclassification for further improvement