[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JAdamHub/M3-SUBMISSION-MSTR/blob/main/M3-Assignment_2_Fake_or_Real_News_Binary_Classification.ipynb)

# Assignment 2: Fake and Real News Dataset üì∞

## Overview
A binary text classification dataset containing **44,919 news articles** (`fake` vs. `true`). Notably contains **class imbalance**:
- **Fake News**: 23,502 articles (`Fake.csv`)
- **Real News**: 21,417 articles (`True.csv`)

## Key Adjustments
- The dataset is **not perfectly balanced** (‚âà52.3% fake vs. ‚âà47.7% real). This should be addressed during model training (e.g., stratification, class weighting).

## Structure & Features
- **Columns**:  
  `title`, `text`, `subject`, `date`  
- **Sources**:  
  Real news from *Reuters.com*; fake news from fact-checked unreliable sources.  
- **Use Cases**:  
  NLP model training, linguistic pattern analysis, misinformation detection benchmarks.  

## 1. Import Libraries & Dataset

In [2]:
# import libraries
!pip install evaluate
import kagglehub
from kagglehub import KaggleDatasetAdapter
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_auc_score
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import evaluate
from huggingface_hub import notebook_login

# according to the dataset description - it consists of two datasets - Fake/Real
# loading seperately
fake_file_path = "Fake.csv"
real_file_path = "True.csv"

# Load Fake.csv
fake_df = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "clmentbisaillon/fake-and-real-news-dataset",
    fake_file_path
)

# Load Real.csv
real_df = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "clmentbisaillon/fake-and-real-news-dataset",
    real_file_path
)

fake_df.head()

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.3.0-py3-none-any.whl.metadata (19 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.17-py311-none-any.whl.metadata (7.2 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.0/84.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 22.9M/22.9M [00:00<00:00, 104MB/s]

Extracting zip of Fake.csv...





Downloading from https://www.kaggle.com/api/v1/datasets/download/clmentbisaillon/fake-and-real-news-dataset?dataset_version_number=1&file_name=True.csv...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 18.1M/18.1M [00:00<00:00, 71.6MB/s]

Extracting zip of True.csv...





Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year‚Äô...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama‚Äôs Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


## 2. Data Overview, Preprocessing & Cleaning

* We can see, that we need to put a label on the data
* Overview, Check missing data

In [3]:
fake_df['label']=0
real_df['label']=1

* Fake news = 0
* Real news = 1
 - As label column

In [4]:
# combine the two datasets
df = pd.concat([fake_df, real_df], ignore_index=True)

In [5]:
print("First 5 records:")
df.head()

First 5 records:


Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year‚Äô...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama‚Äôs Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   label    44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 1.7+ MB


In [7]:
df.drop(['subject', 'date', 'title'], axis=1, inplace=True)

In [8]:
df.isnull().sum()

Unnamed: 0,0
text,0
label,0


In [9]:
print(df.duplicated().sum())
df.drop_duplicates(inplace=True)
print("Cleaning...")
df.duplicated().sum()

6251
Cleaning...


0

In [10]:
# text preparation
def preprocess_text(text):
    text = text.lower()
    return text

df['text'] = df['text'].apply(preprocess_text)

## 3. Sentence Transformer

In [11]:
# sample 500 data points with label 0
df_0 = df[df['label'] == 0].sample(n=500, random_state=42)
# selecting 500 samples with label 0 and setting random state for reproducibility

# sample 500 data points with label 1
df_1 = df[df['label'] == 1].sample(n=500, random_state=42)
# selecting 500 samples with label 1 and setting random state for reproducibility

# combine the two datasets
df_sampled = pd.concat([df_0, df_1])
# concatenating the two dataframes with label 0 and label 1

# reset index
df_sampled = df_sampled.reset_index(drop=True)
# resetting the index of the combined dataframe

In [12]:
# define model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# determine number of unique labels (num_labels)
num_labels = len(df_sampled['label'].unique())

# function to tokenize text data
def tokenize_function(example):
    return tokenizer(example['text'], truncation=True, padding="max_length", max_length=128)

# configure 5-fold cross validation with stratified sampling
n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

fold_results = []

# load the 'accuracy' metric using the evaluate library
accuracy_metric = evaluate.load("accuracy")

# function to compute metrics during evaluation
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    # calculate metrics
    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average="binary")
    roc_auc = roc_auc_score(labels, predictions)

    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "roc_auc": roc_auc
    }


# iterate over each fold
for fold, (train_index, val_index) in enumerate(skf.split(df_sampled, df_sampled['label'])):
    print(f"fold {fold+1}/{n_splits}")

    # split data into training and validation sets for this fold
    train_df = df_sampled.iloc[train_index].reset_index(drop=True)
    val_df = df_sampled.iloc[val_index].reset_index(drop=True)

    # convert pandas dataframes to hugging face datasets
    train_dataset = Dataset.from_pandas(train_df)
    val_dataset = Dataset.from_pandas(val_df)

    # tokenize the datasets
    train_dataset = train_dataset.map(tokenize_function, batched=True)
    val_dataset = val_dataset.map(tokenize_function, batched=True)

    # set format for pytorch (select required columns)
    train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
    val_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

    # load a new instance of the model for each fold
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

    # set training arguments
    training_args = TrainingArguments(
        output_dir=f"./results/fold_{fold}",
        num_train_epochs=3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        evaluation_strategy="epoch",
        logging_strategy="epoch",
        save_strategy="epoch",
        learning_rate=2e-5,
        weight_decay=0.01,
        seed=42,
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
    )

    # initialize the trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics,
    )

    # train the model for this fold
    trainer.train()

    # evaluate the model on the validation set and store the results
    results = trainer.evaluate()
    print(f"fold {fold+1} results:", results)
    fold_results.append(results)

# calculate the average accuracy across all folds
avg_accuracy = np.mean([result['eval_accuracy'] for result in fold_results])
print("average accuracy over folds:", avg_accuracy)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

fold 1/5


Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33madamsen[0m ([33madamsen-aalborg-universitet[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Roc Auc
1,0.3041,0.025218,1.0,1.0,1.0,1.0,1.0
2,0.0186,0.007438,1.0,1.0,1.0,1.0,1.0
3,0.0084,0.005724,1.0,1.0,1.0,1.0,1.0


fold 1 results: {'eval_loss': 0.025218144059181213, 'eval_accuracy': 1.0, 'eval_precision': 1.0, 'eval_recall': 1.0, 'eval_f1': 1.0, 'eval_roc_auc': 1.0, 'eval_runtime': 0.7382, 'eval_samples_per_second': 270.926, 'eval_steps_per_second': 17.61, 'epoch': 3.0}
fold 2/5


Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Roc Auc
1,0.2925,0.032268,0.995,1.0,0.99,0.994975,0.995
2,0.014,0.022702,0.995,1.0,0.99,0.994975,0.995
3,0.0073,0.022293,0.995,1.0,0.99,0.994975,0.995


fold 2 results: {'eval_loss': 0.032267894595861435, 'eval_accuracy': 0.995, 'eval_precision': 1.0, 'eval_recall': 0.99, 'eval_f1': 0.9949748743718593, 'eval_roc_auc': 0.995, 'eval_runtime': 0.7352, 'eval_samples_per_second': 272.019, 'eval_steps_per_second': 17.681, 'epoch': 3.0}
fold 3/5


Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Roc Auc
1,0.2806,0.019343,1.0,1.0,1.0,1.0,1.0
2,0.0182,0.006967,1.0,1.0,1.0,1.0,1.0
3,0.0076,0.005382,1.0,1.0,1.0,1.0,1.0


fold 3 results: {'eval_loss': 0.01934261992573738, 'eval_accuracy': 1.0, 'eval_precision': 1.0, 'eval_recall': 1.0, 'eval_f1': 1.0, 'eval_roc_auc': 1.0, 'eval_runtime': 0.7526, 'eval_samples_per_second': 265.734, 'eval_steps_per_second': 17.273, 'epoch': 3.0}
fold 4/5


Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Roc Auc
1,0.2802,0.022225,1.0,1.0,1.0,1.0,1.0
2,0.0144,0.007053,1.0,1.0,1.0,1.0,1.0
3,0.0074,0.005645,1.0,1.0,1.0,1.0,1.0


fold 4 results: {'eval_loss': 0.022225352004170418, 'eval_accuracy': 1.0, 'eval_precision': 1.0, 'eval_recall': 1.0, 'eval_f1': 1.0, 'eval_roc_auc': 1.0, 'eval_runtime': 0.7366, 'eval_samples_per_second': 271.516, 'eval_steps_per_second': 17.649, 'epoch': 3.0}
fold 5/5


Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Roc Auc
1,0.2787,0.024898,0.995,1.0,0.99,0.994975,0.995
2,0.0159,0.007124,1.0,1.0,1.0,1.0,1.0
3,0.0073,0.005515,1.0,1.0,1.0,1.0,1.0


fold 5 results: {'eval_loss': 0.007124264258891344, 'eval_accuracy': 1.0, 'eval_precision': 1.0, 'eval_recall': 1.0, 'eval_f1': 1.0, 'eval_roc_auc': 1.0, 'eval_runtime': 0.7412, 'eval_samples_per_second': 269.833, 'eval_steps_per_second': 17.539, 'epoch': 3.0}
average accuracy over folds: 0.999


## How can it be so accurate?
* Let's check the data again...

In [14]:
real_df['text'].head(7)

Unnamed: 0,text
0,WASHINGTON (Reuters) - The head of a conservat...
1,WASHINGTON (Reuters) - Transgender people will...
2,WASHINGTON (Reuters) - The special counsel inv...
3,WASHINGTON (Reuters) - Trump campaign adviser ...
4,SEATTLE/WASHINGTON (Reuters) - President Donal...
5,"WEST PALM BEACH, Fla./WASHINGTON (Reuters) - T..."
6,"WEST PALM BEACH, Fla (Reuters) - President Don..."


In [15]:
fake_df['text'].head(7)

Unnamed: 0,text
0,Donald Trump just couldn t wish all Americans ...
1,House Intelligence Committee Chairman Devin Nu...
2,"On Friday, it was revealed that former Milwauk..."
3,"On Christmas day, Donald Trump announced that ..."
4,Pope Francis used his annual Christmas Day mes...
5,The number of cases of cops brutalizing and ki...
6,Donald Trump spent a good portion of his day a...


### Let's try and fix that with regex

In [16]:
import re

def remove_leading_info(text):
    # regex pattern which matches the start with a city name, maybe with commas, spaces, periods, and followed by something in parentheses and a hyphen
    pattern = r'^[A-Za-z\s,./-]+\([A-Za-z]+\)\s*-\s*'
    return re.sub(pattern, '', text)

# apply function on the data
real_df['text'] = real_df['text'].apply(remove_leading_info)

real_df['text'].head(7)


Unnamed: 0,text
0,The head of a conservative Republican faction ...
1,Transgender people will be allowed for the fir...
2,The special counsel investigation of links bet...
3,Trump campaign adviser George Papadopoulos tol...
4,President Donald Trump called on the U.S. Post...
5,The White House said on Friday it was set to k...
6,President Donald Trump said on Thursday he bel...


### Let's try the new method

In [17]:
df = pd.concat([fake_df, real_df], ignore_index=True)
df.drop_duplicates(inplace=True)

# text preparation
def preprocess_text(text):
    text = text.lower()
    return text

df['text'] = df['text'].apply(preprocess_text)

# sample 500 data points with label 0
df_0 = df[df['label'] == 0].sample(n=500, random_state=42)
# selecting 500 samples with label 0 and setting random state for reproducibility

# sample 500 data points with label 1
df_1 = df[df['label'] == 1].sample(n=500, random_state=42)
# selecting 500 samples with label 1 and setting random state for reproducibility

# combine the two datasets
df_sampled = pd.concat([df_0, df_1])
# concatenating the two dataframes with label 0 and label 1

# reset index
df_sampled = df_sampled.reset_index(drop=True)
# resetting the index of the combined dataframe

In [18]:
# define model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# determine number of unique labels (num_labels)
num_labels = len(df_sampled['label'].unique())

# function to tokenize text data
def tokenize_function(example):
    return tokenizer(example['text'], truncation=True, padding="max_length", max_length=128)

# configure 5-fold cross validation with stratified sampling
n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

fold_results = []

# function to compute metrics during evaluation
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    # calculate metrics
    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average="binary")

    # handle ROC-AUC case where only one class is present in validation set
    try:
        roc_auc = roc_auc_score(labels, predictions)
    except ValueError:
        roc_auc = float("nan")  # fallback in case of only one class

    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "roc_auc": roc_auc
    }

# iterate over each fold
for fold, (train_index, val_index) in enumerate(skf.split(df_sampled, df_sampled['label'])):
    print(f"fold {fold+1}/{n_splits}")

    # split data into training and validation sets for this fold
    train_df = df_sampled.iloc[train_index].reset_index(drop=True)
    val_df = df_sampled.iloc[val_index].reset_index(drop=True)

    # convert pandas dataframes to hugging face datasets
    train_dataset = Dataset.from_pandas(train_df)
    val_dataset = Dataset.from_pandas(val_df)

    # tokenize the datasets
    train_dataset = train_dataset.map(tokenize_function, batched=True)
    val_dataset = val_dataset.map(tokenize_function, batched=True)

    # set format for pytorch (select required columns)
    train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
    val_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

    # load a new instance of the model for each fold
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

    # set training arguments
    training_args = TrainingArguments(
        output_dir=f"./results/fold_{fold}",
        num_train_epochs=3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        evaluation_strategy="epoch",
        logging_strategy="epoch",
        save_strategy="epoch",
        learning_rate=2e-5,
        weight_decay=0.01,
        seed=42,
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
    )

    # initialize the trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics,
    )

    # train the model for this fold
    trainer.train()

    # evaluate the model on the validation set and store the results
    results = trainer.evaluate()
    print(f"fold {fold+1} results:", results)
    fold_results.append(results)

# calculate the average metrics across all folds
avg_metrics = {
    metric: np.nanmean([result[f"eval_{metric}"] for result in fold_results])  # use nanmean to ignore NaNs
    for metric in ["accuracy", "precision", "recall", "f1", "roc_auc"]
}

print("average metrics over folds:", avg_metrics)


fold 1/5


Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Roc Auc
1,0.3867,0.150776,0.96,0.942308,0.98,0.960784,0.96
2,0.0765,0.177573,0.95,0.909091,1.0,0.952381,0.95
3,0.0407,0.099041,0.975,0.952381,1.0,0.97561,0.975


fold 1 results: {'eval_loss': 0.09904064238071442, 'eval_accuracy': 0.975, 'eval_precision': 0.9523809523809523, 'eval_recall': 1.0, 'eval_f1': 0.975609756097561, 'eval_roc_auc': 0.975, 'eval_runtime': 0.7518, 'eval_samples_per_second': 266.019, 'eval_steps_per_second': 17.291, 'epoch': 3.0}
fold 2/5


Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Roc Auc
1,0.3863,0.150844,0.965,0.942857,0.99,0.965854,0.965
2,0.0894,0.184901,0.945,0.908257,0.99,0.947368,0.945
3,0.0465,0.130693,0.96,0.933962,0.99,0.961165,0.96


fold 2 results: {'eval_loss': 0.1508435755968094, 'eval_accuracy': 0.965, 'eval_precision': 0.9428571428571428, 'eval_recall': 0.99, 'eval_f1': 0.9658536585365853, 'eval_roc_auc': 0.9649999999999999, 'eval_runtime': 0.765, 'eval_samples_per_second': 261.423, 'eval_steps_per_second': 16.992, 'epoch': 3.0}
fold 3/5


Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Roc Auc
1,0.3896,0.18012,0.935,0.884956,1.0,0.938967,0.935
2,0.0964,0.129715,0.965,0.934579,1.0,0.966184,0.965
3,0.044,0.090979,0.97,0.943396,1.0,0.970874,0.97


fold 3 results: {'eval_loss': 0.09097923338413239, 'eval_accuracy': 0.97, 'eval_precision': 0.9433962264150944, 'eval_recall': 1.0, 'eval_f1': 0.970873786407767, 'eval_roc_auc': 0.97, 'eval_runtime': 0.7388, 'eval_samples_per_second': 270.721, 'eval_steps_per_second': 17.597, 'epoch': 3.0}
fold 4/5


Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Roc Auc
1,0.4031,0.101663,0.985,1.0,0.97,0.984772,0.985
2,0.0853,0.046869,0.99,1.0,0.98,0.989899,0.99
3,0.0395,0.039736,0.99,1.0,0.98,0.989899,0.99


fold 4 results: {'eval_loss': 0.04686886817216873, 'eval_accuracy': 0.99, 'eval_precision': 1.0, 'eval_recall': 0.98, 'eval_f1': 0.98989898989899, 'eval_roc_auc': 0.99, 'eval_runtime': 0.7331, 'eval_samples_per_second': 272.807, 'eval_steps_per_second': 17.732, 'epoch': 3.0}
fold 5/5


Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Roc Auc
1,0.3851,0.118097,0.965,0.969697,0.96,0.964824,0.965
2,0.0737,0.090844,0.965,0.969697,0.96,0.964824,0.965
3,0.0277,0.089377,0.965,0.969697,0.96,0.964824,0.965


fold 5 results: {'eval_loss': 0.11809676885604858, 'eval_accuracy': 0.965, 'eval_precision': 0.9696969696969697, 'eval_recall': 0.96, 'eval_f1': 0.964824120603015, 'eval_roc_auc': 0.965, 'eval_runtime': 0.7428, 'eval_samples_per_second': 269.234, 'eval_steps_per_second': 17.5, 'epoch': 3.0}
average metrics over folds: {'accuracy': 0.9730000000000001, 'precision': 0.9616662582700318, 'recall': 0.986, 'f1': 0.9734120623087836, 'roc_auc': 0.9730000000000001}


In [61]:
# save model for upload to huggingface
# define path
save_path = "./final_model"

# save model + tokenizer
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

('./final_model/tokenizer_config.json',
 './final_model/special_tokens_map.json',
 './final_model/vocab.txt',
 './final_model/added_tokens.json',
 './final_model/tokenizer.json')

## Huggingface Deployment (adapted for Google Colab (userdata secrets))

In [68]:
!pip install --upgrade huggingface_hub
from huggingface_hub import HfApi, HfFolder
from google.colab import userdata

# get your token from userdata
write_token = userdata.get('HF_TOKEN')

# set repository
repo_name = "HugMi/M3-Assignment2"  # replace with your wished username/repository

# create the repository if it does not exist
api = HfApi()
api.create_repo(repo_id=repo_name, repo_type="model", private=False, token=write_token, exist_ok=True)

# create model card content for upload
model_card_content = """
---
license: apache-2.0
tags:
- text-classification
- fake-news-detection
---

# Fake News Detection Model

This model is trained to detect fake news articles using DistilBERT.

## Training Data

The model was trained on a dataset of fake and real news articles. The dataset was preprocessed to remove irrelevant information and to balance the classes.

## Performance

The model was evaluated using 5-fold cross-validation. The average metrics across all folds are as follows:

| Metric    | Value |
|-----------|-------|
| Accuracy  | 0.973 |
| Precision | 0.962 |
| Recall    | 0.986 |
| F1        | 0.973 |
| ROC AUC   | 0.973 |

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("HugMi/M3-Assignment2")
model = AutoModelForSequenceClassification.from_pretrained("HugMi/M3-Assignment2")

def classify_text(text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs)
    predicted_class = outputs.logits.argmax().item()
    return predicted_class  # 0 for fake, 1 for real
"""