# Sequence Classification
- ref: https://huggingface.co/docs/transformers/tasks/sequence_classification
- dataset: [LittleFish-Coder/Fake_News_KDD2020](https://huggingface.co/datasets/LittleFish-Coder/Fake_News_KDD2020)
- pretrained model: [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)

In [44]:
# %pip install transformers datasets evaluate accelerate

In [None]:
# import package
import pandas as pd
import torch
from transformers import AutoTokenizer
from datasets import load_dataset
from datasets import DatasetDict

## Dataset

use the [`LittleFish-Coder/Fake_News_KDD2020`](https://huggingface.co/datasets/LittleFish-Coder/Fake_News_KDD2020) dataset from huggingface datasets library

- 0: real news
- 1: fake news

In [2]:
# load data
dataset: DatasetDict = load_dataset("LittleFish-Coder/Fake_News_KDD2020", download_mode="reuse_cache_if_exists", cache_dir="dataset")   # type: ignore

Generating train split: 100%|██████████| 4487/4487 [00:00<00:00, 35360.95 examples/s]
Generating test split: 100%|██████████| 499/499 [00:00<00:00, 37046.78 examples/s]


In [3]:
# data
print(f"Dataset: {dataset}")
train_dataset = dataset['train']
test_dataset = dataset['test']

Dataset: DatasetDict({
    train: Dataset({
        features: ['text', 'embeddings', 'label'],
        num_rows: 4487
    })
    test: Dataset({
        features: ['text', 'embeddings', 'label'],
        num_rows: 499
    })
})


In [4]:
# quick look at the data
first_train = train_dataset[0]
print(f"First training sample")
print(f"Keys: {first_train.keys()}")
print(f"Text: {first_train['text']}")
print(f"Label: {first_train['label']}")

First training sample
Keys: dict_keys(['text', 'embeddings', 'label'])
Text: Oops. Something went wrong. Please try again later  Looks like we are having a problem on the server.
Label: 0


## Preprocess (Tokenize)
The next step is to load a [`DistilBERT`](https://huggingface.co/distilbert/distilbert-base-uncased) tokenizer to preprocess the `text` field:

In [5]:
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

Create a preprocessing function to tokenize text and truncate sequences to be no longer than DistilBERT’s maximum input length:

In [6]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)

To apply the preprocessing function over the entire dataset, use Datasets map function. 

You can speed up map by setting `batched=True` to process multiple elements of the dataset at once:

In [7]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map: 100%|██████████| 4487/4487 [00:02<00:00, 1896.61 examples/s]
Map: 100%|██████████| 499/499 [00:00<00:00, 1676.75 examples/s]


After preprocessing, the dataset will contain the original text and the following attributes that DistilBERT uses as input:

- `input_ids`: The token indices in the vocabulary
- `attention_mask`: Which parts of the sequence DistilBERT should pay attention to

In [9]:
# tokenized
first_tokenized = tokenized_dataset["train"][0]
print(f"First tokenized sample")
print(f"Keys: {first_tokenized.keys()}")
print(f"Input IDs: {first_tokenized['input_ids']}")
print(f"Attention Mask: {first_tokenized['attention_mask']}")
print(f"Length: {len(first_tokenized['input_ids'])}")

First tokenized sample
Keys: dict_keys(['text', 'embeddings', 'label', 'input_ids', 'attention_mask'])
Input IDs: [101, 1051, 11923, 1012, 2242, 2253, 3308, 1012, 3531, 3046, 2153, 2101, 3504, 2066, 2057, 2024, 2383, 1037, 3291, 2006, 1996, 8241, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

In [10]:
import evaluate

accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")
precision = evaluate.load("precision")
recall = evaluate.load("recall")

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the accuracy:

In [24]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    acc = accuracy.compute(predictions=predictions, references=labels)
    f1_score = f1.compute(predictions=predictions, references=labels, average="weighted")
    pre = precision.compute(predictions=predictions, references=labels, average="weighted")
    rec = recall.compute(predictions=predictions, references=labels, average="weighted")
    
    
    # Handle potential None values
    results = {
        "accuracy": acc["accuracy"] if acc else None,
        "f1": f1_score["f1"] if f1_score else None,
        "precision": pre["precision"] if pre else None,
        "recall": rec["recall"] if rec else None,
    }

    return results

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## Train (Finetune the model)

Before you start training your model, create a map of the expected ids to their labels with `id2label` and `label2id`:

In [12]:
id2label = {0: "real", 1: "fake"}
label2id = {"real": 0, "fake": 1}

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! 

Load DistilBERT with [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSequenceClassification) along with the number of expected labels, and the label mappings:

In [13]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the accuracy and save the training checkpoint.
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [27]:
num_epochs = 2
output_dir = "checkpoints"
batch_size = 64
logging_dir = "logs"

In [28]:
training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=2e-5,
    num_train_epochs=num_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_dir=logging_dir,
    logging_steps=1,
)

In [29]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
)

In [30]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.5737,0.497554,0.747495,0.741587,0.745961,0.747495
2,0.1272,0.499713,0.765531,0.764144,0.763826,0.765531


TrainOutput(global_step=142, training_loss=0.3591496240295155, metrics={'train_runtime': 119.1088, 'train_samples_per_second': 75.343, 'train_steps_per_second': 1.192, 'total_flos': 1188762435538944.0, 'train_loss': 0.3591496240295155, 'epoch': 2.0})

In [42]:
# save the best model
trainer.save_model(f"{output_dir}/best_model")
# save the tokenizer
tokenizer.save_pretrained(f"{output_dir}/best_model")

('checkpoints/best_model/tokenizer_config.json',
 'checkpoints/best_model/special_tokens_map.json',
 'checkpoints/best_model/vocab.txt',
 'checkpoints/best_model/added_tokens.json',
 'checkpoints/best_model/tokenizer.json')

## Evaluation Metrics (on testing dataset)
- Accuracy
- F1 Score

In [32]:
# evaluate on test set
test_result = trainer.evaluate(eval_dataset=tokenized_dataset["test"])  # type: ignore

In [33]:
test_result

{'eval_loss': 0.4975535571575165,
 'eval_accuracy': 0.7474949899799599,
 'eval_f1': 0.741586546059846,
 'eval_precision': 0.7459606174285781,
 'eval_recall': 0.7474949899799599,
 'eval_runtime': 2.4395,
 'eval_samples_per_second': 204.551,
 'eval_steps_per_second': 3.279,
 'epoch': 2.0}

In [34]:
test_df = pd.DataFrame(test_result, index=[0])
test_df.to_csv(f"{output_dir}/test_result.csv", index=False)

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Grab some text you'd like to run inference on:

In [35]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")

Device: cuda


In [36]:
text = test_dataset[0]["text"]
print(f"Text: {text}")

Text: Reports of Lawrence and Pitt dating spread in December 2017. Watch What Happens Live/YouTube and Jason Kempin/Getty Images for Netflix  Jennifer Lawrence appeared on Bravo's "Watch What Happens Live" on Thursday, March 1 and responded to the reports that she was dating Brad Pitt.  When a caller asked her if the two stars were "secretly dating," the "Red Sparrow" actress denied it.  Even though she says the reports weren't true, Lawrence admitted that she didn't mind the speculation too much.  "No," she said. "I've met him once in like, 2013, so it was very random, but I also wasn't like, in a hurry to debunk it."  In December 2017, it was speculated that Lawrence and Pitt were dating.  In December 2017, it was reported that Jennifer Lawrence was dating Brad Pitt, but the "Red Sparrow" actress has now cleared up the speculation and revealed it's not true.  While appearing on Bravo's "Watch What Happens Live" on Thursday, Lawrence answered questions from fans who called in, and one

In [43]:
from transformers import pipeline

classifier = pipeline("text-classification", model=f"{output_dir}/best_model", truncation=True, device=device)
classifier(text)

[{'label': 'fake', 'score': 0.7897346615791321}]

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return PyTorch tensors:

In [44]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(f"{output_dir}/best_model")
inputs = tokenizer(text, return_tensors="pt", truncation=True)
print(f"Input keys: {inputs.keys()}")
print(f"Input: {inputs}")

Input keys: dict_keys(['input_ids', 'attention_mask'])
Input: {'input_ids': tensor([[  101,  4311,  1997,  5623,  1998, 15091,  5306,  3659,  1999,  2285,
          2418,  1012,  3422,  2054,  6433,  2444,  1013,  7858,  1998,  4463,
         20441,  2378,  1013,  2131,  3723,  4871,  2005, 20907,  7673,  5623,
          2596,  2006, 17562,  1005,  1055,  1000,  3422,  2054,  6433,  2444,
          1000,  2006,  9432,  1010,  2233,  1015,  1998,  5838,  2000,  1996,
          4311,  2008,  2016,  2001,  5306,  8226, 15091,  1012,  2043,  1037,
         20587,  2356,  2014,  2065,  1996,  2048,  3340,  2020,  1000, 10082,
          5306,  1010,  1000,  1996,  1000,  2417, 19479,  1000,  3883,  6380,
          2009,  1012,  2130,  2295,  2016,  2758,  1996,  4311,  4694,  1005,
          1056,  2995,  1010,  5623,  4914,  2008,  2016,  2134,  1005,  1056,
          2568,  1996, 12143,  2205,  2172,  1012,  1000,  2053,  1010,  1000,
          2016,  2056,  1012,  1000,  1045,  1005,  231

Pass your inputs to the model and return the `logits`:

In [45]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(f"{output_dir}/best_model")
with torch.no_grad():
    logits = model(**inputs).logits
print(f"Logits: {logits}")

Logits: tensor([[-0.8020,  0.5214]])


Get the class with the highest probability, and use the model's `id2label` mapping to convert it to a text label:

In [46]:
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

'fake'