# Sequence Classification
- ref: https://huggingface.co/docs/transformers/tasks/sequence_classification

- dataset: [GonzaloA/fake_news](https://huggingface.co/datasets/GonzaloA/fake_news)
- pretrained model: [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)

In [1]:
# %pip install transformers datasets evaluate accelerate

In [2]:
# import package
import numpy as np
import pandas as pd
import torch
import evaluate
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import pipeline
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


## Dataset

use the [`GonzaloA/fake_news`](https://huggingface.co/datasets/GonzaloA/fake_news) dataset from huggingface datasets library

- 0: fake news
- 1: real news

In [3]:
# load data
dataset = load_dataset("GonzaloA/fake_news", download_mode="reuse_cache_if_exists", cache_dir="dataset")

Repo card metadata block was not found. Setting CardData to empty.
Generating train split: 100%|██████████| 24353/24353 [00:00<00:00, 34987.88 examples/s]
Generating validation split: 100%|██████████| 8117/8117 [00:00<00:00, 28487.49 examples/s]
Generating test split: 100%|██████████| 8117/8117 [00:00<00:00, 40813.17 examples/s]


In [4]:
# data
print(f"Dataset: {dataset}")
train_dataset = dataset["train"]
val_dataset = dataset["validation"]
test_dataset = dataset["test"]

Dataset: DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'title', 'text', 'label'],
        num_rows: 24353
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'title', 'text', 'label'],
        num_rows: 8117
    })
    test: Dataset({
        features: ['Unnamed: 0', 'title', 'text', 'label'],
        num_rows: 8117
    })
})


In [5]:
# quick look at the data
first_train = train_dataset[0]
print(f"First training sample")
print(f"Keys: {first_train.keys()}")
print(f"Title: {first_train['title']}")
print(f"Text: {first_train['text']}")
print(f"Label: {first_train['label']}")

First training sample
Keys: dict_keys(['Unnamed: 0', 'title', 'text', 'label'])
Title:  ‘Maury’ Show Official Facebook Posts F*CKED UP Caption On Guest That Looks Like Ted Cruz (IMAGE)
Text: Maury is perhaps one of the trashiest shows on television today. It s right in line with the likes of the gutter trash that is Jerry Springer, and the fact that those shows are still on the air with the shit they air really is a sad testament to what Americans find to be entertaining. However, Maury really crossed the line with a Facebook post regarding one of their guest s appearance with a vile, disgusting caption on Tuesday evening.There was a young woman on there doing one of their episodes regarding the paternity of her child. However, on the page, the show posted an image of the woman, who happens to bear a striking resemblance to Senator and presidential candidate Ted Cruz. The caption from the Maury Show page read: The Lie Detector Test determined .that was a LIE!  Ted Cruz is just NOT that

## Preprocess (Tokenize)
The next step is to load a [`DistilBERT`](https://huggingface.co/distilbert/distilbert-base-uncased) tokenizer to preprocess the `text` field:

In [6]:
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")



Create a preprocessing function to tokenize text and truncate sequences to be no longer than DistilBERT’s maximum input length:

In [7]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

To apply the preprocessing function over the entire dataset, use Datasets map function. 

You can speed up map by setting `batched=True` to process multiple elements of the dataset at once:

In [8]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/24353 [00:00<?, ? examples/s]

Map: 100%|██████████| 24353/24353 [00:13<00:00, 1751.55 examples/s]
Map: 100%|██████████| 8117/8117 [00:04<00:00, 1736.66 examples/s]
Map: 100%|██████████| 8117/8117 [00:04<00:00, 1634.74 examples/s]


After preprocessing, the dataset will contain the original text and the following attributes that DistilBERT uses as input:

- `input_ids`: The token indices in the vocabulary
- `attention_mask`: Which parts of the sequence DistilBERT should pay attention to

In [9]:
# tokenized
first_tokenized = tokenized_dataset["train"][0]
print(f"First tokenized sample")
print(f"Keys: {first_tokenized.keys()}")
print(f"Input IDs: {first_tokenized['input_ids']}")
print(f"Attention Mask: {first_tokenized['attention_mask']}")
print(f"Length: {len(first_tokenized['input_ids'])}")

First tokenized sample
Keys: dict_keys(['Unnamed: 0', 'title', 'text', 'label', 'input_ids', 'attention_mask'])
Input IDs: [101, 5003, 13098, 2003, 3383, 2028, 1997, 1996, 11669, 10458, 3065, 2006, 2547, 2651, 1012, 2009, 1055, 2157, 1999, 2240, 2007, 1996, 7777, 1997, 1996, 9535, 3334, 11669, 2008, 2003, 6128, 17481, 1010, 1998, 1996, 2755, 2008, 2216, 3065, 2024, 2145, 2006, 1996, 2250, 2007, 1996, 4485, 2027, 2250, 2428, 2003, 1037, 6517, 9025, 2000, 2054, 4841, 2424, 2000, 2022, 14036, 1012, 2174, 1010, 5003, 13098, 2428, 4625, 1996, 2240, 2007, 1037, 9130, 2695, 4953, 2028, 1997, 2037, 4113, 1055, 3311, 2007, 1037, 25047, 1010, 19424, 14408, 3258, 2006, 9857, 3944, 1012, 2045, 2001, 1037, 2402, 2450, 2006, 2045, 2725, 2028, 1997, 2037, 4178, 4953, 1996, 6986, 11795, 3012, 1997, 2014, 2775, 1012, 2174, 1010, 2006, 1996, 3931, 1010, 1996, 2265, 6866, 2019, 3746, 1997, 1996, 2450, 1010, 2040, 6433, 2000, 4562, 1037, 8478, 14062, 2000, 5205, 1998, 4883, 4018, 6945, 8096, 1012, 1996, 1

Now create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [10]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

In [11]:
import evaluate

accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")
# precision = evaluate.load("precision")
# recall = evaluate.load("recall")

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the accuracy:

In [12]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    acc = accuracy.compute(predictions=predictions, references=labels)
    f1_score = f1.compute(predictions=predictions, references=labels, average="weighted")
    # pre = precision.compute(predictions=predictions, references=labels, average="weighted")
    # rec = recall.compute(predictions=predictions, references=labels, average="weighted")

    results = {"accuracy": acc['accuracy'], "f1": f1_score['f1']}

    return results

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## Train (Finetune the model)

Before you start training your model, create a map of the expected ids to their labels with `id2label` and `label2id`:

In [13]:
id2label = {0: "fake", 1: "real"}
label2id = {"fake": 0, "real": 1}

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! 

Load DistilBERT with [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSequenceClassification) along with the number of expected labels, and the label mappings:

In [14]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the accuracy and save the training checkpoint.
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [30]:
num_epochs = 1
output_dir = "checkpoints/sample"
batch_size = 64

In [31]:
training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True
)

In [32]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

<Tip>

[Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) applies dynamic padding by default when you pass `tokenizer` to it. In this case, you don't need to specify a data collator explicitly.

</Tip>

In [None]:
trainer.train()

In [23]:
# save the best model
trainer.save_model(f"{output_dir}/best_model")

## Evaluation Metrics (on testing dataset)
- Accuracy
- F1 Score

In [54]:
# evaluate on validation set
val_result = trainer.evaluate()

  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


In [25]:
val_result

{'eval_loss': 0.027922283858060837,
 'eval_accuracy': 0.9859554022422077,
 'eval_f1': 0.9859627860717326,
 'eval_runtime': 34.459,
 'eval_samples_per_second': 235.555,
 'eval_steps_per_second': 1.857,
 'epoch': 2.0}

In [None]:
# to csv
val_df = pd.DataFrame(val_result, index=[0])
val_df.to_csv(f"{output_dir}/val_result.csv", index=False)

In [None]:
# evaluate on test set
test_result = trainer.evaluate(eval_dataset=tokenized_dataset["test"])

In [39]:
test_result

{'eval_loss': 0.029379427433013916,
 'eval_accuracy': 0.986694591597881,
 'eval_f1': 0.9866952574111764,
 'eval_runtime': 34.4597,
 'eval_samples_per_second': 235.55,
 'eval_steps_per_second': 1.857,
 'epoch': 1.0}

In [40]:
test_df = pd.DataFrame(test_result, index=[0])
test_df.to_csv(f"{output_dir}/test_result.csv", index=False)

   eval_loss  eval_accuracy   eval_f1  eval_runtime  eval_samples_per_second  \
0   0.029379       0.986695  0.986695       34.4597                   235.55   

   eval_steps_per_second  epoch  
0                  1.857    1.0  


## Inference

Great, now that you've finetuned a model, you can use it for inference!

Grab some text you'd like to run inference on:

In [48]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")

Device: cuda


In [49]:
text = test_dataset[0]["text"]
print(f"Text: {text}")

Text: JOE DIGENOVA has been around D.C for decades and has seen it all. He probably didn t see his one coming. The incoming president  was set-up to be taken down. A soft coup is in the works and DiGenova has this to say about it:"It's very clear that they conspired to frame the incoming President of the United States."  Joe diGenova on allegations of anti-Trump bias at FBI and TheJusticeDept #Tucker https://t.co/qUNjAenzJc pic.twitter.com/VDlhb45Ghi  G. Ashley Hawkins (@g_ashleyhawkins) December 16, 2017DiGenova on Tucker Carlson tonight: Inside the FBI and Department of Justice under Obama was a brazen plot to do two things. To exonerate Hillary Clinton because of an animous for Donald Trump, and then if she lost to frame the incoming president for either a criminal act or impeachment. This is one of the most disgusting performances by the senior officials at the FBI and the Department of Justice that everyone of these agents should be fired and the people who are still in the Justic

In [50]:
from transformers import pipeline

classifier = pipeline("text-classification", model=f"{output_dir}/best_model", truncation=True, device=device)
classifier(text)

[{'label': 'fake', 'score': 0.9989815354347229}]

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return PyTorch tensors:

In [51]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(f"{output_dir}/best_model")
inputs = tokenizer(text, return_tensors="pt", truncation=True)
print(f"Input keys: {inputs.keys()}")
print(f"Input: {inputs}")

Input keys: dict_keys(['input_ids', 'attention_mask'])
Input: {'input_ids': tensor([[  101,  3533, 10667, 16515,  3567,  2038,  2042,  2105,  1040,  1012,
          1039,  2005,  5109,  1998,  2038,  2464,  2009,  2035,  1012,  2002,
          2763,  2134,  1056,  2156,  2010,  2028,  2746,  1012,  1996, 14932,
          2343,  2001,  2275,  1011,  2039,  2000,  2022,  2579,  2091,  1012,
          1037,  3730,  8648,  2003,  1999,  1996,  2573,  1998, 10667, 16515,
          3567,  2038,  2023,  2000,  2360,  2055,  2009,  1024,  1000,  2009,
          1005,  1055,  2200,  3154,  2008,  2027,  9530, 13102, 27559,  2000,
          4853,  1996, 14932,  2343,  1997,  1996,  2142,  2163,  1012,  1000,
          3533, 10667, 16515,  3567,  2006,  9989,  1997,  3424,  1011,  8398,
         13827,  2012,  8495,  1998,  1996, 29427,  6610,  3207, 13876,  1001,
          9802, 16770,  1024,  1013,  1013,  1056,  1012,  2522,  1013, 24209,
          2078,  3900,  2368,  2480,  3501,  2278, 2726

Pass your inputs to the model and return the `logits`:

In [52]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(f"{output_dir}/best_model")
with torch.no_grad():
    logits = model(**inputs).logits
print(f"Logits: {logits}")

Logits: tensor([[ 3.1090, -3.7794]])


Get the class with the highest probability, and use the model's `id2label` mapping to convert it to a text label:

In [53]:
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

'fake'