<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Sequence-Classification" data-toc-modified-id="Sequence-Classification-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Sequence Classification</a></span><ul class="toc-item"><li><span><a href="#Freezing-Model" data-toc-modified-id="Freezing-Model-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Freezing Model</a></span></li></ul></li></ul></div>

## Sequence Classification

Import training algorithms: we are using "DistilBert" flavor for `SequenceClassification` because of speed.
`DistilBertTokenizerFast` is also applied to leverage speed for tokenization. `Collator` is applied to create batch of data for training pipeline. The last import is pipeline object that we can use Hugging face but here we want to use to run our own fine-tuned models.

The `datasets` library, a companion to Transformers, we're going to be importing `load_metric`, allowing us to create our own custom metrics while evaluating and training our pipelines, and the `Dataset` object, is our general collection holder for all of our data points.

In [125]:
from transformers import Trainer, TrainingArguments, DistilBertForSequenceClassification, DistilBertTokenizerFast, \
     DataCollatorWithPadding, pipeline
import numpy as np
from sklearn.preprocessing import LabelEncoder
from datasets import load_metric, Dataset

>**Fake News Data Set**
Fake News Data Set is used to fine-tune BERT model. The data can be downloaded from [Kaggle](https://www.kaggle.com/datasets/jillanisofttech/fake-or-real-news). 


In [126]:
data_news = pd.read_csv('./Data/fake_or_real_news.csv')

data_news[:15]

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL


In [127]:
# This code segment parses the data_news dataset into a more manageable format

titles = []
tokenized_titles = []
sequence_labels = data_news['label']

title, tokenized_title =  [], []
for news in data_news['title']:
    title.append(news)
    tokenized_title.append(news.split(' '))
    

In [128]:
# Python list for each news
title[0], tokenized_title[0], sequence_labels[0]

('You Can Smell Hillary’s Fear',
 ['You', 'Can', 'Smell', 'Hillary’s', 'Fear'],
 'FAKE')

In [129]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased')

In [130]:
unique_sequence_labels = list(set(sequence_labels))
unique_sequence_labels

['REAL', 'FAKE']

There are two categories to predict.

In [131]:
sequence_labels = [unique_sequence_labels.index(l) for l in sequence_labels]

print(f'There are {len(unique_sequence_labels)} unique sequence labels')

There are 2 unique sequence labels


Our final python list is going to be something like this:

In [132]:
print(tokenized_title[0])
print(title[0])
print(sequence_labels[0])
print(unique_sequence_labels[sequence_labels[0]])

['You', 'Can', 'Smell', 'Hillary’s', 'Fear']
You Can Smell Hillary’s Fear
1
FAKE


After getting all data, we put it in dataset object. Then we can have train-test split by `train_test_split`.

In [133]:
news_dataset = Dataset.from_dict(
    dict(
        titles=title, 
        label=sequence_labels,
        tokens=tokenized_title,
    )
)
news_dataset = news_dataset.train_test_split(test_size=0.2)

news_dataset

DatasetDict({
    train: Dataset({
        features: ['titles', 'label', 'tokens'],
        num_rows: 5068
    })
    test: Dataset({
        features: ['titles', 'label', 'tokens'],
        num_rows: 1267
    })
})

Here is first element of our training set:

In [134]:
news_dataset['train'][0]

{'titles': 'Will the GOP Mount a Third-Party Challenge to Trump?',
 'label': 0,
 'tokens': ['Will',
  'the',
  'GOP',
  'Mount',
  'a',
  'Third-Party',
  'Challenge',
  'to',
  'Trump?']}

Next is to instantiate tokenizer with `DistilBertTokenizerFast` from 'distilbert-base-uncased'. FYI, **uncased** means lower or upper case of the words do not matter.

In [135]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Create a pre-process function to take in a batch of titles and tokenize them with `DistilBertTokenizerFast`. The question is why we are tokenizing the title if we already have the tokens. The answer is we do not necessarily know the tokens that has been given to us will match up with tokenized version for BERT.

In [136]:
def preprocess_function(examples):
    return tokenizer(examples["titles"], truncation=True) # truncation=True makes sure to exludes instances with more 
                                                            # 512 tokens

Map the tokenizer function for the entire data set:

In [137]:
# go over all our data set, tokenize them
seq_clf_tokenized_news = news_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/5068 [00:00<?, ? examples/s]

Map:   0%|          | 0/1267 [00:00<?, ? examples/s]

In [138]:
news_dataset

DatasetDict({
    train: Dataset({
        features: ['titles', 'label', 'tokens'],
        num_rows: 5068
    })
    test: Dataset({
        features: ['titles', 'label', 'tokens'],
        num_rows: 1267
    })
})

Looking at the first item, we also have `input_ids` and `attention_mask`. These are the items we are going to need in our model. 

In [139]:
seq_clf_tokenized_news['train'][0]

{'titles': 'Will the GOP Mount a Third-Party Challenge to Trump?',
 'label': 0,
 'tokens': ['Will',
  'the',
  'GOP',
  'Mount',
  'a',
  'Third-Party',
  'Challenge',
  'to',
  'Trump?'],
 'input_ids': [101,
  2097,
  1996,
  2175,
  2361,
  4057,
  1037,
  2353,
  1011,
  2283,
  4119,
  2000,
  8398,
  1029,
  102],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

`DataCollatorWithPadding` creates batch of data. It also dynamically pads text to the length of the longest element in the batch (on the right), making them all the same length. It's possible to pad your text in the tokenizer function with `padding=True`, dynamic padding is more efficient. This will make the training process faster.

In [140]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Data Collator will pad data so that all examples are the same input length. Attention mask is how we ignore attention scores for padding tokens

It is now time to create our actual model. 

In [141]:
sequence_clf_model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', 
                                                                         num_labels=len(unique_sequence_labels),)

# set an index -> label dictionary
sequence_clf_model.config.id2label = {i: l for i, l in enumerate(unique_sequence_labels)}

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier

Every model comes with a `config`. In this `config`, there is `id2label` attribute which is a dictionary that has integer as keys and string as values. See below:

In [142]:
sequence_clf_model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [217]:
sequence_clf_model.config

DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "REAL",
    "1": "FAKE"
  },
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}

In [218]:
sequence_clf_model.config.id2label[0]

'REAL'

Now it is the time to have a costume metric. HuggingFace always uses loss as performance metric but we need to calculate accuracy as a simpler metric.

In [219]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):  # common method to take in logits and calculate accuracy of the eval set
    logits, labels = eval_pred   # logit and label are returning from training loop
    predictions = np.argmax(logits, axis=-1) 
    return metric.compute(predictions=predictions, references=labels) # compute the accuracy

Take pre-trained knowledge of BERT and transfer that knowledge to our supervised data set by not training too many epochs. The code block below is going to repeat itself again and again because it define our training loop.

In [221]:
epochs = 2

# Training argument
training_args = TrainingArguments(
    output_dir="./news_clf/results", # Local directory to save check point of our model as fitting
    num_train_epochs=epochs,         # minimum of two epochs
    per_device_train_batch_size=32,  # batch size for training and evaluation, it common to take around 32, 
    per_device_eval_batch_size=32,   # sometimes less or more, The smaller batch size, the more change model update 
    load_best_model_at_end=True,     # Even if we overfit the model by accident, load the best model through checkpoint
    
    # some deep learning parameters that the trainer is able to take in
    warmup_steps = len(seq_clf_tokenized_news['train']) // 5,  # learning rate scheduler by number of warmup steps
    weight_decay = 0.05,    # weight decay for our learning rate schedule (regularization)
    
    logging_steps = 1,  # Tell the model minimum number of steps to log between (1 means logging as much as possible)
    log_level = 'info',
    evaluation_strategy = 'epoch', # It is "steps" or "epoch", we choose epoch: how many times to stop training to test
    eval_steps = 50,
    save_strategy = 'epoch'  # save a check point of our model after each epoch
)

# Define the trainer:
trainer = Trainer(
    model=sequence_clf_model,   # take our model (sequence_clf_model)
    args=training_args,         # we just set it above
    train_dataset=seq_clf_tokenized_news['train'], # training part of dataset
    eval_dataset=seq_clf_tokenized_news['test'],   # test (evaluation) part of dataset
    compute_metrics=compute_metrics,    # This part is optional but we want to calculate accuracy of our model 
    data_collator=data_collator         # data colladior with padding. Infact, we may or may not need a data collator
                                        # we can check the model to see how it lookes like with or without the collator
)

Before we start training, we can run the trainer **without fine-tune model** to measure performance of the model

In [222]:
# Get initial metrics: evaluation on test set
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1267
  Batch size = 32
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 0.6958811283111572,
 'eval_accuracy': 0.4593528018942384,
 'eval_runtime': 42.9697,
 'eval_samples_per_second': 29.486,
 'eval_steps_per_second': 0.931}

We hope the initial loss and accuracy will improve after training. Since we have not fine-tuned the model yet, the metric is random guessing. The feed-forward layer on top of the model has not been updated yet.

In [223]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5068
  Num Epochs = 2
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 318
  Number of trainable parameters = 66955010


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5282,0.53455,0.748224
2,0.361,0.385628,0.827151


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1267
  Batch size = 32
Saving model checkpoint to ./news_clf/results\checkpoint-159
Configuration saved in ./news_clf/results\checkpoint-159\config.json
Model weights saved in ./news_clf/results\checkpoint-159\pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1267
  Batch size = 32
Saving model checkpoint to ./news_clf/results\checkpoint-318
Configuratio

TrainOutput(global_step=318, training_loss=0.5511337874351807, metrics={'train_runtime': 1213.1417, 'train_samples_per_second': 8.355, 'train_steps_per_second': 0.262, 'total_flos': 80688169304784.0, 'train_loss': 0.5511337874351807, 'epoch': 2.0})

The evaluation loss decreases dramatically after first epoch and accuracy jumped hugely from 0.45 to 0.74.

In [224]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1267
  Batch size = 32


{'eval_loss': 0.385628342628479,
 'eval_accuracy': 0.8271507498026835,
 'eval_runtime': 42.1237,
 'eval_samples_per_second': 30.078,
 'eval_steps_per_second': 0.95,
 'epoch': 2.0}

In [240]:
# make a pipline by passing in our fine-tuned model with tokenizer
pipe = pipeline("text-classification", model=sequence_clf_model, tokenizer=tokenizer)
pipe('Please add Here We Go by Dispatch to my road trip playlist')

[{'label': 'FAKE', 'score': 0.8936770558357239}]

In [226]:
# We can save our model on drirectory we specified
trainer.save_model()

Saving model checkpoint to ./news_clf/results
Configuration saved in ./news_clf/results\config.json
Model weights saved in ./news_clf/results\pytorch_model.bin


We can easily call our pipline directly from directory. This very useful for deploying our model on the cloud with one line of the code. We can use it with exact same way to get the exact result.

In [242]:
pipe = pipeline("text-classification", "./news_clf/results", tokenizer=tokenizer)

loading configuration file ./news_clf/results\config.json
Model config DistilBertConfig {
  "_name_or_path": "./news_clf/results",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "REAL",
    "1": "FAKE"
  },
  "initializer_range": 0.02,
  "label2id": null,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}

loading configuration file ./news_clf/results\config.json
Model config DistilBertConfig {
  "_name_or_path": "./news_clf/results",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "a

In [243]:
text = 'The Battle of New York: Why This Primary Matters'
pipe(text)

[{'label': 'REAL', 'score': 0.9087793827056885}]

In [244]:
text = """Breaking News: Researchers have discovered a new species of dinosaur that 
        can breathe fire. The creature, named Pyrodino, is believed to have lived 
        during the Jurassic period and could shoot flames out of its nostrils, 
        making it one of the deadliest predators of its time."""
pipe(text)

[{'label': 'FAKE', 'score': 0.9160932302474976}]

### Freezing Model 

Up to now we updated all parameters, that is why it takes too much time. Below we freeze all our BERT model except for the classification layer. This is our third option that we freeze all our pre-trained model and only train a layer on top of it.

In [248]:
# Instantiate a new distilbert model
frozen_sequence_clf_model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-cased', 
                                                                        num_labels=len(unique_sequence_labels),)

loading configuration file config.json from cache at C:\Users\mrezv/.cache\huggingface\hub\models--distilbert-base-cased\snapshots\4dc145c5bd4fdb672dcded7fdc1efd6c2bc55992\config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.26.1",
  "vocab_size": 28996
}

loading weights file pytorch_model.bin from cache at C:\Users\mrezv/.cache\huggingface\hub\models--distilbert-base-cased\snapshots\4dc145c5bd4fdb672dcded7fdc1efd6c2bc55992\pytorch_model.bin
Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform

We are going to freeze every parameter in the model. the easiest way to freeze is to iterate over all `distilbert.parameters()` and make them as False. It only updates pre_classifier

In [249]:
for param in frozen_sequence_clf_model.distilbert.parameters():
    param.requires_grad = False   # "False" makes the parameters unable to update. "grad" stands for gradient
                                  # it never upgrade during training        

By running the code above, the only layer allowed to be updated is below:

`(pre_classifier): Linear(in_features=768, out_features=768, bias=True)`

`(classifier): Linear(in_features=768, out_features=2, bias=True)`

`(dropout): Dropout(p=0.2, inplace=False)`

This leads to much faster training but will yield worse result.

In [250]:
frozen_sequence_clf_model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [251]:
from sklearn.metrics import roc_auc_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions[:, 1]

    # Calculate the AUC score
    auc_score = roc_auc_score(labels, preds)

    # Calculate the true positive, false positive, false negative, and true negative values
    tp = ((preds >= 0.5) & (labels == 1)).sum()
    fp = ((preds >= 0.5) & (labels == 0)).sum()
    fn = ((preds < 0.5) & (labels == 1)).sum()
    tn = ((preds < 0.5) & (labels == 0)).sum()

    # Calculate the precision, recall, and F1 score
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f1_score = 2 * (precision * recall) / (precision + recall)

    return {
        'precision': precision,
        'recall': recall,
        'f1_score': f1_score,
        'auc_score': auc_score,
        'tp': tp,
        'fp': fp,
        'fn': fn,
        'tn': tn,
    }

In [252]:
epochs = 2

# Training argument
training_args = TrainingArguments(
    output_dir="./news_clf/results",
    num_train_epochs=epochs,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    load_best_model_at_end=True,
    
    # some deep learning parameters that the Trainer is able to take in
    warmup_steps = len(seq_clf_tokenized_news['train']) // 5,  # number of warmup steps for learning rate scheduler,
    weight_decay = 0.05,
    
    logging_steps = 1, 
    log_level = 'info',
    evaluation_strategy = 'epoch',
    eval_steps = 50,
    save_strategy = 'epoch'
)

# Define the trainer:

trainer = Trainer(
    model=sequence_clf_model,
    args=training_args,
    train_dataset=seq_clf_tokenized_news['train'],
    eval_dataset=seq_clf_tokenized_news['test'],
    compute_metrics=compute_metrics,
    data_collator=data_collator  # data colladior with padding
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [253]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1267
  Batch size = 32


{'eval_loss': 0.385628342628479,
 'eval_precision': 0.7671232876712328,
 'eval_recall': 0.9195402298850575,
 'eval_f1_score': 0.8364451082897684,
 'eval_tp': 560,
 'eval_fp': 170,
 'eval_fn': 49,
 'eval_tn': 488,
 'eval_runtime': 41.9918,
 'eval_samples_per_second': 30.173,
 'eval_steps_per_second': 0.953}

In [254]:
trainer.train()  # ~23min -> ~6min on my laptop with all of distilbert frozen with a worse loss/accuracy

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5068
  Num Epochs = 2
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 318
  Number of trainable parameters = 66955010


Epoch,Training Loss,Validation Loss,Precision,Recall,F1 Score,Tp,Fp,Fn,Tn
1,0.2412,0.367865,0.86067,0.801314,0.829932,488,79,121,579
2,0.2677,0.393928,0.818862,0.898194,0.856695,547,121,62,537


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1267
  Batch size = 32
Saving model checkpoint to ./news_clf/results\checkpoint-159
Configuration saved in ./news_clf/results\checkpoint-159\config.json
Model weights saved in ./news_clf/results\checkpoint-159\pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1267
  Batch size = 32
Saving model checkpoint to ./news_clf/results\checkpoint-318
Configuratio

TrainOutput(global_step=318, training_loss=0.24578225891451416, metrics={'train_runtime': 1191.2413, 'train_samples_per_second': 8.509, 'train_steps_per_second': 0.267, 'total_flos': 80688169304784.0, 'train_loss': 0.24578225891451416, 'epoch': 2.0})

In [255]:
trainer.evaluate()


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1267
  Batch size = 32


{'eval_loss': 0.3678646683692932,
 'eval_precision': 0.8606701940035273,
 'eval_recall': 0.8013136288998358,
 'eval_f1_score': 0.8299319727891158,
 'eval_tp': 488,
 'eval_fp': 79,
 'eval_fn': 121,
 'eval_tn': 579,
 'eval_runtime': 41.6822,
 'eval_samples_per_second': 30.397,
 'eval_steps_per_second': 0.96,
 'epoch': 2.0}

<span class="mark">This is a general rule: when we update the entire model we will get slower run but higher performance. If we freeze the entire BERT model, it gives much faster training time but probably will get worse result. There is middle ground that we freeze part of model to see how it will work for us.</span>