# Hugging Face
Huggign Face is an american company which develops tools for building application using machine learning. It is a website where people can share their ML models. It is most notably known for it’s transformers library which is used to perform different NLP tasks. 

This is a python notebook based on this playlist - https://www.youtube.com/playlist?list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o 

## The Pipeline Function
The pipeline function is the most high level API that Hugging Face library offers. 

In [5]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I hade a healthy breakfast this morning")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9974707365036011}]

In [11]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about trandformers library",
    candidate_labels=["education", "politics", "business"]
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This is a course about trandformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.7521424889564514, 0.1648690402507782, 0.08298840373754501]}

In [12]:
# Use pipeline with custom library

In [13]:
from transformers import pipeline

generator = pipeline('text-generation', model='distilgpt2')
generator(
    "India is a very",
    max_length=30,
    num_return_sequences=2,
)

Downloading (…)lve/main/config.json: 100%|██████████| 762/762 [00:00<00:00, 60.1kB/s]
Downloading model.safetensors: 100%|██████████| 353M/353M [00:30<00:00, 11.7MB/s] 
  with safe_open(checkpoint_file, framework="pt") as f:
  return self.fget.__get__(instance, owner)()
  storage = cls(wrap_storage=untyped_storage)
  with safe_open(filename, framework="pt", device=device) as f:
Downloading (…)neration_config.json: 100%|██████████| 124/124 [00:00<00:00, 9.59kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 4.05MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 2.44MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 4.90MB/s]
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'India is a very strong player in the NBA Draft, and their team has tremendous potential at this point in time.\n\n\n\n\n\n\n'},
 {'generated_text': 'India is a very good choice and is a very attractive value for investors and companies to invest in. However it could have been different, especially in India'}]

Use the save_pretrained() method to save the configs, model weights and vocabulary:

classifier.save_pretrained('/some/directory')  

### Different pipelines
Text classification
<br>
Zero-shot classification
<br>
Text Generation
<br>
Text completion
<br>
Token classification
<br>
Question answering
<br>
Summatization
<br>
Translation

## Transfer learning 

Transfer learning is basically finetuning a existing model. When we train a model from scratch we randomly initialize the weights of the model. In fine tuning/ transfer learning we use the weights of some pretrained model. 

Transfer learning has been succesfully been used in Image datasets but it is fairly new in NLP tasks. It works great on NLP tasks as well but it has a problem of being biased to the previous model. If a model is trained more on US data then the fine tuned has more bais towards the US english linguistic characteristics.

## Transformer Architecture

The Transformer architecture consist of 2 parts, encoder and decoder.
Both the encoder and decoder can run as independent components or can be combined together.

**The Encoder** is bi-directional model in the sense that when generating a vector for a word it takes context from the previous as well as the next word. It uses the self attention mechanism. It outputs the one vector for one input word 
<be>
Encoder examples - BERT, RoBERTa, ALBERT
<br>
Encoder are best for extracting meaning information, NLU - Natural Language Understanding, Sequence classification (sentiment analysis), question answering, masked language modeling 

**The Decoder** is a uni directional model in the sense that when predicting the next word it takes context only from the previously generated output.The output generated from the previous imput is added to the new input using auto-regressive method. It uses masked self-attention mechanism. It can generate many words from a given input sequence. 
<br>
Decoder Examples - GPT-2, GPT Neo
<br>
Decoders are best for Natural Language generation

Combining both are best for many-to-many tasks.<br>Weights are not necessarily shared between encoders and decoders.<br>Input distribution is different form output distribution.<br>Best for Translation tasks where we have to understand the meaning of the sentence to generate output or summarization.

## What happens inside pipeline function

The pipeline consists of 3 stages 

**Tokenizer** -> **Model** -> **Postprocessing**

Raw text (adding special tokens for start and end) -> **Tokenizer** -> Input ID's [100, 4054, ...]

In [3]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life",
    "I hate this so much",
]
inputs = tokenizer(raw_inputs, padding=True, return_tensors="pt") #pt = pytorch
inputs

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   102,     0,     0,     0,
             0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

Input ID's [100, 4054, ...] -> **Model** -> Logits [-4.3343, 4.4343]

In [5]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape) # returns [batch size, sequence length, hidden size]

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([2, 15, 768])


In [9]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits)

tensor([[-1.4683,  1.5105],
        [ 4.2141, -3.4158]], grad_fn=<AddmmBackward0>)


Logits [-4.3343, 4.4343] -> **Postprocessing** -> Predictions [Positive : 99%, Negative : 0.11%]

In [7]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.8393e-02, 9.5161e-01],
        [9.9951e-01, 4.8549e-04]], grad_fn=<SoftmaxBackward0>)


In [11]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

## Tokenizers Overview

* Word-based Tokenizers <br>
* Character-based Tokenizers <br>
* Subword-based Tokenizers

**Word-based Tokenizers** <br>
Each word has a specific token. <br>
Issues - Dog and Dogs have similar meaning but have completely different tokens so relaiton between them is not present in tokens <br>
Vocabulary size can become very large - 170,000 words in the english language. We can solve this issue by using only the most frequent words and assiging all other words as other-token but then the meaning of all the other-tokens is not captured.

**Character-based Tokenizers** <br>
Each Character has a specific token. <br> 
Has less dictionary size and all the words in english are included, no need of other-token<br>
Issues - Characters do not hold as much information as words and the sequence form a very large input vector to the model so size of context can reduce. <br>


**Subword-based Tokenizers** <br>
It relies on following method -<be> 
Frequently used words should not be split into smaller subwords. <br>
Rare words should be decomposed into meaningful subwords. <br>
tokenization -> 'token' 'ization' <br>
'token' can relate to -> token, tokens, tokenizing. <br>
'ization' can relate to Modernization, Immunization. <br>
BERT -> 'Let's try to tokenize!' -> [[CLS],let,',s,try,to,token,##ize',!,[SEP]]<br>
// ## is used to show as a completing word.

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Let's try to tokenize")
print(tokens)

['let', "'", 's', 'try', 'to', 'token', '##ize']


In [6]:
tokenizer = AutoTokenizer.from_pretrained("albert-base-v1")
tokens = tokenizer.tokenize("Let's try to tokenize")
print(tokens)
print(len(tokens))

['▁let', "'", 's', '▁try', '▁to', '▁to', 'ken', 'ize']
8


In [12]:
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)
final_inputs = tokenizer.prepare_for_model(input_ids)
print(final_inputs['input_ids'])
print(final_inputs['attention_mask'])

[408, 22, 18, 1131, 20, 20, 2853, 2952]
[2, 408, 22, 18, 1131, 20, 20, 2853, 2952, 3]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


> If there are two sentences of different length then padding is added to the shorter sentence so the input vector to the model is rectangular shape

In [12]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sentences = [
    "I've been waiting for a HuggingFace course my whole life",
    "I hate this so much",
]

print(tokenizer(sentences, padding=True))
print(type(tokenizer(sentences, padding=True)))
print(tokenizer(sentences, padding=True).input_ids)
print(tokenizer(sentences, padding=True)['input_ids'])

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 102], [101, 1045, 5223, 2023, 2061, 2172, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
<class 'transformers.tokenization_utils_base.BatchEncoding'>
[[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 102], [101, 1045, 5223, 2023, 2061, 2172, 102, 0, 0, 0, 0, 0, 0, 0, 0]]
[[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 102], [101, 1045, 5223, 2023, 2061, 2172, 102, 0, 0, 0, 0, 0, 0, 0, 0]]


## Datasets Library 

The Hugging Face datasets library provides and API to download many publicv datasets and preprocess them

In [17]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Found cached dataset glue (C:/Users/pradhumn/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|██████████| 3/3 [00:00<00:00, 747.07it/s]


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [27]:
raw_datasets["train"][:3]

{'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
  "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
  'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .'],
 'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
  "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .",
  "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale ."],
 'label': [1, 0, 1],
 'idx': [0, 1, 2]}

In [29]:
raw_datasets["train"].features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [39]:
from transformers import AutoTokenizer

checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(
        example["sentence1"], example["sentence1"], padding="max_length", truncation=True, max_length=128
    )

tokenized_dataset = raw_datasets.map(tokenize_function, batched=True)
# Batched processes several elements at same time
# reduces tokenizing time 
print(tokenized_dataset.column_names) 

Loading cached processed dataset at C:\Users\pradhumn\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-d31c72c677462005.arrow
Loading cached processed dataset at C:\Users\pradhumn\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-150be474be28aeef.arrow
                                                                 

{'train': ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'], 'validation': ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'], 'test': ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask']}




In [32]:
tokenized_dataset["train"][:1]

{'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .'],
 'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'],
 'label': [1],
 'idx': [0],
 'input_ids': [[101,
   7277,
   2180,
   5303,
   4806,
   1117,
   1711,
   117,
   2292,
   1119,
   1270,
   107,
   1103,
   7737,
   107,
   117,
   1104,
   9938,
   4267,
   12223,
   21811,
   1117,
   2554,
   119,
   102,
   7277,
   2180,
   5303,
   4806,
   1117,
   1711,
   117,
   2292,
   1119,
   1270,
   107,
   1103,
   7737,
   107,
   117,
   1104,
   9938,
   4267,
   12223,
   21811,
   1117,
   2554,
   119,
   102,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   

In [40]:
tokenized_dataset = tokenized_dataset.remove_columns(["idx", "sentence1", "sentence2"])
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")
tokenized_dataset = tokenized_dataset.with_format("torch")
tokenized_dataset["train"]

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3668
})

## Preprocessing pairs of sencences

Since we have to process pairs of sencences for many tasks, the Hugging Face library has a method inbuilt for that.

In [49]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer("My name is Pradhumn", "I have a cat")
#token type ids tells - 0 is for 1st sentence and 1 is for second sentence

{'input_ids': [101, 2026, 2171, 2003, 10975, 4215, 28600, 2078, 102, 1045, 2031, 1037, 4937, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

To add senctences of different length into same batch we have to add padding to the sentences

Padding can be added in two ways -
* Make all sentences to the size of the largest sentence in Dataset 
    * There is lot of resundent padding
    * All batches have same shape 
    * Need fixed shape batches on TPU
* Make all sentences to the size of the largest sentence in the Batch
    * There is less redundent padding 
    * All batches are of different shape
    * Will be faster on CPU and GPU

### All batchs of same size 

In [51]:
from transformers import AutoTokenizer

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(
        example["sentence1"], example["sentence1"], padding="max_length", truncation=True, max_length=128
    )

tokenized_dataset = raw_datasets.map(tokenize_function, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(["idx", "sentence1", "sentence2"])
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")
tokenized_dataset = tokenized_dataset.with_format("torch")

Found cached dataset glue (C:/Users/pradhumn/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|██████████| 3/3 [00:00<00:00, 1002.62it/s]
Loading cached processed dataset at C:\Users\pradhumn\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-d31c72c677462005.arrow
Loading cached processed dataset at C:\Users\pradhumn\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-150be474be28aeef.arrow
Loading cached processed dataset at C:\Users\pradhumn\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-87f7e66c219a9c61.arrow


In [52]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(tokenized_dataset["train"], batch_size=16,  shuffle=True)

for step, batch in enumerate(train_dataloader):
    print(batch["input_ids"].shape)
    if step>5:
        break

torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])


### All batches of different size

In [57]:
from transformers import AutoTokenizer

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(
        example["sentence1"], example["sentence1"], truncation=True
    )
# Removed max_length, padding="max_length"

tokenized_dataset = raw_datasets.map(tokenize_function, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(["idx", "sentence1", "sentence2"])
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")
tokenized_dataset = tokenized_dataset.with_format("torch")

Found cached dataset glue (C:/Users/pradhumn/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|██████████| 3/3 [00:00<00:00, 752.07it/s]
                                                                  

In [58]:
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding

dataCollator = DataCollatorWithPadding(tokenizer)
train_dataloader = DataLoader(
    tokenized_dataset["train"], batch_size=16,  shuffle=True, collate_fn=dataCollator
)

for step, batch in enumerate(train_dataloader):
    print(batch["input_ids"].shape)
    if step>5:
        break

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


torch.Size([16, 77])
torch.Size([16, 87])
torch.Size([16, 87])
torch.Size([16, 77])
torch.Size([16, 71])
torch.Size([16, 83])
torch.Size([16, 79])


## The Trainer API

The transformers library provides a trainers API which hepls to easiy fine-tune transformers models on out own datasets. <br>
The trainer API takes our datasets, our models as well as the training hyperparameters and can perform training on and kind of setup(CPU, GPU, multiGPU, TPU). <br>
Does prediciton on any datasets and if provided metrics can evaluate model on any dataser

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer( example["sentence1"], example["sentence1"], truncation=True )

tokenized_dataset = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer)

Found cached dataset glue (C:/Users/pradhumn/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|██████████| 3/3 [00:00<00:00, 752.21it/s]
Loading cached processed dataset at C:\Users\pradhumn\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-e14d45524308013e.arrow
Loading cached processed dataset at C:\Users\pradhumn\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-b689f96290250e14.arrow
Loading cached processed dataset at C:\Users\pradhumn\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-003a51adc08b2ffc.arrow


In [3]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

  with safe_open(checkpoint_file, framework="pt") as f:
  return self.fget.__get__(instance, owner)()
  storage = cls(wrap_storage=untyped_storage)
  with safe_open(filename, framework="pt", device=device) as f:
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a 

In [4]:
from transformers import TrainingArguments
training_args = TrainingArguments(
    "test-trainer",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    learning_rate=2e-5,
    weight_decay=0.01,
    report_to="none"
)
# report to none very important or will try to connect to wnadb
# https://github.com/huggingface/transformers/issues/16594 

In [5]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)
trainer.train()

  0%|          | 0/1150 [00:00<?, ?it/s]You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 43%|████▎     | 500/1150 [23:42<31:44,  2.93s/it] 

{'loss': 0.5758, 'learning_rate': 1.1304347826086957e-05, 'epoch': 2.17}


 87%|████████▋ | 1000/1150 [46:10<06:59,  2.80s/it]

{'loss': 0.2701, 'learning_rate': 2.6086956521739132e-06, 'epoch': 4.35}


100%|██████████| 1150/1150 [52:59<00:00,  2.77s/it]

{'train_runtime': 3179.809, 'train_samples_per_second': 5.768, 'train_steps_per_second': 0.362, 'train_loss': 0.3842073971292247, 'epoch': 5.0}





TrainOutput(global_step=1150, training_loss=0.3842073971292247, metrics={'train_runtime': 3179.809, 'train_samples_per_second': 5.768, 'train_steps_per_second': 0.362, 'train_loss': 0.3842073971292247, 'epoch': 5.0})