#  Efficient training with multitask-serving with OpenDelta

### PART 1
In this notebook, you will see three interesting task around a question provided by a user.
1. We load a Delta to correct the spelling in the question.

2. We load another Delta to recognize the question's topic.

3.  We load another Delta to answer the question.

All this functionality shares exactly the same backbone model: T5-large. You can learn how to use the OpenDelta for multitask-serving with 1/N GPU RAM (N is the number of tasks) compared to finetuning.

### PART2 
Then you will learn how to use OpenDelta to train a t5-large using batchsize=32 with only 11G GPU RAM (Which is provided by google for free).
**（You need to connect to the free GPU on the upper-right of the page.)**

### PART3
You will see the flexibility that OpenDelta provides to customize a delta model.

In [None]:
!nvidia-smi

Mon Jul  4 04:02:04 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [1]:
# install necessary packages
!pip install transformers --quiet
!pip install datasets==2.0 --quiet
!pip install opendelta==0.2.2 --quiet

[K     |████████████████████████████████| 4.4 MB 14.5 MB/s 
[K     |████████████████████████████████| 101 kB 12.4 MB/s 
[K     |████████████████████████████████| 596 kB 55.7 MB/s 
[K     |████████████████████████████████| 6.6 MB 64.6 MB/s 
[K     |████████████████████████████████| 325 kB 14.3 MB/s 
[K     |████████████████████████████████| 1.1 MB 71.3 MB/s 
[K     |████████████████████████████████| 212 kB 67.0 MB/s 
[K     |████████████████████████████████| 140 kB 71.2 MB/s 
[K     |████████████████████████████████| 127 kB 75.6 MB/s 
[K     |████████████████████████████████| 271 kB 75.1 MB/s 
[K     |████████████████████████████████| 144 kB 79.0 MB/s 
[K     |████████████████████████████████| 94 kB 4.4 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible

In [2]:
from dataclasses import dataclass, field
from typing import Optional, List
from transformers import Seq2SeqTrainingArguments, TrainerCallback 
from datasets import load_dataset, load_metric, concatenate_datasets
import transformers
from transformers import (
    AutoConfig,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    HfArgumentParser,
    MBartTokenizer,
    default_data_collator,
    set_seed,
)
from datasets import load_dataset
import torch
import numpy as np
import random

In [3]:
@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """
    model_name_or_path: str = field(
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    cache_dir: Optional[str] = field(
        default=None,
        metadata={"help": "Where to store the pretrained models downloaded from huggingface.co"},
    )
    use_fast_tokenizer: bool = field(
        default=True,
        metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
    )
    model_revision: str = field(
        default="main",
        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
    )
    use_auth_token: bool = field(
        default=False,
        metadata={
            "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
            "with private models)."
        },
    )

model_args = ModelArguments(model_name_or_path="t5-large", )

### Load the backbone model using the traditional ways.

In [4]:
config = AutoConfig.from_pretrained(
    model_args.config_name if model_args.config_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
    revision=model_args.model_revision,
    use_auth_token=True if model_args.use_auth_token else None,
)
config.dropout_rate = 0.0
tokenizer = AutoTokenizer.from_pretrained(
    model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
    use_fast=model_args.use_fast_tokenizer,
    revision=model_args.model_revision,
    use_auth_token=True if model_args.use_auth_token else None,
)
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_args.model_name_or_path,
    from_tf=bool(".ckpt" in model_args.model_name_or_path),
    config=config,
    cache_dir=model_args.cache_dir,
    revision=model_args.model_revision,
    use_auth_token=True if model_args.use_auth_token else None,
)
model.resize_token_embeddings(len(tokenizer))


Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading:   0%|          | 0.00/2.75G [00:00<?, ?B/s]

Embedding(32100, 1024)

### PART0: Visualization
#### Using OpenDelta as a fantastic visualization tool.

In [5]:
from opendelta import Visualization
Visualization(model).structure_graph();

#### Compare to the original visualization in pytorch

In [None]:
model

### PART1:  Multitask Serving with Finetuned Deltas.
Load the three special delta models with different functionalities.

In [6]:
from opendelta import AutoDeltaConfig, AutoDeltaModel

delta_model_spelling = AutoDeltaModel.from_finetuned("thunlp/Spelling_Correction_T5_LRAdapter_demo", backbone_model=model)
delta_model_spelling.detach()

delta_model_topic = AutoDeltaModel.from_finetuned("thunlp/Question_Topic_T5-large_Compacter", backbone_model=model)
delta_model_topic.detach()

delta_model_fact = AutoDeltaModel.from_finetuned("thunlp/FactQA_T5-large_Adapter", backbone_model=model)
delta_model_fact.detach()


100% Download Spelling_Correction_T5_LRAdapter_demo.zip 

[INFO|(OpenDelta)delta_configs:209]2022-07-23 01:26:38,927 >> Model config
LowRankAdapterConfig {
  "backbone_checkpoint_name": "t5-large",
  "backbone_class": "T5ForConditionalGeneration",
  "backbone_hash": "baa014d5e81363ff48935ba78b5df374",
  "common_structure": null,
  "low_rank_rank": 1,
  "low_rank_w_init": "glorot-uniform",
  "modified_modules": [
    "SelfAttention",
    "DenseReluDense"
  ],
  "non_linearity": "gelu_new",
  "opendelta_version": "0.2.2",
  "reduction_factor": 32,
  "transformers_version": "4.20.1"
}






Reuse the cached checkpoint in /root/.cache/delta_center/Spelling_Correction_T5_LRAdapter_demo


100% Download Question_Topic_T5-large_Compacter.zip 


[INFO|(OpenDelta)delta_configs:209]2022-07-23 01:26:53,073 >> Model config
CompacterConfig {
  "backbone_checkpoint_name": "t5-large",
  "backbone_class": "T5ForConditionalGeneration",
  "backbone_hash": "6297bd1acc36524547c8a76cc03fef5c",
  "bottleneck_dim": null,
  "common_structure": null,
  "factorized_phm": true,
  "factorized_phm_rule": false,
  "hypercomplex_division": 4,
  "hypercomplex_nonlinearity": "glorot-uniform",
  "kronecker_prod": null,
  "learn_phm": true,
  "modified_modules": [
    "SelfAttention",
    "DenseReluDense"
  ],
  "non_linearity": "gelu_new",
  "opendelta_version": "0.2.2",
  "phm_c_init": "normal",
  "phm_init_range": 0.0001,
  "phm_rank": 1,
  "reduction_factor": 16,
  "sequential": null,
  "shared_W_phm": false,
  "shared_phm_rule": false,
  "transformers_version": "4.20.1",
  "use_bias_down_sampler": true,
  "use_bias_up_sampler": true
}

Reuse the cached checkpoint in /root/.cache/delta_center/Question_Topic_T5-large_Compacter


100% Download FactQA_T5-large_Adapter.zip 


[INFO|(OpenDelta)delta_configs:209]2022-07-23 01:27:09,298 >> Model config
AdapterConfig {
  "backbone_checkpoint_name": "t5-large",
  "backbone_class": "T5ForConditionalGeneration",
  "backbone_hash": "a984db8cc3fe15bfb2618b1ff7abd570",
  "bottleneck_dim": 24,
  "common_structure": null,
  "modified_modules": [
    "SelfAttention",
    "DenseReluDense"
  ],
  "non_linearity": "gelu_new",
  "opendelta_version": "0.2.2",
  "sequential": true,
  "transformers_version": "4.20.1"
}

Reuse the cached checkpoint in /root/.cache/delta_center/FactQA_T5-large_Adapter


In [7]:
def multitask_serving(input_text):
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids#.cuda()
    delta_model_spelling.attach()
    answers_ids =model.generate(input_ids=input_ids, max_length=20, num_beams=4)
    input_text = tokenizer.decode(answers_ids[0], skip_special_tokens=True)
    print("Correct Spelling: {}".format(input_text))
    delta_model_spelling.detach()

    delta_model_topic.attach()
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids#.cuda()
    answers_ids =model.generate(input_ids=input_ids, max_length=20, num_beams=4)
    topic = tokenizer.decode(answers_ids[0], skip_special_tokens=True)
    delta_model_topic.detach()
    print("Question Topic: {}".format(topic))

    delta_model_fact.attach()
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids#.cuda()
    answers_ids =model.generate(input_ids=input_ids, max_length=20, num_beams=4)
    input_text = tokenizer.decode(answers_ids[0], skip_special_tokens=True)
    delta_model_fact.detach()
    print("Question Answer: {}".format(input_text))




In [8]:
multitask_serving("When was Beiiing olymp#ic heldd ?")
multitask_serving("What the commmon career of Newton ad eintesin?")

Correct Spelling: When was Beijing Olympic held?
Question Topic: The question's topic is sports.
Question Answer: 2008
Correct Spelling: What was the common career of Newton and Einstein?
Question Topic: The question's topic is science.
Question Answer: Physicists


In [9]:
delta_model_spelling.detach()
delta_model_topic.detach()
delta_model_fact.detach()

## PART2: Efficient Tuning a T5-large with Limited GPU RAM

###  spelling correction as an example

In [10]:
@dataclass
class TrainingArguments(Seq2SeqTrainingArguments):
    print_num_parameters: Optional[bool] = field(default=False, metadata={"help": "If set, print the parameters of "
                                                                                 "the model."})
    do_test: Optional[bool] = field(default=False, metadata={"help": "If set, evaluates the test performance."})
    split_validation_test: Optional[bool] = field(default=False,
                                                  metadata={"help": "If set, for the datasets which do not"
                                                                    "have the test set, we use validation set as their"
                                                                    "test set and make a validation set from either"
                                                                    "splitting the validation set into half (for smaller"
                                                                    "than 10K samples datasets), or by using 1K examples"
                                                                    "from training set as validation set (for larger"
                                                                    " datasets)."})
    compute_time: Optional[bool] = field(default=False, metadata={"help": "If set measures the time."})
    compute_memory: Optional[bool] = field(default=False, metadata={"help": "if set, measures the memory"})


training_args = TrainingArguments(output_dir="./", 
                                  do_train=True,
                                  do_eval=True,
                                  do_predict=False,
                                  evaluation_strategy="steps",
                                  eval_steps=200,
                                  save_strategy="steps",
                                  save_steps=200,
                                  greater_is_better=True,
                                  load_best_model_at_end=True,
                                  compute_memory=True,
                                  predict_with_generate=True,
                                  push_to_hub=False,
                                  learning_rate=1e-3,
                                  seed=42,
                                  per_device_eval_batch_size=32,
                                  per_device_train_batch_size=32,
                                  num_train_epochs=1,
                                  metric_for_best_model="em",
                                  warmup_steps=0,
                                  save_total_limit=1,
                                  gradient_accumulation_steps=1
                                  )

In [11]:
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    decoded_preds = [tokenizer.decode(i, skip_special_tokens=True, clean_up_tokenization_spaces=True) for i in preds]
    decoded_labels = [tokenizer.decode(i, skip_special_tokens=True, clean_up_tokenization_spaces=True) for i in labels]
    result = {}
    result_list = [int(i==j) for i, j in zip(decoded_labels, decoded_preds)]
    result.update({"em":sum(result_list)/len(result_list)})
    return result

In [12]:
mydataset = load_dataset("trivia_qa","unfiltered.nocontext")
mydataset['train'] = mydataset['train']
validation_index = np.arange(len(mydataset['validation']))
np.random.shuffle(validation_index)
mydataset['validation'] = mydataset['validation'].select(validation_index[:500])

Downloading:   0%|          | 0.00/3.70k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.99k [00:00<?, ?B/s]

Downloading and preparing dataset trivia_qa/unfiltered.nocontext (download: 603.25 MiB, generated: 70.50 MiB, post-processed: Unknown size, total: 673.74 MiB) to /root/.cache/huggingface/datasets/trivia_qa/unfiltered.nocontext/1.2.0/e73c5e47a8704744fa9ded33504b35a6c098661813d1c2a09892eb9b9e9d59ae...


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/633M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset trivia_qa downloaded and prepared to /root/.cache/huggingface/datasets/trivia_qa/unfiltered.nocontext/1.2.0/e73c5e47a8704744fa9ded33504b35a6c098661813d1c2a09892eb9b9e9d59ae. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

We randomly disturb the input question, for example change "Is it a good movie?" to "Ia it af movi good?" The target is the original sentences. 

In [13]:
def misspelling(x):
    length = len(x)
    replace_time = np.random.randint(3)
    count = 0

    while (count<replace_time):
        randfloat = np.random.rand()
        if randfloat < 0.15:
            x = x.split()
            switch_index = [np.random.randint(low=0, high=len(x)) for i in range(2)]
            tmp = x[switch_index[0]]
            x[switch_index[0]] = x[switch_index[1]]
            x[switch_index[1]] = tmp
            x = " ".join(x)
        elif randfloat < 0.3:
            x = x.split()
            drop_index = np.random.randint(low=0, high=len(x))
            x = x[:drop_index] + x[drop_index+1:]
            x = " ".join(x)
        elif randfloat < 0.8:
            replace_str = "".join([random.choice('abcdefghijklmnopqrstuvwxyz!@#$%^&*()') for i in range(np.random.randint(1,3))])
            rindx = np.random.randint(low=0, high=length)
            x = x[:rindx]+replace_str+x[rindx+1:]
        else:
            x=list(x)
            switch_index = [np.random.randint(low=0, high=len(x)) for i in range(2)]
            tmp = x[switch_index[0]]
            x[switch_index[0]] = x[switch_index[1]]
            x[switch_index[1]] = tmp
            x = "".join(x)
        count+=1
    return x

def tokenize_function(examples):
    input_sentences = [" ".join((i.strip("\n").strip().strip("?")+"?").split()[:20]) for i in examples["question"]]
    mis_spellings = [misspelling(x) for x in input_sentences]
    input_ids = [tokenizer.encode(i, padding="max_length", truncation=True, max_length=64) for i in mis_spellings]
    label = [tokenizer.encode(i, padding="max_length", truncation=True, max_length=64) for i in input_sentences]
    return {"input_ids": input_ids, "labels": label}

tokenized_datasets = mydataset.map(tokenize_function, remove_columns=['answer', 'question_source',"entity_pages",'search_results'], batched=True)



  0%|          | 0/88 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/11 [00:00<?, ?ba/s]

### Now let's apply a LowRankAdapterModel

In [14]:
from opendelta import LowRankAdapterModel
delta_model1 = LowRankAdapterModel(backbone_model=model, modified_modules=['SelfAttention', "DenseReluDense"])
delta_model1.freeze_module(set_state_dict = True)
delta_model1.log(delta_ratio=True, trainable_ratio=True, visualization=True)


[INFO|(OpenDelta)basemodel:675]2022-07-23 01:38:50,695 >> Trainable Ratio: 0.041213%
[INFO|(OpenDelta)basemodel:677]2022-07-23 01:38:50,704 >> Delta Parameter Ratio: 0.041213%
[INFO|(OpenDelta)basemodel:679]2022-07-23 01:38:50,705 >> Static Memory 0.00 GB, Max Memory 0.00 GB


In [15]:
class MyCallback(TrainerCallback):
    def on_evaluate(self, args, state, control, **kwargs):
        """
        Event called after an evaluation phase.
        """
        sents = ["was Wher Newton bon?", 
                 "In year which Beiiing Olmpic ld?"
                ]
        for sent in sents:
            input_ids = tokenizer(sent, return_tensors="pt").input_ids.cuda()
            answers_ids =model.generate(input_ids=input_ids, 
                                    max_length=20, 
                                    num_beams=4, 
                                    )
            print("{} {}".format(sent, tokenizer.decode(answers_ids[0], skip_special_tokens=True)))
        print("max allocated memory {} GB".format(torch.cuda.max_memory_allocated(f"cuda:0")/1024**3))
        
from transformers import Seq2SeqTrainer
training_args.output_dir = "./SpellingCorrection" # to avoid conflict
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    callbacks=[MyCallback],
    compute_metrics=compute_metrics,
)

trainer.train()

# delta_model.save_finetuned("Spelling_T5-lowrankadapter", push_to_hub=True) 
# Since we are in a notebook without login in huggingface model hub, we omit this step.

The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: question_id, question. If question_id, question are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 87622
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 2739


Step,Training Loss,Validation Loss


KeyboardInterrupt: ignored

## PART3: Examples of Customized Deltas

Here we add Adapters to 0\~5 layers' attention module, and 10\~12 layers' feed-forward module (after layer_norm).

We then add Lora to 0~9 layers' attention's Q matrix.

In [16]:

from opendelta import AdapterModel
from opendelta import LoraModel
delta_model1.detach()
model.to("cpu")
delta_model_custom1 = AdapterModel(backbone_model=model, 
                                   modified_modules=['[r][0-5]\.layer\.0\.SelfAttention', '[r](10|11|12)\.layer\.1\.layer_norm'])
delta_model_custom2 = LoraModel(backbone_model=model, modified_modules=['[r][0-9]\.layer\.0\.SelfAttention\.q'])
delta_model_custom2.freeze_module(set_state_dict = True)
delta_model_custom2.log(delta_ratio=True, trainable_ratio=True, visualization=True)

# delta_model_custom2.detach()

[INFO|(OpenDelta)basemodel:675]2022-07-23 01:54:54,829 >> Trainable Ratio: 0.166643%
[INFO|(OpenDelta)basemodel:677]2022-07-23 01:54:54,829 >> Delta Parameter Ratio: 0.166643%
[INFO|(OpenDelta)basemodel:679]2022-07-23 01:54:54,829 >> Static Memory 0.00 GB, Max Memory 9.53 GB


## OPenDelta（git_example）


In [1]:
!pip install transformers
!pip install opendelta


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 14.5 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 54.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 65.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 13.5 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling 

In [2]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("facebook/bart-base") # suppose we load BART

Downloading:   0%|          | 0.00/1.68k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/532M [00:00<?, ?B/s]

Some weights of BartForSequenceClassification were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['classification_head.dense.weight', 'classification_head.out_proj.bias', 'classification_head.dense.bias', 'classification_head.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
from opendelta import Visualization
Visualization(model).structure_graph()

In [4]:
from opendelta import AdapterModel
delta_model = AdapterModel(backbone_model=model, modified_modules=['fc2'], bottleneck_dim=12)
delta_model.log() # This will visualize the backbone after modification and other information.

[INFO|(OpenDelta)basemodel:675]2022-07-23 02:56:49,063 >> Trainable Ratio: 100.000000%
[INFO|(OpenDelta)basemodel:677]2022-07-23 02:56:49,066 >> Delta Parameter Ratio: 0.164388%
[INFO|(OpenDelta)basemodel:679]2022-07-23 02:56:49,071 >> Static Memory 0.00 GB, Max Memory 0.00 GB


In [5]:
# a seperate example using BERT.
from transformers import BertForMaskedLM
from opendelta import AdapterModel
model = BertForMaskedLM.from_pretrained("bert-base-cased")
delta_model = AdapterModel(model) # This will apply adapter to the self-attn and feed-forward layer.
delta_model.log()

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[INFO|(OpenDelta)structure_mapping:336]2022-07-23 02:59:07,427 >> Since you are using the common structure mapping, draw the transformed parameter structure for checking.


[INFO|(OpenDelta)basemodel:675]2022-07-23 02:59:07,689 >> Trainable Ratio: 100.000000%
[INFO|(OpenDelta)basemodel:677]2022-07-23 02:59:07,693 >> Delta Parameter Ratio: 0.827267%
[INFO|(OpenDelta)basemodel:679]2022-07-23 02:59:07,695 >> Static Memory 0.00 GB, Max Memory 0.00 GB


In [6]:
# continue with the BART example
delta_model.freeze_module(exclude=["deltas", "layernorm_embedding"], set_state_dict=True)
delta_model.log()

[INFO|(OpenDelta)basemodel:675]2022-07-23 03:12:04,078 >> Trainable Ratio: 0.827267%
[INFO|(OpenDelta)basemodel:677]2022-07-23 03:12:04,081 >> Delta Parameter Ratio: 0.827267%
[INFO|(OpenDelta)basemodel:679]2022-07-23 03:12:04,084 >> Static Memory 0.00 GB, Max Memory 0.00 GB
