# VTX
---
[AIGen](https://github.com/LuciferianInk/aigen) is a text generation and training library, originally forked from [AITextGen](https://aitextgen.minimaxir.com/) (which is now defunct).

AIGen is also the foundation of [VTX](https://github.com/0-5788719150923125/vtx).

To use this notebook with Kaggle, one must first enable the "Internet". To do so:

1. Find "Notebook options" in the sidebar on the right-hand side of this page.
2. If required, verify your phone number.
3. Choose "Internet on"

As well, do not forget to connect to an accelerator. The P100's are better for training.

In [1]:
# Kaggle uses an old version of CUDA, so we need to install a version of Pytorch that was built for that version.
!pip install torch>=2.1.0 --no-build-isolation --index-url https://download.pytorch.org/whl/cu110

# We don't even use this, but have to install it because of Kaggle bugs
!pip install torchaudio

# Now we install AIGen
!pip install 'git+https://github.com/LuciferianInk/aigen.git'

# Install our fork of ModuleFormer
!pip install 'git+https://github.com/LuciferianInk/ModuleFormer.git@enable-gradient-checkpointing'

Collecting git+https://github.com/LuciferianInk/aigen.git
  Cloning https://github.com/LuciferianInk/aigen.git to /tmp/pip-req-build-4hiw3rso
  Running command git clone --filter=blob:none --quiet https://github.com/LuciferianInk/aigen.git /tmp/pip-req-build-4hiw3rso
  Resolved https://github.com/LuciferianInk/aigen.git to commit d83be748faa77cc4163024daa4428e21d626bbcb
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting lightning_hivemind@ git+https://github.com/LuciferianInk/lightning-Hivemind.git@ipfs (from aigen==0.7.0)
  Cloning https://github.com/LuciferianInk/lightning-Hivemind.git (to revision ipfs) to /tmp/pip-install-qod3f39o/lightning-hivemind_70043684388043f8b9299536aec12cf3
  Running command git clone --filter=blob:none --quiet https://github.com/LuciferianInk/lightning-Hivemind.git /tmp/pip-install-qod3f39o/lightning-hivemind_70043684388043f8b9299536aec12cf3
  Running command git checkout -b ipfs --track origin/ipfs
  Switched to a new branch 'ipfs'
  Branch '

## Configuration

We could set a bunch of variables here, but we don't. For now, we just hardcode things inline for clarity.

In [2]:
# Set some variables
import os

focus = 'frame'
os.environ["FOCUS"] = focus
os.environ["TOKENIZERS_PARALLELISM"] = 'false'

## Pretraining

We want to train a ModuleFormer from scratch, so we import example code from IBM's MoLM and configure the settings.

In [3]:
# Pretrain configs for ModuleFormer
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    AutoTokenizer,
)

from moduleformer import (
    ModuleFormerConfig,
    ModuleFormerForCausalLM,
    ModuleFormerForSequenceClassification,
)

AutoConfig.register("moduleformer", ModuleFormerConfig)
AutoModelForCausalLM.register(ModuleFormerConfig, ModuleFormerForCausalLM)
AutoModelForSequenceClassification.register(
    ModuleFormerConfig, ModuleFormerForSequenceClassification
)

base_model = "ibm/MoLM-350M-4B"

pretrain_config = AutoConfig.from_pretrained(base_model)
overrides = {
    "universal": True,
    "world_size": 23,
    "activation_function": 'gelu',
    "n_layer": 16,
    "n_head": 2,
    "k_att": 3,
    "k_mlp": 3,
    "n_att_experts": 8,
    "n_mlp_experts": 16,
    "n_ctx": 2048, # history_length * n_layer
    "n_embd": 768,
    "att_hidden": 256,
    "ffd_hidden": 512,
    "block_size": 128,
    "gate_type": 'gmm',
    "gating_size": 64,
    "aux_loss_type": 'mi',
    "aux_loss_weight": 0.1,
    "history_length": 128,
    "resid_pdrop": 0.1,
    "embd_pdrop": 0.1,
    "attn_pdrop": 0.1,
    "moe_pdrop": 0.1,
    "sample_topk": 2,
    "tie_word_embeddings": True,
}
setattr(pretrain_config, "_name_or_path", focus)
for k, v in overrides.items():
    setattr(pretrain_config, k, v)
print(f"modified pretrain config:")
print(pretrain_config)



config.json:   0%|          | 0.00/952 [00:00<?, ?B/s]

modified pretrain config:
ModuleFormerConfig {
  "_name_or_path": "frame",
  "activation_function": "gelu",
  "architectures": [
    "ModuleFormerForCausalLM"
  ],
  "att_func": "stickbreaking",
  "att_hidden": 256,
  "attn_pdrop": 0.1,
  "aux_loss_type": "mi",
  "aux_loss_weight": 0.1,
  "block_size": 128,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "ffd_hidden": 512,
  "gate_type": "gmm",
  "gating_size": 64,
  "history_length": 128,
  "initializer_range": 0.02,
  "k_att": 3,
  "k_mlp": 3,
  "layer_norm_epsilon": 1e-05,
  "local_size": 1,
  "model_type": "moduleformer",
  "moe_pdrop": 0.1,
  "moe_type": "moe",
  "n_att_experts": 8,
  "n_ctx": 2048,
  "n_embd": 768,
  "n_head": 2,
  "n_layer": 16,
  "n_mlp_experts": 16,
  "resid_pdrop": 0.1,
  "sample_topk": 2,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.36.0",
  "universal": true,
  "use_cache": true,
  "vocab_size": 50295,
  "world_size": 23
}



## Load a pretrained tokenizer

This isn't actually necessary here, but it can be required in some cases.

In [4]:
# Tweak the tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    base_model,
    cache_dir="/kaggle/working/models",
    padding="max_length",
    padding_side="left",
    use_fast=True,
    return_overflowing_tokens=True,
    truncation=True,
    trust_remote_code=True,
)

## Load the model

Here we initialize the model with random weights.

In [5]:
# Instantiate your model
import os
from aigen import aigen

prototype = aigen(
    model=base_model,
    tokenizer=tokenizer,
    cache_dir="/kaggle/working/models",
    precision=16,
    gradient_checkpointing=False,
    config=pretrain_config
)

print(prototype)



ModuleFormer loaded with 298M parameters.


## Parameter-Efficient Fine-Tuning (PEFT)
Here is a basic example of Low-Rank Adapter training. We remove this because it's not used in pre-training.

In [6]:
# # Prepare model for PEFT training

# from peft import (
#     LoraConfig,
#     get_peft_model,
#     prepare_model_for_kbit_training,
# )

# peft_config = LoraConfig(
#     task_type="CAUSAL_LM",
#     r=4,
#     lora_alpha=16,
#     lora_dropout=0.01,
#     bias="all",
#     target_modules=[
#       "embed_in",
#       "query_key_value",
#       "dense",
#       "dense_h_to_4h",
#       "dense_4h_to_h",
#       "embed_out"
#     ]
# )

# prototype.model = prepare_model_for_kbit_training(
#     prototype.model, use_gradient_checkpointing=True
# )

# prototype.model = get_peft_model(prototype.model, peft_config)

# prototype.model.print_trainable_parameters()

## Metrics

We want to log training metrics, so we install Tensorboard and expose it via ngrok. This requires an authtoken from ngrok.com, saved in Kaggle's "Add-ons>Secrets".

In [7]:
from kaggle_secrets import UserSecretsClient
secret_label = "NGROK_SECRET"
secret_value = UserSecretsClient().get_secret(secret_label)

if secret_value:

    !pip install ngrok tensorboard

    import threading
    import subprocess

    def start_tensorboard():
        subprocess.Popen(
            ["tensorboard", "--logdir", "/kaggle/working/logs", "--bind_all", "--samples_per_plugin", "scalars=999999999"], 
        )

    tensorboard_thread = threading.Thread(target=start_tensorboard)
    tensorboard_thread.start()

    import ngrok

    listener = ngrok.forward(6006, authtoken=secret_value)
    
    import time
    
    time.sleep(15)

Collecting ngrok
  Obtaining dependency information for ngrok from https://files.pythonhosted.org/packages/09/e5/1d908d18ba0c532a2d9bb13bbc0020a48e166135aa1ae8d120cf7bf3d3e8/ngrok-0.12.1-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading ngrok-0.12.1-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Downloading ngrok-0.12.1-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: ngrok
Successfully installed ngrok-0.12.1



NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

TensorBoard 2.13.0 at http://c55b8d1ff4c2:6006/ (Press CTRL+C to quit)


## Training

Finally, we train the model on a dataset streamed from: https://huggingface.co/datasets

In [None]:
# Train the model

import os
from lightning.pytorch import loggers

os.makedirs(f"/kaggle/working/logs/{focus}", exist_ok=True)
logger = loggers.TensorBoardLogger("/kaggle/working/logs", name=focus, default_hp_metric=True)

prototype.model.training = True

prototype.train(
    devices="auto",
    strategy="auto",
    streaming_data=[
        {"dataset": "tiiuae/falcon-refinedweb", "content_key": "content"}
    ],
    output_dir="/kaggle/working/trained",
    batch_size=4,
    gradient_accumulation_steps=64,
    block_size=512,
    num_steps=10000,
    warmup_steps=100,
    optimizer="Lion",
    learning_rate=0.000333,
    weight_decay=0.01,
    gradient_clip_val=1.0,
    scheduler="cosine",
    generate_every=500,
    save_every=1000,
    loggers=[logger]
)

Downloading readme:   0%|          | 0.00/9.04k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/5534 [00:00<?, ?it/s]

INFO: GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/configuration_validator.py:74: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name  | Type                    | Params
--------------------------------------------------
0 | model | ModuleFormerForCausalLM | 298 M 
--------------------------------------------------
298 M     Trainable params
0         Non-trainable params
298 M     Total params
1,192.930 Total estimated model params size (MB)
/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `Data

  0%|          | 0/640000 [00:00<?, ?it/s]


==>

==>


## Testing

For testing, we just run an interactive inference session.

In [None]:
# Test inference

while True:
    print("PROMPT:\n")
    prompt = input()
    completion = prototype.generate(
        prompt=prompt,
        do_sample=True,
        min_length=23,
        max_new_tokens=111,
        temperature=0.9,
        eta_cutoff=0.0003,
        penalty_alpha=0.6,
        top_k=4,
        repetition_penalty=1.023,
        no_repeat_ngram_size=13,
        renormalize_logits=True,
        remove_invalid_values=True,
        max_time=60,
        use_cache=True,
    )
    print("COMPLETION:\n")
    print(completion)