## **Small Dataset with sakib323/matmulfreellm (with rotary embeeding + MoE)**

In [1]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
!pip install -U git+https://github.com/Sakib323/matmulfreellm.git
!pip install transformers
!pip install triton==2.2
!pip install datasets
!pip install wandb

Looking in indexes: https://download.pytorch.org/whl/cu124
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading https://download.pytorch.org/whl/cu124/nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (24.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading https://download.pytorch.org/whl/cu124/nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (883 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading https://download.pytorch.org/whl/cu124/nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (13.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[?25

In [9]:
import wandb
wandb.login()
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, AutoTokenizer, AutoModelForCausalLM
from datasets import Dataset

from mmfreelm.models import ( HGRNBitForCausalLM,HGRNBitModel, HGRNBitConfig)

import torch
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("Sakib323/MMfreeLM-370M")
tokenizer.pad_token = tokenizer.eos_token

demo_data = load_dataset("mlabonne/Evol-Instruct-Python-1k")


def tokenize_function(examples):
    combined = [instr + "\n" + out + tokenizer.eos_token for instr, out in zip(examples["instruction"], examples["output"])]
    tokens = tokenizer(combined, truncation=True, padding=True, max_length=2048)
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens


demo_data_small = demo_data["train"].select(range(150))
tokenized_dataset = demo_data_small.map(tokenize_function, batched=True, remove_columns=["instruction", "output"])
split_datasets = tokenized_dataset.train_test_split(test_size=0.1)



config = HGRNBitConfig(
    vocab_size=tokenizer.vocab_size,
    hidden_size=1024,
    num_hidden_layers=24,
    max_position_embeddings=2048,
    attn_mode="fused_recurrent",
    use_short_conv=False,
    conv_size=4,
    rms_norm_eps=1e-6,
    pad_token_id=tokenizer.pad_token_id,
    rope_theta=10000.0,
    use_ternary_rope=True,
    rotary_embeddings=True,
    moe=True,
    num_experts=2,
    num_experts_per_tok=2,
    moe_intermediate_size=1024,  # 4x hidden_size= 4096
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = HGRNBitForCausalLM(config).to(device)


training_args = TrainingArguments(
    output_dir="./output",
    overwrite_output_dir=True,
    per_device_train_batch_size=1,
    remove_unused_columns=False,
    num_train_epochs=30,
    learning_rate=4e-3,
    weight_decay=0.01,
    logging_steps=100,
    save_steps=1000,
    gradient_accumulation_steps=4,
    fp16=False,
    run_name="HGRNBit-MMfreeLM-370M-with-rotary-embedding",
    report_to="wandb",
)



data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split_datasets["train"],
    eval_dataset=split_datasets["test"],
    data_collator=data_collator,
)

trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:


Abort: 

In [3]:
from huggingface_hub import login
login(token="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")


model.save_pretrained("MMfreeLM-370M")
tokenizer.save_pretrained("MMfreeLM-370M")


from huggingface_hub import HfApi, HfFolder
from transformers import AutoModelForCausalLM, AutoTokenizer

from huggingface_hub import create_repo
create_repo("MMfreeLM-370M", private=False)

model.push_to_hub("MMfreeLM-370M")
tokenizer.push_to_hub("MMfreeLM-370M")

model.safetensors:   0%|          | 0.00/1.27G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Sakib323/MMfreeLM-370M/commit/60b4995344190ac3acddb28faa048382373b213c', commit_message='Upload tokenizer', commit_description='', oid='60b4995344190ac3acddb28faa048382373b213c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Sakib323/MMfreeLM-370M', endpoint='https://huggingface.co', repo_type='model', repo_id='Sakib323/MMfreeLM-370M'), pr_revision=None, pr_num=None)

In [10]:
from mmfreelm.models import HGRNBitForCausalLM
import torch

model = HGRNBitForCausalLM.from_pretrained("Sakib323/MMfreeLM-370M")
model.to("cuda" if torch.cuda.is_available() else "cpu")

Initializing RotaryEmbedding with theta=10000.0 and ternary=True

[RotaryEmbedding] Initialized with: dim=1024, max_pos=2048, base=10000.0, ternary=True

Initializing RotaryEmbedding with theta=10000.0 and ternary=True

[RotaryEmbedding] Initialized with: dim=1024, max_pos=2048, base=10000.0, ternary=True

Initializing RotaryEmbedding with theta=10000.0 and ternary=True

[RotaryEmbedding] Initialized with: dim=1024, max_pos=2048, base=10000.0, ternary=True

Initializing RotaryEmbedding with theta=10000.0 and ternary=True

[RotaryEmbedding] Initialized with: dim=1024, max_pos=2048, base=10000.0, ternary=True

Initializing RotaryEmbedding with theta=10000.0 and ternary=True

[RotaryEmbedding] Initialized with: dim=1024, max_pos=2048, base=10000.0, ternary=True

Initializing RotaryEmbedding with theta=10000.0 and ternary=True

[RotaryEmbedding] Initialized with: dim=1024, max_pos=2048, base=10000.0, ternary=True

Initializing RotaryEmbedding with theta=10000.0 and ternary=True

[RotaryEmb

NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Sakib323/MMfreeLM-370M")
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.51M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/437 [00:00<?, ?B/s]

In [6]:
def generate_text(prompt, max_new_tokens=2000):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

    output = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        eos_token_id=tokenizer.convert_tokens_to_ids("</s>"),
        pad_token_id=tokenizer.pad_token_id,
        do_sample=True,
        top_p=0.9,
        temperature=0.7
    )

    return tokenizer.decode(output[0], skip_special_tokens=True)

# Example prompt
prompt = "write code to create an addition function in python"
generated_text = generate_text(prompt)
print(generated_text)


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


write code to create an addition function in python of the two examples class also efficiently multiple names in Python to load data implementing a different movie title. The depth of elements occurring(toCurrency, etc are defined and log2 perides to the target other model. The user can then click the "Calculate" button to apply 10 with the balance + 2* manA-Z\x-lower() at the break to calculate the sum.

Additionally, the script should have a time complexity of O(log n) in the worst-case scenario because it calls with the expected function.

In addition, the above complexity of the network can be used to handle duplicate values in the dataset. The difficulty level can be mitigated by using different initialization strategies or exploring more advanced optimization algorithms.

- Sensitivity to learning rate: The choice of digits to a positive variable or a self-balancing binary search tree. The time complexity of the "add" method is O(log n) in the worst case, where n is the number of