In [None]:
soft-MoE for LLMs: sparsifying MLPs with MoE and Channels - imitating brain-like modularity - selective recurrence via partial-universality (i.e. Universal Transformers paper)\

Google's soft-MoE paper works for non-causal data - all tokens attend to all experts in parallel in one go. i.e. each expert is only available for a one-time use (slots allow for being used by a few tokens, but it was found that slots=1 was best performing). This does not work for language modelling because we need repeated use of the same experts - the same passage of text requiring the same expert may be called at any point in the text any number of times.

the solution is to make as many slots as there are timesteps, but this requires an embedding for each position, greatly slowing down learning (i.e. an expert that solves a certain will have a harder time generalizing across timesteps)

For the parts we care about, this is just equivalent to re-casting the soft-Moe operation not as operating across the entier operation as in ViTs, but as operating across each timestep.

This gets rid of the main reason it is so advantageous in ViT - it removes the redundancy of using the same MLP block for every token, effectively increasing parameter count. i.e. there is now a 'dog' MLP that all dog tokens can immediately go to.

i.e. imo, the main reason for the benefit of soft-MoE to ViT is that beforehand, the MLP 'layer' in a ViT consists of passing each token through the same MLP which has to keep a representation of everything in any image within itself
but soft-MoE takes advantage of the fact that each token will likely consist of something different and so it can be routed to an expert to take care of just that

However, I argue that this benefit is impossible to bring to LLMs because you can't cut it apart into groups of separate entities that can be processed in parallel - the 'dog' expert needs to be available at every timestep because every token may need reference to details about a dog, while thats not true or an image.
Each new token could require any expert, so they all need to be available at every timestep

Motivation: the Bitter Lesson, rough inspiration from biology of human cortical columns

The bitter lesson:
 - architectures which use available hardware better are better able to scale, which outperforms handcrafted methods
 dense MLP blocks in a large amount of layers can only be made so big before we run out of VRAM. splitting them up into separate blocks with sparser information routing between them allows for much more efficient use of compute and hardware, and permits cheap and efficient scalability (see Switch Transformer)

 humnan cortical columns:
 biological brains are not densely connected - especially the neocortex is split into many small columns which are individually wired densely and similarly but communicate sparsely between each other.

robustness -
we theorize that part of the reason for why NNs fail to be robust and to generalize as far as we would like is that the foundational MLP blocks are all-to-all across all layers
- this means that output neurons (in an MLP block) which have a specific role  may take input from neurons which are not at all related to them, causing spurious correlations
- while we can assume that during training these irrelevant weights will be set to 0, this is a strong assumption, and indeed papers have empirically found that unstructured pruning to a certain degree does improve performance. We postulate this is due to the remoal of spurious correlations

- this also forces the NN to very carefull manage information as it propogates through the network - a  large variety of infomration needs to be routed through the exact same pathway and blocks of computation

- we postulate that allowing information to be processed in a more localised and modular way, will allow information to be moved to areas of neural processing better dedicated to them and better separated frorm other areas, which will reduce spurious correlations and ease the burden of neural pathway information saturation.

however, because brain connections (e.g. between cortical columns) can still be long range and in-between neural blocks, and because of the importance of differenttiability and the ability for gradients to flow strongly throughout the model, we enable communication between the separated neural modules, but maintain some degree of sparsity by having this be attention-based information passing, rather than

soft-MoE (they stole my idea! :P not rly) is like this but we need to replace individual MLP blocks, per-token with this. This allows for modular processing and sparsity

Separating the neural pathway into channels is necessary to allow for infomration to pass from a sub-MoE-MLP-block in one layer to another on in the next layer without having to interfere with information coming from another sub-MoE-MLP-block

The brain has major differences between FFNN -
  sugnals can travel back and forth -
  there is SERIOUS high-level-modularity (have a look at mouse brain hippocampus dissection - it pops clean out! not dense AT ALL)

-imagine each sub-brain organ as an MLP block
if we replace MLP block with a transformer layer using multiple channels, we can view the attention and channels as neural pathways between high level organs
training will help to make attention sparse - we can smoothly/differentiably go from initialisation to brain-like sparsity of activation routing between components

- if we tie MOE-MLPs across layers in the model, we allow messages to enter the same block multiple times or enter new ones - this makes the universal transformer much more viable because it allows layers to specialise much better! attention sparsity can basically only limit certain MoE to certain layers, or allow one to be repeatedly applied! seems very useful for quick learning.
This has the problem of limiting how many parameters we update - we probably want to only share N MoE-experts across all layers, and allow each layer to have L experts unique to that layer.

something else interesting

artificial neural networks with residuals and attention mechanisms (and potentially sufficiently deep and insufficiently wide ANNs) are esentially architecturally required to work in fourier space to avoid interference (see mechanistic nterpretation paper teaching math to small transformers) - perhaps biological brians do this, but i have seen no evidence or reason for this - it seems biological brain are sufficently large and sparse and simple that they do not need such advanced methods to route information throughout themselves.

actions:
- replace attention with RNN-like recurrence (removes the issue of having to craft embeddings such that they can be summed without interference. removes need for positional encoding)
- AFAIK, part of the reason why residuals work so well is that they permit a no-op to occur from the very beginning, stabilising training and allowing gradients to flow easier when needed - MoE can do this no-op behaviour without a residual if we replace a sub-MoE-MLP-block at every layer with a no-op block

like with weight decay, to encourage sparsity when routing - we can do this by rewarding the model for having peaky softmax activations in the expert router

# !!!@ we dont just want to encourage sparsity with low weights, we want to remove weak correlations -we want to keep strong ones. weight decay just keeps all weights low but actually we should be okay withhaving a few high ones. rather than L2, we should find something that suits this better, like having high STD AND low L2
# this is relevant becase

In [None]:
# --- CD DATA DIR
if False:
  !mkdir /content/drive/MyDrive/PythonQA
  !cd /content/drive/MyDrive/PythonQA
  import os
  os.chdir('/content/drive/MyDrive/PythonQA')

In [None]:
# DONWLOAD DATA
if False:
    import os
    !pip install tiktoken

    os.environ['KAGGLE_USERNAME'] = "idmidwebe"
    os.environ['KAGGLE_KEY'] = "30336e863dace085f57006a2de67accf"

    import kaggle
    kaggle.api.authenticate()

    dataset_name = 'stackoverflow/pythonquestions'
    version_number = 2
    file_name = f'{dataset_name}.zip'
    download_url = f'https://www.kaggle.com/datasets/{dataset_name}/download?datasetVersionNumber={version_number}'
    if not os.path.isfile(file_name):
        kaggle.api.dataset_download_files(dataset_name, unzip=True, quiet=False)
    print(f'Dataset downloaded as {file_name}')

In [None]:
# FORMAT DATA#
if False:
    import csv

    question_answers = {}

    # Load questions
    with open('Questions.csv', encoding='latin-1') as f:
        reader = csv.DictReader(f)
        for i, row in enumerate(reader):
            question_id = row['Id']
            question_answers[question_id] = "-<{QUESTION}>-\n\n" + row['Title'] + ' ' + row['Body']

    # Load answers
    a=1
    with open('Answers.csv', encoding='latin-1') as f:
        reader = csv.DictReader(f)
        for i, row in enumerate(reader):
            answer_id = row['Id']
            question_id = row['ParentId']

            # Find the corresponding question and update the dictionary
            if question_id in question_answers:
                question_answers[question_id] += '\n\n-<{ANSWER}>-\n\n' + row['Body']
                a+=1

    print(len(question_answers.values()))

    data = list(question_answers.values())
    # assuming 4 characters per token, this is 422M tokens

    data = str(data) # it would be nice to train with separate QA documents for each sample, but this requires padding, more specialised tokenization treatement... for simplicity, we'll just concatenate it all. its still causal at least

In [None]:
# convert and save data to numpy files
if False:
    !pip install tiktoken
    import numpy as np
    import gc
    import tiktoken
    import os

    n=len(data)
    val_data = data[int(n*0.9):]
    train_data = data[:int(n*0.9)]
    del data

    # encode with tiktoken gpt2 bpe
    enc = tiktoken.get_encoding("gpt2")

    val_data = enc.encode_ordinary(val_data)
    val_data = np.array(val_data, dtype=np.uint16)

    val_data.tofile('val.bin')
    del val_data
    gc.collect()



    # Split the text into chunks

    enc = tiktoken.get_encoding("gpt2")


    max_chunk_size=len(train_data)//10
    idx=0
    for i in range(0, len(train_data), max_chunk_size):
        print(f'train{str(idx).zfill(6)}.bin')
        chunk = np.array(enc.encode_ordinary(train_data[i:i+max_chunk_size]), dtype=np.uint16)
        chunk.tofile(f'train{str(idx).zfill(6)}.bin')
        del chunk
        gc.collect()
        idx+=1

In [None]:
# LOAD DATA

# --- CD DATA DIR
!mkdir /content/drive/MyDrive/PythonQA
!cd /content/drive/MyDrive/PythonQA
import os
os.chdir('/content/drive/MyDrive/PythonQA')
import numpy as np

bin_files = [file for file in os.listdir() if file.endswith('.bin')]

train_data = []
for file in bin_files:
    train_data.append(np.memmap(file, dtype=np.uint16, mode='r'))

train_data = np.concatenate(train_data)

print(train_data.shape)

mkdir: cannot create directory ‘/content/drive/MyDrive/PythonQA’: File exists
(667289489,)


In [None]:
!pip install einops transformers



In [None]:
"""
Full definition of a GPT Language Model, all of it in this single file.
References:
1) the official GPT-2 TensorFlow implementation released by OpenAI:
https://github.com/openai/gpt-2/blob/master/src/model.py
2) huggingface/transformers PyTorch implementation:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py
"""

import math
import inspect
from dataclasses import dataclass

import torch
import torch.nn as nn
from torch.nn import functional as F
from einops import rearrange, pack, unpack
import copy

class LayerNorm(nn.Module):
    """ LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False """

    def __init__(self, ndim, bias):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None

    def forward(self, input):
        return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)

class CausalSelfAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        print('E, n_head',config.n_embd, config.n_head)
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        # regularization
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = config.dropout
        # flash attention make GPU go brrrrr but support is only in PyTorch >= 2.0
        self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
        if not self.flash:
            print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
            # causal mask to ensure that attention is only applied to the left in the input sequence
            self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                        .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k, v  = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        if self.flash:
            # efficient attention using Flash Attention CUDA kernels
            y = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)
        else:
            # manual implementation of attention
            att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
            att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
            att = F.softmax(att, dim=-1)
            att = self.attn_dropout(att)
            y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

        # output projection
        y = self.resid_dropout(self.c_proj(y))
        return y

class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu    = nn.GELU()
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x

class RMSNorm(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.scale = dim ** 0.5
        self.gamma = nn.Parameter(torch.ones(dim))

    def forward(self, x):
        return F.normalize(x, dim = - 1) * self.scale * self.gamma

class MOE(nn.Module):
    def __init__(self, config):
        super().__init__()
        if True:
            self.config=config
            self.norm = RMSNorm(config.n_embd)
            self.slot_norm = RMSNorm(config.n_embd//config.n_channels)
            self.slot_embeds = nn.Parameter(torch.randn(config.n_experts, 1, config.n_embd//config.n_channels))
            MLP_config = copy.deepcopy(config)
            MLP_config.n_embd = config.n_embd//config.n_channels # MLP size is now poroportional to channel size, since this is what they operate on
            self.experts = nn.ModuleList([MLP(MLP_config) for _ in range(config.n_experts)] + [torch.nn.Identity(config.n_embd//config.n_channels) for _ in range(config.n_channels)])
            # we assume slots=experts. if a single MLP can handle all that data in a single transformer, a single expert can handle all that related data in an MOE. besides, if it really needs another slot it can just learn to make experts redundant and route info around them accordingly. This might be why slots dont matter so much in the soft-moe paper

    def forward(self, x):
        # (should each channel should be normalized separately? Well, we don't do per-token normalization. TODO: https://arxiv.org/abs/2112.02624 do ablation)
        x = self.norm(x)
        # x has shape [bsz, seq_len, n_embed]. Separate into channels across embed dim. CAREFUL - the embedding is made up of a vector for each attention head, concatenated. depending on if we split the embedding by rows or columns determines if each channel gets multiple attention head or not - we we split tino the same number of attention heads and in the same axis that attention splits the embeddings, each expert only has access to a single attention head - information cannot be split across them. wait actually what am i talking about - each channel gets a confidence scor for each expert - all channels could choose the same expert and all go into it, concatenated and weighted highly
        x = rearrange(x, 'b t (c d) -> b t c d', c = self.config.n_channels) # shape: [b, n, c, d]
        # add channel embeddings (like positional embedding but just so we can identify each channel)
        # we don't need chanel embeddings, each word embeddings can technically learn to encode these in itself, or attention can encode it
        # x += channel_embeddings
        # normalize slot embeddings
        slot_embeds = self.slot_norm(self.slot_embeds)
        # get logits for how to weigh channels to experts' slots
        logits = torch.einsum('b t c d, e s d -> b t c e s', x, slot_embeds) #b:batch, t:tokens, d:channel-wise embed dimension, e:experts, c:channel
        # get dispatch and combine weights (softmax across right dimensions)
        dispatch_weights = logits.softmax(dim = 2)
        combine_weights = rearrange(logits, 'b t c e s -> b t c (e s)')
        combine_weights = combine_weights.softmax(dim = -1)
        # derive slots by weighted average of input tokens using the dispatch weights from above
        slots = torch.einsum('b t c d, b t c e s -> e b t s d', x, dispatch_weights)
        # route the slots per expert to each expert
        out = []
        for slots_per_expert, expert in zip(slots, self.experts):
            out.append(expert(slots_per_expert))
        out = torch.stack(out)
        # combine back out
        out = rearrange(out, 'e b t s d -> b t (e s) d')
        out = torch.einsum('b t s d, b t c s -> b t c d', out, combine_weights)
        out = rearrange(out, 'b t c d -> b t (c d)', c = self.config.n_channels)
        return out

class Block(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
        self.moe = MOE(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.moe(self.ln_2(x))
        return x

@dataclass
class GPTConfig:
    fraction_experts_tied: int = 0.2
    block_size: int = 1024
    vocab_size: int = 50304 # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768
    dropout: float = 0.0
    bias: bool = True # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
    n_channels: int = 4
    n_experts: int = 4

class GPT(nn.Module):

    def __init__(self, config):
        super().__init__()
        assert config.vocab_size is not None
        assert config.block_size is not None
        self.config = config
        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            drop = nn.Dropout(config.dropout),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = LayerNorm(config.n_embd, bias=config.bias),
        ))


        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        # with weight tying when using torch.compile() some warnings get generated:
        # "UserWarning: functional_call was passed multiple values for tied weights.
        # This behavior is deprecated and will be an error in future versions"
        # not 100% sure what this is, so far seems to be harmless. TODO investigate
        self.transformer.wte.weight = self.lm_head.weight # https://paperswithcode.com/method/weight-tying

        # tie portion of MOE experts (layers between first and last 2 transformer blocks - these are known to be outliers in terms of behaviour, so shouldnt be forced to be the same as others, diversity seems necessary here)
        if True:
            tied_experts = self.transformer.h[0].moe.experts[0:int(self.config.n_experts*self.config.fraction_experts_tied)]
            for layer in self.transformer.h:
                for e, expert in enumerate(layer.moe.experts[0:int(self.config.n_experts*self.config.fraction_experts_tied)]):
                    try:
                        expert.c_fc.weight = tied_experts[e].c_fc.weight
                        expert.c_proj.weight = tied_experts[e].c_proj.weight
                    except Exception as e:
                          print(e)
                          pass
            print('TIED EXPERTS:',self.config.n_layer*self.config.n_experts*self.config.fraction_experts_tied,'of', self.config.n_layer*self.config.n_experts)

        # init all weights
        self.apply(self._init_weights)


        # ZeRo weights - initialise all layers to identity - better sparsity, removes need for layer norm, guaranteed closer to optimal
        for layer self.h:
            c_proj = identity, hadamard
            c_proj = identiy
            attention.kqv = same (to get identity attention, i.e. token N attentds only to token N, just need hidden->key and hidden->query to have the same weights)

        self.lm_head = identity



        # apply special scaled init to the residual projections, per GPT-2 paper
        for pn, p in self.named_parameters():
            if pn.endswith('c_proj.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))

        # report number of parameters
        print("number of parameters: %.2fM" % (self.get_num_params()/1e6,))

    def get_num_params(self, non_embedding=True):
        """
        Return the number of parameters in the model.
        For non-embedding count (default), the position embeddings get subtracted.
        The token embeddings would too, except due to the parameter sharing these
        params are actually used as weights in the final layer, so we include them.
        """
        n_params = sum(p.numel() for p in self.parameters())
        if non_embedding:
            n_params -= self.transformer.wpe.weight.numel()
        return n_params

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()
        assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
        pos = torch.arange(0, t, dtype=torch.long, device=device) # shape (t)

        # forward the GPT model itself
        tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
        pos_emb = self.transformer.wpe(pos) # position embeddings of shape (t, n_embd)
        x = self.transformer.drop(tok_emb + pos_emb)
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)

        if targets is not None:
            # if we are given some desired targets also calculate the loss
            logits = self.lm_head(x)
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        else:
            # inference-time mini-optimization: only forward the lm_head on the very last position
            logits = self.lm_head(x[:, [-1], :]) # note: using list [-1] to preserve the time dim
            loss = None

        return logits, loss

    def crop_block_size(self, block_size):
        # model surgery to decrease the block size if necessary
        # e.g. we may load the GPT2 pretrained model checkpoint (block size 1024)
        # but want to use a smaller block size for some smaller, simpler model
        assert block_size <= self.config.block_size
        self.config.block_size = block_size
        self.transformer.wpe.weight = nn.Parameter(self.transformer.wpe.weight[:block_size])
        for block in self.transformer.h:
            if hasattr(block.attn, 'bias'):
                block.attn.bias = block.attn.bias[:,:,:block_size,:block_size]

    @classmethod
    def from_pretrained(cls, model_type, override_args=None):
        assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
        override_args = override_args or {} # default to empty dict
        # only dropout can be overridden see more notes below
        assert all(k == 'dropout' for k in override_args)
        from transformers import GPT2LMHeadModel
        print("loading weights from pretrained gpt: %s" % model_type)

        # n_layer, n_head and n_embd are determined from model_type
        config_args = {
            'gpt2':         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
            'gpt2-medium':  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
            'gpt2-large':   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
            'gpt2-xl':      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
        }[model_type]
        print("forcing vocab_size=50257, block_size=1024, bias=True")
        config_args['vocab_size'] = 50257 # always 50257 for GPT model checkpoints
        config_args['block_size'] = 1024 # always 1024 for GPT model checkpoints
        config_args['bias'] = True # always True for GPT model checkpoints
        # we can override the dropout rate, if desired
        if 'dropout' in override_args:
            print(f"overriding dropout rate to {override_args['dropout']}")
            config_args['dropout'] = override_args['dropout']
        # create a from-scratch initialized minGPT model
        config = GPTConfig(**config_args)
        model = GPT(config)
        sd = model.state_dict()
        sd_keys = sd.keys()
        sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')] # discard this mask / buffer, not a param

        # init a huggingface/transformers model
        model_hf = GPT2LMHeadModel.from_pretrained(model_type)
        sd_hf = model_hf.state_dict()

        # copy while ensuring all of the parameters are aligned and match in names and shapes
        sd_keys_hf = sd_hf.keys()
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')] # ignore these, just a buffer
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')] # same, just the mask (buffer)
        transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
        # basically the openai checkpoints use a "Conv1D" module, but we only want to use a vanilla Linear
        # this means that we have to transpose these weights when we import them
        assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
        for k in sd_keys_hf:
            if any(k.endswith(w) for w in transposed):
                # special treatment for the Conv1D weights we need to transpose
                assert sd_hf[k].shape[::-1] == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k].t())
            else:
                # vanilla copy over the other parameters
                assert sd_hf[k].shape == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k])

        return model

    def configure_optimizers(self, weight_decay, learning_rate, betas, device_type):
        # start with all of the candidate parameters
        param_dict = {pn: p for pn, p in self.named_parameters()}
        # filter out those that do not require grad
        param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
        # create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.
        # i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
        decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
        nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
        optim_groups = [
            {'params': decay_params, 'weight_decay': weight_decay},
            {'params': nodecay_params, 'weight_decay': 0.0}
        ]
        num_decay_params = sum(p.numel() for p in decay_params)
        num_nodecay_params = sum(p.numel() for p in nodecay_params)
        print(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
        print(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters")
        # Create AdamW optimizer and use the fused version if it is available
        fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
        use_fused = fused_available and device_type == 'cuda'
        extra_args = dict(fused=True) if use_fused else dict()
        optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, **extra_args)
        print(f"using fused AdamW: {use_fused}")

        return optimizer

    def estimate_mfu(self, fwdbwd_per_iter, dt):
        """ estimate model flops utilization (MFU) in units of A100 bfloat16 peak FLOPS """
        # first estimate the number of flops we do per iteration.
        # see PaLM paper Appendix B as ref: https://arxiv.org/abs/2204.02311
        N = self.get_num_params()
        cfg = self.config
        L, H, Q, T = cfg.n_layer, cfg.n_head, cfg.n_embd//cfg.n_head, cfg.block_size
        flops_per_token = 6*N + 12*L*H*Q*T
        flops_per_fwdbwd = flops_per_token * T
        flops_per_iter = flops_per_fwdbwd * fwdbwd_per_iter
        # express our flops throughput as ratio of A100 bfloat16 peak flops
        flops_achieved = flops_per_iter * (1.0/dt) # per second
        flops_promised = 312e12 # A100 GPU bfloat16 peak flops is 312 TFLOPS
        mfu = flops_achieved / flops_promised
        return mfu

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        """
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
            # forward the model to get the logits for the index in the sequence
            logits, _ = self(idx_cond)
            # pluck the logits at the final step and scale by desired temperature
            logits = logits[:, -1, :] / temperature
            # optionally crop the logits to only the top k options
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            # apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

In [None]:
"""
This training script can be run both on a single gpu in debug mode,
and also in a larger training run with distributed data parallel (ddp).

To run on a single GPU, example:
$ python train.py --batch_size=32 --compile=False

To run with DDP on 4 gpus on 1 node, example:
$ torchrun --standalone --nproc_per_node=4 train.py

To run with DDP on 4 gpus across 2 nodes, example:
- Run on the first (master) node with example IP 123.456.123.456:
$ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=123.456.123.456 --master_port=1234 train.py
- Run on the worker node:
$ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=123.456.123.456 --master_port=1234 train.py
(If your cluster does not have Infiniband interconnect prepend NCCL_IB_DISABLE=1)
"""

import os
import time
import math
import pickle
from contextlib import nullcontext

import numpy as np
import torch
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group


# -----------------------------------------------------------------------------
# default config values designed to train a gpt2 (124M) on OpenWebText
# I/O
out_dir = 'out'
eval_interval = 2000
log_interval = 1
eval_iters = 200
eval_only = False # if True, script exits right after the first eval
always_save_checkpoint = True # if True, always save a checkpoint after each eval
init_from = 'scratch' # 'scratch' or 'resume' or 'gpt2*'
# wandb logging
wandb_log = False # disabled by default
wandb_project = 'owt'
wandb_run_name = 'gpt2' # 'run' + str(time.time())
# data
batch_size = 8 # if gradient_accumulation_steps > 1, this is the micro-batch size
gradient_accumulation_steps = int( 40 * (8/batch_size) ) # used to simulate larger batch sizes
block_size = 512

# model
n_layer = 12
n_head = 12
n_embd = 768
n_channels = 4  # to get same parameter count as MLP:set number of experts equal to number of channels
n_experts = n_channels**2 # 4*4*10  # PER LAYER to simulate a residual without an eplicit residual, we add n_channels worth of no-op experts
fraction_experts_tied = 0.0
dropout = 0.0 # for pretraining 0 is good, for finetuning try 0.1+


bias = False # do we use bias inside LayerNorm and Linear layers?
# adamw optimizer
learning_rate = 6e-4 # max learning rate
max_iters = 600000 # total number of training iterations
weight_decay = 1e-1
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0
# learning rate decay settings
decay_lr = True # whether to decay the learning rate
warmup_iters = 0 #2000 # how many steps to warm up for
lr_decay_iters = 600000 # should be ~= max_iters per Chinchilla
min_lr = 6e-5 # minimum learning rate, should be ~= learning_rate/10 per Chinchilla
# DDP settings
backend = 'nccl' # 'nccl', 'gloo', etc.
# system
device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc., or try 'mps' on macbooks
dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32', 'bfloat16', or 'float16', the latter will auto implement a GradScaler
compile = True # use PyTorch 2.0 to compile the model to be faster
# -----------------------------------------------------------------------------
config_keys = [k for k,v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
#exec(open('configurator.py').read()) # overrides from command line or config file
config = {k: globals()[k] for k in config_keys} # will be useful for logging
# -----------------------------------------------------------------------------

# various inits, derived attributes, I/O setup
ddp = False #int(os.environ.get('RANK', -1)) != -1 # is this a ddp run?
if ddp:
    init_process_group(backend=backend)
    ddp_rank = int(os.environ['RANK'])
    ddp_local_rank = int(os.environ['LOCAL_RANK'])
    ddp_world_size = int(os.environ['WORLD_SIZE'])
    device = f'cuda:{ddp_local_rank}'
    torch.cuda.set_device(device)
    master_process = ddp_rank == 0 # this process will do logging, checkpointing etc.
    seed_offset = ddp_rank # each process gets a different seed
    # world_size number of processes will be training simultaneously, so we can scale
    # down the desired gradient accumulation iterations per process proportionally
    assert gradient_accumulation_steps % ddp_world_size == 0
    gradient_accumulation_steps //= ddp_world_size
else:
    # if not ddp, we are running on a single gpu, and one process
    master_process = True
    seed_offset = 0
    ddp_world_size = 1
tokens_per_iter = gradient_accumulation_steps * ddp_world_size * batch_size * block_size
print(f"tokens per iteration will be: {tokens_per_iter:,}")

if master_process:
    os.makedirs(out_dir, exist_ok=True)
torch.manual_seed(1337 + seed_offset)
torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
# note: float16 data type will automatically use a GradScaler
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)




# poor man's data loader
data_dir = ''
#train_data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r') from earlier
val_data = np.memmap('val.bin', dtype=np.uint16, mode='r')
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
    if device_type == 'cuda':
        # pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)
        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
    else:
        x, y = x.to(device), y.to(device)
    return x, y




# init these up here, can override if init_from='resume' (i.e. from a checkpoint)
iter_num = 0
best_val_loss = 1e9

# attempt to derive vocab_size from the dataset
meta_path = os.path.join(data_dir, 'meta.pkl')
meta_vocab_size = None
if os.path.exists(meta_path):
    with open(meta_path, 'rb') as f:
        meta = pickle.load(f)
    meta_vocab_size = meta['vocab_size']
    print(f"found vocab_size = {meta_vocab_size} (inside {meta_path})")

# model init
model_args = dict(n_layer=n_layer, n_head=n_head, n_embd=n_embd, block_size=block_size,
                  bias=bias, vocab_size=None, dropout=dropout, n_channels=n_channels, n_experts=n_experts, fraction_experts_tied=fraction_experts_tied) # start with model_args from command line
if init_from == 'scratch':
    # init a new model from scratch
    print("Initializing a new model from scratch")
    # determine the vocab size we'll use for from-scratch training
    if meta_vocab_size is None:
        print("defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)")
    model_args['vocab_size'] = meta_vocab_size if meta_vocab_size is not None else 50304
    gptconf = GPTConfig(**model_args)
    model = GPT(gptconf)
elif init_from == 'resume':
    print(f"Resuming training from {out_dir}")
    # resume training from a checkpoint.
    ckpt_path = os.path.join(out_dir, 'ckpt.pt')
    checkpoint = torch.load(ckpt_path, map_location=device)
    checkpoint_model_args = checkpoint['model_args']
    # force these config attributes to be equal otherwise we can't even resume training
    # the rest of the attributes (e.g. dropout) can stay as desired from command line
    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
        model_args[k] = checkpoint_model_args[k]
    # create the model
    gptconf = GPTConfig(**model_args)
    model = GPT(gptconf)
    state_dict = checkpoint['model']
    # fix the keys of the state dictionary :(
    # honestly no idea how checkpoints sometimes get this prefix, have to debug more
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    model.load_state_dict(state_dict)
    iter_num = checkpoint['iter_num']
    best_val_loss = checkpoint['best_val_loss']
elif init_from.startswith('gpt2'):
    print(f"Initializing from OpenAI GPT-2 weights: {init_from}")
    # initialize from OpenAI GPT-2 weights
    override_args = dict(dropout=dropout)
    model = GPT.from_pretrained(init_from, override_args)
    # read off the created config params, so we can store them into checkpoint correctly
    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
        model_args[k] = getattr(model.config, k)
# crop down the model block size if desired, using model surgery
if block_size < model.config.block_size:
    model.crop_block_size(block_size)
    model_args['block_size'] = block_size # so that the checkpoint will have the right value
model.to(device)

# initialize a GradScaler. If enabled=False scaler is a no-op
scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))

# optimizer
optimizer = model.configure_optimizers(weight_decay, learning_rate, (beta1, beta2), device_type)
if init_from == 'resume':
    optimizer.load_state_dict(checkpoint['optimizer'])
checkpoint = None # free up memory

# compile the model
if compile:
    print("compiling the model... (takes a ~minute)")
    unoptimized_model = model
    model = torch.compile(model) # requires PyTorch 2.0

# wrap model into DDP container
if ddp:
    model = DDP(model, device_ids=[ddp_local_rank])

# helps estimate an arbitrarily accurate loss over either split using many batches
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            with ctx:
                logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# learning rate decay scheduler (cosine with warmup)
def get_lr(it):
    # 1) linear warmup for warmup_iters steps
    if it < warmup_iters:
        return learning_rate * it / warmup_iters
    # 2) if it > lr_decay_iters, return min learning rate
    if it > lr_decay_iters:
        return min_lr
    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1
    return min_lr + coeff * (learning_rate - min_lr)

# logging
if wandb_log and master_process:
    import wandb
    wandb.init(project=wandb_project, name=wandb_run_name, config=config)

# training loop
X, Y = get_batch('train') # fetch the very first batch
t0 = time.time()
local_iter_num = 0 # number of iterations in the lifetime of this process
raw_model = model.module if ddp else model # unwrap DDP container if needed
running_mfu = -1.0
while True:

    # determine and set the learning rate for this iteration
    lr = get_lr(iter_num) if decay_lr else learning_rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # evaluate the loss on train/val sets and write checkpoints
    if iter_num % eval_interval == 0 and master_process:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        if wandb_log:
            wandb.log({
                "iter": iter_num,
                "train/loss": losses['train'],
                "val/loss": losses['val'],
                "lr": lr,
                "mfu": running_mfu*100, # convert to percentage
            })
        if losses['val'] < best_val_loss or always_save_checkpoint:
            best_val_loss = losses['val']
            if iter_num > 0:
                checkpoint = {
                    'model': raw_model.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'model_args': model_args,
                    'iter_num': iter_num,
                    'best_val_loss': best_val_loss,
                    'config': config,
                }
                print(f"saving checkpoint to {out_dir}")
                torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))
    if iter_num == 0 and eval_only:
        break

    # forward backward update, with optional gradient accumulation to simulate larger batch size
    # and using the GradScaler if data type is float16
    for micro_step in range(gradient_accumulation_steps):
        if ddp:
            # in DDP training we only need to sync gradients at the last micro step.
            # the official way to do this is with model.no_sync() context manager, but
            # I really dislike that this bloats the code and forces us to repeat code
            # looking at the source of that context manager, it just toggles this variable
            model.require_backward_grad_sync = (micro_step == gradient_accumulation_steps - 1)
        with ctx:
            logits, loss = model(X, Y)
            loss = loss / gradient_accumulation_steps # scale the loss to account for gradient accumulation
        # immediately async prefetch next batch while model is doing the forward pass on the GPU
        X, Y = get_batch('train')
        # backward pass, with gradient scaling if training in fp16
        scaler.scale(loss).backward()
    # clip the gradient
    if grad_clip != 0.0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
    # step the optimizer and scaler if training in fp16
    scaler.step(optimizer)
    scaler.update()
    # flush the gradients as soon as we can, no need for this memory anymore
    optimizer.zero_grad(set_to_none=True)

    # timing and logging
    t1 = time.time()
    dt = t1 - t0
    t0 = t1
    if iter_num % log_interval == 0 and master_process:
        # get loss as float. note: this is a CPU-GPU sync point
        # scale up to undo the division above, approximating the true total loss (exact would have been a sum)
        lossf = loss.item() * gradient_accumulation_steps
        if local_iter_num >= 5: # let the training loop settle a bit
            mfu = raw_model.estimate_mfu(batch_size * gradient_accumulation_steps, dt)
            running_mfu = mfu if running_mfu == -1.0 else 0.9*running_mfu + 0.1*mfu
        print(f"iter {iter_num}: loss {lossf:.4f}, time {dt*1000:.2f}ms, mfu {running_mfu*100:.2f}%")
    iter_num += 1
    local_iter_num += 1

    # termination conditions
    if iter_num > max_iters:
        break

if ddp:
    destroy_process_group()

tokens per iteration will be: 163,840
Initializing a new model from scratch
defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
TIED EXPERTS: 42.0 of 84
number of parameters: 127.17M
num decayed parameter tensors: 140, with 127,532,544 parameters
num non-decayed parameter tensors: 49, with 33,024 parameters
using fused AdamW: True
compiling the model... (takes a ~minute)
step 0: train loss 10.8268, val loss 10.8055
iter 0: loss 10.8255, time 159472.43ms, mfu -100.00%
iter 1: loss 9.8617, time 16582.09ms, mfu -100.00%
iter 2: loss 9.8032, time 16275.69ms, mfu -100.00%
iter 3: loss 8.6110, time 15920.12ms, mfu -100.00%
iter 4: loss 7.7690, time 15824.49ms, mfu -100.00%
iter 5: loss 6.8583, time 15934.86ms, mfu 2.70%
iter 6: loss 7.0212, time 16054.78ms, mfu 2.70%

In [None]:
22 experts, half are tied, first and last 2 not tied
122M params, block 512, channel4 , embed 768
step 0: train loss 10.8742, val loss 10.8654
iter 0: loss 10.9373, time 235594.05ms, mfu -100.00%
iter 1:22 22 loss 9.5692, time 20302.75ms, mfu -100.00%
iter 2: loss 9.3336, time 19612.51ms, mfu -100.00%
iter 3: loss 8.5666, time 19512.88ms, mfu -100.00%
iter 4: loss 7.8304, time 19520.54ms, mfu -100.00%
iter 5: loss 6.8368, time 19566.81ms, mfu 2.12%
iter 6: loss 6.9976, time 19566.01ms, mfu 2.12%
iter 7: loss 6.7352, time 19535.17ms, mfu 2.12%
iter 8: loss 6.8229, time 19527.55ms, mfu 2.12%
iter 9: loss 5.9560, time 19969.21ms, mfu 2.12%
iter 10: loss 6.2264, time 19542.85ms, mfu 2.12%
iter 11: loss 6.2818, time 19534.69ms, mfu 2.12%
iter 12: loss 5.5996, time 19528.22ms, mfu 2.12%
iter 13: loss 5.6668, time 19515.79ms, mfu 2.12%
iter 14: loss 6.3733, time 19531.67ms, mfu 2.12%
iter 15: loss 5.3279, time 19531.48ms, mfu 2.12%
iter 16: loss 6.2158, time 19582.76ms, mfu 2.12%
iter 17: loss 5.2143, time 19549.86ms, mfu 2.12%
iter 18: loss 5.8425, time 19528.37ms, mfu 2.12%
iter 19: loss 5.7863, time 19532.87ms, mfu 2.12%
iter 20: loss 6.1956, time 19555.85ms, mfu 2.12%
iter 21: loss 5.8044, time 19853.82ms, mfu 2.12%
iter 22: loss 5.7088, time 19524.25ms, mfu 2.12%
iter 23: loss 5.6896, time 19543.35ms, mfu 2.12%
iter 24: loss 5.4736, time 19570.87ms, mfu 2.12%
iter 25: loss 5.4996, time 19518.62ms, mfu 2.12%
iter 26: loss 5.5398, time 19539.88ms, mfu 2.12%
iter 27: loss 5.5123, time 19568.09ms, mfu 2.12%
iter 28: loss 5.4483, time 19546.70ms, mfu 2.12%
iter 29: loss 5.6664, time 19512.68ms, mfu 2.12%
iter 30: loss 5.8704, time 19520.92ms, mfu 2.12%
iter 31: loss 5.7606, time 19556.69ms, mfu 2.12%
iter 32: loss 5.7015, time 19533.75ms, mfu 2.12%
iter 33: loss 5.2100, time 19946.57ms, mfu 2.12%
iter 34: loss 5.3675, time 19544.43ms, mfu 2.12%
iter 35: loss 5.6370, time 19524.26ms, mfu 2.12%
iter 36: loss 4.6845, time 19530.56ms, mfu 2.12%
iter 37: loss 5.5585, time 19564.49ms, mfu 2.12%
iter 38: loss 5.5305, time 19647.98ms, mfu 2.12%
iter 39: loss 5.3003, time 19650.28ms, mfu 2.12%
iter 40: loss 5.4118, time 19540.85ms, mfu 2.12%
iter 41: loss 4.9199, time 19557.32ms, mfu 2.12%
iter 42: loss 4.9256, time 19539.34ms, mfu 2.12%
iter 43: loss 5.6077, time 19549.31ms, mfu 2.12%
iter 44: loss 4.8161, time 19983.53ms, mfu 2.11%
iter 45: loss 5.1001, time 19644.18ms, mfu 2.11%
iter 46: loss 5.1232, time 19653.56ms, mfu 2.11%
iter 47: loss 4.8345, time 19551.53ms, mfu 2.11%
iter 48: loss 4.7367, time 19584.03ms, mfu 2.11%
iter 49: loss 4.8574, time 19594.51ms, mfu 2.11%
iter 50: loss 5.2450, time 19586.29ms, mfu 2.11%
iter 51: loss 4.6109, time 19612.40ms, mfu 2.11%
iter 52: loss 5.1017, time 19653.37ms, mfu 2.11%
iter 53: loss 4.7041, time 19703.88ms, mfu 2.11%
iter 54: loss 5.0468, time 19640.86ms, mfu 2.11%
iter 55: loss 4.9486, time 19666.94ms, mfu 2.11%
iter 56: loss 4.5459, time 19942.59ms, mfu 2.11%
iter 57: loss 4.6944, time 19679.61ms, mfu 2.11%
iter 58: loss 4.8170, time 19663.41ms, mfu 2.11%
iter 59: loss 4.6294, time 19640.35ms, mfu 2.11%
iter 60: loss 4.9441, time 19646.27ms, mfu 2.11%
iter 61: loss 4.5908, time 19710.96ms, mfu 2.11%
iter 62: loss 4.6333, time 19620.74ms, mfu 2.11%
iter 63: loss 4.8380, time 19648.48ms, mfu 2.11%
iter 64: loss 4.8229, time 19654.71ms, mfu 2.11%
iter 65: loss 4.6244, time 19679.67ms, mfu 2.11%
iter 66: loss 4.6889, time 19687.99ms, mfu 2.11%
iter 67: loss 4.6136, time 19876.68ms, mfu 2.11%
iter 68: loss 4.6278, time 20025.22ms, mfu 2.10%
iter 69: loss 4.7701, time 19643.57ms, mfu 2.10%
iter 70: loss 3.7794, time 19605.14ms, mfu 2.11%
iter 71: loss 4.8603, time 19624.44ms, mfu 2.11%
iter 72: loss 4.2795, time 19602.93ms, mfu 2.11%
iter 73: loss 4.7219, time 19596.89ms, mfu 2.11%
iter 74: loss 4.3182, time 19627.27ms, mfu 2.11%
iter 75: loss 4.4689, time 19640.58ms, mfu 2.11%
iter 76: loss 4.7539, time 19671.15ms, mfu 2.11%
iter 77: loss 4.0337, time 19636.22ms, mfu 2.11%
iter 78: loss 4.6246, time 19662.04ms, mfu 2.11%
iter 79: loss 4.0645, time 19661.39ms, mfu 2.11%
iter 80: loss 4.0985, time 19941.61ms, mfu 2.11%
iter 81: loss 4.3193, time 19614.46ms, mfu 2.11%
iter 82: loss 3.7222, time 19607.45ms, mfu 2.11%
iter 83: loss 4.0258, time 19648.38ms, mfu 2.11%
iter 84: loss 4.7010, time 19620.07ms, mfu 2.11%
iter 85: loss 4.5243, time 19613.37ms, mfu 2.11%
iter 86: loss 4.1362, time 19635.37ms, mfu 2.11%
iter 87: loss 3.9700, time 19634.14ms, mfu 2.11%
iter 88: loss 4.2836, time 19655.20ms, mfu 2.11%
iter 89: loss 4.3084, time 19633.91ms, mfu 2.11%
iter 90: loss 4.4079, time 19640.85ms, mfu 2.11%
iter 91: loss 4.3735, time 20000.84ms, mfu 2.11%
iter 92: loss 3.7845, time 19607.79ms, mfu 2.11%
iter 93: loss 4.3156, time 19622.00ms, mfu 2.11%
iter 94: loss 4.0923, time 19599.91ms, mfu 2.11%
iter 95: loss 3.8698, time 19667.07ms, mfu 2.11%
iter 96: loss 4.1850, time 19665.32ms, mfu 2.11%
iter 97: loss 3.8676, time 19643.53ms, mfu 2.11%
iter 98: loss 3.8669, time 19632.92ms, mfu 2.11%
iter 99: loss 3.9873, time 19653.58ms, mfu 2.11%
iter 100: loss 4.0950, time 19616.20ms, mfu 2.11%
iter 101: loss 3.7382, time 19624.36ms, mfu 2.11%
iter 102: loss 4.0727, time 19598.49ms, mfu 2.11%
iter 103: loss 4.0009, time 20002.35ms, mfu 2.11%
iter 104: loss 4.2189, time 19604.91ms, mfu 2.11%
iter 105: loss 4.1368, time 19620.53ms, mfu 2.11%
iter 106: loss 4.5212, time 19675.51ms, mfu 2.11%
iter 107: loss 4.2781, time 19642.19ms, mfu 2.11%
iter 108: loss 3.8773, time 19658.74ms, mfu 2.11%
iter 109: loss 3.6588, time 19621.37ms, mfu 2.11%
iter 110: loss 3.3335, time 19651.80ms, mfu 2.11%
iter 111: loss 3.8953, time 19663.19ms, mfu 2.11%
iter 112: loss 3.9826, time 19672.31ms, mfu 2.11%
iter 113: loss 3.9543, time 19667.00ms, mfu 2.11%
iter 114: loss 3.9316, time 19650.18ms, mfu 2.11%
iter 115: loss 4.4969, time 19888.65ms, mfu 2.11%
iter 116: loss 3.9627, time 19599.90ms, mfu 2.11%
iter 117: loss 3.7920, time 19612.31ms, mfu 2.11%
iter 118: loss 4.1794, time 19607.92ms, mfu 2.11%
iter 119: loss 4.1702, time 19606.86ms, mfu 2.11%
iter 120: loss 3.6590, time 19623.18ms, mfu 2.11%
iter 121: loss 3.7834, time 19620.76ms, mfu 2.11%
iter 122: loss 4.1254, time 19625.79ms, mfu 2.11%
iter 123: loss 3.9923, time 19614.33ms, mfu 2.11%
iter 124: loss 3.0677, time 19637.64ms, mfu 2.11%
iter 125: loss 4.1498, time 19637.54ms, mfu 2.11%
iter 126: loss 4.2521, time 19616.03ms, mfu 2.11%
iter 127: loss 3.6791, time 20003.78ms, mfu 2.11%
iter 128: loss 3.8807, time 19625.33ms, mfu 2.11%
iter 129: loss 3.5672, time 19643.43ms, mfu 2.11%
iter 130: loss 4.1088, time 19612.93ms, mfu 2.11%
iter 131: loss 3.3622, time 19614.76ms, mfu 2.11%
iter 132: loss 3.5544, time 19590.65ms, mfu 2.11%
iter 133: loss 3.7121, time 19611.48ms, mfu 2.11%

In [None]:
import re
import matplotlib.pyplot as plt

loss_pattern = r"iter \d+: loss (\d+\.\d+)"
loss_values = re.findall(loss_pattern, """

""")

loss_numbers = [float(value) for value in loss_values]

plt.plot(loss_numbers)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Loss Progression')
plt.show()

In [None]:
 losses for MOE with weight tying 2-10, 1//8,
#gradient_accumulation_steps = 5 * 8 # used to simulate larger batch sizes
#batch_size = 8 # if gradient_accumulation_steps > 1, this is the micro-batch size
##block_size = 1024
# model
#n_layer = 12
#n_head = 12
#n_embd = 768
#n_channels = 4    # to get same parameter count as MLP:set number of experts equal to number of channels
#n_experts = 16  # to simulate a residual without an eplicit residual, we add n_channels worth of no-op experts
#dropout = 0.0
Block 512
Params = 119m


iter 0: loss 10.5496, time 201461.20ms, mfu -100.00%
iter 1: loss 7.8580, time 15991.93ms, mfu -100.00%
iter 2: loss 9.1977, time 15913.05ms, mfu -100.00%
iter 3: loss 8.0395, time 15681.86ms, mfu -100.00%
iter 4: loss 7.7929, time 15437.90ms, mfu -100.00%
iter 5: loss 6.7418, time 15431.78ms, mfu 2.63%
iter 6: loss 6.9914, time 15491.03ms, mfu 2.63%
iter 7: loss 6.3738, time 15514.68ms, mfu 2.63%
iter 8: loss 10.3224, time 15986.67ms, mfu 2.62%
iter 9: loss 7.8805, time 16727.01ms, mfu 2.60%
iter 10: loss 6.4923, time 17167.11ms, mfu 2.58%
iter 11: loss 6.3069, time 15408.65ms, mfu 2.58%
iter 12: loss 5.9059, time 15464.12ms, mfu 2.59%
iter 13: loss 6.5409, time 15486.64ms, mfu 2.59%
iter 14: loss 6.1827, time 15478.46ms, mfu 2.60%
iter 15: loss 5.7764, time 15481.41ms, mfu 2.60%
iter 16: loss 5.8517, time 15465.05ms, mfu 2.60%
iter 17: loss 6.1622, time 15485.94ms, mfu 2.60%
iter 18: loss 5.8874, time 15474.65ms, mfu 2.61%
iter 19: loss 6.0403, time 15578.55ms, mfu 2.61%
iter 20: loss 5.7187, time 15489.58ms, mfu 2.61%
iter 21: loss 5.8883, time 15481.46ms, mfu 2.61%
iter 22: loss 5.8783, time 15466.36ms, mfu 2.61%
iter 23: loss 5.6900, time 15482.22ms, mfu 2.61%
iter 24: loss 6.1334, time 15479.03ms, mfu 2.61%
iter 25: loss 5.9019, time 15490.91ms, mfu 2.61%
iter 26: loss 5.6818, time 15480.32ms, mfu 2.62%
iter 27: loss 5.7826, time 15478.30ms, mfu 2.62%
iter 28: loss 6.2421, time 15949.33ms, mfu 2.61%
iter 29: loss 6.1194, time 15644.78ms, mfu 2.61%
iter 30: loss 5.7750, time 15500.35ms, mfu 2.61%
iter 31: loss 5.4551, time 15496.02ms, mfu 2.61%
iter 32: loss 5.5985, time 15507.19ms, mfu 2.61%
iter 33: loss 6.0304, time 15508.31ms, mfu 2.61%
iter 34: loss 6.0989, time 15502.37ms, mfu 2.61%
iter 35: loss 6.2600, time 15494.05ms, mfu 2.61%
iter 36: loss 5.0418, time 15493.50ms, mfu 2.61%
iter 37: loss 4.7475, time 15517.34ms, mfu 2.62%
iter 38: loss 5.5421, time 15561.65ms, mfu 2.61%
iter 39: loss 5.0590, time 15661.69ms, mfu 2.61%
iter 40: loss 5.6652, time 15517.55ms, mfu 2.61%
iter 41: loss 5.2561, time 15510.69ms, mfu 2.61%
iter 42: loss 5.3916, time 15501.49ms, mfu 2.61%
iter 43: loss 5.2561, time 15502.08ms, mfu 2.62%
iter 44: loss 5.3353, time 15539.13ms, mfu 2.62%
iter 45: loss 5.3770, time 15497.83ms, mfu 2.62%
iter 46: loss 5.4238, time 15513.48ms, mfu 2.62%
iter 47: loss 5.1389, time 15526.82ms, mfu 2.62%
iter 48: loss 5.4933, time 16291.66ms, mfu 2.60%
iter 49: loss 4.7825, time 15530.99ms, mfu 2.61%
iter 50: loss 5.1460, time 15510.25ms, mfu 2.61%
iter 51: loss 5.2130, time 15497.02ms, mfu 2.61%
iter 52: loss 4.8980, time 15530.95ms, mfu 2.61%
iter 53: loss 5.1830, time 15559.27ms, mfu 2.61%
iter 54: loss 5.6052, time 15539.95ms, mfu 2.61%
iter 55: loss 5.0433, time 15527.73ms, mfu 2.61%
iter 56: loss 5.0833, time 15518.21ms, mfu 2.61%
iter 57: loss 4.7395, time 15545.09ms, mfu 2.61%
iter 58: loss 4.6960, time 15879.77ms, mfu 2.61%
iter 59: loss 4.9908, time 15506.46ms, mfu 2.61%
iter 60: loss 5.2034, time 15501.59ms, mfu 2.61%
iter 61: loss 5.3173, time 15524.94ms, mfu 2.61%
iter 62: loss 4.9190, time 15521.44ms, mfu 2.61%
iter 63: loss 4.4035, time 15522.42ms, mfu 2.61%
iter 64: loss 4.9877, time 15539.67ms, mfu 2.61%
iter 65: loss 4.7285, time 15526.08ms, mfu 2.61%
iter 66: loss 4.8348, time 15545.37ms, mfu 2.61%
iter 67: loss 4.8996, time 15548.85ms, mfu 2.61%
iter 68: loss 5.0770, time 16477.08ms, mfu 2.60%
iter 69: loss 5.0234, time 15582.24ms, mfu 2.60%
iter 70: loss 4.9628, time 15555.80ms, mfu 2.60%
iter 71: loss 5.0200, time 15523.05ms, mfu 2.60%
iter 72: loss 4.8751, time 15534.38ms, mfu 2.60%
iter 73: loss 4.3475, time 15519.57ms, mfu 2.60%
iter 74: loss 4.5017, time 15561.75ms, mfu 2.61%
iter 75: loss 5.2358, time 15566.38ms, mfu 2.61%
iter 76: loss 3.9662, time 15514.86ms, mfu 2.61%
iter 77: loss 4.4274, time 15588.24ms, mfu 2.61%
iter 78: loss 4.2607, time 15745.42ms, mfu 2.60%
iter 79: loss 4.1449, time 15595.34ms, mfu 2.60%
iter 80: loss 4.7209, time 15570.63ms, mfu 2.60%
iter 81: loss 4.5566, time 15526.29ms, mfu 2.61%
iter 82: loss 4.3446, time 15544.13ms, mfu 2.61%
iter 83: loss 4.3396, time 15578.89ms, mfu 2.61%
iter 84: loss 4.1098, time 15614.59ms, mfu 2.61%
iter 85: loss 4.6080, time 15597.33ms, mfu 2.61%
iter 86: loss 4.4334, time 15614.07ms, mfu 2.61%
iter 87: loss 4.1860, time 15611.50ms, mfu 2.61%
iter 88: loss 4.3947, time 15802.91ms, mfu 2.60%
iter 89: loss 4.4694, time 15862.35ms, mfu 2.60%
iter 90: loss 4.3495, time 15581.72ms, mfu 2.60%
iter 91: loss 4.2405, time 15557.58ms, mfu 2.60%
iter 92: loss 4.4502, time 15503.51ms, mfu 2.60%
iter 93: loss 3.6781, time 15533.39ms, mfu 2.60%
iter 94: loss 4.2028, time 15593.35ms, mfu 2.60%
iter 95: loss 4.0590, time 15583.11ms, mfu 2.60%
iter 96: loss 3.9999, time 15556.82ms, mfu 2.60%
iter 97: loss 4.2502, time 15552.48ms, mfu 2.61%
iter 98: loss 4.3104, time 16014.92ms, mfu 2.60%
iter 99: loss 4.4702, time 15580.70ms, mfu 2.60%
iter 100: loss 4.3030, time 15560.13ms, mfu 2.60%
iter 101: loss 4.3456, time 15566.66ms, mfu 2.60%
iter 102: loss 4.0975, time 15574.13ms, mfu 2.60%
iter 103: loss 4.2060, time 15569.51ms, mfu 2.60%
iter 104: loss 4.1035, time 15542.79ms, mfu 2.60%
iter 105: loss 3.8556, time 15554.08ms, mfu 2.60%
iter 106: loss 3.9002, time 15556.90ms, mfu 2.61%
iter 107: loss 4.2606, time 15569.42ms, mfu 2.61%
iter 108: loss 4.0130, time 15562.34ms, mfu 2.61%
iter 109: loss 3.8277, time 16174.09ms, mfu 2.60%
iter 110: loss 4.1542, time 15545.05ms, mfu 2.60%
iter 111: loss 3.9946, time 15551.95ms, mfu 2.60%
iter 112: loss 4.0844, time 15564.93ms, mfu 2.60%
iter 113: loss 3.8145, time 15561.72ms, mfu 2.60%
iter 114: loss 3.9757, time 15555.77ms, mfu 2.60%
iter 115: loss 3.8494, time 15519.14ms, mfu 2.60%
iter 116: loss 4.1259, time 15555.58ms, mfu 2.61%
iter 117: loss 3.8782, time 15564.94ms, mfu 2.61%
iter 118: loss 3.7902, time 15551.91ms, mfu 2.61%
iter 119: loss 4.0406, time 15693.09ms, mfu 2.60%
iter 120: loss 3.7595, time 15586.39ms, mfu 2.60%
iter 121: loss 3.9418, time 15580.22ms, mfu 2.61%
iter 122: loss 4.1073, time 15537.80ms, mfu 2.61%
iter 123: loss 3.6519, time 15531.86ms, mfu 2.61%
iter 124: loss 3.5982, time 15571.60ms, mfu 2.61%
iter 125: loss 3.0670, time 15577.08ms, mfu 2.61%
iter 126: loss 3.7166, time 15574.74ms, mfu 2.61%
iter 127: loss 3.9098, time 15568.69ms, mfu 2.61%
iter 128: loss 4.1879, time 15533.27ms, mfu 2.61%
iter 129: loss 3.7154, time 16202.63ms, mfu 2.60%
iter 130: loss 3.7123, time 15554.70ms, mfu 2.60%
iter 131: loss 3.6772, time 15552.58ms, mfu 2.60%
iter 132: loss 4.0502, time 15521.83ms, mfu 2.60%
iter 133: loss 3.7145, time 15505.29ms, mfu 2.60%
iter 134: loss 3.6576, time 15522.76ms, mfu 2.61%
iter 135: loss 3.8645, time 15497.15ms, mfu 2.61%
iter 136: loss 4.0430, time 15530.40ms, mfu 2.61%
iter 137: loss 3.5491, time 15529.70ms, mfu 2.61%
iter 138: loss 3.4967, time 15513.84ms, mfu 2.61%
iter 139: loss 3.6438, time 15545.47ms, mfu 2.61%
iter 140: loss 4.0229, time 15883.56ms, mfu 2.61%
iter 141: loss 3.6197, time 15548.29ms, mfu 2.61%
iter 142: loss 3.5067, time 15572.78ms, mfu 2.61%
iter 143: loss 3.9023, time 15585.76ms, mfu 2.61%
iter 144: loss 3.4453, time 15585.43ms, mfu 2.61%
iter 145: loss 3.7979, time 15607.55ms, mfu 2.61%
iter 146: loss 3.7835, time 15587.40ms, mfu 2.61%
iter 147: loss 3.7274, time 15583.69ms, mfu 2.61%
iter 148: loss 3.8025, time 15589.98ms, mfu 2.61%
iter 149: loss 3.5961, time 15957.41ms, mfu 2.60%
iter 150: loss 3.8399, time 15588.56ms, mfu 2.60%
iter 151: loss 3.9112, time 15807.03ms, mfu 2.60%
iter 152: loss 3.6547, time 15596.03ms, mfu 2.60%
iter 153: loss 3.7479, time 15608.67ms, mfu 2.60%
iter 154: loss 3.9943, time 15613.51ms, mfu 2.60%
iter 155: loss 3.5901, time 15608.06ms, mfu 2.60%
iter 156: loss 3.7442, time 15616.38ms, mfu 2.60%
iter 157: loss 3.4755, time 15596.31ms, mfu 2.60%
iter 158: loss 3.8800, time 15618.43ms, mfu 2.60%
iter 159: loss 3.7167, time 15601.92ms, mfu 2.60%
iter 160: loss 3.6166, time 15593.10ms, mfu 2.60%
iter 161: loss 3.5934, time 15966.55ms, mfu 2.60%
iter 162: loss 3.8002, time 15681.51ms, mfu 2.60%
iter 163: loss 3.8531, time 15537.85ms, mfu 2.60%
iter 164: loss 3.7795, time 15586.66ms, mfu 2.60%
iter 165: loss 3.8058, time 15644.38ms, mfu 2.60%
iter 166: loss 3.8243, time 15611.09ms, mfu 2.60%
iter 167: loss 3.8385, time 15533.06ms, mfu 2.60%
iter 168: loss 3.2237, time 15516.04ms, mfu 2.60%
iter 169: loss 3.5984, time 15988.79ms, mfu 2.60%
iter 170: loss 3.6642, time 15533.55ms, mfu 2.60%
iter 171: loss 3.4045, time 15529.18ms, mfu 2.60%
iter 172: loss 3.5449, time 15827.28ms, mfu 2.60%
iter 173: loss 3.2869, time 15527.38ms, mfu 2.60%
iter 174: loss 3.7998, time 15559.53ms, mfu 2.60%
iter 175: loss 4.3149, time 15545.23ms, mfu 2.60%
iter 176: loss 3.5281, time 15562.97ms, mfu 2.60%
iter 177: loss 3.5306, time 15590.84ms, mfu 2.60%
iter 178: loss 3.5306, time 15557.46ms, mfu 2.60%
iter 179: loss 3.5240, time 15551.73ms, mfu 2.60%
iter 180: loss 3.7603, time 15563.85ms, mfu 2.60%
iter 181: loss 3.7127, time 15590.02ms, mfu 2.60%
iter 182: loss 3.5117, time 15904.33ms, mfu 2.60%
iter 183: loss 3.6257, time 15596.25ms, mfu 2.60%
iter 184: loss 3.2285, time 15521.07ms, mfu 2.60%
iter 185: loss 3.6972, time 15554.94ms, mfu 2.60%
iter 186: loss 3.4791, time 15562.11ms, mfu 2.60%
iter 187: loss 3.5209, time 15601.94ms, mfu 2.60%
iter 188: loss 3.3363, time 15587.92ms, mfu 2.60%
iter 189: loss 3.5081, time 16004.13ms, mfu 2.60%
iter 190: loss 3.5572, time 15541.33ms, mfu 2.60%
iter 191: loss 3.5353, time 15513.30ms, mfu 2.60%
iter 192: loss 3.4124, time 15686.53ms, mfu 2.60%
iter 193: loss 3.8282, time 15797.64ms, mfu 2.60%
iter 194: loss 3.5870, time 15557.56ms, mfu 2.60%
iter 195: loss 3.5669, time 15568.88ms, mfu 2.60%
iter 196: loss 3.3700, time 15574.17ms, mfu 2.60%
iter 197: loss 3.4367, time 15532.20ms, mfu 2.60%
iter 198: loss 3.2026, time 15527.08ms, mfu 2.60%
iter 199: loss 3.3023, time 15542.02ms, mfu 2.60%
iter 200: loss 3.0132, time 15570.36ms, mfu 2.61%
iter 201: loss 3.5980, time 15521.83ms, mfu 2.61%
iter 202: loss 3.4657, time 15561.07ms, mfu 2.61%
iter 203: loss 2.8025, time 15655.26ms, mfu 2.61%
iter 204: loss 3.5007, time 15709.72ms, mfu 2.60%
iter 205: loss 3.3317, time 15564.58ms, mfu 2.60%
iter 206: loss 3.3960, time 15576.42ms, mfu 2.60%
iter 207: loss 3.4090, time 15537.86ms, mfu 2.61%
iter 208: loss 3.0504, time 15544.88ms, mfu 2.61%
iter 209: loss 3.1450, time 15572.67ms, mfu 2.61%
iter 210: loss 3.3565, time 15903.14ms, mfu 2.60%
iter 211: loss 3.3267, time 15581.47ms, mfu 2.60%
iter 212: loss 3.2799, time 15577.57ms, mfu 2.60%
iter 213: loss 3.9759, time 15562.59ms, mfu 2.60%
iter 214: loss 3.5953, time 15640.87ms, mfu 2.60%
iter 215: loss 3.3641, time 15825.60ms, mfu 2.60%
iter 216: loss 3.4127, time 15537.87ms, mfu 2.60%
iter 217: loss 3.4980, time 15541.58ms, mfu 2.60%
iter 218: loss 3.5675, time 15546.77ms, mfu 2.60%
iter 219: loss 2.8910, time 15571.53ms, mfu 2.60%
iter 220: loss 3.4550, time 15561.42ms, mfu 2.60%
iter 221: loss 3.2639, time 15568.84ms, mfu 2.61%
iter 222: loss 3.5732, time 15595.69ms, mfu 2.61%
iter 223: loss 3.5502, time 15598.87ms, mfu 2.61%
iter 224: loss 2.9080, time 15609.44ms, mfu 2.60%
iter 225: loss 3.2880, time 15780.49ms, mfu 2.60%
iter 226: loss 3.3588, time 15657.42ms, mfu 2.60%
iter 227: loss 3.1142, time 15534.94ms, mfu 2.60%
iter 228: loss 3.0523, time 15562.44ms, mfu 2.60%
iter 229: loss 3.3545, time 15540.96ms, mfu 2.60%
iter 230: loss 3.5060, time 15841.83ms, mfu 2.60%
iter 231: loss 3.2516, time 15552.81ms, mfu 2.60%
iter 232: loss 3.3218, time 15546.56ms, mfu 2.60%
iter 233: loss 3.3270, time 15535.03ms, mfu 2.60%
iter 234: loss 2.9918, time 15542.73ms, mfu 2.60%
iter 235: loss 3.4410, time 15640.18ms, mfu 2.60%
iter 236: loss 3.2404, time 15724.20ms, mfu 2.60%
iter 237: loss 3.4811, time 15502.33ms, mfu 2.60%
iter 238: loss 3.1366, time 15516.27ms, mfu 2.61%
iter 239: loss 3.4770, time 15519.34ms, mfu 2.61%
iter 240: loss 2.9299, time 15506.68ms, mfu 2.61%
iter 241: loss 3.4372, time 15506.75ms, mfu 2.61%
iter 242: loss 3.0867, time 15509.92ms, mfu 2.61%
iter 243: loss 3.2273, time 15788.95ms, mfu 2.61%
iter 244: loss 3.6071, time 15543.56ms, mfu 2.61%
iter 245: loss 3.8065, time 15513.72ms, mfu 2.61%
iter 246: loss 3.3285, time 15563.29ms, mfu 2.61%
iter 247: loss 3.1286, time 15546.88ms, mfu 2.61%
iter 248: loss 3.0438, time 15536.71ms, mfu 2.61%
iter 249: loss 3.1228, time 15524.92ms, mfu 2.61%
iter 250: loss 3.3888, time 15865.43ms, mfu 2.61%
iter 251: loss 3.2347, time 15534.86ms, mfu 2.61%
iter 252: loss 2.9589, time 15541.19ms, mfu 2.61%
iter 253: loss 2.9632, time 15560.89ms, mfu 2.61%
iter 254: loss 3.2970, time 15577.15ms, mfu 2.61%
iter 255: loss 2.7699, time 15853.40ms, mfu 2.60%
iter 256: loss 3.6863, time 15568.48ms, mfu 2.60%
iter 257: loss 2.9634, time 15519.42ms, mfu 2.61%
iter 258: loss 3.1813, time 15504.52ms, mfu 2.61%
iter 259: loss 3.0801, time 15498.53ms, mfu 2.61%
iter 260: loss 2.9143, time 15534.15ms, mfu 2.61%
iter 261: loss 3.4172, time 15522.04ms, mfu 2.61%
iter 262: loss 3.4539, time 15515.52ms, mfu 2.61%
iter 263: loss 2.9955, time 15562.98ms, mfu 2.61%
iter 264: loss 3.4846, time 15586.20ms, mfu 2.61%
iter 265: loss 2.8587, time 15750.13ms, mfu 2.61%
iter 266: loss 3.3069, time 15546.75ms, mfu 2.61%
iter 267: loss 3.5753, time 15523.80ms, mfu 2.61%
iter 268: loss 2.8718, time 15562.25ms, mfu 2.61%
iter 269: loss 3.1959, time 15589.45ms, mfu 2.61%
iter 270: loss 3.0923, time 15543.78ms, mfu 2.61%
iter 271: loss 3.0377, time 15851.29ms, mfu 2.60%
iter 272: loss 2.6705, time 15537.87ms, mfu 2.61%
iter 273: loss 3.2692, time 15544.71ms, mfu 2.61%
iter 274: loss 3.3469, time 15753.24ms, mfu 2.60%
iter 275: loss 3.5814, time 15525.04ms, mfu 2.60%
iter 276: loss 3.1679, time 15505.90ms, mfu 2.61%
iter 277: loss 3.0719, time 15516.82ms, mfu 2.61%

In [None]:
plt.figure(figsize=(10, 5))


import re
import matplotlib.pyplot as plt
loss_pattern = r"iter \d+: loss (\d+\.\d+)"













loss_values = re.findall(loss_pattern, """







tokens per iteration will be: 163,840
Initializing a new model from scratch
defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
E, n_head 768 12
TIED EXPERTS: 14 of 192
number of parameters: 119.51M
num decayed parameter tensors: 394, with 119,869,440 parameters
num non-decayed parameter tensors: 49, with 30,720 parameters
using fused AdamW: True
compiling the model... (takes a ~minute)
step 0: train loss 10.7424, val loss 10.7163
iter 0: loss 10.5496, time 237212.36ms, mfu -100.00%
iter 1: loss 7.8580, time 15582.87ms, mfu -100.00%
iter 2: loss 9.1977, time 15715.41ms, mfu -100.00%
iter 3: loss 8.0395, time 15829.06ms, mfu -100.00%
iter 4: loss 7.7929, time 15894.36ms, mfu -100.00%
iter 5: loss 6.7418, time 15921.47ms, mfu 2.55%
iter 6: loss 6.9914, time 15828.20ms, mfu 2.55%
iter 7: loss 6.3738, time 15762.87ms, mfu 2.56%
iter 8: loss 10.3224, time 15814.98ms, mfu 2.56%
iter 9: loss 7.8805, time 15870.74ms, mfu 2.56%
iter 10: loss 6.4923, time 15840.29ms, mfu 2.56%
iter 11: loss 6.3069, time 15878.58ms, mfu 2.56%
iter 12: loss 5.9059, time 15759.95ms, mfu 2.56%
iter 13: loss 6.5409, time 15729.33ms, mfu 2.56%
iter 14: loss 6.1827, time 15743.67ms, mfu 2.56%
iter 15: loss 5.7764, time 15736.97ms, mfu 2.57%
iter 16: loss 5.8517, time 15739.11ms, mfu 2.57%
iter 17: loss 6.1622, time 15747.25ms, mfu 2.57%
iter 18: loss 5.8874, time 16077.70ms, mfu 2.56%
iter 19: loss 6.0403, time 15796.48ms, mfu 2.57%
iter 20: loss 5.7187, time 15798.83ms, mfu 2.57%
iter 21: loss 5.8883, time 15796.92ms, mfu 2.57%
iter 22: loss 5.8783, time 15791.44ms, mfu 2.57%
iter 23: loss 5.6900, time 15773.66ms, mfu 2.57%
iter 24: loss 6.1334, time 15782.48ms, mfu 2.57%
iter 25: loss 5.9019, time 15794.16ms, mfu 2.57%
iter 26: loss 5.6818, time 15798.12ms, mfu 2.57%
iter 27: loss 5.7826, time 15803.54ms, mfu 2.57%
iter 28: loss 6.2421, time 15809.40ms, mfu 2.57%
iter 29: loss 6.1194, time 15823.85ms, mfu 2.57%
iter 30: loss 5.7750, time 15850.24ms, mfu 2.57%
iter 31: loss 5.4551, time 15859.91ms, mfu 2.57%
iter 32: loss 5.5985, time 15874.37ms, mfu 2.57%
iter 33: loss 6.0304, time 15808.76ms, mfu 2.57%
iter 34: loss 6.0989, time 15789.16ms, mfu 2.57%
iter 35: loss 6.2600, time 15790.01ms, mfu 2.57%
iter 36: loss 5.0418, time 15767.69ms, mfu 2.57%
iter 37: loss 4.7475, time 15799.51ms, mfu 2.57%
iter 38: loss 5.5421, time 16041.08ms, mfu 2.57%
iter 39: loss 5.0590, time 15779.59ms, mfu 2.57%
iter 40: loss 5.6652, time 15789.35ms, mfu 2.57%
iter 41: loss 5.2561, time 15778.62ms, mfu 2.57%
iter 42: loss 5.3916, time 15784.91ms, mfu 2.57%
iter 43: loss 5.2561, time 15772.61ms, mfu 2.57%
iter 44: loss 5.3353, time 15781.64ms, mfu 2.57%
iter 45: loss 5.3770, time 15769.19ms, mfu 2.57%
iter 46: loss 5.4238, time 15775.41ms, mfu 2.57%
iter 47: loss 5.1389, time 15791.51ms, mfu 2.57%
iter 48: loss 5.4933, time 15783.68ms, mfu 2.57%
iter 49: loss 4.7825, time 15785.26ms, mfu 2.57%
iter 50: loss 5.1460, time 15802.44ms, mfu 2.57%
iter 51: loss 5.2130, time 15798.79ms, mfu 2.57%
iter 52: loss 4.8980, time 15812.57ms, mfu 2.57%
iter 53: loss 5.1830, time 15820.66ms, mfu 2.57%
iter 54: loss 5.6052, time 15825.98ms, mfu 2.57%
iter 55: loss 5.0433, time 15786.66ms, mfu 2.57%
iter 56: loss 5.0833, time 15818.47ms, mfu 2.57%
iter 57: loss 4.7395, time 15816.96ms, mfu 2.57%
iter 58: loss 4.6960, time 16227.17ms, mfu 2.56%
iter 59: loss 4.9908, time 15798.73ms, mfu 2.56%
iter 60: loss 5.2034, time 15800.99ms, mfu 2.57%
iter 61: loss 5.3173, time 15823.46ms, mfu 2.57%
iter 62: loss 4.9190, time 15843.27ms, mfu 2.57%
iter 63: loss 4.4035, time 15845.16ms, mfu 2.57%
iter 64: loss 4.9877, time 15836.50ms, mfu 2.57%
iter 65: loss 4.7285, time 15841.93ms, mfu 2.57%
iter 66: loss 4.8348, time 15863.21ms, mfu 2.56%
iter 67: loss 4.8996, time 15863.32ms, mfu 2.56%
iter 68: loss 5.0770, time 15871.59ms, mfu 2.56%
iter 69: loss 5.0234, time 15878.44ms, mfu 2.56%
iter 70: loss 4.9628, time 15839.01ms, mfu 2.56%
iter 71: loss 5.0200, time 15848.06ms, mfu 2.56%
iter 72: loss 4.8751, time 15863.46ms, mfu 2.56%
iter 73: loss 4.3475, time 15849.43ms, mfu 2.56%
iter 74: loss 4.5017, time 15855.94ms, mfu 2.56%
iter 75: loss 5.2358, time 15844.63ms, mfu 2.56%
iter 76: loss 3.9662, time 15862.33ms, mfu 2.56%
iter 77: loss 4.4274, time 15871.65ms, mfu 2.56%
iter 78: loss 4.2607, time 16182.28ms, mfu 2.56%
iter 79: loss 4.1449, time 15861.83ms, mfu 2.56%
iter 80: loss 4.7209, time 15871.89ms, mfu 2.56%
iter 81: loss 4.5566, time 15864.86ms, mfu 2.56%
iter 82: loss 4.3446, time 15863.50ms, mfu 2.56%
iter 83: loss 4.3396, time 15871.67ms, mfu 2.56%
iter 84: loss 4.1098, time 15857.75ms, mfu 2.56%
iter 85: loss 4.6080, time 15856.13ms, mfu 2.56%
iter 86: loss 4.4334, time 15877.71ms, mfu 2.56%
iter 87: loss 4.1860, time 15883.50ms, mfu 2.56%
iter 88: loss 4.3947, time 15861.12ms, mfu 2.56%
iter 89: loss 4.4694, time 15873.07ms, mfu 2.56%
iter 90: loss 4.3495, time 15877.75ms, mfu 2.56%
iter 91: loss 4.2405, time 15850.88ms, mfu 2.56%
iter 92: loss 4.4502, time 15834.65ms, mfu 2.56%
iter 93: loss 3.6781, time 15867.04ms, mfu 2.56%
iter 94: loss 4.2028, time 15862.11ms, mfu 2.56%
iter 95: loss 4.0590, time 15871.43ms, mfu 2.56%
iter 96: loss 3.9999, time 15870.66ms, mfu 2.56%
iter 97: loss 4.2502, time 15849.35ms, mfu 2.56%
iter 98: loss 4.3104, time 16203.60ms, mfu 2.56%
iter 99: loss 4.4702, time 15855.31ms, mfu 2.56%
iter 100: loss 4.3030, time 15844.52ms, mfu 2.56%
iter 101: loss 4.3456, time 15846.36ms, mfu 2.56%
iter 102: loss 4.0975, time 15851.76ms, mfu 2.56%
iter 103: loss 4.2060, time 15846.26ms, mfu 2.56%
iter 104: loss 4.1035, time 17532.60ms, mfu 2.53%
iter 105: loss 3.8556, time 15831.70ms, mfu 2.54%
iter 106: loss 3.9002, time 15841.38ms, mfu 2.54%
iter 107: loss 4.2606, time 15854.09ms, mfu 2.54%
iter 108: loss 4.0130, time 15825.89ms, mfu 2.55%
iter 109: loss 3.8277, time 15840.55ms, mfu 2.55%
iter 110: loss 4.1542, time 15838.44ms, mfu 2.55%
iter 111: loss 3.9946, time 15858.84ms, mfu 2.55%
iter 112: loss 4.0844, time 15852.44ms, mfu 2.55%
iter 113: loss 3.8145, time 15862.23ms, mfu 2.55%
iter 114: loss 3.9757, time 15862.84ms, mfu 2.55%
iter 115: loss 3.8494, time 15843.59ms, mfu 2.55%
iter 116: loss 4.1259, time 15851.74ms, mfu 2.56%
iter 117: loss 3.8782, time 15856.68ms, mfu 2.56%
iter 118: loss 3.7902, time 16132.74ms, mfu 2.55%
iter 119: loss 4.0406, time 15837.98ms, mfu 2.55%
iter 120: loss 3.7595, time 15853.50ms, mfu 2.55%
iter 121: loss 3.9418, time 15849.12ms, mfu 2.56%
iter 122: loss 4.1073, time 15861.08ms, mfu 2.56%
iter 123: loss 3.6519, time 15842.18ms, mfu 2.56%
iter 124: loss 3.5982, time 15839.64ms, mfu 2.56%
iter 125: loss 3.0670, time 15849.16ms, mfu 2.56%
iter 126: loss 3.7166, time 15847.16ms, mfu 2.56%
iter 127: loss 3.9098, time 15844.61ms, mfu 2.56%
iter 128: loss 4.1879, time 15843.72ms, mfu 2.56%
iter 129: loss 3.7154, time 15839.65ms, mfu 2.56%
iter 130: loss 3.7123, time 15841.57ms, mfu 2.56%
iter 131: loss 3.6772, time 15840.82ms, mfu 2.56%
iter 132: loss 4.0502, time 15833.82ms, mfu 2.56%
iter 133: loss 3.7145, time 15829.70ms, mfu 2.56%
iter 134: loss 3.6576, time 15839.11ms, mfu 2.56%
iter 135: loss 3.8645, time 15823.49ms, mfu 2.56%
iter 136: loss 4.0430, time 15851.00ms, mfu 2.56%
iter 137: loss 3.5491, time 15856.06ms, mfu 2.56%
iter 138: loss 3.4967, time 16121.97ms, mfu 2.56%
iter 139: loss 3.6438, time 15836.26ms, mfu 2.56%
iter 140: loss 4.0229, time 15837.02ms, mfu 2.56%
iter 141: loss 3.6197, time 15845.40ms, mfu 2.56%
iter 142: loss 3.5067, time 15843.71ms, mfu 2.56%
iter 143: loss 3.9023, time 15845.83ms, mfu 2.56%
iter 144: loss 3.4453, time 15838.15ms, mfu 2.56%
iter 145: loss 3.7979, time 15853.09ms, mfu 2.56%
iter 146: loss 3.7835, time 15849.86ms, mfu 2.56%
iter 147: loss 3.7274, time 15848.61ms, mfu 2.56%
iter 148: loss 3.8025, time 15853.35ms, mfu 2.56%
iter 149: loss 3.5961, time 15851.56ms, mfu 2.56%
iter 150: loss 3.8399, time 15850.61ms, mfu 2.56%
iter 151: loss 3.9112, time 15844.32ms, mfu 2.56%
iter 152: loss 3.6547, time 15835.29ms, mfu 2.56%
iter 153: loss 3.7479, time 15835.85ms, mfu 2.56%
iter 154: loss 3.9943, time 15825.31ms, mfu 2.56%
iter 155: loss 3.5901, time 15834.86ms, mfu 2.56%
iter 156: loss 3.7442, time 15832.71ms, mfu 2.56%
iter 157: loss 3.4755, time 15833.90ms, mfu 2.56%
iter 158: loss 3.8800, time 16247.43ms, mfu 2.56%
iter 159: loss 3.7167, time 15833.53ms, mfu 2.56%
iter 160: loss 3.6166, time 15806.69ms, mfu 2.56%
iter 161: loss 3.5934, time 15806.16ms, mfu 2.56%
iter 162: loss 3.8002, time 15795.80ms, mfu 2.56%
iter 163: loss 3.8531, time 15789.52ms, mfu 2.56%
iter 164: loss 3.7795, time 15788.46ms, mfu 2.56%
iter 165: loss 3.8058, time 15798.61ms, mfu 2.56%
iter 166: loss 3.8243, time 15805.01ms, mfu 2.57%
iter 167: loss 3.8385, time 15783.21ms, mfu 2.57%
iter 168: loss 3.2237, time 15796.08ms, mfu 2.57%
iter 169: loss 3.5984, time 15811.65ms, mfu 2.57%
iter 170: loss 3.6642, time 15829.39ms, mfu 2.57%
iter 171: loss 3.4045, time 15846.21ms, mfu 2.57%
iter 172: loss 3.5449, time 15821.77ms, mfu 2.57%
iter 173: loss 3.2869, time 15833.90ms, mfu 2.57%
iter 174: loss 3.7998, time 15844.33ms, mfu 2.57%
iter 175: loss 4.3149, time 15851.10ms, mfu 2.57%
iter 176: loss 3.5281, time 15842.14ms, mfu 2.57%
iter 177: loss 3.5306, time 15816.55ms, mfu 2.57%
iter 178: loss 3.5306, time 16112.92ms, mfu 2.56%
iter 179: loss 3.5240, time 15810.46ms, mfu 2.56%
iter 180: loss 3.7603, time 15784.76ms, mfu 2.56%
iter 181: loss 3.7127, time 15788.40ms, mfu 2.56%
iter 182: loss 3.5117, time 15794.92ms, mfu 2.57%
iter 183: loss 3.6257, time 15806.45ms, mfu 2.57%
iter 184: loss 3.2285, time 15803.65ms, mfu 2.57%
iter 185: loss 3.6972, time 15793.96ms, mfu 2.57%
iter 186: loss 3.4791, time 15805.36ms, mfu 2.57%
iter 187: loss 3.5209, time 15824.54ms, mfu 2.57%
iter 188: loss 3.3363, time 15812.57ms, mfu 2.57%
iter 189: loss 3.5081, time 15815.81ms, mfu 2.57%
iter 190: loss 3.5572, time 15823.33ms, mfu 2.57%
iter 191: loss 3.5353, time 15934.48ms, mfu 2.57%
iter 192: loss 3.4124, time 16032.51ms, mfu 2.56%
iter 193: loss 3.8282, time 16004.78ms, mfu 2.56%
iter 194: loss 3.5870, time 15941.11ms, mfu 2.56%
iter 195: loss 3.5669, time 15956.34ms, mfu 2.56%
iter 196: loss 3.3700, time 15938.57ms, mfu 2.56%
iter 197: loss 3.4367, time 15974.67ms, mfu 2.56%
iter 198: loss 3.2026, time 16216.00ms, mfu 2.55%
iter 199: loss 3.3023, time 15995.36ms, mfu 2.55%
iter 200: loss 3.0132, time 15995.54ms, mfu 2.55%
iter 201: loss 3.5980, time 16005.13ms, mfu 2.55%
iter 202: loss 3.4657, time 16002.04ms, mfu 2.55%
iter 203: loss 2.8025, time 15998.62ms, mfu 2.55%
iter 204: loss 3.5007, time 15978.70ms, mfu 2.55%
iter 205: loss 3.3317, time 15986.76ms, mfu 2.55%
iter 206: loss 3.3960, time 15977.16ms, mfu 2.54%
iter 207: loss 3.4090, time 15980.35ms, mfu 2.54%
iter 208: loss 3.0504, time 15972.07ms, mfu 2.54%
iter 209: loss 3.1450, time 15977.17ms, mfu 2.54%
iter 210: loss 3.3565, time 15973.36ms, mfu 2.54%
iter 211: loss 3.3267, time 15972.93ms, mfu 2.54%
iter 212: loss 3.2799, time 15975.59ms, mfu 2.54%
iter 213: loss 3.9759, time 15958.26ms, mfu 2.54%
iter 214: loss 3.5953, time 15948.47ms, mfu 2.54%
iter 215: loss 3.3641, time 15958.40ms, mfu 2.54%
iter 216: loss 3.4127, time 15943.15ms, mfu 2.55%
iter 217: loss 3.4980, time 15965.56ms, mfu 2.55%
iter 218: loss 3.5675, time 16225.73ms, mfu 2.54%
iter 219: loss 2.8910, time 15965.73ms, mfu 2.54%
iter 220: loss 3.4550, time 15982.46ms, mfu 2.54%
iter 221: loss 3.2639, time 15987.22ms, mfu 2.54%
iter 222: loss 3.5732, time 16026.08ms, mfu 2.54%
iter 223: loss 3.5502, time 16007.85ms, mfu 2.54%
iter 224: loss 2.9080, time 16011.88ms, mfu 2.54%
iter 225: loss 3.2880, time 15980.66ms, mfu 2.54%
iter 226: loss 3.3588, time 15973.15ms, mfu 2.54%
iter 227: loss 3.1142, time 15958.82ms, mfu 2.54%
iter 228: loss 3.0523, time 15963.20ms, mfu 2.54%
iter 229: loss 3.3545, time 15944.74ms, mfu 2.54%
iter 230: loss 3.5060, time 15946.59ms, mfu 2.54%
iter 231: loss 3.2516, time 15955.33ms, mfu 2.54%
iter 232: loss 3.3218, time 15923.96ms, mfu 2.54%
iter 233: loss 3.3270, time 15952.38ms, mfu 2.54%
iter 234: loss 2.9918, time 15922.79ms, mfu 2.54%
iter 235: loss 3.4410, time 15944.69ms, mfu 2.55%
iter 236: loss 3.2404, time 15954.96ms, mfu 2.55%
iter 237: loss 3.4811, time 15965.70ms, mfu 2.55%
iter 238: loss 3.1366, time 16308.94ms, mfu 2.54%
iter 239: loss 3.4770, time 15981.43ms, mfu 2.54%
iter 240: loss 2.9299, time 16003.99ms, mfu 2.54%
iter 241: loss 3.4372, time 15988.94ms, mfu 2.54%
iter 242: loss 3.0867, time 15994.37ms, mfu 2.54%
iter 243: loss 3.2273, time 15971.12ms, mfu 2.54%
iter 244: loss 3.6071, time 15962.65ms, mfu 2.54%
iter 245: loss 3.8065, time 15976.38ms, mfu 2.54%
iter 246: loss 3.3285, time 15961.24ms, mfu 2.54%
iter 247: loss 3.1286, time 15958.17ms, mfu 2.54%
iter 248: loss 3.0438, time 15951.22ms, mfu 2.54%
iter 249: loss 3.1228, time 15951.24ms, mfu 2.54%
iter 250: loss 3.3888, time 15938.56ms, mfu 2.54%
iter 251: loss 3.2347, time 15991.16ms, mfu 2.54%
iter 252: loss 2.9589, time 15938.67ms, mfu 2.54%
iter 253: loss 2.9632, time 15966.20ms, mfu 2.54%
iter 254: loss 3.2970, time 15959.84ms, mfu 2.54%
iter 255: loss 2.7699, time 15953.41ms, mfu 2.54%
iter 256: loss 3.6863, time 15966.51ms, mfu 2.54%
iter 257: loss 2.9634, time 15959.23ms, mfu 2.54%
iter 258: loss 3.1813, time 16286.79ms, mfu 2.54%
iter 259: loss 3.0801, time 15947.42ms, mfu 2.54%
iter 260: loss 2.9143, time 15943.80ms, mfu 2.54%
iter 261: loss 3.4172, time 15908.03ms, mfu 2.54%
iter 262: loss 3.4539, time 15916.55ms, mfu 2.54%
iter 263: loss 2.9955, time 15889.78ms, mfu 2.54%
iter 264: loss 3.4846, time 15872.24ms, mfu 2.55%
iter 265: loss 2.8587, time 15872.84ms, mfu 2.55%
iter 266: loss 3.3069, time 15833.42ms, mfu 2.55%
iter 267: loss 3.5753, time 15828.98ms, mfu 2.55%
iter 268: loss 2.8718, time 15811.25ms, mfu 2.55%
iter 269: loss 3.1959, time 15815.04ms, mfu 2.55%
iter 270: loss 3.0923, time 15791.24ms, mfu 2.56%
iter 271: loss 3.0377, time 15802.84ms, mfu 2.56%
iter 272: loss 2.6705, time 15790.75ms, mfu 2.56%
iter 273: loss 3.2692, time 15798.93ms, mfu 2.56%
iter 274: loss 3.3469, time 15796.51ms, mfu 2.56%
iter 275: loss 3.5814, time 15812.40ms, mfu 2.56%
iter 276: loss 3.1679, time 15816.62ms, mfu 2.56%
iter 277: loss 3.0719, time 15798.06ms, mfu 2.56%
iter 278: loss 2.8805, time 16054.08ms, mfu 2.56%
iter 279: loss 3.0067, time 15803.95ms, mfu 2.56%
iter 280: loss 3.0421, time 15797.96ms, mfu 2.56%
iter 281: loss 2.7071, time 15801.89ms, mfu 2.56%
iter 282: loss 3.0861, time 15815.90ms, mfu 2.56%
iter 283: loss 2.8721, time 15798.01ms, mfu 2.56%
iter 284: loss 2.8223, time 15795.56ms, mfu 2.57%
iter 285: loss 3.2394, time 15801.60ms, mfu 2.57%
iter 286: loss 2.6602, time 15795.18ms, mfu 2.57%
iter 287: loss 3.3055, time 15796.36ms, mfu 2.57%
iter 288: loss 3.0129, time 15809.78ms, mfu 2.57%
iter 289: loss 3.1899, time 15801.83ms, mfu 2.57%
iter 290: loss 2.6284, time 15811.22ms, mfu 2.57%
iter 291: loss 3.2918, time 15797.86ms, mfu 2.57%
iter 292: loss 3.3756, time 15809.97ms, mfu 2.57%
iter 293: loss 3.2896, time 15818.52ms, mfu 2.57%
iter 294: loss 3.2544, time 15805.37ms, mfu 2.57%
iter 295: loss 2.9546, time 15812.88ms, mfu 2.57%
iter 296: loss 2.9124, time 15801.92ms, mfu 2.57%
iter 297: loss 3.1641, time 15813.62ms, mfu 2.57%
iter 298: loss 3.0895, time 16201.21ms, mfu 2.56%
iter 299: loss 3.0943, time 15794.24ms, mfu 2.56%
iter 300: loss 2.8827, time 15799.54ms, mfu 2.56%
iter 301: loss 2.9733, time 15801.36ms, mfu 2.57%
iter 302: loss 3.1296, time 15800.82ms, mfu 2.57%
iter 303: loss 3.0488, time 15808.17ms, mfu 2.57%
iter 304: loss 3.1167, time 15800.95ms, mfu 2.57%
iter 305: loss 3.1030, time 15790.85ms, mfu 2.57%
iter 306: loss 3.3306, time 15811.21ms, mfu 2.57%
iter 307: loss 3.0198, time 15818.58ms, mfu 2.57%
iter 308: loss 2.8113, time 15795.83ms, mfu 2.57%
iter 309: loss 3.1668, time 15791.75ms, mfu 2.57%
iter 310: loss 3.1273, time 15805.68ms, mfu 2.57%
iter 311: loss 3.0325, time 15798.73ms, mfu 2.57%
iter 312: loss 3.1685, time 15787.36ms, mfu 2.57%
iter 313: loss 3.2206, time 15799.46ms, mfu 2.57%
iter 314: loss 3.0981, time 15798.93ms, mfu 2.57%
iter 315: loss 2.8837, time 15792.73ms, mfu 2.57%
iter 316: loss 2.9473, time 15786.77ms, mfu 2.57%
iter 317: loss 3.4503, time 15819.55ms, mfu 2.57%
iter 318: loss 3.0209, time 16120.70ms, mfu 2.57%
iter 319: loss 3.0338, time 15800.72ms, mfu 2.57%
iter 320: loss 2.9305, time 15797.41ms, mfu 2.57%
iter 321: loss 2.9377, time 15812.80ms, mfu 2.57%
iter 322: loss 3.1877, time 15796.69ms, mfu 2.57%
iter 323: loss 2.8615, time 15803.20ms, mfu 2.57%
iter 324: loss 3.0747, time 15802.31ms, mfu 2.57%
iter 325: loss 2.8679, time 15805.37ms, mfu 2.57%
iter 326: loss 2.9131, time 15803.32ms, mfu 2.57%
iter 327: loss 2.4475, time 15791.22ms, mfu 2.57%
iter 328: loss 2.8435, time 15799.38ms, mfu 2.57%
iter 329: loss 3.1049, time 15784.13ms, mfu 2.57%
iter 330: loss 2.8092, time 15796.71ms, mfu 2.57%
iter 331: loss 3.7731, time 15804.34ms, mfu 2.57%
iter 332: loss 3.1845, time 15784.70ms, mfu 2.57%
iter 333: loss 3.0108, time 15796.68ms, mfu 2.57%
iter 334: loss 2.5417, time 15787.35ms, mfu 2.57%
iter 335: loss 3.0692, time 15809.42ms, mfu 2.57%
iter 336: loss 2.6822, time 15809.11ms, mfu 2.57%
iter 337: loss 3.0735, time 15801.35ms, mfu 2.57%
iter 338: loss 3.4541, time 15792.96ms, mfu 2.57%
iter 339: loss 2.7714, time 16141.24ms, mfu 2.57%
iter 340: loss 2.8263, time 15801.48ms, mfu 2.57%
iter 341: loss 2.9887, time 15800.25ms, mfu 2.57%
iter 342: loss 2.7262, time 15793.80ms, mfu 2.57%
iter 343: loss 2.8534, time 15789.32ms, mfu 2.57%
iter 344: loss 3.2666, time 15805.43ms, mfu 2.57%
iter 345: loss 3.4553, time 15807.67ms, mfu 2.57%
iter 346: loss 2.5342, time 15798.17ms, mfu 2.57%
iter 347: loss 3.2457, time 15798.31ms, mfu 2.57%
iter 348: loss 3.0042, time 15804.89ms, mfu 2.57%
iter 349: loss 2.8087, time 15790.22ms, mfu 2.57%
iter 350: loss 2.9350, time 15804.04ms, mfu 2.57%
iter 351: loss 3.1430, time 15796.84ms, mfu 2.57%
iter 352: loss 2.3252, time 15788.54ms, mfu 2.57%
iter 353: loss 2.7831, time 15799.16ms, mfu 2.57%
iter 354: loss 2.7464, time 15789.20ms, mfu 2.57%
iter 355: loss 2.7986, time 15787.14ms, mfu 2.57%
iter 356: loss 2.8607, time 15800.76ms, mfu 2.57%
iter 357: loss 3.0449, time 15808.93ms, mfu 2.57%
iter 358: loss 3.0757, time 15796.51ms, mfu 2.57%
iter 359: loss 2.6455, time 16080.66ms, mfu 2.57%
iter 360: loss 2.7913, time 15786.67ms, mfu 2.57%
iter 361: loss 2.8785, time 15800.88ms, mfu 2.57%
iter 362: loss 2.8933, time 15804.67ms, mfu 2.57%
iter 363: loss 2.8443, time 15787.57ms, mfu 2.57%
iter 364: loss 2.9080, time 15793.35ms, mfu 2.57%
iter 365: loss 2.6602, time 15779.40ms, mfu 2.57%
iter 366: loss 3.0786, time 15806.12ms, mfu 2.57%
iter 367: loss 2.7430, time 15792.91ms, mfu 2.57%
iter 368: loss 2.7212, time 15794.79ms, mfu 2.57%
iter 369: loss 2.9307, time 15778.89ms, mfu 2.57%
iter 370: loss 3.0286, time 15794.16ms, mfu 2.57%
iter 371: loss 2.8667, time 15785.76ms, mfu 2.57%
iter 372: loss 3.0195, time 15798.91ms, mfu 2.57%
iter 373: loss 2.7936, time 15810.23ms, mfu 2.57%
iter 374: loss 3.1533, time 15807.03ms, mfu 2.57%
iter 375: loss 3.0492, time 15789.75ms, mfu 2.57%
iter 376: loss 2.9561, time 15794.20ms, mfu 2.57%
iter 377: loss 2.5164, time 15811.21ms, mfu 2.57%
iter 378: loss 3.2568, time 15788.98ms, mfu 2.57%
iter 379: loss 2.6679, time 16029.73ms, mfu 2.57%
iter 380: loss 3.0339, time 15804.46ms, mfu 2.57%
iter 381: loss 2.9370, time 15800.35ms, mfu 2.57%
iter 382: loss 2.8623, time 15796.08ms, mfu 2.57%
iter 383: loss 3.2159, time 15799.36ms, mfu 2.57%
iter 384: loss 2.6484, time 15799.01ms, mfu 2.57%
iter 385: loss 3.1784, time 15800.81ms, mfu 2.57%
iter 386: loss 3.0412, time 15792.09ms, mfu 2.57%
iter 387: loss 2.9857, time 15793.03ms, mfu 2.57%
iter 388: loss 2.7962, time 15796.85ms, mfu 2.57%
iter 389: loss 2.6385, time 15783.63ms, mfu 2.57%
iter 390: loss 2.7857, time 15796.82ms, mfu 2.57%
iter 391: loss 2.7863, time 15803.30ms, mfu 2.57%
iter 392: loss 2.5522, time 15796.99ms, mfu 2.57%
iter 393: loss 2.4665, time 15801.94ms, mfu 2.57%
iter 394: loss 2.5936, time 15804.02ms, mfu 2.57%
iter 395: loss 2.7005, time 15808.21ms, mfu 2.57%
iter 396: loss 2.9162, time 15802.15ms, mfu 2.57%
iter 397: loss 2.3873, time 15797.05ms, mfu 2.57%
iter 398: loss 2.5072, time 15783.80ms, mfu 2.57%
iter 399: loss 2.4738, time 16100.07ms, mfu 2.57%
iter 400: loss 2.7168, time 15809.56ms, mfu 2.57%
iter 401: loss 2.5157, time 15802.32ms, mfu 2.57%
iter 402: loss 2.4018, time 15794.00ms, mfu 2.57%
iter 403: loss 2.7701, time 15799.05ms, mfu 2.57%
iter 404: loss 2.4937, time 15802.33ms, mfu 2.57%
iter 405: loss 2.6894, time 15785.66ms, mfu 2.57%
iter 406: loss 2.9440, time 15797.94ms, mfu 2.57%
iter 407: loss 2.8287, time 15796.38ms, mfu 2.57%
iter 408: loss 2.5601, time 15802.30ms, mfu 2.57%
iter 409: loss 2.5608, time 15797.10ms, mfu 2.57%
iter 410: loss 3.2831, time 15783.74ms, mfu 2.57%
iter 411: loss 2.6006, time 15809.89ms, mfu 2.57%
iter 412: loss 2.7233, time 15798.74ms, mfu 2.57%
iter 413: loss 2.7280, time 15800.55ms, mfu 2.57%
iter 414: loss 2.5217, time 15783.47ms, mfu 2.57%
iter 415: loss 2.8725, time 15800.83ms, mfu 2.57%
iter 416: loss 2.6422, time 15798.42ms, mfu 2.57%
iter 417: loss 2.4275, time 15787.56ms, mfu 2.57%
iter 418: loss 2.8811, time 15785.26ms, mfu 2.57%
iter 419: loss 2.4776, time 16070.18ms, mfu 2.57%
iter 420: loss 2.2295, time 15893.96ms, mfu 2.57%
iter 421: loss 2.4955, time 16004.19ms, mfu 2.56%
iter 422: loss 3.2582, time 15980.84ms, mfu 2.56%
iter 423: loss 2.4587, time 15915.84ms, mfu 2.56%
iter 424: loss 2.8869, time 15905.27ms, mfu 2.56%
iter 425: loss 2.7382, time 15920.57ms, mfu 2.56%
iter 426: loss 3.0234, time 15944.21ms, mfu 2.56%
iter 427: loss 2.8522, time 15984.58ms, mfu 2.56%
iter 428: loss 3.0375, time 15983.30ms, mfu 2.55%
iter 429: loss 2.8303, time 15984.82ms, mfu 2.55%
iter 430: loss 2.7821, time 15969.14ms, mfu 2.55%
iter 431: loss 2.6337, time 15977.06ms, mfu 2.55%
iter 432: loss 2.9028, time 15950.38ms, mfu 2.55%
iter 433: loss 2.6127, time 15953.75ms, mfu 2.55%
iter 434: loss 2.7918, time 15946.11ms, mfu 2.55%
iter 435: loss 2.8875, time 15948.28ms, mfu 2.55%
iter 436: loss 2.6473, time 15941.53ms, mfu 2.55%
iter 437: loss 2.6770, time 15933.56ms, mfu 2.55%
iter 438: loss 2.7060, time 15925.14ms, mfu 2.55%
iter 439: loss 2.5090, time 16205.49ms, mfu 2.55%
iter 440: loss 2.7187, time 15911.42ms, mfu 2.55%
iter 441: loss 2.6590, time 15907.40ms, mfu 2.55%
iter 442: loss 2.6670, time 15927.22ms, mfu 2.55%
iter 443: loss 2.9774, time 15916.46ms, mfu 2.55%
iter 444: loss 2.6973, time 15927.82ms, mfu 2.55%
iter 445: loss 2.5374, time 15919.67ms, mfu 2.55%
iter 446: loss 2.8308, time 15916.86ms, mfu 2.55%
iter 447: loss 2.5241, time 15918.67ms, mfu 2.55%
iter 448: loss 2.5075, time 15927.52ms, mfu 2.55%
iter 449: loss 2.4325, time 15924.50ms, mfu 2.55%
iter 450: loss 2.4201, time 15927.95ms, mfu 2.55%
iter 451: loss 2.2985, time 15910.86ms, mfu 2.55%
iter 452: loss 2.3880, time 15917.93ms, mfu 2.55%
iter 453: loss 2.5183, time 15965.50ms, mfu 2.55%
iter 454: loss 3.0203, time 15943.84ms, mfu 2.55%
iter 455: loss 2.4947, time 15944.17ms, mfu 2.55%
iter 456: loss 2.6011, time 15961.92ms, mfu 2.55%
iter 457: loss 2.3262, time 15960.27ms, mfu 2.55%
iter 458: loss 3.0259, time 15967.29ms, mfu 2.55%
iter 459: loss 3.0312, time 16208.36ms, mfu 2.54%
iter 460: loss 2.5466, time 15932.29ms, mfu 2.54%
iter 461: loss 2.6829, time 15939.51ms, mfu 2.55%
iter 462: loss 2.8123, time 15932.45ms, mfu 2.55%
iter 463: loss 2.5061, time 15935.80ms, mfu 2.55%
iter 464: loss 2.3448, time 15924.68ms, mfu 2.55%
iter 465: loss 2.6406, time 15917.34ms, mfu 2.55%
iter 466: loss 2.7453, time 15924.74ms, mfu 2.55%
iter 467: loss 2.7203, time 15923.30ms, mfu 2.55%
iter 468: loss 2.4365, time 15922.67ms, mfu 2.55%
iter 469: loss 3.0017, time 15939.20ms, mfu 2.55%
iter 470: loss 2.8322, time 15916.59ms, mfu 2.55%
iter 471: loss 2.6614, time 15950.62ms, mfu 2.55%
iter 472: loss 2.2179, time 15946.43ms, mfu 2.55%
iter 473: loss 2.8178, time 15957.02ms, mfu 2.55%
iter 474: loss 2.6097, time 16004.73ms, mfu 2.55%
iter 475: loss 2.5224, time 15974.26ms, mfu 2.55%
iter 476: loss 2.4071, time 15977.86ms, mfu 2.55%
iter 477: loss 2.8893, time 15973.50ms, mfu 2.55%
iter 478: loss 2.5441, time 15955.11ms, mfu 2.55%
iter 479: loss 2.7618, time 16272.72ms, mfu 2.54%
iter 480: loss 2.6981, time 15973.16ms, mfu 2.54%
iter 481: loss 3.0501, time 15985.04ms, mfu 2.54%
iter 482: loss 2.5342, time 15984.70ms, mfu 2.54%
iter 483: loss 2.4169, time 15987.65ms, mfu 2.54%
iter 484: loss 2.2711, time 15999.03ms, mfu 2.54%
iter 485: loss 2.6116, time 15981.00ms, mfu 2.54%
iter 486: loss 2.4427, time 15958.29ms, mfu 2.54%
iter 487: loss 2.4774, time 15935.68ms, mfu 2.54%
iter 488: loss 2.6362, time 15917.84ms, mfu 2.54%
iter 489: loss 2.6711, time 15926.06ms, mfu 2.54%
iter 490: loss 2.6236, time 15898.17ms, mfu 2.55%
iter 491: loss 2.3763, time 15886.05ms, mfu 2.55%
iter 492: loss 2.6455, time 15902.37ms, mfu 2.55%
iter 493: loss 2.3998, time 15859.28ms, mfu 2.55%
iter 494: loss 2.5892, time 15855.00ms, mfu 2.55%
iter 495: loss 2.5932, time 15863.47ms, mfu 2.55%
iter 496: loss 2.4433, time 15857.14ms, mfu 2.55%
iter 497: loss 2.5513, time 15827.00ms, mfu 2.55%
iter 498: loss 2.6984, time 15817.33ms, mfu 2.56%
iter 499: loss 2.2579, time 16143.42ms, mfu 2.55%
iter 500: loss 2.8659, time 15816.08ms, mfu 2.55%
iter 501: loss 2.9132, time 15799.27ms, mfu 2.55%
iter 502: loss 2.5313, time 15812.84ms, mfu 2.56%
iter 503: loss 2.5614, time 15797.63ms, mfu 2.56%
iter 504: loss 2.2397, time 15805.10ms, mfu 2.56%
iter 505: loss 2.4082, time 15801.63ms, mfu 2.56%
iter 506: loss 2.8121, time 15800.53ms, mfu 2.56%
iter 507: loss 2.2298, time 15799.67ms, mfu 2.56%
iter 508: loss 2.1452, time 15785.16ms, mfu 2.56%
iter 509: loss 2.2834, time 15798.56ms, mfu 2.56%
iter 510: loss 2.1281, time 15807.97ms, mfu 2.56%
iter 511: loss 2.4051, time 15784.35ms, mfu 2.57%
iter 512: loss 2.4044, time 15794.75ms, mfu 2.57%
iter 513: loss 2.3181, time 15795.77ms, mfu 2.57%
iter 514: loss 2.2780, time 15802.24ms, mfu 2.57%
iter 515: loss 2.2908, time 15787.63ms, mfu 2.57%
iter 516: loss 2.3274, time 15799.79ms, mfu 2.57%
iter 517: loss 2.5745, time 15804.19ms, mfu 2.57%
iter 518: loss 2.6249, time 15791.15ms, mfu 2.57%
iter 519: loss 2.6284, time 16105.69ms, mfu 2.56%
iter 520: loss 2.6420, time 15791.62ms, mfu 2.57%
iter 521: loss 2.8209, time 15784.75ms, mfu 2.57%
iter 522: loss 2.8504, time 15811.57ms, mfu 2.57%
iter 523: loss 2.4801, time 15788.57ms, mfu 2.57%
iter 524: loss 2.4152, time 15791.78ms, mfu 2.57%
iter 525: loss 2.4709, time 15782.90ms, mfu 2.57%
iter 526: loss 2.7038, time 15792.87ms, mfu 2.57%
iter 527: loss 2.0819, time 15816.37ms, mfu 2.57%
iter 528: loss 2.3352, time 15790.26ms, mfu 2.57%
iter 529: loss 2.6072, time 15787.36ms, mfu 2.57%
iter 530: loss 2.7950, time 15787.64ms, mfu 2.57%
iter 531: loss 3.1596, time 15779.10ms, mfu 2.57%
iter 532: loss 2.5854, time 15791.83ms, mfu 2.57%
iter 533: loss 2.1237, time 15787.15ms, mfu 2.57%
iter 534: loss 2.2175, time 15786.48ms, mfu 2.57%
iter 535: loss 2.3070, time 15782.27ms, mfu 2.57%
iter 536: loss 2.9697, time 15787.66ms, mfu 2.57%
iter 537: loss 2.2469, time 15775.32ms, mfu 2.57%
iter 538: loss 2.4111, time 15788.58ms, mfu 2.57%
iter 539: loss 2.2906, time 16140.06ms, mfu 2.57%
iter 540: loss 2.1589, time 15798.52ms, mfu 2.57%
iter 541: loss 2.8654, time 15778.14ms, mfu 2.57%
iter 542: loss 2.7110, time 15781.96ms, mfu 2.57%
iter 543: loss 2.8178, time 15786.72ms, mfu 2.57%
iter 544: loss 2.2266, time 15785.43ms, mfu 2.57%
iter 545: loss 2.2750, time 15782.35ms, mfu 2.57%
iter 546: loss 2.3305, time 15801.30ms, mfu 2.57%
iter 547: loss 2.4866, time 15799.45ms, mfu 2.57%
iter 548: loss 2.7554, time 15798.52ms, mfu 2.57%
iter 549: loss 2.5226, time 15808.41ms, mfu 2.57%
iter 550: loss 2.3467, time 15796.91ms, mfu 2.57%
iter 551: loss 2.5413, time 15803.15ms, mfu 2.57%
iter 552: loss 2.1947, time 15810.02ms, mfu 2.57%
iter 553: loss 2.4986, time 15802.81ms, mfu 2.57%
iter 554: loss 2.4026, time 15789.73ms, mfu 2.57%
iter 555: loss 2.8721, time 15807.66ms, mfu 2.57%
iter 556: loss 2.5764, time 15790.15ms, mfu 2.57%
iter 557: loss 2.5734, time 15802.61ms, mfu 2.57%
iter 558: loss 2.2357, time 15787.64ms, mfu 2.57%
iter 559: loss 2.0916, time 16168.40ms, mfu 2.57%
iter 560: loss 2.8057, time 15789.90ms, mfu 2.57%
iter 561: loss 2.2789, time 15786.29ms, mfu 2.57%
iter 562: loss 2.6132, time 15788.14ms, mfu 2.57%
iter 563: loss 2.3844, time 15794.10ms, mfu 2.57%
iter 564: loss 2.3848, time 15797.04ms, mfu 2.57%
iter 565: loss 2.3647, time 15789.59ms, mfu 2.57%
iter 566: loss 2.4408, time 15786.46ms, mfu 2.57%
iter 567: loss 2.3429, time 15784.01ms, mfu 2.57%
iter 568: loss 2.2185, time 15792.14ms, mfu 2.57%
iter 569: loss 2.6537, time 15795.36ms, mfu 2.57%
iter 570: loss 2.2799, time 15796.30ms, mfu 2.57%
iter 571: loss 2.5771, time 15788.32ms, mfu 2.57%
iter 572: loss 2.5400, time 15790.55ms, mfu 2.57%
iter 573: loss 2.2381, time 15791.22ms, mfu 2.57%
iter 574: loss 2.5007, time 15806.24ms, mfu 2.57%
iter 575: loss 2.7137, time 15795.37ms, mfu 2.57%
iter 576: loss 2.3972, time 15789.52ms, mfu 2.57%
iter 577: loss 2.3526, time 15790.31ms, mfu 2.57%
iter 578: loss 2.3622, time 15798.52ms, mfu 2.57%
iter 579: loss 2.4322, time 16129.75ms, mfu 2.57%
iter 580: loss 2.1113, time 15785.31ms, mfu 2.57%
iter 581: loss 2.2421, time 15773.34ms, mfu 2.57%
iter 582: loss 2.7477, time 15785.57ms, mfu 2.57%
iter 583: loss 2.7682, time 15792.40ms, mfu 2.57%
iter 584: loss 2.2854, time 15776.84ms, mfu 2.57%
iter 585: loss 2.2100, time 15781.18ms, mfu 2.57%
iter 586: loss 2.4937, time 15774.82ms, mfu 2.57%
iter 587: loss 2.5954, time 15784.34ms, mfu 2.57%
iter 588: loss 2.6640, time 15778.23ms, mfu 2.57%
iter 589: loss 2.0855, time 15784.18ms, mfu 2.57%
iter 590: loss 2.7482, time 15803.04ms, mfu 2.57%
iter 591: loss 2.1702, time 15805.66ms, mfu 2.57%
iter 592: loss 2.3958, time 15781.19ms, mfu 2.57%
iter 593: loss 2.3625, time 15773.95ms, mfu 2.57%
iter 594: loss 2.7443, time 15788.15ms, mfu 2.57%
iter 595: loss 2.2690, time 15787.79ms, mfu 2.57%
iter 596: loss 2.2654, time 15788.44ms, mfu 2.57%
iter 597: loss 2.1157, time 15769.60ms, mfu 2.57%
iter 598: loss 2.4491, time 15776.70ms, mfu 2.57%
iter 599: loss 2.4217, time 16061.82ms, mfu 2.57%
iter 600: loss 2.2889, time 15783.70ms, mfu 2.57%
iter 601: loss 2.4338, time 15800.38ms, mfu 2.57%
iter 602: loss 2.1594, time 15783.85ms, mfu 2.57%
iter 603: loss 2.1465, time 15788.72ms, mfu 2.57%
iter 604: loss 2.2383, time 15793.85ms, mfu 2.57%
iter 605: loss 2.3018, time 15797.37ms, mfu 2.57%
iter 606: loss 2.0594, time 15777.62ms, mfu 2.57%
iter 607: loss 2.9835, time 15775.70ms, mfu 2.57%
iter 608: loss 2.0232, time 15798.83ms, mfu 2.57%
iter 609: loss 2.2184, time 15795.88ms, mfu 2.57%
iter 610: loss 2.3647, time 15780.93ms, mfu 2.57%
iter 611: loss 2.0097, time 15779.59ms, mfu 2.57%
iter 612: loss 2.2952, time 15783.21ms, mfu 2.57%
iter 613: loss 2.0717, time 15796.25ms, mfu 2.57%
iter 614: loss 2.1381, time 15765.73ms, mfu 2.57%
iter 615: loss 1.9476, time 15794.76ms, mfu 2.57%
iter 616: loss 2.0739, time 15789.48ms, mfu 2.57%
iter 617: loss 2.1631, time 15786.87ms, mfu 2.57%
iter 618: loss 2.1743, time 15776.46ms, mfu 2.57%
iter 619: loss 2.4424, time 16048.59ms, mfu 2.57%
iter 620: loss 2.6910, time 15787.64ms, mfu 2.57%
iter 621: loss 2.3275, time 15783.88ms, mfu 2.57%
iter 622: loss 2.3303, time 15809.47ms, mfu 2.57%
iter 623: loss 2.4463, time 15776.38ms, mfu 2.57%
iter 624: loss 2.3346, time 15788.53ms, mfu 2.57%
iter 625: loss 2.5835, time 15784.57ms, mfu 2.57%
iter 626: loss 2.6000, time 15791.78ms, mfu 2.57%
iter 627: loss 2.0886, time 15793.33ms, mfu 2.57%
iter 628: loss 2.1909, time 15785.74ms, mfu 2.57%
iter 629: loss 2.1972, time 15788.17ms, mfu 2.57%
iter 630: loss 2.2716, time 15791.97ms, mfu 2.57%
iter 631: loss 2.0687, time 15793.57ms, mfu 2.57%
iter 632: loss 2.2091, time 15799.66ms, mfu 2.57%
iter 633: loss 2.5382, time 15801.87ms, mfu 2.57%
iter 634: loss 2.2486, time 15809.57ms, mfu 2.57%
iter 635: loss 2.0279, time 15796.35ms, mfu 2.57%
iter 636: loss 2.0477, time 15799.11ms, mfu 2.57%
iter 637: loss 2.5486, time 15803.05ms, mfu 2.57%
iter 638: loss 2.3293, time 15803.42ms, mfu 2.57%
iter 639: loss 2.2543, time 15802.36ms, mfu 2.57%
iter 640: loss 2.6743, time 16062.95ms, mfu 2.57%
iter 641: loss 2.1462, time 15788.55ms, mfu 2.57%
iter 642: loss 1.9658, time 15804.36ms, mfu 2.57%
iter 643: loss 2.1205, time 15775.78ms, mfu 2.57%
iter 644: loss 2.4142, time 15783.86ms, mfu 2.57%
iter 645: loss 2.1221, time 15817.33ms, mfu 2.57%
iter 646: loss 2.1921, time 15793.12ms, mfu 2.57%
iter 647: loss 2.4341, time 15861.57ms, mfu 2.57%
iter 648: loss 2.2900, time 15961.05ms, mfu 2.57%
iter 649: loss 2.5549, time 15998.13ms, mfu 2.56%
iter 650: loss 2.4503, time 15945.11ms, mfu 2.56%
iter 651: loss 2.3835, time 15898.49ms, mfu 2.56%
iter 652: loss 2.3608, time 15907.14ms, mfu 2.56%
iter 653: loss 2.4110, time 15901.41ms, mfu 2.56%
iter 654: loss 2.3531, time 15915.34ms, mfu 2.56%
iter 655: loss 2.2332, time 15925.18ms, mfu 2.56%
iter 656: loss 2.3136, time 15927.56ms, mfu 2.56%
iter 657: loss 2.0531, time 15927.52ms, mfu 2.56%
iter 658: loss 2.2932, time 15916.49ms, mfu 2.56%
iter 659: loss 2.4390, time 15929.16ms, mfu 2.56%
iter 660: loss 1.8732, time 16343.58ms, mfu 2.55%
iter 661: loss 2.3778, time 15934.94ms, mfu 2.55%
iter 662: loss 2.0706, time 15957.10ms, mfu 2.55%
iter 663: loss 2.2611, time 15970.03ms, mfu 2.55%
iter 664: loss 2.0972, time 15973.96ms, mfu 2.55%
iter 665: loss 2.4296, time 15936.08ms, mfu 2.55%
iter 666: loss 2.2116, time 15958.79ms, mfu 2.55%
iter 667: loss 1.7510, time 15915.33ms, mfu 2.55%
iter 668: loss 2.3228, time 15948.25ms, mfu 2.55%
iter 669: loss 3.0563, time 15921.14ms, mfu 2.55%
iter 670: loss 2.2778, time 15910.80ms, mfu 2.55%
iter 671: loss 2.0439, time 15914.95ms, mfu 2.55%
iter 672: loss 2.1418, time 15919.24ms, mfu 2.55%
iter 673: loss 2.1537, time 15918.95ms, mfu 2.55%
iter 674: loss 2.1859, time 15917.24ms, mfu 2.55%
iter 675: loss 1.9869, time 15903.59ms, mfu 2.55%
iter 676: loss 2.1232, time 15936.02ms, mfu 2.55%
iter 677: loss 2.6877, time 15930.14ms, mfu 2.55%
iter 678: loss 1.8962, time 15936.45ms, mfu 2.55%
iter 679: loss 2.0495, time 15917.55ms, mfu 2.55%
iter 680: loss 2.2965, time 16378.65ms, mfu 2.54%
iter 681: loss 1.8261, time 15949.69ms, mfu 2.54%
iter 682: loss 1.7964, time 15953.77ms, mfu 2.54%
iter 683: loss 2.3738, time 15959.63ms, mfu 2.54%
iter 684: loss 2.1975, time 15966.40ms, mfu 2.54%
iter 685: loss 1.8523, time 15942.45ms, mfu 2.54%
iter 686: loss 1.9012, time 15932.62ms, mfu 2.55%
iter 687: loss 2.0767, time 15953.62ms, mfu 2.55%
iter 688: loss 2.5375, time 15938.92ms, mfu 2.55%
iter 689: loss 2.0440, time 15920.37ms, mfu 2.55%
iter 690: loss 2.4560, time 15931.49ms, mfu 2.55%
iter 691: loss 2.9542, time 15910.26ms, mfu 2.55%
iter 692: loss 1.9677, time 15904.03ms, mfu 2.55%
iter 693: loss 2.3102, time 15911.48ms, mfu 2.55%
iter 694: loss 2.1394, time 15907.89ms, mfu 2.55%
iter 695: loss 2.2108, time 15905.93ms, mfu 2.55%
iter 696: loss 2.4340, time 15908.73ms, mfu 2.55%
iter 697: loss 2.1329, time 15917.42ms, mfu 2.55%
iter 698: loss 2.1013, time 15926.24ms, mfu 2.55%
iter 699: loss 2.4316, time 15949.91ms, mfu 2.55%
iter 700: loss 2.3680, time 16237.08ms, mfu 2.55%
iter 701: loss 2.2237, time 15955.84ms, mfu 2.55%
iter 702: loss 2.0896, time 15970.36ms, mfu 2.55%
iter 703: loss 1.9919, time 15977.42ms, mfu 2.54%
iter 704: loss 2.2598, time 15965.55ms, mfu 2.54%
iter 705: loss 2.2104, time 15982.93ms, mfu 2.54%
iter 706: loss 1.9944, time 15973.79ms, mfu 2.54%
iter 707: loss 2.1840, time 15960.52ms, mfu 2.54%
iter 708: loss 2.4193, time 15951.03ms, mfu 2.54%
iter 709: loss 1.9926, time 15946.96ms, mfu 2.55%
iter 710: loss 2.0876, time 15961.22ms, mfu 2.55%
iter 711: loss 2.5176, time 15949.25ms, mfu 2.55%
iter 712: loss 2.4725, time 15944.80ms, mfu 2.55%
iter 713: loss 1.7612, time 15935.55ms, mfu 2.55%
iter 714: loss 2.3871, time 15934.72ms, mfu 2.55%
iter 715: loss 2.3156, time 15903.94ms, mfu 2.55%
iter 716: loss 2.0363, time 15893.52ms, mfu 2.55%
iter 717: loss 2.0942, time 15851.11ms, mfu 2.55%
iter 718: loss 2.1179, time 15859.08ms, mfu 2.55%
iter 719: loss 2.1003, time 15861.24ms, mfu 2.55%
iter 720: loss 2.0626, time 16091.09ms, mfu 2.55%
iter 721: loss 2.1981, time 15834.31ms, mfu 2.55%
iter 722: loss 2.4605, time 15825.52ms, mfu 2.55%
iter 723: loss 2.1600, time 15814.33ms, mfu 2.55%
iter 724: loss 2.0861, time 15821.79ms, mfu 2.56%
iter 725: loss 2.3366, time 15808.07ms, mfu 2.56%
iter 726: loss 2.0901, time 15799.30ms, mfu 2.56%
iter 727: loss 2.2544, time 15803.75ms, mfu 2.56%
iter 728: loss 2.1209, time 15805.20ms, mfu 2.56%
iter 729: loss 1.8458, time 15809.60ms, mfu 2.56%
iter 730: loss 2.2982, time 15790.47ms, mfu 2.56%
iter 731: loss 2.3403, time 15800.86ms, mfu 2.56%
iter 732: loss 2.2307, time 15789.94ms, mfu 2.56%
iter 733: loss 2.3694, time 15807.17ms, mfu 2.57%
iter 734: loss 2.4578, time 15789.09ms, mfu 2.57%
iter 735: loss 2.1329, time 15796.23ms, mfu 2.57%
iter 736: loss 2.1214, time 15793.39ms, mfu 2.57%
iter 737: loss 2.0364, time 15788.04ms, mfu 2.57%
iter 738: loss 2.2073, time 15778.19ms, mfu 2.57%
iter 739: loss 1.9880, time 15784.50ms, mfu 2.57%
iter 740: loss 2.1584, time 16136.49ms, mfu 2.56%
iter 741: loss 2.0573, time 15779.76ms, mfu 2.56%
iter 742: loss 2.2370, time 15769.43ms, mfu 2.57%
iter 743: loss 2.0385, time 15788.93ms, mfu 2.57%
iter 744: loss 1.9839, time 15789.68ms, mfu 2.57%
iter 745: loss 2.4512, time 15777.05ms, mfu 2.57%
iter 746: loss 2.2320, time 15775.27ms, mfu 2.57%
iter 747: loss 2.1465, time 15781.45ms, mfu 2.57%
iter 748: loss 2.0109, time 15787.63ms, mfu 2.57%
iter 749: loss 2.1734, time 15789.56ms, mfu 2.57%
iter 750: loss 2.2463, time 15778.66ms, mfu 2.57%
iter 751: loss 2.1865, time 15774.03ms, mfu 2.57%
iter 752: loss 2.1479, time 15791.75ms, mfu 2.57%
iter 753: loss 1.8703, time 15788.33ms, mfu 2.57%
iter 754: loss 2.0623, time 15770.58ms, mfu 2.57%
iter 755: loss 2.3280, time 15783.75ms, mfu 2.57%
iter 756: loss 2.2761, time 15781.09ms, mfu 2.57%
iter 757: loss 2.5053, time 15803.88ms, mfu 2.57%
iter 758: loss 2.1852, time 15789.65ms, mfu 2.57%
iter 759: loss 2.3007, time 15766.70ms, mfu 2.57%
iter 760: loss 2.1826, time 16089.81ms, mfu 2.57%
iter 761: loss 1.9758, time 15788.18ms, mfu 2.57%
iter 762: loss 2.1945, time 15774.13ms, mfu 2.57%
iter 763: loss 1.8993, time 15778.52ms, mfu 2.57%
iter 764: loss 2.0809, time 15776.60ms, mfu 2.57%
iter 765: loss 2.2653, time 15782.42ms, mfu 2.57%
iter 766: loss 2.3293, time 15772.02ms, mfu 2.57%
iter 767: loss 2.3479, time 15792.82ms, mfu 2.57%
iter 768: loss 2.1680, time 15770.38ms, mfu 2.57%
iter 769: loss 2.3976, time 15772.82ms, mfu 2.57%
iter 770: loss 2.5198, time 15780.17ms, mfu 2.57%
iter 771: loss 2.6177, time 15773.39ms, mfu 2.57%
iter 772: loss 2.4365, time 15779.48ms, mfu 2.57%
iter 773: loss 2.1642, time 15784.82ms, mfu 2.57%
iter 774: loss 1.7456, time 15775.48ms, mfu 2.57%
iter 775: loss 2.0486, time 15772.15ms, mfu 2.57%
iter 776: loss 2.3621, time 15783.22ms, mfu 2.57%
iter 777: loss 1.8590, time 15769.19ms, mfu 2.57%
iter 778: loss 2.3607, time 15781.52ms, mfu 2.57%
iter 779: loss 2.5178, time 15780.87ms, mfu 2.57%
iter 780: loss 2.1158, time 16147.76ms, mfu 2.57%
iter 781: loss 2.0912, time 15788.16ms, mfu 2.57%
iter 782: loss 2.5448, time 15785.25ms, mfu 2.57%
iter 783: loss 2.0678, time 15787.82ms, mfu 2.57%
iter 784: loss 2.0523, time 15773.48ms, mfu 2.57%
iter 785: loss 2.2459, time 15781.72ms, mfu 2.57%
iter 786: loss 1.8564, time 15783.97ms, mfu 2.57%
iter 787: loss 1.9513, time 15780.42ms, mfu 2.57%
iter 788: loss 2.3032, time 15790.47ms, mfu 2.57%
iter 789: loss 2.3686, time 15782.91ms, mfu 2.57%
iter 790: loss 2.2410, time 15771.57ms, mfu 2.57%
iter 791: loss 2.1699, time 15780.82ms, mfu 2.57%
iter 792: loss 1.7947, time 15775.43ms, mfu 2.57%
iter 793: loss 2.0690, time 15794.24ms, mfu 2.57%
iter 794: loss 2.6321, time 15770.36ms, mfu 2.57%
iter 795: loss 2.1456, time 15775.06ms, mfu 2.57%
iter 796: loss 2.2586, time 15786.43ms, mfu 2.57%
iter 797: loss 2.5081, time 15783.17ms, mfu 2.57%
iter 798: loss 2.3299, time 15774.00ms, mfu 2.57%
iter 799: loss 1.9305, time 15788.48ms, mfu 2.57%
iter 800: loss 2.2911, time 16100.50ms, mfu 2.57%
iter 801: loss 2.1312, time 15773.37ms, mfu 2.57%
iter 802: loss 2.3034, time 15772.51ms, mfu 2.57%
iter 803: loss 2.3147, time 15771.95ms, mfu 2.57%
iter 804: loss 2.0392, time 15779.73ms, mfu 2.57%
iter 805: loss 2.1784, time 15769.13ms, mfu 2.57%
iter 806: loss 2.5106, time 15774.15ms, mfu 2.57%
iter 807: loss 2.3400, time 15776.43ms, mfu 2.57%
iter 808: loss 2.1633, time 15774.72ms, mfu 2.57%
iter 809: loss 1.9262, time 15773.21ms, mfu 2.57%
iter 810: loss 1.9016, time 15793.47ms, mfu 2.57%
iter 811: loss 1.8484, time 15787.97ms, mfu 2.57%
iter 812: loss 1.9734, time 15782.51ms, mfu 2.57%
iter 813: loss 2.0043, time 15797.22ms, mfu 2.57%
iter 814: loss 2.1159, time 15788.28ms, mfu 2.57%
iter 815: loss 1.9981, time 15773.26ms, mfu 2.57%
iter 816: loss 1.7099, time 15776.72ms, mfu 2.57%
iter 817: loss 2.1904, time 15768.34ms, mfu 2.57%
iter 818: loss 2.4765, time 15781.07ms, mfu 2.57%
iter 819: loss 1.8449, time 15784.56ms, mfu 2.57%
iter 820: loss 2.2601, time 16080.30ms, mfu 2.57%
iter 821: loss 2.0005, time 15788.49ms, mfu 2.57%
iter 822: loss 2.6169, time 15764.23ms, mfu 2.57%
iter 823: loss 1.8710, time 15776.47ms, mfu 2.57%
iter 824: loss 2.1524, time 15761.12ms, mfu 2.57%
iter 825: loss 2.2425, time 15780.63ms, mfu 2.57%
iter 826: loss 2.1842, time 15766.72ms, mfu 2.57%
iter 827: loss 1.9680, time 15783.96ms, mfu 2.57%
iter 828: loss 2.0557, time 15762.71ms, mfu 2.57%
iter 829: loss 2.4807, time 15783.36ms, mfu 2.57%
iter 830: loss 1.6956, time 15784.24ms, mfu 2.57%
iter 831: loss 2.1801, time 15759.32ms, mfu 2.57%
iter 832: loss 2.1264, time 15777.61ms, mfu 2.57%
iter 833: loss 1.8823, time 15788.00ms, mfu 2.57%
iter 834: loss 1.9047, time 15775.72ms, mfu 2.57%
iter 835: loss 2.0368, time 15780.39ms, mfu 2.57%
iter 836: loss 2.3900, time 15778.47ms, mfu 2.57%
iter 837: loss 2.2201, time 15776.71ms, mfu 2.57%
iter 838: loss 2.3296, time 15776.88ms, mfu 2.57%
iter 839: loss 1.7919, time 15766.49ms, mfu 2.57%
iter 840: loss 2.0380, time 16113.37ms, mfu 2.57%
iter 841: loss 2.1084, time 15767.58ms, mfu 2.57%
iter 842: loss 2.0553, time 15780.38ms, mfu 2.57%
iter 843: loss 1.8175, time 15780.98ms, mfu 2.57%
iter 844: loss 2.0221, time 15770.50ms, mfu 2.57%
iter 845: loss 2.1730, time 15781.15ms, mfu 2.57%
iter 846: loss 2.2764, time 15783.39ms, mfu 2.57%
iter 847: loss 2.4144, time 15782.91ms, mfu 2.57%
iter 848: loss 2.1396, time 15779.87ms, mfu 2.57%
iter 849: loss 2.0887, time 15767.91ms, mfu 2.57%
iter 850: loss 2.0006, time 15780.05ms, mfu 2.57%
iter 851: loss 1.9063, time 15779.48ms, mfu 2.57%
iter 852: loss 2.3892, time 15784.33ms, mfu 2.57%
iter 853: loss 2.2392, time 15776.66ms, mfu 2.57%
iter 854: loss 2.1287, time 15782.47ms, mfu 2.57%
iter 855: loss 2.1681, time 15785.10ms, mfu 2.57%
iter 856: loss 2.2668, time 15795.70ms, mfu 2.57%
iter 857: loss 1.8165, time 15783.77ms, mfu 2.57%
iter 858: loss 2.2330, time 15778.10ms, mfu 2.57%
iter 859: loss 2.2017, time 15774.75ms, mfu 2.57%
iter 860: loss 2.0308, time 16109.96ms, mfu 2.57%
iter 861: loss 2.4079, time 15774.54ms, mfu 2.57%
iter 862: loss 2.0862, time 15778.08ms, mfu 2.57%
iter 863: loss 2.2388, time 15767.96ms, mfu 2.57%
iter 864: loss 1.9900, time 15777.19ms, mfu 2.57%
iter 865: loss 2.1341, time 15770.55ms, mfu 2.57%
iter 866: loss 2.1110, time 15774.64ms, mfu 2.57%
iter 867: loss 1.9598, time 15769.22ms, mfu 2.57%
iter 868: loss 1.9497, time 15769.67ms, mfu 2.57%
iter 869: loss 2.0639, time 15763.15ms, mfu 2.57%
iter 870: loss 2.2020, time 15789.51ms, mfu 2.57%
iter 871: loss 2.1003, time 15773.31ms, mfu 2.57%
iter 872: loss 1.8806, time 15775.41ms, mfu 2.57%
iter 873: loss 1.9058, time 15807.99ms, mfu 2.57%
iter 874: loss 2.3615, time 15932.92ms, mfu 2.57%
iter 875: loss 1.8822, time 15991.81ms, mfu 2.57%
iter 876: loss 2.2796, time 15937.22ms, mfu 2.57%
iter 877: loss 1.8836, time 15909.85ms, mfu 2.56%
iter 878: loss 2.1845, time 15911.93ms, mfu 2.56%
iter 879: loss 2.0726, time 15911.34ms, mfu 2.56%
iter 880: loss 1.8925, time 16176.88ms, mfu 2.56%
iter 881: loss 2.3302, time 15951.26ms, mfu 2.56%
iter 882: loss 2.7191, time 15974.75ms, mfu 2.56%
iter 883: loss 1.7602, time 15984.91ms, mfu 2.55%
iter 884: loss 2.0672, time 15979.65ms, mfu 2.55%
iter 885: loss 2.1944, time 15992.26ms, mfu 2.55%
iter 886: loss 2.4615, time 15961.31ms, mfu 2.55%
iter 887: loss 2.2395, time 15967.04ms, mfu 2.55%
iter 888: loss 1.9623, time 16207.23ms, mfu 2.55%
iter 889: loss 1.9722, time 15949.49ms, mfu 2.55%
iter 890: loss 2.5430, time 15924.32ms, mfu 2.55%
iter 891: loss 1.9831, time 15923.68ms, mfu 2.55%
iter 892: loss 2.1639, time 15909.01ms, mfu 2.55%
iter 893: loss 2.3151, time 15917.31ms, mfu 2.55%
iter 894: loss 1.7399, time 15930.81ms, mfu 2.55%
iter 895: loss 2.3940, time 15906.74ms, mfu 2.55%
iter 896: loss 2.3052, time 15900.06ms, mfu 2.55%
iter 897: loss 1.9370, time 15910.33ms, mfu 2.55%
iter 898: loss 2.2052, time 15913.94ms, mfu 2.55%
iter 899: loss 1.8660, time 15902.12ms, mfu 2.55%
iter 900: loss 1.9955, time 16244.58ms, mfu 2.55%
iter 901: loss 2.2343, time 15923.07ms, mfu 2.55%
iter 902: loss 1.9395, time 15934.34ms, mfu 2.55%
iter 903: loss 1.6143, time 15922.84ms, mfu 2.55%
iter 904: loss 2.2701, time 15928.97ms, mfu 2.55%
iter 905: loss 2.0790, time 15934.58ms, mfu 2.55%
iter 906: loss 2.1615, time 15954.26ms, mfu 2.55%
iter 907: loss 1.9250, time 15989.40ms, mfu 2.55%
iter 908: loss 2.1421, time 15946.80ms, mfu 2.55%
iter 909: loss 2.0729, time 15948.47ms, mfu 2.55%
iter 910: loss 2.2458, time 15941.73ms, mfu 2.55%
iter 911: loss 1.9515, time 15931.91ms, mfu 2.55%
iter 912: loss 2.1838, time 15918.71ms, mfu 2.55%
iter 913: loss 1.9630, time 15938.16ms, mfu 2.55%
iter 914: loss 2.0396, time 15919.49ms, mfu 2.55%
iter 915: loss 2.0151, time 15913.66ms, mfu 2.55%
iter 916: loss 2.0428, time 15911.73ms, mfu 2.55%
iter 917: loss 2.1890, time 15905.19ms, mfu 2.55%
iter 918: loss 2.3218, time 15907.54ms, mfu 2.55%
iter 919: loss 1.5845, time 15910.55ms, mfu 2.55%
iter 920: loss 1.9300, time 16170.92ms, mfu 2.55%
iter 921: loss 2.0468, time 15905.84ms, mfu 2.55%
iter 922: loss 1.8893, time 15926.33ms, mfu 2.55%
iter 923: loss 2.1315, time 15944.98ms, mfu 2.55%
iter 924: loss 2.3006, time 15948.21ms, mfu 2.55%
iter 925: loss 2.3514, time 15974.04ms, mfu 2.55%
iter 926: loss 1.8763, time 15969.65ms, mfu 2.55%
iter 927: loss 2.1285, time 15961.77ms, mfu 2.55%
iter 928: loss 2.3119, time 15971.41ms, mfu 2.55%
iter 929: loss 1.8992, time 15978.87ms, mfu 2.55%
iter 930: loss 1.9893, time 15951.22ms, mfu 2.55%
iter 931: loss 2.4998, time 15952.23ms, mfu 2.55%
iter 932: loss 1.9821, time 15942.95ms, mfu 2.55%
iter 933: loss 1.8341, time 15942.24ms, mfu 2.55%
iter 934: loss 2.0317, time 15927.36ms, mfu 2.55%
iter 935: loss 1.8766, time 15923.45ms, mfu 2.55%
iter 936: loss 1.9942, time 15920.75ms, mfu 2.55%
iter 937: loss 1.8731, time 15935.07ms, mfu 2.55%
iter 938: loss 2.0855, time 15930.27ms, mfu 2.55%
iter 939: loss 2.0397, time 15938.04ms, mfu 2.55%
iter 940: loss 1.9939, time 16285.36ms, mfu 2.54%
iter 941: loss 1.6447, time 15931.10ms, mfu 2.54%
iter 942: loss 2.1109, time 15917.17ms, mfu 2.54%
iter 943: loss 2.3539, time 15883.13ms, mfu 2.55%
iter 944: loss 2.2987, time 15837.31ms, mfu 2.55%
iter 945: loss 2.0739, time 15841.32ms, mfu 2.55%
iter 946: loss 1.8895, time 15804.69ms, mfu 2.55%
iter 947: loss 2.0627, time 15781.94ms, mfu 2.55%
iter 948: loss 2.1138, time 15797.39ms, mfu 2.56%
iter 949: loss 1.8687, time 15784.21ms, mfu 2.56%
iter 950: loss 2.1419, time 15780.94ms, mfu 2.56%
iter 951: loss 2.3077, time 15807.29ms, mfu 2.56%
iter 952: loss 1.9509, time 15799.97ms, mfu 2.56%
iter 953: loss 2.1269, time 15806.10ms, mfu 2.56%
iter 954: loss 2.0780, time 15806.90ms, mfu 2.56%
iter 955: loss 2.1166, time 15804.81ms, mfu 2.56%
iter 956: loss 2.0927, time 15796.93ms, mfu 2.56%
iter 957: loss 1.6858, time 15801.49ms, mfu 2.57%
iter 958: loss 2.0981, time 15794.35ms, mfu 2.57%
iter 959: loss 1.8430, time 15799.43ms, mfu 2.57%
iter 960: loss 2.0108, time 16030.49ms, mfu 2.56%
iter 961: loss 1.8443, time 15788.75ms, mfu 2.56%
iter 962: loss 1.9369, time 15785.49ms, mfu 2.57%
iter 963: loss 2.1958, time 15795.72ms, mfu 2.57%
iter 964: loss 1.7507, time 15798.57ms, mfu 2.57%
iter 965: loss 2.1615, time 15800.26ms, mfu 2.57%
iter 966: loss 2.3008, time 15786.09ms, mfu 2.57%
iter 967: loss 1.9942, time 15787.95ms, mfu 2.57%
iter 968: loss 1.7154, time 15783.34ms, mfu 2.57%
iter 969: loss 2.2270, time 15790.76ms, mfu 2.57%
iter 970: loss 2.3086, time 15811.23ms, mfu 2.57%
iter 971: loss 2.0850, time 15788.44ms, mfu 2.57%
iter 972: loss 2.0052, time 15776.77ms, mfu 2.57%
iter 973: loss 1.9036, time 15767.90ms, mfu 2.57%
iter 974: loss 1.8268, time 15759.79ms, mfu 2.57%
iter 975: loss 1.6150, time 15752.48ms, mfu 2.57%
iter 976: loss 1.7824, time 15766.53ms, mfu 2.57%
iter 977: loss 2.0631, time 15749.82ms, mfu 2.57%
iter 978: loss 1.8577, time 15740.29ms, mfu 2.57%
iter 979: loss 1.6349, time 15770.62ms, mfu 2.57%
iter 980: loss 1.8901, time 16040.79ms, mfu 2.57%
iter 981: loss 1.8216, time 15788.56ms, mfu 2.57%
iter 982: loss 2.4381, time 15794.00ms, mfu 2.57%
iter 983: loss 2.2495, time 15790.51ms, mfu 2.57%
iter 984: loss 2.1350, time 15768.91ms, mfu 2.57%
iter 985: loss 2.1531, time 15780.48ms, mfu 2.57%
iter 986: loss 2.2011, time 15784.38ms, mfu 2.57%
iter 987: loss 1.8706, time 15790.48ms, mfu 2.57%
iter 988: loss 1.8769, time 15767.63ms, mfu 2.57%
iter 989: loss 2.3796, time 15766.98ms, mfu 2.57%
iter 990: loss 2.0378, time 15756.04ms, mfu 2.57%
iter 991: loss 2.2502, time 15754.62ms, mfu 2.57%
iter 992: loss 2.3784, time 15756.12ms, mfu 2.57%
iter 993: loss 1.9600, time 15742.88ms, mfu 2.58%
iter 994: loss 2.0210, time 15766.28ms, mfu 2.58%
iter 995: loss 2.1075, time 15754.86ms, mfu 2.58%
iter 996: loss 2.1094, time 15747.22ms, mfu 2.58%
iter 997: loss 2.1608, time 15756.01ms, mfu 2.58%
iter 998: loss 2.3172, time 15773.56ms, mfu 2.58%
iter 999: loss 1.7910, time 15754.11ms, mfu 2.58%
iter 1000: loss 2.1135, time 16021.83ms, mfu 2.57%
iter 1001: loss 1.8837, time 15753.84ms, mfu 2.57%
iter 1002: loss 2.1453, time 15756.50ms, mfu 2.57%
iter 1003: loss 2.0328, time 15761.90ms, mfu 2.57%
iter 1004: loss 2.0944, time 15768.06ms, mfu 2.57%
iter 1005: loss 1.7740, time 15762.25ms, mfu 2.57%
iter 1006: loss 2.4371, time 15754.46ms, mfu 2.58%
iter 1007: loss 1.9607, time 15749.71ms, mfu 2.58%
iter 1008: loss 2.5170, time 15762.32ms, mfu 2.58%
iter 1009: loss 2.1380, time 15767.49ms, mfu 2.58%
iter 1010: loss 2.2924, time 15768.61ms, mfu 2.58%
iter 1011: loss 1.6963, time 15757.98ms, mfu 2.58%
iter 1012: loss 1.9943, time 15769.90ms, mfu 2.58%
iter 1013: loss 1.8660, time 15786.75ms, mfu 2.58%
iter 1014: loss 1.8056, time 15770.39ms, mfu 2.58%
iter 1015: loss 2.0060, time 15765.65ms, mfu 2.58%
iter 1016: loss 2.2022, time 15765.26ms, mfu 2.58%
iter 1017: loss 1.9028, time 15789.02ms, mfu 2.58%
iter 1018: loss 1.9232, time 15760.75ms, mfu 2.58%
iter 1019: loss 1.9527, time 15786.07ms, mfu 2.58%
iter 1020: loss 1.9351, time 16099.03ms, mfu 2.57%
iter 1021: loss 1.9626, time 15777.27ms, mfu 2.57%
iter 1022: loss 1.9922, time 15779.56ms, mfu 2.57%
iter 1023: loss 2.0531, time 15770.53ms, mfu 2.57%
iter 1024: loss 2.0958, time 15760.08ms, mfu 2.57%
iter 1025: loss 2.3392, time 15787.43ms, mfu 2.57%
iter 1026: loss 2.1816, time 15768.63ms, mfu 2.57%
iter 1027: loss 1.9614, time 15770.66ms, mfu 2.57%
iter 1028: loss 2.3126, time 15761.23ms, mfu 2.57%
iter 1029: loss 1.8281, time 15776.20ms, mfu 2.57%
iter 1030: loss 2.0593, time 15777.73ms, mfu 2.57%
iter 1031: loss 2.5659, time 15786.51ms, mfu 2.57%
iter 1032: loss 2.0860, time 15775.69ms, mfu 2.57%
iter 1033: loss 2.0120, time 15771.09ms, mfu 2.57%
iter 1034: loss 1.7220, time 15768.74ms, mfu 2.57%
iter 1035: loss 2.0369, time 15757.75ms, mfu 2.57%
iter 1036: loss 1.9870, time 15771.84ms, mfu 2.57%
iter 1037: loss 1.8783, time 15775.39ms, mfu 2.58%
iter 1038: loss 1.9442, time 15776.82ms, mfu 2.58%
iter 1039: loss 2.5697, time 15762.83ms, mfu 2.58%
iter 1040: loss 1.6626, time 16097.30ms, mfu 2.57%
iter 1041: loss 2.3173, time 15763.17ms, mfu 2.57%
iter 1042: loss 2.1713, time 15776.24ms, mfu 2.57%
iter 1043: loss 1.9872, time 15778.16ms, mfu 2.57%
iter 1044: loss 1.7716, time 15765.42ms, mfu 2.57%
iter 1045: loss 1.6054, time 15785.91ms, mfu 2.57%
iter 1046: loss 2.0962, time 15776.82ms, mfu 2.57%
iter 1047: loss 2.2393, time 15787.55ms, mfu 2.57%
iter 1048: loss 1.8940, time 15777.97ms, mfu 2.57%
iter 1049: loss 2.1238, time 15775.37ms, mfu 2.57%
iter 1050: loss 2.2438, time 15778.32ms, mfu 2.57%
iter 1051: loss 1.8787, time 15775.84ms, mfu 2.57%
iter 1052: loss 1.7790, time 15776.47ms, mfu 2.57%
iter 1053: loss 1.9426, time 15769.08ms, mfu 2.57%
iter 1054: loss 2.5886, time 15767.70ms, mfu 2.57%
iter 1055: loss 1.9457, time 15765.80ms, mfu 2.57%
iter 1056: loss 1.7589, time 15762.98ms, mfu 2.57%
iter 1057: loss 1.6923, time 15779.75ms, mfu 2.57%
iter 1058: loss 2.2667, time 15758.63ms, mfu 2.58%
iter 1059: loss 1.9191, time 15776.13ms, mfu 2.58%
iter 1060: loss 1.6839, time 15763.46ms, mfu 2.58%
iter 1061: loss 2.1539, time 16009.58ms, mfu 2.57%
iter 1062: loss 1.9992, time 15772.18ms, mfu 2.57%
iter 1063: loss 1.9803, time 15781.18ms, mfu 2.57%
iter 1064: loss 2.6328, time 15776.91ms, mfu 2.57%
iter 1065: loss 2.2380, time 15758.83ms, mfu 2.57%
iter 1066: loss 1.8736, time 15769.04ms, mfu 2.57%
iter 1067: loss 1.7368, time 15760.23ms, mfu 2.57%
iter 1068: loss 1.9990, time 15766.06ms, mfu 2.57%
iter 1069: loss 1.8743, time 15769.14ms, mfu 2.57%
iter 1070: loss 1.6806, time 15768.74ms, mfu 2.57%
iter 1071: loss 2.2978, time 15763.13ms, mfu 2.57%
iter 1072: loss 1.9821, time 15783.55ms, mfu 2.57%
iter 1073: loss 1.8013, time 15777.38ms, mfu 2.57%
iter 1074: loss 1.6832, time 15775.90ms, mfu 2.57%
iter 1075: loss 1.7635, time 15769.97ms, mfu 2.57%
iter 1076: loss 2.3534, time 15782.53ms, mfu 2.57%
iter 1077: loss 2.2191, time 15790.42ms, mfu 2.57%
iter 1078: loss 1.8030, time 15769.70ms, mfu 2.57%
iter 1079: loss 1.6763, time 15778.61ms, mfu 2.57%
iter 1080: loss 2.0248, time 16114.62ms, mfu 2.57%
iter 1081: loss 1.7363, time 15784.60ms, mfu 2.57%
iter 1082: loss 2.1313, time 15775.31ms, mfu 2.57%
iter 1083: loss 1.8997, time 15763.86ms, mfu 2.57%
iter 1084: loss 1.8651, time 15787.60ms, mfu 2.57%
iter 1085: loss 1.8254, time 15763.27ms, mfu 2.57%
iter 1086: loss 1.7342, time 15774.84ms, mfu 2.57%
iter 1087: loss 2.1848, time 15762.01ms, mfu 2.57%
iter 1088: loss 1.7120, time 15775.99ms, mfu 2.57%
iter 1089: loss 1.9508, time 15782.71ms, mfu 2.57%
iter 1090: loss 1.9634, time 15767.46ms, mfu 2.57%
iter 1091: loss 1.8856, time 15760.00ms, mfu 2.57%
iter 1092: loss 1.8837, time 15771.18ms, mfu 2.57%
iter 1093: loss 2.1838, time 15770.99ms, mfu 2.57%
iter 1094: loss 2.3170, time 15788.25ms, mfu 2.57%
iter 1095: loss 2.2861, time 15776.75ms, mfu 2.57%
iter 1096: loss 1.7582, time 15792.17ms, mfu 2.57%
iter 1097: loss 2.3594, time 15781.72ms, mfu 2.57%
iter 1098: loss 2.0036, time 15765.96ms, mfu 2.57%
iter 1099: loss 2.0521, time 15768.74ms, mfu 2.57%
iter 1100: loss 1.9638, time 15786.65ms, mfu 2.57%
iter 1101: loss 2.0535, time 16132.60ms, mfu 2.57%
iter 1102: loss 1.7811, time 15879.56ms, mfu 2.57%
iter 1103: loss 2.0517, time 15963.80ms, mfu 2.57%
iter 1104: loss 1.8010, time 15961.34ms, mfu 2.56%
iter 1105: loss 1.9940, time 15913.75ms, mfu 2.56%
iter 1106: loss 1.8471, time 15890.92ms, mfu 2.56%
iter 1107: loss 1.7116, time 15914.58ms, mfu 2.56%
iter 1108: loss 2.3167, time 15930.64ms, mfu 2.56%
iter 1109: loss 1.8641, time 15946.56ms, mfu 2.56%
iter 1110: loss 2.2233, time 15969.80ms, mfu 2.56%
iter 1111: loss 2.2569, time 15968.96ms, mfu 2.56%
iter 1112: loss 1.7784, time 15979.22ms, mfu 2.55%
iter 1113: loss 1.9016, time 15972.61ms, mfu 2.55%
iter 1114: loss 2.1169, time 15964.39ms, mfu 2.55%
iter 1115: loss 1.8901, time 15951.75ms, mfu 2.55%
iter 1116: loss 1.9111, time 15946.51ms, mfu 2.55%
iter 1117: loss 2.0210, time 15933.31ms, mfu 2.55%
iter 1118: loss 1.8662, time 15909.31ms, mfu 2.55%
iter 1119: loss 2.3302, time 15925.08ms, mfu 2.55%
iter 1120: loss 2.0536, time 15915.90ms, mfu 2.55%
iter 1121: loss 1.9847, time 16192.63ms, mfu 2.55%
iter 1122: loss 2.0099, time 15907.40ms, mfu 2.55%
iter 1123: loss 1.8159, time 15914.78ms, mfu 2.55%
iter 1124: loss 1.6674, time 15899.06ms, mfu 2.55%
iter 1125: loss 1.9015, time 15897.76ms, mfu 2.55%
iter 1126: loss 2.0643, time 15900.51ms, mfu 2.55%
iter 1127: loss 1.6108, time 15900.63ms, mfu 2.55%
iter 1128: loss 1.8102, time 15900.50ms, mfu 2.55%
iter 1129: loss 1.8904, time 15905.25ms, mfu 2.55%
iter 1130: loss 1.9878, time 15889.21ms, mfu 2.55%
iter 1131: loss 2.2863, time 15905.67ms, mfu 2.55%
iter 1132: loss 1.8839, time 15929.31ms, mfu 2.55%
iter 1133: loss 1.9553, time 15921.01ms, mfu 2.55%
iter 1134: loss 2.0602, time 15908.88ms, mfu 2.55%
iter 1135: loss 1.9460, time 15932.21ms, mfu 2.55%
iter 1136: loss 1.7090, time 15942.05ms, mfu 2.55%
iter 1137: loss 1.8632, time 15939.24ms, mfu 2.55%
iter 1138: loss 2.1540, time 15941.62ms, mfu 2.55%
iter 1139: loss 1.8791, time 15939.85ms, mfu 2.55%
iter 1140: loss 1.8271, time 15930.17ms, mfu 2.55%
iter 1141: loss 1.8950, time 16213.99ms, mfu 2.55%
iter 1142: loss 2.0078, time 15939.76ms, mfu 2.55%
iter 1143: loss 1.9261, time 15943.21ms, mfu 2.55%
iter 1144: loss 1.7841, time 15943.30ms, mfu 2.55%
iter 1145: loss 2.2205, time 15945.50ms, mfu 2.55%
iter 1146: loss 1.4193, time 15944.73ms, mfu 2.55%
iter 1147: loss 1.4500, time 15949.35ms, mfu 2.55%
iter 1148: loss 1.7120, time 15926.34ms, mfu 2.55%
iter 1149: loss 1.6750, time 15937.57ms, mfu 2.55%
iter 1150: loss 1.8165, time 15930.02ms, mfu 2.55%
iter 1151: loss 1.8866, time 15910.51ms, mfu 2.55%
iter 1152: loss 1.8744, time 15913.04ms, mfu 2.55%
iter 1153: loss 1.9256, time 15925.83ms, mfu 2.55%
iter 1154: loss 1.5523, time 15921.76ms, mfu 2.55%
iter 1155: loss 1.8423, time 15917.15ms, mfu 2.55%
iter 1156: loss 1.8896, time 15919.26ms, mfu 2.55%
iter 1157: loss 2.0200, time 15935.87ms, mfu 2.55%
iter 1158: loss 2.1869, time 15926.73ms, mfu 2.55%
iter 1159: loss 1.8346, time 15949.47ms, mfu 2.55%
iter 1160: loss 2.1496, time 15958.65ms, mfu 2.55%
iter 1161: loss 2.2133, time 16230.38ms, mfu 2.54%
iter 1162: loss 2.1843, time 15934.00ms, mfu 2.55%
iter 1163: loss 1.9826, time 15956.88ms, mfu 2.55%
iter 1164: loss 2.2125, time 15958.28ms, mfu 2.55%
iter 1165: loss 1.7229, time 15941.75ms, mfu 2.55%
iter 1166: loss 2.2951, time 15954.83ms, mfu 2.55%
iter 1167: loss 1.7688, time 15944.20ms, mfu 2.55%
iter 1168: loss 1.9640, time 15923.98ms, mfu 2.55%
iter 1169: loss 2.3560, time 15910.86ms, mfu 2.55%
iter 1171: loss 2.2182, time 15887.20ms, mfu 2.55%
iter 1172: loss 1.8848, time 15868.07ms, mfu 2.55%
iter 1173: loss 1.7727, time 15867.91ms, mfu 2.55%
iter 1174: loss 1.9030, time 15828.32ms, mfu 2.55%
iter 1175: loss 2.3258, time 15822.88ms, mfu 2.55%
iter 1176: loss 1.9799, time 15814.13ms, mfu 2.56%
iter 1177: loss 1.7430, time 15818.37ms, mfu 2.56%
iter 1178: loss 1.8035, time 15813.90ms, mfu 2.56%
iter 1179: loss 2.0126, time 15815.27ms, mfu 2.56%
iter 1180: loss 1.7263, time 15788.54ms, mfu 2.56%
iter 1181: loss 2.1177, time 16061.31ms, mfu 2.56%
iter 1182: loss 1.8924, time 15801.04ms, mfu 2.56%
iter 1183: loss 1.8663, time 15798.55ms, mfu 2.56%
iter 1184: loss 2.1043, time 15791.94ms, mfu 2.56%
iter 1185: loss 2.0917, time 15789.60ms, mfu 2.56%
iter 1186: loss 2.1933, time 15777.73ms, mfu 2.56%
iter 1187: loss 1.8184, time 15783.34ms, mfu 2.56%
iter 1188: loss 1.9211, time 15771.79ms, mfu 2.57%
iter 1189: loss 1.7662, time 15772.87ms, mfu 2.57%
iter 1190: loss 1.9317, time 15774.60ms, mfu 2.57%
iter 1191: loss 1.8640, time 15775.72ms, mfu 2.57%
iter 1192: loss 2.0126, time 15797.12ms, mfu 2.57%
iter 1193: loss 1.4912, time 15760.00ms, mfu 2.57%
iter 1194: loss 2.4076, time 15780.15ms, mfu 2.57%
iter 1195: loss 1.8551, time 15783.95ms, mfu 2.57%
iter 1196: loss 2.0659, time 15772.01ms, mfu 2.57%
iter 1197: loss 1.9695, time 15780.57ms, mfu 2.57%
iter 1198: loss 1.7309, time 15767.05ms, mfu 2.57%
iter 1199: loss 1.6996, time 15759.90ms, mfu 2.57%
iter 1200: loss 1.9812, time 15780.02ms, mfu 2.57%
iter 1201: loss 1.8614, time 16130.25ms, mfu 2.57%
iter 1202: loss 1.9921, time 15768.04ms, mfu 2.57%
iter 1203: loss 1.8211, time 15784.07ms, mfu 2.57%
iter 1204: loss 1.3265, time 15784.73ms, mfu 2.57%
iter 1205: loss 2.1775, time 15760.71ms, mfu 2.57%
iter 1206: loss 2.0925, time 15787.29ms, mfu 2.57%
iter 1207: loss 1.8776, time 15772.83ms, mfu 2.57%
iter 1208: loss 1.7967, time 15780.46ms, mfu 2.57%
iter 1209: loss 1.5361, time 15772.80ms, mfu 2.57%
iter 1210: loss 1.8313, time 15765.95ms, mfu 2.57%
iter 1211: loss 2.0248, time 15776.23ms, mfu 2.57%
iter 1212: loss 2.0935, time 15780.71ms, mfu 2.57%
iter 1213: loss 1.6439, time 15777.78ms, mfu 2.57%
iter 1214: loss 1.8058, time 15773.46ms, mfu 2.57%
iter 1215: loss 2.1409, time 15766.13ms, mfu 2.57%
iter 1216: loss 1.5350, time 15779.77ms, mfu 2.57%
iter 1217: loss 2.3468, time 15777.83ms, mfu 2.57%
iter 1218: loss 2.1072, time 15765.90ms, mfu 2.57%
iter 1219: loss 1.7223, time 15774.15ms, mfu 2.57%
iter 1220: loss 2.0166, time 15772.28ms, mfu 2.57%
iter 1221: loss 1.8767, time 16110.29ms, mfu 2.57%
iter 1222: loss 1.7361, time 15786.65ms, mfu 2.57%
iter 1223: loss 1.7018, time 15762.23ms, mfu 2.57%
iter 1224: loss 1.8229, time 15770.03ms, mfu 2.57%
iter 1225: loss 2.2512, time 15764.41ms, mfu 2.57%
iter 1226: loss 1.7006, time 15764.74ms, mfu 2.57%
iter 1227: loss 2.0131, time 15763.37ms, mfu 2.57%
iter 1228: loss 1.9551, time 15778.48ms, mfu 2.57%
iter 1229: loss 2.0289, time 15773.95ms, mfu 2.57%
iter 1230: loss 1.7550, time 15769.74ms, mfu 2.57%
iter 1231: loss 1.8935, time 15772.88ms, mfu 2.57%
iter 1232: loss 2.0062, time 15773.99ms, mfu 2.57%
iter 1233: loss 2.0517, time 15779.46ms, mfu 2.57%
iter 1234: loss 2.1836, time 15769.61ms, mfu 2.57%
iter 1235: loss 1.9093, time 15772.12ms, mfu 2.57%
iter 1236: loss 2.5555, time 15769.57ms, mfu 2.57%
iter 1237: loss 1.9639, time 15772.29ms, mfu 2.57%
iter 1238: loss 1.9596, time 15784.00ms, mfu 2.57%
iter 1239: loss 1.8389, time 15786.29ms, mfu 2.57%
iter 1240: loss 2.0330, time 15771.17ms, mfu 2.57%
iter 1241: loss 2.0702, time 16099.55ms, mfu 2.57%
iter 1242: loss 1.8632, time 15776.03ms, mfu 2.57%
iter 1243: loss 2.3545, time 15781.47ms, mfu 2.57%
iter 1244: loss 1.8433, time 15782.04ms, mfu 2.57%
iter 1245: loss 1.7864, time 15792.01ms, mfu 2.57%
iter 1246: loss 2.0292, time 15775.63ms, mfu 2.57%
iter 1247: loss 1.9286, time 15775.55ms, mfu 2.57%
iter 1248: loss 1.6001, time 15771.97ms, mfu 2.57%
iter 1249: loss 1.7347, time 15776.39ms, mfu 2.57%
iter 1250: loss 2.1960, time 15780.96ms, mfu 2.57%
iter 1251: loss 1.6266, time 15775.05ms, mfu 2.57%
iter 1252: loss 1.7542, time 15769.63ms, mfu 2.57%
iter 1253: loss 1.8862, time 15783.82ms, mfu 2.57%
iter 1254: loss 2.1327, time 15780.49ms, mfu 2.57%
iter 1255: loss 2.4162, time 15780.61ms, mfu 2.57%
iter 1256: loss 1.8615, time 15794.90ms, mfu 2.57%
iter 1257: loss 2.0074, time 15758.54ms, mfu 2.57%
iter 1258: loss 1.8397, time 15763.28ms, mfu 2.57%
iter 1259: loss 1.8150, time 15774.84ms, mfu 2.57%
iter 1260: loss 2.1547, time 15772.57ms, mfu 2.57%
iter 1261: loss 2.2813, time 16120.46ms, mfu 2.57%
iter 1262: loss 2.3838, time 15768.35ms, mfu 2.57%
iter 1263: loss 1.8462, time 15765.12ms, mfu 2.57%
iter 1264: loss 1.8523, time 15784.72ms, mfu 2.57%
iter 1265: loss 2.1258, time 15770.92ms, mfu 2.57%
iter 1266: loss 2.2104, time 15773.60ms, mfu 2.57%
iter 1267: loss 2.0958, time 15785.71ms, mfu 2.57%
iter 1268: loss 2.3315, time 15777.74ms, mfu 2.57%
iter 1269: loss 2.1111, time 15779.58ms, mfu 2.57%
iter 1270: loss 1.9112, time 15784.34ms, mfu 2.57%
iter 1271: loss 1.8457, time 15770.17ms, mfu 2.57%
iter 1272: loss 2.1813, time 15784.84ms, mfu 2.57%
iter 1273: loss 1.8843, time 15794.82ms, mfu 2.57%
iter 1274: loss 2.0951, time 15776.20ms, mfu 2.57%
iter 1275: loss 1.7946, time 15783.47ms, mfu 2.57%
iter 1276: loss 2.0680, time 15783.95ms, mfu 2.57%
iter 1277: loss 1.7926, time 15775.79ms, mfu 2.57%
iter 1278: loss 2.0990, time 15764.96ms, mfu 2.57%
iter 1279: loss 2.1877, time 15773.78ms, mfu 2.57%
iter 1280: loss 1.4997, time 15763.84ms, mfu 2.57%
iter 1281: loss 1.6356, time 16069.21ms, mfu 2.57%
iter 1282: loss 2.1385, time 15770.08ms, mfu 2.57%
iter 1283: loss 1.5910, time 15757.74ms, mfu 2.57%
iter 1284: loss 2.1389, time 15772.30ms, mfu 2.57%
iter 1285: loss 2.0292, time 15775.38ms, mfu 2.57%
iter 1286: loss 1.6908, time 15761.19ms, mfu 2.57%
iter 1287: loss 2.0956, time 15759.38ms, mfu 2.57%
iter 1288: loss 1.7991, time 15775.85ms, mfu 2.57%
iter 1289: loss 1.5124, time 15779.52ms, mfu 2.57%
iter 1290: loss 2.0179, time 15768.35ms, mfu 2.57%
iter 1291: loss 2.4950, time 15775.33ms, mfu 2.57%
iter 1292: loss 1.9284, time 15763.75ms, mfu 2.57%
iter 1293: loss 1.6809, time 15782.49ms, mfu 2.57%
iter 1294: loss 1.8331, time 15783.07ms, mfu 2.57%
iter 1295: loss 2.0401, time 15773.26ms, mfu 2.57%
iter 1296: loss 1.7915, time 15751.70ms, mfu 2.57%
iter 1297: loss 2.0082, time 15768.21ms, mfu 2.58%
iter 1298: loss 2.0283, time 15798.23ms, mfu 2.57%
iter 1299: loss 2.3189, time 15781.77ms, mfu 2.57%
iter 1300: loss 2.0114, time 15758.56ms, mfu 2.58%
iter 1301: loss 2.2665, time 16030.96ms, mfu 2.57%
iter 1302: loss 1.7582, time 15784.60ms, mfu 2.57%
iter 1303: loss 1.9668, time 15773.68ms, mfu 2.57%
iter 1304: loss 2.0183, time 15774.64ms, mfu 2.57%
iter 1305: loss 2.0292, time 15766.27ms, mfu 2.57%
iter 1306: loss 1.9212, time 15785.79ms, mfu 2.57%
iter 1307: loss 2.0745, time 15774.12ms, mfu 2.57%
iter 1308: loss 2.2478, time 15792.42ms, mfu 2.57%
iter 1309: loss 1.9061, time 15767.58ms, mfu 2.57%
iter 1310: loss 1.7365, time 15780.06ms, mfu 2.57%
iter 1311: loss 2.1661, time 15767.60ms, mfu 2.57%
iter 1312: loss 2.0411, time 15776.21ms, mfu 2.57%
iter 1313: loss 1.7853, time 15772.72ms, mfu 2.57%
iter 1314: loss 1.7941, time 15770.00ms, mfu 2.57%
iter 1315: loss 2.2009, time 15780.97ms, mfu 2.57%
iter 1316: loss 2.3113, time 15768.58ms, mfu 2.57%
iter 1317: loss 1.9028, time 15777.49ms, mfu 2.57%
iter 1318: loss 1.7433, time 15763.76ms, mfu 2.57%
iter 1319: loss 1.7431, time 15768.51ms, mfu 2.58%
iter 1320: loss 1.8226, time 15762.53ms, mfu 2.58%
iter 1321: loss 2.0279, time 16112.98ms, mfu 2.57%
iter 1322: loss 1.9977, time 15752.87ms, mfu 2.57%
iter 1323: loss 1.7740, time 15789.35ms, mfu 2.57%
iter 1324: loss 1.8628, time 15771.53ms, mfu 2.57%
iter 1325: loss 1.9168, time 15773.93ms, mfu 2.57%
iter 1326: loss 2.1415, time 15767.72ms, mfu 2.57%
iter 1327: loss 2.0156, time 15773.89ms, mfu 2.57%
iter 1328: loss 2.5071, time 15796.65ms, mfu 2.57%
iter 1329: loss 1.7428, time 15889.28ms, mfu 2.57%
iter 1330: loss 1.6962, time 15974.57ms, mfu 2.57%
iter 1331: loss 1.8132, time 15960.85ms, mfu 2.57%
iter 1332: loss 1.8643, time 15930.29ms, mfu 2.56%
iter 1333: loss 2.3019, time 15905.01ms, mfu 2.56%
iter 1334: loss 1.9015, time 15903.77ms, mfu 2.56%
iter 1335: loss 1.8027, time 15920.06ms, mfu 2.56%
iter 1336: loss 2.0391, time 15933.97ms, mfu 2.56%
iter 1337: loss 2.1382, time 15965.97ms, mfu 2.56%
iter 1338: loss 2.2791, time 15964.78ms, mfu 2.56%
iter 1339: loss 1.6880, time 15946.31ms, mfu 2.56%
iter 1340: loss 2.0567, time 15935.18ms, mfu 2.56%
iter 1341: loss 2.1500, time 16316.79ms, mfu 2.55%
iter 1342: loss 1.8703, time 15906.76ms, mfu 2.55%
iter 1343: loss 2.1620, time 15906.89ms, mfu 2.55%
iter 1344: loss 1.6563, time 15909.81ms, mfu 2.55%
iter 1345: loss 1.9624, time 15914.50ms, mfu 2.55%
iter 1346: loss 1.6964, time 15913.55ms, mfu 2.55%
iter 1347: loss 2.0077, time 15901.13ms, mfu 2.55%
iter 1348: loss 1.7683, time 15903.48ms, mfu 2.55%
iter 1349: loss 2.3588, time 15909.40ms, mfu 2.55%
iter 1350: loss 2.1513, time 15904.32ms, mfu 2.55%
iter 1351: loss 1.7792, time 15892.82ms, mfu 2.55%
iter 1352: loss 2.2574, time 15906.29ms, mfu 2.55%
iter 1353: loss 1.9371, time 15908.79ms, mfu 2.55%
iter 1354: loss 2.1396, time 15911.62ms, mfu 2.55%
iter 1355: loss 2.2304, time 15899.24ms, mfu 2.55%
iter 1356: loss 1.5315, time 15906.49ms, mfu 2.55%
iter 1357: loss 1.8347, time 15914.33ms, mfu 2.55%
iter 1358: loss 1.9759, time 15925.12ms, mfu 2.55%
iter 1359: loss 2.0032, time 15916.07ms, mfu 2.55%
iter 1360: loss 1.5955, time 15928.98ms, mfu 2.55%
iter 1361: loss 1.5934, time 16294.32ms, mfu 2.55%
iter 1362: loss 2.2455, time 15943.63ms, mfu 2.55%
iter 1363: loss 1.7632, time 15935.61ms, mfu 2.55%
iter 1364: loss 1.9172, time 15925.98ms, mfu 2.55%
iter 1365: loss 2.0376, time 15923.26ms, mfu 2.55%
iter 1366: loss 1.9724, time 15919.72ms, mfu 2.55%
iter 1367: loss 2.3970, time 15920.03ms, mfu 2.55%
iter 1368: loss 1.6786, time 15918.27ms, mfu 2.55%
iter 1369: loss 1.8987, time 15905.56ms, mfu 2.55%
iter 1370: loss 1.7732, time 15917.30ms, mfu 2.55%
iter 1371: loss 1.6480, time 15906.99ms, mfu 2.55%
iter 1372: loss 2.0814, time 15909.31ms, mfu 2.55%
iter 1373: loss 1.8224, time 15898.11ms, mfu 2.55%
iter 1374: loss 1.8849, time 15915.67ms, mfu 2.55%
iter 1375: loss 1.8932, time 15900.27ms, mfu 2.55%
iter 1376: loss 1.7097, time 15908.72ms, mfu 2.55%
iter 1377: loss 1.9430, time 15895.45ms, mfu 2.55%
iter 1378: loss 1.8418, time 15907.71ms, mfu 2.55%
iter 1379: loss 1.9613, time 15907.25ms, mfu 2.55%
iter 1380: loss 1.9449, time 15920.31ms, mfu 2.55%
iter 1381: loss 1.7127, time 16185.37ms, mfu 2.55%
iter 1382: loss 2.4146, time 15916.30ms, mfu 2.55%
iter 1383: loss 2.2618, time 15934.80ms, mfu 2.55%
iter 1384: loss 1.5586, time 15921.09ms, mfu 2.55%
iter 1385: loss 1.9421, time 15926.14ms, mfu 2.55%
iter 1386: loss 1.7724, time 15915.95ms, mfu 2.55%
iter 1387: loss 2.0780, time 15921.00ms, mfu 2.55%
iter 1388: loss 1.8384, time 15928.70ms, mfu 2.55%
iter 1389: loss 2.1228, time 15931.13ms, mfu 2.55%
iter 1390: loss 2.0056, time 15929.82ms, mfu 2.55%
iter 1391: loss 1.9096, time 15933.55ms, mfu 2.55%
iter 1392: loss 1.9180, time 15933.26ms, mfu 2.55%
iter 1393: loss 1.9016, time 15929.02ms, mfu 2.55%
iter 1394: loss 1.9152, time 15938.91ms, mfu 2.55%
iter 1395: loss 2.1955, time 15901.04ms, mfu 2.55%
iter 1396: loss 1.9555, time 15870.72ms, mfu 2.55%
iter 1397: loss 1.7741, time 15835.68ms, mfu 2.55%
iter 1398: loss 2.0422, time 15795.43ms, mfu 2.55%
iter 1399: loss 1.5413, time 15798.54ms, mfu 2.56%
iter 1400: loss 1.8797, time 15774.50ms, mfu 2.56%
iter 1401: loss 2.2371, time 16109.43ms, mfu 2.55%
iter 1402: loss 1.8153, time 15758.07ms, mfu 2.56%
iter 1403: loss 2.0009, time 15758.24ms, mfu 2.56%
iter 1404: loss 2.0080, time 15767.57ms, mfu 2.56%
iter 1405: loss 1.9477, time 15756.96ms, mfu 2.56%
iter 1406: loss 1.6947, time 15754.92ms, mfu 2.56%
iter 1407: loss 2.0860, time 15750.82ms, mfu 2.57%
iter 1408: loss 1.7258, time 15764.58ms, mfu 2.57%
iter 1409: loss 1.6618, time 15765.62ms, mfu 2.57%
iter 1410: loss 1.9851, time 15755.83ms, mfu 2.57%
iter 1411: loss 1.6911, time 15760.55ms, mfu 2.57%
iter 1412: loss 2.0183, time 15753.60ms, mfu 2.57%
iter 1413: loss 2.0561, time 15771.63ms, mfu 2.57%
iter 1414: loss 1.7574, time 15762.31ms, mfu 2.57%
iter 1415: loss 1.6595, time 15780.81ms, mfu 2.57%
iter 1416: loss 1.8544, time 15773.05ms, mfu 2.57%
iter 1417: loss 1.9052, time 15793.69ms, mfu 2.57%
iter 1418: loss 1.7433, time 15785.64ms, mfu 2.57%
iter 1419: loss 2.3131, time 15782.89ms, mfu 2.57%
iter 1420: loss 1.8094, time 15784.05ms, mfu 2.57%
iter 1421: loss 1.8045, time 16113.64ms, mfu 2.57%
iter 1422: loss 1.9110, time 15782.43ms, mfu 2.57%
iter 1423: loss 1.9431, time 15774.65ms, mfu 2.57%
iter 1424: loss 2.2250, time 15775.59ms, mfu 2.57%
iter 1425: loss 1.5227, time 15774.00ms, mfu 2.57%
iter 1426: loss 1.8239, time 15796.49ms, mfu 2.57%
iter 1427: loss 1.9412, time 15780.08ms, mfu 2.57%
iter 1428: loss 1.8239, time 15768.41ms, mfu 2.57%
iter 1429: loss 2.1441, time 15781.02ms, mfu 2.57%
iter 1430: loss 1.8366, time 15791.95ms, mfu 2.57%
iter 1431: loss 2.2166, time 15773.03ms, mfu 2.57%
iter 1432: loss 1.7734, time 15772.57ms, mfu 2.57%
iter 1433: loss 2.0125, time 15777.84ms, mfu 2.57%
iter 1434: loss 1.8132, time 15782.20ms, mfu 2.57%
iter 1435: loss 2.0373, time 15784.76ms, mfu 2.57%
iter 1436: loss 1.8431, time 15780.96ms, mfu 2.57%
iter 1437: loss 2.0681, time 15779.75ms, mfu 2.57%
iter 1438: loss 2.0288, time 15787.67ms, mfu 2.57%
iter 1439: loss 1.9897, time 15782.04ms, mfu 2.57%
iter 1440: loss 2.0015, time 15774.46ms, mfu 2.57%
iter 1441: loss 2.0768, time 16134.58ms, mfu 2.57%
iter 1442: loss 1.9768, time 15778.47ms, mfu 2.57%
iter 1443: loss 1.8125, time 15777.59ms, mfu 2.57%
iter 1444: loss 1.8858, time 15766.40ms, mfu 2.57%
iter 1445: loss 2.5136, time 15767.44ms, mfu 2.57%
iter 1446: loss 1.9558, time 15775.43ms, mfu 2.57%
iter 1447: loss 1.6178, time 15782.27ms, mfu 2.57%
iter 1448: loss 1.6175, time 15774.19ms, mfu 2.57%
iter 1449: loss 1.8499, time 15779.13ms, mfu 2.57%
iter 1450: loss 1.9977, time 15782.42ms, mfu 2.57%
iter 1451: loss 2.0480, time 15783.39ms, mfu 2.57%
iter 1452: loss 1.7687, time 15781.60ms, mfu 2.57%
iter 1453: loss 1.4689, time 15780.87ms, mfu 2.57%
iter 1454: loss 1.5920, time 15779.72ms, mfu 2.57%
iter 1455: loss 1.5261, time 15764.48ms, mfu 2.57%
iter 1456: loss 1.9414, time 15778.95ms, mfu 2.57%
iter 1457: loss 1.9611, time 15784.18ms, mfu 2.57%
iter 1458: loss 1.7961, time 15772.28ms, mfu 2.57%
iter 1459: loss 1.8372, time 15804.10ms, mfu 2.57%
iter 1460: loss 1.9023, time 15790.62ms, mfu 2.57%
iter 1461: loss 2.0105, time 15796.68ms, mfu 2.57%
iter 1462: loss 1.8961, time 16157.04ms, mfu 2.57%
iter 1463: loss 1.7972, time 15791.64ms, mfu 2.57%
iter 1464: loss 2.1627, time 15787.83ms, mfu 2.57%
iter 1465: loss 1.9859, time 15779.80ms, mfu 2.57%
iter 1466: loss 1.7016, time 15770.05ms, mfu 2.57%
iter 1467: loss 2.0602, time 15777.61ms, mfu 2.57%
iter 1468: loss 2.1318, time 15764.38ms, mfu 2.57%
iter 1469: loss 1.6875, time 15775.77ms, mfu 2.57%
iter 1470: loss 2.0187, time 15773.41ms, mfu 2.57%
iter 1471: loss 2.0420, time 15767.90ms, mfu 2.57%
iter 1472: loss 2.0798, time 15776.06ms, mfu 2.57%
iter 1473: loss 1.8249, time 15780.97ms, mfu 2.57%
iter 1474: loss 1.9992, time 15771.82ms, mfu 2.57%
iter 1475: loss 1.5019, time 15769.34ms, mfu 2.57%
iter 1476: loss 1.8603, time 15786.61ms, mfu 2.57%
iter 1477: loss 1.9073, time 15788.80ms, mfu 2.57%
iter 1478: loss 2.0584, time 15780.72ms, mfu 2.57%
iter 1479: loss 1.9313, time 15782.36ms, mfu 2.57%
iter 1480: loss 1.7370, time 15795.65ms, mfu 2.57%
iter 1481: loss 1.6616, time 15797.50ms, mfu 2.57%
iter 1482: loss 1.8056, time 16141.42ms, mfu 2.57%
iter 1483: loss 1.7638, time 15788.15ms, mfu 2.57%
iter 1484: loss 2.1546, time 15793.43ms, mfu 2.57%
iter 1485: loss 1.7338, time 15781.46ms, mfu 2.57%
iter 1486: loss 1.6425, time 15785.16ms, mfu 2.57%
iter 1487: loss 1.5794, time 15786.03ms, mfu 2.57%
iter 1488: loss 1.7717, time 15773.89ms, mfu 2.57%
iter 1489: loss 1.6910, time 15780.88ms, mfu 2.57%
iter 1490: loss 1.5232, time 15778.28ms, mfu 2.57%
iter 1491: loss 1.6804, time 15786.48ms, mfu 2.57%
iter 1492: loss 1.8551, time 15780.15ms, mfu 2.57%
iter 1493: loss 2.0491, time 15778.21ms, mfu 2.57%
iter 1494: loss 1.5774, time 15788.95ms, mfu 2.57%
iter 1495: loss 1.6586, time 15779.81ms, mfu 2.57%
iter 1496: loss 2.2795, time 15790.68ms, mfu 2.57%
iter 1497: loss 1.8694, time 15776.47ms, mfu 2.57%
iter 1498: loss 2.0181, time 15781.30ms, mfu 2.57%
iter 1499: loss 2.0309, time 15792.65ms, mfu 2.57%
iter 1500: loss 1.8838, time 15799.71ms, mfu 2.57%
iter 1501: loss 1.6761, time 15795.08ms, mfu 2.57%
iter 1502: loss 1.7525, time 16143.61ms, mfu 2.57%
iter 1503: loss 1.9547, time 15791.01ms, mfu 2.57%
iter 1504: loss 1.6884, time 15792.93ms, mfu 2.57%
iter 1505: loss 1.9141, time 15788.94ms, mfu 2.57%
iter 1506: loss 2.1211, time 15780.02ms, mfu 2.57%
iter 1507: loss 1.9482, time 15795.64ms, mfu 2.57%
iter 1508: loss 1.6617, time 15803.81ms, mfu 2.57%
iter 1509: loss 2.2855, time 15792.19ms, mfu 2.57%
iter 1510: loss 1.7107, time 15790.26ms, mfu 2.57%
iter 1511: loss 2.0500, time 15788.53ms, mfu 2.57%
iter 1512: loss 1.8106, time 15787.37ms, mfu 2.57%
iter 1513: loss 1.7404, time 15790.60ms, mfu 2.57%
iter 1514: loss 1.9674, time 15787.64ms, mfu 2.57%
iter 1515: loss 2.0110, time 15797.40ms, mfu 2.57%
iter 1516: loss 1.9275, time 15789.07ms, mfu 2.57%
iter 1517: loss 1.8100, time 15781.75ms, mfu 2.57%
iter 1518: loss 2.0615, time 15787.17ms, mfu 2.57%
iter 1519: loss 2.1564, time 15782.59ms, mfu 2.57%
iter 1520: loss 1.8922, time 15791.75ms, mfu 2.57%
iter 1521: loss 1.7134, time 15791.82ms, mfu 2.57%
iter 1522: loss 1.8307, time 16073.71ms, mfu 2.57%
iter 1523: loss 1.5005, time 15788.77ms, mfu 2.57%
iter 1524: loss 1.7430, time 15788.16ms, mfu 2.57%
iter 1525: loss 1.8533, time 15795.29ms, mfu 2.57%
iter 1526: loss 1.4030, time 15792.47ms, mfu 2.57%
iter 1527: loss 2.1639, time 15783.72ms, mfu 2.57%
iter 1528: loss 1.5347, time 15790.06ms, mfu 2.57%
iter 1529: loss 1.8344, time 15772.23ms, mfu 2.57%
iter 1530: loss 2.2323, time 15786.95ms, mfu 2.57%
iter 1531: loss 1.9273, time 15793.58ms, mfu 2.57%
iter 1532: loss 1.9488, time 15792.44ms, mfu 2.57%
iter 1533: loss 1.9533, time 15789.02ms, mfu 2.57%
iter 1534: loss 2.0039, time 15797.52ms, mfu 2.57%
iter 1535: loss 1.9747, time 15789.14ms, mfu 2.57%
iter 1536: loss 2.0585, time 15782.12ms, mfu 2.57%
iter 1537: loss 1.9811, time 15790.48ms, mfu 2.57%
iter 1538: loss 1.5967, time 15780.57ms, mfu 2.57%
iter 1539: loss 1.9964, time 15782.30ms, mfu 2.57%
iter 1540: loss 1.6812, time 15791.93ms, mfu 2.57%
iter 1541: loss 1.9755, time 15789.87ms, mfu 2.57%
iter 1542: loss 1.6822, time 16075.31ms, mfu 2.57%
iter 1543: loss 1.7964, time 15787.42ms, mfu 2.57%
iter 1544: loss 2.0982, time 15790.22ms, mfu 2.57%
iter 1545: loss 1.8766, time 15797.55ms, mfu 2.57%
iter 1546: loss 1.9487, time 15793.74ms, mfu 2.57%
iter 1547: loss 2.0322, time 15787.29ms, mfu 2.57%
iter 1548: loss 1.8785, time 15782.11ms, mfu 2.57%
iter 1549: loss 1.6156, time 15782.21ms, mfu 2.57%
iter 1550: loss 1.7803, time 15785.59ms, mfu 2.57%
iter 1551: loss 1.8348, time 15779.24ms, mfu 2.57%
iter 1552: loss 2.1416, time 15786.13ms, mfu 2.57%
iter 1553: loss 1.7547, time 15781.82ms, mfu 2.57%
iter 1554: loss 1.8863, time 15785.52ms, mfu 2.57%
iter 1555: loss 1.4423, time 15780.08ms, mfu 2.57%
iter 1556: loss 1.7758, time 15773.16ms, mfu 2.57%
iter 1557: loss 2.0436, time 15796.33ms, mfu 2.57%
iter 1558: loss 1.6908, time 15903.73ms, mfu 2.57%
iter 1559: loss 2.0069, time 15951.74ms, mfu 2.57%
iter 1560: loss 2.2317, time 15915.78ms, mfu 2.57%
iter 1561: loss 1.8768, time 15888.32ms, mfu 2.57%
iter 1562: loss 1.7994, time 16204.47ms, mfu 2.56%
iter 1563: loss 1.6339, time 15878.70ms, mfu 2.56%
iter 1564: loss 2.1040, time 15883.48ms, mfu 2.56%
iter 1565: loss 2.0283, time 15904.23ms, mfu 2.56%
iter 1566: loss 1.7904, time 15924.30ms, mfu 2.56%
iter 1567: loss 2.0150, time 15929.25ms, mfu 2.56%
iter 1568: loss 1.8069, time 15922.09ms, mfu 2.56%
iter 1569: loss 1.6109, time 15916.16ms, mfu 2.56%
iter 1570: loss 1.6277, time 15914.53ms, mfu 2.56%
iter 1571: loss 1.8875, time 15910.17ms, mfu 2.56%
iter 1572: loss 1.6858, time 15897.63ms, mfu 2.56%
iter 1573: loss 2.0734, time 15917.21ms, mfu 2.56%
iter 1574: loss 1.8202, time 15881.75ms, mfu 2.56%
iter 1575: loss 1.8268, time 15902.26ms, mfu 2.56%
iter 1576: loss 1.5291, time 15884.52ms, mfu 2.56%
iter 1577: loss 1.8864, time 15887.51ms, mfu 2.56%
iter 1578: loss 1.8156, time 15893.20ms, mfu 2.56%
iter 1579: loss 1.5513, time 15888.18ms, mfu 2.56%
iter 1580: loss 1.9806, time 15897.50ms, mfu 2.56%
iter 1581: loss 1.5019, time 15887.80ms, mfu 2.56%
iter 1582: loss 2.3285, time 16237.70ms, mfu 2.55%
iter 1583: loss 1.8715, time 15878.22ms, mfu 2.55%
iter 1584: loss 2.0251, time 15884.05ms, mfu 2.55%
iter 1585: loss 1.8141, time 15894.89ms, mfu 2.55%
iter 1586: loss 1.8440, time 15880.98ms, mfu 2.55%
iter 1587: loss 1.8035, time 15889.72ms, mfu 2.55%
iter 1588: loss 1.5534, time 15882.13ms, mfu 2.55%
iter 1589: loss 1.6958, time 15895.24ms, mfu 2.55%
iter 1590: loss 2.0792, time 15908.79ms, mfu 2.55%
iter 1591: loss 1.6145, time 15923.19ms, mfu 2.55%
iter 1592: loss 1.9323, time 15917.38ms, mfu 2.55%
iter 1593: loss 1.6851, time 15905.78ms, mfu 2.55%
iter 1594: loss 1.6433, time 15902.13ms, mfu 2.55%
iter 1595: loss 2.1890, time 15904.37ms, mfu 2.55%
iter 1596: loss 1.8891, time 15898.10ms, mfu 2.55%
iter 1597: loss 1.7923, time 15901.95ms, mfu 2.55%
iter 1598: loss 2.0047, time 15894.76ms, mfu 2.55%
iter 1599: loss 1.8642, time 15909.82ms, mfu 2.55%
iter 1600: loss 2.0749, time 15898.84ms, mfu 2.55%
iter 1601: loss 1.9594, time 15899.41ms, mfu 2.55%
iter 1602: loss 1.6151, time 16130.73ms, mfu 2.55%
iter 1603: loss 1.7347, time 15895.68ms, mfu 2.55%
iter 1604: loss 2.0070, time 15890.35ms, mfu 2.55%
iter 1605: loss 2.0974, time 15889.99ms, mfu 2.55%
iter 1606: loss 1.8863, time 15882.60ms, mfu 2.55%
iter 1607: loss 2.2181, time 15890.97ms, mfu 2.55%
iter 1608: loss 1.8027, time 15890.40ms, mfu 2.55%
iter 1609: loss 1.6876, time 15900.46ms, mfu 2.55%
iter 1610: loss 1.9703, time 15886.01ms, mfu 2.55%
iter 1611: loss 1.4521, time 15895.13ms, mfu 2.55%
iter 1612: loss 1.6297, time 15902.43ms, mfu 2.55%
iter 1613: loss 2.0117, time 15892.53ms, mfu 2.55%
iter 1614: loss 2.1012, time 15897.77ms, mfu 2.55%
iter 1615: loss 1.8906, time 15905.07ms, mfu 2.55%
iter 1616: loss 1.8641, time 15902.53ms, mfu 2.55%
iter 1617: loss 1.5754, time 15903.58ms, mfu 2.55%
iter 1618: loss 2.0836, time 15900.95ms, mfu 2.55%
iter 1619: loss 1.9777, time 15908.99ms, mfu 2.55%
iter 1620: loss 1.8335, time 15920.85ms, mfu 2.55%
iter 1621: loss 2.2246, time 15913.85ms, mfu 2.55%
iter 1622: loss 2.2430, time 16191.06ms, mfu 2.55%
iter 1623: loss 2.1753, time 15911.52ms, mfu 2.55%
iter 1624: loss 2.1419, time 15910.84ms, mfu 2.55%
iter 1625: loss 1.9180, time 15890.52ms, mfu 2.55%
iter 1626: loss 1.6449, time 15882.94ms, mfu 2.55%
iter 1627: loss 1.8856, time 15866.94ms, mfu 2.55%
iter 1628: loss 1.9768, time 15851.46ms, mfu 2.55%
iter 1629: loss 1.6844, time 15826.63ms, mfu 2.55%
iter 1630: loss 1.6323, time 15820.27ms, mfu 2.56%
iter 1631: loss 2.0533, time 15822.57ms, mfu 2.56%
iter 1632: loss 1.8984, time 15802.47ms, mfu 2.56%
iter 1633: loss 1.9832, time 15801.95ms, mfu 2.56%
iter 1634: loss 1.8963, time 15794.49ms, mfu 2.56%
iter 1635: loss 1.8318, time 15815.96ms, mfu 2.56%
iter 1636: loss 1.5516, time 15788.90ms, mfu 2.56%
iter 1637: loss 1.6288, time 15793.60ms, mfu 2.56%
iter 1638: loss 1.9833, time 15801.43ms, mfu 2.56%
iter 1639: loss 1.8337, time 15802.23ms, mfu 2.57%
iter 1640: loss 1.8858, time 15798.30ms, mfu 2.57%
iter 1641: loss 1.8881, time 15798.33ms, mfu 2.57%
iter 1642: loss 1.4774, time 16047.75ms, mfu 2.56%
iter 1643: loss 1.7936, time 15793.55ms, mfu 2.56%
iter 1644: loss 1.7493, time 15785.49ms, mfu 2.56%
iter 1645: loss 1.7265, time 15777.76ms, mfu 2.57%
iter 1646: loss 2.0495, time 15776.82ms, mfu 2.57%
iter 1647: loss 1.8651, time 15796.10ms, mfu 2.57%
iter 1648: loss 2.1174, time 15785.16ms, mfu 2.57%
iter 1649: loss 2.4302, time 15774.11ms, mfu 2.57%
iter 1650: loss 1.9371, time 15775.17ms, mfu 2.57%
iter 1651: loss 1.7874, time 15786.06ms, mfu 2.57%
iter 1652: loss 1.6668, time 15775.99ms, mfu 2.57%
iter 1653: loss 1.6779, time 15777.56ms, mfu 2.57%
iter 1654: loss 1.7398, time 15774.23ms, mfu 2.57%
iter 1655: loss 1.7189, time 15791.65ms, mfu 2.57%
iter 1656: loss 1.7483, time 15786.17ms, mfu 2.57%
iter 1657: loss 2.1391, time 15781.69ms, mfu 2.57%
iter 1658: loss 1.5932, time 15777.45ms, mfu 2.57%
iter 1659: loss 1.7624, time 15779.71ms, mfu 2.57%
iter 1660: loss 1.5138, time 15783.46ms, mfu 2.57%
iter 1661: loss 1.8093, time 15793.05ms, mfu 2.57%
iter 1662: loss 2.0861, time 16052.99ms, mfu 2.57%
iter 1663: loss 1.8050, time 15776.04ms, mfu 2.57%
iter 1664: loss 2.2851, time 15793.10ms, mfu 2.57%
iter 1665: loss 1.9119, time 15784.92ms, mfu 2.57%
iter 1666: loss 1.4827, time 15784.04ms, mfu 2.57%
iter 1667: loss 1.8128, time 15786.19ms, mfu 2.57%
iter 1668: loss 1.7085, time 15784.41ms, mfu 2.57%
iter 1669: loss 1.8458, time 15794.18ms, mfu 2.57%
iter 1670: loss 1.9662, time 15790.38ms, mfu 2.57%
iter 1671: loss 1.9449, time 15786.07ms, mfu 2.57%
iter 1672: loss 1.9604, time 15777.20ms, mfu 2.57%
iter 1673: loss 1.5326, time 15797.90ms, mfu 2.57%
iter 1674: loss 1.7987, time 15780.83ms, mfu 2.57%
iter 1675: loss 1.8529, time 15795.79ms, mfu 2.57%
iter 1676: loss 1.7318, time 15799.61ms, mfu 2.57%
iter 1677: loss 2.0619, time 15788.61ms, mfu 2.57%
iter 1678: loss 1.8874, time 15774.29ms, mfu 2.57%
iter 1679: loss 1.9182, time 15781.51ms, mfu 2.57%
iter 1680: loss 1.9726, time 15786.87ms, mfu 2.57%
iter 1681: loss 2.0642, time 15788.20ms, mfu 2.57%
iter 1682: loss 1.8214, time 16154.15ms, mfu 2.57%
iter 1683: loss 1.8059, time 15802.32ms, mfu 2.57%
iter 1684: loss 2.0718, time 15792.46ms, mfu 2.57%
iter 1685: loss 1.7922, time 15792.65ms, mfu 2.57%
iter 1686: loss 2.1280, time 15799.36ms, mfu 2.57%
iter 1687: loss 1.9845, time 15785.14ms, mfu 2.57%
iter 1688: loss 2.0841, time 15787.54ms, mfu 2.57%
iter 1689: loss 1.8557, time 15775.23ms, mfu 2.57%
iter 1690: loss 1.9997, time 15760.62ms, mfu 2.57%
iter 1691: loss 2.1671, time 15767.75ms, mfu 2.57%
iter 1692: loss 1.7102, time 15765.79ms, mfu 2.57%
iter 1693: loss 1.7521, time 15753.15ms, mfu 2.57%
iter 1694: loss 1.9086, time 15772.77ms, mfu 2.57%
iter 1695: loss 2.0004, time 15759.93ms, mfu 2.57%
iter 1696: loss 2.0249, time 15755.28ms, mfu 2.57%
iter 1697: loss 1.6515, time 15752.33ms, mfu 2.57%
iter 1698: loss 1.9788, time 15755.55ms, mfu 2.57%
iter 1699: loss 2.2396, time 15751.55ms, mfu 2.58%
iter 1700: loss 1.9517, time 15749.51ms, mfu 2.58%
iter 1701: loss 1.5402, time 15758.29ms, mfu 2.58%
iter 1702: loss 1.7067, time 16024.64ms, mfu 2.57%
iter 1703: loss 1.5766, time 15769.56ms, mfu 2.57%
iter 1704: loss 2.0527, time 15805.08ms, mfu 2.57%
iter 1705: loss 1.7288, time 15742.35ms, mfu 2.57%
iter 1706: loss 1.7968, time 15748.15ms, mfu 2.57%
iter 1707: loss 2.0287, time 15741.90ms, mfu 2.57%
iter 1708: loss 1.8942, time 15757.52ms, mfu 2.57%
iter 1709: loss 1.5662, time 15750.35ms, mfu 2.58%
iter 1710: loss 1.9769, time 15750.06ms, mfu 2.58%
iter 1711: loss 1.7878, time 15747.30ms, mfu 2.58%
iter 1712: loss 1.9953, time 15757.63ms, mfu 2.58%
iter 1713: loss 2.0413, time 15759.00ms, mfu 2.58%
iter 1714: loss 1.9532, time 15756.03ms, mfu 2.58%
iter 1715: loss 1.6457, time 15754.39ms, mfu 2.58%
iter 1716: loss 1.7273, time 15751.91ms, mfu 2.58%
iter 1717: loss 1.9459, time 15753.64ms, mfu 2.58%
iter 1718: loss 1.7258, time 15757.08ms, mfu 2.58%
iter 1719: loss 1.5074, time 15752.36ms, mfu 2.58%
iter 1720: loss 2.0684, time 15757.34ms, mfu 2.58%
iter 1721: loss 2.1336, time 15762.67ms, mfu 2.58%
iter 1722: loss 2.0781, time 15992.84ms, mfu 2.57%
iter 1723: loss 2.1730, time 15749.09ms, mfu 2.57%
iter 1724: loss 1.8046, time 15750.50ms, mfu 2.57%
iter 1725: loss 1.8395, time 15758.97ms, mfu 2.58%
iter 1726: loss 1.5110, time 15755.97ms, mfu 2.58%
iter 1727: loss 1.8237, time 15756.10ms, mfu 2.58%
iter 1728: loss 1.9320, time 15750.87ms, mfu 2.58%
iter 1729: loss 2.2232, time 15763.66ms, mfu 2.58%
iter 1730: loss 1.8891, time 15768.73ms, mfu 2.58%
iter 1731: loss 1.4998, time 15753.06ms, mfu 2.58%
iter 1732: loss 1.6709, time 15741.81ms, mfu 2.58%
iter 1733: loss 1.5448, time 15761.07ms, mfu 2.58%
iter 1734: loss 1.5811, time 15749.71ms, mfu 2.58%
iter 1735: loss 1.6910, time 15756.44ms, mfu 2.58%
iter 1736: loss 1.6211, time 15756.15ms, mfu 2.58%
iter 1737: loss 1.9204, time 15762.38ms, mfu 2.58%
iter 1738: loss 2.0685, time 15758.14ms, mfu 2.58%
iter 1739: loss 1.9543, time 15751.07ms, mfu 2.58%
iter 1740: loss 1.6668, time 15748.12ms, mfu 2.58%
iter 1741: loss 1.8378, time 15769.95ms, mfu 2.58%
iter 1742: loss 1.9700, time 16120.71ms, mfu 2.57%
iter 1743: loss 1.6926, time 15760.10ms, mfu 2.57%
iter 1744: loss 1.5128, time 15756.63ms, mfu 2.57%
iter 1745: loss 1.7544, time 15757.64ms, mfu 2.57%
iter 1746: loss 1.9674, time 15767.38ms, mfu 2.57%
iter 1747: loss 1.9328, time 15751.37ms, mfu 2.57%
iter 1748: loss 1.9696, time 15751.69ms, mfu 2.58%
iter 1749: loss 1.9393, time 15759.67ms, mfu 2.58%
iter 1750: loss 1.5051, time 15773.42ms, mfu 2.58%
iter 1751: loss 1.6274, time 15756.89ms, mfu 2.58%
iter 1752: loss 2.1265, time 15763.61ms, mfu 2.58%
iter 1753: loss 1.6701, time 15765.93ms, mfu 2.58%
iter 1754: loss 1.6312, time 15766.35ms, mfu 2.58%
iter 1755: loss 1.7187, time 15766.00ms, mfu 2.58%
iter 1756: loss 1.9255, time 15752.71ms, mfu 2.58%
iter 1757: loss 1.5150, time 15756.09ms, mfu 2.58%
iter 1758: loss 2.1522, time 15775.09ms, mfu 2.58%
iter 1759: loss 1.4483, time 15754.50ms, mfu 2.58%
iter 1760: loss 1.7792, time 15754.36ms, mfu 2.58%
iter 1761: loss 1.5246, time 15757.48ms, mfu 2.58%
iter 1762: loss 1.5539, time 15984.67ms, mfu 2.57%
iter 1763: loss 1.8584, time 15767.14ms, mfu 2.57%
iter 1764: loss 1.4452, time 15756.66ms, mfu 2.57%
iter 1765: loss 1.8157, time 15749.51ms, mfu 2.57%
iter 1766: loss 1.8247, time 15764.11ms, mfu 2.58%
iter 1767: loss 1.7340, time 15769.57ms, mfu 2.58%
iter 1768: loss 1.9309, time 15754.03ms, mfu 2.58%
iter 1769: loss 2.0125, time 15762.05ms, mfu 2.58%
iter 1770: loss 1.7890, time 15770.52ms, mfu 2.58%
iter 1771: loss 1.6720, time 15768.96ms, mfu 2.58%
iter 1772: loss 1.8283, time 15770.06ms, mfu 2.58%
iter 1773: loss 1.8139, time 15758.86ms, mfu 2.58%
iter 1774: loss 1.8512, time 15746.02ms, mfu 2.58%
iter 1775: loss 1.9119, time 15771.62ms, mfu 2.58%
iter 1776: loss 1.2712, time 15758.59ms, mfu 2.58%
iter 1777: loss 2.0879, time 15757.78ms, mfu 2.58%
iter 1778: loss 1.5996, time 15756.45ms, mfu 2.58%
iter 1779: loss 1.6318, time 15756.38ms, mfu 2.58%
iter 1780: loss 1.7972, time 15770.38ms, mfu 2.58%
iter 1781: loss 1.9510, time 15753.58ms, mfu 2.58%
iter 1782: loss 1.7035, time 16024.18ms, mfu 2.57%
iter 1783: loss 1.9655, time 15759.29ms, mfu 2.57%
iter 1784: loss 1.7994, time 15787.06ms, mfu 2.57%
iter 1785: loss 2.0944, time 15887.08ms, mfu 2.57%
iter 1786: loss 1.2300, time 15936.78ms, mfu 2.57%
iter 1787: loss 1.7320, time 15931.35ms, mfu 2.57%
iter 1788: loss 1.6881, time 15885.25ms, mfu 2.57%
iter 1789: loss 1.8906, time 15877.52ms, mfu 2.57%
iter 1790: loss 2.1616, time 15860.31ms, mfu 2.57%
iter 1791: loss 1.6541, time 15890.83ms, mfu 2.56%
iter 1792: loss 1.5091, time 15908.45ms, mfu 2.56%
iter 1793: loss 1.8687, time 15916.92ms, mfu 2.56%
iter 1794: loss 1.5763, time 15919.86ms, mfu 2.56%
iter 1795: loss 1.3614, time 15937.71ms, mfu 2.56%
iter 1796: loss 1.9453, time 15937.26ms, mfu 2.56%




""")
loss_numbers = [float(value) for value in loss_values]
plt.plot(loss_numbers, color='green', linewidth=1, linestyle='dotted')














loss_values = re.findall( loss_pattern,
"""




compiling the model... (takes a ~minute)
step 0: train loss 10.9842, val loss 10.9807
iter 0: loss 10.9576, time 78124.89ms, mfu -100.00%
iter 1: loss 9.8106, time 8746.02ms, mfu -100.00%
iter 2: loss 9.0550, time 8849.25ms, mfu -100.00%
iter 3: loss 7.8765, time 8813.99ms, mfu -100.00%
iter 4: loss 7.8633, time 8645.71ms, mfu -100.00%
iter 5: loss 7.6910, time 8448.40ms, mfu 4.96%
iter 6: loss 6.5204, time 8320.73ms, mfu 4.97%
iter 7: loss 6.6445, time 8265.82ms, mfu 4.98%
iter 8: loss 6.4419, time 8270.29ms, mfu 4.99%
iter 9: loss 5.4835, time 8314.72ms, mfu 4.99%
iter 10: loss 6.2119, time 8392.97ms, mfu 4.99%
iter 11: loss 6.4842, time 8439.19ms, mfu 4.99%
iter 12: loss 6.1765, time 8486.24ms, mfu 4.99%
iter 13: loss 5.6887, time 8500.86ms, mfu 4.98%
iter 14: loss 5.1491, time 8500.83ms, mfu 4.97%
iter 15: loss 6.0895, time 8456.98ms, mfu 4.97%
iter 16: loss 5.7107, time 8430.47ms, mfu 4.97%
iter 17: loss 6.0215, time 8421.47ms, mfu 4.97%
iter 18: loss 5.7130, time 8422.12ms, mfu 4.97%
iter 19: loss 5.2922, time 8451.42ms, mfu 4.97%
iter 20: loss 5.7435, time 8447.25ms, mfu 4.97%
iter 21: loss 5.6644, time 8447.10ms, mfu 4.97%
iter 22: loss 5.9542, time 8457.44ms, mfu 4.97%
iter 23: loss 5.5786, time 8460.72ms, mfu 4.97%
iter 24: loss 5.2780, time 8469.45ms, mfu 4.97%
iter 25: loss 5.9572, time 8481.77ms, mfu 4.96%
iter 26: loss 5.3107, time 8502.11ms, mfu 4.96%
iter 27: loss 5.7009, time 8505.72ms, mfu 4.96%
iter 28: loss 5.6113, time 8535.52ms, mfu 4.95%
iter 29: loss 5.7749, time 8518.12ms, mfu 4.95%
iter 30: loss 5.2756, time 8525.12ms, mfu 4.95%
iter 31: loss 5.7443, time 8522.72ms, mfu 4.94%
iter 32: loss 5.6096, time 8519.40ms, mfu 4.94%
iter 33: loss 5.5277, time 8546.86ms, mfu 4.94%
iter 34: loss 4.8778, time 8545.56ms, mfu 4.93%
iter 35: loss 5.5249, time 8556.87ms, mfu 4.93%
iter 36: loss 5.3246, time 8568.24ms, mfu 4.93%
iter 37: loss 5.3226, time 8582.36ms, mfu 4.92%
iter 38: loss 5.2742, time 8573.51ms, mfu 4.92%
iter 39: loss 4.8037, time 8566.22ms, mfu 4.92%
iter 40: loss 5.3470, time 8581.15ms, mfu 4.91%
iter 41: loss 5.1823, time 8578.95ms, mfu 4.91%
iter 42: loss 5.2806, time 8599.65ms, mfu 4.91%
iter 43: loss 5.0674, time 8606.43ms, mfu 4.90%
iter 44: loss 4.9301, time 8576.05ms, mfu 4.90%
iter 45: loss 5.4071, time 8573.92ms, mfu 4.90%
iter 46: loss 5.2126, time 8561.43ms, mfu 4.90%
iter 47: loss 4.6139, time 8566.13ms, mfu 4.90%
iter 48: loss 4.4989, time 8561.59ms, mfu 4.90%
iter 49: loss 5.0535, time 8574.99ms, mfu 4.90%
iter 50: loss 5.3442, time 8574.26ms, mfu 4.90%
iter 51: loss 4.6419, time 8577.49ms, mfu 4.90%
iter 52: loss 4.6042, time 8571.11ms, mfu 4.89%
iter 53: loss 4.9906, time 8590.25ms, mfu 4.89%
iter 54: loss 4.3614, time 8575.50ms, mfu 4.89%
iter 55: loss 4.9262, time 8579.88ms, mfu 4.89%
iter 56: loss 4.4826, time 8575.00ms, mfu 4.89%
iter 57: loss 4.9846, time 8588.81ms, mfu 4.89%
iter 58: loss 4.4331, time 8605.09ms, mfu 4.89%
iter 59: loss 5.0106, time 8619.47ms, mfu 4.89%
iter 60: loss 4.9764, time 8621.17ms, mfu 4.88%
iter 61: loss 4.7182, time 8608.74ms, mfu 4.88%
iter 62: loss 4.4463, time 8620.13ms, mfu 4.88%
iter 63: loss 4.4104, time 8611.26ms, mfu 4.88%
iter 64: loss 4.7071, time 8617.55ms, mfu 4.88%
iter 65: loss 4.6299, time 8621.09ms, mfu 4.88%
iter 66: loss 4.2646, time 8620.13ms, mfu 4.87%
iter 67: loss 4.1438, time 8593.68ms, mfu 4.87%
iter 68: loss 4.4006, time 8582.23ms, mfu 4.88%
iter 69: loss 4.0730, time 8590.72ms, mfu 4.88%
iter 70: loss 4.9976, time 8595.29ms, mfu 4.88%
iter 71: loss 4.9489, time 8599.93ms, mfu 4.88%
iter 72: loss 5.0154, time 8579.80ms, mfu 4.88%
iter 73: loss 4.5946, time 8606.84ms, mfu 4.88%
iter 74: loss 4.0939, time 8630.93ms, mfu 4.87%
iter 75: loss 4.6848, time 8625.82ms, mfu 4.87%
iter 76: loss 4.4969, time 8649.87ms, mfu 4.87%
iter 77: loss 4.9814, time 8632.02ms, mfu 4.87%
iter 78: loss 3.7846, time 8623.38ms, mfu 4.87%
iter 79: loss 3.8895, time 8632.47ms, mfu 4.87%
iter 80: loss 4.4998, time 8617.54ms, mfu 4.87%
iter 81: loss 4.5046, time 8627.93ms, mfu 4.87%
iter 82: loss 4.6163, time 8635.66ms, mfu 4.86%
iter 83: loss 4.4953, time 8609.98ms, mfu 4.86%
iter 84: loss 4.5803, time 8624.13ms, mfu 4.86%
iter 85: loss 4.6659, time 8611.76ms, mfu 4.86%
iter 86: loss 4.3460, time 8616.83ms, mfu 4.86%
iter 87: loss 4.3845, time 8603.22ms, mfu 4.87%
iter 88: loss 4.3533, time 8617.45ms, mfu 4.86%
iter 89: loss 4.4435, time 8608.65ms, mfu 4.87%
iter 90: loss 4.3704, time 8592.76ms, mfu 4.87%
iter 91: loss 4.2431, time 8589.92ms, mfu 4.87%
iter 92: loss 4.0212, time 8597.50ms, mfu 4.87%
iter 93: loss 4.1264, time 8586.24ms, mfu 4.87%
iter 94: loss 4.1220, time 8592.83ms, mfu 4.87%
iter 95: loss 4.1321, time 8582.34ms, mfu 4.87%
iter 96: loss 4.0555, time 8599.52ms, mfu 4.87%
iter 97: loss 4.4975, time 8609.70ms, mfu 4.87%
iter 98: loss 4.0238, time 8614.32ms, mfu 4.87%
iter 99: loss 4.1986, time 8647.01ms, mfu 4.87%
iter 100: loss 4.2760, time 8653.75ms, mfu 4.87%
iter 101: loss 4.4590, time 8644.09ms, mfu 4.86%
iter 102: loss 4.2637, time 8638.90ms, mfu 4.86%
iter 103: loss 4.4544, time 8631.84ms, mfu 4.86%
iter 104: loss 4.1024, time 8627.18ms, mfu 4.86%
iter 105: loss 4.2986, time 8632.00ms, mfu 4.86%
iter 106: loss 4.2812, time 8604.27ms, mfu 4.86%
iter 107: loss 4.4581, time 8598.36ms, mfu 4.86%
iter 108: loss 4.1913, time 8620.99ms, mfu 4.86%
iter 109: loss 3.8422, time 8607.38ms, mfu 4.86%
iter 110: loss 4.5527, time 8601.24ms, mfu 4.86%
iter 111: loss 4.1511, time 8605.99ms, mfu 4.87%
iter 112: loss 3.9285, time 8596.85ms, mfu 4.87%
iter 113: loss 4.3725, time 8586.94ms, mfu 4.87%
iter 114: loss 4.1151, time 8611.93ms, mfu 4.87%
iter 115: loss 3.9681, time 8599.22ms, mfu 4.87%
iter 116: loss 3.5115, time 8604.79ms, mfu 4.87%
iter 117: loss 4.6527, time 8627.97ms, mfu 4.87%
iter 118: loss 4.0279, time 8636.22ms, mfu 4.87%
iter 119: loss 4.6281, time 8655.93ms, mfu 4.86%
iter 120: loss 4.2316, time 8648.39ms, mfu 4.86%
iter 121: loss 3.9145, time 8615.37ms, mfu 4.86%
iter 122: loss 4.2373, time 8566.49ms, mfu 4.87%
iter 123: loss 4.5246, time 8570.03ms, mfu 4.87%
iter 124: loss 4.1204, time 8594.51ms, mfu 4.87%
iter 125: loss 3.7255, time 8621.74ms, mfu 4.87%
iter 126: loss 3.9115, time 8653.39ms, mfu 4.87%
iter 127: loss 4.0616, time 8663.17ms, mfu 4.86%
iter 128: loss 3.8366, time 8664.00ms, mfu 4.86%
iter 129: loss 4.0500, time 8667.55ms, mfu 4.86%
iter 130: loss 4.3392, time 8625.17ms, mfu 4.86%
iter 131: loss 4.1691, time 8631.78ms, mfu 4.86%
iter 132: loss 4.4353, time 8617.00ms, mfu 4.86%
iter 133: loss 3.8027, time 8618.41ms, mfu 4.86%
iter 134: loss 3.7959, time 8604.49ms, mfu 4.86%
iter 135: loss 4.3140, time 8596.72ms, mfu 4.86%
iter 136: loss 3.9474, time 8599.17ms, mfu 4.86%
iter 137: loss 4.1435, time 8598.01ms, mfu 4.86%
iter 138: loss 4.2142, time 8585.70ms, mfu 4.87%
iter 139: loss 4.3872, time 8589.99ms, mfu 4.87%
iter 140: loss 3.7716, time 8579.24ms, mfu 4.87%
iter 141: loss 3.6565, time 8611.21ms, mfu 4.87%
iter 142: loss 4.4198, time 8591.41ms, mfu 4.87%
iter 143: loss 4.1432, time 8593.54ms, mfu 4.87%
iter 144: loss 3.6307, time 8603.14ms, mfu 4.87%
iter 145: loss 3.9492, time 8588.07ms, mfu 4.87%
iter 146: loss 3.8826, time 8581.41ms, mfu 4.87%
iter 147: loss 3.9950, time 8584.29ms, mfu 4.87%
iter 148: loss 4.0783, time 8588.14ms, mfu 4.87%
iter 149: loss 3.9484, time 8599.29ms, mfu 4.87%
iter 150: loss 4.1350, time 8621.13ms, mfu 4.87%
iter 151: loss 4.3543, time 8618.92ms, mfu 4.87%
iter 152: loss 3.6763, time 8622.45ms, mfu 4.87%
iter 153: loss 3.5881, time 8628.86ms, mfu 4.87%
iter 154: loss 4.0982, time 8634.87ms, mfu 4.87%
iter 155: loss 3.9191, time 8648.12ms, mfu 4.87%
iter 156: loss 4.7037, time 8650.67ms, mfu 4.86%
iter 157: loss 3.8984, time 8644.67ms, mfu 4.86%
iter 158: loss 3.6119, time 8640.63ms, mfu 4.86%
iter 159: loss 4.0273, time 8639.77ms, mfu 4.86%
iter 160: loss 3.8976, time 8649.45ms, mfu 4.86%
iter 161: loss 3.7696, time 8629.82ms, mfu 4.86%
iter 162: loss 4.3601, time 8624.82ms, mfu 4.86%
iter 163: loss 3.5459, time 8615.99ms, mfu 4.86%
iter 164: loss 4.0427, time 8588.03ms, mfu 4.86%
iter 165: loss 4.0317, time 8596.69ms, mfu 4.86%
iter 166: loss 3.4882, time 8605.05ms, mfu 4.86%
iter 167: loss 3.5766, time 8621.97ms, mfu 4.86%
iter 168: loss 3.4853, time 8627.42ms, mfu 4.86%
iter 169: loss 3.8831, time 8651.67ms, mfu 4.86%
iter 170: loss 4.0610, time 8655.51ms, mfu 4.86%
iter 171: loss 4.2021, time 8657.51ms, mfu 4.86%
iter 172: loss 3.5624, time 8643.72ms, mfu 4.86%
iter 173: loss 3.8339, time 8620.85ms, mfu 4.86%
iter 174: loss 3.8728, time 8616.63ms, mfu 4.86%
iter 175: loss 3.9464, time 8618.80ms, mfu 4.86%
iter 176: loss 3.9322, time 8623.73ms, mfu 4.86%
iter 177: loss 3.7080, time 8616.21ms, mfu 4.86%
iter 178: loss 4.2495, time 8619.43ms, mfu 4.86%
iter 179: loss 4.2081, time 8644.29ms, mfu 4.86%
iter 180: loss 3.5319, time 8651.30ms, mfu 4.86%
iter 181: loss 3.9545, time 8650.71ms, mfu 4.86%
iter 182: loss 3.6312, time 8639.63ms, mfu 4.86%
iter 183: loss 3.8381, time 8634.64ms, mfu 4.86%
iter 184: loss 3.7253, time 8630.92ms, mfu 4.86%
iter 185: loss 3.7516, time 8614.88ms, mfu 4.86%
iter 186: loss 3.9877, time 8615.15ms, mfu 4.86%
iter 187: loss 3.3855, time 8617.80ms, mfu 4.86%
iter 188: loss 3.7415, time 8640.44ms, mfu 4.86%
iter 189: loss 4.2760, time 8654.48ms, mfu 4.86%
iter 190: loss 3.7982, time 8667.83ms, mfu 4.85%
iter 191: loss 4.0666, time 8659.23ms, mfu 4.85%
iter 192: loss 3.5614, time 8654.13ms, mfu 4.85%
iter 193: loss 4.1368, time 8636.53ms, mfu 4.85%
iter 194: loss 4.0192, time 8630.78ms, mfu 4.85%
iter 195: loss 3.7553, time 8613.80ms, mfu 4.85%
iter 196: loss 3.2708, time 8608.26ms, mfu 4.85%
iter 197: loss 3.9325, time 8612.03ms, mfu 4.86%
iter 198: loss 4.0550, time 8640.95ms, mfu 4.86%
iter 199: loss 3.5905, time 8653.29ms, mfu 4.85%
iter 200: loss 3.1686, time 8668.38ms, mfu 4.85%
iter 201: loss 3.9889, time 8668.22ms, mfu 4.85%
iter 202: loss 3.8391, time 8671.27ms, mfu 4.85%
iter 203: loss 4.0105, time 8651.60ms, mfu 4.85%
iter 204: loss 3.3503, time 8638.98ms, mfu 4.85%
iter 205: loss 3.8734, time 8619.31ms, mfu 4.85%
iter 206: loss 3.7412, time 8618.30ms, mfu 4.85%
iter 207: loss 3.9228, time 8623.12ms, mfu 4.85%
iter 208: loss 3.8018, time 8638.63ms, mfu 4.85%
iter 209: loss 3.3612, time 8643.29ms, mfu 4.85%
iter 210: loss 3.5298, time 8641.91ms, mfu 4.85%
iter 211: loss 3.7375, time 8656.26ms, mfu 4.85%
iter 212: loss 4.1289, time 8673.38ms, mfu 4.85%
iter 213: loss 3.7472, time 8661.97ms, mfu 4.85%
iter 214: loss 3.4928, time 8654.19ms, mfu 4.85%
iter 215: loss 4.1344, time 8663.70ms, mfu 4.85%
iter 216: loss 3.6699, time 8665.40ms, mfu 4.85%
iter 217: loss 4.0792, time 8639.66ms, mfu 4.85%
iter 218: loss 3.8377, time 8635.12ms, mfu 4.85%
iter 219: loss 3.5988, time 8648.06ms, mfu 4.85%
iter 220: loss 3.7710, time 8623.13ms, mfu 4.85%
iter 221: loss 3.7184, time 8617.63ms, mfu 4.85%
iter 222: loss 3.7713, time 8625.57ms, mfu 4.85%
iter 223: loss 3.5824, time 8621.98ms, mfu 4.85%
iter 224: loss 3.3563, time 8616.86ms, mfu 4.85%
iter 225: loss 3.6609, time 8620.42ms, mfu 4.85%
iter 226: loss 3.5828, time 8621.86ms, mfu 4.85%
iter 227: loss 3.4875, time 8620.00ms, mfu 4.86%
iter 228: loss 3.9423, time 8606.99ms, mfu 4.86%
iter 229: loss 3.5360, time 8623.02ms, mfu 4.86%
iter 230: loss 4.0604, time 8613.26ms, mfu 4.86%
iter 231: loss 3.5473, time 8647.14ms, mfu 4.86%
iter 232: loss 3.2815, time 8650.43ms, mfu 4.86%
iter 233: loss 3.5676, time 8660.14ms, mfu 4.85%
iter 234: loss 3.8136, time 8648.40ms, mfu 4.85%
iter 235: loss 4.0873, time 8644.27ms, mfu 4.85%
iter 236: loss 3.5319, time 8648.41ms, mfu 4.85%
iter 237: loss 3.2985, time 8655.96ms, mfu 4.85%
iter 238: loss 3.7030, time 8647.27ms, mfu 4.85%
iter 239: loss 3.7970, time 8649.48ms, mfu 4.85%
iter 240: loss 3.9477, time 8637.59ms, mfu 4.85%
iter 241: loss 3.4666, time 8608.50ms, mfu 4.85%
iter 242: loss 3.9139, time 8636.02ms, mfu 4.85%
iter 243: loss 3.7084, time 8598.47ms, mfu 4.85%
iter 244: loss 4.1553, time 8608.88ms, mfu 4.86%
iter 245: loss 3.1720, time 8626.86ms, mfu 4.86%
iter 246: loss 3.3597, time 8657.78ms, mfu 4.85%
iter 247: loss 3.8867, time 8679.08ms, mfu 4.85%
iter 248: loss 3.4409, time 8672.95ms, mfu 4.85%
iter 249: loss 3.6427, time 8667.01ms, mfu 4.85%
iter 250: loss 3.1590, time 8628.97ms, mfu 4.85%
iter 251: loss 3.5068, time 8620.25ms, mfu 4.85%
iter 252: loss 3.8223, time 8588.02ms, mfu 4.85%
iter 253: loss 3.6845, time 8588.25ms, mfu 4.86%
iter 254: loss 3.4547, time 8595.88ms, mfu 4.86%
iter 255: loss 3.9800, time 8623.62ms, mfu 4.86%
iter 256: loss 3.5661, time 8662.05ms, mfu 4.86%
iter 257: loss 3.1116, time 8662.66ms, mfu 4.85%
iter 258: loss 3.4438, time 8684.17ms, mfu 4.85%
iter 259: loss 3.2512, time 8694.32ms, mfu 4.85%
iter 260: loss 3.0704, time 8668.98ms, mfu 4.85%
iter 261: loss 3.5007, time 8660.05ms, mfu 4.85%
iter 262: loss 3.7747, time 8659.00ms, mfu 4.85%
iter 263: loss 3.3664, time 8625.56ms, mfu 4.85%
iter 264: loss 3.4316, time 8633.14ms, mfu 4.85%
iter 265: loss 3.3168, time 8617.28ms, mfu 4.85%
iter 266: loss 3.9911, time 8602.97ms, mfu 4.85%
iter 267: loss 3.2569, time 8628.35ms, mfu 4.85%
iter 268: loss 3.5132, time 8659.45ms, mfu 4.85%
iter 269: loss 3.4869, time 8666.75ms, mfu 4.85%
iter 270: loss 3.5610, time 8685.68ms, mfu 4.85%
iter 271: loss 3.5612, time 8685.91ms, mfu 4.85%
iter 272: loss 3.2356, time 8671.53ms, mfu 4.84%
iter 273: loss 3.7701, time 8645.19ms, mfu 4.84%
iter 274: loss 3.7196, time 8630.64ms, mfu 4.85%
iter 275: loss 3.3458, time 8616.09ms, mfu 4.85%
iter 276: loss 3.1954, time 8617.82ms, mfu 4.85%
iter 277: loss 3.6401, time 8612.62ms, mfu 4.85%
iter 278: loss 3.3252, time 8609.22ms, mfu 4.85%
iter 279: loss 3.3146, time 8626.44ms, mfu 4.85%
iter 280: loss 3.4229, time 8619.85ms, mfu 4.85%
iter 281: loss 3.6033, time 8623.26ms, mfu 4.85%
iter 282: loss 3.2154, time 8648.17ms, mfu 4.85%
iter 283: loss 3.4494, time 8655.24ms, mfu 4.85%
iter 284: loss 3.8272, time 8645.33ms, mfu 4.85%
iter 285: loss 3.3412, time 8658.55ms, mfu 4.85%
iter 286: loss 3.4489, time 8651.13ms, mfu 4.85%
iter 287: loss 3.1939, time 8674.93ms, mfu 4.85%
iter 288: loss 3.5090, time 8668.67ms, mfu 4.85%
iter 289: loss 3.8602, time 8664.28ms, mfu 4.85%
iter 290: loss 3.3456, time 8667.55ms, mfu 4.85%
iter 291: loss 3.1917, time 8674.09ms, mfu 4.84%
iter 292: loss 3.3412, time 8677.42ms, mfu 4.84%
iter 293: loss 3.2346, time 8669.57ms, mfu 4.84%
iter 294: loss 3.1180, time 8664.29ms, mfu 4.84%
iter 295: loss 3.2915, time 8659.04ms, mfu 4.84%
iter 296: loss 3.0988, time 8671.70ms, mfu 4.84%
iter 297: loss 3.2264, time 8655.80ms, mfu 4.84%
iter 298: loss 3.4803, time 8671.95ms, mfu 4.84%
iter 299: loss 3.4889, time 8686.26ms, mfu 4.84%
iter 300: loss 3.6585, time 8687.64ms, mfu 4.84%
iter 301: loss 3.0936, time 8672.83ms, mfu 4.84%
iter 302: loss 3.3307, time 8673.06ms, mfu 4.84%
iter 303: loss 2.9237, time 8680.70ms, mfu 4.84%
iter 304: loss 3.5450, time 8668.80ms, mfu 4.84%
iter 305: loss 3.4214, time 8679.82ms, mfu 4.83%
iter 306: loss 3.7609, time 8682.43ms, mfu 4.83%
iter 307: loss 3.4361, time 8690.50ms, mfu 4.83%
iter 308: loss 2.8749, time 8689.23ms, mfu 4.83%
iter 309: loss 3.4302, time 8673.76ms, mfu 4.83%
iter 310: loss 3.2594, time 8672.50ms, mfu 4.83%
iter 311: loss 3.4443, time 8665.84ms, mfu 4.83%
iter 312: loss 3.1514, time 8650.35ms, mfu 4.83%
iter 313: loss 3.5545, time 8651.61ms, mfu 4.83%
iter 314: loss 3.5483, time 8665.51ms, mfu 4.84%
iter 315: loss 3.6582, time 8662.80ms, mfu 4.84%
iter 316: loss 3.6223, time 8675.57ms, mfu 4.83%
iter 317: loss 3.0154, time 8665.38ms, mfu 4.84%
iter 318: loss 3.3053, time 8648.43ms, mfu 4.84%
iter 319: loss 3.5329, time 8651.81ms, mfu 4.84%
iter 320: loss 3.3153, time 8625.75ms, mfu 4.84%
iter 321: loss 2.8094, time 8641.79ms, mfu 4.84%
iter 322: loss 3.6426, time 8621.13ms, mfu 4.84%
iter 323: loss 3.4374, time 8615.89ms, mfu 4.84%
iter 324: loss 3.7190, time 8629.96ms, mfu 4.85%
iter 325: loss 3.4063, time 8623.97ms, mfu 4.85%
iter 326: loss 3.1594, time 8642.17ms, mfu 4.85%
iter 327: loss 2.7545, time 8669.01ms, mfu 4.85%
iter 328: loss 3.5843, time 8677.00ms, mfu 4.84%
iter 329: loss 3.5237, time 8686.23ms, mfu 4.84%
iter 330: loss 3.5213, time 8691.07ms, mfu 4.84%
iter 331: loss 3.5086, time 8670.12ms, mfu 4.84%
iter 332: loss 3.2068, time 8649.56ms, mfu 4.84%
iter 333: loss 3.1339, time 8629.05ms, mfu 4.84%
iter 334: loss 3.0310, time 8634.86ms, mfu 4.84%
iter 335: loss 2.8422, time 8615.16ms, mfu 4.85%
iter 336: loss 3.3428, time 8629.04ms, mfu 4.85%
iter 337: loss 3.3146, time 8634.50ms, mfu 4.85%
iter 338: loss 3.5079, time 8651.51ms, mfu 4.85%
iter 339: loss 3.0649, time 8637.77ms, mfu 4.85%
iter 340: loss 2.8651, time 8635.81ms, mfu 4.85%
iter 341: loss 3.1499, time 8640.59ms, mfu 4.85%
iter 342: loss 3.3568, time 8649.71ms, mfu 4.85%
iter 343: loss 2.7608, time 8651.03ms, mfu 4.85%
iter 344: loss 3.2066, time 8654.76ms, mfu 4.85%
iter 345: loss 3.6423, time 8640.92ms, mfu 4.85%
iter 346: loss 3.3953, time 8659.28ms, mfu 4.85%
iter 347: loss 3.0974, time 8646.02ms, mfu 4.85%
iter 348: loss 2.7902, time 8648.20ms, mfu 4.85%
iter 349: loss 3.3830, time 8655.83ms, mfu 4.85%
iter 350: loss 3.4035, time 8650.05ms, mfu 4.85%
iter 351: loss 3.3697, time 8659.76ms, mfu 4.85%
iter 352: loss 3.4110, time 8663.69ms, mfu 4.84%
iter 353: loss 3.1482, time 8662.94ms, mfu 4.84%
iter 354: loss 3.1332, time 8649.34ms, mfu 4.84%
iter 355: loss 3.4862, time 8658.85ms, mfu 4.84%
iter 356: loss 2.7291, time 8670.13ms, mfu 4.84%
iter 357: loss 3.8040, time 8670.49ms, mfu 4.84%
iter 358: loss 3.2584, time 8670.42ms, mfu 4.84%
iter 359: loss 2.8368, time 8642.85ms, mfu 4.84%
iter 360: loss 3.4442, time 8662.50ms, mfu 4.84%
iter 361: loss 2.9921, time 8654.29ms, mfu 4.84%
iter 362: loss 3.4384, time 8666.97ms, mfu 4.84%
iter 363: loss 3.0399, time 8648.76ms, mfu 4.84%
iter 364: loss 3.4536, time 8656.93ms, mfu 4.84%
iter 365: loss 3.2651, time 8670.49ms, mfu 4.84%
iter 366: loss 3.1223, time 8655.47ms, mfu 4.84%
iter 367: loss 2.7574, time 8654.37ms, mfu 4.84%
iter 368: loss 3.7777, time 8683.38ms, mfu 4.84%
iter 369: loss 3.3921, time 8648.73ms, mfu 4.84%
iter 370: loss 3.0499, time 8663.09ms, mfu 4.84%
iter 371: loss 2.9050, time 8666.00ms, mfu 4.84%
iter 372: loss 3.9740, time 8670.99ms, mfu 4.84%
iter 373: loss 3.4565, time 8663.29ms, mfu 4.84%
iter 374: loss 3.3331, time 8664.95ms, mfu 4.84%
iter 375: loss 3.4092, time 8676.23ms, mfu 4.84%
iter 376: loss 3.4608, time 8673.07ms, mfu 4.84%
iter 377: loss 3.0973, time 8673.72ms, mfu 4.84%
iter 378: loss 3.1853, time 8665.67ms, mfu 4.84%
iter 379: loss 2.9494, time 8670.04ms, mfu 4.84%
iter 380: loss 3.0774, time 8676.85ms, mfu 4.84%
iter 381: loss 3.1104, time 8645.91ms, mfu 4.84%
iter 382: loss 3.4315, time 8662.00ms, mfu 4.84%
iter 383: loss 2.9648, time 8664.61ms, mfu 4.84%
iter 384: loss 3.3145, time 8675.97ms, mfu 4.84%
iter 385: loss 3.5303, time 8670.86ms, mfu 4.84%
iter 386: loss 3.6814, time 8667.07ms, mfu 4.84%
iter 387: loss 3.1340, time 8668.05ms, mfu 4.84%
iter 388: loss 2.8605, time 8661.08ms, mfu 4.84%
iter 389: loss 3.0573, time 8665.45ms, mfu 4.84%
iter 390: loss 3.3290, time 8667.54ms, mfu 4.84%
iter 391: loss 2.8216, time 8683.05ms, mfu 4.84%
iter 392: loss 3.3761, time 8681.72ms, mfu 4.83%
iter 393: loss 3.2386, time 8656.53ms, mfu 4.84%
iter 394: loss 2.9342, time 8675.19ms, mfu 4.84%
iter 395: loss 3.3630, time 8641.03ms, mfu 4.84%
iter 396: loss 2.9133, time 8672.51ms, mfu 4.84%
iter 397: loss 3.0461, time 8664.44ms, mfu 4.84%
iter 398: loss 3.4008, time 8661.83ms, mfu 4.84%
iter 399: loss 2.7681, time 8665.82ms, mfu 4.84%
iter 400: loss 2.8864, time 8658.23ms, mfu 4.84%
iter 401: loss 3.0373, time 8648.87ms, mfu 4.84%
iter 402: loss 2.7249, time 8654.07ms, mfu 4.84%
iter 403: loss 2.9297, time 8640.09ms, mfu 4.84%
iter 404: loss 3.1049, time 8648.95ms, mfu 4.84%
iter 405: loss 2.9963, time 8658.22ms, mfu 4.84%
iter 406: loss 2.9402, time 8639.44ms, mfu 4.84%
iter 407: loss 2.9765, time 8625.90ms, mfu 4.84%
iter 408: loss 3.0184, time 8636.50ms, mfu 4.84%
iter 409: loss 3.1750, time 8648.83ms, mfu 4.84%
iter 410: loss 2.7781, time 8640.51ms, mfu 4.85%
iter 411: loss 3.4126, time 8647.81ms, mfu 4.85%
iter 412: loss 2.8227, time 8646.20ms, mfu 4.85%
iter 413: loss 2.7017, time 8642.34ms, mfu 4.85%
iter 414: loss 2.8977, time 8630.03ms, mfu 4.85%
iter 415: loss 3.1534, time 8652.40ms, mfu 4.85%
iter 416: loss 3.1109, time 8627.47ms, mfu 4.85%
iter 417: loss 3.0617, time 8632.74ms, mfu 4.85%
iter 418: loss 3.0585, time 8637.74ms, mfu 4.85%
iter 419: loss 2.8181, time 8626.48ms, mfu 4.85%
iter 420: loss 2.9407, time 8645.69ms, mfu 4.85%
iter 421: loss 2.5877, time 8624.63ms, mfu 4.85%
iter 422: loss 2.8930, time 8625.95ms, mfu 4.85%
iter 423: loss 3.1659, time 8639.73ms, mfu 4.85%
iter 424: loss 3.0002, time 8636.31ms, mfu 4.85%
iter 425: loss 2.7393, time 8623.30ms, mfu 4.85%
iter 426: loss 2.9193, time 8647.94ms, mfu 4.85%
iter 427: loss 2.8248, time 8646.27ms, mfu 4.85%
iter 428: loss 3.1079, time 8650.32ms, mfu 4.85%
iter 429: loss 2.6133, time 8657.00ms, mfu 4.85%
iter 430: loss 3.2906, time 8642.88ms, mfu 4.85%
iter 431: loss 3.3801, time 8662.70ms, mfu 4.85%
iter 432: loss 2.9604, time 8630.81ms, mfu 4.85%
iter 433: loss 3.5290, time 8646.04ms, mfu 4.85%
iter 434: loss 3.1831, time 8667.16ms, mfu 4.85%
iter 435: loss 3.3310, time 8658.09ms, mfu 4.85%
iter 436: loss 2.8888, time 8649.50ms, mfu 4.85%
iter 437: loss 2.9691, time 8673.54ms, mfu 4.85%
iter 438: loss 2.8066, time 8667.81ms, mfu 4.84%
iter 439: loss 3.2573, time 8651.94ms, mfu 4.84%
iter 440: loss 2.9447, time 8675.06ms, mfu 4.84%
iter 441: loss 3.0731, time 8662.11ms, mfu 4.84%
iter 442: loss 2.9119, time 8679.35ms, mfu 4.84%
iter 443: loss 2.8863, time 8649.81ms, mfu 4.84%
iter 444: loss 2.9525, time 8668.16ms, mfu 4.84%
iter 445: loss 2.7801, time 8659.38ms, mfu 4.84%
iter 446: loss 2.8643, time 8664.47ms, mfu 4.84%
iter 447: loss 3.1379, time 8671.97ms, mfu 4.84%
iter 448: loss 3.0064, time 8667.29ms, mfu 4.84%
iter 449: loss 2.7570, time 8668.94ms, mfu 4.84%
iter 450: loss 3.0825, time 8658.21ms, mfu 4.84%
iter 451: loss 2.9744, time 8655.65ms, mfu 4.84%
iter 452: loss 2.8429, time 8639.18ms, mfu 4.84%
iter 453: loss 3.1309, time 8649.77ms, mfu 4.84%
iter 454: loss 3.0075, time 8648.82ms, mfu 4.84%
iter 455: loss 2.9062, time 8648.30ms, mfu 4.84%
iter 456: loss 3.2647, time 8659.95ms, mfu 4.84%
iter 457: loss 3.1610, time 8652.56ms, mfu 4.84%
iter 458: loss 2.8575, time 8656.39ms, mfu 4.84%
iter 459: loss 3.3243, time 8666.80ms, mfu 4.84%
iter 460: loss 2.8951, time 8640.45ms, mfu 4.84%
iter 461: loss 2.9039, time 8651.74ms, mfu 4.84%
iter 462: loss 3.1147, time 8660.53ms, mfu 4.84%
iter 463: loss 3.1728, time 8660.29ms, mfu 4.84%
iter 464: loss 2.9617, time 8651.62ms, mfu 4.84%
iter 465: loss 2.6183, time 8662.35ms, mfu 4.84%
iter 466: loss 2.9171, time 8640.45ms, mfu 4.84%
iter 467: loss 2.7724, time 8672.34ms, mfu 4.84%
iter 468: loss 2.6687, time 8640.98ms, mfu 4.84%
iter 469: loss 2.9125, time 8646.05ms, mfu 4.84%
iter 470: loss 3.1598, time 8665.74ms, mfu 4.84%
iter 471: loss 3.0722, time 8649.59ms, mfu 4.84%
iter 472: loss 2.9900, time 8654.81ms, mfu 4.84%
iter 473: loss 3.2363, time 8652.38ms, mfu 4.84%
iter 474: loss 3.0777, time 8648.03ms, mfu 4.84%
iter 475: loss 3.0885, time 8659.32ms, mfu 4.84%
iter 476: loss 3.1338, time 8667.53ms, mfu 4.84%
iter 477: loss 2.7480, time 8655.68ms, mfu 4.84%
iter 478: loss 3.0809, time 8647.81ms, mfu 4.84%
iter 479: loss 2.8427, time 8650.17ms, mfu 4.84%
iter 480: loss 2.8938, time 8657.66ms, mfu 4.84%
iter 481: loss 3.0926, time 8658.14ms, mfu 4.84%
iter 482: loss 3.2445, time 8643.02ms, mfu 4.84%
iter 483: loss 3.0581, time 8644.14ms, mfu 4.84%
iter 484: loss 2.8537, time 8642.63ms, mfu 4.84%
iter 485: loss 2.6688, time 8633.78ms, mfu 4.85%
iter 486: loss 2.3358, time 8624.15ms, mfu 4.85%
iter 487: loss 2.6975, time 8632.62ms, mfu 4.85%
iter 488: loss 3.2052, time 8627.98ms, mfu 4.85%
iter 489: loss 2.6787, time 8629.58ms, mfu 4.85%
iter 490: loss 3.0259, time 8653.31ms, mfu 4.85%
iter 491: loss 2.9852, time 8625.35ms, mfu 4.85%
iter 492: loss 3.0974, time 8630.94ms, mfu 4.85%
iter 493: loss 3.2502, time 8645.79ms, mfu 4.85%
iter 494: loss 2.6175, time 8625.32ms, mfu 4.85%
iter 495: loss 3.1159, time 8639.53ms, mfu 4.85%
iter 496: loss 2.6928, time 8641.15ms, mfu 4.85%
iter 497: loss 2.4996, time 8653.25ms, mfu 4.85%
iter 498: loss 3.4541, time 8644.18ms, mfu 4.85%
iter 499: loss 2.6616, time 8630.96ms, mfu 4.85%
iter 500: loss 2.9856, time 8629.29ms, mfu 4.85%
iter 501: loss 3.0015, time 8637.01ms, mfu 4.85%
iter 502: loss 2.9113, time 8656.21ms, mfu 4.85%
iter 503: loss 2.5566, time 8646.32ms, mfu 4.85%
iter 504: loss 2.7494, time 8654.97ms, mfu 4.85%
iter 505: loss 2.7406, time 8657.50ms, mfu 4.85%
iter 506: loss 2.9104, time 8651.37ms, mfu 4.85%
iter 507: loss 2.6410, time 8663.21ms, mfu 4.85%
iter 508: loss 2.6439, time 8660.41ms, mfu 4.85%
iter 509: loss 2.8496, time 8643.38ms, mfu 4.85%
iter 510: loss 2.6592, time 8672.76ms, mfu 4.85%
iter 511: loss 2.6955, time 8661.35ms, mfu 4.84%
iter 512: loss 2.9582, time 8650.71ms, mfu 4.84%
iter 513: loss 2.8227, time 8670.88ms, mfu 4.84%
iter 514: loss 3.1748, time 8678.47ms, mfu 4.84%
iter 515: loss 3.0165, time 8670.11ms, mfu 4.84%
iter 516: loss 2.6271, time 8667.11ms, mfu 4.84%
iter 517: loss 2.9158, time 8671.04ms, mfu 4.84%
iter 518: loss 2.7804, time 8667.92ms, mfu 4.84%
iter 519: loss 2.7650, time 8651.42ms, mfu 4.84%
iter 520: loss 3.0931, time 8646.35ms, mfu 4.84%
iter 521: loss 3.1471, time 8651.94ms, mfu 4.84%
iter 522: loss 3.1553, time 8637.86ms, mfu 4.84%
iter 523: loss 2.5807, time 8648.35ms, mfu 4.84%
iter 524: loss 2.5773, time 8623.45ms, mfu 4.84%
iter 525: loss 3.1033, time 8622.79ms, mfu 4.85%
iter 526: loss 2.7191, time 8623.90ms, mfu 4.85%
iter 527: loss 2.5332, time 8638.45ms, mfu 4.85%
iter 528: loss 2.8957, time 8615.40ms, mfu 4.85%
iter 529: loss 3.1985, time 8631.94ms, mfu 4.85%
iter 530: loss 2.5415, time 8631.46ms, mfu 4.85%
iter 531: loss 2.9808, time 8647.46ms, mfu 4.85%
iter 532: loss 2.8564, time 8653.88ms, mfu 4.85%
iter 533: loss 2.7561, time 8662.95ms, mfu 4.85%
iter 534: loss 2.4323, time 8655.16ms, mfu 4.85%
iter 535: loss 2.8473, time 8659.21ms, mfu 4.85%
iter 536: loss 3.2732, time 8663.19ms, mfu 4.85%
iter 537: loss 2.9682, time 8663.59ms, mfu 4.85%
iter 538: loss 2.4346, time 8681.30ms, mfu 4.84%
iter 539: loss 3.0878, time 8681.82ms, mfu 4.84%
iter 540: loss 2.6138, time 8680.38ms, mfu 4.84%
iter 541: loss 3.1540, time 8683.31ms, mfu 4.84%
iter 542: loss 2.6734, time 8687.51ms, mfu 4.84%
iter 543: loss 3.3290, time 8684.97ms, mfu 4.84%
iter 544: loss 2.7893, time 8666.62ms, mfu 4.84%
iter 545: loss 2.5224, time 8663.66ms, mfu 4.84%
iter 546: loss 2.7761, time 8657.69ms, mfu 4.84%
iter 547: loss 2.7874, time 8673.55ms, mfu 4.84%
iter 548: loss 2.8987, time 8656.89ms, mfu 4.84%
iter 549: loss 2.6741, time 8662.13ms, mfu 4.84%
iter 550: loss 2.9549, time 8657.49ms, mfu 4.84%
iter 551: loss 2.6362, time 8629.49ms, mfu 4.84%
iter 552: loss 2.5698, time 8628.99ms, mfu 4.84%
iter 553: loss 3.3155, time 8635.99ms, mfu 4.84%
iter 554: loss 3.0167, time 8656.88ms, mfu 4.84%
iter 555: loss 2.7190, time 8688.24ms, mfu 4.84%
iter 556: loss 2.9315, time 8684.43ms, mfu 4.84%
iter 557: loss 2.9666, time 8693.93ms, mfu 4.84%
iter 558: loss 2.9896, time 8691.93ms, mfu 4.84%
iter 559: loss 2.6828, time 8661.07ms, mfu 4.84%
iter 560: loss 2.9515, time 8657.07ms, mfu 4.84%
iter 561: loss 2.4689, time 8642.19ms, mfu 4.84%
iter 562: loss 3.0925, time 8638.87ms, mfu 4.84%
iter 563: loss 2.6779, time 8626.41ms, mfu 4.84%
iter 564: loss 2.8660, time 8610.51ms, mfu 4.84%
iter 565: loss 2.6725, time 8639.57ms, mfu 4.84%
iter 566: loss 2.8228, time 8635.43ms, mfu 4.85%
iter 567: loss 2.5499, time 8629.48ms, mfu 4.85%
iter 568: loss 2.8937, time 8630.60ms, mfu 4.85%
iter 569: loss 2.5574, time 8627.56ms, mfu 4.85%
iter 570: loss 3.2463, time 8642.97ms, mfu 4.85%
iter 571: loss 2.3930, time 8619.66ms, mfu 4.85%
iter 572: loss 2.6956, time 8632.41ms, mfu 4.85%
iter 573: loss 2.4659, time 8645.99ms, mfu 4.85%
iter 574: loss 3.0291, time 8657.56ms, mfu 4.85%
iter 575: loss 2.9726, time 8665.60ms, mfu 4.85%
iter 576: loss 2.6596, time 8659.34ms, mfu 4.85%
iter 577: loss 2.7857, time 8653.21ms, mfu 4.85%
iter 578: loss 2.5932, time 8670.29ms, mfu 4.85%
iter 579: loss 2.6142, time 8667.29ms, mfu 4.84%
iter 580: loss 2.8602, time 8672.29ms, mfu 4.84%
iter 581: loss 2.6789, time 8675.84ms, mfu 4.84%
iter 582: loss 3.1505, time 8688.29ms, mfu 4.84%
iter 583: loss 2.4355, time 8693.20ms, mfu 4.84%
iter 584: loss 2.8075, time 8672.10ms, mfu 4.84%
iter 585: loss 2.7514, time 8665.23ms, mfu 4.84%
iter 586: loss 2.4707, time 8675.27ms, mfu 4.84%
iter 587: loss 2.6756, time 8674.22ms, mfu 4.84%
iter 588: loss 2.7731, time 8677.58ms, mfu 4.84%
iter 589: loss 2.8041, time 8676.98ms, mfu 4.84%
iter 590: loss 2.5452, time 8683.96ms, mfu 4.83%
iter 591: loss 2.6023, time 8662.85ms, mfu 4.83%
iter 592: loss 2.9542, time 8684.82ms, mfu 4.83%
iter 593: loss 2.7394, time 8668.33ms, mfu 4.83%
iter 594: loss 2.5169, time 8663.33ms, mfu 4.83%
iter 595: loss 2.2587, time 8671.98ms, mfu 4.83%
iter 596: loss 2.7463, time 8659.56ms, mfu 4.83%
iter 597: loss 3.1444, time 8654.86ms, mfu 4.84%
iter 598: loss 2.7956, time 8645.32ms, mfu 4.84%
iter 599: loss 2.7251, time 8653.31ms, mfu 4.84%
iter 600: loss 2.7324, time 8634.86ms, mfu 4.84%
iter 601: loss 2.8006, time 8661.91ms, mfu 4.84%
iter 602: loss 2.8238, time 8697.88ms, mfu 4.84%
iter 603: loss 2.8431, time 8695.72ms, mfu 4.84%
iter 604: loss 2.6771, time 8685.14ms, mfu 4.83%
iter 605: loss 2.9833, time 8690.87ms, mfu 4.83%
iter 606: loss 2.7350, time 8684.70ms, mfu 4.83%
iter 607: loss 2.6079, time 8663.20ms, mfu 4.83%
iter 608: loss 2.5441, time 8654.91ms, mfu 4.83%
iter 609: loss 2.4605, time 8644.53ms, mfu 4.84%
iter 610: loss 2.2826, time 8649.47ms, mfu 4.84%
iter 611: loss 2.6622, time 8646.14ms, mfu 4.84%
iter 612: loss 2.3689, time 8649.14ms, mfu 4.84%
iter 613: loss 2.8340, time 8649.03ms, mfu 4.84%
iter 614: loss 2.7277, time 8650.16ms, mfu 4.84%
iter 615: loss 2.5289, time 8649.16ms, mfu 4.84%
iter 616: loss 2.6897, time 8660.36ms, mfu 4.84%
iter 617: loss 2.6683, time 8659.74ms, mfu 4.84%
iter 618: loss 2.4284, time 8665.35ms, mfu 4.84%
iter 619: loss 2.6459, time 8687.44ms, mfu 4.84%
iter 620: loss 2.4581, time 8697.98ms, mfu 4.84%
iter 621: loss 2.4879, time 8670.16ms, mfu 4.84%
iter 622: loss 2.6254, time 8708.55ms, mfu 4.83%
iter 623: loss 2.7253, time 8707.74ms, mfu 4.83%
iter 624: loss 2.6671, time 8688.98ms, mfu 4.83%
iter 625: loss 2.6193, time 8698.57ms, mfu 4.83%
iter 626: loss 2.5587, time 8697.56ms, mfu 4.83%
iter 627: loss 2.7809, time 8702.58ms, mfu 4.83%
iter 628: loss 2.8641, time 8687.50ms, mfu 4.83%
iter 629: loss 2.3357, time 8691.39ms, mfu 4.83%
iter 630: loss 2.7074, time 8716.26ms, mfu 4.82%
iter 631: loss 3.1473, time 8703.56ms, mfu 4.82%
iter 632: loss 2.3688, time 8708.97ms, mfu 4.82%
iter 633: loss 2.7453, time 8710.71ms, mfu 4.82%
iter 634: loss 2.2974, time 8689.91ms, mfu 4.82%
iter 635: loss 2.7187, time 8695.61ms, mfu 4.82%
iter 636: loss 2.8857, time 8703.01ms, mfu 4.82%
iter 637: loss 2.6801, time 8696.49ms, mfu 4.82%
iter 638: loss 2.4622, time 8684.90ms, mfu 4.82%
iter 639: loss 2.8263, time 8696.59ms, mfu 4.82%
iter 640: loss 2.6745, time 8693.04ms, mfu 4.82%
iter 641: loss 2.6081, time 8683.30ms, mfu 4.82%
iter 642: loss 2.3250, time 8687.08ms, mfu 4.82%
iter 643: loss 2.4172, time 8669.32ms, mfu 4.82%
iter 644: loss 2.6628, time 8697.41ms, mfu 4.82%
iter 645: loss 2.7922, time 8677.29ms, mfu 4.82%
iter 646: loss 2.5714, time 8673.65ms, mfu 4.82%
iter 647: loss 2.6283, time 8677.91ms, mfu 4.83%
iter 648: loss 2.3705, time 8680.33ms, mfu 4.83%
iter 649: loss 2.4749, time 8683.26ms, mfu 4.83%
iter 650: loss 2.7158, time 8672.01ms, mfu 4.83%
iter 651: loss 2.8011, time 8673.38ms, mfu 4.83%
iter 652: loss 2.6900, time 8671.81ms, mfu 4.83%
iter 653: loss 2.2081, time 8683.23ms, mfu 4.83%
iter 654: loss 2.6868, time 8683.73ms, mfu 4.83%
iter 655: loss 2.6055, time 8678.04ms, mfu 4.83%
iter 656: loss 2.6779, time 8674.01ms, mfu 4.83%
iter 657: loss 2.2427, time 8675.68ms, mfu 4.83%
iter 658: loss 2.5606, time 8694.64ms, mfu 4.83%
iter 659: loss 2.7877, time 8681.29ms, mfu 4.83%
iter 660: loss 2.7018, time 8680.36ms, mfu 4.83%
iter 661: loss 2.7788, time 8679.22ms, mfu 4.83%
iter 662: loss 2.7859, time 8691.54ms, mfu 4.83%
iter 663: loss 2.6635, time 8693.50ms, mfu 4.83%
iter 664: loss 2.4972, time 8691.01ms, mfu 4.83%
iter 665: loss 2.8176, time 8676.30ms, mfu 4.83%
iter 666: loss 2.5625, time 8676.79ms, mfu 4.83%
iter 667: loss 2.7594, time 8677.03ms, mfu 4.83%
iter 668: loss 2.3591, time 8672.64ms, mfu 4.83%
iter 669: loss 2.3429, time 8673.78ms, mfu 4.83%
iter 670: loss 2.6290, time 8669.14ms, mfu 4.83%
iter 671: loss 2.5004, time 8669.35ms, mfu 4.83%
iter 672: loss 2.3437, time 8675.40ms, mfu 4.83%
iter 673: loss 2.9688, time 8672.27ms, mfu 4.83%
iter 674: loss 2.6090, time 8675.72ms, mfu 4.83%
iter 675: loss 2.6485, time 8669.89ms, mfu 4.83%
iter 676: loss 2.6594, time 8671.38ms, mfu 4.83%
iter 677: loss 2.3029, time 8670.71ms, mfu 4.83%
iter 678: loss 2.1802, time 8665.64ms, mfu 4.83%
iter 679: loss 2.5418, time 8661.10ms, mfu 4.83%
iter 680: loss 2.5186, time 8656.56ms, mfu 4.83%
iter 681: loss 2.3765, time 8665.39ms, mfu 4.83%
iter 682: loss 2.4261, time 8661.11ms, mfu 4.83%
iter 683: loss 2.2529, time 8649.60ms, mfu 4.84%
iter 684: loss 2.5767, time 8659.09ms, mfu 4.84%
iter 685: loss 2.6583, time 8651.83ms, mfu 4.84%
iter 686: loss 2.5810, time 8656.50ms, mfu 4.84%
iter 687: loss 3.0182, time 8654.19ms, mfu 4.84%
iter 688: loss 2.3277, time 8664.95ms, mfu 4.84%
iter 689: loss 2.2495, time 8662.72ms, mfu 4.84%
iter 690: loss 2.6594, time 8663.52ms, mfu 4.84%
iter 691: loss 2.2855, time 8653.00ms, mfu 4.84%
iter 692: loss 2.5947, time 8660.07ms, mfu 4.84%
iter 693: loss 2.2406, time 8660.24ms, mfu 4.84%
iter 694: loss 2.7628, time 8667.64ms, mfu 4.84%
iter 695: loss 2.5812, time 8647.58ms, mfu 4.84%
iter 696: loss 2.6452, time 8665.48ms, mfu 4.84%
iter 697: loss 2.3691, time 8663.08ms, mfu 4.84%
iter 698: loss 2.6981, time 8675.27ms, mfu 4.84%
iter 699: loss 2.1398, time 8655.66ms, mfu 4.84%
iter 700: loss 2.3275, time 8659.21ms, mfu 4.84%
iter 701: loss 2.4207, time 8662.24ms, mfu 4.84%
iter 702: loss 2.4365, time 8668.87ms, mfu 4.84%
iter 703: loss 3.1493, time 8673.89ms, mfu 4.84%
iter 704: loss 2.6444, time 8660.72ms, mfu 4.84%
iter 705: loss 2.5070, time 8657.47ms, mfu 4.84%
iter 706: loss 2.3583, time 8646.67ms, mfu 4.84%
iter 707: loss 2.4516, time 8654.61ms, mfu 4.84%
iter 708: loss 2.6122, time 8654.33ms, mfu 4.84%
iter 709: loss 2.5110, time 8669.24ms, mfu 4.84%
iter 710: loss 2.6044, time 8664.81ms, mfu 4.84%
iter 711: loss 2.6173, time 8666.73ms, mfu 4.84%
iter 712: loss 2.3983, time 8659.56ms, mfu 4.84%
iter 713: loss 2.4088, time 8671.36ms, mfu 4.84%
iter 714: loss 2.5653, time 8674.28ms, mfu 4.84%
iter 715: loss 2.5491, time 8682.79ms, mfu 4.84%
iter 716: loss 2.2089, time 8674.73ms, mfu 4.84%
iter 717: loss 2.4063, time 8680.54ms, mfu 4.84%
iter 718: loss 2.6448, time 8679.89ms, mfu 4.83%
iter 719: loss 2.2115, time 8679.49ms, mfu 4.83%
iter 720: loss 2.2509, time 8688.21ms, mfu 4.83%
iter 721: loss 2.4627, time 8674.88ms, mfu 4.83%
iter 722: loss 2.4523, time 8686.09ms, mfu 4.83%
iter 723: loss 2.4939, time 8684.56ms, mfu 4.83%
iter 724: loss 2.4933, time 8692.35ms, mfu 4.83%
iter 725: loss 2.4709, time 8676.31ms, mfu 4.83%
iter 726: loss 2.4483, time 8664.72ms, mfu 4.83%
iter 727: loss 2.4127, time 8666.06ms, mfu 4.83%
iter 728: loss 2.3384, time 8680.67ms, mfu 4.83%
iter 729: loss 2.4702, time 8658.75ms, mfu 4.83%
iter 730: loss 2.4720, time 8665.76ms, mfu 4.83%
iter 731: loss 2.4992, time 8652.66ms, mfu 4.83%
iter 732: loss 2.2717, time 8640.85ms, mfu 4.84%
iter 733: loss 2.6875, time 8646.02ms, mfu 4.84%
iter 734: loss 2.7360, time 8665.48ms, mfu 4.84%
iter 735: loss 2.4250, time 8648.83ms, mfu 4.84%
iter 736: loss 2.8403, time 8651.67ms, mfu 4.84%
iter 737: loss 2.6083, time 8655.74ms, mfu 4.84%
iter 738: loss 2.6208, time 8647.63ms, mfu 4.84%
iter 739: loss 2.2349, time 8646.14ms, mfu 4.84%
iter 740: loss 2.4456, time 8660.51ms, mfu 4.84%
iter 741: loss 2.6212, time 8673.37ms, mfu 4.84%
iter 742: loss 2.1929, time 8658.77ms, mfu 4.84%
iter 743: loss 2.2917, time 8670.94ms, mfu 4.84%
iter 744: loss 1.9926, time 8687.66ms, mfu 4.84%
iter 745: loss 2.5111, time 8678.66ms, mfu 4.84%
iter 746: loss 2.4027, time 8679.34ms, mfu 4.84%
iter 747: loss 2.1879, time 8670.64ms, mfu 4.84%
iter 748: loss 2.0431, time 8663.75ms, mfu 4.84%
iter 749: loss 2.3332, time 8643.29ms, mfu 4.84%
iter 750: loss 2.2117, time 8655.51ms, mfu 4.84%
iter 751: loss 2.7839, time 8657.95ms, mfu 4.84%
iter 752: loss 2.4317, time 8649.15ms, mfu 4.84%
iter 753: loss 2.5504, time 8652.58ms, mfu 4.84%
iter 754: loss 2.3120, time 8653.99ms, mfu 4.84%
iter 755: loss 2.3204, time 8664.43ms, mfu 4.84%
iter 756: loss 2.5266, time 8677.49ms, mfu 4.84%
iter 757: loss 2.4791, time 8657.61ms, mfu 4.84%
iter 758: loss 2.5697, time 8677.25ms, mfu 4.84%
iter 759: loss 2.3885, time 8674.07ms, mfu 4.84%
iter 760: loss 2.3748, time 8673.85ms, mfu 4.84%
iter 761: loss 2.1206, time 8675.82ms, mfu 4.84%
iter 762: loss 2.2737, time 8690.87ms, mfu 4.83%
iter 763: loss 2.4199, time 8686.80ms, mfu 4.83%
iter 764: loss 2.3890, time 8687.07ms, mfu 4.83%
iter 765: loss 2.5465, time 8688.06ms, mfu 4.83%
iter 766: loss 2.1928, time 8690.09ms, mfu 4.83%
iter 767: loss 2.8608, time 8676.41ms, mfu 4.83%
iter 768: loss 2.3825, time 8681.67ms, mfu 4.83%
iter 769: loss 2.5774, time 8682.17ms, mfu 4.83%
iter 770: loss 2.0420, time 8675.91ms, mfu 4.83%
iter 771: loss 2.6755, time 8670.91ms, mfu 4.83%
iter 772: loss 2.4884, time 8644.14ms, mfu 4.83%
iter 773: loss 2.3940, time 8643.04ms, mfu 4.83%
iter 774: loss 2.4640, time 8639.33ms, mfu 4.84%
iter 775: loss 2.2849, time 8628.91ms, mfu 4.84%
iter 776: loss 2.8129, time 8656.91ms, mfu 4.84%
iter 777: loss 2.9733, time 8638.36ms, mfu 4.84%
iter 778: loss 2.2054, time 8653.92ms, mfu 4.84%
iter 779: loss 2.4064, time 8664.92ms, mfu 4.84%
iter 780: loss 2.4049, time 8660.38ms, mfu 4.84%
iter 781: loss 2.2324, time 8680.56ms, mfu 4.84%
iter 782: loss 2.8412, time 8691.38ms, mfu 4.84%
iter 783: loss 2.2741, time 8682.48ms, mfu 4.84%
iter 784: loss 2.2453, time 8669.22ms, mfu 4.84%
iter 785: loss 2.2904, time 8684.83ms, mfu 4.83%
iter 786: loss 2.3911, time 8689.01ms, mfu 4.83%
iter 787: loss 2.6349, time 8660.26ms, mfu 4.83%
iter 788: loss 2.2199, time 8651.83ms, mfu 4.84%
iter 789: loss 2.4132, time 8681.67ms, mfu 4.83%
iter 790: loss 2.2740, time 8666.80ms, mfu 4.83%
iter 791: loss 2.5929, time 8668.39ms, mfu 4.83%
iter 792: loss 2.4299, time 8687.36ms, mfu 4.83%
iter 793: loss 2.3502, time 8691.21ms, mfu 4.83%
iter 794: loss 2.0241, time 8675.10ms, mfu 4.83%
iter 795: loss 2.2606, time 8684.62ms, mfu 4.83%
iter 796: loss 2.0902, time 8673.69ms, mfu 4.83%
iter 797: loss 2.2309, time 8673.72ms, mfu 4.83%
iter 798: loss 2.0164, time 8682.76ms, mfu 4.83%
iter 799: loss 2.7512, time 8667.58ms, mfu 4.83%
iter 800: loss 2.1311, time 8664.01ms, mfu 4.83%
iter 801: loss 2.4181, time 8656.38ms, mfu 4.83%
iter 802: loss 2.8373, time 8652.04ms, mfu 4.83%
iter 803: loss 2.4376, time 8631.43ms, mfu 4.84%
iter 804: loss 3.1380, time 8637.49ms, mfu 4.84%
iter 805: loss 2.7780, time 8633.23ms, mfu 4.84%
iter 806: loss 2.3303, time 8640.51ms, mfu 4.84%
iter 807: loss 2.5322, time 8648.63ms, mfu 4.84%
iter 808: loss 2.4337, time 8648.88ms, mfu 4.84%
iter 809: loss 2.4054, time 8668.76ms, mfu 4.84%
iter 810: loss 2.2234, time 8659.23ms, mfu 4.84%
iter 811: loss 1.8040, time 8666.64ms, mfu 4.84%
iter 812: loss 2.4289, time 8674.45ms, mfu 4.84%
iter 813: loss 2.5061, time 8680.00ms, mfu 4.84%
iter 814: loss 1.8701, time 8671.14ms, mfu 4.84%
iter 815: loss 2.2755, time 8689.49ms, mfu 4.84%
iter 816: loss 2.0300, time 8686.49ms, mfu 4.84%
iter 817: loss 2.0331, time 8675.97ms, mfu 4.84%
iter 818: loss 2.3605, time 8657.80ms, mfu 4.84%
iter 819: loss 2.2095, time 8655.46ms, mfu 4.84%
iter 820: loss 2.1069, time 8646.32ms, mfu 4.84%
iter 821: loss 2.1041, time 8641.06ms, mfu 4.84%
iter 822: loss 1.9896, time 8648.99ms, mfu 4.84%
iter 823: loss 2.1027, time 8646.21ms, mfu 4.84%
iter 824: loss 2.7316, time 8631.72ms, mfu 4.84%
iter 825: loss 2.2668, time 8640.57ms, mfu 4.84%
iter 826: loss 2.3068, time 8628.30ms, mfu 4.84%
iter 827: loss 2.2890, time 8624.02ms, mfu 4.85%
iter 828: loss 1.8854, time 8640.65ms, mfu 4.85%
iter 829: loss 2.2078, time 8633.42ms, mfu 4.85%
iter 830: loss 2.0955, time 8641.63ms, mfu 4.85%
iter 831: loss 2.3729, time 8621.94ms, mfu 4.85%
iter 832: loss 2.2711, time 8647.20ms, mfu 4.85%
iter 833: loss 2.4814, time 8650.66ms, mfu 4.85%
iter 834: loss 2.5200, time 8630.06ms, mfu 4.85%
iter 835: loss 2.2827, time 8644.56ms, mfu 4.85%
iter 836: loss 2.1315, time 8651.85ms, mfu 4.85%
iter 837: loss 2.3794, time 8659.15ms, mfu 4.85%
iter 838: loss 2.4544, time 8665.77ms, mfu 4.85%
iter 839: loss 2.6497, time 8658.30ms, mfu 4.85%
iter 840: loss 2.3769, time 8664.60ms, mfu 4.85%
iter 841: loss 1.9111, time 8669.42ms, mfu 4.84%
iter 842: loss 2.0941, time 8660.15ms, mfu 4.84%
iter 843: loss 2.0086, time 8666.47ms, mfu 4.84%
iter 844: loss 2.2060, time 8679.26ms, mfu 4.84%
iter 845: loss 2.5718, time 8681.84ms, mfu 4.84%
iter 846: loss 2.6641, time 8686.52ms, mfu 4.84%
iter 847: loss 2.0136, time 8691.40ms, mfu 4.84%
iter 848: loss 2.4067, time 8676.90ms, mfu 4.84%
iter 849: loss 2.1732, time 8675.57ms, mfu 4.84%
iter 850: loss 2.2051, time 8679.29ms, mfu 4.84%
iter 851: loss 1.7456, time 8682.64ms, mfu 4.83%
iter 852: loss 2.0610, time 8672.91ms, mfu 4.83%
iter 853: loss 2.1243, time 8672.02ms, mfu 4.83%
iter 854: loss 2.0067, time 8653.59ms, mfu 4.84%
iter 855: loss 1.9455, time 8623.57ms, mfu 4.84%
iter 856: loss 1.8256, time 8639.15ms, mfu 4.84%
iter 857: loss 2.2401, time 8651.37ms, mfu 4.84%
iter 858: loss 2.5038, time 8665.78ms, mfu 4.84%
iter 859: loss 2.0689, time 8670.32ms, mfu 4.84%
iter 860: loss 2.1552, time 8698.89ms, mfu 4.84%
iter 861: loss 2.3414, time 8703.92ms, mfu 4.83%
iter 862: loss 2.1148, time 8687.50ms, mfu 4.83%
iter 863: loss 2.3830, time 8685.18ms, mfu 4.83%
iter 864: loss 1.8281, time 8666.10ms, mfu 4.83%
iter 865: loss 2.1354, time 8663.84ms, mfu 4.83%
iter 866: loss 2.3371, time 8673.74ms, mfu 4.83%
iter 867: loss 2.1882, time 8668.16ms, mfu 4.83%
iter 868: loss 2.2561, time 8660.35ms, mfu 4.83%
iter 869: loss 2.2050, time 8672.89ms, mfu 4.83%
iter 870: loss 2.1418, time 8653.86ms, mfu 4.83%
iter 871: loss 2.3277, time 8665.21ms, mfu 4.84%
iter 872: loss 1.9097, time 8662.82ms, mfu 4.84%
iter 873: loss 2.2660, time 8663.61ms, mfu 4.84%
iter 874: loss 2.4248, time 8665.10ms, mfu 4.84%
iter 875: loss 2.0572, time 8654.73ms, mfu 4.84%
iter 876: loss 2.3292, time 8661.69ms, mfu 4.84%
iter 877: loss 2.1836, time 8636.70ms, mfu 4.84%
iter 878: loss 2.6857, time 8654.46ms, mfu 4.84%
iter 879: loss 2.4095, time 8637.75ms, mfu 4.84%
iter 880: loss 2.1184, time 8643.99ms, mfu 4.84%
iter 881: loss 2.3021, time 8637.19ms, mfu 4.84%
iter 882: loss 2.2769, time 8645.13ms, mfu 4.84%
iter 883: loss 2.1263, time 8660.66ms, mfu 4.84%
iter 884: loss 1.9796, time 8642.50ms, mfu 4.84%
iter 885: loss 2.1766, time 8652.59ms, mfu 4.84%
iter 886: loss 2.1935, time 8636.51ms, mfu 4.84%
iter 887: loss 2.2927, time 8635.20ms, mfu 4.85%
iter 888: loss 2.2546, time 8655.21ms, mfu 4.84%
iter 889: loss 2.5126, time 8651.89ms, mfu 4.84%
iter 890: loss 1.8852, time 8649.63ms, mfu 4.84%
iter 891: loss 1.8434, time 8661.16ms, mfu 4.84%
iter 892: loss 2.0295, time 8662.20ms, mfu 4.84%
iter 893: loss 1.9722, time 8658.81ms, mfu 4.84%
iter 894: loss 2.2568, time 8636.67ms, mfu 4.84%
iter 895: loss 2.1139, time 8636.03ms, mfu 4.85%
iter 896: loss 2.0771, time 8625.66ms, mfu 4.85%
iter 897: loss 2.3600, time 8628.93ms, mfu 4.85%
iter 898: loss 1.9744, time 8639.25ms, mfu 4.85%
iter 899: loss 2.2936, time 8654.48ms, mfu 4.85%
iter 900: loss 2.0987, time 8669.82ms, mfu 4.85%
iter 901: loss 2.0020, time 8698.99ms, mfu 4.84%
iter 902: loss 2.1931, time 8684.48ms, mfu 4.84%
iter 903: loss 2.2089, time 8656.08ms, mfu 4.84%
iter 904: loss 2.1851, time 8650.26ms, mfu 4.84%
iter 905: loss 2.5468, time 8650.78ms, mfu 4.84%
iter 906: loss 2.0638, time 8637.83ms, mfu 4.84%
iter 907: loss 2.0356, time 8624.67ms, mfu 4.85%
iter 908: loss 2.4536, time 8623.13ms, mfu 4.85%
iter 909: loss 2.2901, time 8619.98ms, mfu 4.85%
iter 910: loss 2.1091, time 8610.50ms, mfu 4.85%
iter 911: loss 2.3121, time 8615.83ms, mfu 4.85%
iter 912: loss 2.3065, time 8614.26ms, mfu 4.85%
iter 913: loss 2.0007, time 8615.36ms, mfu 4.85%
iter 914: loss 1.8559, time 8627.18ms, mfu 4.85%
iter 915: loss 2.1905, time 8630.72ms, mfu 4.85%
iter 916: loss 2.2706, time 8643.94ms, mfu 4.85%
iter 917: loss 2.0856, time 8649.61ms, mfu 4.85%
iter 918: loss 1.8995, time 8643.71ms, mfu 4.85%
iter 919: loss 2.2470, time 8650.50ms, mfu 4.85%
iter 920: loss 2.1344, time 8618.98ms, mfu 4.85%
iter 921: loss 2.1240, time 8619.67ms, mfu 4.85%
iter 922: loss 2.1387, time 8620.16ms, mfu 4.85%
iter 923: loss 2.2419, time 8614.15ms, mfu 4.86%
iter 924: loss 2.3215, time 8637.11ms, mfu 4.86%
iter 925: loss 1.8266, time 8648.28ms, mfu 4.85%
iter 926: loss 2.1055, time 8663.54ms, mfu 4.85%
iter 927: loss 1.9792, time 8682.27ms, mfu 4.85%
iter 928: loss 2.2986, time 8691.85ms, mfu 4.85%
iter 929: loss 1.8192, time 8678.94ms, mfu 4.85%
iter 930: loss 2.2478, time 8663.76ms, mfu 4.84%
iter 931: loss 1.9053, time 8648.32ms, mfu 4.85%
iter 932: loss 1.9951, time 8650.24ms, mfu 4.85%
iter 933: loss 2.1759, time 8651.36ms, mfu 4.85%
iter 934: loss 2.0489, time 8660.82ms, mfu 4.84%
iter 935: loss 1.7658, time 8626.43ms, mfu 4.85%
iter 936: loss 2.5158, time 8629.32ms, mfu 4.85%
iter 937: loss 2.0632, time 8650.66ms, mfu 4.85%
iter 938: loss 2.0642, time 8637.61ms, mfu 4.85%
iter 939: loss 2.3880, time 8647.20ms, mfu 4.85%
iter 940: loss 1.8211, time 8636.17ms, mfu 4.85%
iter 941: loss 2.2613, time 8654.94ms, mfu 4.85%
iter 942: loss 1.8883, time 8639.95ms, mfu 4.85%
iter 943: loss 1.9180, time 8655.72ms, mfu 4.85%
iter 944: loss 2.1608, time 8649.60ms, mfu 4.85%
iter 945: loss 2.0622, time 8665.96ms, mfu 4.85%
iter 946: loss 2.3906, time 8657.63ms, mfu 4.85%
iter 947: loss 2.0849, time 8665.33ms, mfu 4.84%
iter 948: loss 2.3849, time 8657.47ms, mfu 4.84%
iter 949: loss 2.0250, time 8662.10ms, mfu 4.84%
iter 950: loss 1.8718, time 8647.34ms, mfu 4.84%
iter 951: loss 2.2879, time 8665.99ms, mfu 4.84%
iter 952: loss 2.3471, time 8665.60ms, mfu 4.84%
iter 953: loss 2.5668, time 8663.21ms, mfu 4.84%
iter 954: loss 1.8589, time 8661.88ms, mfu 4.84%
iter 955: loss 2.2227, time 8662.85ms, mfu 4.84%
iter 956: loss 2.0354, time 8647.74ms, mfu 4.84%
iter 957: loss 1.8215, time 8660.39ms, mfu 4.84%
iter 958: loss 2.3485, time 8675.12ms, mfu 4.84%
iter 959: loss 2.4595, time 8662.45ms, mfu 4.84%
iter 960: loss 1.9506, time 8663.54ms, mfu 4.84%
iter 961: loss 2.0021, time 8655.17ms, mfu 4.84%
iter 962: loss 2.0918, time 8663.83ms, mfu 4.84%
iter 963: loss 2.2048, time 8664.51ms, mfu 4.84%
iter 964: loss 2.0278, time 8666.68ms, mfu 4.84%
iter 965: loss 2.3662, time 8678.37ms, mfu 4.84%
iter 966: loss 2.0605, time 8668.03ms, mfu 4.84%
iter 967: loss 1.6682, time 8672.62ms, mfu 4.84%
iter 968: loss 2.1648, time 8679.96ms, mfu 4.84%
iter 969: loss 2.2571, time 8687.22ms, mfu 4.84%
iter 970: loss 2.1135, time 8665.93ms, mfu 4.84%
iter 971: loss 2.3212, time 8678.92ms, mfu 4.84%
iter 972: loss 2.3870, time 8689.47ms, mfu 4.83%
iter 973: loss 2.2199, time 8664.80ms, mfu 4.83%
iter 974: loss 2.2061, time 8656.64ms, mfu 4.83%
iter 975: loss 2.2833, time 8645.67ms, mfu 4.84%
iter 976: loss 2.0661, time 8645.70ms, mfu 4.84%
iter 977: loss 2.3967, time 8662.36ms, mfu 4.84%
iter 978: loss 2.2126, time 8661.00ms, mfu 4.84%
iter 979: loss 1.9587, time 8640.61ms, mfu 4.84%
iter 980: loss 1.8211, time 8644.86ms, mfu 4.84%
iter 981: loss 1.9849, time 8633.23ms, mfu 4.84%
iter 982: loss 2.1852, time 8641.95ms, mfu 4.84%
iter 983: loss 2.5230, time 8647.11ms, mfu 4.84%
iter 984: loss 2.1272, time 8631.24ms, mfu 4.84%
iter 985: loss 1.5338, time 8633.10ms, mfu 4.85%
iter 986: loss 2.6286, time 8642.28ms, mfu 4.85%
iter 987: loss 2.3081, time 8646.68ms, mfu 4.85%
iter 988: loss 2.7219, time 8630.03ms, mfu 4.85%
iter 989: loss 1.9672, time 8634.17ms, mfu 4.85%
iter 990: loss 2.3856, time 8643.81ms, mfu 4.85%
iter 991: loss 1.7661, time 8624.88ms, mfu 4.85%
iter 992: loss 2.2479, time 8636.27ms, mfu 4.85%
iter 993: loss 2.2443, time 8654.13ms, mfu 4.85%
iter 994: loss 1.9488, time 8653.58ms, mfu 4.85%
iter 995: loss 2.0138, time 8672.98ms, mfu 4.85%
iter 996: loss 2.1428, time 8674.86ms, mfu 4.85%
iter 997: loss 1.9275, time 8687.95ms, mfu 4.84%
iter 998: loss 2.0441, time 8691.68ms, mfu 4.84%
iter 999: loss 2.0458, time 8674.72ms, mfu 4.84%
iter 1000: loss 2.0687, time 8681.03ms, mfu 4.84%
iter 1001: loss 1.9258, time 8670.45ms, mfu 4.84%
iter 1002: loss 1.8871, time 8673.65ms, mfu 4.84%
iter 1003: loss 1.9055, time 8681.94ms, mfu 4.84%
iter 1004: loss 2.4176, time 8678.95ms, mfu 4.84%
iter 1005: loss 2.2812, time 8680.18ms, mfu 4.84%
iter 1006: loss 2.1293, time 8678.82ms, mfu 4.83%
iter 1007: loss 1.7835, time 8682.49ms, mfu 4.83%
iter 1008: loss 2.2694, time 8676.94ms, mfu 4.83%
iter 1009: loss 1.8915, time 8667.67ms, mfu 4.83%
iter 1010: loss 2.0493, time 8684.32ms, mfu 4.83%
iter 1011: loss 1.9158, time 8668.60ms, mfu 4.83%
iter 1012: loss 2.4123, time 8669.08ms, mfu 4.83%
iter 1013: loss 2.0555, time 8672.32ms, mfu 4.83%
iter 1014: loss 2.2553, time 8661.81ms, mfu 4.83%
iter 1015: loss 2.1858, time 8655.27ms, mfu 4.83%
iter 1016: loss 1.8468, time 8639.79ms, mfu 4.84%
iter 1017: loss 2.0811, time 8664.24ms, mfu 4.84%
iter 1018: loss 2.0125, time 8643.90ms, mfu 4.84%
iter 1019: loss 2.2507, time 8645.75ms, mfu 4.84%
iter 1020: loss 2.0830, time 8645.49ms, mfu 4.84%
iter 1021: loss 2.3255, time 8646.77ms, mfu 4.84%
iter 1022: loss 2.1003, time 8668.52ms, mfu 4.84%
iter 1023: loss 1.8687, time 8653.83ms, mfu 4.84%
iter 1024: loss 1.9889, time 8643.72ms, mfu 4.84%
iter 1025: loss 1.9850, time 8653.37ms, mfu 4.84%
iter 1026: loss 1.9380, time 8668.53ms, mfu 4.84%
iter 1027: loss 1.9225, time 8667.95ms, mfu 4.84%
iter 1028: loss 1.7988, time 8677.55ms, mfu 4.84%
iter 1029: loss 1.8731, time 8655.65ms, mfu 4.84%
iter 1030: loss 1.9371, time 8668.53ms, mfu 4.84%
iter 1031: loss 1.8783, time 8666.10ms, mfu 4.84%
iter 1032: loss 2.1740, time 8670.37ms, mfu 4.84%
iter 1033: loss 2.0343, time 8664.27ms, mfu 4.84%
iter 1034: loss 2.1180, time 8677.83ms, mfu 4.84%
iter 1035: loss 2.4840, time 8661.68ms, mfu 4.84%
iter 1036: loss 1.9678, time 8681.63ms, mfu 4.84%
iter 1037: loss 2.2802, time 8675.47ms, mfu 4.84%
iter 1038: loss 1.8350, time 8641.66ms, mfu 4.84%
iter 1039: loss 2.4062, time 8645.52ms, mfu 4.84%
iter 1040: loss 2.1192, time 8650.39ms, mfu 4.84%
iter 1041: loss 1.7705, time 8654.04ms, mfu 4.84%
iter 1042: loss 2.2824, time 8659.78ms, mfu 4.84%
iter 1043: loss 1.9954, time 8640.73ms, mfu 4.84%
iter 1044: loss 1.9364, time 8641.21ms, mfu 4.84%
iter 1045: loss 2.1229, time 8632.13ms, mfu 4.84%
iter 1046: loss 1.8553, time 8644.80ms, mfu 4.84%
iter 1047: loss 2.0978, time 8640.21ms, mfu 4.84%
iter 1048: loss 1.8978, time 8638.05ms, mfu 4.85%
iter 1049: loss 2.0075, time 8632.83ms, mfu 4.85%
iter 1050: loss 2.0257, time 8616.30ms, mfu 4.85%
iter 1051: loss 1.8068, time 8622.66ms, mfu 4.85%
iter 1052: loss 1.8917, time 8633.61ms, mfu 4.85%
iter 1053: loss 1.8845, time 8643.48ms, mfu 4.85%
iter 1054: loss 1.8544, time 8642.34ms, mfu 4.85%
iter 1055: loss 2.1286, time 8650.05ms, mfu 4.85%
iter 1056: loss 2.2201, time 8661.72ms, mfu 4.85%
iter 1057: loss 2.1691, time 8656.90ms, mfu 4.85%
iter 1058: loss 2.1533, time 8644.39ms, mfu 4.85%
iter 1059: loss 1.9748, time 8661.87ms, mfu 4.85%
iter 1060: loss 1.8495, time 8660.76ms, mfu 4.85%
iter 1061: loss 2.1472, time 8672.73ms, mfu 4.84%
iter 1062: loss 2.2816, time 8665.71ms, mfu 4.84%
iter 1063: loss 2.0611, time 8663.71ms, mfu 4.84%
iter 1064: loss 2.0422, time 8670.25ms, mfu 4.84%
iter 1065: loss 2.7229, time 8649.80ms, mfu 4.84%
iter 1066: loss 1.9583, time 8646.74ms, mfu 4.84%
iter 1067: loss 2.0222, time 8641.01ms, mfu 4.84%
iter 1068: loss 2.1312, time 8645.68ms, mfu 4.84%
iter 1069: loss 1.8924, time 8640.77ms, mfu 4.84%
iter 1070: loss 1.8623, time 8649.57ms, mfu 4.84%
iter 1071: loss 2.2180, time 8646.93ms, mfu 4.85%
iter 1072: loss 1.9593, time 8651.68ms, mfu 4.85%
iter 1073: loss 1.9115, time 8630.62ms, mfu 4.85%
iter 1074: loss 2.4096, time 8648.32ms, mfu 4.85%
iter 1075: loss 2.2738, time 8661.42ms, mfu 4.85%
iter 1076: loss 1.9644, time 8642.31ms, mfu 4.85%
iter 1077: loss 1.9090, time 8657.94ms, mfu 4.85%
iter 1078: loss 2.1436, time 8656.25ms, mfu 4.85%
iter 1079: loss 1.9761, time 8650.41ms, mfu 4.85%
iter 1080: loss 2.1602, time 8641.78ms, mfu 4.85%
iter 1081: loss 2.1133, time 8657.15ms, mfu 4.85%
iter 1082: loss 2.1049, time 8653.89ms, mfu 4.84%
iter 1083: loss 1.9072, time 8643.57ms, mfu 4.85%
iter 1084: loss 2.4493, time 8629.13ms, mfu 4.85%
iter 1085: loss 1.8030, time 8646.73ms, mfu 4.85%
iter 1086: loss 1.9460, time 8660.72ms, mfu 4.85%
iter 1087: loss 2.0726, time 8664.72ms, mfu 4.85%
iter 1088: loss 1.9652, time 8645.27ms, mfu 4.85%
iter 1089: loss 1.9847, time 8653.39ms, mfu 4.85%
iter 1090: loss 2.3418, time 8667.95ms, mfu 4.84%
iter 1091: loss 2.3839, time 8700.07ms, mfu 4.84%
iter 1092: loss 1.9647, time 8684.27ms, mfu 4.84%
iter 1093: loss 2.2534, time 8677.89ms, mfu 4.84%
iter 1094: loss 1.9677, time 8661.98ms, mfu 4.84%
iter 1095: loss 2.5311, time 8666.96ms, mfu 4.84%
iter 1096: loss 2.0815, time 8647.75ms, mfu 4.84%
iter 1097: loss 2.0470, time 8644.91ms, mfu 4.84%
iter 1098: loss 2.1777, time 8668.93ms, mfu 4.84%
iter 1099: loss 2.1275, time 8661.03ms, mfu 4.84%
iter 1100: loss 2.2520, time 8660.67ms, mfu 4.84%
iter 1101: loss 2.2388, time 8655.62ms, mfu 4.84%
iter 1102: loss 2.2674, time 8651.85ms, mfu 4.84%
iter 1103: loss 1.5940, time 8658.21ms, mfu 4.84%
iter 1104: loss 2.0176, time 8653.13ms, mfu 4.84%
iter 1105: loss 1.7465, time 8657.65ms, mfu 4.84%
iter 1106: loss 2.1046, time 8652.45ms, mfu 4.84%
iter 1107: loss 1.8977, time 8664.09ms, mfu 4.84%
iter 1108: loss 1.9956, time 8667.81ms, mfu 4.84%
iter 1109: loss 1.8870, time 8686.76ms, mfu 4.84%
iter 1110: loss 2.0072, time 8673.11ms, mfu 4.84%
iter 1111: loss 1.8188, time 8668.35ms, mfu 4.84%
iter 1112: loss 1.8852, time 8658.22ms, mfu 4.84%
iter 1113: loss 2.0438, time 8646.72ms, mfu 4.84%
iter 1114: loss 2.1543, time 8643.14ms, mfu 4.84%
iter 1115: loss 2.3431, time 8648.46ms, mfu 4.84%
iter 1116: loss 1.9816, time 8664.88ms, mfu 4.84%
iter 1117: loss 1.9266, time 8649.83ms, mfu 4.84%
iter 1118: loss 1.9614, time 8656.50ms, mfu 4.84%
iter 1119: loss 1.8379, time 8665.00ms, mfu 4.84%
iter 1120: loss 1.9771, time 8653.20ms, mfu 4.84%
iter 1121: loss 2.2930, time 8628.70ms, mfu 4.84%
iter 1122: loss 1.9815, time 8638.05ms, mfu 4.84%
iter 1123: loss 2.0637, time 8631.33ms, mfu 4.84%
iter 1124: loss 1.7444, time 8657.00ms, mfu 4.84%
iter 1125: loss 2.0056, time 8629.70ms, mfu 4.85%
iter 1126: loss 1.7579, time 8659.47ms, mfu 4.85%
iter 1127: loss 2.1461, time 8655.85ms, mfu 4.84%
iter 1128: loss 1.8163, time 8655.37ms, mfu 4.84%
iter 1129: loss 1.7906, time 8656.76ms, mfu 4.84%
iter 1130: loss 2.2243, time 8655.70ms, mfu 4.84%
iter 1131: loss 1.9427, time 8661.27ms, mfu 4.84%
iter 1132: loss 2.2269, time 8652.95ms, mfu 4.84%
iter 1133: loss 2.4695, time 8663.76ms, mfu 4.84%
iter 1134: loss 1.9799, time 8663.50ms, mfu 4.84%
iter 1135: loss 1.8280, time 8662.77ms, mfu 4.84%
iter 1136: loss 2.1076, time 8664.73ms, mfu 4.84%
iter 1137: loss 1.8031, time 8667.21ms, mfu 4.84%
iter 1138: loss 2.1160, time 8662.48ms, mfu 4.84%
iter 1139: loss 1.7619, time 8658.89ms, mfu 4.84%
iter 1140: loss 1.9949, time 8660.46ms, mfu 4.84%
iter 1141: loss 2.1185, time 8664.63ms, mfu 4.84%
iter 1142: loss 1.6058, time 8664.54ms, mfu 4.84%
iter 1143: loss 2.0216, time 8673.71ms, mfu 4.84%
iter 1144: loss 2.3268, time 8663.10ms, mfu 4.84%
iter 1145: loss 1.7562, time 8657.23ms, mfu 4.84%
iter 1146: loss 1.6811, time 8695.13ms, mfu 4.84%
iter 1147: loss 1.8889, time 8676.61ms, mfu 4.84%
iter 1148: loss 2.1142, time 8658.44ms, mfu 4.84%
iter 1149: loss 2.0872, time 8681.62ms, mfu 4.84%
iter 1150: loss 2.1173, time 8677.26ms, mfu 4.84%
iter 1151: loss 2.1076, time 8654.12ms, mfu 4.84%
iter 1152: loss 1.8990, time 8646.63ms, mfu 4.84%
iter 1153: loss 2.2619, time 8654.15ms, mfu 4.84%
iter 1154: loss 1.8723, time 8641.51ms, mfu 4.84%
iter 1155: loss 2.2276, time 8651.52ms, mfu 4.84%
iter 1156: loss 2.1474, time 8633.31ms, mfu 4.84%
iter 1157: loss 2.2931, time 8618.59ms, mfu 4.84%
iter 1158: loss 2.0024, time 8619.97ms, mfu 4.85%
iter 1159: loss 1.6519, time 8640.15ms, mfu 4.85%
iter 1160: loss 2.3259, time 8650.07ms, mfu 4.85%
iter 1161: loss 2.4254, time 8627.08ms, mfu 4.85%
iter 1162: loss 2.1132, time 8634.56ms, mfu 4.85%
iter 1163: loss 2.1285, time 8626.62ms, mfu 4.85%
iter 1164: loss 2.0669, time 8619.40ms, mfu 4.85%
iter 1165: loss 2.1278, time 8620.66ms, mfu 4.85%
iter 1166: loss 2.0217, time 8630.99ms, mfu 4.85%
iter 1167: loss 1.9270, time 8641.31ms, mfu 4.85%
iter 1168: loss 1.7001, time 8647.20ms, mfu 4.85%
iter 1169: loss 1.8529, time 8640.36ms, mfu 4.85%
iter 1170: loss 1.7611, time 8655.27ms, mfu 4.85%
iter 1171: loss 2.0673, time 8664.33ms, mfu 4.85%
iter 1172: loss 2.1764, time 8679.80ms, mfu 4.85%
iter 1173: loss 1.8607, time 8668.46ms, mfu 4.85%
iter 1174: loss 1.9485, time 8667.32ms, mfu 4.84%
iter 1175: loss 1.9752, time 8691.70ms, mfu 4.84%
iter 1176: loss 1.7852, time 8686.94ms, mfu 4.84%
iter 1177: loss 2.0558, time 8674.49ms, mfu 4.84%
iter 1178: loss 1.8271, time 8671.06ms, mfu 4.84%
iter 1179: loss 1.9448, time 8667.89ms, mfu 4.84%
iter 1180: loss 1.7307, time 8645.57ms, mfu 4.84%
iter 1181: loss 1.8619, time 8645.04ms, mfu 4.84%
iter 1182: loss 1.9668, time 8657.80ms, mfu 4.84%
iter 1183: loss 1.9903, time 8651.27ms, mfu 4.84%
iter 1184: loss 1.6563, time 8621.44ms, mfu 4.84%
iter 1185: loss 1.7789, time 8637.85ms, mfu 4.84%
iter 1186: loss 1.7080, time 8648.65ms, mfu 4.84%
iter 1187: loss 1.8086, time 8650.84ms, mfu 4.84%
iter 1188: loss 1.6998, time 8639.56ms, mfu 4.85%
iter 1189: loss 1.6169, time 8635.51ms, mfu 4.85%
iter 1190: loss 1.5650, time 8650.57ms, mfu 4.85%
iter 1191: loss 1.9959, time 8634.88ms, mfu 4.85%
iter 1192: loss 2.3920, time 8644.71ms, mfu 4.85%
iter 1193: loss 2.0352, time 8647.86ms, mfu 4.85%
iter 1194: loss 2.0458, time 8627.38ms, mfu 4.85%
iter 1195: loss 2.0101, time 8649.50ms, mfu 4.85%
iter 1196: loss 1.6416, time 8658.60ms, mfu 4.85%
iter 1197: loss 2.3099, time 8632.28ms, mfu 4.85%
iter 1198: loss 1.7973, time 8649.63ms, mfu 4.85%
iter 1199: loss 1.9114, time 8646.89ms, mfu 4.85%
iter 1200: loss 2.4870, time 8645.34ms, mfu 4.85%
iter 1201: loss 1.8310, time 8638.75ms, mfu 4.85%
iter 1202: loss 2.2283, time 8638.12ms, mfu 4.85%
iter 1203: loss 2.1246, time 8647.66ms, mfu 4.85%
iter 1204: loss 1.6934, time 8635.85ms, mfu 4.85%
iter 1205: loss 1.8596, time 8610.40ms, mfu 4.85%
iter 1206: loss 1.9005, time 8629.43ms, mfu 4.85%
iter 1207: loss 1.9967, time 8618.04ms, mfu 4.85%
iter 1208: loss 2.0247, time 8639.04ms, mfu 4.85%
iter 1209: loss 1.6539, time 8637.58ms, mfu 4.85%
iter 1210: loss 2.0407, time 8632.27ms, mfu 4.85%
iter 1211: loss 1.8134, time 8631.84ms, mfu 4.85%
iter 1212: loss 2.0890, time 8643.10ms, mfu 4.85%
iter 1213: loss 1.8851, time 8630.49ms, mfu 4.85%
iter 1214: loss 2.0042, time 8639.19ms, mfu 4.85%
iter 1215: loss 1.9461, time 8634.36ms, mfu 4.85%
iter 1216: loss 2.0499, time 8627.90ms, mfu 4.85%
iter 1217: loss 1.8095, time 8611.23ms, mfu 4.85%
iter 1218: loss 2.4256, time 8624.56ms, mfu 4.86%
iter 1219: loss 2.0160, time 8638.15ms, mfu 4.86%
iter 1220: loss 1.8479, time 8627.77ms, mfu 4.86%
iter 1221: loss 1.8566, time 8638.74ms, mfu 4.85%
iter 1222: loss 2.2262, time 8648.00ms, mfu 4.85%
iter 1223: loss 1.8691, time 8657.10ms, mfu 4.85%
iter 1224: loss 1.8896, time 8654.11ms, mfu 4.85%
iter 1225: loss 2.0431, time 8656.45ms, mfu 4.85%
iter 1226: loss 2.3181, time 8639.68ms, mfu 4.85%
iter 1227: loss 1.9709, time 8670.09ms, mfu 4.85%
iter 1228: loss 2.3929, time 8666.93ms, mfu 4.85%
iter 1229: loss 1.6771, time 8664.09ms, mfu 4.85%
iter 1230: loss 1.8196, time 8667.42ms, mfu 4.85%
iter 1231: loss 1.7089, time 8647.24ms, mfu 4.85%
iter 1232: loss 1.9338, time 8646.31ms, mfu 4.85%
iter 1233: loss 1.8501, time 8657.30ms, mfu 4.85%
iter 1234: loss 2.0295, time 8638.59ms, mfu 4.85%
iter 1235: loss 1.9605, time 8659.97ms, mfu 4.85%
iter 1236: loss 2.3756, time 8645.81ms, mfu 4.85%
iter 1237: loss 2.1573, time 8638.38ms, mfu 4.85%
iter 1238: loss 1.9498, time 8662.13ms, mfu 4.85%
iter 1239: loss 2.0843, time 8656.77ms, mfu 4.85%
iter 1240: loss 1.7446, time 8645.22ms, mfu 4.85%
iter 1241: loss 2.2826, time 8658.47ms, mfu 4.85%
iter 1242: loss 1.9175, time 8660.04ms, mfu 4.84%
iter 1243: loss 2.0632, time 8664.50ms, mfu 4.84%
iter 1244: loss 2.1362, time 8664.43ms, mfu 4.84%
iter 1245: loss 1.9205, time 8655.57ms, mfu 4.84%
iter 1246: loss 2.0096, time 8645.37ms, mfu 4.84%
iter 1247: loss 1.9113, time 8627.72ms, mfu 4.84%
iter 1248: loss 2.1777, time 8625.83ms, mfu 4.85%
iter 1249: loss 2.0021, time 8642.48ms, mfu 4.85%
iter 1250: loss 2.2000, time 8643.99ms, mfu 4.85%
iter 1251: loss 1.8159, time 8659.19ms, mfu 4.85%
iter 1252: loss 1.8673, time 8658.38ms, mfu 4.85%
iter 1253: loss 2.1968, time 8667.67ms, mfu 4.84%
iter 1254: loss 1.9797, time 8667.02ms, mfu 4.84%
iter 1255: loss 1.9226, time 8677.41ms, mfu 4.84%
iter 1256: loss 1.9931, time 8674.80ms, mfu 4.84%
iter 1257: loss 1.7302, time 8678.95ms, mfu 4.84%
iter 1258: loss 2.1920, time 8647.27ms, mfu 4.84%
iter 1259: loss 1.6803, time 8650.09ms, mfu 4.84%
iter 1260: loss 1.9050, time 8678.26ms, mfu 4.84%
iter 1261: loss 1.6297, time 8655.33ms, mfu 4.84%
iter 1262: loss 1.7300, time 8646.01ms, mfu 4.84%
iter 1263: loss 1.8868, time 8662.13ms, mfu 4.84%
iter 1264: loss 2.2481, time 8654.91ms, mfu 4.84%
iter 1265: loss 2.2961, time 8643.25ms, mfu 4.84%
iter 1266: loss 1.6970, time 8663.86ms, mfu 4.84%
iter 1267: loss 1.8533, time 8657.44ms, mfu 4.84%
iter 1268: loss 1.9388, time 8653.97ms, mfu 4.84%
iter 1269: loss 1.7060, time 8648.09ms, mfu 4.84%
iter 1270: loss 1.9982, time 8632.95ms, mfu 4.84%
iter 1271: loss 2.2431, time 8637.21ms, mfu 4.84%
iter 1272: loss 2.0694, time 8650.53ms, mfu 4.84%
iter 1273: loss 2.4949, time 8658.87ms, mfu 4.84%
iter 1274: loss 1.9327, time 8646.62ms, mfu 4.84%
iter 1275: loss 1.5493, time 8644.51ms, mfu 4.84%
iter 1276: loss 2.1479, time 8648.41ms, mfu 4.84%
iter 1277: loss 1.7543, time 8649.57ms, mfu 4.84%
iter 1278: loss 1.6045, time 8644.15ms, mfu 4.85%
iter 1279: loss 1.7160, time 8640.19ms, mfu 4.85%
iter 1280: loss 1.9836, time 8660.39ms, mfu 4.85%
iter 1281: loss 1.8876, time 8645.26ms, mfu 4.85%
iter 1282: loss 1.9914, time 8636.47ms, mfu 4.85%
iter 1283: loss 1.9747, time 8672.27ms, mfu 4.84%
iter 1284: loss 2.0094, time 8655.73ms, mfu 4.84%
iter 1285: loss 1.9861, time 8630.84ms, mfu 4.85%
iter 1286: loss 2.0489, time 8653.44ms, mfu 4.85%
iter 1287: loss 1.7027, time 8641.15ms, mfu 4.85%
iter 1288: loss 2.0254, time 8643.69ms, mfu 4.85%
iter 1289: loss 2.2194, time 8643.06ms, mfu 4.85%
iter 1290: loss 2.2451, time 8645.54ms, mfu 4.85%
iter 1291: loss 1.8097, time 8651.54ms, mfu 4.85%
iter 1292: loss 1.6661, time 8651.87ms, mfu 4.85%
iter 1293: loss 2.0120, time 8655.93ms, mfu 4.85%
iter 1294: loss 1.6480, time 8655.43ms, mfu 4.85%
iter 1295: loss 2.1255, time 8656.51ms, mfu 4.85%
iter 1296: loss 1.8569, time 8653.35ms, mfu 4.85%
iter 1297: loss 2.2080, time 8653.59ms, mfu 4.84%
iter 1298: loss 1.8384, time 8647.69ms, mfu 4.85%
iter 1299: loss 1.8157, time 8642.65ms, mfu 4.85%
iter 1300: loss 1.9570, time 8657.95ms, mfu 4.85%
iter 1301: loss 1.9995, time 8654.21ms, mfu 4.84%
iter 1302: loss 1.7847, time 8646.05ms, mfu 4.85%
iter 1303: loss 1.8805, time 8653.26ms, mfu 4.84%
iter 1304: loss 1.9564, time 8642.82ms, mfu 4.85%
iter 1305: loss 1.9966, time 8647.86ms, mfu 4.85%
iter 1306: loss 2.0268, time 8644.30ms, mfu 4.85%
iter 1307: loss 1.9515, time 8660.61ms, mfu 4.85%
iter 1308: loss 1.9683, time 8660.09ms, mfu 4.84%
iter 1309: loss 2.2391, time 8673.99ms, mfu 4.84%
iter 1310: loss 1.9657, time 8684.77ms, mfu 4.84%
iter 1311: loss 2.1114, time 8662.94ms, mfu 4.84%
iter 1312: loss 2.1041, time 8668.74ms, mfu 4.84%
iter 1313: loss 2.1638, time 8664.87ms, mfu 4.84%
iter 1314: loss 1.7774, time 8660.95ms, mfu 4.84%
iter 1315: loss 2.3196, time 8659.28ms, mfu 4.84%
iter 1316: loss 1.9403, time 8656.88ms, mfu 4.84%
iter 1317: loss 1.6012, time 8652.56ms, mfu 4.84%
iter 1318: loss 2.1925, time 8673.15ms, mfu 4.84%
iter 1319: loss 1.9085, time 8674.75ms, mfu 4.84%
iter 1320: loss 1.9214, time 8687.92ms, mfu 4.84%
iter 1321: loss 2.1180, time 8687.23ms, mfu 4.84%
iter 1322: loss 2.0481, time 8675.73ms, mfu 4.84%
iter 1323: loss 1.7108, time 8663.74ms, mfu 4.84%
iter 1324: loss 1.6520, time 8656.46ms, mfu 4.84%
iter 1325: loss 1.9090, time 8668.31ms, mfu 4.84%
iter 1326: loss 1.7905, time 8657.89ms, mfu 4.84%
iter 1327: loss 2.2676, time 8652.18ms, mfu 4.84%
iter 1328: loss 2.1283, time 8657.31ms, mfu 4.84%
iter 1329: loss 1.8880, time 8651.88ms, mfu 4.84%
iter 1330: loss 1.8924, time 8663.84ms, mfu 4.84%
iter 1331: loss 1.9915, time 8659.63ms, mfu 4.84%
iter 1332: loss 1.8023, time 8647.27ms, mfu 4.84%
iter 1333: loss 2.0073, time 8646.59ms, mfu 4.84%
iter 1334: loss 2.0690, time 8649.92ms, mfu 4.84%
iter 1335: loss 1.6942, time 8640.71ms, mfu 4.84%
iter 1336: loss 1.8927, time 8639.60ms, mfu 4.84%
iter 1337: loss 2.1524, time 8650.74ms, mfu 4.84%
iter 1338: loss 2.0093, time 8640.04ms, mfu 4.84%
iter 1339: loss 1.8529, time 8631.57ms, mfu 4.84%
iter 1340: loss 1.9657, time 8644.47ms, mfu 4.85%
iter 1341: loss 1.6935, time 8637.98ms, mfu 4.85%
iter 1342: loss 2.2194, time 8626.53ms, mfu 4.85%
iter 1343: loss 1.8642, time 8644.12ms, mfu 4.85%
iter 1344: loss 1.8770, time 8650.74ms, mfu 4.85%
iter 1345: loss 2.0630, time 8638.51ms, mfu 4.85%
iter 1346: loss 2.4034, time 8642.92ms, mfu 4.85%
iter 1347: loss 2.0413, time 8634.10ms, mfu 4.85%
iter 1348: loss 1.6857, time 8640.15ms, mfu 4.85%
iter 1349: loss 1.9156, time 8634.87ms, mfu 4.85%
iter 1350: loss 1.9014, time 8649.01ms, mfu 4.85%
iter 1351: loss 1.6929, time 8634.89ms, mfu 4.85%
iter 1352: loss 2.1824, time 8646.23ms, mfu 4.85%
iter 1353: loss 1.9100, time 8651.83ms, mfu 4.85%
iter 1354: loss 1.7070, time 8628.28ms, mfu 4.85%
iter 1355: loss 1.8435, time 8654.01ms, mfu 4.85%
iter 1356: loss 1.9691, time 8635.86ms, mfu 4.85%
iter 1357: loss 1.9536, time 8638.50ms, mfu 4.85%
iter 1358: loss 1.9414, time 8650.89ms, mfu 4.85%
iter 1359: loss 1.7787, time 8638.71ms, mfu 4.85%
iter 1360: loss 1.6894, time 8654.42ms, mfu 4.85%
iter 1361: loss 1.9362, time 8648.90ms, mfu 4.85%
iter 1362: loss 1.9526, time 8664.63ms, mfu 4.85%
iter 1363: loss 2.1801, time 8665.74ms, mfu 4.85%
iter 1364: loss 1.6169, time 8662.74ms, mfu 4.85%
iter 1365: loss 2.1093, time 8658.83ms, mfu 4.85%
iter 1366: loss 1.8904, time 8657.64ms, mfu 4.84%
iter 1367: loss 2.1499, time 8644.11ms, mfu 4.85%
iter 1368: loss 2.0029, time 8638.64ms, mfu 4.85%
iter 1369: loss 1.9286, time 8655.64ms, mfu 4.85%
iter 1370: loss 1.7415, time 8640.78ms, mfu 4.85%
iter 1371: loss 1.7375, time 8652.41ms, mfu 4.85%
iter 1372: loss 1.9281, time 8647.97ms, mfu 4.85%
iter 1373: loss 1.6799, time 8651.41ms, mfu 4.85%
iter 1374: loss 1.8777, time 8662.94ms, mfu 4.84%
iter 1375: loss 1.9336, time 8652.47ms, mfu 4.84%
iter 1376: loss 2.3795, time 8661.77ms, mfu 4.84%
iter 1377: loss 1.6371, time 8641.30ms, mfu 4.84%
iter 1378: loss 1.9554, time 8645.77ms, mfu 4.85%
iter 1379: loss 1.8430, time 8647.63ms, mfu 4.85%
iter 1380: loss 2.4579, time 8649.84ms, mfu 4.85%
iter 1381: loss 1.8703, time 8658.80ms, mfu 4.84%
iter 1382: loss 2.0603, time 8637.60ms, mfu 4.85%
iter 1383: loss 1.7841, time 8633.58ms, mfu 4.85%
iter 1384: loss 2.0077, time 8629.28ms, mfu 4.85%
iter 1385: loss 1.8836, time 8635.62ms, mfu 4.85%
iter 1386: loss 2.1330, time 8625.37ms, mfu 4.85%
iter 1387: loss 1.4980, time 8623.45ms, mfu 4.85%
iter 1388: loss 1.8981, time 8637.08ms, mfu 4.85%
iter 1389: loss 1.7221, time 8639.67ms, mfu 4.85%
iter 1390: loss 2.3674, time 8638.83ms, mfu 4.85%
iter 1391: loss 2.3395, time 8633.16ms, mfu 4.85%
iter 1392: loss 1.8090, time 8633.58ms, mfu 4.85%
iter 1393: loss 1.8810, time 8639.68ms, mfu 4.85%
iter 1394: loss 2.1099, time 8638.18ms, mfu 4.85%
iter 1395: loss 2.2576, time 8637.02ms, mfu 4.85%
iter 1396: loss 1.9481, time 8634.95ms, mfu 4.85%
iter 1397: loss 1.8134, time 8634.84ms, mfu 4.85%
iter 1398: loss 2.0521, time 8646.73ms, mfu 4.85%
iter 1399: loss 1.7672, time 8626.88ms, mfu 4.85%
iter 1400: loss 1.7836, time 8634.29ms, mfu 4.85%
iter 1401: loss 2.0023, time 8635.53ms, mfu 4.85%
iter 1402: loss 1.6627, time 8671.59ms, mfu 4.85%
iter 1403: loss 1.5504, time 8666.74ms, mfu 4.85%
iter 1404: loss 1.8509, time 8662.70ms, mfu 4.85%
iter 1405: loss 2.1916, time 8666.12ms, mfu 4.85%
iter 1406: loss 1.7207, time 8671.30ms, mfu 4.85%
iter 1407: loss 2.0434, time 8659.92ms, mfu 4.85%
iter 1408: loss 1.8764, time 8675.97ms, mfu 4.84%
iter 1409: loss 1.9452, time 8667.41ms, mfu 4.84%
iter 1410: loss 1.8258, time 8660.18ms, mfu 4.84%
iter 1411: loss 1.8839, time 8650.82ms, mfu 4.84%
iter 1412: loss 1.7175, time 8648.33ms, mfu 4.84%
iter 1413: loss 1.9244, time 8639.33ms, mfu 4.84%
iter 1414: loss 1.8627, time 8631.20ms, mfu 4.85%
iter 1415: loss 2.1133, time 8630.17ms, mfu 4.85%
iter 1416: loss 1.8989, time 8644.78ms, mfu 4.85%
iter 1417: loss 1.8104, time 8641.69ms, mfu 4.85%
iter 1418: loss 1.9111, time 8643.80ms, mfu 4.85%
iter 1419: loss 2.1514, time 8625.16ms, mfu 4.85%
iter 1420: loss 1.8511, time 8646.64ms, mfu 4.85%
iter 1421: loss 1.7844, time 8638.92ms, mfu 4.85%
iter 1422: loss 1.6243, time 8627.23ms, mfu 4.85%
iter 1423: loss 1.8968, time 8639.86ms, mfu 4.85%
iter 1424: loss 1.6911, time 8649.69ms, mfu 4.85%
iter 1425: loss 2.0285, time 8655.39ms, mfu 4.85%
iter 1426: loss 2.0663, time 8652.27ms, mfu 4.85%
iter 1427: loss 2.0426, time 8667.41ms, mfu 4.85%
iter 1428: loss 2.0632, time 8656.33ms, mfu 4.85%
iter 1429: loss 1.8151, time 8676.23ms, mfu 4.84%
iter 1430: loss 2.0610, time 8669.29ms, mfu 4.84%
iter 1431: loss 2.0249, time 8671.98ms, mfu 4.84%
iter 1432: loss 1.7239, time 8655.31ms, mfu 4.84%
iter 1433: loss 1.9831, time 8659.02ms, mfu 4.84%
iter 1434: loss 1.6276, time 8643.59ms, mfu 4.84%
iter 1435: loss 1.9032, time 8630.51ms, mfu 4.84%
iter 1436: loss 2.3588, time 8644.32ms, mfu 4.84%
iter 1437: loss 1.8645, time 8642.96ms, mfu 4.85%
iter 1438: loss 1.8047, time 8630.34ms, mfu 4.85%
iter 1439: loss 1.9607, time 8638.08ms, mfu 4.85%
iter 1440: loss 1.8618, time 8637.00ms, mfu 4.85%
iter 1441: loss 1.8715, time 8646.96ms, mfu 4.85%
iter 1442: loss 1.8655, time 8650.41ms, mfu 4.85%
iter 1443: loss 1.5629, time 8648.40ms, mfu 4.85%
iter 1444: loss 1.7480, time 8659.51ms, mfu 4.85%
iter 1445: loss 2.3368, time 8637.64ms, mfu 4.85%
iter 1446: loss 2.1625, time 8640.49ms, mfu 4.85%
iter 1447: loss 2.0571, time 8617.03ms, mfu 4.85%
iter 1448: loss 1.8284, time 8642.79ms, mfu 4.85%
iter 1449: loss 2.0530, time 8644.45ms, mfu 4.85%
iter 1450: loss 1.9051, time 8644.33ms, mfu 4.85%
iter 1451: loss 2.0512, time 8640.72ms, mfu 4.85%
iter 1452: loss 1.8213, time 8636.35ms, mfu 4.85%
iter 1453: loss 1.6003, time 8638.13ms, mfu 4.85%
iter 1454: loss 1.9516, time 8656.25ms, mfu 4.85%
iter 1455: loss 1.9429, time 8655.39ms, mfu 4.85%
iter 1456: loss 1.8706, time 8646.97ms, mfu 4.85%
iter 1457: loss 1.8405, time 8656.16ms, mfu 4.85%
iter 1458: loss 1.9236, time 8641.91ms, mfu 4.85%
iter 1459: loss 2.0095, time 8654.32ms, mfu 4.85%
iter 1460: loss 1.9135, time 8647.80ms, mfu 4.85%
iter 1461: loss 1.9452, time 8648.73ms, mfu 4.85%
iter 1462: loss 1.6396, time 8649.77ms, mfu 4.85%
iter 1463: loss 1.9113, time 8635.95ms, mfu 4.85%
iter 1464: loss 2.1143, time 8626.35ms, mfu 4.85%
iter 1465: loss 1.7022, time 8636.75ms, mfu 4.85%
iter 1466: loss 2.1069, time 8630.91ms, mfu 4.85%
iter 1467: loss 1.9055, time 8628.76ms, mfu 4.85%
iter 1468: loss 1.6427, time 8615.85ms, mfu 4.85%
iter 1469: loss 1.8533, time 8632.79ms, mfu 4.85%
iter 1470: loss 2.1891, time 8625.36ms, mfu 4.85%
iter 1471: loss 1.7650, time 8623.12ms, mfu 4.85%
iter 1472: loss 1.8346, time 8631.23ms, mfu 4.85%
iter 1473: loss 1.9380, time 8633.79ms, mfu 4.85%
iter 1474: loss 1.8703, time 8641.70ms, mfu 4.85%
iter 1475: loss 2.0617, time 8640.68ms, mfu 4.85%
iter 1476: loss 1.5456, time 8664.40ms, mfu 4.85%
iter 1477: loss 1.9545, time 8650.35ms, mfu 4.85%
iter 1478: loss 1.7594, time 8651.46ms, mfu 4.85%
iter 1479: loss 2.0273, time 8628.02ms, mfu 4.85%
iter 1480: loss 2.1226, time 8618.99ms, mfu 4.85%
iter 1481: loss 1.5857, time 8625.00ms, mfu 4.85%
iter 1482: loss 1.5615, time 8632.24ms, mfu 4.85%
iter 1483: loss 1.8871, time 8605.89ms, mfu 4.85%
iter 1484: loss 2.1619, time 8642.12ms, mfu 4.85%
iter 1485: loss 1.7582, time 8638.31ms, mfu 4.85%
iter 1486: loss 2.2947, time 8649.49ms, mfu 4.85%
iter 1487: loss 1.9702, time 8651.62ms, mfu 4.85%
iter 1488: loss 1.7989, time 8667.05ms, mfu 4.85%
iter 1489: loss 1.9357, time 8659.41ms, mfu 4.85%
iter 1490: loss 2.2377, time 8675.97ms, mfu 4.85%
iter 1491: loss 1.9419, time 8669.84ms, mfu 4.85%
iter 1492: loss 1.6376, time 8663.95ms, mfu 4.85%
iter 1493: loss 1.9858, time 8655.39ms, mfu 4.85%
iter 1494: loss 1.8343, time 8662.56ms, mfu 4.84%
iter 1495: loss 2.3099, time 8647.99ms, mfu 4.84%
iter 1496: loss 1.8276, time 8661.39ms, mfu 4.84%
iter 1497: loss 1.7308, time 8646.68ms, mfu 4.84%
iter 1498: loss 2.1092, time 8659.30ms, mfu 4.84%
iter 1499: loss 1.6856, time 8638.76ms, mfu 4.84%
iter 1500: loss 2.0307, time 8647.28ms, mfu 4.85%
iter 1501: loss 1.8992, time 8657.93ms, mfu 4.84%
iter 1502: loss 1.9076, time 8627.02ms, mfu 4.85%
iter 1503: loss 1.5141, time 8628.74ms, mfu 4.85%
iter 1504: loss 1.5394, time 8634.57ms, mfu 4.85%
iter 1505: loss 2.1358, time 8626.42ms, mfu 4.85%
iter 1506: loss 1.8544, time 8617.76ms, mfu 4.85%
iter 1507: loss 1.7642, time 8636.05ms, mfu 4.85%
iter 1508: loss 2.0794, time 8632.43ms, mfu 4.85%
iter 1509: loss 1.7281, time 8624.93ms, mfu 4.85%
iter 1510: loss 1.8091, time 8623.65ms, mfu 4.85%
iter 1511: loss 1.6151, time 8629.69ms, mfu 4.85%
iter 1512: loss 2.0236, time 8621.64ms, mfu 4.85%
iter 1513: loss 1.8368, time 8620.83ms, mfu 4.85%






""")
loss_numbers = [float(value) for value in loss_values]
plt.plot(loss_numbers, color='blue', linewidth=1, linestyle='dotted')







# I think this is 50% tied experts, with enough experts added to bring parameter conunt back to dense
loss_values = re.findall(loss_pattern, """
of 84
number of parameters: 127.17M
num decayed parameter tensors: 140, with 127,532,544 parameters
num non-decayed parameter tensors: 49, with 33,024 parameters
using fused AdamW: True
compiling the model... (takes a ~minute)
step 0: train loss 10.8268, val loss 10.8055
iter 0: loss 10.8255, time 159472.43ms, mfu -100.00%
iter 1: loss 9.8617, time 16582.09ms, mfu -100.00%
iter 2: loss 9.8032, time 16275.69ms, mfu -100.00%
iter 3: loss 8.6110, time 15920.12ms, mfu -100.00%
iter 4: loss 7.7690, time 15824.49ms, mfu -100.00%
iter 5: loss 6.8583, time 15934.86ms, mfu 2.70%
iter 6: loss 7.0212, time 16054.78ms, mfu 2.70%
iter 7: loss 7.1987, time 16046.04ms, mfu 2.70%
iter 8: loss 6.5593, time 16000.07ms, mfu 2.70%
iter 9: loss 6.6554, time 16245.25ms, mfu 2.69%
iter 10: loss 5.9607, time 15914.91ms, mfu 2.69%
iter 11: loss 6.3551, time 15911.87ms, mfu 2.69%
iter 12: loss 6.2330, time 15907.14ms, mfu 2.70%
iter 13: loss 6.0446, time 15966.44ms, mfu 2.70%
iter 14: loss 5.8610, time 16039.11ms, mfu 2.69%
iter 15: loss 10.0204, time 16081.58ms, mfu 2.69%
iter 16: loss 7.9912, time 15948.65ms, mfu 2.69%
iter 17: loss 6.2051, time 15960.86ms, mfu 2.69%
iter 18: loss 5.7491, time 15948.52ms, mfu 2.69%
iter 19: loss 6.0006, time 15948.25ms, mfu 2.69%
iter 20: loss 5.9542, time 15966.65ms, mfu 2.69%
iter 21: loss 6.1603, time 15973.25ms, mfu 2.69%
iter 22: loss 4.9878, time 16005.31ms, mfu 2.69%
iter 23: loss 5.7598, time 15945.46ms, mfu 2.69%
iter 24: loss 6.0732, time 15916.83ms, mfu 2.70%
iter 25: loss 6.1038, time 15934.28ms, mfu 2.70%
iter 26: loss 5.6229, time 15938.70ms, mfu 2.70%
iter 27: loss 5.4457, time 15952.12ms, mfu 2.70%
iter 28: loss 5.4726, time 15970.58ms, mfu 2.70%
iter 29: loss 6.1769, time 15963.20ms, mfu 2.70%
iter 30: loss 6.1896, time 15948.18ms, mfu 2.70%
iter 31: loss 5.3170, time 15960.24ms, mfu 2.70%
iter 32: loss 5.7969, time 15953.22ms, mfu 2.70%
iter 33: loss 6.0055, time 15970.10ms, mfu 2.70%
iter 34: loss 5.3540, time 15984.36ms, mfu 2.70%
iter 35: loss 5.1457, time 15979.81ms, mfu 2.70%
iter 36: loss 5.6742, time 16005.88ms, mfu 2.70%
iter 37: loss 4.8508, time 15995.71ms, mfu 2.69%
iter 38: loss 5.7621, time 16020.05ms, mfu 2.69%
iter 39: loss 5.2958, time 16017.60ms, mfu 2.69%
iter 40: loss 5.5246, time 16024.57ms, mfu 2.69%
iter 41: loss 5.1807, time 16083.57ms, mfu 2.69%
iter 42: loss 5.6162, time 15985.07ms, mfu 2.69%
iter 43: loss 5.0091, time 15988.13ms, mfu 2.69%
iter 44: loss 5.6098, time 16051.52ms, mfu 2.69%
iter 45: loss 6.0661, time 16058.97ms, mfu 2.69%
iter 46: loss 5.6564, time 16041.53ms, mfu 2.69%
iter 47: loss 5.4838, time 16042.70ms, mfu 2.69%
iter 48: loss 5.7960, time 16041.32ms, mfu 2.69%
iter 49: loss 5.5079, time 16050.61ms, mfu 2.69%
iter 50: loss 5.3488, time 16034.66ms, mfu 2.69%
iter 51: loss 5.4160, time 16036.16ms, mfu 2.69%
iter 52: loss 4.3826, time 16022.01ms, mfu 2.69%
iter 53: loss 5.2089, time 16024.07ms, mfu 2.69%
iter 54: loss 4.6480, time 16028.38ms, mfu 2.69%
iter 55: loss 5.5686, time 16059.89ms, mfu 2.69%
iter 56: loss 5.2937, time 16063.60ms, mfu 2.69%
iter 57: loss 5.5686, time 16247.20ms, mfu 2.68%
iter 58: loss 5.1158, time 16044.78ms, mfu 2.68%
iter 59: loss 4.8403, time 16050.01ms, mfu 2.68%
iter 60: loss 5.1915, time 16030.95ms, mfu 2.68%
iter 61: loss 5.0713, time 16059.31ms, mfu 2.68%
iter 62: loss 4.8346, time 16065.59ms, mfu 2.68%
iter 63: loss 5.2106, time 16063.56ms, mfu 2.68%
iter 64: loss 5.0290, time 16046.02ms, mfu 2.68%
iter 65: loss 5.1046, time 16060.26ms, mfu 2.68%
iter 66: loss 4.3295, time 16075.24ms, mfu 2.68%
iter 67: loss 4.9671, time 16058.94ms, mfu 2.68%
iter 68: loss 4.9066, time 16051.90ms, mfu 2.68%
iter 69: loss 4.6732, time 16063.47ms, mfu 2.68%
iter 70: loss 4.3774, time 16082.13ms, mfu 2.68%
iter 71: loss 4.5529, time 16078.24ms, mfu 2.68%
iter 72: loss 5.3960, time 16076.93ms, mfu 2.68%
iter 73: loss 5.1753, time 16075.30ms, mfu 2.68%
iter 74: loss 4.6450, time 16060.18ms, mfu 2.68%
iter 75: loss 4.0651, time 16078.80ms, mfu 2.68%
iter 76: loss 4.7811, time 16067.51ms, mfu 2.68%
iter 77: loss 4.7712, time 16078.13ms, mfu 2.68%
iter 78: loss 4.7634, time 16065.56ms, mfu 2.68%
iter 79: loss 4.9600, time 16083.43ms, mfu 2.68%
iter 80: loss 5.1169, time 16071.82ms, mfu 2.68%
iter 81: loss 4.2514, time 16088.00ms, mfu 2.68%
iter 82: loss 4.4016, time 16059.63ms, mfu 2.68%
iter 83: loss 4.7483, time 16068.43ms, mfu 2.68%
iter 84: loss 4.9519, time 16092.77ms, mfu 2.68%
iter 85: loss 4.9643, time 16068.89ms, mfu 2.68%
iter 86: loss 5.3717, time 16082.16ms, mfu 2.68%
iter 87: loss 4.5888, time 16085.78ms, mfu 2.68%
iter 88: loss 4.4927, time 16091.80ms, mfu 2.68%
iter 89: loss 4.8198, time 16088.58ms, mfu 2.68%
iter 90: loss 4.0388, time 16078.18ms, mfu 2.68%
iter 91: loss 4.6328, time 16078.16ms, mfu 2.68%
iter 92: loss 4.6303, time 16110.18ms, mfu 2.68%
iter 93: loss 4.4409, time 16100.47ms, mfu 2.68%
iter 94: loss 4.2924, time 16083.09ms, mfu 2.68%
iter 95: loss 4.8069, time 16087.53ms, mfu 2.68%
iter 96: loss 4.5557, time 16097.31ms, mfu 2.68%
iter 97: loss 4.7743, time 16092.44ms, mfu 2.68%
iter 98: loss 4.6427, time 16087.82ms, mfu 2.68%
iter 99: loss 4.1648, time 16105.16ms, mfu 2.68%
iter 100: loss 4.2891, time 16107.11ms, mfu 2.68%
iter 101: loss 4.3193, time 16106.85ms, mfu 2.67%
iter 102: loss 4.0782, time 16096.85ms, mfu 2.67%
iter 103: loss 4.6365, time 16089.84ms, mfu 2.67%
iter 104: loss 4.1630, time 16296.51ms, mfu 2.67%
iter 105: loss 4.4996, time 16107.17ms, mfu 2.67%
iter 106: loss 4.7051, time 16094.68ms, mfu 2.67%
iter 107: loss 4.4863, time 16088.63ms, mfu 2.67%
iter 108: loss 4.4288, time 16116.80ms, mfu 2.67%
iter 109: loss 4.4226, time 16101.08ms, mfu 2.67%
iter 110: loss 4.4727, time 16097.67ms, mfu 2.67%
iter 111: loss 4.1946, time 16082.42ms, mfu 2.67%
iter 112: loss 4.3360, time 16117.51ms, mfu 2.67%
iter 113: loss 4.5944, time 16112.23ms, mfu 2.67%
iter 114: loss 4.0840, time 16113.57ms, mfu 2.67%
iter 115: loss 4.4153, time 16111.66ms, mfu 2.67%
iter 116: loss 4.4029, time 16116.88ms, mfu 2.67%
iter 117: loss 4.4821, time 16109.22ms, mfu 2.67%
iter 118: loss 4.0624, time 16103.72ms, mfu 2.67%
iter 119: loss 4.0524, time 16108.28ms, mfu 2.67%
iter 120: loss 4.7035, time 16113.06ms, mfu 2.67%
iter 121: loss 4.4940, time 16114.62ms, mfu 2.67%
iter 122: loss 4.2356, time 16107.47ms, mfu 2.67%
iter 123: loss 4.4804, time 16101.38ms, mfu 2.67%
iter 124: loss 4.0159, time 16108.88ms, mfu 2.67%
iter 125: loss 3.9901, time 16100.40ms, mfu 2.67%
iter 126: loss 4.1615, time 16117.79ms, mfu 2.67%
iter 127: loss 3.8375, time 16139.11ms, mfu 2.67%
iter 128: loss 4.0177, time 16138.34ms, mfu 2.67%
iter 129: loss 3.9974, time 16157.89ms, mfu 2.67%
iter 130: loss 3.7126, time 16147.60ms, mfu 2.67%
iter 131: loss 3.8351, time 16137.21ms, mfu 2.67%
iter 132: loss 4.3075, time 16149.18ms, mfu 2.67%
iter 133: loss 4.2053, time 16145.46ms, mfu 2.67%
iter 134: loss 4.0609, time 16145.57ms, mfu 2.67%
iter 135: loss 3.9417, time 16134.64ms, mfu 2.67%
iter 136: loss 4.1098, time 16114.89ms, mfu 2.67%
iter 137: loss 4.1738, time 16115.93ms, mfu 2.67%
iter 138: loss 4.0643, time 16126.60ms, mfu 2.67%
iter 139: loss 3.4530, time 16119.58ms, mfu 2.67%
iter 140: loss 4.1600, time 16128.37ms, mfu 2.67%
iter 141: loss 3.6988, time 16116.86ms, mfu 2.67%
iter 142: loss 3.8537, time 16109.13ms, mfu 2.67%
iter 143: loss 4.2608, time 16122.26ms, mfu 2.67%
iter 144: loss 4.5512, time 16117.01ms, mfu 2.67%
iter 145: loss 3.9840, time 16120.14ms, mfu 2.67%
iter 146: loss 4.0729, time 16126.53ms, mfu 2.67%
iter 147: loss 3.8876, time 16099.89ms, mfu 2.67%
iter 148: loss 3.8520, time 16119.23ms, mfu 2.67%
iter 149: loss 4.2679, time 16102.75ms, mfu 2.67%
iter 150: loss 3.7646, time 16113.25ms, mfu 2.67%
iter 151: loss 3.7615, time 16286.81ms, mfu 2.67%
iter 152: loss 3.8321, time 16130.11ms, mfu 2.67%
iter 153: loss 3.8822, time 16116.62ms, mfu 2.67%
iter 154: loss 3.7048, time 16123.56ms, mfu 2.67%
iter 155: loss 3.9149, time 16143.51ms, mfu 2.67%
iter 156: loss 4.1509, time 16167.40ms, mfu 2.67%
iter 157: loss 4.0402, time 16173.32ms, mfu 2.67%
iter 158: loss 4.2500, time 16145.18ms, mfu 2.67%
iter 159: loss 3.9604, time 16156.35ms, mfu 2.67%
iter 160: loss 3.9399, time 16137.64ms, mfu 2.67%
iter 161: loss 3.6061, time 16154.96ms, mfu 2.67%
iter 162: loss 3.8253, time 16128.84ms, mfu 2.67%
iter 163: loss 3.9551, time 16137.39ms, mfu 2.67%
iter 164: loss 3.9487, time 16124.35ms, mfu 2.67%
iter 165: loss 3.7832, time 16126.11ms, mfu 2.67%
iter 166: loss 3.7740, time 16097.67ms, mfu 2.67%
iter 167: loss 3.7048, time 16130.41ms, mfu 2.67%
iter 168: loss 3.7746, time 16125.73ms, mfu 2.67%
iter 169: loss 3.8644, time 16091.71ms, mfu 2.67%
iter 170: loss 4.2486, time 16087.27ms, mfu 2.67%
iter 171: loss 4.2473, time 16123.68ms, mfu 2.67%
iter 172: loss 4.3033, time 16097.77ms, mfu 2.67%
iter 173: loss 3.6725, time 16103.06ms, mfu 2.67%
iter 174: loss 4.0084, time 16096.02ms, mfu 2.67%
iter 175: loss 3.6084, time 16110.12ms, mfu 2.67%
iter 176: loss 3.8653, time 16095.87ms, mfu 2.67%
iter 177: loss 4.0035, time 16092.42ms, mfu 2.67%
iter 178: loss 3.7509, time 16110.34ms, mfu 2.67%
iter 179: loss 3.7010, time 16101.51ms, mfu 2.67%
iter 180: loss 3.7862, time 16100.46ms, mfu 2.67%
iter 181: loss 3.5373, time 16119.55ms, mfu 2.67%
iter 182: loss 3.6026, time 16111.17ms, mfu 2.67%
iter 183: loss 3.1074, time 16126.44ms, mfu 2.67%
iter 184: loss 3.7301, time 16093.48ms, mfu 2.67%
iter 185: loss 3.5413, time 16110.73ms, mfu 2.67%
iter 186: loss 4.1786, time 16111.42ms, mfu 2.67%
iter 187: loss 3.7508, time 16102.69ms, mfu 2.67%
iter 188: loss 3.6772, time 16113.26ms, mfu 2.67%
iter 189: loss 3.9387, time 16105.23ms, mfu 2.67%
iter 190: loss 3.9718, time 16100.38ms, mfu 2.67%
iter 191: loss 3.7076, time 16088.96ms, mfu 2.67%
iter 192: loss 3.6210, time 16110.65ms, mfu 2.67%
iter 193: loss 3.8967, time 16090.53ms, mfu 2.67%
iter 194: loss 3.5632, time 16098.80ms, mfu 2.67%
iter 195: loss 3.6297, time 16100.51ms, mfu 2.67%
iter 196: loss 3.7419, time 16094.91ms, mfu 2.67%
iter 197: loss 4.2959, time 16083.23ms, mfu 2.67%
iter 198: loss 4.0520, time 16093.77ms, mfu 2.67%
iter 199: loss 3.6902, time 16291.80ms, mfu 2.67%
iter 200: loss 3.7319, time 16110.33ms, mfu 2.67%
iter 201: loss 3.6107, time 16112.60ms, mfu 2.67%
iter 202: loss 3.4855, time 16118.76ms, mfu 2.67%
iter 203: loss 3.6736, time 16140.58ms, mfu 2.67%
iter 204: loss 3.6694, time 16165.34ms, mfu 2.67%
iter 205: loss 3.5911, time 16171.97ms, mfu 2.67%
iter 206: loss 3.6828, time 16169.87ms, mfu 2.67%
iter 207: loss 4.1689, time 16150.92ms, mfu 2.67%
iter 208: loss 3.4918, time 16138.00ms, mfu 2.67%
iter 209: loss 3.3907, time 16127.53ms, mfu 2.67%
iter 210: loss 3.2456, time 16134.25ms, mfu 2.67%
iter 211: loss 4.0257, time 16132.77ms, mfu 2.67%
iter 212: loss 3.4801, time 16122.46ms, mfu 2.67%
iter 213: loss 3.4740, time 16124.17ms, mfu 2.67%
iter 214: loss 3.1314, time 16122.81ms, mfu 2.67%
iter 215: loss 4.1641, time 16113.08ms, mfu 2.67%
iter 216: loss 3.7525, time 16105.93ms, mfu 2.67%
iter 217: loss 3.4395, time 16132.56ms, mfu 2.67%
iter 218: loss 3.5586, time 16097.09ms, mfu 2.67%
iter 219: loss 3.1693, time 16113.37ms, mfu 2.67%
iter 220: loss 3.3051, time 16110.12ms, mfu 2.67%
iter 221: loss 3.2175, time 16099.54ms, mfu 2.67%
iter 222: loss 3.6372, time 16109.87ms, mfu 2.67%
iter 223: loss 3.1684, time 16093.89ms, mfu 2.67%
iter 224: loss 4.0658, time 16099.39ms, mfu 2.67%
iter 225: loss 3.3405, time 16123.74ms, mfu 2.67%
iter 226: loss 3.1954, time 16107.14ms, mfu 2.67%
iter 227: loss 3.4785, time 16102.85ms, mfu 2.67%
iter 228: loss 3.1682, time 16116.03ms, mfu 2.67%
iter 229: loss 3.5430, time 16085.46ms, mfu 2.67%
iter 230: loss 3.5215, time 16081.65ms, mfu 2.67%
iter 231: loss 3.7575, time 16083.27ms, mfu 2.67%
iter 232: loss 3.5887, time 16079.79ms, mfu 2.67%
iter 233: loss 3.8481, time 16105.23ms, mfu 2.67%
iter 234: loss 3.6449, time 16112.97ms, mfu 2.67%
iter 235: loss 3.2898, time 16104.40ms, mfu 2.67%
iter 236: loss 3.4100, time 16092.35ms, mfu 2.67%
iter 237: loss 3.6561, time 16092.59ms, mfu 2.67%
iter 238: loss 3.1385, time 16101.60ms, mfu 2.67%
iter 239: loss 2.8360, time 16100.46ms, mfu 2.67%
iter 240: loss 3.1632, time 16097.11ms, mfu 2.67%
iter 241: loss 3.8662, time 16105.02ms, mfu 2.67%
iter 242: loss 3.6755, time 16084.35ms, mfu 2.67%
iter 243: loss 3.1045, time 16091.93ms, mfu 2.67%
iter 244: loss 3.5673, time 16106.02ms, mfu 2.67%
iter 245: loss 3.6127, time 16095.58ms, mfu 2.67%
iter 246: loss 3.5572, time 16281.70ms, mfu 2.67%
iter 247: loss 3.4369, time 16072.43ms, mfu 2.67%
iter 248: loss 3.1905, time 16094.36ms, mfu 2.67%
iter 249: loss 3.2539, time 16086.49ms, mfu 2.67%
iter 250: loss 3.4243, time 16087.53ms, mfu 2.67%
iter 251: loss 4.1580, time 16085.76ms, mfu 2.67%
iter 252: loss 3.7038, time 16087.67ms, mfu 2.67%
iter 253: loss 3.6222, time 16092.58ms, mfu 2.67%
iter 254: loss 3.4630, time 16097.81ms, mfu 2.67%
iter 255: loss 3.6287, time 16087.88ms, mfu 2.67%
iter 256: loss 3.6959, time 16100.84ms, mfu 2.67%
iter 257: loss 3.1946, time 16123.83ms, mfu 2.67%
iter 258: loss 3.1303, time 16114.43ms, mfu 2.67%
iter 259: loss 3.6505, time 16125.70ms, mfu 2.67%
iter 260: loss 3.0840, time 16122.73ms, mfu 2.67%
iter 261: loss 2.9574, time 16148.99ms, mfu 2.67%
iter 262: loss 3.0740, time 16145.09ms, mfu 2.67%
iter 263: loss 3.6996, time 16145.21ms, mfu 2.67%
iter 264: loss 3.4703, time 16133.45ms, mfu 2.67%
iter 265: loss 3.3182, time 16123.53ms, mfu 2.67%
iter 266: loss 3.9360, time 16111.06ms, mfu 2.67%
iter 267: loss 3.6560, time 16132.63ms, mfu 2.67%
iter 268: loss 3.4099, time 16109.71ms, mfu 2.67%
iter 269: loss 3.3894, time 16110.34ms, mfu 2.67%
iter 270: loss 3.3179, time 16112.35ms, mfu 2.67%
iter 271: loss 3.4406, time 16104.70ms, mfu 2.67%
iter 272: loss 3.5731, time 16089.54ms, mfu 2.67%
iter 273: loss 3.4467, time 16101.77ms, mfu 2.67%
iter 274: loss 3.1601, time 16113.32ms, mfu 2.67%
iter 275: loss 3.5178, time 16080.99ms, mfu 2.67%
iter 276: loss 3.6782, time 16108.93ms, mfu 2.67%
iter 277: loss 2.9064, time 16109.94ms, mfu 2.67%
iter 278: loss 3.2146, time 16097.51ms, mfu 2.67%
iter 279: loss 3.4012, time 16112.80ms, mfu 2.67%
iter 280: loss 3.1513, time 16125.01ms, mfu 2.67%
iter 281: loss 3.6630, time 16137.98ms, mfu 2.67%
iter 282: loss 3.7381, time 16135.96ms, mfu 2.67%
iter 283: loss 3.3035, time 16152.61ms, mfu 2.67%
iter 284: loss 3.2808, time 16151.67ms, mfu 2.67%
iter 285: loss 2.8494, time 16152.40ms, mfu 2.67%
iter 286: loss 3.7245, time 16186.09ms, mfu 2.67%
iter 287: loss 3.5057, time 16168.79ms, mfu 2.67%
iter 288: loss 3.5812, time 16180.36ms, mfu 2.67%
iter 289: loss 3.5610, time 16188.64ms, mfu 2.67%
iter 290: loss 3.0930, time 16146.94ms, mfu 2.67%
iter 291: loss 3.4537, time 16132.88ms, mfu 2.67%
iter 292: loss 3.4306, time 16137.29ms, mfu 2.67%
iter 293: loss 3.5969, time 16134.85ms, mfu 2.67%
iter 294: loss 3.6199, time 16318.98ms, mfu 2.66%
iter 295: loss 3.4162, time 16125.18ms, mfu 2.66%
iter 296: loss 2.9110, time 16120.75ms, mfu 2.66%
iter 297: loss 3.4343, time 16110.98ms, mfu 2.67%
iter 298: loss 3.3734, time 16110.05ms, mfu 2.67%
iter 299: loss 3.4172, time 16105.47ms, mfu 2.67%
iter 300: loss 3.2457, time 16092.57ms, mfu 2.67%
iter 301: loss 3.2312, time 16082.04ms, mfu 2.67%
""")
loss_numbers = [float(value) for value in loss_values]
plt.plot(loss_numbers, color='orange', linewidth=1, linestyle='dotted')







# I think this is 50% tied experts, with enough experts added to bring parameter conunt back to dense
loss_values = re.findall(loss_pattern, """
of 84
number of parameters: 127.17M
num decayed parameter tensors: 140, with 127,532,544 parameters
num non-decayed parameter tensors: 49, with 33,024 parameters
using fused AdamW: True
compiling the model... (takes a ~minute)
step 0: train loss 10.8268, val loss 10.8055
iter 0: loss 10.8255, time 159472.43ms, mfu -100.00%
iter 1: loss 9.8617, time 16582.09ms, mfu -100.00%
iter 2: loss 9.8032, time 16275.69ms, mfu -100.00%
iter 3: loss 8.6110, time 15920.12ms, mfu -100.00%
iter 4: loss 7.7690, time 15824.49ms, mfu -100.00%
iter 5: loss 6.8583, time 15934.86ms, mfu 2.70%
iter 6: loss 7.0212, time 16054.78ms, mfu 2.70%
iter 7: loss 7.1987, time 16046.04ms, mfu 2.70%
iter 8: loss 6.5593, time 16000.07ms, mfu 2.70%
iter 9: loss 6.6554, time 16245.25ms, mfu 2.69%
iter 10: loss 5.9607, time 15914.91ms, mfu 2.69%
iter 11: loss 6.3551, time 15911.87ms, mfu 2.69%
iter 12: loss 6.2330, time 15907.14ms, mfu 2.70%
iter 13: loss 6.0446, time 15966.44ms, mfu 2.70%
iter 14: loss 5.8610, time 16039.11ms, mfu 2.69%
iter 15: loss 10.0204, time 16081.58ms, mfu 2.69%
iter 16: loss 7.9912, time 15948.65ms, mfu 2.69%
iter 17: loss 6.2051, time 15960.86ms, mfu 2.69%
iter 18: loss 5.7491, time 15948.52ms, mfu 2.69%
iter 19: loss 6.0006, time 15948.25ms, mfu 2.69%
iter 20: loss 5.9542, time 15966.65ms, mfu 2.69%
iter 21: loss 6.1603, time 15973.25ms, mfu 2.69%
iter 22: loss 4.9878, time 16005.31ms, mfu 2.69%
iter 23: loss 5.7598, time 15945.46ms, mfu 2.69%
iter 24: loss 6.0732, time 15916.83ms, mfu 2.70%
iter 25: loss 6.1038, time 15934.28ms, mfu 2.70%
iter 26: loss 5.6229, time 15938.70ms, mfu 2.70%
iter 27: loss 5.4457, time 15952.12ms, mfu 2.70%
iter 28: loss 5.4726, time 15970.58ms, mfu 2.70%
iter 29: loss 6.1769, time 15963.20ms, mfu 2.70%
iter 30: loss 6.1896, time 15948.18ms, mfu 2.70%
iter 31: loss 5.3170, time 15960.24ms, mfu 2.70%
iter 32: loss 5.7969, time 15953.22ms, mfu 2.70%
iter 33: loss 6.0055, time 15970.10ms, mfu 2.70%
iter 34: loss 5.3540, time 15984.36ms, mfu 2.70%
iter 35: loss 5.1457, time 15979.81ms, mfu 2.70%
iter 36: loss 5.6742, time 16005.88ms, mfu 2.70%
iter 37: loss 4.8508, time 15995.71ms, mfu 2.69%
iter 38: loss 5.7621, time 16020.05ms, mfu 2.69%
iter 39: loss 5.2958, time 16017.60ms, mfu 2.69%
iter 40: loss 5.5246, time 16024.57ms, mfu 2.69%
iter 41: loss 5.1807, time 16083.57ms, mfu 2.69%
iter 42: loss 5.6162, time 15985.07ms, mfu 2.69%
iter 43: loss 5.0091, time 15988.13ms, mfu 2.69%
iter 44: loss 5.6098, time 16051.52ms, mfu 2.69%
iter 45: loss 6.0661, time 16058.97ms, mfu 2.69%
iter 46: loss 5.6564, time 16041.53ms, mfu 2.69%
iter 47: loss 5.4838, time 16042.70ms, mfu 2.69%
iter 48: loss 5.7960, time 16041.32ms, mfu 2.69%
iter 49: loss 5.5079, time 16050.61ms, mfu 2.69%
iter 50: loss 5.3488, time 16034.66ms, mfu 2.69%
iter 51: loss 5.4160, time 16036.16ms, mfu 2.69%
iter 52: loss 4.3826, time 16022.01ms, mfu 2.69%
iter 53: loss 5.2089, time 16024.07ms, mfu 2.69%
iter 54: loss 4.6480, time 16028.38ms, mfu 2.69%
iter 55: loss 5.5686, time 16059.89ms, mfu 2.69%
iter 56: loss 5.2937, time 16063.60ms, mfu 2.69%
iter 57: loss 5.5686, time 16247.20ms, mfu 2.68%
iter 58: loss 5.1158, time 16044.78ms, mfu 2.68%
iter 59: loss 4.8403, time 16050.01ms, mfu 2.68%
iter 60: loss 5.1915, time 16030.95ms, mfu 2.68%
iter 61: loss 5.0713, time 16059.31ms, mfu 2.68%
iter 62: loss 4.8346, time 16065.59ms, mfu 2.68%
iter 63: loss 5.2106, time 16063.56ms, mfu 2.68%
iter 64: loss 5.0290, time 16046.02ms, mfu 2.68%
iter 65: loss 5.1046, time 16060.26ms, mfu 2.68%
iter 66: loss 4.3295, time 16075.24ms, mfu 2.68%
iter 67: loss 4.9671, time 16058.94ms, mfu 2.68%
iter 68: loss 4.9066, time 16051.90ms, mfu 2.68%
iter 69: loss 4.6732, time 16063.47ms, mfu 2.68%
iter 70: loss 4.3774, time 16082.13ms, mfu 2.68%
iter 71: loss 4.5529, time 16078.24ms, mfu 2.68%
iter 72: loss 5.3960, time 16076.93ms, mfu 2.68%
iter 73: loss 5.1753, time 16075.30ms, mfu 2.68%
iter 74: loss 4.6450, time 16060.18ms, mfu 2.68%
iter 75: loss 4.0651, time 16078.80ms, mfu 2.68%
iter 76: loss 4.7811, time 16067.51ms, mfu 2.68%
iter 77: loss 4.7712, time 16078.13ms, mfu 2.68%
iter 78: loss 4.7634, time 16065.56ms, mfu 2.68%
iter 79: loss 4.9600, time 16083.43ms, mfu 2.68%
iter 80: loss 5.1169, time 16071.82ms, mfu 2.68%
iter 81: loss 4.2514, time 16088.00ms, mfu 2.68%
iter 82: loss 4.4016, time 16059.63ms, mfu 2.68%
iter 83: loss 4.7483, time 16068.43ms, mfu 2.68%
iter 84: loss 4.9519, time 16092.77ms, mfu 2.68%
iter 85: loss 4.9643, time 16068.89ms, mfu 2.68%
iter 86: loss 5.3717, time 16082.16ms, mfu 2.68%
iter 87: loss 4.5888, time 16085.78ms, mfu 2.68%
iter 88: loss 4.4927, time 16091.80ms, mfu 2.68%
iter 89: loss 4.8198, time 16088.58ms, mfu 2.68%
iter 90: loss 4.0388, time 16078.18ms, mfu 2.68%
iter 91: loss 4.6328, time 16078.16ms, mfu 2.68%
iter 92: loss 4.6303, time 16110.18ms, mfu 2.68%
iter 93: loss 4.4409, time 16100.47ms, mfu 2.68%
iter 94: loss 4.2924, time 16083.09ms, mfu 2.68%
iter 95: loss 4.8069, time 16087.53ms, mfu 2.68%
iter 96: loss 4.5557, time 16097.31ms, mfu 2.68%
iter 97: loss 4.7743, time 16092.44ms, mfu 2.68%
iter 98: loss 4.6427, time 16087.82ms, mfu 2.68%
iter 99: loss 4.1648, time 16105.16ms, mfu 2.68%
iter 100: loss 4.2891, time 16107.11ms, mfu 2.68%
iter 101: loss 4.3193, time 16106.85ms, mfu 2.67%
iter 102: loss 4.0782, time 16096.85ms, mfu 2.67%
iter 103: loss 4.6365, time 16089.84ms, mfu 2.67%
iter 104: loss 4.1630, time 16296.51ms, mfu 2.67%
iter 105: loss 4.4996, time 16107.17ms, mfu 2.67%
iter 106: loss 4.7051, time 16094.68ms, mfu 2.67%
iter 107: loss 4.4863, time 16088.63ms, mfu 2.67%
iter 108: loss 4.4288, time 16116.80ms, mfu 2.67%
iter 109: loss 4.4226, time 16101.08ms, mfu 2.67%
iter 110: loss 4.4727, time 16097.67ms, mfu 2.67%
iter 111: loss 4.1946, time 16082.42ms, mfu 2.67%
iter 112: loss 4.3360, time 16117.51ms, mfu 2.67%
iter 113: loss 4.5944, time 16112.23ms, mfu 2.67%
iter 114: loss 4.0840, time 16113.57ms, mfu 2.67%
iter 115: loss 4.4153, time 16111.66ms, mfu 2.67%
iter 116: loss 4.4029, time 16116.88ms, mfu 2.67%
iter 117: loss 4.4821, time 16109.22ms, mfu 2.67%
iter 118: loss 4.0624, time 16103.72ms, mfu 2.67%
iter 119: loss 4.0524, time 16108.28ms, mfu 2.67%
iter 120: loss 4.7035, time 16113.06ms, mfu 2.67%
iter 121: loss 4.4940, time 16114.62ms, mfu 2.67%
iter 122: loss 4.2356, time 16107.47ms, mfu 2.67%
iter 123: loss 4.4804, time 16101.38ms, mfu 2.67%
iter 124: loss 4.0159, time 16108.88ms, mfu 2.67%
iter 125: loss 3.9901, time 16100.40ms, mfu 2.67%
iter 126: loss 4.1615, time 16117.79ms, mfu 2.67%
iter 127: loss 3.8375, time 16139.11ms, mfu 2.67%
iter 128: loss 4.0177, time 16138.34ms, mfu 2.67%
iter 129: loss 3.9974, time 16157.89ms, mfu 2.67%
iter 130: loss 3.7126, time 16147.60ms, mfu 2.67%
iter 131: loss 3.8351, time 16137.21ms, mfu 2.67%
iter 132: loss 4.3075, time 16149.18ms, mfu 2.67%
iter 133: loss 4.2053, time 16145.46ms, mfu 2.67%
iter 134: loss 4.0609, time 16145.57ms, mfu 2.67%
iter 135: loss 3.9417, time 16134.64ms, mfu 2.67%
iter 136: loss 4.1098, time 16114.89ms, mfu 2.67%
iter 137: loss 4.1738, time 16115.93ms, mfu 2.67%
iter 138: loss 4.0643, time 16126.60ms, mfu 2.67%
iter 139: loss 3.4530, time 16119.58ms, mfu 2.67%
iter 140: loss 4.1600, time 16128.37ms, mfu 2.67%
iter 141: loss 3.6988, time 16116.86ms, mfu 2.67%
iter 142: loss 3.8537, time 16109.13ms, mfu 2.67%
iter 143: loss 4.2608, time 16122.26ms, mfu 2.67%
iter 144: loss 4.5512, time 16117.01ms, mfu 2.67%
iter 145: loss 3.9840, time 16120.14ms, mfu 2.67%
iter 146: loss 4.0729, time 16126.53ms, mfu 2.67%
iter 147: loss 3.8876, time 16099.89ms, mfu 2.67%
iter 148: loss 3.8520, time 16119.23ms, mfu 2.67%
iter 149: loss 4.2679, time 16102.75ms, mfu 2.67%
iter 150: loss 3.7646, time 16113.25ms, mfu 2.67%
iter 151: loss 3.7615, time 16286.81ms, mfu 2.67%
iter 152: loss 3.8321, time 16130.11ms, mfu 2.67%
iter 153: loss 3.8822, time 16116.62ms, mfu 2.67%
iter 154: loss 3.7048, time 16123.56ms, mfu 2.67%
iter 155: loss 3.9149, time 16143.51ms, mfu 2.67%
iter 156: loss 4.1509, time 16167.40ms, mfu 2.67%
iter 157: loss 4.0402, time 16173.32ms, mfu 2.67%
iter 158: loss 4.2500, time 16145.18ms, mfu 2.67%
iter 159: loss 3.9604, time 16156.35ms, mfu 2.67%
iter 160: loss 3.9399, time 16137.64ms, mfu 2.67%
iter 161: loss 3.6061, time 16154.96ms, mfu 2.67%
iter 162: loss 3.8253, time 16128.84ms, mfu 2.67%
iter 163: loss 3.9551, time 16137.39ms, mfu 2.67%
iter 164: loss 3.9487, time 16124.35ms, mfu 2.67%
iter 165: loss 3.7832, time 16126.11ms, mfu 2.67%
iter 166: loss 3.7740, time 16097.67ms, mfu 2.67%
iter 167: loss 3.7048, time 16130.41ms, mfu 2.67%
iter 168: loss 3.7746, time 16125.73ms, mfu 2.67%
iter 169: loss 3.8644, time 16091.71ms, mfu 2.67%
iter 170: loss 4.2486, time 16087.27ms, mfu 2.67%
iter 171: loss 4.2473, time 16123.68ms, mfu 2.67%
iter 172: loss 4.3033, time 16097.77ms, mfu 2.67%
iter 173: loss 3.6725, time 16103.06ms, mfu 2.67%
iter 174: loss 4.0084, time 16096.02ms, mfu 2.67%
iter 175: loss 3.6084, time 16110.12ms, mfu 2.67%
iter 176: loss 3.8653, time 16095.87ms, mfu 2.67%
iter 177: loss 4.0035, time 16092.42ms, mfu 2.67%
iter 178: loss 3.7509, time 16110.34ms, mfu 2.67%
iter 179: loss 3.7010, time 16101.51ms, mfu 2.67%
iter 180: loss 3.7862, time 16100.46ms, mfu 2.67%
iter 181: loss 3.5373, time 16119.55ms, mfu 2.67%
iter 182: loss 3.6026, time 16111.17ms, mfu 2.67%
iter 183: loss 3.1074, time 16126.44ms, mfu 2.67%
iter 184: loss 3.7301, time 16093.48ms, mfu 2.67%
iter 185: loss 3.5413, time 16110.73ms, mfu 2.67%
iter 186: loss 4.1786, time 16111.42ms, mfu 2.67%
iter 187: loss 3.7508, time 16102.69ms, mfu 2.67%
iter 188: loss 3.6772, time 16113.26ms, mfu 2.67%
iter 189: loss 3.9387, time 16105.23ms, mfu 2.67%
iter 190: loss 3.9718, time 16100.38ms, mfu 2.67%
iter 191: loss 3.7076, time 16088.96ms, mfu 2.67%
iter 192: loss 3.6210, time 16110.65ms, mfu 2.67%
iter 193: loss 3.8967, time 16090.53ms, mfu 2.67%
iter 194: loss 3.5632, time 16098.80ms, mfu 2.67%
iter 195: loss 3.6297, time 16100.51ms, mfu 2.67%
iter 196: loss 3.7419, time 16094.91ms, mfu 2.67%
iter 197: loss 4.2959, time 16083.23ms, mfu 2.67%
iter 198: loss 4.0520, time 16093.77ms, mfu 2.67%
iter 199: loss 3.6902, time 16291.80ms, mfu 2.67%
iter 200: loss 3.7319, time 16110.33ms, mfu 2.67%
iter 201: loss 3.6107, time 16112.60ms, mfu 2.67%
iter 202: loss 3.4855, time 16118.76ms, mfu 2.67%
iter 203: loss 3.6736, time 16140.58ms, mfu 2.67%
iter 204: loss 3.6694, time 16165.34ms, mfu 2.67%
iter 205: loss 3.5911, time 16171.97ms, mfu 2.67%
iter 206: loss 3.6828, time 16169.87ms, mfu 2.67%
iter 207: loss 4.1689, time 16150.92ms, mfu 2.67%
iter 208: loss 3.4918, time 16138.00ms, mfu 2.67%
iter 209: loss 3.3907, time 16127.53ms, mfu 2.67%
iter 210: loss 3.2456, time 16134.25ms, mfu 2.67%
iter 211: loss 4.0257, time 16132.77ms, mfu 2.67%
iter 212: loss 3.4801, time 16122.46ms, mfu 2.67%
iter 213: loss 3.4740, time 16124.17ms, mfu 2.67%
iter 214: loss 3.1314, time 16122.81ms, mfu 2.67%
iter 215: loss 4.1641, time 16113.08ms, mfu 2.67%
iter 216: loss 3.7525, time 16105.93ms, mfu 2.67%
iter 217: loss 3.4395, time 16132.56ms, mfu 2.67%
iter 218: loss 3.5586, time 16097.09ms, mfu 2.67%
iter 219: loss 3.1693, time 16113.37ms, mfu 2.67%
iter 220: loss 3.3051, time 16110.12ms, mfu 2.67%
iter 221: loss 3.2175, time 16099.54ms, mfu 2.67%
iter 222: loss 3.6372, time 16109.87ms, mfu 2.67%
iter 223: loss 3.1684, time 16093.89ms, mfu 2.67%
iter 224: loss 4.0658, time 16099.39ms, mfu 2.67%
iter 225: loss 3.3405, time 16123.74ms, mfu 2.67%
iter 226: loss 3.1954, time 16107.14ms, mfu 2.67%
iter 227: loss 3.4785, time 16102.85ms, mfu 2.67%
iter 228: loss 3.1682, time 16116.03ms, mfu 2.67%
iter 229: loss 3.5430, time 16085.46ms, mfu 2.67%
iter 230: loss 3.5215, time 16081.65ms, mfu 2.67%
iter 231: loss 3.7575, time 16083.27ms, mfu 2.67%
iter 232: loss 3.5887, time 16079.79ms, mfu 2.67%
iter 233: loss 3.8481, time 16105.23ms, mfu 2.67%
iter 234: loss 3.6449, time 16112.97ms, mfu 2.67%
iter 235: loss 3.2898, time 16104.40ms, mfu 2.67%
iter 236: loss 3.4100, time 16092.35ms, mfu 2.67%
iter 237: loss 3.6561, time 16092.59ms, mfu 2.67%
iter 238: loss 3.1385, time 16101.60ms, mfu 2.67%
iter 239: loss 2.8360, time 16100.46ms, mfu 2.67%
iter 240: loss 3.1632, time 16097.11ms, mfu 2.67%
iter 241: loss 3.8662, time 16105.02ms, mfu 2.67%
iter 242: loss 3.6755, time 16084.35ms, mfu 2.67%
iter 243: loss 3.1045, time 16091.93ms, mfu 2.67%
iter 244: loss 3.5673, time 16106.02ms, mfu 2.67%
iter 245: loss 3.6127, time 16095.58ms, mfu 2.67%
iter 246: loss 3.5572, time 16281.70ms, mfu 2.67%
iter 247: loss 3.4369, time 16072.43ms, mfu 2.67%
iter 248: loss 3.1905, time 16094.36ms, mfu 2.67%
iter 249: loss 3.2539, time 16086.49ms, mfu 2.67%
iter 250: loss 3.4243, time 16087.53ms, mfu 2.67%
iter 251: loss 4.1580, time 16085.76ms, mfu 2.67%
iter 252: loss 3.7038, time 16087.67ms, mfu 2.67%
iter 253: loss 3.6222, time 16092.58ms, mfu 2.67%
iter 254: loss 3.4630, time 16097.81ms, mfu 2.67%
iter 255: loss 3.6287, time 16087.88ms, mfu 2.67%
iter 256: loss 3.6959, time 16100.84ms, mfu 2.67%
iter 257: loss 3.1946, time 16123.83ms, mfu 2.67%
iter 258: loss 3.1303, time 16114.43ms, mfu 2.67%
iter 259: loss 3.6505, time 16125.70ms, mfu 2.67%
iter 260: loss 3.0840, time 16122.73ms, mfu 2.67%
iter 261: loss 2.9574, time 16148.99ms, mfu 2.67%
iter 262: loss 3.0740, time 16145.09ms, mfu 2.67%
iter 263: loss 3.6996, time 16145.21ms, mfu 2.67%
iter 264: loss 3.4703, time 16133.45ms, mfu 2.67%
iter 265: loss 3.3182, time 16123.53ms, mfu 2.67%
iter 266: loss 3.9360, time 16111.06ms, mfu 2.67%
iter 267: loss 3.6560, time 16132.63ms, mfu 2.67%
iter 268: loss 3.4099, time 16109.71ms, mfu 2.67%
iter 269: loss 3.3894, time 16110.34ms, mfu 2.67%
iter 270: loss 3.3179, time 16112.35ms, mfu 2.67%
iter 271: loss 3.4406, time 16104.70ms, mfu 2.67%
iter 272: loss 3.5731, time 16089.54ms, mfu 2.67%
iter 273: loss 3.4467, time 16101.77ms, mfu 2.67%
iter 274: loss 3.1601, time 16113.32ms, mfu 2.67%
iter 275: loss 3.5178, time 16080.99ms, mfu 2.67%
iter 276: loss 3.6782, time 16108.93ms, mfu 2.67%
iter 277: loss 2.9064, time 16109.94ms, mfu 2.67%
iter 278: loss 3.2146, time 16097.51ms, mfu 2.67%
iter 279: loss 3.4012, time 16112.80ms, mfu 2.67%
iter 280: loss 3.1513, time 16125.01ms, mfu 2.67%
iter 281: loss 3.6630, time 16137.98ms, mfu 2.67%
iter 282: loss 3.7381, time 16135.96ms, mfu 2.67%
iter 283: loss 3.3035, time 16152.61ms, mfu 2.67%
iter 284: loss 3.2808, time 16151.67ms, mfu 2.67%
iter 285: loss 2.8494, time 16152.40ms, mfu 2.67%
iter 286: loss 3.7245, time 16186.09ms, mfu 2.67%
iter 287: loss 3.5057, time 16168.79ms, mfu 2.67%
iter 288: loss 3.5812, time 16180.36ms, mfu 2.67%
iter 289: loss 3.5610, time 16188.64ms, mfu 2.67%
iter 290: loss 3.0930, time 16146.94ms, mfu 2.67%
iter 291: loss 3.4537, time 16132.88ms, mfu 2.67%
iter 292: loss 3.4306, time 16137.29ms, mfu 2.67%
iter 293: loss 3.5969, time 16134.85ms, mfu 2.67%
iter 294: loss 3.6199, time 16318.98ms, mfu 2.66%
iter 295: loss 3.4162, time 16125.18ms, mfu 2.66%
iter 296: loss 2.9110, time 16120.75ms, mfu 2.66%
iter 297: loss 3.4343, time 16110.98ms, mfu 2.67%
iter 298: loss 3.3734, time 16110.05ms, mfu 2.67%
iter 299: loss 3.4172, time 16105.47ms, mfu 2.67%
iter 300: loss 3.2457, time 16092.57ms, mfu 2.67%
iter 301: loss 3.2312, time 16082.04ms, mfu 2.67%
""")
loss_numbers = [float(value) for value in loss_values]
plt.plot(loss_numbers, color='pink', linewidth=1, linestyle='dotted')










plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Loss Progression')
plt.show()

In [None]:
# GPT pure MLP 123M losses
GPT_MLP_loss_numbers = [10.9576, 9.8106, 9.055, 7.8765, 7.8633, 7.691, 6.5204, 6.6445, 6.4419, 5.4835, 6.2119, 6.4842, 6.1765, 5.6887, 5.1491, 6.0895, 5.7106, 6.0215, 5.713, 5.2922, 5.7435, 5.6644, 5.9542, 5.5786, 5.278, 5.9571, 5.3107, 5.7009, 5.6113, 5.7749, 5.2756, 5.7444, 5.6095, 5.5276, 4.8778, 5.5249, 5.3246, 5.3226, 5.2742, 4.8037, 5.347, 5.1824, 5.2806, 5.0674, 4.93, 5.4071, 5.2126, 4.6141, 4.4989, 5.0535, 5.3439, 4.6418, 4.6041, 4.9905, 4.3615, 4.9262, 4.4824, 4.9847, 4.433, 5.0104, 4.9764, 4.7183, 4.4462, 4.4099, 4.7068, 4.6294, 4.2654, 4.1447, 4.3996, 4.0728, 4.998, 4.9489, 5.0133, 4.5926, 4.0945, 4.687, 4.4994, 4.9805, 3.7874, 3.8967, 4.5082, 4.4717, 4.6059, 4.468, 4.5881, 4.641, 4.322, 4.3696, 4.3637, 4.4174, 4.3616, 4.2224, 4.0234, 4.1362, 4.1247, 4.1314, 4.0546, 4.4732, 4.0306, 4.1661, 4.2747, 4.4687, 4.2739, 4.436, 4.1414, 4.3452, 4.2828, 4.4947, 4.2014, 3.859, 4.5639, 4.1636, 3.9034, 4.3203, 4.084, 3.9545, 3.5285, 4.6628, 4.0178, 4.6129, 4.2337, 3.9054, 4.2386, 4.48, 4.1114, 3.6654, 3.8827, 4.0185, 3.8202, 4.0552, 4.3378, 4.129, 4.4863, 3.8109, 3.821, 4.3211, 3.9934, 4.1615, 4.2278, 4.3883, 3.7854, 3.6643, 4.4475, 4.1733, 3.6834, 3.9815, 3.9075, 4.0414, 4.0944, 3.9907, 4.1128, 4.339, 3.6425, 3.5547, 4.0793, 3.8883, 4.6776, 3.8702, 3.6018, 4.0218, 3.8726, 3.7366, 4.3593, 3.5171, 4.0424, 4.0333, 3.5088, 3.5917, 3.4939, 3.9, 4.0737, 4.2209, 3.5565, 3.8428, 3.9188, 3.9191, 3.9205, 3.7231, 4.1986, 4.2119, 3.4972, 3.9284, 3.6158, 3.8216, 3.7053, 3.7403, 3.9738, 3.3592, 3.7261, 4.278, 3.798, 4.0542, 3.5429, 4.1235, 4.0203, 3.7618, 3.2626, 3.9078, 4.055, 3.5681, 3.1504, 3.9691, 3.7992, 4.0029, 3.3048, 3.8243, 3.7369, 3.9215, 3.7834, 3.3592, 3.5072, 3.722, 4.1357, 3.7239, 3.4832, 4.1278, 3.6601, 4.0817, 3.8305, 3.6161, 3.7851, 3.7249, 3.7893, 3.5663, 3.3438, 3.6767, 3.6012, 3.4481, 3.935, 3.5061, 4.0334, 3.5467, 3.2545, 3.5526, 3.8094, 4.0412, 3.495, 3.2736, 3.6726, 3.7938, 3.9618, 3.4864, 3.9444, 3.73, 4.1293, 3.1828, 3.3672, 3.8852, 3.4521, 3.639, 3.171, 3.4916, 3.8057, 3.6494, 3.4646, 3.9826, 3.5671, 3.1103, 3.4251, 3.2528, 3.0538, 3.4751, 3.7298, 3.3139, 3.4043, 3.3098, 3.9748, 3.2352, 3.5101, 3.4949, 3.577, 3.5857, 3.2437, 3.7681, 3.703, 3.3454, 3.1867, 3.6099, 3.3351, 3.2679, 3.4137, 3.5699, 3.2261, 3.4275, 3.8351, 3.3167, 3.433, 3.1929, 3.4834, 3.8629, 3.3579, 3.1745, 3.3399, 3.295, 3.1181, 3.3177, 3.1491, 3.2569, 3.5087, 3.5253, 3.7011, 3.1315, 3.358, 2.9364, 3.562, 3.4364, 3.7563, 3.4289, 2.89, 3.3959, 3.2331, 3.4437, 3.1434, 3.5554, 3.5415, 3.6579, 3.5888, 2.9947, 3.3008, 3.5174, 3.3474, 2.7528, 3.6103, 3.4141, 3.6946, 3.3797, 3.1317, 2.7479, 3.559, 3.53, 3.5028, 3.5054, 3.2151, 3.1472, 3.052, 2.8558, 3.3546, 3.3341, 3.4884, 3.0229, 2.8556, 3.1342, 3.3573, 2.7588, 3.1783, 3.5827, 3.4166, 3.1073, 2.7966, 3.3824, 3.4247, 3.3729, 3.3877, 3.1593, 3.1296, 3.4907, 2.7737, 3.801, 3.3017, 2.851, 3.4577, 2.9851, 3.4454, 3.016, 3.4318, 3.2377, 3.1178, 2.7414, 3.7492, 3.3746, 3.0507, 2.8981, 3.9917, 3.4672, 3.3647, 3.424, 3.4977, 3.113, 3.211, 2.9579, 3.1121, 3.1305, 3.4714, 2.9859, 3.3166, 3.5357, 3.6909, 3.1579, 2.8759, 3.0833, 3.3263, 2.8353, 3.3763, 3.2316, 2.934, 3.3684, 2.9122, 3.0404, 3.3913, 2.7641, 2.8893, 3.0558, 2.7453, 2.9326, 3.1057, 2.9925, 2.9272, 2.9986, 3.0174, 3.193, 2.7934, 3.4256, 2.8541, 2.725, 2.9333, 3.1762, 3.1466, 3.093, 3.0461, 2.8347, 2.9347, 2.5861, 2.8893, 3.1507, 2.9835, 2.7352, 2.9487, 2.8253, 3.103, 2.6198, 3.2875, 3.4315, 2.9578, 3.5256, 3.1973, 3.3392, 2.8866, 2.9669, 2.7964, 3.2561, 2.9415, 3.0752, 2.9218, 2.9161, 2.9988, 2.8101, 2.8671, 3.1694, 3.0458, 2.7555, 3.0941, 2.9922, 2.8542, 3.128, 3.0284, 2.9314, 3.2599, 3.1373, 2.8478, 3.3279, 2.9018, 2.912, 3.1244, 3.1821, 2.9934, 2.5966, 2.9417, 2.7753, 2.6691, 2.9299, 3.095, 3.0573, 2.9802, 3.2395, 3.0678, 3.0731, 3.1263]
print(GPT_MLP_loss_numbers)

In [None]:
# GPT MOE
# channels = 4
# experts = 16
# tied = layers[2:-3], experts[0:num_experts//8]
loss_numbers = [10.5496, 7.858, 9.1977, 8.0395, 7.7929, 6.7418, 6.9914, 6.3738, 10.3224, 7.8805, 6.4923, 6.3069, 5.9059, 6.5409, 6.1827, 5.7764, 5.8517, 6.1622, 5.8874, 6.0403, 5.7187, 5.8883, 5.8783, 5.69, 6.1334, 5.9019, 5.6818, 5.7826, 6.2421, 6.1194, 5.775, 5.4551, 5.5985, 6.0304, 6.0989, 6.26, 5.0418, 4.7475, 5.5421, 5.059, 5.6652, 5.2561, 5.3916, 5.2561, 5.3353, 5.377, 5.4238, 5.1389, 5.4933, 4.7825, 5.146, 5.213, 4.898, 5.183, 5.6052, 5.0433, 5.0833, 4.7395, 4.696, 4.9908, 5.2034, 5.3173, 4.919, 4.4035, 4.9877, 4.7285, 4.8348, 4.8996, 5.077, 5.0234, 4.9628, 5.02, 4.8751, 4.3475, 4.5017, 5.2358, 3.9662, 4.4274, 4.2607, 4.1449, 4.7209, 4.5566, 4.3446, 4.3396, 4.1098, 4.608, 4.4334, 4.186, 4.3947, 4.4694, 4.3495, 4.2405, 4.4502, 3.6781, 4.2028, 4.059, 3.9999, 4.2502, 4.3104, 4.4702, 4.303, 4.3456, 4.0975, 4.206, 4.1035, 3.8556, 3.9002, 4.2606, 4.013, 3.8277, 4.1542, 3.9946, 4.0844, 3.8145, 3.9757, 3.8494, 4.1259, 3.8782, 3.7902, 4.0406, 3.7595, 3.9418, 4.1073, 3.6519, 3.5982, 3.067, 3.7166, 3.9098, 4.1879, 3.7154, 3.7123, 3.6772, 4.0502, 3.7145, 3.6576, 3.8645, 4.043, 3.5491, 3.4967, 3.6438, 4.0229, 3.6197, 3.5067, 3.9023, 3.4453, 3.7979, 3.7835, 3.7274, 3.8025, 3.5961, 3.8399, 3.9112, 3.6547, 3.7479, 3.9943, 3.5901, 3.7442, 3.4755, 3.88, 3.7167, 3.6166, 3.5934, 3.8002, 3.8531, 3.7795, 3.8058, 3.8243, 3.8385, 3.2237, 3.5984, 3.6642, 3.4045, 3.5449, 3.2869, 3.7998, 4.3149, 3.5281, 3.5306, 3.5306, 3.524, 3.7603, 3.7127, 3.5117, 3.6257, 3.2285, 3.6972, 3.4791, 3.5209, 3.3363, 3.5081, 3.5572, 3.5353, 3.4124, 3.8282, 3.587, 3.5669, 3.37, 3.4367, 3.2026, 3.3023, 3.0132, 3.598, 3.4657, 2.8025, 3.5007, 3.3317, 3.396, 3.409, 3.0504, 3.145, 3.3565, 3.3267, 3.2799, 3.9759, 3.5953, 3.3641, 3.4127, 3.498, 3.5675, 2.891, 3.455, 3.2639, 3.5732, 3.5502, 2.908, 3.288, 3.3588, 3.1142, 3.0523, 3.3545, 3.506, 3.2516, 3.3218, 3.327, 2.9918, 3.441, 3.2404, 3.4811, 3.1366, 3.477, 2.9299, 3.4372, 3.0867, 3.2273, 3.6071, 3.8065, 3.3285, 3.1286, 3.0438, 3.1228, 3.3888, 3.2347, 2.9589, 2.9632, 3.297, 2.7699, 3.6863, 2.9634, 3.1813, 3.0801, 2.9143, 3.4172, 3.4539, 2.9955, 3.4846, 2.8587, 3.3069, 3.5753, 2.8718, 3.1959, 3.0923, 3.0377, 2.6705, 3.2692, 3.3469, 3.5814, 3.1679, 3.0719, 2.8805, 3.0067, 3.0421, 2.7071, 3.0861, 2.8721, 2.8223, 3.2394, 2.6602, 3.3055, 3.0129, 3.1899, 2.6284, 3.2918, 3.3756, 3.2896, 3.2544, 2.9546, 2.9124, 3.1641, 3.0895, 3.0943, 2.8827, 2.9733, 3.1296, 3.0488, 3.1167, 3.103, 3.3306, 3.0198, 2.8113, 3.1668, 3.1273, 3.0325, 3.1685, 3.2206, 3.0981, 2.8837, 2.9473, 3.4503, 3.0209, 3.0338, 2.9305, 2.9377, 3.1877, 2.8615, 3.0747, 2.8679, 2.9131, 2.4475, 2.8435, 3.1049, 2.8092, 3.7731, 3.1845, 3.0108, 2.5417, 3.0692, 2.6822, 3.0735, 3.4541, 2.7714, 2.8263, 2.9887, 2.7262, 2.8534, 3.2666, 3.4553, 2.5342, 3.2457, 3.0042, 2.8087, 2.935, 3.143, 2.3252, 2.7831, 2.7464, 2.7986, 2.8607, 3.0449, 3.0757, 2.6455, 2.7913, 2.8785, 2.8933, 2.8443, 2.908, 2.6602, 3.0786, 2.743, 2.7212, 2.9307, 3.0286, 2.8667, 3.0195, 2.7936, 3.1533, 3.0492, 2.9561, 2.5164, 3.2568, 2.6679, 3.0339, 2.937, 2.8623, 3.2159, 2.6484, 3.1784, 3.0412, 2.9857, 2.7962, 2.6385, 2.7857, 2.7863, 2.5522, 2.4665, 2.5936, 2.7005, 2.9162, 2.3873, 2.5072, 2.4738, 2.7168, 2.5157, 2.4018, 2.7701, 2.4937, 2.6894, 2.944, 2.8287, 2.5601, 2.5608, 3.2831, 2.6006, 2.7233, 2.728, 2.5217, 2.8725, 2.6422]
print(loss_numbers)

In [None]:


# losses for MOE without weight tying gradient_accumulation_steps = 5 * 8 # used to simulate larger batch sizes
#batch_size = 8 # if gradient_accumulation_steps > 1, this is the micro-batch size
##block_size = 1024
# model
#n_layer = 12
#n_head = 12
#n_embd = 768
#n_channels = 4    # to get same parameter count as MLP:set number of experts equal to number of channels
#n_experts = 4  # to simulate a residual without an eplicit residual, we add n_channels worth of no-op experts
#dropout = 0.0