This notebook is calling [StarCoder2-3b](https://huggingface.co/bigcode/starcoder2-3b) from the huggingface platform, a code LLM pre-trained on a code dataset called [The Stack v2 smol](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids) that includes 17 programming languages (C, C#, C++, Go, Java, JavaScript, Kotlin, Lua, PHP, Python, R, Ruby, Rust, SQL, Shell, Swift, TypeScript) and incorporates 7 code documentation languages (including HTML).

The goal targeted when training StarCoder2 further in a continued pre-training fashion, before fine-tuning it, is to instruct the pre-trained model with more information on the Near Protocol blockchain, before teaching it how to code Near dApps in a fine-tuning process.

While the fine-tuning approach will tend to instruct the model in a structure-aware method to give it a good knowledge of the whole dApp logic, the continued pre-training will consist on pursuing the next-token prediction paradigm instructed during training. Mopre details on the model StarCoder2 and its pre-training method is available in the official paper [StarCoder 2 and The Stack v2: The Next Generation](https://arxiv.org/abs/2402.19173) (2024).

In [None]:
!pip install transformers[torch]
!pip install accelerate
!pip install datasets transformers
!pip install huggingface_hub

Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.31.0-py3-none-any.whl (309 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->transformers[torch])
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->transformers[torch])
  Using cached nvidia_cublas_cu



In [None]:
# Set token parallelism to false to resolve further warning
!export TOKENIZERS_PARALLELISM=false

In [None]:
import pandas as pd
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM

In [None]:
# Load the dataset from Hugging Face
#dataset = load_dataset('jcarbonnell/preTrainingNEAR')
dataset = load_dataset('jcarbonnell/structTuningNEAR')


# Verify the loaded dataset
print(dataset)

# Print the first example from the training dataset
print(dataset['train'][0])  # Print the first example from the train dataset
print(dataset['val'][0])  # Print the first example from the validation dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/54.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2863 [00:00<?, ? examples/s]

Generating val split:   0%|          | 0/319 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['repoName', 'tree', 'readme'],
        num_rows: 2863
    })
    val: Dataset({
        features: ['repoName', 'tree', 'readme'],
        num_rows: 319
    })
})
{'repoName': 'near-cli-rs_interactive-clap', 'tree': '.github\n    workflows\n        release-plz.yml\nCHANGELOG.md\nCargo.toml\nREADME.md\nexamples\n    advanced_enum.rs\n    advanced_struct.rs\n    simple_enum.rs\n    simple_struct.rs\n    struct_with_context.rs\n    struct_with_flatten.rs\n    struct_with_named_arg.rs\n    struct_with_subargs.rs\n    struct_with_subcommand.rs\n    to_cli_args.rs\ninteractive-clap-derive\n    CHANGELOG.md\n    Cargo.toml\n    src\n        derives\n            interactive_clap\n                methods\n                    choose_variant.rs\n                    cli_field_type.rs\n                    fields_with_skip_default_input_arg.rs\n                    fields_with_subargs.rs\n                    fields_with_subcommand.rs\n             

In [None]:
# Initialize the tokenizer and the model

#model_name = "bigcode/starcoder2-3b" # RAM + SESSION RESTARTED
#model_name = "distilgpt2" #SUCCESS !!
#model_name = "deepseek-ai/deepseek-coder-1.3b-base" # GPU-RAM AGAIN
#model_name = "deepseek-ai/deepseek-coder-1.3b-instruct" # # GPU-RAM AGAIN
#model_name = "RWKV/rwkv-6-world-1b6"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set the pad_token to eos_token if not already set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    tokenized_inputs = tokenizer(examples['readme'], truncation=True, padding="longest", max_length=64, return_special_tokens_mask=True)
    tokenized_inputs['labels'] = tokenized_inputs['input_ids'].copy()
    return tokenized_inputs

# Apply the tokenization
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["repoName", "tree", "readme"])


tokenizer_config.json:   0%|          | 0.00/1.87k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.37M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/631 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.69G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

Map:   0%|          | 0/2863 [00:00<?, ? examples/s]

Map:   0%|          | 0/319 [00:00<?, ? examples/s]

#### Chunk the tokenized dataset into 64 tokens to avoid crashing
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= 64:
        total_length = (total_length // 64) * 64
    result = {
        k: [t[i:i + 64] for i in range(0, total_length, 64)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

#### Apply chunking
tokenized_datasets = tokenized_datasets.map(group_texts, batched=True)

In [None]:
from transformers import TrainingArguments

In [None]:
# setup training arguments
training_args = TrainingArguments(
    output_dir='./results',
    overwrite_output_dir=True,
    num_train_epochs=5,
    per_device_train_batch_size=1,  # reduced batch size 4 to 1 to avoid crashing
    per_device_eval_batch_size=1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    logging_dir='./logs',
    logging_steps=10,
    gradient_accumulation_steps=8,  # Simulate larger batch size
    #fp16=True,  # mixed precision training not supported on mps
)



ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.21.0`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

In [None]:
from transformers import Trainer

In [None]:
# create the trainer and start training
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['val'],
)

trainer.train()

In [None]:
# model evaluation
import math

eval_results = trainer.evaluate()
perplexity = math.exp(eval_results['eval_loss'])
print(f"Perplexity: {perplexity}")

In [None]:
# Assuming you have access to the training logs
training_loss = [log['loss'] for log in trainer.state.log_history if 'loss' in log]
validation_loss = [log['eval_loss'] for log in trainer.state.log_history if 'eval_loss' in log]

# Pad or trim the lists to ensure they have the same length
min_length = min(len(training_loss), len(validation_loss))
training_loss = training_loss[:min_length]
validation_loss = validation_loss[:min_length]

epochs = range(len(training_loss))

plt.plot(epochs, training_loss, label='Training Loss')
plt.plot(epochs, validation_loss, label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.title(f'Training and Validation Loss for {model_name}')
plt.show()

In [None]:
import torch

prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors='pt')

# Move input_ids to the same device as the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inputs = {key: value.to(device) for key, value in inputs.items()}
model.to(device)

# Ensure pad_token_id is set
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

# Generate text with attention_mask
output = model.generate(
    inputs['input_ids'],
    attention_mask=inputs['attention_mask'],
    max_length=50,
    pad_token_id=tokenizer.pad_token_id
)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
