This notebook is calling [StarCoder2-3b-NEAR](https://huggingface.co/bigcode/starcoder2-3b) from the huggingface platform, a code LLM pre-trained on a code dataset called [The Stack v2 smol](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids) that includes 17 programming languages (C, C#, C++, Go, Java, JavaScript, Kotlin, Lua, PHP, Python, R, Ruby, Rust, SQL, Shell, Swift, TypeScript) and incorporates 7 code documentation languages (including HTML), and further trained in a continued pre-training process on the NEAR Protocol blockchain from the [preTrainingNEAR](https://huggingface.co/datasets/jcarbonnell/preTrainingNEAR) dataset.

The Structure-Aware fine-tuning step consists in instructing the model with NEAR dApps trees and their corresponding description in natural language extracted from readme files transformed into user prompts. This structure-aware approach aim to give the pre-trained model a good knowledge of the whole dApp logic, and a corresponding user prompt, so that when a user will ask the model to create an app on the NEAR Protocol blockchain, the LLM will answer with a proposition of an app tree augmented with a extensive description of the functionality of each tree branches. 

What we are targetting here is to overpass the limitation of the 'next-token prediction' logic, that can spin the model in 'dumb loops' while iterating over coding complex code completion. It will also be of great support for code understanding on both the user side and the model side, since the model and the user will be aware of the big picture at first. 

More details on the model original StarCoder2 is available in the official paper [StarCoder 2 and The Stack v2: The Next Generation](https://arxiv.org/abs/2402.19173) (2024).

In [54]:
import os
import pandas as pd
from datasets import Dataset, DatasetDict

In [55]:
# Load the dataset from Hugging Face
dataset = load_dataset('jcarbonnell/structTuningNEAR')

# Verify the loaded dataset
print(dataset)

# Print the first example from the training dataset
print(dataset['train'][0])  # Print the first example from the train dataset
print(dataset['val'][0])  # Print the first example from the validation dataset

NameError: name 'load_dataset' is not defined

In [33]:
# Initialize the tokenizer and the model
model_name = "bigcode/starcoder2-3b"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set the pad_token to eos_token if not already set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Tokenize the dataset into chunks of 64 tokens max
def tokenize_function(examples):
    tokenized_inputs = tokenizer(examples['text'], truncation=True, padding="longest", max_length=64, return_special_tokens_mask=True)
    tokenized_inputs['labels'] = tokenized_inputs['input_ids'].copy()
    return tokenized_inputs

# Apply the tokenization
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])



Map:   0%|          | 0/1022 [00:00<?, ? examples/s]

Map:   0%|          | 0/114 [00:00<?, ? examples/s]

#### Chunk the tokenized dataset into 64 tokens to avoid crashing
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= 64:
        total_length = (total_length // 64) * 64
    result = {
        k: [t[i:i + 64] for i in range(0, total_length, 64)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

#### Apply chunking
tokenized_datasets = tokenized_datasets.map(group_texts, batched=True)

In [34]:
# setup training arguments
training_args = TrainingArguments(
    output_dir='./results',
    overwrite_output_dir=True,
    num_train_epochs=5,
    per_device_train_batch_size=1,  # reduced batch size 4 to 1 to avoid crashing
    per_device_eval_batch_size=1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    logging_dir='./logs',
    logging_steps=10,
    gradient_accumulation_steps=8,  # Simulate larger batch size
    #fp16=True,  # mixed precision training not supported on mps
)

In [35]:
# create the trainer and start training
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['val'],
)

trainer.train()

RuntimeError: MPS backend out of memory (MPS allocated: 18.12 GB, other allocations: 704.00 KB, max allowed: 18.13 GB). Tried to allocate 36.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

In [None]:
# model evaluation
import math

eval_results = trainer.evaluate()
perplexity = math.exp(eval_results['eval_loss'])
print(f"Perplexity: {perplexity}")

In [None]:
import matplotlib.pyplot as plt

# Assuming you have access to the training logs
training_loss = [log['loss'] for log in trainer.state.log_history if 'loss' in log]
validation_loss = [log['eval_loss'] for log in trainer.state.log_history if 'eval_loss' in log]
epochs = range(len(training_loss))

plt.plot(epochs, training_loss, label='Training Loss')
plt.plot(epochs, validation_loss, label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [None]:
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors='pt')
output = model.generate(inputs['input_ids'], max_length=50)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)