This notebook is calling [StarCoder2-3b](https://huggingface.co/bigcode/starcoder2-3b) from the huggingface platform, a code LLM pre-trained on a code dataset called [The Stack v2 smol](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids) that includes 17 programming languages (C, C#, C++, Go, Java, JavaScript, Kotlin, Lua, PHP, Python, R, Ruby, Rust, SQL, Shell, Swift, TypeScript) and incorporates 7 code documentation languages (including HTML).

The goal targeted when training StarCoder2 further in a continued pre-training fashion, before fine-tuning it, is to instruct the pre-trained model with more information on the Near Protocol blockchain, before teaching it how to code Near dApps in a fine-tuning process.

While the fine-tuning approach will tend to instruct the model in a structure-aware method to give it a good knowledge of the whole dApp logic, the continued pre-training will consist on pursuing the next-token prediction paradigm instructed during training. Mopre details on the model StarCoder2 and its pre-training method is available in the official paper [StarCoder 2 and The Stack v2: The Next Generation](https://arxiv.org/abs/2402.19173) (2024).

In [1]:
#!pip install datasets transformers

In [2]:
# Set token parallelism to false to resolve further warning
!export TOKENIZERS_PARALLELISM=false

In [3]:
import os
import shutil
from pathlib import Path
import random

from datasets import Dataset, DatasetDict, load_from_disk, load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments

##### split the nearData folder in train and val subfolders (90/10)
data_dir = '/Users/juliencarbonnell/Desktop/nearData'

##### Paths to train and val directories
train_dir = os.path.join(data_dir, 'train')
val_dir = os.path.join(data_dir, 'val')

##### Create train and val directories if they don't exist
os.makedirs(train_dir, exist_ok=True)
os.makedirs(val_dir, exist_ok=True)

##### Get all text files in the data directory
all_files = list(Path(data_dir).glob('*.txt'))

##### Function to calculate the length of a file
def get_file_length(filepath):
    with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
        return len(f.read())

##### Get lengths of all files
file_lengths = [(file, get_file_length(file)) for file in all_files]

##### Sort files by length
file_lengths.sort(key=lambda x: x[1], reverse=True)

##### Initialize train and validation file lists
train_files = []
val_files = []

##### Distribute files to train and validation sets
for i, (file, length) in enumerate(file_lengths):
    if i % 10 == 0:  # Approximately 10% for validation
        val_files.append(file)
    else:
        train_files.append(file)

##### Move files to the respective directories
for file in train_files:
    shutil.move(str(file), train_dir)

for file in val_files:
    shutil.move(str(file), val_dir)

print(f'Moved {len(train_files)} files to {train_dir}')
print(f'Moved {len(val_files)} files to {val_dir}')

##### convert text files to UTF-8 encoding:
def convert_to_utf8(filepath):
    with open(filepath, 'rb') as f:
        content = f.read()
        try:
            decoded_content = content.decode('utf-8')
        except UnicodeDecodeError:
            decoded_content = content.decode('latin1')  # Try another encoding if UTF-8 fails

    with open(filepath, 'wb') as f:
        f.write(decoded_content.encode('utf-8'))

def convert_directory_to_utf8(directory):
    for root, dirs, files in os.walk(directory):
        for file_name in files:
            file_path = os.path.join(root, file_name)
            convert_to_utf8(file_path)
            print(f"Converted {file_path} to UTF-8 encoding.")

##### Convert all files in the data directory
convert_directory_to_utf8('/Users/juliencarbonnell/Desktop/nearData/train/')
convert_directory_to_utf8('/Users/juliencarbonnell/Desktop/nearData/val/')

In [4]:
# convert text files to dataset
def load_text_files(directory):
    data = {'text': []}
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
                data['text'].append(file.read())
    return data

train_data = load_text_files('/Users/juliencarbonnell/Desktop/nearData/train/')
val_data = load_text_files('/Users/juliencarbonnell/Desktop/nearData/val/')

train_dataset = Dataset.from_dict(train_data)
val_dataset = Dataset.from_dict(val_data)

datasets = DatasetDict({
    'train': train_dataset,
    'val': val_dataset
})

datasets.save_to_disk('prepared_datasets')

Saving the dataset (0/1 shards):   0%|          | 0/1022 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/114 [00:00<?, ? examples/s]

In [5]:
# load the prepared datasets
datasets = load_from_disk('prepared_datasets')

In [6]:
# Initialize the tokenizer and the model
model_name = "bigcode/starcoder2-3b"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set the pad_token to eos_token if not already set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# tokenize the dataset into chunks of 64 tokens max
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding="longest", max_length=64, return_special_tokens_mask=True)

# Apply the tokenization
tokenized_datasets = datasets.map(tokenize_function, batched=True, remove_columns=["text"])



Map:   0%|          | 0/114 [00:00<?, ? examples/s]

In [7]:
# Chunk the tokenized dataset into 64 tokens to avoid crashing
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= 64:
        total_length = (total_length // 64) * 64
    result = {
        k: [t[i:i + 64] for i in range(0, total_length, 64)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

# Apply chunking
tokenized_datasets = tokenized_datasets.map(group_texts, batched=True)

Map:   0%|          | 0/114 [00:00<?, ? examples/s]

In [8]:
# setup training arguments
training_args = TrainingArguments(
    output_dir='./results',
    overwrite_output_dir=True,
    num_train_epochs=5,
    per_device_train_batch_size=1,  # reduced batch size 4 to 1 to avoid crashing
    per_device_eval_batch_size=1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    logging_dir='./logs',
    logging_steps=10,
    gradient_accumulation_steps=8,  # Simulate larger batch size
    #fp16=True,  # mixed precision training not supported on mps
)

In [9]:
# create the trainer and start training
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['val'],
)

trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

python(1117) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(1123) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


RuntimeError: MPS backend out of memory (MPS allocated: 17.98 GB, other allocations: 98.73 MB, max allowed: 18.13 GB). Tried to allocate 144.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

In [None]:
# model evaluation
import math

eval_results = trainer.evaluate()
perplexity = math.exp(eval_results['eval_loss'])
print(f"Perplexity: {perplexity}")

In [None]:
import matplotlib.pyplot as plt

# Assuming you have access to the training logs
training_loss = [log['loss'] for log in trainer.state.log_history if 'loss' in log]
validation_loss = [log['eval_loss'] for log in trainer.state.log_history if 'eval_loss' in log]
epochs = range(len(training_loss))

plt.plot(epochs, training_loss, label='Training Loss')
plt.plot(epochs, validation_loss, label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [None]:
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors='pt')
output = model.generate(inputs['input_ids'], max_length=50)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)