<a href="https://colab.research.google.com/github/HenryNVP/chess-llm/blob/main/Fork_Pin_Phi_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fork-Pin Finetuning Phi-3

## Install dependencies

In [None]:
!pip install accelerate peft bitsandbytes transformers trl datasets

Collecting trl
  Using cached trl-0.11.4-py3-none-any.whl.metadata (12 kB)
Collecting datasets
  Using cached datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting tyro>=0.5.11 (from trl)
  Using cached tyro-0.8.14-py3-none-any.whl.metadata (8.4 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Using cached dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Using cached xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Using cached multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Using cached fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting shtab>=1.5.6 (from tyro>=0.5.11->trl)
  Using cached shtab-1.7.1-py3-none-any.whl.metadata (7.3 kB)
Using cached trl-0.11.4-py3-none-any.whl (316 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [9

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
from trl import SFTTrainer
from jinja2 import Template
import yaml

MODEL_ID = "microsoft/Phi-3-mini-4k-instruct"
NEW_MODEL_NAME = "ForkPin-Phi-3-mini-4k"

MAX_SEQ_LENGTH = 256
num_train_epochs = 1
license = "apache-2.0"
username = "henrynvp"
learning_rate = 2e-4
per_device_train_batch_size = 8
gradient_accumulation_steps = 1
output_dir = "./results"

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

## Dataset

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
# Load the CSV file
file_path = "/content/drive/My Drive/Chess_Tactics/fork_pin_dataset.csv"
dataset_csv = load_dataset('csv', data_files=file_path)

def format_example(example):
    input_text = f"FEN: {example['FEN']} Moves: {example['Moves']}"
    output_text = f"Themes: {example['Themes']}"
    return {"input": input_text, "output": output_text}

# Apply the formatting to the existing train_dataset
dataset = dataset_csv.map(format_example)

dataset = dataset.remove_columns(['FEN', 'Moves', 'Themes'])

# First, split into train and temp (for validation and test)
train_test_split = dataset['train'].train_test_split(test_size=0.2)  # 80% train, 20% temp
temp_dataset = train_test_split['test']

# Now split temp into validation and test
valid_test_split = temp_dataset.train_test_split(test_size=0.5)  # 50% valid, 50% test

# Combine splits
train_dataset = train_test_split['train']
valid_dataset = valid_test_split['train']
test_dataset = valid_test_split['test']

train_dataset.to_json("train_dataset.json", orient='records', lines=True)
valid_dataset.to_json("valid_dataset.json", orient='records', lines=True)
test_dataset.to_json("test_dataset.json", orient='records', lines=True)

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/846382 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/678 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/85 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/85 [00:00<?, ?ba/s]

11524847

In [None]:
# Function to print a sample of the dataset
def print_sample(dataset, num_samples=5):
    for i, example in enumerate(dataset):
        if i >= num_samples:
            break
        print(example)

print_sample(train_dataset)

{'input': 'FEN: r2q3r/5kb1/3p1n1p/3pp3/4P3/2N3BR/PP2QP2/2KR4 w - - 0 22 Moves: c3d5 d8c8 c1b1 c8h3', 'output': 'Themes: fork'}
{'input': 'FEN: r6r/2pk4/p2q1p2/1p1p1pp1/3Pn3/2P4P/PP3PPN/R2QR1K1 w - - 2 24 Moves: f2f3 h8h3 h2f1 h3h1 g1h1 e4f2 h1g1 f2d1', 'output': 'Themes: fork'}
{'input': 'FEN: 8/r4k2/1b2p1p1/3n4/P1QP2K1/6P1/5P2/8 w - - 1 48 Moves: f2f4 d5e3 g4f3 e3c4', 'output': 'Themes: fork'}
{'input': 'FEN: 3r1rk1/p4ppp/8/2b5/4nB2/2P5/P3BP2/RN3K1R w - - 2 25 Moves: f4e3 c5e3 f2e3 e4g3 f1g2 g3h1', 'output': 'Themes: fork'}
{'input': 'FEN: 3r3r/1k4n1/1p2p3/3pP1q1/1Q1P4/2N3pR/PP5P/5R1K w - - 0 29 Moves: h3g3 g5g3 f1f7 b7c8 f7c7 c8c7', 'output': 'Themes: pin'}


## Fine Tuning

In [None]:
args = TrainingArguments(
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    gradient_checkpointing=True,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    max_steps=-1,
    num_train_epochs=num_train_epochs,
    save_strategy="no",
    logging_steps=1,
    output_dir=output_dir,  # Changed to a string value
    optim="paged_adamw_32bit",
    bf16=True,
)

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    dataset_text_field="input",
    max_seq_length=MAX_SEQ_LENGTH,
    tokenizer=tokenizer
)

trainer.train()
trainer.model.save_pretrained(NEW_MODEL_NAME)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).

Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/677105 [00:00<?, ? examples/s]

  return fn(*args, **kwargs)


OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 39.06 MiB is free. Process 12857 has 14.71 GiB memory in use. Of the allocated memory 14.56 GiB is allocated by PyTorch, and 25.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)