<a href="https://colab.research.google.com/github/Akshata4/data_mining/blob/main/unslothai/Unslothai_Full_finetuning_(SmolLM2_135M).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Create a notebook to perform full fine-tuning of the SmolLM2-135M model on a small coding/chat dataset. The notebook should explain the data format and chat template, and show before/after evaluation results.

## Install necessary libraries

### Subtask:
Install the required libraries for fine-tuning, including `transformers`, `datasets`, and `peft`.


**Reasoning**:
Install the required libraries using pip.



In [1]:
%pip install transformers datasets peft



## Load and prepare dataset

### Subtask:
Load a small coding/chat dataset. Explain the expected data format and the chat template used for formatting the data for training.


**Reasoning**:
Load a small dataset suitable for coding/chat fine-tuning, explain the expected data format and the chat template, and apply the template to a sample of the loaded dataset.



In [2]:
from datasets import load_dataset

# 1. Load a small dataset
# Using a small subset of a publicly available dataset for demonstration
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:100]")

# 2. Explain the expected data format
print("Expected Data Format:")
print("The dataset is expected to contain a list of conversations, where each conversation is a list of messages. Each message should be a dictionary with 'role' (either 'user' or 'assistant') and 'content' (the message text).")
print("\nExample data instance structure:")
print(dataset[0]['messages'])

# 3. Describe the chat template
print("\nChat Template Description:")
print("The chat template will format the conversation into a single string that the model can process. It typically uses special tokens to delineate turns and the start/end of the conversation.")
print("For this model, a common template is:")
print("{% for message in messages %}")
print("{% if message['role'] == 'user' %}")
print("{{ '<|user|>\n' + message['content'] + eos_token }}")
print("{% elif message['role'] == 'assistant' %}")
print("{{ '<|assistant|>\n' + message['content'] + eos_token }}")
print("{% endif %}")
print("{% if loop.last and message['role'] == 'assistant' %}")
print("{{ '<|endoftext|>' }}")
print("{% endif %}")
print("{% endfor %}")
print("\nThis template adds specific tokens before the user and assistant messages and an end-of-text token at the end of the conversation.")


# 4. Apply the described chat template to a sample
from transformers import AutoTokenizer

# Load a tokenizer that has a chat template
# Using a tokenizer for a model with a compatible template
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/tiny-random-LlamaForCausalLM")

def apply_chat_template(example):
  # The tokenizer's apply_chat_template function handles the formatting
  example["formatted_chat"] = tokenizer.apply_chat_template(example["messages"], tokenize=False, add_generation_prompt=False)
  return example

dataset = dataset.map(apply_chat_template)

print("\nSample after applying chat template:")
print(dataset[0]['formatted_chat'])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/train_sft-00000-of-00003-a3ecf92756(…):   0%|          | 0.00/244M [00:00<?, ?B/s]

data/train_sft-00001-of-00003-0a1804bcb6(…):   0%|          | 0.00/244M [00:00<?, ?B/s]

data/train_sft-00002-of-00003-ee46ed25cf(…):   0%|          | 0.00/244M [00:00<?, ?B/s]

data/test_sft-00000-of-00001-f7dfac4afe5(…):   0%|          | 0.00/81.2M [00:00<?, ?B/s]

data/train_gen-00000-of-00003-a6c9fb894b(…):   0%|          | 0.00/244M [00:00<?, ?B/s]

data/train_gen-00001-of-00003-d6a0402e41(…):   0%|          | 0.00/243M [00:00<?, ?B/s]

data/train_gen-00002-of-00003-c0db75b92a(…):   0%|          | 0.00/243M [00:00<?, ?B/s]

data/test_gen-00000-of-00001-3d4cd830914(…):   0%|          | 0.00/80.4M [00:00<?, ?B/s]

Generating train_sft split:   0%|          | 0/207865 [00:00<?, ? examples/s]

Generating test_sft split:   0%|          | 0/23110 [00:00<?, ? examples/s]

Generating train_gen split:   0%|          | 0/256032 [00:00<?, ? examples/s]

Generating test_gen split:   0%|          | 0/28304 [00:00<?, ? examples/s]

Expected Data Format:
The dataset is expected to contain a list of conversations, where each conversation is a list of messages. Each message should be a dictionary with 'role' (either 'user' or 'assistant') and 'content' (the message text).

Example data instance structure:
[{'content': "These instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?", 'role': 'user'}, {'content': 'This feature only applies to Collection pages and Featured Collections sections of the section

tokenizer_config.json:   0%|          | 0.00/700 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

ValueError: Cannot use chat template functions because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating

**Reasoning**:
The previous attempt to apply the chat template failed because the loaded tokenizer does not have a default `chat_template` set. To fix this, I need to explicitly define the chat template and pass it to the `apply_chat_template` function.



In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer

# Load a small dataset
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:100]")

# Explain the expected data format
print("Expected Data Format:")
print("The dataset is expected to contain a list of conversations, where each conversation is a list of messages. Each message should be a dictionary with 'role' (either 'user' or 'assistant') and 'content' (the message text).")
print("\nExample data instance structure:")
print(dataset[0]['messages'])

# Describe the chat template
print("\nChat Template Description:")
print("The chat template will format the conversation into a single string that the model can process. It typically uses special tokens to delineate turns and the start/end of the conversation.")
print("For this model, a common template is:")
chat_template = "{% for message in messages %}{% if message['role'] == 'user' %}{{ '<|user|>\n' + message['content'] + eos_token }}{% elif message['role'] == 'assistant' %}{{ '<|assistant|>\n' + message['content'] + eos_token }}{% endif %}{% if loop.last and message['role'] == 'assistant' %}{{ '<|endoftext|>' }}{% endif %}{% endfor %}"
print(chat_template)
print("\nThis template adds specific tokens before the user and assistant messages and an end-of-text token at the end of the conversation.")

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/tiny-random-LlamaForCausalLM")

def apply_chat_template(example, chat_template, tokenizer):
  # Apply the chat template, passing the template string explicitly
  example["formatted_chat"] = tokenizer.apply_chat_template(
      example["messages"],
      chat_template=chat_template,
      tokenize=False,
      add_generation_prompt=False
      )
  return example

# Apply the described chat template to a sample
dataset = dataset.map(lambda x: apply_chat_template(x, chat_template, tokenizer))


print("\nSample after applying chat template:")
print(dataset[0]['formatted_chat'])

Expected Data Format:
The dataset is expected to contain a list of conversations, where each conversation is a list of messages. Each message should be a dictionary with 'role' (either 'user' or 'assistant') and 'content' (the message text).

Example data instance structure:
[{'content': "These instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?", 'role': 'user'}, {'content': 'This feature only applies to Collection pages and Featured Collections sections of the section

Map:   0%|          | 0/100 [00:00<?, ? examples/s]


Sample after applying chat template:
<|user|>
These instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?
On your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!
Your Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.
Does this feature apply to all sections of the theme or just specific ones as listed in the text material?</s><|assistant|>
This feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.</s><|user|>
Can you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?</s><|assistant|>
Sure, here are the steps to enable the secondary

## Load base model and tokenizer

### Subtask:
Load the pre-trained SmolLM2-135M model and its corresponding tokenizer.


**Reasoning**:
Import the necessary classes from the transformers library and load the specified pre-trained model and tokenizer.



In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the pre-trained model
model = AutoModelForCausalLM.from_pretrained("SmolLM/SmolLM2-135M")

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("SmolLM/SmolLM2-135M")

print("Model and tokenizer loaded successfully.")

OSError: SmolLM/SmolLM2-135M is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `hf auth login` or by passing `token=<your_token>`

**Reasoning**:
The previous attempt to load the model and tokenizer failed because the model identifier was incorrect or the model is private. I will try a publicly available model with a similar size and purpose as a substitute to continue with the subtask. I will use `TinyLlama/TinyLlama-1.1B-Chat-v1.0` as a substitute since it's a publicly available small chat model.



In [5]:
# Load a publicly available substitute model and tokenizer
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

print("Substitute model and tokenizer loaded successfully.")

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Substitute model and tokenizer loaded successfully.


## Prepare model for fine-tuning

### Subtask:
Configure the model for full fine-tuning by setting `full_finetuning=True`.


**Reasoning**:
Import the necessary class for full fine-tuning and configure the loaded model for full fine-tuning. Since the model is a standard `AutoModelForCausalLM`, full fine-tuning is the default behavior when using a standard `Trainer` or custom training loop without PEFT configurations. No specific function from `peft` or direct attribute manipulation is needed to enable full fine-tuning on this model type in this context.



In [6]:
# Full fine-tuning is the default behavior for AutoModelForCausalLM when trained without PEFT configurations.
# No specific configuration steps are needed here to enable full fine-tuning.
# The model object 'model' is already prepared for full fine-tuning by its nature.

print("Model is configured for full fine-tuning.")

Model is configured for full fine-tuning.


## Define training arguments

### Subtask:
Set up the training arguments, such as the number of epochs, learning rate, and batch size.


**Reasoning**:
Import the `TrainingArguments` class and instantiate it with the required parameters for training.



In [7]:
from transformers import TrainingArguments

output_dir = "./results"
num_train_epochs = 3
per_device_train_batch_size = 4
learning_rate = 2e-5
weight_decay = 0.01
warmup_steps = 100
logging_dir = "./logs"

training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    warmup_steps=warmup_steps,
    logging_dir=logging_dir,
    report_to="none" # Disable reporting to external services like W&B
)

print("Training arguments set up successfully.")
print(training_args)

Training arguments set up successfully.
TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=True,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStra

## Create trainer

### Subtask:
Initialize the `Trainer` object with the model, training arguments, and dataset.


**Reasoning**:
Import the Trainer class and initialize it with the model, training arguments, and the processed dataset.



In [8]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

print("Trainer object initialized successfully.")

Trainer object initialized successfully.


## Evaluate before fine-tuning

### Subtask:
Evaluate the base model on a small test set to establish a baseline performance.


**Reasoning**:
Define a small test dataset from the loaded dataset, use the initialized trainer to evaluate the base model on this test set, store the results, and print them.



In [9]:
# Define a small test dataset
# We will use a subset of the loaded dataset for demonstration purposes
test_dataset = dataset.select(range(10)) # Select the first 10 examples as a test set

# Evaluate the base model on the test dataset
baseline_eval_results = trainer.evaluate(eval_dataset=test_dataset)

# Print the baseline evaluation results
print("Baseline Evaluation Results:")
print(baseline_eval_results)

ValueError: No columns in the dataset match the model's forward method signature: (input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, cache_position, logits_to_keep, kwargs, labels, label_ids, label). The following columns have been ignored: [formatted_chat, prompt, prompt_id, messages]. Please check the dataset and model. You may need to set `remove_unused_columns=False` in `TrainingArguments`.

**Reasoning**:
The error indicates that the dataset columns do not match the model's expected input format. The model expects columns like `input_ids` and `attention_mask`, but the dataset only contains `formatted_chat`, `prompt`, `prompt_id`, and `messages`. I need to tokenize the `formatted_chat` column and create `input_ids` and `attention_mask`. I also need to set `remove_unused_columns=False` in `TrainingArguments` to prevent the trainer from removing necessary columns before tokenization.



In [10]:
# Tokenize the dataset
def tokenize_function(examples):
    # Apply the chat template again to ensure it's correctly formatted
    # This is necessary because the previous map operation didn't save the tokenized output
    # and the trainer expects tokenized input.
    # However, apply_chat_template with tokenize=False was used before.
    # Let's re-apply with tokenize=True now.
    chat_template = "{% for message in messages %}{% if message['role'] == 'user' %}{{ '<|user|>\n' + message['content'] + eos_token }}{% elif message['role'] == 'assistant' %}{{ '<|assistant|>\n' + message['content'] + eos_token }}{% endif %}{% if loop.last and message['role'] == 'assistant' %}{{ '<|endoftext|>' }}{% endif %}{% endfor %}"
    tokenized_output = tokenizer.apply_chat_template(
        examples["messages"],
        chat_template=chat_template,
        tokenize=True,
        add_generation_prompt=False,
        return_dict=True  # Ensure dictionary output for input_ids and attention_mask
    )
    return tokenized_output

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Set remove_unused_columns=False in TrainingArguments
training_args.remove_unused_columns = False

# Update the trainer with the tokenized dataset
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset.select(range(10)) # Use a subset of the tokenized dataset for evaluation
)

# Evaluate the base model on the tokenized test dataset
baseline_eval_results = trainer.evaluate(eval_dataset=tokenized_dataset.select(range(10)))

# Print the baseline evaluation results
print("Baseline Evaluation Results:")
print(baseline_eval_results)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (3637 > 2048). Running this sequence through the model will result in indexing errors


RuntimeError: Could not infer dtype of dict

**Reasoning**:
The error `RuntimeError: Could not infer dtype of dict` during collation suggests that the data returned by the tokenization function is still in a format that the default collator cannot handle, likely due to remaining non-tensor columns or a structure that isn't a simple list of dictionaries ready for batching into tensors. Also, there is a warning about sequence length exceeding the model's maximum length. I need to ensure the tokenization function only returns `input_ids` and `attention_mask` and potentially handle the sequence length issue by truncating or padding. I also need to make sure the label column is correctly handled.



In [11]:
# Tokenize the dataset and prepare for evaluation
def tokenize_and_prepare_labels(examples):
    chat_template = "{% for message in messages %}{% if message['role'] == 'user' %}{{ '<|user|>\n' + message['content'] + eos_token }}{% elif message['role'] == 'assistant' %}{{ '<|assistant|>\n' + message['content'] + eos_token }}{% endif %}{% if loop.last and message['role'] == 'assistant' %}{{ '<|endoftext|>' }}{% endif %}{% endfor %}"
    tokenized_output = tokenizer.apply_chat_template(
        examples["messages"],
        chat_template=chat_template,
        tokenize=True,
        add_generation_prompt=False,
        return_dict=True,
        padding="max_length", # Pad sequences to max length
        truncation=True, # Truncate sequences exceeding max length
        max_length=tokenizer.model_max_length # Use model's max length
    )

    # For evaluation, the labels are typically the same as input_ids
    tokenized_output["labels"] = tokenized_output["input_ids"].copy()
    return tokenized_output

# Apply the tokenization and preparation function
tokenized_dataset_for_eval = dataset.map(tokenize_and_prepare_labels, batched=True)

# Ensure only necessary columns are present for the trainer
tokenized_dataset_for_eval = tokenized_dataset_for_eval.remove_columns(['messages', 'formatted_chat', 'prompt', 'prompt_id'])


# Set remove_unused_columns=False in TrainingArguments (already done, but keep for clarity if running cells out of order)
# training_args.remove_unused_columns = False # This was set in the previous successful step

# Re-initialize the trainer with the correctly formatted dataset for evaluation
# Note: For evaluation, we don't strictly need to re-initialize the *entire* trainer
# but can just pass the eval_dataset to the evaluate method.
# However, to ensure the trainer is aware of the dataset structure, re-initializing might be safer.
# Let's stick to just passing the dataset to evaluate for simplicity and to avoid
# potential issues with modifying a trainer after initialization if not strictly necessary.

# Evaluate the base model on the tokenized test dataset
# Use the first 10 examples of the prepared dataset for evaluation
baseline_eval_results = trainer.evaluate(eval_dataset=tokenized_dataset_for_eval.select(range(10)))

# Print the baseline evaluation results
print("Baseline Evaluation Results:")
print(baseline_eval_results)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Baseline Evaluation Results:
{'eval_loss': 3.916987895965576, 'eval_model_preparation_time': 0.0029, 'eval_runtime': 13.62, 'eval_samples_per_second': 0.734, 'eval_steps_per_second': 0.147}


## Fine-tune the model

### Subtask:
Start the training process.


**Reasoning**:
Start the training process by calling the train method on the trainer object.



In [12]:
train_results = trainer.train()

RuntimeError: Could not infer dtype of dict

**Reasoning**:
The traceback indicates a `RuntimeError: Could not infer dtype of dict` during the data collation step of training. This suggests that there are still dictionary type columns in the training dataset that the default data collator cannot handle, similar to the issue encountered during evaluation. I need to remove these columns from the training dataset before starting training. I will remove the same columns that caused issues during evaluation (`messages`, `formatted_chat`, `prompt`, `prompt_id`) from the training dataset and then retry the training.



In [13]:
# Remove the problematic columns from the training dataset
dataset = dataset.remove_columns(['messages', 'formatted_chat', 'prompt', 'prompt_id'])

# Re-initialize the trainer with the cleaned dataset
# This is important because the trainer holds a reference to the dataset
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

# Retry the training process
train_results = trainer.train()

ValueError: num_samples should be a positive integer value, but got num_samples=0

**Reasoning**:
The traceback indicates a `ValueError: num_samples should be a positive integer value, but got num_samples=0`. This means the training dataset passed to the `Trainer` is empty. Looking back at the previous steps, I only tokenized a *subset* of the dataset for evaluation (`tokenized_dataset_for_eval.select(range(10))`). I need to apply the tokenization and formatting to the *entire* training dataset before initializing the trainer for training.



In [14]:
# Apply the tokenization and preparation function to the entire training dataset
def tokenize_and_prepare_labels(examples):
    chat_template = "{% for message in messages %}{% if message['role'] == 'user' %}{{ '<|user|>\n' + message['content'] + eos_token }}{% elif message['role'] == 'assistant' %}{{ '<|assistant|>\n' + message['content'] + eos_token }}{% endif %}{% if loop.last and message['role'] == 'assistant' %}{{ '<|endoftext|>' }}{% endif %}{% endfor %}"
    tokenized_output = tokenizer.apply_chat_template(
        examples["messages"],
        chat_template=chat_template,
        tokenize=True,
        add_generation_prompt=False,
        return_dict=True,
        padding="max_length", # Pad sequences to max length
        truncation=True, # Truncate sequences exceeding max length
        max_length=tokenizer.model_max_length # Use model's max length
    )

    # For training, the labels are typically the same as input_ids
    tokenized_output["labels"] = tokenized_output["input_ids"].copy()
    return tokenized_output

tokenized_dataset_for_train = dataset.map(tokenize_and_prepare_labels, batched=True)

# Ensure only necessary columns are present for the trainer
tokenized_dataset_for_train = tokenized_dataset_for_train.remove_columns(['messages', 'formatted_chat', 'prompt', 'prompt_id'])

# Re-initialize the trainer with the correctly formatted and tokenized training dataset
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_for_train,
)

# Retry the training process
train_results = trainer.train()

ValueError: Column name ['formatted_chat', 'prompt', 'prompt_id', 'messages'] not in the dataset. Current columns in the dataset: []

**Reasoning**:
The traceback indicates a `ValueError` because the columns I'm trying to remove (`messages`, `formatted_chat`, `prompt`, `prompt_id`) are no longer in the dataset. This is because the dataset object was modified in the previous step when I removed those columns. I need to apply the tokenization and preparation function to the *original* dataset object before the columns were removed. I will reload the dataset or use the original dataset object and then apply the tokenization and column removal. Since I don't have the original dataset object available by name, I will reload a small portion of the dataset.



In [15]:
from datasets import load_dataset

# Reload a small portion of the dataset to get the original columns back
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:100]")

# Apply the tokenization and preparation function to the reloaded dataset
def tokenize_and_prepare_labels(examples):
    chat_template = "{% for message in messages %}{% if message['role'] == 'user' %}{{ '<|user|>\n' + message['content'] + eos_token }}{% elif message['role'] == 'assistant' %}{{ '<|assistant|>\n' + message['content'] + eos_token }}{% endif %}{% if loop.last and message['role'] == 'assistant' %}{{ '<|endoftext|>' }}{% endif %}{% endfor %}"
    tokenized_output = tokenizer.apply_chat_template(
        examples["messages"],
        chat_template=chat_template,
        tokenize=True,
        add_generation_prompt=False,
        return_dict=True,
        padding="max_length", # Pad sequences to max length
        truncation=True, # Truncate sequences exceeding max length
        max_length=tokenizer.model_max_length # Use model's max length
    )

    # For training, the labels are typically the same as input_ids
    tokenized_output["labels"] = tokenized_output["input_ids"].copy()
    return tokenized_output

tokenized_dataset_for_train = dataset.map(tokenize_and_prepare_labels, batched=True)

# Ensure only necessary columns are present for the trainer by removing the original ones
tokenized_dataset_for_train = tokenized_dataset_for_train.remove_columns(['messages', 'formatted_chat', 'prompt', 'prompt_id'])

# Re-initialize the trainer with the correctly formatted and tokenized training dataset
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_for_train,
)

# Retry the training process
train_results = trainer.train()

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

ValueError: Column name ['formatted_chat'] not in the dataset. Current columns in the dataset: ['prompt', 'prompt_id', 'messages', 'input_ids', 'attention_mask', 'labels']

**Reasoning**:
The traceback indicates that the column 'formatted_chat' is not present in the dataset when I try to remove it, but 'messages', 'prompt', and 'prompt_id' are. This is because in the previous step, I applied the `tokenize_and_prepare_labels` function which outputs 'input_ids', 'attention_mask', and 'labels'. The `formatted_chat` column was created in a previous step before tokenization. When I reloaded the dataset and applied the tokenization function directly, `formatted_chat` was not recreated. I should remove the columns that are actually present in the dataset after tokenization and before training.



In [16]:
# Reload a small portion of the dataset to get the original columns back
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:100]")

# Apply the tokenization and preparation function to the reloaded dataset
def tokenize_and_prepare_labels(examples):
    chat_template = "{% for message in messages %}{% if message['role'] == 'user' %}{{ '<|user|>\n' + message['content'] + eos_token }}{% elif message['role'] == 'assistant' %}{{ '<|assistant|>\n' + message['content'] + eos_token }}{% endif %}{% if loop.last and message['role'] == 'assistant' %}{{ '<|endoftext|>' }}{% endif %}{% endfor %}"
    tokenized_output = tokenizer.apply_chat_template(
        examples["messages"],
        chat_template=chat_template,
        tokenize=True,
        add_generation_prompt=False,
        return_dict=True,
        padding="max_length", # Pad sequences to max length
        truncation=True, # Truncate sequences exceeding max length
        max_length=tokenizer.model_max_length # Use model's max length
    )

    # For training, the labels are typically the same as input_ids
    tokenized_output["labels"] = tokenized_output["input_ids"].copy()
    return tokenized_output

tokenized_dataset_for_train = dataset.map(tokenize_and_prepare_labels, batched=True)

# Ensure only necessary columns are present for the trainer by removing the original ones that exist after tokenization
tokenized_dataset_for_train = tokenized_dataset_for_train.remove_columns(['messages', 'prompt', 'prompt_id'])

# Re-initialize the trainer with the correctly formatted and tokenized training dataset
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_for_train,
)

# Retry the training process
train_results = trainer.train()

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 176.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 14.12 MiB is free. Process 10655 has 14.72 GiB memory in use. Of the allocated memory 14.57 GiB is allocated by PyTorch, and 33.82 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

**Reasoning**:
The traceback indicates a `OutOfMemoryError: CUDA out of memory`. This means the model is too large or the batch size is too big for the available GPU memory. Since I am using a substitute model ("TinyLlama/TinyLlama-1.1B-Chat-v1.0") which is larger than the intended SmolLM2-135M, and full fine-tuning is memory-intensive, I need to reduce the memory usage. I will try reducing the `per_device_train_batch_size` in the `TrainingArguments` and re-initialize the trainer.



In [17]:
# Reduce the per_device_train_batch_size to reduce memory usage
per_device_train_batch_size = 1

# Update the training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    warmup_steps=warmup_steps,
    logging_dir=logging_dir,
    report_to="none" # Disable reporting to external services like W&B
)

# Re-initialize the trainer with the updated training arguments and the prepared dataset
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_for_train,
)

# Retry the training process
train_results = trainer.train()

OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 14.12 MiB is free. Process 10655 has 14.72 GiB memory in use. Of the allocated memory 14.57 GiB is allocated by PyTorch, and 33.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## Fine-tune the model retry 1

### Subtask:
Retry the training process for fine-tuning the model.


**Reasoning**:
Based on the instructions, I need to reduce the effective batch size using gradient accumulation, re-initialize the TrainingArguments and Trainer, and then retry the training process. This can all be done in a single code block.



In [18]:
from transformers import TrainingArguments, Trainer

# Reduce the effective batch size further by using gradient accumulation
gradient_accumulation_steps = 8

# Update the training arguments
# The other arguments are kept from the previous successful step
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=1, # Keep batch size at 1 due to previous OOM errors
    gradient_accumulation_steps=gradient_accumulation_steps, # Add gradient accumulation
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_steps=100,
    logging_dir="./logs",
    report_to="none" # Disable reporting to external services like W&B
)

print("Training arguments updated with gradient accumulation.")
print(training_args)

# Re-initialize the trainer with the updated training arguments and the prepared dataset
# Assuming tokenized_dataset_for_train is available from previous steps
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_for_train, # Use the full tokenized training dataset
)

print("Trainer object re-initialized successfully with updated arguments.")

# Retry the training process
print("Attempting to start training with gradient accumulation...")
train_results = trainer.train()

print("Training finished.")

Training arguments updated with gradient accumulation.
TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=True,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strate

OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 12.12 MiB is free. Process 10655 has 14.73 GiB memory in use. Of the allocated memory 14.57 GiB is allocated by PyTorch, and 26.82 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## Fine-tune the model retry 2

### Subtask:
Retry the training process for fine-tuning the model, addressing the persistent CUDA Out of Memory errors.


**Reasoning**:
Decrease the number of training epochs to 1, update the training arguments and the trainer, and attempt to start the training process again as a last effort to see if a very short run is possible.



In [19]:
# Decrease the number of training epochs
num_train_epochs = 1

# Update the training arguments
# Keep other arguments as in the previous successful step
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_steps=100,
    logging_dir="./logs",
    report_to="none"
)

print(f"Training arguments updated with num_train_epochs = {num_train_epochs}.")
print(training_args)

# Re-initialize the trainer with the updated training arguments
# Assuming model and tokenized_dataset_for_train are available from previous steps
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_for_train,
)

print("Trainer object re-initialized successfully with updated arguments.")

# Attempt to start the training process again
print("Attempting to start training with reduced epochs...")
try:
    train_results = trainer.train()
    print("Training finished successfully.")
except Exception as e:
    print(f"Training failed with error: {e}")
    train_results = None


Training arguments updated with num_train_epochs = 1.
TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=True,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strateg

## Summary:

### Data Analysis Key Findings

*   The required libraries (`transformers`, `datasets`, and `peft`) were already installed in the environment.
*   The dataset was successfully loaded and confirmed to be in the expected format of a list of conversations, each containing messages with 'role' and 'content'.
*   A chat template was defined and successfully applied to format the conversational data into a single string suitable for model input.
*   The intended `SmolLM/SmolLM2-135M` model could not be loaded due to an `OSError`, and a substitute model, `TinyLlama/TinyLlama-1.1B-Chat-v1.0`, was loaded instead.
*   Full fine-tuning is the default behavior for `AutoModelForCausalLM` without PEFT configurations, so no specific configuration was needed for the model itself.
*   Training arguments including `output_dir`, `num_train_epochs`, `per_device_train_batch_size`, `learning_rate`, `weight_decay`, `warmup_steps`, and `logging_dir` were successfully defined.
*   The `Trainer` object was initialized successfully with the loaded model, defined training arguments, and the prepared dataset.
*   Evaluating the base model required tokenizing the dataset, removing unused columns, and ensuring `input_ids` and `attention_mask` were present, along with a `labels` column. Padding and truncation were also necessary to handle sequence lengths.
*   The base model was successfully evaluated, providing baseline metrics like `eval_loss`.
*   Attempts to start the full fine-tuning process repeatedly failed with "CUDA out of memory" errors, even after reducing the `per_device_train_batch_size` to 1 and implementing `gradient_accumulation_steps` of 8.
*   Reducing the number of training epochs to 1 did not resolve the memory issues; the errors persisted early in the training process.

### Insights or Next Steps

*   Full fine-tuning the 1.1B parameter substitute model is not feasible with the current GPU memory constraints (14.74 GiB).
*   To successfully fine-tune this model or a similar-sized model within the available resources, it would be necessary to use memory-efficient techniques like PEFT (e.g., LoRA) or utilize hardware with significantly more GPU memory.
