### Streamlined Mistral-7B Fine-Tuning for Scientific Research (Reproducible & Structured)

This notebook is structured to adhere to your requested pattern: all imports at the top,
functions in the middle, and the main execution logic at the bottom.

It also clarifies the CPU/GPU division for data preparation and model training.

**Execution Flow:**
* **CPU Phase (Data Preparation/Labeling):** The `load_and_prepare_dataset` function operates on the CPU, handling dataset loading, tokenization, and initial processing.
* **GPU Phase (Weight Computation):** The `fine_tune_model` function, utilizing `transformers.Trainer` (via `SFTTrainer`) and `accelerate`, handles all GPU computations, including weight updates.
* **Asynchronous Batching:** `DataCollatorForLanguageModeling` prepares batches on the CPU and efficiently transfers them to the GPU asynchronously during training, managed by the Trainer.
* **Custom Token Batching (Conceptual):** The "100M token pool, feed 30M until 100M" strategy is an advanced data loading pattern. While not fully implemented here (as it requires a custom `IterableDataset` or `DataCollator`), the `MAX_SEQ_LENGTH` and `BATCH_SIZE` control the sample/batch size for the GPU, and `group_by_length` helps optimize. For true 100M/30M token chunks, you would typically preprocess your dataset into these larger units or implement a custom streaming data loader before passing to the `Trainer`.

# Imports

In [25]:
import os
import torch
import json
import gc
from huggingface_hub import login
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer
from kaggle_secrets import UserSecretsClient

os.environ["BNB_CUDA_VERSION"] = "124"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

print("Installing essential libraries...")
!pip install --no-deps transformers==4.51.3 bitsandbytes==0.46.0 peft==0.12.0 trl==0.11.1 accelerate==0.34.2
!pip install datasets
print("Library installation complete. Please restart your kernel if prompted.")
try:
    import transformers
    import bitsandbytes
    import peft
    import trl
    import accelerate
    print("transformers version:", transformers.__version__)
    print("bitsandbytes version:", bitsandbytes.__version__)
    print("peft version:", peft.__version__)
    print("trl version:", trl.__version__)
    print("accelerate version:", accelerate.__version__)
    print("torch version:", torch.__version__)
    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"CUDA version: {torch.version.cuda}")
    !nvidia-smi
except ImportError as e:
    print(f"Import error during version check: {e}")
    raise

Installing essential libraries...
Library installation complete. Please restart your kernel if prompted.
transformers version: 4.51.3
bitsandbytes version: 0.46.0
peft version: 0.12.0
trl version: 0.11.1
accelerate version: 0.34.2
torch version: 2.6.0+cu124
CUDA available: True
CUDA version: 12.4
Tue Jun 24 22:55:58 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   77C    P0         

# Main Functions

In [26]:
class Config:
    """Centralized configuration for the fine-tuning process."""
    MODEL_NAME = "mistralai/Mistral-7B-v0.1"
    DATASET_NAME = "Allanatrix/Scientific_Research_Tokenized"
    NEW_MODEL_NAME = "nexa-mistral-sci7b"
    MAX_SEQ_LENGTH = 1024
    BATCH_SIZE = 1
    GRADIENT_ACCUMULATION_STEPS = 64
    LEARNING_RATE = 2e-5
    NUM_TRAIN_EPOCHS = 2
    OUTPUT_DIR = "/kaggle/working/results"
    ARTIFACTS_DIR = "/kaggle/working/artifacts"

    def to_dict(self):
        """Converts config to a dictionary for JSON export."""
        return {k: v for k, v in vars(self).items() if not k.startswith('__') and not callable(getattr(self, k))}

def hf_login():
    """Logs into Hugging Face Hub using Kaggle Secrets."""
    try:
        client = UserSecretsClient()
        token = client.get_secret("HF_TOKEN")
        login(token=token)
        print("Hugging Face login complete.")
    except Exception as e:
        print(f"Failed to access HF_TOKEN: {e}. Please ensure 'HF_TOKEN' is set in Kaggle Secrets.")
        raise

def get_model_and_tokenizer(model_name: str):
    """Loads the base model with 4-bit quantization and its tokenizer."""
    try:
        torch.cuda.empty_cache()
        gc.collect()
        import bitsandbytes as bnb
        print("bitsandbytes loaded successfully")
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=False,
        )
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            quantization_config=bnb_config,
            trust_remote_code=True,
            device_map={"": 0}
        )
        model.config.use_cache = False
        model.config.pretraining_tp = 1
        tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.padding_side = "right"
        return model, tokenizer
    except Exception as e:
        print(f"Error loading model: {e}")
        print("Ensure bitsandbytes is correctly installed and CUDA is compatible.")
        raise
    finally:
        torch.cuda.empty_cache()
        gc.collect()

def load_and_prepare_dataset(dataset_name: str, tokenizer: AutoTokenizer, max_seq_length: int):
    """Loads and tokenizes the dataset on CPU."""
    print(f"Loading dataset '{dataset_name}'...")
    try:
        torch.cuda.empty_cache()
        gc.collect()
        dataset = load_dataset(dataset_name)
        print(f"Dataset columns: {dataset['train'].column_names}")
        def tokenize_function(examples):
            return tokenizer(
                examples["input_text"],
                truncation=True,
                max_length=max_seq_length
            )
        print("Tokenizing dataset...")
        tokenized_dataset = dataset.map(
            tokenize_function,
            batched=True,
            remove_columns=[col for col in dataset["train"].column_names if col != "input_ids"],
            desc="Tokenizing dataset"
        )
        tokenized_dataset = tokenized_dataset.filter(lambda x: len(x["input_ids"]) > 0, desc="Filtering empty sequences")
        return tokenized_dataset
    except Exception as e:
        print(f"Error loading or tokenizing dataset: {e}")
        raise
    finally:
        torch.cuda.empty_cache()
        gc.collect()

def get_lora_config():
    """Returns the LoRA configuration."""
    lora_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        bias="none",
        task_type="CAUSAL_LM",
    )
    return lora_config

def get_training_arguments(config: Config):
    """Returns the TrainingArguments."""
    training_args = TrainingArguments(
        output_dir=config.OUTPUT_DIR,
        num_train_epochs=config.NUM_TRAIN_EPOCHS,
        per_device_train_batch_size=config.BATCH_SIZE,
        gradient_accumulation_steps=config.GRADIENT_ACCUMULATION_STEPS,
        optim="paged_adamw_8bit",
        save_steps=25,
        logging_steps=25,
        learning_rate=config.LEARNING_RATE,
        weight_decay=0.001,
        bf16=True,
        max_grad_norm=0.3,
        max_steps=-1,
        warmup_ratio=0.03,
        group_by_length=True,
        lr_scheduler_type="cosine",
        report_to="tensorboard"
    )
    return training_args

def fine_tune_model(model: AutoModelForCausalLM, dataset, tokenizer: AutoTokenizer, lora_config: LoraConfig, training_args: TrainingArguments, max_seq_length: int):
    """Performs model fine-tuning on GPU."""
    try:
        torch.cuda.empty_cache()
        gc.collect()
        data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
        trainer = SFTTrainer(
            model=model,
            train_dataset=dataset["train"],
            peft_config=lora_config,
            dataset_text_field="input_ids",
            max_seq_length=max_seq_length,
            tokenizer=tokenizer,
            args=training_args
        )
        print("Starting model fine-tuning...")
        trainer.train()
        return trainer
    except Exception as e:
        print(f"Error during fine-tuning: {e}")
        raise
    finally:
        torch.cuda.empty_cache()
        gc.collect()

def save_model_artifacts(trainer: SFTTrainer, config: Config, training_args: TrainingArguments):
    """Saves the fine-tuned model weights and artifacts."""
    try:
        final_model_path = os.path.join(config.ARTIFACTS_DIR, config.NEW_MODEL_NAME)
        trainer.save_model(final_model_path)
        trainer.tokenizer.save_pretrained(final_model_path)
        print(f"Model and tokenizer saved to: {final_model_path}")
        config_filename = os.path.join(config.ARTIFACTS_DIR, "training_config.json")
        with open(config_filename, 'w') as f:
            json.dump(config.to_dict(), f, indent=4)
        print(f"Training configuration saved to: {config_filename}")
        training_args_filename = os.path.join(config.ARTIFACTS_DIR, "training_arguments.json")
        with open(training_args_filename, 'w') as f:
            json.dump(training_args.to_dict(), f, indent=4)
        print(f"Training arguments saved to: {training_args_filename}")
    except Exception as e:
        print(f"Error saving artifacts: {e}")
        raise
    finally:
        torch.cuda.empty_cache()
        gc.collect()

# Main Loop

In [27]:
def main():
    """Orchestrates the fine-tuning workflow."""
    try:
        torch.cuda.empty_cache()
        gc.collect()
        config = Config()
        os.makedirs(config.ARTIFACTS_DIR, exist_ok=True)
        print(f"Artifacts will be saved to: {config.ARTIFACTS_DIR}")
        print(f"CUDA available: {torch.cuda.is_available()}")
        print(f"CUDA version: {torch.version.cuda}")
        !nvidia-smi
        hf_login()
        print("Setting up model and tokenizer...")
        model, tokenizer = get_model_and_tokenizer(config.MODEL_NAME)
        print("Preparing dataset...")
        dataset = load_and_prepare_dataset(config.DATASET_NAME, tokenizer, config.MAX_SEQ_LENGTH)
        print(f"Dataset prepared with splits: {dataset.keys()}")
        print("Configuring LoRA and training arguments...")
        lora_config = get_lora_config()
        model.gradient_checkpointing_enable()
        model = prepare_model_for_kbit_training(model)
        model = get_peft_model(model, lora_config)
        model.print_trainable_parameters()
        training_args = get_training_arguments(config)
        print("Starting fine-tuning...")
        trainer = fine_tune_model(
            model,
            dataset,
            tokenizer,
            lora_config,
            training_args,
            config.MAX_SEQ_LENGTH
        )
        print("Saving model artifacts...")
        save_model_artifacts(trainer, config, training_args)
        print("Fine-tuning complete.")
    except Exception as e:
        print(f"Error in main loop: {e}")
        raise
    finally:
        torch.cuda.empty_cache()
        gc.collect()

if __name__ == "__main__":
    main()

Artifacts will be saved to: /kaggle/working/artifacts
CUDA available: True
CUDA version: 12.4
Tue Jun 24 22:55:59 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   77C    P0             34W /   70W |    9921MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+---

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Preparing dataset...
Loading dataset 'Allanatrix/Scientific_Research_Tokenized'...
Dataset columns: ['input_text', 'target_hypothesis', 'expert_label']
Tokenizing dataset...
Dataset prepared with splits: dict_keys(['train'])
Configuring LoRA and training arguments...
trainable params: 27,262,976 || all params: 7,268,995,072 || trainable%: 0.3751
Starting fine-tuning...



Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
  super().__init__(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Starting model fine-tuning...


  return fn(*args, **kwargs)


Step,Training Loss


Saving model artifacts...


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Model and tokenizer saved to: /kaggle/working/artifacts/nexa-mistral-sci7b
Training configuration saved to: /kaggle/working/artifacts/training_config.json
Training arguments saved to: /kaggle/working/artifacts/training_arguments.json
Fine-tuning complete.
