[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/your-repo/your-notebook.ipynb)

## Install Necessary Packages
First, we need to install the required packages for our fine-tuning process.

In [1]:
import os
import sys

# Check if running in Google Colab
if 'COLAB_GPU' in os.environ:
    !git clone https://github.com/JulianLopezB/LLMFinetuner.git
    !python LLMFinetuner/setup_environment.py
    # Add the cloned repository to the Python path
    sys.path.append('/content/LLMFinetuner')
else:
    print("Assuming the repository is already cloned and environment is set up.")
    # Add the parent directory to the Python path
    sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))


Assuming the repository is already cloned and environment is set up.


## Import Necessary Libraries
Next, we import all the necessary libraries and modules that we will use throughout the notebook.

In [2]:
import torch
from dotenv import load_dotenv
from omegaconf import OmegaConf
from src import DataLoader, ModelSetup, CustomTrainer, Evaluator, HuggingFaceIntegration

  from .autonotebook import tqdm as notebook_tqdm


## Check Environment
Ensure that CUDA is available and the necessary environment variables are set.

In [3]:
# Check for CUDA availability
if not torch.cuda.is_available():
    print("CUDA is not available. Please check your installation of CUDA and NVIDIA drivers.")

# Check for HUGGINGFACE_TOKEN environment variable
if 'HUGGINGFACE_TOKEN' not in os.environ:
    print("HUGGINGFACE_TOKEN is not set. Please set this environment variable.")
    from huggingface_hub import notebook_login
notebook_login()

CUDA is not available. Please check your installation of CUDA and NVIDIA drivers.


## Configuration YAML Explanation
The configuration YAML file consists of several sections that define the parameters for the model, dataset, and training process. Here is a breakdown of the sections:

- **model**: Contains the model name, new model name, and quantization settings.
  - `name`: The name of the pre-trained model to use.
  - `new_model`: The name to save the fine-tuned model.
  - `quantization`: Settings for model quantization.
  - `device_map`: Device mapping for model loading.
- **dataset**: Contains the dataset path and type.
  - `path`: The path to the dataset file.
  - `type`: The type of dataset.
  - `from_huggingface`: Boolean indicating if the dataset is from Hugging Face.
- **training**: Contains training parameters and settings.
  - `output_dir`: Directory to save the output.
  - `peft_enabled`: Boolean indicating if PEFT is enabled.
  - `lora_config`: Configuration for PEFT.
  - `hf_push`: Boolean indicating if the model should be pushed to Hugging Face.
  - `hf_org`: The organization name on Hugging Face.
  - `trainer_args`: Additional arguments for the trainer.

## Hardcoded Configuration
Here we hardcode the configuration settings directly into the notebook.

In [6]:
config = {
    'model': {
        'name': 'mistralai/Mistral-7B-Instruct-v0.1',
        'new_model': 'Mistral-7B-Instruct-detcext-v0.1',
        'quantization': {
            'load_in_4bit': True,
            'bnb_4bit_use_double_quant': True,
            'bnb_4bit_quant_type': 'nf4',
            'bnb_4bit_compute_dtype': 'bfloat16'
        },
        'device_map': {
            0: ""  # Correct device map
        }
    },
    'dataset': {
        'path': '../data/eval/example_instruction_dataset.jsonl',
        'type': 'alpaca',
        'from_huggingface': False
    },
    'training': {
        'output_dir': './output',
        'peft_enabled': True,
        'peft_config': {
            'r': 8,
            'lora_alpha': 32,
            'target_modules': ['q_proj', 'v_proj'],  # Update with correct target modules
            'lora_dropout': 0.05,
            'bias': 'none',
            'task_type': 'CAUSAL_LM'
        },
        'hf_push': True,
        'hf_org': 'my-organization',
        'trainer_args': {
            'per_device_train_batch_size': 1,
            'gradient_accumulation_steps': 4,
            'max_steps': 100,
            'learning_rate': 0.0002,
            'logging_steps': 1,
            'save_strategy': 'epoch',
            'optim': 'paged_adamw_8bit'
        }
    }
}

# Print the configuration to verify
print(OmegaConf.to_yaml(OmegaConf.create(config)))

model:
  name: mistralai/Mistral-7B-Instruct-v0.1
  new_model: Mistral-7B-Instruct-detcext-v0.1
  quantization:
    load_in_4bit: true
    bnb_4bit_use_double_quant: true
    bnb_4bit_quant_type: nf4
    bnb_4bit_compute_dtype: bfloat16
  device_map:
    0: ''
dataset:
  path: ../data/eval/example_instruction_dataset.jsonl
  type: alpaca
  from_huggingface: false
training:
  output_dir: ./output
  peft_enabled: true
  peft_config:
    r: 8
    lora_alpha: 32
    target_modules:
    - q_proj
    - v_proj
    lora_dropout: 0.05
    bias: none
    task_type: CAUSAL_LM
  hf_push: true
  hf_org: my-organization
  trainer_args:
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 4
    max_steps: 100
    learning_rate: 0.0002
    logging_steps: 1
    save_strategy: epoch
    optim: paged_adamw_8bit



## Load Dataset
Load the dataset using the `DataLoader` class.

In [7]:
# Load the dataset
data_loader = DataLoader(config['dataset']['path'], from_huggingface=config['dataset']['from_huggingface'])
train_dataset, eval_dataset = data_loader.get_dataset()['train'].train_test_split(test_size=0.2).values()

Generating train split: 8 examples [00:00, 90.16 examples/s]


## Setup Model and Tokenizer
Setup the model and tokenizer with quantization and device configuration if required.

In [9]:
# Setup the model and tokenizer
model_setup = ModelSetup(
    config['model']['name'],
    quantization_config=config['model']['quantization'],
    device_map=config['model']['device_map']
)
model, tokenizer = model_setup.get_model_and_tokenizer()

## Setup and Run Trainer
Setup the `CustomTrainer` and start the training process.

In [None]:
# Setup and run the trainer
trainer = CustomTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    output_dir=config['training']['output_dir'],
    peft_config=config['training']['peft_config'],  # Updated to use peft_config from config
    **config['training']['trainer_args']
)
trainer.train()

## Evaluate Model
Evaluate the model on the evaluation dataset.

In [None]:
# Evaluate the model
evaluator = Evaluator(model, tokenizer, eval_dataset)
results = evaluator.evaluate()
print("Evaluation Results:", results)

## Push Model to Hugging Face
If enabled, push the fine-tuned model to Hugging Face.

In [None]:
# If Hugging Face push is enabled
if config['training']['hf_push']:
    hf_integration = HuggingFaceIntegration(
        model,
        config['model']['name'],
        config['model']['new_model'],
        config['training']['hf_org']
    )
    hf_integration.save_and_push_model()