# Fine-tuning a Language Model for Custom-Style Text Generation

This notebook demonstrates how to fine-tune a language model to generate text in a custom-style voice. We'll use a dataset of paired emails (standard and custom-style) to teach the model how to transform regular text into custom speech.

## Setup and Imports

In [3]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Install required packages
!pip install "torch==2.0.1" "transformers==4.34.0" "datasets==2.14.5" "accelerate==0.23.0" "bitsandbytes==0.41.1" "trl==0.7.2" "peft==0.5.0" "tensorboard" "flash-attn" --quiet

fatal: destination path 'humanize-LLM' already exists and is not an empty directory.


In [None]:
# Clone the repository
!git clone https://github.com/TheBormann/humanize-LLM.git
!cd humanize-LLM && pip install -r requirements.txt

In [None]:
import os
import logging
import sys
import pandas as pd
from typing import List, Dict

# Add the parent directory to the path
sys.path.append('/content/humanize-LLM')

# Import TRL components for efficient fine-tuning
from trl import SFTTrainer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments
)
from datasets import Dataset

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    stream=sys.stdout
)
logger = logging.getLogger(__name__)

## Data Loading and Preparation

We'll load our dataset of paired emails from a CSV file, but now we'll convert it to the modern conversational format for better fine-tuning with TRL's SFTTrainer.

In [None]:
def load_emails_from_csv(file_path: str) -> pd.DataFrame:
    """Load emails from a CSV file with semicolon delimiter."""
    df = pd.read_csv(file_path, sep=';')
    logger.info(f"Loaded {len(df)} emails from {file_path}")
    return df

def prepare_training_data(emails_df: pd.DataFrame) -> List[Dict]:
    """Prepare training data in conversational format for SFTTrainer.

    Creates direct style transfer pairs in conversational format:
    - system: instruction on style transformation
    - user: original AI-generated email
    - assistant: styled version
    """
    system_message = """Transform the given email into a custom-styled version that maintains the same content but uses a more personal, unique tone. 
Your goal is to make the text feel more human-written with natural speech patterns."""

    training_samples = []

    for _, row in emails_df.iterrows():
        if pd.isna(row['body']) or pd.isna(row['body_ai']):
            continue

        # Create conversation in the format expected by TRL's SFTTrainer
        sample = {
            "messages": [
                {"role": "system", "content": system_message},
                {"role": "user", "content": row['body_ai']},  # AI-generated email
                {"role": "assistant", "content": row['body']}  # Custom style version
            ]
        }

        training_samples.append(sample)

    logger.info(f"Created {len(training_samples)} conversational training samples")
    return training_samples

In [None]:
# Set the path to the CSV file
EMAIL_CSV_PATH = '/content/humanize-LLM/data/manual_emails.csv'

# Load and prepare the dataset
emails_df = load_emails_from_csv(EMAIL_CSV_PATH)
training_data = prepare_training_data(emails_df)

# Convert to Hugging Face Dataset format
dataset = Dataset.from_list(training_data)

# Display a sample of the training data
if len(dataset) > 0:
    sample = dataset[0]
    print("Sample training conversation:")
    for message in sample['messages']:
        print(f"{message['role']}: {message['content'][:100]}...")
    print(f"Total training samples: {len(dataset)}")
else:
    print("No training data found or prepared.")

Sample training pair:
Prompt (AI-generated): Hi [Name],\n\nI'm [Your Name], founder of [Startup Name]. We're revolutionizing [industry] through [key innovation]. Would you have time next week to ...
Response (custom-style): Ahoy [Name],\n\nYer lookin' at [Your Name], fearsome captain of [Startup Name]. We be chartin' treacherous waters of [industry] with [key innovation] ...
Total training pairs: 69


## Model Selection and QLoRA Configuration

We'll use a smaller model suitable for Google Colab (Mistral-7B-Instruct-v0.2) with QLoRA for efficient fine-tuning.

In [None]:
# Model configuration
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"  # A smaller but capable model
OUTPUT_DIR = "/content/drive/MyDrive/custom_style_model"

# Create output directory
os.makedirs(OUTPUT_DIR, exist_ok=True)

# QLoRA Configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16"
)

# LoRA configuration
lora_config = LoraConfig(
    r=16,  # Rank
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

# Training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    lr_scheduler_type="constant_with_warmup",
    warmup_ratio=0.1,
    bf16=True,  # Use mixed precision
    save_strategy="epoch",
    logging_steps=10,
    logging_dir=f"{OUTPUT_DIR}/logs",
    report_to="tensorboard"
)

## Fine-tuning with SFTTrainer and QLoRA

We'll use the SFTTrainer from TRL with QLoRA for parameter-efficient fine-tuning, significantly reducing memory requirements while maintaining performance.

In [None]:
def load_and_prepare_model():
    """Load and prepare the model for QLoRA fine-tuning"""
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    tokenizer.pad_token = tokenizer.eos_token

    # Load model with quantization
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,
        device_map="auto"
    )

    # Prepare model for kbit training
    model = prepare_model_for_kbit_training(model)

    # Apply LoRA
    model = get_peft_model(model, lora_config)

    # Print trainable parameters info
    model.print_trainable_parameters()

    return model, tokenizer

In [None]:
def finetune_model(dataset):
    """Fine-tune model using SFTTrainer with QLoRA"""
    logger.info("Loading and preparing model...")
    model, tokenizer = load_and_prepare_model()

    logger.info("Initializing SFTTrainer...")
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=dataset,
        args=training_args,
        max_seq_length=1024,
        dataset_text_field="messages",  # Use our conversation format
        packing=True  # Enable packing for more efficient training
    )

    logger.info("Starting fine-tuning...")
    trainer.train()

    logger.info(f"Saving model to {OUTPUT_DIR}")
    trainer.save_model(OUTPUT_DIR)

    return model, tokenizer

In [None]:
# Run the fine-tuning
model, tokenizer = finetune_model(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

## Test and Evaluate the Fine-tuned Model

Let's test our fine-tuned model with some example prompts and implement proper evaluation.

In [None]:
# Function to generate responses with our fine-tuned model
def generate_styled_text(prompt, model, tokenizer, max_new_tokens=200):
    """Generate styled text from prompt using our fine-tuned model"""
    # Prepare conversation for inference
    system_message = "Transform the given email into a custom-styled version that maintains the same content but uses a more personal, unique tone."

    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": prompt}
    ]

    # Format with chat template
    input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    # Tokenize
    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    # Generate
    with torch.no_grad():
        outputs = model.generate(
            inputs["input_ids"],
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )

    # Decode and extract only the generated part
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    assistant_response = generated_text.split("<|assistant|>")[-1].strip()

    return assistant_response

In [None]:
# Test with different prompts
import torch

test_prompts = [
    "Hello, I'm writing to inquire about your services. Could we schedule a call next week?",
    "Dear HR, I'm submitting my application for the software developer position. I have 5 years of experience.",
    "Team, please remember to submit your reports by Friday. The client is expecting our analysis.",
]

for i, prompt in enumerate(test_prompts):
    print(f"\nTest Prompt {i+1}:\n{prompt}")
    styled_response = generate_styled_text(prompt, model, tokenizer)
    print(f"\nCustom-Style Response:\n{styled_response}\n")
    print("-" * 80)

## Evaluate Model Performance

Let's evaluate our model on a subset of emails not used for training to assess its performance.

In [None]:
def evaluate_model(model, tokenizer, test_samples=5):
    """Evaluate model performance on test samples from the dataset"""
    # Use a subset of our dataset for testing
    if len(dataset) <= test_samples:
        test_indices = range(len(dataset))
    else:
        import random
        test_indices = random.sample(range(len(dataset)), test_samples)

    print(f"\nEvaluating model on {len(test_indices)} test samples...")

    for idx in test_indices:
        sample = dataset[idx]

        # Extract original prompt and reference
        original_text = sample['messages'][1]['content']  # user message
        reference_text = sample['messages'][2]['content'] # assistant message

        # Generate styled version
        generated_text = generate_styled_text(original_text, model, tokenizer)

        print(f"\nOriginal: {original_text[:150]}...")
        print(f"\nGenerated: {generated_text[:150]}...")
        print(f"\nReference: {reference_text[:150]}...")
        print("\n" + "-"*80)

# Run evaluation
evaluate_model(model, tokenizer)

## Merge Adapter Weights (Optional)

For deployment, you might want to merge the LoRA adapter weights back into the base model for more efficient inference.

In [None]:
def merge_adapter_weights():
    """Merge LoRA adapter weights into the base model"""
    from peft import AutoPeftModelForCausalLM

    # Load the fine-tuned PEFT model
    peft_model = AutoPeftModelForCausalLM.from_pretrained(
        OUTPUT_DIR,
        device_map="auto"
    )

    # Merge weights
    merged_model = peft_model.merge_and_unload()

    # Save the merged model
    merged_model_path = f"{OUTPUT_DIR}_merged"
    merged_model.save_pretrained(merged_model_path)
    tokenizer.save_pretrained(merged_model_path)

    print(f"Merged model saved to {merged_model_path}")

    return merged_model_path

# Uncomment to merge weights
# merged_model_path = merge_adapter_weights()

## Conclusion

In this notebook, we've demonstrated how to fine-tune a language model to generate text in a specific style using modern, efficient techniques from 2025:

1. We used QLoRA for parameter-efficient fine-tuning, which dramatically reduces the memory requirements
2. We implemented the conversational format for better compatibility with SFTTrainer
3. We applied optimizations like gradient checkpointing and mixed precision training
4. We used a smaller but capable model (Mistral-7B) that fits on Google Colab's resources
5. We incorporated proper evaluation techniques

These approaches allow for efficient fine-tuning even with limited computational resources like those available on Google Colab, while still producing high-quality results.