# Data Preparation for LLM Fine-Tuning

This notebook demonstrates how to prepare datasets for fine-tuning language models. We'll cover:

1. Data collection and curation
2. Data cleaning and preprocessing
3. Converting to supported formats (Alpaca and ShareGPT)
4. Analyzing dataset statistics
5. Splitting into training and evaluation sets
6. Tokenization and formatting

Let's get started!

## 1. Setup and Imports

First, we'll import the necessary libraries and set up our environment.

In [None]:
import os
import sys
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from typing import List, Dict, Any, Optional
from sklearn.model_selection import train_test_split
from tqdm.auto import tqdm

# Add the project root to path
notebook_dir = Path.cwd()
project_root = notebook_dir.parent
sys.path.insert(0, str(project_root))

# Import project utilities
from src.utils.data_processing import calculate_dataset_statistics

# Create directories if they don't exist
data_dir = project_root / "data"
data_dir.mkdir(exist_ok=True)

## 2. Data Collection and Curation

When fine-tuning an LLM, the quality of your training data is crucial. Here are some strategies for collecting high-quality data:

1. **Manual creation**: Write examples that demonstrate desired behavior
2. **Curated datasets**: Use existing datasets from sources like Hugging Face
3. **Data synthesis**: Generate examples using another LLM
4. **Real-world data**: Utilize conversations, documents, or interactions from your domain

Let's look at some example data and discuss curation principles:

In [None]:
# Load and examine sample data
sample_alpaca_path = data_dir / "sample_alpaca.json"

with open(sample_alpaca_path, "r") as f:
    sample_data = json.load(f)

print(f"Loaded {len(sample_data)} examples from {sample_alpaca_path}")
print("\nSample example:")
print(json.dumps(sample_data[0], indent=2))

### Data Curation Principles

When curating a dataset for fine-tuning, consider these important principles:

1. **Diversity**: Include a wide range of examples that cover different aspects of the task
2. **Quality**: Ensure responses are high-quality, accurate, and helpful
3. **Balance**: Balance different types of tasks or domains in your dataset
4. **Consistency**: Maintain consistent style, tone, and format across examples
5. **Representativeness**: Examples should represent real-world use cases

Let's create a simple function to help assess data quality:

In [None]:
def assess_data_quality(data: List[Dict[str, Any]], format_type: str = "alpaca"):
    """Assess basic quality metrics for a dataset."""
    issues = []
    stats = {}
    
    if format_type == "alpaca":
        # Check for empty fields
        empty_instructions = [i for i, item in enumerate(data) if not item.get("instruction", "").strip()]
        empty_outputs = [i for i, item in enumerate(data) if not item.get("output", "").strip()]
        
        # Check output length
        output_lengths = [len(item.get("output", "").split()) for item in data]
        
        # Check for very short outputs (potential low quality)
        short_outputs = [i for i, item in enumerate(data) if len(item.get("output", "").split()) < 10]
        
        # Check for very long outputs (potential issues)
        long_outputs = [i for i, item in enumerate(data) if len(item.get("output", "").split()) > 500]
        
        # Calculate statistics
        stats["empty_instructions"] = len(empty_instructions)
        stats["empty_outputs"] = len(empty_outputs)
        stats["short_outputs"] = len(short_outputs)
        stats["long_outputs"] = len(long_outputs)
        stats["avg_output_length"] = np.mean(output_lengths)
        stats["min_output_length"] = min(output_lengths)
        stats["max_output_length"] = max(output_lengths)
        
        if empty_instructions:
            issues.append(f"Found {len(empty_instructions)} examples with empty instructions")
        if empty_outputs:
            issues.append(f"Found {len(empty_outputs)} examples with empty outputs")
        if short_outputs:
            issues.append(f"Found {len(short_outputs)} examples with very short outputs (<10 words)")
            
    elif format_type == "sharegpt":
        # Stats for ShareGPT format
        conversation_lengths = [len(item.get("conversations", [])) for item in data]
        empty_convs = [i for i, item in enumerate(data) if not item.get("conversations", [])]
        
        # Check for conversations with missing roles or content
        invalid_msgs = []
        for i, item in enumerate(data):
            for j, msg in enumerate(item.get("conversations", [])):
                if not msg.get("role", "") or not msg.get("value", ""):
                    invalid_msgs.append((i, j))
        
        stats["empty_conversations"] = len(empty_convs)
        stats["invalid_messages"] = len(invalid_msgs)
        stats["avg_conversation_length"] = np.mean(conversation_lengths)
        stats["min_conversation_length"] = min(conversation_lengths) if conversation_lengths else 0
        stats["max_conversation_length"] = max(conversation_lengths) if conversation_lengths else 0
        
        if empty_convs:
            issues.append(f"Found {len(empty_convs)} examples with empty conversations")
        if invalid_msgs:
            issues.append(f"Found {len(invalid_msgs)} messages with missing role or content")
    
    return {"stats": stats, "issues": issues}

# Assess quality of our sample data
quality_assessment = assess_data_quality(sample_data, "alpaca")
print("Data Quality Assessment:")
print("\nStatistics:")
for key, value in quality_assessment["stats"].items():
    print(f"  {key}: {value}")

print("\nIssues:")
if quality_assessment["issues"]:
    for issue in quality_assessment["issues"]:
        print(f"  - {issue}")
else:
    print("  No major issues found.")

## 3. Data Cleaning and Preprocessing

Before using the data for training, it's important to clean and preprocess it. Common preprocessing steps include:

1. **Removing duplicates**: Eliminate duplicate examples
2. **Fixing formatting issues**: Normalize text formatting
3. **Filtering inappropriate content**: Remove harmful or inappropriate examples
4. **Anonymizing data**: Remove personally identifiable information (PII)
5. **Normalizing text**: Standardize whitespace, newlines, etc.

Let's implement some basic cleaning functions:

In [None]:
def clean_text(text: str) -> str:
    """Basic text cleaning function."""
    if not isinstance(text, str):
        return ""
    
    # Fix extra whitespace
    text = " ".join(text.split())
    
    # Normalize newlines (preserve paragraph breaks)
    text = text.replace("\r\n", "\n").replace("\r", "\n")
    text = "\n\n".join([para.strip() for para in text.split("\n\n")])
    
    return text.strip()

def remove_duplicates(data: List[Dict[str, Any]], format_type: str = "alpaca") -> List[Dict[str, Any]]:
    """Remove duplicate examples based on content."""
    if format_type == "alpaca":
        seen = set()
        unique_data = []
        
        for item in data:
            # Create a key from instruction and input (if present)
            key = (item.get("instruction", "").strip(), item.get("input", "").strip())
            
            if key not in seen and key[0]:  # Ensure instruction is not empty
                seen.add(key)
                unique_data.append(item)
                
        return unique_data
    
    elif format_type == "sharegpt":
        # For ShareGPT, deduplication is more complex
        # We'll use a simple heuristic based on the first user message
        seen = set()
        unique_data = []
        
        for item in data:
            first_user_msg = ""
            for msg in item.get("conversations", []):
                if msg.get("role", "").lower() in ["user", "human"]:
                    first_user_msg = msg.get("value", "").strip()
                    break
            
            if first_user_msg and first_user_msg not in seen:
                seen.add(first_user_msg)
                unique_data.append(item)
                
        return unique_data
    
    return data

def clean_dataset(data: List[Dict[str, Any]], format_type: str = "alpaca") -> List[Dict[str, Any]]:
    """Apply cleaning functions to a dataset."""
    cleaned_data = []
    
    if format_type == "alpaca":
        for item in data:
            cleaned_item = {
                "instruction": clean_text(item.get("instruction", "")),
                "input": clean_text(item.get("input", "")),
                "output": clean_text(item.get("output", ""))
            }
            
            # Only include examples with non-empty instruction and output
            if cleaned_item["instruction"] and cleaned_item["output"]:
                cleaned_data.append(cleaned_item)
    
    elif format_type == "sharegpt":
        for item in data:
            cleaned_conversations = []
            
            for msg in item.get("conversations", []):
                if msg.get("role") and msg.get("value"):
                    cleaned_conversations.append({
                        "role": msg["role"],
                        "value": clean_text(msg["value"])
                    })
            
            if cleaned_conversations:  # Only include non-empty conversations
                cleaned_data.append({"conversations": cleaned_conversations})
    
    # Remove duplicates
    cleaned_data = remove_duplicates(cleaned_data, format_type)
    
    return cleaned_data

# Clean our sample data
cleaned_data = clean_dataset(sample_data, "alpaca")
print(f"Original data: {len(sample_data)} examples")
print(f"Cleaned data: {len(cleaned_data)} examples")

# View a cleaned example
print("\nCleaned Example:")
print(json.dumps(cleaned_data[0], indent=2))

## 4. Converting to Supported Formats

Our framework primarily supports two formats: Alpaca and ShareGPT. Let's create functions to convert data to these formats.

In [None]:
def convert_to_alpaca_format(data: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    """Convert data to Alpaca format (instruction, input, output)."""
    alpaca_data = []
    
    for item in data:
        # Handle common formats
        if "instruction" in item and "output" in item:
            # Already in Alpaca format
            alpaca_item = {
                "instruction": item["instruction"],
                "input": item.get("input", ""),
                "output": item["output"]
            }
            alpaca_data.append(alpaca_item)
            
        elif "prompt" in item and "completion" in item:
            # Convert prompt-completion format
            alpaca_item = {
                "instruction": item["prompt"],
                "input": "",
                "output": item["completion"]
            }
            alpaca_data.append(alpaca_item)
            
        elif "question" in item and "answer" in item:
            # Convert QA format
            alpaca_item = {
                "instruction": item["question"],
                "input": "",
                "output": item["answer"]
            }
            alpaca_data.append(alpaca_item)
            
        elif "conversations" in item:
            # Convert from ShareGPT format
            conversations = item["conversations"]
            
            # Find first user message as instruction
            instruction = ""
            for msg in conversations:
                if msg.get("role", "").lower() in ["user", "human"]:
                    instruction = msg.get("value", "")
                    break
            
            # Find first assistant response as output
            output = ""
            for msg in conversations:
                if msg.get("role", "").lower() in ["assistant", "bot", "gpt"]:
                    output = msg.get("value", "")
                    break
                    
            if instruction and output:
                alpaca_item = {
                    "instruction": instruction,
                    "input": "",
                    "output": output
                }
                alpaca_data.append(alpaca_item)
    
    return alpaca_data

def convert_to_sharegpt_format(data: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    """Convert data to ShareGPT format (conversations with roles)."""
    sharegpt_data = []
    
    for item in data:
        if "conversations" in item:
            # Already in ShareGPT format
            sharegpt_data.append(item)
            
        elif "instruction" in item and "output" in item:
            # Convert from Alpaca format
            conversations = []
            
            # Add system message if needed
            system_msg = "You are a helpful, harmless, and honest AI assistant."
            conversations.append({"role": "system", "value": system_msg})
            
            # Add user message (combine instruction and input)
            user_msg = item["instruction"]
            if item.get("input", ""):
                user_msg += "\n\n" + item["input"]
                
            conversations.append({"role": "human", "value": user_msg})
            
            # Add assistant response
            conversations.append({"role": "assistant", "value": item["output"]})
            
            sharegpt_data.append({"conversations": conversations})
            
        elif "prompt" in item and "completion" in item:
            # Convert prompt-completion format
            conversations = [
                {"role": "system", "value": "You are a helpful, harmless, and honest AI assistant."},
                {"role": "human", "value": item["prompt"]},
                {"role": "assistant", "value": item["completion"]}
            ]
            sharegpt_data.append({"conversations": conversations})
    
    return sharegpt_data

# Convert our sample data to ShareGPT format
sharegpt_data = convert_to_sharegpt_format(cleaned_data)
print(f"Converted {len(sharegpt_data)} examples to ShareGPT format")
print("\nShareGPT Example:")
print(json.dumps(sharegpt_data[0], indent=2))

## 5. Analyzing Dataset Statistics

Before using your dataset for training, it's important to understand its characteristics. Let's create some functions to analyze the dataset statistics:

In [None]:
def analyze_alpaca_dataset(data: List[Dict[str, Any]]):
    """Analyze and visualize Alpaca format dataset statistics."""
    # Extract key statistics
    instruction_lengths = [len(item["instruction"].split()) for item in data]
    input_lengths = [len(item.get("input", "").split()) for item in data]
    output_lengths = [len(item["output"].split()) for item in data]
    
    # Count examples with inputs
    has_input = sum(1 for item in data if item.get("input", "").strip())
    
    # Print basic statistics
    print(f"Total examples: {len(data)}")
    print(f"Examples with input field: {has_input} ({has_input/len(data)*100:.1f}%)")
    print("\nLength statistics (in words):")
    print(f"  Instruction: avg={np.mean(instruction_lengths):.1f}, min={min(instruction_lengths)}, max={max(instruction_lengths)}")
    print(f"  Input: avg={np.mean(input_lengths):.1f}, min={min(input_lengths)}, max={max(input_lengths)}")
    print(f"  Output: avg={np.mean(output_lengths):.1f}, min={min(output_lengths)}, max={max(output_lengths)}")
    
    # Visualize length distributions
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    
    # Instruction length
    sns.histplot(instruction_lengths, kde=True, ax=axes[0])
    axes[0].set_title("Instruction Length Distribution")
    axes[0].set_xlabel("Word Count")
    
    # Input length (for examples with input)
    non_empty_inputs = [length for length in input_lengths if length > 0]
    if non_empty_inputs:
        sns.histplot(non_empty_inputs, kde=True, ax=axes[1])
        axes[1].set_title("Input Length Distribution (non-empty)")
        axes[1].set_xlabel("Word Count")
    else:
        axes[1].set_title("No examples with input field")
    
    # Output length
    sns.histplot(output_lengths, kde=True, ax=axes[2])
    axes[2].set_title("Output Length Distribution")
    axes[2].set_xlabel("Word Count")
    
    plt.tight_layout()
    plt.show()
    
    # Return statistics dict
    return {
        "count": len(data),
        "with_input": has_input,
        "avg_instruction_length": np.mean(instruction_lengths),
        "avg_input_length": np.mean(input_lengths),
        "avg_output_length": np.mean(output_lengths),
    }

def analyze_sharegpt_dataset(data: List[Dict[str, Any]]):
    """Analyze and visualize ShareGPT format dataset statistics."""
    # Extract key statistics
    conversation_lengths = [len(item["conversations"]) for item in data]
    msg_lengths = []
    role_counts = {}
    
    for item in data:
        for msg in item["conversations"]:
            role = msg["role"].lower()
            role_counts[role] = role_counts.get(role, 0) + 1
            msg_lengths.append((role, len(msg["value"].split())))
    
    # Calculate role-specific statistics
    role_length_stats = {}
    for role in role_counts.keys():
        lengths = [length for r, length in msg_lengths if r == role]
        role_length_stats[role] = {
            "count": len(lengths),
            "avg_length": np.mean(lengths),
            "min_length": min(lengths),
            "max_length": max(lengths),
        }
    
    # Print basic statistics
    print(f"Total conversations: {len(data)}")
    print(f"Average messages per conversation: {np.mean(conversation_lengths):.1f}")
    print("\nMessage counts by role:")
    for role, count in role_counts.items():
        print(f"  {role}: {count}")
    
    print("\nMessage length statistics by role (in words):")
    for role, stats in role_length_stats.items():
        print(f"  {role}: avg={stats['avg_length']:.1f}, min={stats['min_length']}, max={stats['max_length']}")
    
    # Visualize message length by role
    plt.figure(figsize=(10, 6))
    
    role_data = {}
    for role, length in msg_lengths:
        if role not in role_data:
            role_data[role] = []
        role_data[role].append(length)
    
    plt.boxplot([role_data[role] for role in role_data.keys()], labels=role_data.keys())
    plt.title("Message Length by Role")
    plt.ylabel("Word Count")
    plt.yscale("log")
    plt.grid(True, alpha=0.3)
    plt.show()
    
    # Return statistics dict
    return {
        "count": len(data),
        "avg_conversation_length": np.mean(conversation_lengths),
        "role_counts": role_counts,
        "role_length_stats": role_length_stats,
    }

# Analyze our sample data
print("Alpaca Dataset Analysis:")
alpaca_stats = analyze_alpaca_dataset(cleaned_data)

print("\n" + "-"*50 + "\n")

print("ShareGPT Dataset Analysis:")
sharegpt_stats = analyze_sharegpt_dataset(sharegpt_data)

## 6. Splitting into Training and Evaluation Sets

Next, we'll split our dataset into training and evaluation sets. This is crucial for assessing the model's performance on unseen data.

In [None]:
def split_dataset(data: List[Dict[str, Any]], eval_size: float = 0.1, seed: int = 42):
    """Split dataset into training and evaluation sets."""
    if len(data) < 10:  # For very small datasets
        print("Warning: Dataset too small for meaningful split")
        return data, []
    
    train_data, eval_data = train_test_split(data, test_size=eval_size, random_state=seed)
    print(f"Split dataset into {len(train_data)} training examples and {len(eval_data)} evaluation examples")
    return train_data, eval_data

# Split our cleaned data
train_data, eval_data = split_dataset(cleaned_data)

# Save split datasets
train_path = data_dir / "train.json"
eval_path = data_dir / "eval.json"

with open(train_path, "w") as f:
    json.dump(train_data, f, indent=2)
    
with open(eval_path, "w") as f:
    json.dump(eval_data, f, indent=2)
    
print(f"Saved training data to {train_path}")
print(f"Saved evaluation data to {eval_path}")

## 7. Tokenization and Formatting

Finally, let's see how the data will be tokenized and formatted for training. This step is handled by the framework, but it's useful to understand what's happening under the hood.

In [None]:
from transformers import AutoTokenizer
from src.utils.data_processing import process_alpaca_dataset

# Load a tokenizer
try:
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", trust_remote_code=True)
    
    # Ensure the tokenizer has padding token
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    # Process a sample with the tokenizer to see how it will be formatted
    example = train_data[0]
    
    # Apply chat template if available
    if hasattr(tokenizer, "apply_chat_template"):
        # Format as a chat
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": example["instruction"] + ("\n" + example["input"] if example.get("input") else "")},
            {"role": "assistant", "content": example["output"]},
        ]
        
        formatted_text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
        )
        
        print("Example formatted with chat template:")
        print("-"*80)
        print(formatted_text)
        print("-"*80)
        
        # Tokenize and show token count
        tokens = tokenizer.encode(formatted_text)
        print(f"\nToken count: {len(tokens)}")
    else:
        print("Chat template not available for this tokenizer.")
        
        # Fall back to simple format
        if example.get("input"):
            formatted_text = f"Instruction: {example['instruction']}\nInput: {example['input']}\nOutput: {example['output']}"
        else:
            formatted_text = f"Instruction: {example['instruction']}\nOutput: {example['output']}"
            
        print("Example formatted without chat template:")
        print("-"*80)
        print(formatted_text)
        print("-"*80)
        
        # Tokenize and show token count
        tokens = tokenizer.encode(formatted_text)
        print(f"\nToken count: {len(tokens)}")
except Exception as e:
    print(f"Could not load tokenizer: {e}")
    print("This step requires a valid model to be available.")

## 8. Creating a Configuration for Fine-tuning

Now that we have prepared our dataset, let's create a configuration file for fine-tuning:

In [None]:
import yaml

# Create a configuration for fine-tuning
config = {
    "model": {
        "base_model": "meta-llama/Llama-3.1-8B-Instruct",  # Replace with the model you want to use
        "tokenizer": "meta-llama/Llama-3.1-8B-Instruct",
        "load_in_4bit": True,
        "trust_remote_code": True,
        "use_flash_attention": True,
    },
    "fine_tuning": {
        "method": "qlora",
        "lora": {
            "r": 16,
            "alpha": 32,
            "dropout": 0.05,
            "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
            "bias": "none",
        },
        "quantization": {
            "bits": 4,
            "bnb_4bit_compute_dtype": "bfloat16",
            "bnb_4bit_quant_type": "nf4",
            "bnb_4bit_use_double_quant": True,
        },
    },
    "training": {
        "epochs": 3,
        "micro_batch_size": 1,
        "gradient_accumulation_steps": 16,
        "learning_rate": 2.0e-4,
        "lr_scheduler_type": "cosine",
        "warmup_ratio": 0.03,
        "max_grad_norm": 0.3,
        "optimizer": "paged_adamw_8bit",
        "weight_decay": 0.001,
        "max_seq_length": 4096,
        "gradient_checkpointing": True,
        "mixed_precision": "bf16",
    },
    "dataset": {
        "format": "alpaca",
        "train_path": str(train_path),
        "eval_path": str(eval_path),
        "preprocessing": {
            "add_eos_token": True,
            "add_bos_token": False,
            "use_chat_template": True,
        },
    },
    "output": {
        "output_dir": "models/fine-tuned-model",
        "logging_steps": 10,
        "eval_steps": 100,
        "save_steps": 100,
        "save_total_limit": 5,
        "push_to_hub": False,
    },
    "evaluation": {
        "do_eval": True,
        "eval_batch_size": 8,
        "eval_strategy": "steps",
        "eval_steps": 100,
    },
}

# Save the configuration
config_path = project_root / "config" / "data_prep_config.yaml"
os.makedirs(config_path.parent, exist_ok=True)

with open(config_path, "w") as f:
    yaml.dump(config, f, sort_keys=False)

print(f"Configuration saved to {config_path}")

## 9. Advanced Data Processing Techniques

For more advanced projects, you might need additional data processing techniques. Here are some examples:

In [None]:
def filter_by_length(data: List[Dict[str, Any]], min_words: int = 10, max_words: int = 1000, field: str = "output"):
    """Filter examples by the length of a specific field."""
    filtered_data = []
    
    for item in data:
        if field in item:
            word_count = len(item[field].split())
            if min_words <= word_count <= max_words:
                filtered_data.append(item)
    
    return filtered_data

def augment_data(data: List[Dict[str, Any]], num_variants: int = 1):
    """Create simple variants of examples (placeholder for more sophisticated augmentation)."""
    augmented_data = list(data)  # Start with original data
    
    # This is a placeholder for more sophisticated augmentation techniques
    # In a real implementation, you might use:    
    # - Paraphrasing
    # - Back-translation
    # - Word substitution
    # - Using another LLM to generate variants
    
    print("Note: This is a placeholder for data augmentation.")
    print("In a real implementation, you would use techniques like paraphrasing or LLM generation.")
    
    return augmented_data

def balance_categories(data: List[Dict[str, Any]], category_field: str, max_per_category: Optional[int] = None):
    """Balance the dataset by limiting examples per category."""
    # Count examples per category
    category_counts = {}
    for item in data:
        category = item.get(category_field, "unknown")
        category_counts[category] = category_counts.get(category, 0) + 1
    
    # Determine maximum examples per category
    if max_per_category is None:
        # Use the minimum count as the maximum
        max_per_category = min(category_counts.values())
    
    # Balance the dataset
    balanced_data = []
    category_current = {category: 0 for category in category_counts}
    
    for item in data:
        category = item.get(category_field, "unknown")
        if category_current[category] < max_per_category:
            balanced_data.append(item)
            category_current[category] += 1
    
    print(f"Balanced dataset from {len(data)} to {len(balanced_data)} examples")
    print("Examples per category:")
    for category, count in category_current.items():
        print(f"  {category}: {count}")
    
    return balanced_data

# Example of using these advanced techniques
# Note: These are placeholders for demonstration purposes
print("Advanced data processing techniques:")
print(f"Original data size: {len(cleaned_data)} examples")

# Filter by length (example)
filtered_data = filter_by_length(cleaned_data, min_words=20, max_words=500)
print(f"After length filtering: {len(filtered_data)} examples")

# Note: The following are just placeholders and won't actually change the data
# In a real implementation, you would implement these functions properly
augmented_data = augment_data(filtered_data, num_variants=1)
print(f"After augmentation: {len(augmented_data)} examples")

# Balancing example (placeholder)
# This assumes a 'category' field which doesn't exist in our sample data
print("\nCategory balancing would be used with a field specifying categories")

## Conclusion

In this notebook, we've covered the essential steps for preparing data for LLM fine-tuning:

1. **Data collection and curation**: Understand the importance of high-quality, diverse examples
2. **Data cleaning**: Remove duplicates, fix formatting issues, ensure quality
3. **Format conversion**: Convert data to Alpaca or ShareGPT formats
4. **Dataset analysis**: Understand the characteristics of your dataset
5. **Dataset splitting**: Create proper training and evaluation sets
6. **Tokenization and formatting**: See how data is prepared for model training
7. **Configuration**: Create a configuration file for fine-tuning
8. **Advanced techniques**: Explore more sophisticated data processing methods

With these tools and techniques, you can prepare high-quality datasets for fine-tuning language models for your specific use cases.

For the next steps, you can run the fine-tuning process using the configuration file we created:

```bash
python src/train.py --config config/data_prep_config.yaml
```

Or use the simplified interface:

```bash
python scripts/quick_finetune.py --model_name meta-llama/Llama-3.1-8B-Instruct --dataset_path data/train.json --eval_path data/eval.json --method qlora
```

Happy fine-tuning!