# Fine-Tuning Machine Learning Models: A Comprehensive Course

Fine-tuning is a powerful technique that allows you to adapt pre-trained models to specific tasks or domains. This notebook provides a practical guide to fine-tuning various types of machine learning models.

## What is Fine-Tuning?

Fine-tuning is the process of taking a model that has been pre-trained on a large dataset and then further training it on a smaller, task-specific dataset. This approach leverages the knowledge already captured in the pre-trained model and adapts it to perform well on a new, related task.

## Why Fine-Tune?

- **Resource efficiency**: Training from scratch requires significant computational resources and data
- **Better performance**: Pre-trained models contain valuable features and patterns
- **Faster convergence**: Fine-tuning typically requires fewer iterations than training from scratch
- **Less data needed**: You can achieve good results with smaller domain-specific datasets

## Course Outline

1. Understanding Transfer Learning & Fine-Tuning
2. Preparing Data for Fine-Tuning
3. Fine-Tuning Computer Vision Models
4. Fine-Tuning NLP Models
5. Fine-Tuning Large Language Models (LLMs)
6. Evaluation and Hyperparameter Tuning
7. Best Practices and Common Pitfalls
8. Advanced Fine-Tuning Techniques

## Required Libraries

In [None]:
# Install required libraries
!pip install -q torch torchvision transformers datasets evaluate sklearn matplotlib pandas numpy

## 1. Understanding Transfer Learning & Fine-Tuning

Transfer learning is a machine learning technique where knowledge gained from solving one problem is applied to a different but related problem. Fine-tuning is a specific approach to transfer learning.

### Types of Transfer Learning:

1. **Feature Extraction**: The pre-trained model is used only as a feature extractor. The last layer(s) are removed and replaced with a new classifier that is trained on the new task.

2. **Fine-Tuning**: Some or all of the weights in the pre-trained model are updated during training on the new task.

### Common Approaches to Fine-Tuning:

- **Freeze early layers, train later layers**: The early layers of neural networks typically learn generic features, while later layers learn more task-specific features.
- **Gradual unfreezing**: Start by training only the last layer, then progressively unfreeze and train earlier layers.
- **Differential learning rates**: Apply smaller learning rates to early layers and larger learning rates to later layers.

## 2. Preparing Data for Fine-Tuning

Data preparation is crucial for effective fine-tuning. Let's walk through the process:

### Key Steps:

1. **Data Collection**: Gather task-specific data relevant to your application
2. **Data Cleaning**: Remove inconsistencies, handle missing values
3. **Data Preprocessing**: Format data to match the pre-trained model's requirements
4. **Data Augmentation**: Increase dataset diversity artificially 
5. **Dataset Splitting**: Create training, validation, and test sets

Let's demonstrate these steps with a practical example using a sample image classification dataset.

In [None]:
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, random_split
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Define transformations for data preprocessing and augmentation
data_transforms = {
    'train': transforms.Compose([
        transforms.Resize(256),
        transforms.RandomCrop(224),
        transforms.RandomHorizontalFlip(),  # Data augmentation
        transforms.ColorJitter(brightness=0.1, contrast=0.1),  # Data augmentation
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
}

# Load a sample dataset (CIFAR-10 in this example)
# In real applications, you would use your own domain-specific dataset
cifar_data = torchvision.datasets.CIFAR10(root='./data', train=True, 
                                          download=True)

# Create smaller dataset for fine-tuning example
# Let's assume we're only interested in the first 2 classes
indices = np.where(np.array(cifar_data.targets) < 2)[0]
subset_data = torch.utils.data.Subset(cifar_data, indices)

# Split dataset into train and validation sets
train_size = int(0.8 * len(subset_data))
val_size = len(subset_data) - train_size
train_subset, val_subset = random_split(subset_data, [train_size, val_size])

# Apply transformations
class TransformedSubset(torch.utils.data.Dataset):
    def __init__(self, subset, transform):
        self.subset = subset
        self.transform = transform
        
    def __getitem__(self, idx):
        x, y = self.subset[idx]
        return self.transform(x), y
    
    def __len__(self):
        return len(self.subset)

train_data = TransformedSubset(train_subset, data_transforms['train'])
val_data = TransformedSubset(val_subset, data_transforms['val'])

# Create data loaders
train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
val_loader = DataLoader(val_data, batch_size=32, shuffle=False)

# Visualize a few examples
def imshow(img):
    img = img / 2 + 0.5  # Unnormalize
    npimg = img.numpy()
    plt.figure(figsize=(10, 5))
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.axis('off')

# Get some random training images
dataiter = iter(train_loader)
images, labels = next(dataiter)

# Show images
imshow(torchvision.utils.make_grid(images[:5]))
class_names = ['airplane', 'automobile']
print('Sample classes: ', [class_names[labels[i]] for i in range(5)])

print(f"Training set size: {len(train_data)}")
print(f"Validation set size: {len(val_data)}")

## 3. Fine-Tuning Computer Vision Models

Computer vision models are commonly fine-tuned for tasks like:
- Object detection
- Image classification
- Semantic segmentation
- Instance segmentation

Let's fine-tune a pre-trained ResNet model for our specific image classification task:

In [None]:
import torch.nn as nn
import torch.optim as optim
import time
import copy
from torchvision import models

# Check if CUDA is available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load a pre-trained ResNet model
model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)

# Inspect the model architecture
print("Original model's last layer:", model.fc)

# Modify the final fully connected layer for our specific task (2 classes)
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 2)  # Replace with number of target classes

print("Modified model's last layer:", model.fc)

# Move model to the appropriate device
model = model.to(device)

# Define loss function
criterion = nn.CrossEntropyLoss()

# Fine-tuning approach 1: Train only the final layer
# Freeze all layers except the final one
for param in model.parameters():
    param.requires_grad = False
    
# Unfreeze the final layer
for param in model.fc.parameters():
    param.requires_grad = True

# Use a smaller learning rate for fine-tuning
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)

# Training function
def train_model(model, criterion, optimizer, dataloaders, num_epochs=5):
    since = time.time()
    
    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0
    
    train_history = {'loss': [], 'acc': []}
    val_history = {'loss': [], 'acc': []}
    
    for epoch in range(num_epochs):
        print(f'Epoch {epoch+1}/{num_epochs}')
        print('-' * 10)
        
        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()  # Set model to training mode
                dataloader = train_loader
            else:
                model.eval()   # Set model to evaluate mode
                dataloader = val_loader
                
            running_loss = 0.0
            running_corrects = 0
            
            # Iterate over data
            for inputs, labels in dataloader:
                inputs = inputs.to(device)
                labels = labels.to(device)
                
                # Zero the parameter gradients
                optimizer.zero_grad()
                
                # Forward pass
                # Track history only in train phase
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)
                    
                    # Backward + optimize only in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()
                
                # Statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
            
            epoch_loss = running_loss / len(dataloader.dataset)
            epoch_acc = running_corrects.double() / len(dataloader.dataset)
            
            print(f'{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')
            
            # Save history
            if phase == 'train':
                train_history['loss'].append(epoch_loss)
                train_history['acc'].append(epoch_acc.item())
            else:
                val_history['loss'].append(epoch_loss)
                val_history['acc'].append(epoch_acc.item())
            
            # Deep copy the model if best performance
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())
                
        print()
    
    time_elapsed = time.time() - since
    print(f'Training complete in {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s')
    print(f'Best val Acc: {best_acc:.4f}')
    
    # Load best model weights
    model.load_state_dict(best_model_wts)
    
    return model, train_history, val_history

# Train the model - feature extraction approach
model_ft, train_history, val_history = train_model(model, criterion, optimizer, 
                                                  [train_loader, val_loader], num_epochs=5)

# Plot training results
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(train_history['loss'], label='train')
plt.plot(val_history['loss'], label='val')
plt.title('Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(train_history['acc'], label='train')
plt.plot(val_history['acc'], label='val')
plt.title('Accuracy')
plt.legend()
plt.show()

### Fine-Tuning the Entire Network

Now, let's try a different fine-tuning approach where we update all layers of the network but use different learning rates for different parts of the model.

In [None]:
# Reset the model
model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 2)
model = model.to(device)

# Fine-tuning approach 2: Fine-tune the entire network with differential learning rates
# Parameters of newly constructed layers have different learning rates
params_to_update = [
    {'params': [param for name, param in model.named_parameters() if 'fc' not in name], 'lr': 0.0001},
    {'params': model.fc.parameters(), 'lr': 0.001}
]

optimizer = optim.Adam(params_to_update)
criterion = nn.CrossEntropyLoss()

# Train the model - full fine-tuning approach
model_ft_full, train_history_full, val_history_full = train_model(model, criterion, optimizer, 
                                                                [train_loader, val_loader], num_epochs=5)

# Plot training results
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(train_history_full['loss'], label='train')
plt.plot(val_history_full['loss'], label='val')
plt.title('Loss (Full Fine-tuning)')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(train_history_full['acc'], label='train')
plt.plot(val_history_full['acc'], label='val')
plt.title('Accuracy (Full Fine-tuning)')
plt.legend()
plt.show()

# Compare approaches
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(train_history['acc'], label='feature extraction')
plt.plot(train_history_full['acc'], label='full fine-tuning')
plt.title('Training Accuracy Comparison')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(val_history['acc'], label='feature extraction')
plt.plot(val_history_full['acc'], label='full fine-tuning')
plt.title('Validation Accuracy Comparison')
plt.legend()
plt.show()

## 4. Fine-Tuning NLP Models

Natural Language Processing (NLP) models can be fine-tuned for tasks like:
- Text classification
- Named Entity Recognition
- Question answering
- Sentiment analysis

Let's fine-tune a pre-trained BERT model for a text classification task using the Hugging Face `transformers` library:

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
import evaluate
import numpy as np

# Load a dataset - we'll use the IMDB movie reviews for sentiment analysis
imdb = load_dataset("imdb")
print(imdb)

# Take a smaller subset for demonstration purposes
train_dataset = imdb["train"].shuffle(seed=42).select(range(5000))
test_dataset = imdb["test"].shuffle(seed=42).select(range(1000))

# Load tokenizer and model from Hugging Face
model_name = "distilbert-base-uncased"  # Smaller, faster version of BERT
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize function
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

# Apply tokenization
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)

# Load pre-trained model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Define metrics
accuracy_metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy_metric.compute(predictions=predictions, references=labels)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results/imdb_classifier",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# Train the model
trainer.train()

# Evaluate the model
eval_results = trainer.evaluate()
print(f"Evaluation results: {eval_results}")

# Test on some examples
from transformers import pipeline

# Create a text classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Test some examples
test_reviews = [
    "This movie was absolutely fantastic! The acting was superb and the storyline captivating.",
    "I really didn't enjoy this film. The plot was confusing and the characters poorly developed."
]

for review in test_reviews:
    result = classifier(review)
    sentiment = "positive" if result[0]['label'] == "LABEL_1" else "negative"
    confidence = result[0]['score']
    print(f"Review: {review}")
    print(f"Sentiment: {sentiment}, Confidence: {confidence:.4f}\n")

## 5. Fine-Tuning Large Language Models (LLMs)

Large Language Models (LLMs) like GPT, BERT, and T5 can be fine-tuned for specific domains and tasks. Let's explore instruction fine-tuning with a smaller model:

### Use Cases for LLM Fine-Tuning:

- **Domain adaptation**: Tailor the model to specific industries like legal, medical, or financial
- **Task-specific tuning**: Optimize for tasks like summarization or question-answering
- **Instruction tuning**: Teach the model to follow specific instructions
- **Alignment**: Ensure outputs match human preferences and values

Let's fine-tune a small T5 model for a simple summarization task:

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset
import evaluate
import numpy as np
from transformers.data.data_collator import DataCollatorForSeq2Seq

# Load a dataset - CNN/DailyMail for summarization
dataset = load_dataset("cnn_dailymail", "3.0.0")
print(dataset)

# Take a small subset for demonstration
train_dataset = dataset["train"].shuffle(seed=42).select(range(500))
val_dataset = dataset["validation"].shuffle(seed=42).select(range(100))

# Load model and tokenizer
model_name = "t5-small"  # Using a smaller model for demonstration
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Preprocessing function
prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["article"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
    
    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["highlights"], max_length=128, truncation=True, padding="max_length")
    
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply preprocessing
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_val = val_dataset.map(preprocess_function, batched=True)

# Define data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Load evaluation metric
rouge_metric = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # ROUGE expects a newline after each sentence
    decoded_preds = ["\n".join(pred.split()) for pred in decoded_preds]
    decoded_labels = ["\n".join(label.split()) for label in decoded_labels]
    
    result = rouge_metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract the median scores
    result = {key: value * 100 for key, value in result.items()}
    return result

# Define training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./results/t5-summarization",
    evaluation_strategy="epoch",
    learning_rate=3e-4,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    generation_max_length=128,
)

# Create trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Fine-tune the model
trainer.train()

# Test the model on a few examples
from transformers import pipeline

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)

articles = [
    """The US has passed the peak of new coronavirus cases, President Donald Trump said and predicted that some states would reopen this month. 
    The US has over 637,000 confirmed Covid-19 cases and over 30,826 deaths, the highest for any country in the world. 
    At the daily White House coronavirus briefing on Wednesday, Trump said new guidelines to reopen the country would be announced on Thursday after he speaks to governors. 
    "We'll be the comeback kids, all of us," he said. "We want to get our country back." The Trump administration has previously fixed May 1 as a possible date to reopen the world's largest economy, but the president said some states may be able to return to normalcy sooner than others."""
]

for article in articles:
    summary = summarizer(prefix + article, max_length=100, min_length=30, do_sample=False)
    print(f"Article: {article[:100]}...\n")
    print(f"Generated summary: {summary[0]['summary_text']}\n")

## 6. Evaluation and Hyperparameter Tuning

Properly evaluating your fine-tuned models and optimizing their hyperparameters is crucial for achieving the best performance.

### Key Evaluation Metrics:

- **Classification**: Accuracy, F1-score, AUC-ROC, confusion matrix
- **NLP**: BLEU, ROUGE, METEOR, perplexity
- **Computer Vision**: mAP, IoU

Let's implement a simple hyperparameter tuning process using scikit-learn's GridSearchCV with our CV model:

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import torch.nn.functional as F

# Create a wrapper class for scikit-learn compatibility
class ModelWrapper:
    def __init__(self, model, device):
        self.model = model
        self.device = device
        
    def predict(self, X):
        self.model.eval()
        X = torch.tensor(X).to(self.device)
        with torch.no_grad():
            outputs = self.model(X)
            _, preds = torch.max(outputs, 1)
        return preds.cpu().numpy()
    
    def fit(self, X, y, learning_rate=0.001, epochs=5):
        # Convert data to tensors
        X = torch.tensor(X).to(self.device)
        y = torch.tensor(y).to(self.device)
        
        # Set up optimizer
        optimizer = optim.Adam(self.model.parameters(), lr=learning_rate)
        criterion = nn.CrossEntropyLoss()
        
        # Train
        self.model.train()
        for epoch in range(epochs):
            # Forward pass
            outputs = self.model(X)
            loss = criterion(outputs, y)
            
            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
        return self

# For simplicity in this example, we'll use a very small subset and a simple model
# In practice, you would do this with your full dataset and your fine-tuned model
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn

# Create a simple model for demonstration
class SimpleModel(nn.Module):
    def __init__(self, num_classes=2):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(28*28, 128)  # Simplified for MNIST
        self.fc2 = nn.Linear(128, num_classes)
        
    def forward(self, x):
        x = x.view(-1, 28*28)  # Flatten
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Load a simple dataset (MNIST) for demonstration
mnist_train = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transforms.ToTensor())
mnist_test = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transforms.ToTensor())

# Take a subset with only digits 0 and 1 for binary classification
idx_train = (mnist_train.targets == 0) | (mnist_train.targets == 1)
idx_test = (mnist_test.targets == 0) | (mnist_test.targets == 1)

X_train = mnist_train.data[idx_train].numpy()
y_train = mnist_train.targets[idx_train].numpy()
X_test = mnist_test.data[idx_test].numpy()
y_test = mnist_test.targets[idx_test].numpy()

# Take small subsets for demonstration
X_train, y_train = X_train[:500], y_train[:500]
X_test, y_test = X_test[:100], y_test[:100]

# Normalize data
X_train = X_train / 255.0
X_test = X_test / 255.0

# Create model
model = SimpleModel().to(device)
wrapper = ModelWrapper(model, device)

# Define hyperparameter grid
param_grid = {
    'learning_rate': [0.01, 0.001, 0.0001],
    'epochs': [3, 5, 10]
}

# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=wrapper, param_grid=param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print results
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

# Evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test accuracy: {accuracy:.4f}")

# Display confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# Display classification report
report = classification_report(y_test, y_pred)
print("Classification Report:")
print(report)

# Visualize some predictions
plt.figure(figsize=(12, 6))
for i in range(10):
    plt.subplot(2, 5, i+1)
    plt.imshow(X_test[i].reshape(28, 28), cmap='gray')
    plt.title(f"True: {y_test[i]}, Pred: {y_pred[i]}")
    plt.axis('off')
plt.tight_layout()
plt.show()

## 7. Best Practices and Common Pitfalls

When fine-tuning models, several best practices can help you achieve better results:

### Best Practices

1. **Start simple**: Begin with a frozen pre-trained model and only train the final layer.

2. **Gradual unfreezing**: If needed, gradually unfreeze and train earlier layers.

3. **Lower learning rates**: Use a smaller learning rate than you would for training from scratch.

4. **Differential learning rates**: Use even smaller learning rates for pre-trained layers.

5. **Early stopping**: Monitor validation performance to prevent overfitting.

6. **Regularization**: Use techniques like weight decay, dropout, or data augmentation.

7. **Batch normalization**: Re-calibrate batch normalization statistics for your dataset.

8. **Monitor training**: Keep track of both training and validation metrics.

### Common Pitfalls

1. **Catastrophic forgetting**: The model forgets previously learned information.

2. **Overfitting**: The model performs well on training data but poorly on new data.

3. **Inappropriate learning rate**: Too high can destabilize training; too low can lead to slow convergence.

4. **Not adapting preprocessing**: Ensure your data preprocessing matches what the pre-trained model expects.

5. **Ignoring class imbalance**: Address imbalanced classes in your dataset.

6. **Training-validation mismatch**: Ensure validation process matches how you'll use the model.

7. **Neglecting model size**: Consider computational constraints, especially for deployment.

## 8. Advanced Fine-Tuning Techniques

As you progress, you might want to explore more advanced fine-tuning techniques:

### Parameter-Efficient Fine-Tuning

1. **Adapters**: Add small trainable modules between layers while keeping pre-trained weights frozen.

2. **LoRA (Low-Rank Adaptation)**: Represent weight updates as low-rank decompositions.

3. **Prompt Tuning**: Learn soft prompts while keeping the model frozen.

4. **QLoRA**: Quantized Low-Rank Adaptation for efficient fine-tuning.

Let's implement a simple adapter-based approach:

In [None]:
# Simple adapter implementation
class Adapter(nn.Module):
    def __init__(self, input_dim, adapter_dim):
        super().__init__()
        self.down = nn.Linear(input_dim, adapter_dim)
        self.up = nn.Linear(adapter_dim, input_dim)
        self.init_weights()
        
    def init_weights(self):
        # Initialize to small values
        nn.init.normal_(self.down.weight, std=1e-3)
        nn.init.normal_(self.up.weight, std=1e-3)
        nn.init.zeros_(self.down.bias)
        nn.init.zeros_(self.up.bias)
        
    def forward(self, x):
        return self.up(F.relu(self.down(x))) + x  # Residual connection

# Modify an existing ResNet model to add adapters
def add_adapters_to_resnet(model, adapter_dim=64):
    # Add adapters to the ResNet blocks
    for name, module in model.named_modules():
        if isinstance(module, models.resnet.BasicBlock) or isinstance(module, models.resnet.Bottleneck):
            # Get the dimension of the output
            if isinstance(module, models.resnet.BasicBlock):
                input_dim = module.conv2.out_channels
            else:  # Bottleneck
                input_dim = module.conv3.out_channels
                
            # Create and add adapter
            module.adapter = Adapter(input_dim, adapter_dim)
            
            # Save original forward method
            original_forward = module.forward
            
            # Define new forward method with adapter
            def new_forward(self, x):
                identity = self.downsample(x) if self.downsample is not None else x
                
                out = self.conv1(x)
                out = self.bn1(out)
                out = self.relu(out)
                
                out = self.conv2(out)
                out = self.bn2(out)
                
                if hasattr(self, 'conv3'):  # For Bottleneck
                    out = self.relu(out)
                    out = self.conv3(out)
                    out = self.bn3(out)
                
                out += identity
                out = self.relu(out)
                
                # Apply adapter
                out = self.adapter(out)
                
                return out
                
            # Bind the new method to the module
            import types
            module.forward = types.MethodType(new_forward, module)
    
    return model

# Create a model with adapters
model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 2)
model = model.to(device)

# Freeze all parameters
for param in model.parameters():
    param.requires_grad = False

# Add adapters
model = add_adapters_to_resnet(model, adapter_dim=16)

# Unfreeze only the adapters and the final layer
for name, param in model.named_parameters():
    if 'adapter' in name or 'fc' in name:
        param.requires_grad = True

# Count trainable parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params}")
print(f"Trainable parameters: {trainable_params}")
print(f"Percentage of trainable parameters: {trainable_params/total_params*100:.2f}%")

# Train with adapters (would use the same training loop as before)
# For brevity, we won't repeat the full training code here

## Implementation Strategies for LLM Fine-Tuning

For very large language models, specialized techniques are often needed:

### Full Fine-Tuning vs. Parameter-Efficient Methods

| Technique | Description | Pros | Cons |
|-----------|-------------|------|------|
| **Full Fine-Tuning** | Update all model weights | Best performance | High compute requirements |
| **LoRA** | Low-rank adaptation of weight matrices | Much less memory | Slightly lower performance |
| **QLoRA** | Quantized LoRA | Even more efficient | Complex implementation |
| **Prefix/Prompt Tuning** | Add trainable tokens | Very parameter efficient | Task-specific |
| **Adapters** | Small trainable layers | Modular, stackable | Added inference latency |

### Knowledge Distillation

Another technique is to fine-tune a smaller model to mimic a larger one:

1. Fine-tune a large model on your task
2. Use that model to generate outputs on your dataset
3. Train a smaller model to match those outputs

This allows you to leverage the capabilities of large models while deploying more efficient ones.

## Conclusion

Fine-tuning pre-trained models is a powerful technique that lets you leverage the knowledge embedded in large models while adapting them to your specific needs. The key takeaways from this course include:

1. **Transfer learning efficiency**: Fine-tuning is more efficient than training from scratch in terms of data, compute, and time requirements.

2. **Approach options**: Choose between feature extraction, full fine-tuning, or parameter-efficient methods based on your resources and needs.

3. **Hyperparameter sensitivity**: Fine-tuning requires careful tuning of learning rates and other hyperparameters.

4. **Evaluation importance**: Proper evaluation on relevant metrics ensures your model generalizes well.

5. **Advanced techniques**: As models grow larger, parameter-efficient methods become increasingly important.

Whether you're working with computer vision, NLP, or other domains, the principles of fine-tuning remain similar: leverage pre-trained knowledge while carefully adapting the model to your specific task.

## Further Resources

- [Hugging Face Transfer Learning Documentation](https://huggingface.co/docs/transformers/training)
- [PyTorch Transfer Learning Tutorials](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html)
- [Parameter-Efficient Fine-Tuning Methods](https://arxiv.org/abs/2303.15647)
- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)
- [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)