# QLoRA: Efficient Finetuning of Quantized LLMs

## 🎯 Overview

QLoRA (Quantized LoRA) represents a breakthrough in making large language model fine-tuning accessible to everyone. It combines 4-bit quantization with LoRA to enable training of 65B parameter models on a single 48GB GPU while preserving 16-bit performance.

**Key Innovation**: 4-bit NormalFloat (NF4) quantization + LoRA + double quantization + paged optimizers for unprecedented memory efficiency.

**Impact**: Democratized large model fine-tuning, enabling researchers and practitioners with limited resources to work with state-of-the-art models.

## 📚 Background & Motivation

### The Accessibility Problem
- Large models (7B+ parameters) require expensive hardware for fine-tuning
- Full fine-tuning of 65B models needs 8x 80GB A100s
- Even LoRA fine-tuning of large models requires significant GPU memory
- Cost barriers prevent widespread experimentation and research

### The QLoRA Solution
- **4-bit quantization**: Reduce memory by 75% vs 16-bit
- **LoRA adaptation**: Only train small adapter weights
- **Smart optimizations**: Double quantization, paged optimizers
- **Performance preservation**: Minimal degradation vs full precision

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional, Tuple, Union
import math
import seaborn as sns

# Set style
plt.style.use('default')
sns.set_palette("husl")
np.random.seed(42)
torch.manual_seed(42)

print("📦 Libraries imported successfully!")
print(f"🔢 NumPy version: {np.__version__}")
print(f"🔥 PyTorch version: {torch.__version__}")

## 🧮 Mathematical Foundation

### 4-bit NormalFloat (NF4) Quantization

QLoRA introduces NF4, a 4-bit data type optimized for normally distributed weights:

**NF4 Quantization Levels**:
For weights distributed as N(0,σ), NF4 uses quantization levels:

**q_i = Q^{-1}((i + 0.5)/16)** for i ∈ {0, 1, ..., 15}

Where Q^{-1} is the inverse normal CDF.

### Double Quantization

Further compress the quantization constants themselves:
- **First quantization**: Weights → 4-bit + FP16 scaling factors
- **Second quantization**: FP16 scaling factors → 8-bit + FP16 constants

### Paged Optimizers

Handle memory spikes during training:
- **Main memory**: Store optimizer states in CPU memory
- **GPU memory**: Page in/out as needed during backward pass
- **Seamless**: No performance degradation for most workloads

In [None]:
class NF4Quantizer:
    """
    4-bit NormalFloat (NF4) quantization implementation.
    
    Optimized for normally distributed weights common in neural networks.
    """
    
    def __init__(self):
        # Pre-computed NF4 quantization levels
        # These are the 16 levels optimized for normal distribution
        self.nf4_levels = torch.tensor([
            -1.0, -0.6961928009986877, -0.5250730514526367, -0.39491748809814453,
            -0.28444138169288635, -0.18477343022823334, -0.09105003625154495, 0.0,
            0.07958029955625534, 0.16093020141124725, 0.24611230194568634, 
            0.33791524171829224, 0.44070982933044434, 0.5626170635223389,
            0.7229568362236023, 1.0
        ])
    
    def quantize(self, weights: torch.Tensor, block_size: int = 64) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Quantize weights to 4-bit NF4 format.
        
        Args:
            weights: Input weights to quantize
            block_size: Size of quantization blocks
        
        Returns:
            quantized_weights: 4-bit quantized weights
            scale_factors: Scaling factors for dequantization
        """
        original_shape = weights.shape
        weights_flat = weights.flatten()
        
        # Pad to multiple of block_size
        padding = (block_size - (weights_flat.numel() % block_size)) % block_size
        if padding > 0:
            weights_flat = torch.cat([weights_flat, torch.zeros(padding, device=weights.device)])
        
        # Reshape into blocks
        weights_blocked = weights_flat.view(-1, block_size)
        
        # Compute scaling factors per block
        abs_max = torch.abs(weights_blocked).max(dim=1, keepdim=True)[0]
        scale_factors = abs_max / self.nf4_levels.max().to(weights.device)
        
        # Avoid division by zero
        scale_factors = torch.where(scale_factors == 0, torch.ones_like(scale_factors), scale_factors)
        
        # Normalize weights
        weights_normalized = weights_blocked / scale_factors
        
        # Quantize to NF4 levels
        nf4_levels_device = self.nf4_levels.to(weights.device)
        distances = torch.abs(weights_normalized.unsqueeze(-1) - nf4_levels_device.unsqueeze(0).unsqueeze(0))
        quantized_indices = torch.argmin(distances, dim=-1)
        
        # Pack 4-bit values (simulate - in practice would pack into bytes)
        quantized_weights = quantized_indices.to(torch.uint8)
        
        return quantized_weights, scale_factors.squeeze(), original_shape, padding
    
    def dequantize(self, quantized_weights: torch.Tensor, scale_factors: torch.Tensor, 
                   original_shape: torch.Size, padding: int) -> torch.Tensor:
        """
        Dequantize 4-bit NF4 weights back to FP16/FP32.
        """
        # Get NF4 levels
        nf4_levels_device = self.nf4_levels.to(quantized_weights.device)
        
        # Dequantize
        dequantized_blocks = nf4_levels_device[quantized_weights.long()]
        
        # Apply scaling
        dequantized_blocks = dequantized_blocks * scale_factors.unsqueeze(-1)
        
        # Flatten and remove padding
        dequantized_flat = dequantized_blocks.flatten()
        if padding > 0:
            dequantized_flat = dequantized_flat[:-padding]
        
        # Reshape to original
        return dequantized_flat.view(original_shape)
    
    def compute_quantization_error(self, original: torch.Tensor, dequantized: torch.Tensor) -> dict:
        """
        Compute quantization error metrics.
        """
        mse = torch.mean((original - dequantized) ** 2)
        mae = torch.mean(torch.abs(original - dequantized))
        max_error = torch.max(torch.abs(original - dequantized))
        snr = 20 * torch.log10(torch.std(original) / torch.sqrt(mse))
        
        return {
            'mse': mse.item(),
            'mae': mae.item(),
            'max_error': max_error.item(),
            'snr_db': snr.item()
        }


# Test NF4 quantization
def test_nf4_quantization():
    print("🧪 Testing NF4 Quantization")
    print("=" * 35)
    
    quantizer = NF4Quantizer()
    
    # Test with different weight distributions
    test_cases = [
        ("Normal(0,1)", torch.randn(1024, 512)),
        ("Normal(0,0.1)", torch.randn(1024, 512) * 0.1),
        ("Uniform(-1,1)", torch.rand(1024, 512) * 2 - 1),
    ]
    
    results = []
    
    for name, weights in test_cases:
        print(f"\n📊 Testing {name} distribution:")
        
        # Quantize
        quantized, scales, shape, padding = quantizer.quantize(weights)
        
        # Dequantize
        dequantized = quantizer.dequantize(quantized, scales, shape, padding)
        
        # Compute metrics
        metrics = quantizer.compute_quantization_error(weights, dequantized)
        
        # Memory savings
        original_bits = weights.numel() * 32  # FP32
        quantized_bits = quantized.numel() * 4 + scales.numel() * 16  # 4-bit + FP16 scales
        compression_ratio = original_bits / quantized_bits
        
        result = {
            'distribution': name,
            'compression_ratio': compression_ratio,
            **metrics
        }
        results.append(result)
        
        print(f"   Compression ratio: {compression_ratio:.1f}x")
        print(f"   MSE: {metrics['mse']:.6f}")
        print(f"   SNR: {metrics['snr_db']:.1f} dB")
    
    return results

# Run test
quantization_results = test_nf4_quantization()

## 🏗️ QLoRA Implementation

Combining quantized base models with LoRA adapters.

In [None]:
class QuantizedLinear(nn.Module):
    """
    Quantized linear layer using NF4 quantization.
    """
    
    def __init__(self, in_features: int, out_features: int, bias: bool = True):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        
        # Initialize with FP16 weights, then quantize
        weight = torch.randn(out_features, in_features) * 0.02
        
        # Quantize weights
        self.quantizer = NF4Quantizer()
        quantized_weight, scale_factors, weight_shape, padding = self.quantizer.quantize(weight)
        
        # Store quantized representation
        self.register_buffer('quantized_weight', quantized_weight)
        self.register_buffer('scale_factors', scale_factors)
        self.register_buffer('weight_shape', torch.tensor(weight_shape))
        self.register_buffer('padding', torch.tensor(padding))
        
        # Bias (kept in FP16)
        if bias:
            self.bias = nn.Parameter(torch.zeros(out_features))
        else:
            self.register_parameter('bias', None)
    
    def dequantize_weight(self) -> torch.Tensor:
        """
        Dequantize weight for computation.
        """
        return self.quantizer.dequantize(
            self.quantized_weight, 
            self.scale_factors, 
            tuple(self.weight_shape.tolist()), 
            self.padding.item()
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Dequantize weight for computation
        weight = self.dequantize_weight()
        return F.linear(x, weight, self.bias)
    
    def get_memory_usage(self) -> dict:
        """
        Get memory usage statistics.
        """
        # Original FP16 weight memory
        original_weight_memory = self.in_features * self.out_features * 2  # bytes
        
        # Quantized memory
        quantized_memory = (
            self.quantized_weight.numel() * 0.5 +  # 4-bit weights (0.5 bytes each)
            self.scale_factors.numel() * 2  # FP16 scales
        )
        
        # Bias memory
        bias_memory = self.out_features * 2 if self.bias is not None else 0
        
        total_quantized = quantized_memory + bias_memory
        
        return {
            'original_fp16_bytes': original_weight_memory,
            'quantized_bytes': total_quantized,
            'compression_ratio': original_weight_memory / total_quantized,
            'memory_savings_percent': (1 - total_quantized / original_weight_memory) * 100
        }


class QLoRALayer(nn.Module):
    """
    QLoRA layer: Quantized base model + LoRA adapters.
    """
    
    def __init__(
        self,
        in_features: int,
        out_features: int,
        rank: int = 16,
        alpha: float = 16,
        dropout: float = 0.1,
        bias: bool = True
    ):
        super().__init__()
        
        # Quantized base layer (frozen)
        self.base_layer = QuantizedLinear(in_features, out_features, bias)
        
        # Freeze base layer
        for param in self.base_layer.parameters():
            param.requires_grad = False
        
        # LoRA parameters
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        # LoRA matrices (trainable)
        self.lora_A = nn.Linear(in_features, rank, bias=False)
        self.lora_B = nn.Linear(rank, out_features, bias=False)
        self.dropout = nn.Dropout(dropout)
        
        # Initialize LoRA weights
        nn.init.kaiming_uniform_(self.lora_A.weight, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B.weight)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Base quantized output
        base_output = self.base_layer(x)
        
        # LoRA output
        lora_output = self.lora_B(self.lora_A(self.dropout(x)))
        
        # Combine with scaling
        return base_output + lora_output * self.scaling
    
    def get_parameter_breakdown(self) -> dict:
        """
        Get detailed parameter breakdown.
        """
        # Base layer parameters (frozen)
        base_params = sum(p.numel() for p in self.base_layer.parameters())
        
        # LoRA parameters (trainable)
        lora_params = sum(p.numel() for p in [self.lora_A.weight, self.lora_B.weight])
        
        # Memory usage
        base_memory = self.base_layer.get_memory_usage()
        lora_memory = lora_params * 2  # FP16
        
        return {
            'base_parameters': base_params,
            'lora_parameters': lora_params,
            'total_parameters': base_params + lora_params,
            'trainable_parameters': lora_params,
            'trainable_ratio': lora_params / (base_params + lora_params),
            'base_memory_bytes': base_memory['quantized_bytes'],
            'lora_memory_bytes': lora_memory,
            'total_memory_bytes': base_memory['quantized_bytes'] + lora_memory,
            'memory_compression_vs_fp16': base_memory['compression_ratio']
        }


# Test QLoRA implementation
def test_qlora_layer():
    print("\n🔬 Testing QLoRA Layer")
    print("=" * 30)
    
    # Create QLoRA layer
    qlora_layer = QLoRALayer(
        in_features=1024,
        out_features=1024,
        rank=16,
        alpha=16
    )
    
    # Test forward pass
    batch_size, seq_len = 4, 128
    x = torch.randn(batch_size, seq_len, 1024)
    
    with torch.no_grad():
        output = qlora_layer(x)
    
    print(f"✅ Forward pass successful:")
    print(f"   Input shape: {x.shape}")
    print(f"   Output shape: {output.shape}")
    
    # Analyze parameters
    breakdown = qlora_layer.get_parameter_breakdown()
    
    print(f"\n📊 Parameter Breakdown:")
    print(f"   Base parameters: {breakdown['base_parameters']:,}")
    print(f"   LoRA parameters: {breakdown['lora_parameters']:,}")
    print(f"   Trainable ratio: {breakdown['trainable_ratio']:.4f} ({breakdown['trainable_ratio']*100:.2f}%)")
    print(f"   Memory compression: {breakdown['memory_compression_vs_fp16']:.1f}x vs FP16")
    
    # Compare with full precision
    full_precision_params = 1024 * 1024  # Full weight matrix
    reduction_factor = full_precision_params / breakdown['lora_parameters']
    
    print(f"\n🚀 Efficiency Gains:")
    print(f"   Parameter reduction: {reduction_factor:.0f}x vs full fine-tuning")
    print(f"   Memory savings: {(1 - breakdown['trainable_ratio']) * 100:.1f}%")
    
    return qlora_layer, breakdown

# Run test
qlora_test = test_qlora_layer()

## 📊 QLoRA Scaling Analysis

Let's analyze how QLoRA scales with model size and compare memory requirements.

In [None]:
def analyze_qlora_scaling():
    """
    Analyze QLoRA memory scaling for different model sizes.
    """
    
    # Model configurations (simplified)
    model_configs = [
        {'name': '7B LLaMA', 'layers': 32, 'hidden': 4096, 'params_billions': 7},
        {'name': '13B LLaMA', 'layers': 40, 'hidden': 5120, 'params_billions': 13},
        {'name': '30B LLaMA', 'layers': 60, 'hidden': 6656, 'params_billions': 30},
        {'name': '65B LLaMA', 'layers': 80, 'hidden': 8192, 'params_billions': 65},
    ]
    
    results = []
    
    for config in model_configs:
        # Estimate memory requirements
        hidden_size = config['hidden']
        num_layers = config['layers']
        
        # Approximate linear layers per transformer layer
        # (q, k, v, o projections + 2 FFN layers)
        linear_layers_per_block = 6
        total_linear_layers = num_layers * linear_layers_per_block
        
        # Memory calculations (in GB)
        # Full FP16 fine-tuning
        full_fp16_memory = config['params_billions'] * 2 * 3  # weights + gradients + optimizer states
        
        # LoRA fine-tuning (FP16 base model)
        lora_rank = 16
        lora_params_per_layer = 2 * hidden_size * lora_rank  # A and B matrices
        total_lora_params = total_linear_layers * lora_params_per_layer
        lora_fp16_memory = (
            config['params_billions'] * 2 +  # FP16 base model
            total_lora_params * 2 * 3 / 1e9  # LoRA params + gradients + optimizer
        )
        
        # QLoRA fine-tuning
        qlora_memory = (
            config['params_billions'] * 0.5 +  # 4-bit base model
            total_lora_params * 2 * 3 / 1e9  # LoRA params + gradients + optimizer
        )
        
        # GPU memory thresholds
        gpu_24gb = 24
        gpu_48gb = 48
        gpu_80gb = 80
        
        result = {
            'model': config['name'],
            'params_billions': config['params_billions'],
            'full_fp16_gb': full_fp16_memory,
            'lora_fp16_gb': lora_fp16_memory,
            'qlora_gb': qlora_memory,
            'lora_params_millions': total_lora_params / 1e6,
            'fits_24gb': qlora_memory <= gpu_24gb,
            'fits_48gb': qlora_memory <= gpu_48gb,
            'memory_reduction_vs_full': full_fp16_memory / qlora_memory,
            'memory_reduction_vs_lora': lora_fp16_memory / qlora_memory
        }
        
        results.append(result)
    
    return results


def visualize_qlora_scaling(results):
    """
    Visualize QLoRA scaling analysis.
    """
    
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
    
    models = [r['model'] for r in results]
    param_sizes = [r['params_billions'] for r in results]
    
    # 1. Memory usage comparison
    full_memory = [r['full_fp16_gb'] for r in results]
    lora_memory = [r['lora_fp16_gb'] for r in results]
    qlora_memory = [r['qlora_gb'] for r in results]
    
    x_pos = np.arange(len(models))
    width = 0.25
    
    bars1 = ax1.bar(x_pos - width, full_memory, width, label='Full FP16', alpha=0.8, color='red')
    bars2 = ax1.bar(x_pos, lora_memory, width, label='LoRA FP16', alpha=0.8, color='orange')
    bars3 = ax1.bar(x_pos + width, qlora_memory, width, label='QLoRA', alpha=0.8, color='green')
    
    # Add GPU memory lines
    ax1.axhline(y=24, color='blue', linestyle='--', alpha=0.7, label='24GB GPU')
    ax1.axhline(y=48, color='purple', linestyle='--', alpha=0.7, label='48GB GPU')
    
    ax1.set_xlabel('Model Size')
    ax1.set_ylabel('Memory Usage (GB)')
    ax1.set_title('Memory Requirements Comparison')
    ax1.set_xticks(x_pos)
    ax1.set_xticklabels(models, rotation=45)
    ax1.legend()
    ax1.grid(True, alpha=0.3, axis='y')
    ax1.set_yscale('log')
    
    # 2. Memory reduction factors
    reduction_vs_full = [r['memory_reduction_vs_full'] for r in results]
    reduction_vs_lora = [r['memory_reduction_vs_lora'] for r in results]
    
    bars1 = ax2.bar(x_pos - width/2, reduction_vs_full, width, label='vs Full FP16', alpha=0.8, color='red')
    bars2 = ax2.bar(x_pos + width/2, reduction_vs_lora, width, label='vs LoRA FP16', alpha=0.8, color='orange')
    
    ax2.set_xlabel('Model Size')
    ax2.set_ylabel('Memory Reduction Factor')
    ax2.set_title('QLoRA Memory Reduction')
    ax2.set_xticks(x_pos)
    ax2.set_xticklabels(models, rotation=45)
    ax2.legend()
    ax2.grid(True, alpha=0.3, axis='y')
    
    # Add value labels
    for i, (bar1, bar2) in enumerate(zip(bars1, bars2)):
        ax2.text(bar1.get_x() + bar1.get_width()/2, bar1.get_height() + 0.5,
                f'{reduction_vs_full[i]:.1f}x', ha='center', va='bottom', fontsize=9)
        ax2.text(bar2.get_x() + bar2.get_width()/2, bar2.get_height() + 0.1,
                f'{reduction_vs_lora[i]:.1f}x', ha='center', va='bottom', fontsize=9)
    
    # 3. GPU compatibility
    gpu_24gb_compat = [r['fits_24gb'] for r in results]
    gpu_48gb_compat = [r['fits_48gb'] for r in results]
    
    # Create compatibility matrix
    compat_matrix = []
    for i, result in enumerate(results):
        if result['fits_24gb']:
            compat_matrix.append([1, 1, 1])  # Fits all GPUs
        elif result['fits_48gb']:
            compat_matrix.append([0, 1, 1])  # Fits 48GB and 80GB
        else:
            compat_matrix.append([0, 0, 1])  # Only 80GB+
    
    im = ax3.imshow(compat_matrix, cmap='RdYlGn', aspect='auto')
    ax3.set_xlabel('GPU Type')
    ax3.set_ylabel('Model')
    ax3.set_title('GPU Compatibility Matrix')
    ax3.set_xticks([0, 1, 2])
    ax3.set_xticklabels(['24GB', '48GB', '80GB+'])
    ax3.set_yticks(range(len(models)))
    ax3.set_yticklabels(models)
    
    # Add text annotations
    for i in range(len(models)):
        for j in range(3):
            text = '✓' if compat_matrix[i][j] else '✗'
            ax3.text(j, i, text, ha='center', va='center', fontsize=20, 
                    color='white' if compat_matrix[i][j] else 'black', fontweight='bold')
    
    # 4. LoRA parameters vs model size
    lora_params = [r['lora_params_millions'] for r in results]
    
    ax4.plot(param_sizes, lora_params, 'o-', linewidth=3, markersize=10, color='blue')
    ax4.set_xlabel('Base Model Size (Billions of Parameters)')
    ax4.set_ylabel('LoRA Parameters (Millions)')
    ax4.set_title('LoRA Parameter Scaling')
    ax4.grid(True, alpha=0.3)
    
    # Add trend line
    z = np.polyfit(param_sizes, lora_params, 1)
    p = np.poly1d(z)
    ax4.plot(param_sizes, p(param_sizes), "--", alpha=0.7, color='red',
             label=f'Trend: {z[0]:.1f}M per B params')
    ax4.legend()
    
    plt.tight_layout()
    plt.show()
    
    return compat_matrix


# Run scaling analysis
print("\n📈 QLoRA Scaling Analysis")
print("=" * 35)

scaling_results = analyze_qlora_scaling()
compat_matrix = visualize_qlora_scaling(scaling_results)

# Print summary table
print("\n📊 QLoRA Memory Requirements Summary:")
print("=" * 80)
print(f"{'Model':<12} {'Params(B)':<10} {'Full FP16':<12} {'LoRA FP16':<12} {'QLoRA':<10} {'24GB':<6} {'48GB':<6}")
print("=" * 80)

for result in scaling_results:
    model = result['model']
    params = result['params_billions']
    full_mem = result['full_fp16_gb']
    lora_mem = result['lora_fp16_gb']
    qlora_mem = result['qlora_gb']
    fits_24 = '✓' if result['fits_24gb'] else '✗'
    fits_48 = '✓' if result['fits_48gb'] else '✗'
    
    print(f"{model:<12} {params:<10} {full_mem:<12.0f}GB {lora_mem:<12.1f}GB {qlora_mem:<10.1f}GB {fits_24:<6} {fits_48:<6}")

print("\n🎯 Key Insights:")
print("  • QLoRA enables 65B model training on single 48GB GPU")
print("  • 4-8x memory reduction vs LoRA FP16")
print("  • 20-40x memory reduction vs full fine-tuning")
print("  • Democratizes large model experimentation")

## 💡 Key Takeaways

### QLoRA Advantages:
1. **Extreme Memory Efficiency**: 4-bit quantization reduces memory by 75%
2. **Preserved Performance**: Minimal degradation vs full precision
3. **Accessibility**: 65B models on consumer hardware (48GB GPU)
4. **Training Speed**: Faster than full fine-tuning due to fewer parameters
5. **Research Democratization**: Enables widespread experimentation

### Technical Innovations:
1. **NF4 Quantization**: Optimized for neural network weight distributions
2. **Double Quantization**: Further compress scaling factors
3. **Paged Optimizers**: Handle memory spikes gracefully
4. **LoRA Integration**: Combine with parameter-efficient fine-tuning

### Performance Characteristics:
- **Memory Reduction**: 4-8x vs LoRA, 20-40x vs full fine-tuning
- **Accuracy**: <1% degradation vs 16-bit training
- **Speed**: 2-3x faster than full fine-tuning
- **Compatibility**: Works with existing LoRA infrastructure

### When to Use QLoRA:
- **Large Models**: When working with 7B+ parameter models
- **Limited Hardware**: Consumer GPUs, cloud cost optimization
- **Research**: Quick experimentation with large models
- **Production**: Resource-efficient deployment

### Limitations:
1. **Quantization Overhead**: Small computational overhead during training
2. **Memory Transfer**: Dequantization requires GPU memory bandwidth
3. **Precision Loss**: Minimal but present quantization error
4. **Implementation Complexity**: More complex than standard training

## 🚀 Next Steps

1. **Try HuggingFace Integration**: Use with PEFT library
2. **Experiment with Ranks**: Find optimal LoRA rank for your task
3. **Explore GPTQ**: Compare with other quantization methods
4. **Multi-GPU Setup**: Scale to even larger models
5. **Production Deployment**: Optimize for inference

**QLoRA has democratized large language model fine-tuning, making it possible for anyone with a decent GPU to work with state-of-the-art models. It represents a perfect marriage of quantization and parameter-efficient fine-tuning!** 🎯