# 🚀 PCM-LLM: Prompt Compression Benchmark

A comprehensive framework for testing and evaluating prompt compression methods for Large Language Models (LLMs).

**Optimized for Google Colab Free Tier**
- ✅ 4-bit quantization for maximum memory efficiency
- ✅ Memory monitoring and automatic cleanup
- ✅ Optimized for limited GPU memory (15GB)
- ✅ Fast execution with small batch sizes

---

## Quick Setup

In [None]:
# Clone the repository
!git clone https://github.com/yourusername/pcm-llm.git
%cd pcm-llm

# Install dependencies
!pip install -q -r requirements.txt

# Check GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

## Configuration

The system is pre-configured for optimal Colab performance:
- **Model**: Phi-3 Mini (3.8B parameters)
- **Quantization**: 4-bit (maximum memory savings)
- **Samples**: 3 (quick testing)
- **Memory monitoring**: Enabled

You can modify settings in `config.py` if needed.

In [None]:
# Check current configuration
from config import *
print("Current Configuration:")
print(f"- LLM Provider: {DEFAULT_LLM_PROVIDER}")
print(f"- Model: {HUGGINGFACE_MODEL}")
print(f"- Quantization: {HUGGINGFACE_QUANTIZATION}")
print(f"- Dataset: {DEFAULT_DATASET}")
print(f"- Samples: {NUM_SAMPLES_TO_RUN}")
print(f"- Compression: {DEFAULT_COMPRESSION_METHOD}")

## Run Benchmark

Execute the full benchmark pipeline:
1. **Compression Phase**: Compress prompts using LLMLingua-2
2. **Evaluation Phase**: Test both original and compressed prompts
3. **Analysis**: Compare performance and consistency

**Expected runtime**: ~5-10 minutes for 3 samples

In [None]:
# Run the benchmark
!python main.py

## Results Analysis

The benchmark generates:
- **CSV logs** in the `results/` folder
- **Performance metrics** for original vs compressed prompts
- **Memory usage reports**
- **Answer consistency analysis**

Let's examine the latest results:

In [None]:
import pandas as pd
import glob

# Find the latest results file
result_files = glob.glob('results/*.csv')
if result_files:
    latest_file = max(result_files, key=lambda x: x)
    print(f"Loading latest results: {latest_file}")
    
    df = pd.read_csv(latest_file)
    print(f"\nResults shape: {df.shape}")
    print("\nColumns:")
    for col in df.columns:
        print(f"  - {col}")
    
    print("\nSample results:")
    display(df.head())
else:
    print("No results files found. Run the benchmark first!")

## Memory Optimization Tips

### For Colab Free Tier:
1. **Use 4-bit quantization** (already enabled)
2. **Keep sample size small** (3-5 samples recommended)
3. **Monitor memory usage** (automatic)
4. **Clear memory regularly** (automatic)

### If you run out of memory:
- Reduce `NUM_SAMPLES_TO_RUN` in `config.py`
- Use an even smaller model like `microsoft/phi-2`
- Restart the runtime and try again

### For faster execution:
- The system automatically uses the best available device (GPU > CPU)
- Memory is cleared aggressively between operations
- Progress bars show real-time status

## Troubleshooting

### Common Issues:

**CUDA out of memory**
```python
# In config.py, try these settings:
HUGGINGFACE_QUANTIZATION = "4bit"
NUM_SAMPLES_TO_RUN = 2
```

**Model download slow**
- Models are cached after first download
- Use smaller models for testing

**Dataset loading fails**
- The system falls back to mock data automatically
- No external dependencies required

---

## Next Steps

1. **Experiment with different models** in `config.py`
2. **Try different compression methods**
3. **Analyze results** in the CSV files
4. **Customize evaluation metrics**

Happy benchmarking! 🎯