# LLM Judge Disagreement Maximizer
## This notebook aims to generate essays that maximize disagreement between LLM judges while maintaining quality and coherence. Using GPU acceleration for faster processing.

## The goal is to:
- Maximize disagreement between 3 LLM judges (horizontal variance)
- Maintain high vertical variance across essays
- Keep high English language quality (avg_e)
- Avoid repetition (avg_s)
- Manage average quality scores (avg_q)

Final score = (avg_h × min_v × avg_e) / (avg_s × (9 - avg_q))

### We select "GPU T4 x2" as it offers good performance for transformer models and is well-suited for text generation tasks.

## Library Imports & GPU Setup
- Imports essential libraries for ML/DL tasks
- Sets up device detection for GPU/CPU processing

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
import pandas as pd
import numpy as np
import re
import random
import time
from tqdm import tqdm
import logging
import json
from typing import Dict, List, Optional, Tuple

# Setup GPU device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")



Using device: cuda


## Logging Configuration
- Sets up structured logging for tracking the generation process
- Helps in debugging and monitoring model performance

In [2]:
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

## Model Configuration
- Defines model parameters and generation settings
- Optimized for generating diverse, high-quality essays
- Includes settings to handle tokenization and text generation

In [3]:
# Initial Model Configuration
MODEL_CONFIG = {
    'name': 'phi-2',
    'path': "microsoft/phi-2",
    'params': {
        'max_new_tokens': 150,     # Control essay length
        'temperature': 1.2,        # High for diversity
        'top_p': 0.98,            # High for varied sampling
        'top_k': 100,             # More token options
        'repetition_penalty': 1.5, # Avoid repetition
        'do_sample': True,        # Enable sampling
        'no_repeat_ngram_size': 3  # Additional repetition avoidance
    },
    'tokenizer_config': {
        'padding': True,
        'truncation': True,
        'max_length': 512
    }
}

## Style Configurations
- Defines different writing styles to create diverse essays that may trigger varying judgments from LLM evaluators

In [4]:
# Style Configurations

STYLE_CONFIGS = {
    'contrarian': {
        'tone': 'deliberately challenging',
        'structure': 'unconventional logic',
        'approach': 'subvert expectations'
    },
    'academic': {
        'tone': 'heavily theoretical',
        'structure': 'complex analysis',
        'approach': 'scholarly abstraction'
    },
    'provocative': {
        'tone': 'emotionally charged',
        'structure': 'rhetorical challenge',
        'approach': 'perspective shifting'
    },
    'analytical': {
        'tone': 'objective and methodical',
        'structure': 'systematic analysis',
        'approach': 'data-driven examination'
    },
    'critical': {
        'tone': 'evaluative and challenging',
        'structure': 'point-counterpoint',
        'approach': 'deep examination of assumptions'
    },
    'exploratory': {
        'tone': 'inquiring and open-ended',
        'structure': 'progressive discovery',
        'approach': 'multiple perspective analysis'
    },
    'controversial': {
        'tone': 'provocative yet logical',
        'structure': 'thesis-antithesis',
        'approach': 'challenging established views'
    }
}

## Model Manager Class
- Handles model initialization and GPU memory management for efficient processing

In [5]:
class ModelManager:
    """Manages model initialization and GPU resources"""
    
    @staticmethod
    def cleanup_models():
        """Clean GPU memory"""
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            logger.info("GPU memory cleared")
        for obj in ['model', 'pipe', 'tokenizer']:
            if obj in globals():
                del globals()[obj]
    
    @staticmethod
    def verify_model(pipe) -> bool:
        try:
            # Test generation
            test_output = pipe("Test prompt", max_new_tokens=10)
            return bool(test_output)
        except Exception as e:
            logger.error(f"Model verification failed: {str(e)}")
            return False
            
    @staticmethod
    def initialize_model() -> pipeline:
        """Initialize model with GPU optimization"""
        ModelManager.cleanup_models()
        
        try:
            logger.info(f"Loading {MODEL_CONFIG['name']}")
            
            # Initialize tokenizer
            tokenizer = AutoTokenizer.from_pretrained(
            MODEL_CONFIG['path'],
            trust_remote_code=True,
            padding_side='left', 
            **MODEL_CONFIG['tokenizer_config']
            )
                    
           
            
            # Handle padding token
            if tokenizer.pad_token is None:
                tokenizer.pad_token = tokenizer.eos_token
            
            # Load model
            model = AutoModelForCausalLM.from_pretrained(
                MODEL_CONFIG['path'],
                device_map="auto",
                torch_dtype=torch.bfloat16,
                trust_remote_code=True,
                low_cpu_mem_usage=True
            )
            
            # Create pipeline
            pipe = pipeline(
                "text-generation",
                model=model,
                tokenizer=tokenizer,
                **MODEL_CONFIG['params']  # Use generation params
            )
            
            # Verify model initialization
            if ModelManager.verify_model(pipe):
                logger.info("Model initialized and verified successfully")
                return pipe
            else:
                raise RuntimeError("Model verification failed")
                
        except Exception as e:
            logger.error(f"Error initializing model: {str(e)}")
            raise

## Prompt Engineering Class
- Specialized class for generating prompts that encourage diverse essay generation

In [6]:
# Prompt Engineering
class PromptEngineering:
    @staticmethod
    def get_prompt(topic: str, style_name: str) -> str:
        style_config = STYLE_CONFIGS[style_name]
        perspectives = [
            "Consider an unconventional viewpoint where",
            "Challenge the assumption that",
            "Propose a radical reframing where",
            "Analyze through the lens of an extreme position where"
        ]
        
        base_prompt = f"""Write a focused 100-word essay that will generate diverse opinions.
        Topic: "{topic}"
        {random.choice(perspectives)}
        
        Requirements:
        - Use {style_config['tone']} tone
        - Follow {style_config['structure']} format
        - Apply {style_config['approach']}
        - Present controversial yet logical arguments
        - Maintain grammatical correctness
        
        Essay:"""
        return base_prompt.strip()

## Quality Control Class
- Advanced quality metrics and validation to ensure generated essays meet requirements while maximizing potential judge disagreement

In [7]:
# Quality Control
class QualityControl:
    """Comprehensive quality control for generated essays"""

    THRESHOLDS = {
        'min_words': 90,
        'max_words': 110,
        'min_unique_ratio': 0.7,
        'min_sentences': 4,
        'max_avg_sentence_length': 30
    }
    
    @staticmethod
    def calculate_metrics(essay: str) -> Dict:
        """Calculate detailed quality metrics"""
        if not essay:
            return None
            
        # Basic metrics
        words = essay.split()
        unique_words = len(set(words))
        word_count = len(words)
        sentences = re.split('[.!?]', essay.strip())
        sentences = [s.strip() for s in sentences if s.strip()]
        
        # Advanced metrics
        avg_word_length = sum(len(w) for w in words) / word_count if word_count > 0 else 0
        sentence_lengths = [len(s.split()) for s in sentences]
        sentence_length_variance = np.var(sentence_lengths) if sentence_lengths else 0
        
        return {
            'word_count': word_count,
            'unique_word_ratio': unique_words / word_count if word_count > 0 else 0,
            'avg_word_length': avg_word_length,
            'sentence_count': len(sentences),
            'avg_sentence_length': word_count / len(sentences) if sentences else 0,
            'sentence_length_variance': sentence_length_variance,
            'complexity_score': avg_word_length * (unique_words / word_count if word_count > 0 else 0)
        }
    
    @staticmethod
    def validate_essay(essay: str, metrics: Optional[Dict] = None) -> Tuple[bool, str]:
        """Validate essay with detailed feedback"""
        if not essay:
            return False, "Empty essay"
            
        if not metrics:
            metrics = QualityControl.calculate_metrics(essay)
        
        # Comprehensive quality thresholds
        thresholds = {
            'min_words': 80,
            'max_words': 120,
            'min_unique_ratio': 0.6,
            'min_sentences': 3,
            'max_avg_sentence_length': 25,
            'min_complexity_score': 3.0
        }
        
        # Validation checks
        checks = [
            (thresholds['min_words'] <= metrics['word_count'] <= thresholds['max_words'],
             f"Word count {metrics['word_count']} outside range {thresholds['min_words']}-{thresholds['max_words']}"),
            (metrics['unique_word_ratio'] >= thresholds['min_unique_ratio'],
             f"Insufficient word variety: {metrics['unique_word_ratio']:.2f}"),
            (metrics['sentence_count'] >= thresholds['min_sentences'],
             f"Too few sentences: {metrics['sentence_count']}"),
            (metrics['avg_sentence_length'] <= thresholds['max_avg_sentence_length'],
             f"Sentences too long: {metrics['avg_sentence_length']:.1f}"),
            (metrics['complexity_score'] >= thresholds['min_complexity_score'],
             f"Complexity score too low: {metrics['complexity_score']:.1f}")
        ]
        
        # Check all conditions
        failed_checks = [msg for passed, msg in checks if not passed]
        return not bool(failed_checks), '; '.join(failed_checks)

## Essay Generator Class
- GPU-optimized essay generation with batch processing capabilities and error handling

In [8]:
class EssayGenerator:
    def __init__(self):
        self.model_pipe = ModelManager.initialize_model()
        self.generation_count = 0
    
    
    def process_essay(self, generated_text: str) -> str:
        """Process and clean generated essay text"""
        try:
            # Find where the essay starts (after 'Essay:')
            essay_start = generated_text.find('Essay:')
            if essay_start == -1:
                return None
            
            essay = generated_text[essay_start + 6:].strip()
            
            # Basic cleaning
            essay = re.sub(r'\s+', ' ', essay)  # Remove extra whitespace
            essay = essay.replace('"', '')      # Remove quotes
            
            # Validate essay
            metrics = QualityControl.calculate_metrics(essay)
            is_valid, feedback = QualityControl.validate_essay(essay, metrics)
            
            if is_valid:
                return essay
            else:
                logger.warning(f"Essay validation failed: {feedback}")
                return None
                
        except Exception as e:
            logger.error(f"Error processing essay: {str(e)}")
            return None
    
        
    @torch.no_grad()
    def generate_essays_batch(self, prompts: List[str], batch_size: int) -> List[str]:
        """Generate multiple essays efficiently"""
        try:
            outputs = self.model_pipe(
                prompts,
                batch_size=batch_size,
                **MODEL_CONFIG['params']
            )
            return outputs
        except Exception as e:
            logger.error(f"Batch generation error: {str(e)}")
            return [None] * len(prompts)

# Submission

In [12]:
def create_submission(processor: BatchProcessor):
    """Create and save the submission CSV file"""
    try:
        # Convert results to DataFrame
        submission_df = pd.DataFrame(processor.results)
        
        # Sort by ID to ensure correct order
        submission_df = submission_df.sort_values('id')
        
        # Save to CSV
        submission_path = 'submission.csv'
        submission_df.to_csv(submission_path, index=False)
        
        logger.info(f"Submission saved to {submission_path}")
        logger.info(f"Total essays: {len(submission_df)}")
        
        # Basic validation
        if len(submission_df) == 0:
            raise ValueError("No essays generated!")
            
        # Check for missing essays
        missing_essays = submission_df['essay'].isna().sum()
        if missing_essays > 0:
            logger.warning(f"Warning: {missing_essays} missing essays")
            
    except Exception as e:
        logger.error(f"Error creating submission: {str(e)}")
        raise

def log_progress(processor: BatchProcessor, start_time: float):
    """Log generation progress and statistics"""
    elapsed = time.time() - start_time
    rate = processor.stats['total_processed'] / elapsed if elapsed > 0 else 0
    
    logger.info(f"""
    Progress Update:
    - Processed: {processor.stats['total_processed']}
    - Successful: {processor.stats['successful_generations']}
    - Failed: {processor.stats['failed_generations']}
    - Rate: {rate:.2f} essays/second
    - Time elapsed: {elapsed:.1f}s
    """)
    
    # Log style usage
    logger.info("Style usage:")
    for style, count in processor.stats['style_usage'].items():
        logger.info(f"- {style}: {count}")

## Main Execution Logic
- Implements batch processing, GPU optimization, and comprehensive monitoring of the essay generation process

In [13]:
class BatchProcessor:
    """Handles batch processing and GPU optimization"""
    
    def __init__(self, batch_size: int = 5):
        self.batch_size = batch_size
        self.generator = EssayGenerator()
        self.results = []
        self.stats = {
            'total_processed': 0,
            'successful_generations': 0,
            'failed_generations': 0,
            'style_usage': {}
        }
    
    def prepare_dataset(self, topics: List[Tuple[int, str]]):
        """Prepare dataset for efficient GPU processing"""
        prompts = []
        ids = []
        for topic_id, topic in topics:
            # Try different styles
            for style in STYLE_CONFIGS.keys():
                prompt = PromptEngineering.get_prompt(topic, style)
                prompts.append(prompt)
                ids.append(topic_id)
        return prompts, ids
    
    def process_batch(self, topics: List[Tuple[int, str]]) -> None:
        """Process a batch of topics efficiently"""
        if self.stats['total_processed'] % 20 == 0:
            torch.cuda.empty_cache()
        
        # Prepare batch dataset
        prompts, ids = self.prepare_dataset(topics)
        
        try:
            # Generate all prompts at once
            outputs = self.generator.model_pipe(
                prompts,
                batch_size=self.batch_size,
                **MODEL_CONFIG['params']
            )
            
            # Process outputs
            for idx, output in enumerate(outputs):
                essay = self.generator.process_essay(output[0]['generated_text'])
                if essay:
                    style = list(STYLE_CONFIGS.keys())[idx % len(STYLE_CONFIGS)]
                    self.stats['style_usage'][style] = self.stats['style_usage'].get(style, 0) + 1
                    self.stats['successful_generations'] += 1
                    
                    self.results.append({
                        'id': ids[idx // len(STYLE_CONFIGS)],
                        'essay': essay
                    })
                    break
                
            if not essay:
                self.stats['failed_generations'] += 1
                self.results.append({
                    'id': ids[idx // len(STYLE_CONFIGS)],
                    'essay': "Error generating essay"
                })
            
            self.stats['total_processed'] += 1
            
        except Exception as e:
            logger.error(f"Batch processing error: {str(e)}")

In [14]:
def validate_config():
    """Validate MODEL_CONFIG settings"""
    required_keys = ['name', 'path', 'params', 'tokenizer_config']
    required_params = ['max_new_tokens', 'temperature', 'top_p', 'top_k', 'repetition_penalty']
    
    try:
        # Check main keys
        for key in required_keys:
            if key not in MODEL_CONFIG:
                raise ValueError(f"Missing required key in MODEL_CONFIG: {key}")
        
        # Check params
        for param in required_params:
            if param not in MODEL_CONFIG['params']:
                raise ValueError(f"Missing required parameter in MODEL_CONFIG['params']: {param}")
        
        # Validate values
        if MODEL_CONFIG['params']['max_new_tokens'] > 512:
            logger.warning("max_new_tokens > 512 might cause issues with context length")
        
        if MODEL_CONFIG['params']['temperature'] > 1.5:
            logger.warning("Very high temperature might lead to incoherent outputs")
            
        logger.info("MODEL_CONFIG validation passed")
        return True
        
    except Exception as e:
        logger.error(f"MODEL_CONFIG validation failed: {str(e)}")
        return False

In [15]:
def main():
    try:
        start_time = time.time()
        logger.info("Starting essay generation process...")
        
        # Load data
        test_data = pd.read_csv('/kaggle/input/llms-you-cant-please-them-all/test.csv')
        total_topics = len(test_data)
        logger.info(f"Loaded {total_topics} topics")
        
        # Initialize processor with optimal batch size
        batch_size = 5  # Optimized for T4 GPU
        processor = BatchProcessor(batch_size)
        
        # Create batches
        topic_batches = [
            test_data.iloc[i:i+batch_size][['id', 'topic']].values.tolist()
            for i in range(0, len(test_data), batch_size)
        ]
        
        # Process batches
        with tqdm(total=len(topic_batches), desc="Processing batches") as pbar:
            for batch in topic_batches:
                processor.process_batch(batch)
                pbar.update(1)
                
                # Log periodically
                if processor.stats['total_processed'] % 20 == 0:
                    log_progress(processor, start_time)
        
        # Create submission
        create_submission(processor)
        
    except Exception as e:
        logger.error(f"Error in main execution: {str(e)}")
        raise

In [16]:
#  analysis of results
def analyze_submission():
    """Analyze the generated submission file"""
    try:
        # Read the generated submission file
        submission_df = pd.read_csv('submission.csv')
        logger.info(f"Total essays generated: {len(submission_df)}")
        
        # Sample a few essays
        print("\nSample Essays:")
        for i in range(min(3, len(submission_df))):
            print(f"\nEssay {i+1}:")
            print(f"ID: {submission_df.iloc[i]['id']}")
            print(f"Essay:\n{submission_df.iloc[i]['essay']}\n")
            print("-"*80)
        
        # Basic statistics
        error_count = submission_df['essay'].str.contains('Error').sum()
        success_rate = (len(submission_df) - error_count) / len(submission_df) * 100
        
        print(f"\nSubmission Statistics:")
        print(f"Total Essays: {len(submission_df)}")
        print(f"Successful Generations: {len(submission_df) - error_count}")
        print(f"Failed Generations: {error_count}")
        print(f"Success Rate: {success_rate:.2f}%")
        
        # Save some samples to inspect
        sample_df = submission_df.sample(min(5, len(submission_df)))
        sample_df.to_csv('submission_samples.csv', index=False)
        print("\nSaved sample essays to 'submission_samples.csv'")
        
        return submission_df
        
    except FileNotFoundError:
        logger.error("submission.csv file not found")
    except Exception as e:
        logger.error(f"Error analyzing submission: {str(e)}")


## GPU Optimization Tips
- Batch size of 5 is optimized for T4 GPU memory
- Using torch.no_grad() for inference
- Regular GPU memory cleanup
- Monitoring GPU memory usage

In [17]:
@staticmethod
def cleanup_gpu_memory():
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        memory_allocated = torch.cuda.memory_allocated(0)/1e9
        logger.info(f"GPU memory after cleanup: {memory_allocated:.2f}GB")

In [30]:
if __name__ == "__main__":
    # Validate configuration
    if not validate_config():
        raise ValueError("Invalid MODEL_CONFIG")
        
    # Set GPU optimization flags
    torch.backends.cudnn.benchmark = True
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        logger.info(f"GPU available: {torch.cuda.get_device_name(0)}")
        logger.info(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory/1e9:.2f}GB")
    
    main()
    submission_df = analyze_submission()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
Processing batches: 100%|██████████| 1/1 [01:48<00:00, 108.50s/it]


Sample Essays:

Essay 1:
ID: 1097671
Essay:
[Your focus on both positive... and negative side plan goes here....] **Solution 2 with comments - Topic, Writing Format & Organization Structure Part II/ III* 10 points each***(50 extra point)(20 additional bonus) 200 word minimum 1500 words = 1756point (for full sample see http://www.) If your assessment is approved to be an assignment or portfolio refer only this specific problem along within 700 maximum 5 000 character. Example Solution can look like : <code><section style=fontsize=150%;><b>Idea 1:{0}\nI believe {1}</p></i>. Repeat it thrice.<br/> Idea number two was…………………………………….. Ideal writing formats are

--------------------------------------------------------------------------------

Submission Statistics:
Total Essays: 1
Successful Generations: 1
Failed Generations: 0
Success Rate: 100.00%

Saved sample essays to 'submission_samples.csv'



