<a href="https://colab.research.google.com/github/MariaAise/deep_learning_w/blob/main/bitsandbites.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
!pip install transformers bitsandbytes accelerate torch scipy numpy pandas matplotlib --break-system-packages

Collecting bitsandbytes
  Downloading bitsandbytes-0.49.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Downloading bitsandbytes-0.49.2-py3-none-manylinux_2_24_x86_64.whl (60.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.7/60.7 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.49.2


In [1]:
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig
)
from scipy.stats import entropy
from typing import Dict, List, Tuple
import json
import time
from pathlib import Path
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

In [2]:
# ============================================================================
# DOMAIN TEXTS - Representative samples from 4 different domains
# ============================================================================

DOMAIN_TEXTS = {
    "medical": """
        The pathophysiology of acute myocardial infarction involves coronary artery
        thrombosis leading to myocardial necrosis. Electrocardiographic findings typically
        show ST-segment elevation in leads corresponding to the affected myocardial territory.
        Biomarkers including troponin I and creatine kinase-MB demonstrate elevated levels
        within 3-6 hours post-infarction. Immediate therapeutic interventions include
        antiplatelet therapy with aspirin and P2Y12 inhibitors, anticoagulation, and
        reperfusion strategies such as percutaneous coronary intervention or fibrinolytic
        therapy. The prognosis depends on infarct size, left ventricular ejection fraction,
        and the presence of complications including cardiogenic shock or ventricular arrhythmias.
    """,

    "legal": """
        Pursuant to the provisions of Section 1983 of Title 42 of the United States Code,
        the plaintiff alleges that the defendant, acting under color of state law, deprived
        the plaintiff of rights secured by the Constitution and laws of the United States.
        The court grants plaintiff's motion for summary judgment on the grounds that no
        genuine dispute of material fact exists regarding the defendant's qualified immunity
        defense. The Fourth Amendment jurisprudence establishes that warrantless searches
        are per se unreasonable, subject to specifically established exceptions. The doctrine
        of stare decisis compels adherence to precedential holdings absent compelling
        justification for departure therefrom.
    """,

    "technical": """
        The microservice architecture implements a distributed system using containerized
        Docker images orchestrated via Kubernetes. The API gateway handles request routing
        with nginx load balancing across multiple backend instances. Database sharding
        distributes PostgreSQL data horizontally using consistent hashing algorithms.
        The system employs event-driven communication through Apache Kafka message queues,
        ensuring eventual consistency via the SAGA pattern. Circuit breakers prevent cascade
        failures, while distributed tracing with Jaeger enables performance monitoring.
        The CI/CD pipeline automates deployment through Jenkins with blue-green deployment
        strategies minimizing downtime.
    """,

    "casual": """
        Hey everyone! Just wanted to share my experience at the new coffee shop downtown.
        The atmosphere is really cozy and perfect for getting some work done or just
        hanging out with friends. I tried their signature latte and it was honestly
        amazing - probably the best I've had in the city! The baristas are super friendly
        and they have a great selection of pastries too. The wifi is fast and there are
        plenty of power outlets. Definitely recommend checking it out if you're in the
        area. They're open until 8pm on weekdays and 10pm on weekends. Can't wait to go
        back this weekend!
    """
}


# ============================================================================
# MODEL CONFIGURATIONS
# ============================================================================

MODELS = {
    "gpt2-small": "gpt2",
    "gpt2-medium": "gpt2-medium",
    "opt-350m": "facebook/opt-350m",
    "pythia-410m": "EleutherAI/pythia-410m"
}


In [3]:

# ============================================================================
# MODEL CONFIGURATIONS
# ============================================================================

MODELS = {
    "gpt2-small": "gpt2",
    "gpt2-medium": "gpt2-medium",
    "opt-350m": "facebook/opt-350m",
    "pythia-410m": "EleutherAI/pythia-410m"
}


# ============================================================================
# DOMAIN SHIFT ANALYZER CLASS
# ============================================================================

class DomainShiftAnalyzer:
    """Analyzes domain shift across models and text domains."""

    def __init__(self, use_4bit: bool = True, device: str = None):
        """
        Initialize the analyzer.

        Args:
            use_4bit: Use 4-bit quantization (True) or 8-bit (False)
            device: Device to use ('cuda', 'cpu', or None for auto-detect)
        """
        self.use_4bit = use_4bit
        self.device = device or ('cuda' if torch.cuda.is_available() else 'cpu')
        self.results = []
        self.models_cache = {}
        self.tokenizers_cache = {}

        print(f"Initializing DomainShiftAnalyzer")
        print(f"Device: {self.device}")
        print(f"Quantization: {'4-bit' if use_4bit else '8-bit'}")
        print(f"CUDA available: {torch.cuda.is_available()}")
        if torch.cuda.is_available():
            print(f"GPU: {torch.cuda.get_device_name(0)}")
        print("-" * 80)

    def get_quantization_config(self) -> BitsAndBytesConfig:
        """Get bitsandbytes quantization configuration."""
        if self.use_4bit:
            return BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_compute_dtype=torch.float16,
                bnb_4bit_use_double_quant=True,
                bnb_4bit_quant_type="nf4"
            )
        else:
            return BitsAndBytesConfig(
                load_in_8bit=True,
                llm_int8_threshold=6.0
            )

    def load_model(self, model_name: str) -> Tuple[AutoModelForCausalLM, AutoTokenizer]:
        """
        Load model and tokenizer with quantization.

        Args:
            model_name: Model identifier from MODELS dict

        Returns:
            Tuple of (model, tokenizer)
        """
        if model_name in self.models_cache:
            return self.models_cache[model_name], self.tokenizers_cache[model_name]

        model_id = MODELS[model_name]
        print(f"\nLoading {model_name} ({model_id})...")

        # Load tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_id)
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token

        # Load model with quantization
        if self.device == 'cuda':
            quantization_config = self.get_quantization_config()
            model = AutoModelForCausalLM.from_pretrained(
                model_id,
                quantization_config=quantization_config,
                device_map="auto",
                trust_remote_code=True
            )
        else:
            # CPU fallback - no quantization
            print("  Warning: Running on CPU without quantization (slower)")
            model = AutoModelForCausalLM.from_pretrained(
                model_id,
                trust_remote_code=True
            )
            model = model.to(self.device)

        model.eval()

        # Cache for reuse
        self.models_cache[model_name] = model
        self.tokenizers_cache[model_name] = tokenizer

        # Print model info
        param_count = sum(p.numel() for p in model.parameters())
        print(f"  Parameters: {param_count:,}")
        if self.device == 'cuda':
            memory_mb = torch.cuda.memory_allocated() / 1024**2
            print(f"  GPU Memory: {memory_mb:.2f} MB")

        return model, tokenizer

    def compute_perplexity(
        self,
        model: AutoModelForCausalLM,
        tokenizer: AutoTokenizer,
        text: str,
        max_length: int = 512
    ) -> float:
        """
        Compute perplexity of text under the model.

        Args:
            model: Language model
            tokenizer: Tokenizer
            text: Input text
            max_length: Maximum sequence length

        Returns:
            Perplexity value
        """
        # Tokenize
        encodings = tokenizer(
            text,
            return_tensors='pt',
            truncation=True,
            max_length=max_length
        )

        input_ids = encodings.input_ids.to(self.device)

        # Compute loss
        with torch.no_grad():
            outputs = model(input_ids, labels=input_ids)
            loss = outputs.loss.item()

        # Perplexity = exp(loss)
        perplexity = np.exp(loss)

        return perplexity

    def compute_token_entropy(
        self,
        model: AutoModelForCausalLM,
        tokenizer: AutoTokenizer,
        text: str,
        max_length: int = 512
    ) -> float:
        """
        Compute average token-level entropy (uncertainty).

        Higher entropy = more uncertain predictions = potential domain shift

        Args:
            model: Language model
            tokenizer: Tokenizer
            text: Input text
            max_length: Maximum sequence length

        Returns:
            Average entropy across tokens
        """
        encodings = tokenizer(
            text,
            return_tensors='pt',
            truncation=True,
            max_length=max_length
        )

        input_ids = encodings.input_ids.to(self.device)

        with torch.no_grad():
            outputs = model(input_ids)
            logits = outputs.logits

        # Compute softmax probabilities
        probs = torch.softmax(logits, dim=-1)

        # Compute entropy for each token position
        entropies = []
        for i in range(probs.shape[1]):
            prob_dist = probs[0, i, :].cpu().numpy()
            ent = entropy(prob_dist)
            entropies.append(ent)

        return np.mean(entropies)

    def compute_confidence(
        self,
        model: AutoModelForCausalLM,
        tokenizer: AutoTokenizer,
        text: str,
        max_length: int = 512
    ) -> float:
        """
        Compute average confidence (max probability per token).

        Lower confidence = potential domain shift

        Args:
            model: Language model
            tokenizer: Tokenizer
            text: Input text
            max_length: Maximum sequence length

        Returns:
            Average max probability across tokens
        """
        encodings = tokenizer(
            text,
            return_tensors='pt',
            truncation=True,
            max_length=max_length
        )

        input_ids = encodings.input_ids.to(self.device)

        with torch.no_grad():
            outputs = model(input_ids)
            logits = outputs.logits

        # Compute softmax probabilities
        probs = torch.softmax(logits, dim=-1)

        # Get max probability for each position
        max_probs = torch.max(probs, dim=-1)[0]

        return max_probs.mean().item()

    def analyze_single(
        self,
        model_name: str,
        domain: str,
        text: str
    ) -> Dict:
        """
        Analyze a single model-domain combination.

        Args:
            model_name: Model identifier
            domain: Domain name
            text: Domain text

        Returns:
            Dictionary of metrics
        """
        print(f"\n  Analyzing {model_name} on {domain} domain...")

        # Load model
        model, tokenizer = self.load_model(model_name)

        # Compute metrics
        start_time = time.time()

        perplexity = self.compute_perplexity(model, tokenizer, text)
        token_entropy = self.compute_token_entropy(model, tokenizer, text)
        confidence = self.compute_confidence(model, tokenizer, text)

        inference_time = time.time() - start_time

        result = {
            'model': model_name,
            'domain': domain,
            'perplexity': perplexity,
            'token_entropy': token_entropy,
            'confidence': confidence,
            'inference_time_s': inference_time,
            'timestamp': datetime.now().isoformat()
        }

        print(f"    Perplexity: {perplexity:.2f}")
        print(f"    Token Entropy: {token_entropy:.4f}")
        print(f"    Confidence: {confidence:.4f}")
        print(f"    Time: {inference_time:.2f}s")

        return result

    def analyze_all(self) -> pd.DataFrame:
        """
        Run complete analysis across all models and domains.

        Returns:
            DataFrame with all results
        """
        print("\n" + "=" * 80)
        print("DOMAIN SHIFT ANALYSIS")
        print("=" * 80)

        results = []

        for model_name in MODELS.keys():
            print(f"\n{'='*80}")
            print(f"MODEL: {model_name}")
            print(f"{'='*80}")

            for domain, text in DOMAIN_TEXTS.items():
                result = self.analyze_single(model_name, domain, text)
                results.append(result)

            # Clear CUDA cache between models
            if self.device == 'cuda':
                torch.cuda.empty_cache()

        self.results = results
        df = pd.DataFrame(results)

        return df

    def compute_domain_shift_metrics(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Compute domain shift metrics for each model.

        Metrics:
        - Perplexity variance across domains (high = domain sensitive)
        - Average entropy difference from best domain
        - Confidence degradation from best to worst domain

        Args:
            df: Results DataFrame

        Returns:
            Domain shift summary DataFrame
        """
        shift_metrics = []

        for model in df['model'].unique():
            model_data = df[df['model'] == model]

            # Perplexity metrics
            perp_mean = model_data['perplexity'].mean()
            perp_std = model_data['perplexity'].std()
            perp_cv = perp_std / perp_mean if perp_mean > 0 else 0  # Coefficient of variation

            # Entropy metrics
            entropy_mean = model_data['token_entropy'].mean()
            entropy_range = model_data['token_entropy'].max() - model_data['token_entropy'].min()

            # Confidence metrics
            conf_mean = model_data['confidence'].mean()
            conf_degradation = model_data['confidence'].max() - model_data['confidence'].min()

            # Best and worst domains
            best_domain = model_data.loc[model_data['perplexity'].idxmin(), 'domain']
            worst_domain = model_data.loc[model_data['perplexity'].idxmax(), 'domain']

            shift_metrics.append({
                'model': model,
                'perplexity_mean': perp_mean,
                'perplexity_std': perp_std,
                'perplexity_cv': perp_cv,  # Higher = more domain shift
                'entropy_mean': entropy_mean,
                'entropy_range': entropy_range,  # Higher = more domain shift
                'confidence_mean': conf_mean,
                'confidence_degradation': conf_degradation,  # Higher = more domain shift
                'best_domain': best_domain,
                'worst_domain': worst_domain,
                'domain_shift_score': perp_cv + entropy_range + conf_degradation  # Composite score
            })

        shift_df = pd.DataFrame(shift_metrics)
        shift_df = shift_df.sort_values('domain_shift_score', ascending=True)

        return shift_df

    def plot_results(self, df: pd.DataFrame, output_dir: Path):
        """
        Create visualization plots.

        Args:
            df: Results DataFrame
            output_dir: Directory to save plots
        """
        output_dir = Path(output_dir)
        output_dir.mkdir(exist_ok=True, parents=True)

        # 1. Perplexity heatmap
        fig, ax = plt.subplots(figsize=(10, 6))
        pivot = df.pivot(index='model', columns='domain', values='perplexity')
        im = ax.imshow(pivot.values, cmap='YlOrRd', aspect='auto')

        ax.set_xticks(np.arange(len(pivot.columns)))
        ax.set_yticks(np.arange(len(pivot.index)))
        ax.set_xticklabels(pivot.columns)
        ax.set_yticklabels(pivot.index)

        # Add values
        for i in range(len(pivot.index)):
            for j in range(len(pivot.columns)):
                text = ax.text(j, i, f'{pivot.values[i, j]:.1f}',
                             ha="center", va="center", color="black", fontsize=9)

        ax.set_title('Perplexity by Model and Domain\n(Lower is Better)', fontsize=14, pad=20)
        ax.set_xlabel('Domain', fontsize=12)
        ax.set_ylabel('Model', fontsize=12)
        plt.colorbar(im, ax=ax, label='Perplexity')
        plt.tight_layout()
        plt.savefig(output_dir / 'perplexity_heatmap.png', dpi=300, bbox_inches='tight')
        plt.close()

        # 2. Confidence comparison
        fig, ax = plt.subplots(figsize=(12, 6))
        pivot = df.pivot(index='model', columns='domain', values='confidence')

        x = np.arange(len(pivot.index))
        width = 0.2

        for i, domain in enumerate(pivot.columns):
            offset = width * (i - len(pivot.columns)/2 + 0.5)
            ax.bar(x + offset, pivot[domain], width, label=domain)

        ax.set_xlabel('Model', fontsize=12)
        ax.set_ylabel('Confidence (Avg Max Prob)', fontsize=12)
        ax.set_title('Model Confidence Across Domains\n(Higher is Better)', fontsize=14, pad=20)
        ax.set_xticks(x)
        ax.set_xticklabels(pivot.index, rotation=15, ha='right')
        ax.legend(title='Domain')
        ax.grid(axis='y', alpha=0.3)
        plt.tight_layout()
        plt.savefig(output_dir / 'confidence_comparison.png', dpi=300, bbox_inches='tight')
        plt.close()

        # 3. Token entropy comparison
        fig, ax = plt.subplots(figsize=(10, 6))

        for model in df['model'].unique():
            model_data = df[df['model'] == model]
            domains = model_data['domain'].values
            entropies = model_data['token_entropy'].values
            ax.plot(domains, entropies, marker='o', label=model, linewidth=2)

        ax.set_xlabel('Domain', fontsize=12)
        ax.set_ylabel('Token Entropy', fontsize=12)
        ax.set_title('Token Entropy Across Domains\n(Lower = More Certain)', fontsize=14, pad=20)
        ax.legend()
        ax.grid(alpha=0.3)
        plt.xticks(rotation=15, ha='right')
        plt.tight_layout()
        plt.savefig(output_dir / 'entropy_comparison.png', dpi=300, bbox_inches='tight')
        plt.close()

        print(f"\n✓ Plots saved to {output_dir}/")

    def generate_report(self, df: pd.DataFrame, shift_df: pd.DataFrame, output_dir: Path):
        """
        Generate comprehensive text report.

        Args:
            df: Results DataFrame
            shift_df: Domain shift metrics DataFrame
            output_dir: Directory to save report
        """
        output_dir = Path(output_dir)
        output_dir.mkdir(exist_ok=True, parents=True)

        report_path = output_dir / 'domain_shift_report.txt'

        with open(report_path, 'w') as f:
            f.write("=" * 80 + "\n")
            f.write("DOMAIN SHIFT ANALYSIS REPORT\n")
            f.write("=" * 80 + "\n\n")
            f.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
            f.write(f"Models analyzed: {len(MODELS)}\n")
            f.write(f"Domains analyzed: {len(DOMAIN_TEXTS)}\n")
            f.write(f"Quantization: {'4-bit' if self.use_4bit else '8-bit'}\n")
            f.write(f"Device: {self.device}\n")
            f.write("\n" + "=" * 80 + "\n\n")

            # Domain shift ranking
            f.write("DOMAIN SHIFT RANKING (Lower Score = More Robust)\n")
            f.write("-" * 80 + "\n")
            for idx, row in shift_df.iterrows():
                f.write(f"\n{idx + 1}. {row['model']}\n")
                f.write(f"   Domain Shift Score: {row['domain_shift_score']:.4f}\n")
                f.write(f"   Perplexity CV: {row['perplexity_cv']:.4f}\n")
                f.write(f"   Entropy Range: {row['entropy_range']:.4f}\n")
                f.write(f"   Confidence Degradation: {row['confidence_degradation']:.4f}\n")
                f.write(f"   Best Domain: {row['best_domain']}\n")
                f.write(f"   Worst Domain: {row['worst_domain']}\n")

            f.write("\n" + "=" * 80 + "\n\n")

            # Detailed results per model
            f.write("DETAILED RESULTS BY MODEL\n")
            f.write("-" * 80 + "\n")
            for model in df['model'].unique():
                model_data = df[df['model'] == model]
                f.write(f"\n{model.upper()}\n")
                f.write("-" * 40 + "\n")
                for _, row in model_data.iterrows():
                    f.write(f"\n  Domain: {row['domain']}\n")
                    f.write(f"    Perplexity: {row['perplexity']:.2f}\n")
                    f.write(f"    Token Entropy: {row['token_entropy']:.4f}\n")
                    f.write(f"    Confidence: {row['confidence']:.4f}\n")
                    f.write(f"    Inference Time: {row['inference_time_s']:.2f}s\n")

            f.write("\n" + "=" * 80 + "\n\n")

            # Key findings
            f.write("KEY FINDINGS\n")
            f.write("-" * 80 + "\n\n")

            best_model = shift_df.iloc[0]['model']
            worst_model = shift_df.iloc[-1]['model']

            f.write(f"• Most robust model (least domain shift): {best_model}\n")
            f.write(f"• Most sensitive model (most domain shift): {worst_model}\n\n")

            # Find easiest and hardest domains overall
            domain_difficulty = df.groupby('domain')['perplexity'].mean().sort_values()
            easiest_domain = domain_difficulty.index[0]
            hardest_domain = domain_difficulty.index[-1]

            f.write(f"• Easiest domain overall: {easiest_domain} ")
            f.write(f"(avg perplexity: {domain_difficulty[easiest_domain]:.2f})\n")
            f.write(f"• Hardest domain overall: {hardest_domain} ")
            f.write(f"(avg perplexity: {domain_difficulty[hardest_domain]:.2f})\n\n")

            f.write("• Domain shift indicates how much model performance varies across domains\n")
            f.write("• High perplexity CV = model struggles with domain generalization\n")
            f.write("• High entropy range = model uncertainty varies significantly by domain\n")
            f.write("• High confidence degradation = model less certain on some domains\n")

        print(f"✓ Report saved to {report_path}")

    def save_results(self, df: pd.DataFrame, output_dir: Path):
        """
        Save results to CSV and JSON.

        Args:
            df: Results DataFrame
            output_dir: Directory to save files
        """
        output_dir = Path(output_dir)
        output_dir.mkdir(exist_ok=True, parents=True)

        # Save CSV
        csv_path = output_dir / 'results.csv'
        df.to_csv(csv_path, index=False)
        print(f"✓ Results saved to {csv_path}")

        # Save JSON
        json_path = output_dir / 'results.json'
        df.to_json(json_path, orient='records', indent=2)
        print(f"✓ Results saved to {json_path}")




Initializing DomainShiftAnalyzer
Device: cuda
Quantization: 4-bit
CUDA available: True
GPU: Tesla T4
--------------------------------------------------------------------------------

Starting comprehensive domain shift analysis...
This will take several minutes depending on your hardware.


DOMAIN SHIFT ANALYSIS

MODEL: gpt2-small

  Analyzing gpt2-small on medical domain...

Loading gpt2-small (gpt2)...




model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

  Parameters: 81,972,480
  GPU Memory: 211.61 MB


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


    Perplexity: 17.55
    Token Entropy: 2.5274
    Confidence: 0.5908
    Time: 0.94s

  Analyzing gpt2-small on legal domain...
    Perplexity: 15.35
    Token Entropy: 2.2405
    Confidence: 0.6147
    Time: 0.34s

  Analyzing gpt2-small on technical domain...
    Perplexity: 43.34
    Token Entropy: 2.9844
    Confidence: 0.5343
    Time: 0.35s

  Analyzing gpt2-small on casual domain...
    Perplexity: 14.13
    Token Entropy: 2.6221
    Confidence: 0.5435
    Time: 0.32s

MODEL: gpt2-medium

  Analyzing gpt2-medium on medical domain...

Loading gpt2-medium (gpt2-medium)...


config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/292 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: gpt2-medium
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...23}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

  Parameters: 203,828,224
  GPU Memory: 600.35 MB
    Perplexity: 8.10
    Token Entropy: 2.3262
    Confidence: 0.6099
    Time: 1.30s

  Analyzing gpt2-medium on legal domain...
    Perplexity: 7.69
    Token Entropy: 2.3595
    Confidence: 0.6003
    Time: 0.57s

  Analyzing gpt2-medium on technical domain...
    Perplexity: 17.63
    Token Entropy: 3.0572
    Confidence: 0.5183
    Time: 0.70s

  Analyzing gpt2-medium on casual domain...
    Perplexity: 7.10
    Token Entropy: 2.3414
    Confidence: 0.5699
    Time: 0.60s

MODEL: opt-350m

  Analyzing opt-350m on medical domain...

Loading opt-350m (facebook/opt-350m)...


config.json:   0%|          | 0.00/644 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/663M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/388 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/662M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

  Parameters: 179,677,184
  GPU Memory: 804.86 MB
    Perplexity: 9.65
    Token Entropy: 2.2051
    Confidence: 0.6104
    Time: 6.75s

  Analyzing opt-350m on legal domain...
    Perplexity: 8.77
    Token Entropy: 1.9377
    Confidence: 0.6406
    Time: 2.60s

  Analyzing opt-350m on technical domain...
    Perplexity: 22.09
    Token Entropy: 2.7918
    Confidence: 0.5352
    Time: 1.71s

  Analyzing opt-350m on casual domain...
    Perplexity: 7.35
    Token Entropy: 2.3516
    Confidence: 0.5488
    Time: 1.55s

MODEL: pythia-410m

  Analyzing pythia-410m on medical domain...

Loading pythia-410m (EleutherAI/pythia-410m)...


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/911M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/292 [00:00<?, ?it/s]

  Parameters: 254,339,072
  GPU Memory: 1162.12 MB
    Perplexity: 10.48
    Token Entropy: 2.4381
    Confidence: 0.4956
    Time: 3.29s

  Analyzing pythia-410m on legal domain...
    Perplexity: 7.32
    Token Entropy: 2.1425
    Confidence: 0.5610
    Time: 1.96s

  Analyzing pythia-410m on technical domain...
    Perplexity: 38.58
    Token Entropy: 3.3077
    Confidence: 0.3884
    Time: 1.23s

  Analyzing pythia-410m on casual domain...
    Perplexity: 12.10
    Token Entropy: 2.8341
    Confidence: 0.4290
    Time: 0.86s

Computing domain shift metrics...

DOMAIN SHIFT RANKING:
      model  domain_shift_score best_domain worst_domain
gpt2-medium            1.317815      casual    technical
 gpt2-small            1.439581      casual    technical
   opt-350m            1.529178      casual    technical
pythia-410m            2.181465       legal    technical

Saving results...
✓ Results saved to /mnt/user-data/outputs/domain_shift_analysis/results.csv
✓ Results saved to /mnt/use

In [5]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

ValueError: Mountpoint must not already contain files

In [8]:
# ============================================================================
# MAIN EXECUTION
# ============================================================================

    # Configuration
OUTPUT_DIR = Path("/content/drive/MyDrive/DomainShiftAnalysis_Results") # Changed to a specific subfolder
USE_4BIT = True  # Set to False for 8-bit quantization

    # Initialize analyzer
analyzer = DomainShiftAnalyzer(use_4bit=USE_4BIT)

    # Run analysis
print("\nStarting comprehensive domain shift analysis...")
print("This will take several minutes depending on your hardware.\n")

df = analyzer.analyze_all()

    # Compute domain shift metrics
print("\n" + "=" * 80)
print("Computing domain shift metrics...")
shift_df = analyzer.compute_domain_shift_metrics(df)

print("\nDOMAIN SHIFT RANKING:")
print(shift_df[['model', 'domain_shift_score', 'best_domain', 'worst_domain']].to_string(index=False))

    # Save results
print("\n" + "=" * 80)
print("Saving results...")
analyzer.save_results(df, OUTPUT_DIR)

    # Generate visualizations
print("\nGenerating visualizations...")
analyzer.plot_results(df, OUTPUT_DIR)

    # Generate report
print("\nGenerating report...")
analyzer.generate_report(df, shift_df, OUTPUT_DIR)

print("\n" + "=" * 80)
print("ANALYSIS COMPLETE!")
print(f"All results saved to: {OUTPUT_DIR}")
print("=" * 80)

Initializing DomainShiftAnalyzer
Device: cuda
Quantization: 4-bit
CUDA available: True
GPU: Tesla T4
--------------------------------------------------------------------------------

Starting comprehensive domain shift analysis...
This will take several minutes depending on your hardware.


DOMAIN SHIFT ANALYSIS

MODEL: gpt2-small

  Analyzing gpt2-small on medical domain...

Loading gpt2-small (gpt2)...


Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


  Parameters: 81,972,480
  GPU Memory: 2420.73 MB
    Perplexity: 17.55
    Token Entropy: 2.5274
    Confidence: 0.5908
    Time: 0.51s

  Analyzing gpt2-small on legal domain...
    Perplexity: 15.35
    Token Entropy: 2.2405
    Confidence: 0.6147
    Time: 0.47s

  Analyzing gpt2-small on technical domain...
    Perplexity: 43.34
    Token Entropy: 2.9844
    Confidence: 0.5343
    Time: 0.47s

  Analyzing gpt2-small on casual domain...
    Perplexity: 14.13
    Token Entropy: 2.6221
    Confidence: 0.5435
    Time: 0.48s

MODEL: gpt2-medium

  Analyzing gpt2-medium on medical domain...

Loading gpt2-medium (gpt2-medium)...


Loading weights:   0%|          | 0/292 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: gpt2-medium
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...23}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


  Parameters: 203,828,224
  GPU Memory: 2800.10 MB
    Perplexity: 8.10
    Token Entropy: 2.3262
    Confidence: 0.6099
    Time: 0.47s

  Analyzing gpt2-medium on legal domain...
    Perplexity: 7.69
    Token Entropy: 2.3595
    Confidence: 0.6003
    Time: 0.43s

  Analyzing gpt2-medium on technical domain...
    Perplexity: 17.63
    Token Entropy: 3.0572
    Confidence: 0.5183
    Time: 0.40s

  Analyzing gpt2-medium on casual domain...
    Perplexity: 7.10
    Token Entropy: 2.3414
    Confidence: 0.5699
    Time: 0.41s

MODEL: opt-350m

  Analyzing opt-350m on medical domain...

Loading opt-350m (facebook/opt-350m)...


Loading weights:   0%|          | 0/388 [00:00<?, ?it/s]

  Parameters: 179,677,184
  GPU Memory: 1972.21 MB
    Perplexity: 9.65
    Token Entropy: 2.2051
    Confidence: 0.6104
    Time: 1.43s

  Analyzing opt-350m on legal domain...
    Perplexity: 8.77
    Token Entropy: 1.9377
    Confidence: 0.6406
    Time: 1.50s

  Analyzing opt-350m on technical domain...
    Perplexity: 22.09
    Token Entropy: 2.7918
    Confidence: 0.5352
    Time: 1.20s

  Analyzing opt-350m on casual domain...
    Perplexity: 7.35
    Token Entropy: 2.3516
    Confidence: 0.5488
    Time: 0.97s

MODEL: pythia-410m

  Analyzing pythia-410m on medical domain...

Loading pythia-410m (EleutherAI/pythia-410m)...


Loading weights:   0%|          | 0/292 [00:00<?, ?it/s]

  Parameters: 254,339,072
  GPU Memory: 2327.33 MB
    Perplexity: 10.48
    Token Entropy: 2.4381
    Confidence: 0.4956
    Time: 0.70s

  Analyzing pythia-410m on legal domain...
    Perplexity: 7.32
    Token Entropy: 2.1425
    Confidence: 0.5610
    Time: 0.63s

  Analyzing pythia-410m on technical domain...
    Perplexity: 38.58
    Token Entropy: 3.3077
    Confidence: 0.3884
    Time: 0.77s

  Analyzing pythia-410m on casual domain...
    Perplexity: 12.10
    Token Entropy: 2.8341
    Confidence: 0.4290
    Time: 0.67s

Computing domain shift metrics...

DOMAIN SHIFT RANKING:
      model  domain_shift_score best_domain worst_domain
gpt2-medium            1.317815      casual    technical
 gpt2-small            1.439581      casual    technical
   opt-350m            1.529178      casual    technical
pythia-410m            2.181465       legal    technical

Saving results...
✓ Results saved to /content/drive/MyDrive/DomainShiftAnalysis_Results/results.csv
✓ Results saved to /c

In [9]:
# Confirm the output directory path
OUTPUT_DIR = Path("/content/drive/MyDrive/DomainShiftAnalysis_Results")

# List contents of the output directory
print(f"Contents of {OUTPUT_DIR}:")
!ls -F "{OUTPUT_DIR}"

Contents of /content/drive/MyDrive/DomainShiftAnalysis_Results:
confidence_comparison.png  entropy_comparison.png  results.csv
domain_shift_report.txt    perplexity_heatmap.png  results.json
