### **Semantic Performance LASER - From text perplexity to task performance.**

by [Lucas Hänke de Cansino](https://www.linkedin.com/in/lucas-h%C3%A4nke-de-cansino-8b8521234/), [Fernando Fernandes Neto](https://twitter.com/FernandoNetoAi), [David Golchinfar](https://twitter.com/DavidGFar) and [Eric Hartford](https://twitter.com/erhartford)

With this notebook, we present a novel approach to applying LaserRMT shifting from perplexity based assessment of the lasered model to text generation performance on datasets previously seen during training. This method leverage similarity based scoring of generated answers against both chosen and rejected answers from Direct Preference Optimization (DPO) datasets.


**Overview:**

Here, we provide an exemplary demonstration of what the approach looks like on `openaccess-ai-collective/DPOpenHermes-7B-v2` and it's two DPO training datasets `Intel/orca_dpo_pairs` and `allenai/ultrafeedback_binarized_cleaned`. To give you a brief overview of the process: Initially, the script is executed, which, among other outputs, generates a JSON file containing the current top 16 highest SNR/max singular value for each module in every layer. Following this, we will guide you on how to use the extracted layers in Axolotl or LlamaFactory for your training.



### **The laser-scanner script:**

In [None]:
# %%
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np
import gc
import torch.nn.functional as F

from lib.utils.load_benchmark_dataset import get_benchmark_data

from lib.utils.assets import PromptTemplate
from lib.utils.prompt_template import get_llm_prompt

from lib.utils.AutoModelForSentenceEmbedding import (
    AutoModelForSentenceEmbedding,
    get_cosine_embeddings,
)

class ModelModifier:
    def __init__(
        self,
        model_name,
        prompt_template: PromptTemplate = PromptTemplate.chatml,
        input_length=512,
        output_length=512,
    ):
        self.model_name = model_name
        self.prompt_template = prompt_template
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name, torch_dtype=torch.bfloat16, device_map={"": 0}
        )
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_name, use_fast=True
        )
        self.layer_snr = {}
        self.modified_layers = set()
        self.original_weights = {}
        self.input_length = input_length
        self.output_length = output_length
        self.embeddings_model = AutoModelForSentenceEmbedding(
            model_name, self.tokenizer
        )

    def calculate_snr_for_layer(self, layer_type, layer_number):
        for name, module in self.model.named_modules():
            if layer_type in name and str(layer_number) in name:
                weights = module.weight.double()
                S = torch.linalg.svdvals(weights)
                weights = weights.detach().cpu()
                S = S.detach().cpu()
                sigma_estimated = self.estimate_sigma_with_full_iqr(S)
                n, m = weights.shape
                mp_threshold = self.marchenko_pastur_threshold(sigma_estimated, n, m)

                signal = S[S > mp_threshold].sum()
                noise = S[S <= mp_threshold].sum()
                snr = signal / noise if noise != 0 else float("inf")
                del S, weights
                torch.cuda.empty_cache()  # Clear PyTorch's CUDA memory cache
                gc.collect()
                return snr

    def update_model_reduce_layer(self, layer_type, layer_number):
        layer_id = f"{layer_type}+{layer_number}"
        if layer_id in self.modified_layers:
            print(f"Layer {layer_id} has already been modified. Skipping.")
            return False

        for name, module in self.model.named_modules():
            if layer_type in name and str(layer_number) in name:
                print(f"Reconstructing layer: {name}")
                original_dtype = module.weight.dtype
                self.original_weights[name] = module.weight.detach().clone()
                weights = module.weight.double()
                U, S, V = torch.linalg.svd(weights, full_matrices=False)

                # Estimate sigma using the full IQR method
                sigma_estimated_full_iqr = self.estimate_sigma_with_full_iqr(S)

                # Calculate Marchenko-Pastur threshold
                n, m = weights.shape
                mp_threshold_full_iqr = self.marchenko_pastur_threshold(
                    sigma_estimated_full_iqr, n, m
                )

                # Retain only the singular values above the MP threshold
                S_reduced = torch.zeros_like(S)
                k = (S > mp_threshold_full_iqr).sum().item()
                S_reduced[:k] = S[:k]
                print(f"Reduced from {S.shape} to {k}")

                # Reconstruct the matrix using the thresholded singular values
                reconstructed_weights = U @ torch.diag(S_reduced) @ V
                reconstructed_weights = reconstructed_weights.to(original_dtype)
                module.weight = torch.nn.Parameter(reconstructed_weights)
                self.modified_layers.add(layer_id)
                return True

    @staticmethod
    def marchenko_pastur_threshold(sigma, n, m):
        beta = n / m if n < m else m / n
        threshold = sigma * np.sqrt((1 + np.sqrt(beta)) ** 2)
        return threshold

    ## Calculate an estimate of the standard deviation of the singular values based on Inter Quantile Range

    @staticmethod
    def estimate_sigma_with_full_iqr(S):
        q75 = torch.quantile(S, 0.75)
        q25 = torch.quantile(S, 0.25)
        iqr = q75 - q25
        sigma_estimated = (
            iqr / 1.349
        )  ## 0.6745 * sigma is the expected range between the quantiles (Q1 and Q3)
        return sigma_estimated

    def restore_model_original_layer(self, layer_type, layer_number):
        layer_id = f"{layer_type}+{layer_number}"
        for name, module in self.model.named_modules():
            if layer_type in name and layer_number in name:
                if name in self.original_weights:
                    module.weight = torch.nn.Parameter(self.original_weights[name])
                    print(f"Restored original weights for layer: {name}", flush=True)
                    if layer_id in self.modified_layers:
                        self.modified_layers.remove(layer_id)
                        break
                else:
                    print(f"No original weights saved for layer: {name}", flush=True)
        return

    def calculate_model_performance(
        self,
        datasets=["orca_dpo", "ultrafeedback"],  # "openhermes"
        n_samples=128,
        input_length=512,
        output_length=512,
    ):
        score_accumulated = 0.0
        model = self.model
        tokenizer = self.tokenizer
        embeddings_model = self.embeddings_model
        for dataset in datasets:
            benchmark_dataset = get_benchmark_data(
                dataset, n_samples, input_length, output_length
            )
            print("Calculating performance for dataset:", dataset)
            for index, sample in enumerate(benchmark_dataset.data):
                progress = str(f"{index}/{n_samples}")
                print(progress)
                prompt = get_llm_prompt(sample.instruction, sample.prompt)
                prompt_enc = tokenizer([prompt], return_tensors="pt")
                prompt_enc.to("cuda")
                model_output = model.generate(
                    **prompt_enc,
                    max_new_tokens=self.output_length,
                    use_cache=False,
                    output_hidden_states=False,
                    output_attentions=False,
                    pad_token_id=tokenizer.eos_token_id,
                )
                expected_answer = sample.chosen
                expected_answer_enc = tokenizer(
                    [expected_answer],
                    return_tensors="pt",
                    padding="max_length",
                    max_length=self.output_length,
                )
                expected_answer_enc.to("cuda")
                expected_answer_embs = embeddings_model(**expected_answer_enc)
                rejected_answer = sample.rejected
                rejected_answer_enc = tokenizer(
                    [rejected_answer],
                    return_tensors="pt",
                    padding="max_length",
                    max_length=self.output_length,
                )
                rejected_answer_enc.to("cuda")
                rejected_answer_embs = embeddings_model(**rejected_answer_enc)

                input_length = len(prompt_enc["input_ids"][0])

                # Slice the output to remove the input tokens
                response_tokens = model_output[0][input_length:]

                output_string = tokenizer.decode(
                    response_tokens, skip_special_tokens=True
                )
                answer_enc = tokenizer(
                    [output_string],
                    return_tensors="pt",
                    padding="max_length",
                    max_length=self.output_length,
                )
                answer_enc.to("cuda")
                model_output_embs = embeddings_model(**answer_enc)
                cosine_similarity_gain = get_cosine_embeddings(
                    model_output_embs, expected_answer_embs
                )
                score_accumulated += cosine_similarity_gain.item()
                cosine_similarity_loss = get_cosine_embeddings(
                    model_output_embs, rejected_answer_embs
                )
                score_accumulated -= cosine_similarity_loss.item()

                del (
                    answer_enc,
                    rejected_answer_enc,
                    expected_answer_enc,
                    prompt_enc,
                    model_output_embs,
                    expected_answer_embs,
                    rejected_answer_embs,
                    cosine_similarity_gain,
                    cosine_similarity_loss,
                )
                torch.cuda.empty_cache()

        performance = score_accumulated / (n_samples * len(datasets))
        return performance

    def assess_layers_snr(self, layer_types, layer_numbers):
        for name, _ in self.model.named_modules():
            for layer_number in layer_numbers:
                for layer_type in layer_types:
                    if layer_type in name and str(layer_number) in name:
                        layer_name = f"{layer_type}+{layer_number}"
                        print("*" * 50, flush=True)
                        print(
                            f"Calculating Signal to Noise Ratio at layer {layer_name}",
                            flush=True,
                        )
                        snr = self.calculate_snr_for_layer(layer_type, layer_number)
                        self.layer_snr[layer_name] = snr
                        print(
                            f"Signal to Noise Ratio at layer {layer_name} = {snr}",
                            flush=True,
                        )
                        print("*" * 50, flush=True)

    def select_layers_for_modification(self, k):
        sorted_layers = sorted(
            self.layer_snr.items(), key=lambda x: x[1], reverse=False
        )
        return [layer[0] for layer in sorted_layers[:k]]

    def test_and_modify_layers(self, candidate_layers):
        initial_performance = self.calculate_model_performance()

        print(f"Initial Model Performance: {initial_performance}")

        for layer in candidate_layers:
            # Modify the layer
            layer_type = layer.split("+")[0]
            layer_number = layer.split("+")[1]
            self.update_model_reduce_layer(
                layer_type=layer_type, layer_number=layer_number
            )

            # Test the model's performance
            new_performance = self.calculate_model_performance()
            print(
                f"Tested Model Performance after modifying {layer}: {new_performance}"
            )

            # If the performance does not improve, revert the change
            if new_performance <= initial_performance:
                self.restore_model_original_layer(
                    layer_type=layer_type, layer_number=layer_number
                )
                print(
                    f"Reverted changes in {layer} due to lack of improvement.",
                    flush=True,
                )
            else:
                initial_performance = new_performance
                print(
                    f"Modification kept for {layer}. New baseline performance: {initial_performance}",
                    flush=True,
                )

    def save_model(self, save_dir):

        self.model.save_pretrained(save_dir)
        self.tokenizer.save_pretrained(save_dir)


# Usage
model_name = "openaccess-ai-collective/DPOpenHermes-7B-v2"
modifier = ModelModifier(model_name)

# %%
layer_numbers = list(range(31, -1, -1))
layer_numbers = [f".{l}." for l in layer_numbers]
print(layer_numbers)

layer_types=['mlp.gate', 'mlp.down_proj', 'mlp.up_proj', 'self_attn.q_proj', 'self_attn.k_proj', 'self_attn.v_proj', 'self_attn.o_proj']

modifier.assess_layers_snr(layer_types, layer_numbers)
top_k_layers = modifier.select_layers_for_modification(15)  # Select top 15 layers
print(top_k_layers, flush=True)

modifier.test_and_modify_layers(top_k_layers)
# %%
modifier.save_model("laser_model")





The key parts of this script are the **calculate_model_performance** and **calculate_snr_for_layer** functions:

### Calculate Model Performance

The `calculate_model_performance` method in the Python code evaluates the overall performance of the LLM. This method incorporates both the generation of model outputs and the comparison of these outputs with expected results using cosine similarity. 

Here's a step-by-step breakdown of the process:

1. **Initialization**: The method begins by initializing a score accumulator to zero. This score accumulator will be used to compile the overall performance score of the model across multiple data sets and samples.

2. **Benchmark Dataset**: For each specified dataset (e.g., \"orca_dpo\", \"ultrafeedback\"), the method retrieves a benchmark dataset. This dataset contains a number of samples (defined by `n_samples`) with a specific input length (`input_length`) and output length (`output_length`).

3. **Sample Iteration**: The method iterates through each sample in the benchmark dataset. For each sample, it performs the following steps:

   - **Prompt Creation**: Using the sample's instruction and prompt, it creates a suitable prompt for the model. 

   - **Prompt Encoding**: The created prompt is then encoded into a format that can be processed by the model.

   - **Model Output Generation**: The method generates an output from the model based on the encoded prompt. The length of the output is defined by `output_length`.

   - **Expected Answer Encoding**: The expected answer from the sample is also encoded in a format that can be processed by the model.

   - **Rejected Answer Encoding**: The rejected answer from the sample (if any) is encoded similarly.

   - **Embeddings Calculation**: The method calculates embeddings for the model output, expected answer, and rejected answer. These embeddings are computed using an embedding model, which transforms the raw encoded text into a dense vector representation.

4. **Cosine Similarity Calculation**: The method calculates the cosine similarity between the embeddings of the model output and the expected answer. This similarity score measures how closely the model's output aligns with the expected answer. The score is added to the score accumulator. If a rejected answer is present, the method also calculates the cosine similarity between the model output and the rejected answer. This score is subtracted from the score accumulator.

5. **Memory Management**: After processing each sample, the method clears the allocated memory for the various encodings and embeddings to optimize memory usage and prevent memory leaks.

6. **Performance Calculation**: Once all samples in all datasets have been processed, the method calculates the overall performance of the model by dividing the total accumulated score by the total number of samples across all datasets.

This comprehensive evaluation process allows for a detailed assessment of the model's performance, providing a basis for model optimization and refinement. It also provides a mechanism for comparing the performance of different models or different configurations of the same model


### Calculate Signal-to-Noise-Ration for Layer
The `calculate_snr_for_layer` method in the Python code performs a detailed analysis of the signal-to-noise ratio (SNR) for a specific layer within a neural network model. This method incorporates both the extraction of singular values from the layer's weights and the application of statistical measures to determine the layer's SNR. Here's a step-by-step breakdown of the process, integrating the mathematical concepts and formulas addressed previously:

1. **Identify Layer Weights**: For a given layer type and number, the method iterates through the model's layers to find a match. Once found, it extracts the weights of the layer and converts them to double precision for accurate computation.

2. **Singular Value Decomposition (SVD) Values**: The method calculates the singular values (\(S\)) of the layer's weight matrix using PyTorch's `torch.linalg.svdvals` function. This step is crucial for assessing the layer's information content through its singular values.

3. **Maximum Singular Value**: It records the maximum singular value (\(S[0]\)), which represents the highest magnitude of signal strength in the layer's weights.

4. **Estimate Sigma with IQR**: Using the full inter-quantile range (IQR) method, it estimates the standard deviation (\(\sigma\)) of the singular values. This estimation helps in setting a threshold for distinguishing between signal and noise based on the variability of the singular values:
   \[\sigma = \frac{IQR}{1.349}\]\


5. **Marchenko-Pastur Threshold**: The method then calculates the Marchenko-Pastur threshold (\(\lambda\)) to separate the singular values into signal and noise categories. This threshold is computed using the formula:
   \[\lambda = \sigma \sqrt{(1 + \sqrt{\beta})^2}\]
   where \(\beta\) is the aspect ratio of the weight matrix (\(n/m\) or \(m/n\), whichever is smaller).

6. **Signal and Noise Calculation**: The singular values greater than the Marchenko-Pastur threshold (\(\lambda\)) are considered signal, and those below are considered noise. The method sums these groups of singular values separately to quantify the total signal (\(\sum_{\sigma_i > \lambda} \sigma_i\)) and total noise (\(\sum_{\sigma_i \leq \lambda} \sigma_i\)).

7. **Signal-to-Noise Ratio (SNR)**: The SNR is calculated by dividing the total signal by the total noise. In cases where the noise is zero (to avoid division by zero), the SNR is set to infinity (\(\infty\)), indicating a layer with overwhelmingly dominant signal content.

8. **SNR Ratio Relative to Maximum Singular Value**: The method further refines the SNR analysis by calculating the ratio of the SNR to the maximum singular value. This ratio provides insight into how the layer's strongest signal component compares to the overall signal-to-noise balance:
   \[SNR\ Ratio = \frac{SNR}{\text{max singular value}}\]

9. **Memory Management**: After the calculations, the method clears the allocated memory for the singular values and weights to optimize memory usage and prevent memory leaks.

This detailed analysis enables the identification of layers with high signal-to-noise efficiency, indicating layers that are potentially more influential or critical to the model's performance. Layers with higher SNR ratios relative to their maximum singular value are considered to have weights that are more effectively contributing to the model's output, providing a basis for model optimization and refinement.

---




### **Results**

![SP-Laser Benchmark](../assets/SP-Laser-benchmark.png)

The results derived from HuggingFace's [Enterprise Scenarios Leaderboard](https://huggingface.co/spaces/PatronusAI/enterprise_scenarios_leaderboard) Benchmark indicate distinct performance variations when applying the laser technique in two different ways. There is an observable overall enhancement in the benchmark performance. However, it's important to note that the degree of gains and losses fluctuates across the various benchmarks.

In a general sense, it appears that the performance-based approach, applied on the dataset that the model was trained on, tends to yield superior results. This is closely followed by the perplexity-based laser approach. The base model, in comparison, tends to lag behind, registering the least effective performance.

These observations underscore the potential of laser techniques in improving model performance, although the effectiveness can vary based on the specific approach and dataset used.







### **Future Work**

Many more experiments need to be conducted. However, so far, we have been able to identify significant improvements compared to perplexity based approach.

Curious to see the results others have on different models and datasets.