**Pruning Sarvam Model** Kaggle Environment GPU T4

Install Libraries

In this notebook, we will look at an example of depth pruning, which involves removing entire layers from the model.

The first thing to note is that removing entire layers from a transformer model usually has a significant impact on the model's performance. This is a much more drastic architectural change compared to the simple removal of neurons from the MLP layers, as seen in the previous example.

For this reason, these models are not designed to be used directly after the pruning process. Instead, they will require a subsequent fine-tuning process to recover their capabilities.

In [None]:
# Install required libraries
!pip install transformers accelerate datasets lm-eval sacrebleu evaluate torch torchvision torchaudio

In [None]:
!pip install -q lm-eval

In [None]:
!pip install protobuf==3.20.3

In [None]:
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch import nn
from torch.utils.data import DataLoader
import os
from tqdm import tqdm
from lm_eval import evaluator, tasks, models
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import copy
import tempfile

In [None]:

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")


Download Model and study Architecture

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("sarvamai/sarvam-1")
tokenizer = AutoTokenizer.from_pretrained("sarvamai/sarvam-1")

In [None]:
#Study Model Architecture
print(model)

In [None]:
def get_output(prompt, model=model, tokenizer=tokenizer):
    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    outputs = model.generate(
        inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=50,
        num_return_sequences=1,
        pad_token_id=tokenizer.pad_token_id,
        temperature=None,
        top_p=None,
        do_sample=False,          # Disable sampling
        num_beams=5,              # Use beam search
        early_stopping=True,      # Stop when end-of-sequence token is generated
        no_repeat_ngram_size=2    # Prevent repetition of 2-grams
    )
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters())


In [None]:
original_param_count = count_parameters(model)
print(f"Original model parameters: {original_param_count}")

In a Transformer model like this, layers are organized hierarchically: the initial layers (closer to the input) tend to capture basic language patterns (such as syntactic structure, common word combinations, etc.), while the intermediate and final layers refine these representations, capturing higher-order relationships, global coherence, and subtle semantic nuances.

Removing the initial layers directly undermines the foundation upon which more complex representations are built, leading the model to generate meaningless text sequences. Similarly, removing layers based on weight importance metrics (without considering their position or function) can eliminate layers critical for linguistic cohesion or contextual coherence.

On the other hand, removing the final layers, while resulting in a loss of some refinement and specialization capabilities, preserves the initial and middle layers that have already learned fundamental language rules and basic word dependencies.

However, this is just an empirical and highly simple test. Later, when we evaluate the model's performance using rankings, we will see that it retains a significant portion of its characteristics. Therefore, we are dealing with a model that can deliver very good results after a small fine-tuning process to recover some of the lost capabilities.

In [None]:
# ELIMINATE LAST LAYERS OF THE MODEL.
def prune_last_layers(model, num_layers_to_remove):
    """
    Removes the last 'num_layers_to_remove' layers from the model.

    Args:
    - model: The model from which layers will be pruned.
    - num_layers_to_remove: Number of layers to remove from the top of the stack.

    Returns:
    - model: The pruned model with fewer layers.
    """
    total_layers = len(model.model.layers)

    # Ensure we are not removing more layers than exist
    if num_layers_to_remove >= total_layers:
        raise ValueError("Number of layers to remove is greater or equal to total layers.")

    # Slice the layers to remove the last ones
    new_layers = model.model.layers[:total_layers - num_layers_to_remove]
    model.model.layers = nn.ModuleList(new_layers)

    # Update the model configuration
    model.config.num_hidden_layers = len(model.model.layers)

    return model

**Prune Loop**


The update_model function iterates through the blocks within the model's Transformer structure. This structure consists of multiple LlamaDecoderLayer blocks, and each of these blocks contains a pair of LlamaSdpaAttention and LlamaMLP components.

In [None]:
def update_model(model, prune_percent):
    """
    Modifies the model by removing entire layers from the end of the stack,
    instead of basing it on importance scores. This is a heuristic approach
    that tries pruning the top layers to see if it yields better results.

    Args:
    - model: Model to prune.
    - prune_percent: Percentage of layers to prune from the top.

    Returns:
    - model: New pruned model with fewer layers.
    """
    ### uncomment this if you want to use weight layer selection.
    #model = prune_layers(model, prune_percent)

    total_layers = len(model.model.layers)
    num_layers_to_remove = int(total_layers * prune_percent)

    model = prune_last_layers(model, num_layers_to_remove)
    #model = prune_first_layers(model, num_layers_to_remove)

    # Update the model configuration to reflect the new number of layers
    model.config.num_hidden_layers = len(model.model.layers)
    return model

**Obtain & test the pruned model.**

In [None]:
prune_percent = 0.2  # Prune 20% of neurons
model = update_model(model, prune_percent)
# Recalculate the number of parameters
pruned_param_count = count_parameters(model)
reduction_in_params = original_param_count - pruned_param_count
percentage_savings = (reduction_in_params / original_param_count) * 100

print(f"Pruned model parameters: {pruned_param_count}")
print(f"Reduction in parameters: {reduction_in_params}")
print(f"Percentage of weight savings: {percentage_savings:.2f}%")

**Study Model Structure after pruning**

In [None]:
print(model)

In [None]:
new_model_name = 'depth20-sarvam-1-2b'
# The directory where the model files (config.json, model.safetensors, etc.) will be saved
output_dir = './' + new_model_name

import os
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Save the pruned model and tokenizer to the local directory
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f" Pruned Sarvam-1 model saved locally to: {output_dir}")

In [None]:
# Push the model to your Hugging Face repository

model.push_to_hub(new_model_name, private=True)
tokenizer.push_to_hub(new_model_name)

Evaluating the Model

LAMBADA – tests the model’s ability to predict the final word of a sentence using long-range context.

BoolQ – a yes/no question-answering task.

ARC Easy – multiple-choice questions that test basic reasoning.

These tasks give a quick snapshot of how pruning affected understanding, reasoning, and language prediction abilities.

In [None]:
def evaluate_local_hf_model(output_dir, tasks=['arc_easy', 'boolq', 'lambada'], num_fewshot=0):
    """
    Evaluates a model saved in a local folder using the 'hf' (Hugging Face) backend
    of lm-eval, by passing the local path to the 'pretrained' argument.

    Args:
    - local_model_path: The local directory path (e.g., './depth20-sarvam-1-2b').
    - tasks: A list of tasks to evaluate on.

    Returns:
    - Dictionary of evaluation metrics.
    """
    # The 'hf' model type in lm-eval accepts a local path for the 'pretrained' argument.
    # It must be an absolute or relative path to a directory containing the saved model files.
    model_args = f"pretrained={output_dir},device=cuda,dtype=float16" # Include dtype to match your loading

    print(f"Loading model from local path for evaluation: {output_dir}")

    # Note: Setting limit=None uses the full evaluation set, which can take time.
    results = evaluator.simple_evaluate(
      model="hf",
      model_args=model_args,
      tasks=tasks,
      num_fewshot=num_fewshot,
      limit=None,
      bootstrap_iters=10
    )

    metrics = results.get('results', {})
    return metrics



In [None]:
# --- Execution ---
local_path = './depth20-sarvam-1-2b' # This must match the output_dir above
tasks_to_run = ['lambada', 'boolq', 'arc_easy']

# You may want to replace these with Indic-specific benchmarks if Sarvam-1 excels there.
# You will need to check if lm-eval supports them or if they are in the Sarvam AI Hugging Face dataset collections (e.g., 'sarvamai/boolq-indic').

print(f"\n--- Starting Local Evaluation for {local_path} ---")
metrics_pruned = evaluate_local_hf_model(local_path, tasks=tasks_to_run)

print("\n--- Pruned Sarvam-1 Evaluation Metrics ---")
print(metrics_pruned)