# Multi-Model Benchmarking of Lightweight Open-Source Generative AI Models

## Objective
This notebook evaluates and compares multiple lightweight publicly available generative ai models under identical conditions


The goal is to understand trade-offs in:
1. Instruction following
2. Conversational quality
3. Response verbosity
4. Inference latency

This type of comparison mirrors real-world model selection in GenAI systems.

## Models Evaluated

| Model Name | Architecture | Tuning Type | Reason for Inclusion |
|-----------|-------------|-------------|---------------------|
| BlenderBot-400M | Encoder-Decoder | Dialogue-tuned | Conversational baseline |
| FLAN-T5-Small | Encoder-Decoder | Instruction-tuned | Lightweight instruction model |
| FLAN-T5-Base | Encoder-Decoder | Instruction-tuned | Scale comparison |


## Experimental Setup

1. Same prompts for all models
2. Greedy decoding (no sampling)
3. Max output length: 128 tokens
4. CPU inference
5. No fine-tuning applied

This ensures a fair, controlled comparison.


### The following packages are required to be installed to run the models


In [16]:
%pip install transformers tensorflow sentencepiece numpy pandas


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting pandas
  Downloading pandas-2.3.3-cp310-cp310-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading pandas-2.3.3-cp310-cp310-macosx_11_0_arm64.whl (10.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hUsing cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Installing collected packages: pytz, pandas
Successfully installed pandas-2.3.3 pytz-2025.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [17]:
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    BlenderbotTokenizer,
    BlenderbotForConditionalGeneration
)
import pandas as pd



In [18]:
MODELS = {
    "blenderbot": "facebook/blenderbot-400M-distill",
    "flan_t5_small": "google/flan-t5-small",
    "flan_t5_base": "google/flan-t5-base",
}

MAX_TOKENS = 128


## Evaluation Prompts

The following prompts are used for all models.
They test instruction following, reasoning, and conversational ability.


In [19]:
PROMPTS = [
    "Explain overfitting in simple terms.",
    "Write a polite email declining a meeting.",
    "What are the pros and cons of electric cars?",
]


In [20]:
def load_model(model_name: str):

    if "blenderbot" in model_name.lower():
        tokenizer = BlenderbotTokenizer.from_pretrained(model_name)
        model = BlenderbotForConditionalGeneration.from_pretrained(model_name)
    else:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    model.eval()
    return model, tokenizer

In [21]:
def generate_response(model, tokenizer, prompt):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
    outputs = model.generate(
        **inputs,
        max_length=MAX_TOKENS
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


In [22]:
results = []

for model_key, model_name in MODELS.items():
    model, tokenizer = load_model(model_name)
    for prompt in PROMPTS:
        output = generate_response(model, tokenizer, prompt)
        results.append({
            "model": model_key,
            "prompt": prompt,
            "output": output
        })


'(ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')), '(Request ID: 1be58f21-6654-4f33-85eb-a394d5fbc08d)')' thrown while requesting HEAD https://huggingface.co/facebook/blenderbot-400M-distill/resolve/main/tokenizer_config.json
Retrying in 1s [Retry 1/5].


In [23]:
df = pd.DataFrame(results)
df

Unnamed: 0,model,prompt,output
0,blenderbot,Explain overfitting in simple terms.,I'm trying to get rid of some of my excess we...
1,blenderbot,Write a polite email declining a meeting.,I hate when that happens. What did you end u...
2,blenderbot,What are the pros and cons of electric cars?,Electric cars are so much better than gasolin...
3,flan_t5_small,Explain overfitting in simple terms.,The overfitting is a way to get the most out o...
4,flan_t5_small,Write a polite email declining a meeting.,I'm not sure if I'll be able to attend the mee...
5,flan_t5_small,What are the pros and cons of electric cars?,Electric cars are powered by a single engine.
6,flan_t5_base,Explain overfitting in simple terms.,Overfitting is the state of being too small or...
7,flan_t5_base,Write a polite email declining a meeting.,I'm sorry to hear that you are unable to atten...
8,flan_t5_base,What are the pros and cons of electric cars?,Electric cars are more efficient than fossil f...


## RESULT ANALYSIS

### Blenderbot-400
1. It produces fluent, conversational responses
2. It frequently ignores instructions and treats the dialogue as continuation
3. It performs poorly on task oriented queries

**Conclusion:** It well-suited for open-domain chat, but not for instruction-following tasks.

---

### FLAN-T5-Base
- It attempts to follow instructions but often produces incorrect or repetitive outputs.
- It shows limited reasoning depth and instability in longer generations.

**Conclusion:** It is aware of instructions but unreliable without decoding or prompt tuning.

---

### FLAN-T5-Small
- It generates shallow, repetitive, or nonsensical responses.
- It fails to explain basic concepts accurately.

**Conclusion:** The model's capacity is insufficient for reliable generative reasoning.
