
REASON FOR CHOOSING MISTRAL 7B

For fine-tuning on a domain-specific question-answer dataset, Mistral 7B is generally considered better than BERT for several reasons. Mistral 7B is a modern language model with 7 billion parameters that outperforms larger models like Llama 2 13B on many benchmarks, including question answering. It uses advanced attention mechanisms like Grouped-Query Attention for faster inference and Sliding Window Attention for handling long inputs efficiently. Mistral 7B supports much longer context lengths than BERT, which is important for complex QA tasks. It also performs well on domain-specific applications such as medical question answering, showing superior precision compared to other models.

In contrast, while BERT has been a pioneering model for QA and is efficient for smaller tasks, it is older and limited by its smaller context window and architecture focused more on masked language modeling rather than autoregressive generation.

Thus, if you want one recommendation for a smaller model to fine-tune on domain-specific QA datasets, Mistral 7B is the better choice due to its improved performance, efficiency, and robustness on longer sequences and instruction-following tasks.

In [6]:
!nvidia-smi



Thu Nov 27 18:31:05 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:04:00.0 Off |                  Off |
|  0%   27C    P8             19W /  450W |       4MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
import torch
import bitsandbytes as bnb

print(torch.__version__)
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))



2.4.1+cu121
True
NVIDIA GeForce RTX 4090


In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.2"

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

print("Loading model in 4-bit (QLoRA compatible)...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

print("Applying QLoRA adapters...")
lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

print("✅ Mistral 7B successfully loaded in 4-bit with QLoRA!")


Loading tokenizer...


Loading model in 4-bit (QLoRA compatible)...


`torch_dtype` is deprecated! Use `dtype` instead!
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|██████████| 3/3 [00:09<00:00,  3.15s/it]


Applying QLoRA adapters...
✅ Mistral 7B successfully loaded in 4-bit with QLoRA!


In [5]:
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj