<a href="https://colab.research.google.com/github/DJCordhose/transformers/blob/main/notebooks/Mixtral_8x7B_Instruct_HQQ_T4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mixtral 8x7B Instruct HQQ quantize on T4

Would work on T4, but requres extended RAM of at least 32GB

* https://mistral.ai/news/mixtral-of-experts/
* https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
* https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bitgs8-metaoffload-HQQ
* https://huggingface.co/docs/transformers/v4.42.0/quantization/hqq

### Prerequisites
1. a Huggingface account and an
1. access tokens to put in below while logging in
1. patience: this will take 10-15 minutes to load libs and model and even generation will be really slow

### Why Mixtral
* explicitly tuned for European languages (like French, Italian, German and Spanish)
* performance close to GPT 3.5
* even quantized performance is good
* only uses fraction of parameters at a time
* theoretically able to unload parameters to CPU

### Comparing GPU microarchitectures
* T4/RTX 20: https://en.wikipedia.org/wiki/Turing_(microarchitecture)
* A100/RTX 30: https://en.wikipedia.org/wiki/Ampere_(microarchitecture)
* L4/L40/RTX 40: https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture)
* H100 - variant of RTX 40 consumer line, not available on Colab (yet?): https://en.wikipedia.org/wiki/Hopper_(microarchitecture)
* Future successor to both Hopper and Ada Lovelace: https://en.wikipedia.org/wiki/Blackwell_(microarchitecture)
* comparing GPUs: https://www.reddit.com/r/learnmachinelearning/comments/18gn1b2/choosing_the_right_gpu_for_your_workloads_a_dive/


In [1]:
!nvidia-smi

Sun Jul 14 11:02:39 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   68C    P0              28W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
%%time

# this will take a few minutes
!pip install hqq hqq_aten -q

CPU times: user 32.4 ms, sys: 5.1 ms, total: 37.5 ms
Wall time: 5.22 s


In [3]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
from IPython.display import Markdown

In [5]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) 
Token is valid (permission: read).


execute in a termial 'watch -n 0.5 nvidia-smi' to see the GPU usage and when the model is loaded onto it

In [6]:
%%time

import transformers
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
from hqq.core.quantize import *

model_id = 'mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bitgs8-metaoffload-HQQ'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)

HQQLinear.set_backend(HQQBackend.ATEN_BACKPROP)

Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

100%|██████████| 32/32 [00:00<00:00, 99.90it/s] 
100%|██████████| 32/32 [00:11<00:00,  2.77it/s]

CPU times: user 51.9 s, sys: 24.4 s, total: 1min 16s
Wall time: 2min 3s





In [7]:
!nvidia-smi

Sun Jul 14 11:06:26 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P0              27W /  70W |  12895MiB / 15360MiB |     20%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [8]:
%%time

messages = [
    # Mistral models can not handle system promt roles
    {"role": "user", "content": "You are an English-speaking, competent advisor in the field of statutory health insurance. Answer consice, serious and formal. Answer this question: What is the best German statutory health insurance?"},
]
tokenizer.use_default_system_prompt = False

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    # will run faster because it will not use full 512 tokens
    max_new_tokens=512,
    # max_new_tokens=64,
    # max_new_tokens=16,
    # max_new_tokens=2,
    eos_token_id=terminators,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=False,
    num_beams=1
)
response = outputs[0][input_ids.shape[-1]:]
Markdown(tokenizer.decode(response, skip_special_tokens=True))


CPU times: user 5min 47s, sys: 239 ms, total: 5min 48s
Wall time: 5min 47s


Determining the "best" German statutory health insurance can be subjective and dependent on individual needs and preferences. However, some of the top-rated German statutory health insurance funds according to independent studies and customer satisfaction surveys are:

1. Techniker Krankenkasse (TK)
2. AOK Bundesverband (AOK-BV)
3. DAK-Gesundheit
4. Barmer Ersatzkasse
5. Kaufmännische Krankenkasse - KKH

These funds consistently rank high in terms of services, customer satisfaction, and financial stability. It is recommended that you compare the benefits, services, and premiums of these funds to determine which one best suits your personal needs.

In [9]:
!nvidia-smi

Sun Jul 14 11:12:14 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   77C    P0              61W /  70W |  13279MiB / 15360MiB |     97%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    