<a href="https://colab.research.google.com/github/DJCordhose/practical-llm/blob/main/Assessment_Llama_3_8B_T4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assessment on Quantized LLama 3 8B (T4)

* the "small" version: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
* quantizing the model into 8 bit resolution to make it fit into T4 (16 GB) with a longer context: https://huggingface.co/docs/transformers/v4.42.0/quantization/bitsandbytes
* inference with bitsandbytes is slower than FP16 precision
  * https://huggingface.co/docs/text-generation-inference/conceptual/quantization#quantization-with-bitsandbytes
  * https://huggingface.co/blog/hf-bitsandbytes-integration#is-it-faster-than-native-models

Prerequisites
1. a Huggingface account and an
1. access tokens to put in below while logging in

### Comparing GPU microarchitectures
* T4/RTX 20: https://en.wikipedia.org/wiki/Turing_(microarchitecture)
  * V100 - professional variant of RTX 20 consumer line: https://en.wikipedia.org/wiki/Volta_(microarchitecture)
* A100/RTX 30: https://en.wikipedia.org/wiki/Ampere_(microarchitecture)
* L4/L40/RTX 40: https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture)
  * H100 - professional variant of RTX 40 consumer line, not available on Colab (yet?): https://en.wikipedia.org/wiki/Hopper_(microarchitecture)
* Future successor to both Hopper and Ada Lovelace: https://en.wikipedia.org/wiki/Blackwell_(microarchitecture)
* comparing GPUs: https://www.reddit.com/r/learnmachinelearning/comments/18gn1b2/choosing_the_right_gpu_for_your_workloads_a_dive/


In [2]:
# https://huggingface.co/docs/transformers/v4.43.4/quantization/bitsandbytes?bnb=4-bit
!pip install accelerate transformers --upgrade
!pip install bitsandbytes>=0.39.0



In [3]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
from IPython.display import Markdown

In [5]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) 
Token is valid (permission: read).


In [6]:
!nvidia-smi

Tue Aug  6 07:11:37 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   52C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [7]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

In [8]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

***note:*** execute in a termial 'watch -n 0.5 nvidia-smi' to see the GPU usage and when the model is loaded onto it

In [9]:
%%time

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# quantization_config = BitsAndBytesConfig(load_in_8bit=True)
quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config, # quantized for T4
    # torch_dtype=torch.bfloat16,  # 16-Bit full resolution for everything having 24GB (L4) or more (A100)
    device_map="cuda",
    trust_remote_code=True
)

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

CPU times: user 39.5 s, sys: 30.5 s, total: 1min 9s
Wall time: 1min 7s


In [10]:
!nvidia-smi

Tue Aug  6 07:12:47 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P0              25W /  70W |   5993MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [11]:
positive = [
  "With the diagnosis named here, the need for compensation to ensure the basic need is conceivable.",
  "The socio-medical prerequisites for the prescribed aid supply have been met.",
  "Everyday relevant usage benefits have been determined.",
  "Socio-medical indication for the aid is confirmed.",
  "Contraindications have been excluded; there are no contraindications for the use of the requested aid."
]

In [12]:
negative = [
  "No specific findings can be derived from the diagnosis currently named as the basis for the regulation.",
  "According to the service extracts from the health insurance, the insured has already been provided with the functional product requested according to its area of application.",
  "A medically comprehensible explanation as to why the use of an orthopedic aid corresponding to the findings is not sufficient and instead electric foot lifter stimulation for walking would be more appropriate and therefore necessary has not been transmitted.",
  "From an overall view of the information available here, it cannot be seen how the supply of the insured with the product could be justified, nor can the safety of such a supply be confirmed.",
  "A medical justification for why a product not listed in the directory of aids should be used in the present case has not been transmitted."
]

In [13]:
assessment = negative[0]
# assessment = positive[0]

In [14]:
%%time

messages = [
    {"role": "system", "content": "You are an English-speaking, competent expert in the field of statutory health insurance. Answer consice, serious and formal."},
    {"role": "user", "content": f'''
What is the result of the assessment? Is a positive or negative recommendation given? Answer with "Yes" or "No" and then provide a brief justification for your assessment.

# Assessment
{assessment}

'''},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    eos_token_id=terminators,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=False
)
response = outputs[0][input_ids.shape[-1]:]
Markdown(tokenizer.decode(response, skip_special_tokens=True))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


CPU times: user 7.76 s, sys: 542 ms, total: 8.31 s
Wall time: 7.62 s


No

The assessment concludes that no specific findings can be derived from the diagnosis, indicating that the diagnosis does not provide sufficient information to support a recommendation.

In [15]:
def eval_assessment(assessment):
  messages = [
    {"role": "system", "content": "You are an English-speaking, competent expert in the field of statutory health insurance. Answer consice, serious and formal."},
    {"role": "user", "content": f'''
What is the result of the assessment? Is a positive or negative recommendation given? Answer with "Yes" or "No" and then provide a brief justification for your assessment.

# Assessment
{assessment}

'''},
]

  input_ids = tokenizer.apply_chat_template(
      messages,
      add_generation_prompt=True,
      return_tensors="pt"
  ).to(model.device)

  terminators = [
      tokenizer.eos_token_id,
      tokenizer.convert_tokens_to_ids("<|eot_id|>")
  ]

  outputs = model.generate(
      input_ids,
      max_new_tokens=512,
      eos_token_id=terminators,
      pad_token_id=tokenizer.eos_token_id,
      do_sample=False
  )
  response = outputs[0][input_ids.shape[-1]:]
  result = tokenizer.decode(response, skip_special_tokens=True)
  if result.startswith("Yes"):
    return "Positive", result
  elif result.startswith("No"):
    return "Negative", result
  else:
    return "Neutral", result

## Negative

In [16]:
%%time

for assessment in negative:
  print(f"Assessment: {assessment}")
  result, explanation = eval_assessment(assessment)
  print(f"{result}: {explanation}")
  print("-----")

Assessment: No specific findings can be derived from the diagnosis currently named as the basis for the regulation.
Negative: No

The assessment concludes that no specific findings can be derived from the diagnosis, indicating that the diagnosis does not provide sufficient information to support a recommendation.
-----
Assessment: According to the service extracts from the health insurance, the insured has already been provided with the functional product requested according to its area of application.
Positive: Yes

The assessment concludes that the insured has already received the requested functional product, which meets the requirements of its area of application. Therefore, a positive recommendation is given, indicating that the insured has already received the necessary treatment or service.
-----
Assessment: A medically comprehensible explanation as to why the use of an orthopedic aid corresponding to the findings is not sufficient and instead electric foot lifter stimulation fo

## Positive

In [17]:
%%time

for assessment in positive:
  print(f"Assessment: {assessment}")
  result, explanation = eval_assessment(assessment)
  print(f"{result}: {explanation}")
  print("-----")

Assessment: With the diagnosis named here, the need for compensation to ensure the basic need is conceivable.
Positive: Yes

The assessment indicates that the diagnosis is related to a basic need, which is a fundamental requirement for a person's well-being and health. As a result, a positive recommendation is given to ensure that the necessary compensation is provided to meet this basic need.
-----
Assessment: The socio-medical prerequisites for the prescribed aid supply have been met.
Positive: Yes

The socio-medical prerequisites for the prescribed aid supply have been met, indicating that the individual meets the necessary criteria for the treatment or therapy, and a positive recommendation is given.
-----
Assessment: Everyday relevant usage benefits have been determined.
Positive: Yes

The assessment concludes that the benefits are everyday relevant, indicating that the health insurance plan provides coverage for common, routine medical expenses that are essential for maintaining 

In [18]:
!nvidia-smi

Tue Aug  6 07:13:37 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   74C    P0              69W /  70W |   6431MiB / 15360MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    