# Overview

Note: This is notebook isn't support in Kaggle environment, it is limited by hardware. We need GPU capability is 8 or 8.2 to support Marlin. More detail about Marlin see 

We will use auto_gptq to convert GPTQ LLM to Marlin format.

In [1]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2023 NVIDIA Corporation

Built on Mon_Apr__3_17:16:06_PDT_2023

Cuda compilation tools, release 12.1, V12.1.105

Build cuda_12.1.r12.1/compiler.32688072_0


In [2]:
import torch

major, minor=torch.cuda.get_device_capability()

if major>=8:
    print(f"COOL {major}")
else:
    print("GG")

COOL 8


In [3]:
%%capture
!pip install --upgrade transformers auto-gptq accelerate optimum

In [4]:
!pip list |grep auto_gptq

auto_gptq                 0.8.0.dev0+cu121 /home/ec2-user/SageMaker/AutoGPTQ


# Converting to Marlin's format

In [5]:
from transformers import AutoTokenizer

model_id="TheBloke/Llama-2-7B-Chat-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_id)

In [6]:
from auto_gptq import AutoGPTQForCausalLM

marlin_model=AutoGPTQForCausalLM.from_quantized(model_id, device_map="auto", use_triton=False, inject_fused_attention=False, inject_fused_mlp=False, disable_exllama=True, disable_exllamav2=True, use_marlin=True)

INFO - `checkpoint_format` is missing from the quantization configuration and is automatically inferred to gptq.

INFO - The layer lm_head is not quantized.

Overriding QuantLinear layers to use Marlin's QuantLinear...: 100%|██████████| 454/454 [00:23<00:00, 19.24it/s]

The safetensors archive passed at /home/ec2-user/.cache/huggingface/assets/auto_gptq/TheBloke/Llama-2-7B-Chat-GPTQ/autogptq_model_gptq_marlin.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.


# Saving the model in 4-bit

In [7]:
save_dir="Llama-2-7B-Chat-GPTQ-marlin-4bit"
marlin_model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)



Non-default generation parameters: {'max_length': 4096}


('Llama-2-7B-Chat-GPTQ-marlin-4bit/tokenizer_config.json',
 'Llama-2-7B-Chat-GPTQ-marlin-4bit/special_tokens_map.json',
 'Llama-2-7B-Chat-GPTQ-marlin-4bit/tokenizer.model',
 'Llama-2-7B-Chat-GPTQ-marlin-4bit/added_tokens.json',
 'Llama-2-7B-Chat-GPTQ-marlin-4bit/tokenizer.json')

# Pushing to Hub


Create notebook.cfg file in same path of the notebook.
```bash
[SECRETS]
HF_TOKEN: <token>
```

In [8]:
from configparser import ConfigParser

parser=ConfigParser()
_,=parser.read('notebook.cfg')

HF_TOKEN=parser.get('SECRETS','HF_TOKEN')

In [9]:
from huggingface_hub import login

login(token=HF_TOKEN)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.

Token is valid (permission: write).

Your token has been saved to /home/ec2-user/.cache/huggingface/token

Login successful


In [10]:
repo_id = f"aisuko/{save_dir}"
commit_message = f"{save_dir}"
marlin_model.push_to_hub(repo_id, use_safetensors=True, commit_message=commit_message, use_auth_token=True)

INFO - Uploading the following files to aisuko/Llama-2-7B-Chat-GPTQ-marlin-4bit: quantize_config.json,special_tokens_map.json,tokenizer_config.json,model.safetensors,tokenizer.json,config.json,tokenizer.model


model.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/aisuko/Llama-2-7B-Chat-GPTQ-marlin-4bit/commit/b245d787b0b89835aa6b01fd25fd30df206019b8', commit_message='Llama-2-7B-Chat-GPTQ-marlin-4bit', commit_description='', oid='b245d787b0b89835aa6b01fd25fd30df206019b8', pr_url=None, pr_revision=None, pr_num=None)

# Checking model

In [11]:
from auto_gptq.nn_modules.qlinear.qlinear_marlin import QuantLinear as MarlinQuantLinear

has_marlin = False
for name, module in marlin_model.named_modules():
    if isinstance(module, MarlinQuantLinear):
        has_marlin = True
        break
print(has_marlin)

True


# Inference

In [12]:
prompt = "I am in Melbourne and"

inp = tokenizer(prompt, return_tensors="pt").to('cuda')

marlin_model.to('cuda')
res = marlin_model.generate(**inp, num_beams=1, min_new_tokens=60, max_new_tokens=60)

predicted_text = tokenizer.decode(res[0])
print(predicted_text)

<s> I am in Melbourne and I am looking for a good restaurant to have dinner at.

I am open to any cuisine, but I am particularly interested in trying some of the local specialties.

Can you recommend some good restaurants in Melbourne that serve local food?

I would love to hear your suggestions!


# ML Libraries Version

In [13]:
!pip freeze

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...


	- Avoid using `tokenizers` before the fork if possible

	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


accelerate==0.28.0

aiohttp==3.9.3

aiosignal==1.3.1

aniso8601==9.0.1

annotated-types @ file:///home/conda/feedstock_root/build_artifacts/annotated-types_1696634205638/work

ansi2html==1.9.1

anyio @ file:///home/conda/feedstock_root/build_artifacts/anyio_1708355285029/work

argon2-cffi @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi_1692818318753/work

argon2-cffi-bindings @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi-bindings_1695386546427/work

arrow @ file:///home/conda/feedstock_root/build_artifacts/arrow_1696128962909/work

asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1698341106958/work

async-lru @ file:///home/conda/feedstock_root/build_artifacts/async-lru_1690563019058/work

async-timeout==4.0.3

attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1704011227531/work

-e git+https://github.com/PanQiWei/AutoGPTQ.git@866b4c8c2cbb893f1156cb6c114625bba2e4d7c5#egg=auto_gptq

autovizwidget==0.21.0

awscli==1.3