## Hugging Face CLI

In [None]:
import getpass
print("Enter you Hugging Face token:")
TOKEN = getpass.getpass()

In [None]:
!git config --global credential.helper store
!huggingface-cli login --token $TOKEN --add-to-git-credential



## Import the modules


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

## Unquantized Llama 3.1

### Load the model

In [None]:
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name,
                    device_map = "auto")

### Data type of model's parameters

In [None]:
param_dtypes = [param.dtype for param in model.parameters()]
print("Parameter dtypes:", param_dtypes)

### Memory footprints

In [None]:
print(model.get_memory_footprint())

### Inference of unquantized model

The session time for Jupyter notebook is 15 minutes on our platform. Executing the below cell takes around 45-50 minutes to show the output.
You can uncomment the code and execute on GPU enable machine to see the response.

In [None]:
#tokenizer = AutoTokenizer.from_pretrained(model_name)
#input = tokenizer("Portugal is", return_tensors="pt").to('cuda')

#response = model.generate(**input, max_new_tokens = 50)
#print(tokenizer.batch_decode(response, skip_special_tokens=True))

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['Portugal is a country located in southwestern Europe, bordered by Spain to the east and north, and the Atlantic Ocean to the west and south. It has a long history of colonialism and trade, and its culture reflects its rich heritage. Here are some key facts']


## Cleaning the memory

We are cleaning some variables and memory to avoid memory problems while executing the code.

In [None]:
import gc
del model
gc.collect()
torch.cuda.empty_cache()

## Implementing 8 bit quantization



### BitsAndBytes configuration





In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True
)

### Load the model

In [None]:
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
quantized_model = AutoModelForCausalLM.from_pretrained(model_name,
                    quantization_config=bnb_config,
                    device_map = "auto")

### Data type of model's parameters

In [None]:
param_dtypes = [param.dtype for param in quantized_model.parameters()]
print("Parameter dtypes:", param_dtypes)

### Memory footprints

In [None]:
print(quantized_model.get_memory_footprint())

### Inference of 8-bit quantized model

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
input = tokenizer("Portugal is", return_tensors="pt").to('cuda')

response = quantized_model.generate(**input, max_new_tokens = 50)
print(tokenizer.batch_decode(response, skip_special_tokens=True))