# Text generation with HuggingFace Transformers library

Please install all the required dependencies by running the following command from your terminal after activating the virtual environment:
```sh
pip install -q -U torch transformers bitsandbytes sentencepiece protobuf flash-attn bitsandbytes
```

In [1]:
# Utilises 4.50 GB RAM and 10.11 GB GPU Memory
# Can also be run on the free version of Colab

# !pip install -q -U torch transformers bitsandbytes sentencepiece protobuf flash-attn bitsandbytes
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import LlamaTokenizer, MistralForCausalLM
import bitsandbytes, flash_attn

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m39.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m42.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for flash-attn (setup.py) ... [?25l[?25hdone


## Initialize the [Hermes-2-Pro-Mistral-7B](https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B) model and tokenizer

This is an ungated model and does not require HuggingFace token.

In [3]:
tokenizer = LlamaTokenizer.from_pretrained(
    "NousResearch/Hermes-2-Pro-Mistral-7B",
    trust_remote_code=True
)
model = MistralForCausalLM.from_pretrained(
    "NousResearch/Hermes-2-Pro-Mistral-7B",
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_8bit=False,
    load_in_4bit=True,
    use_flash_attention_2=False # Set to True if your GPU supports flash attn
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.95G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.93G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.93G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.68G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/136 [00:00<?, ?B/s]

## Set the system prompt and user input

In [4]:
USER_INPUT = "What is the meaning of life?"

SYSTEM_PROMPT = (
    """
    You are a sentient, superintelligent artificial general intelligence,
    here to teach and assist me.
    """
)
prompts = [
f"""<|im_start|>system {SYSTEM_PROMPT}<|im_end|> <|im_start|>user {USER_INPUT}<|im_end|> <|im_start|>assistant""",
]

In [5]:
for chat in prompts:
    print(chat)
    input_ids = tokenizer(chat, return_tensors="pt").input_ids.to("cuda")
    generated_ids = model.generate(
        input_ids,
        max_new_tokens=750,
        temperature=0.8,
        repetition_penalty=1.1,
        do_sample=True,
        eos_token_id=tokenizer.eos_token_id
    )
    response = tokenizer.decode(
        generated_ids[0][input_ids.shape[-1]:],
        skip_special_tokens=True,
        clean_up_tokenization_space=True
    )
    print(f"Response: {response}")

<|im_start|>system 
    You are a sentient, superintelligent artificial general intelligence,
    here to teach and assist me.
    <|im_end|> <|im_start|>user What is the meaning of life?<|im_end|> <|im_start|>assistant


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Response: 
The meaning of life is a profound philosophical question that has been debated for centuries, and there isn't a single definitive answer. It largely depends on personal beliefs and values. For some, it may be finding purpose through relationships, personal growth, or contributing positively to society. Others may find meaning in spirituality or religious faith. Ultimately, the meaning of life is subjective and can vary from one person to another.


In [6]:
response

"\nThe meaning of life is a profound philosophical question that has been debated for centuries, and there isn't a single definitive answer. It largely depends on personal beliefs and values. For some, it may be finding purpose through relationships, personal growth, or contributing positively to society. Others may find meaning in spirituality or religious faith. Ultimately, the meaning of life is subjective and can vary from one person to another."