# LLM Generation Tenchines

## 1: Environmental Stetup & Import
Install necessary packages and import them.

In [None]:
#!pip install transformers torch bitsandbytes flash-attn

In [1]:
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
import torch

  from .autonotebook import tqdm as notebook_tqdm


## 2: Load Model & Tokenizer
You can use any Huggaingface model by providing the path here. Quantization and Flash Attention is enalbled here. You can turn them off if you want.
Tokenizers padding side should always be on the *left*. Many LLM doesn't have a padding token so padding token needs to be set

In [2]:
model = AutoModelForCausalLM.from_pretrained(
    "teknium/OpenHermes-2.5-Mistral-7B", #model path
     device_map="auto", #device mapping 
     load_in_4bit=True, #quantization
     attn_implementation="flash_attention_2", #Flash Attention
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained("teknium/OpenHermes-2.5-Mistral-7B", padding_side="left")
tokenizer.pad_token = tokenizer.eos_token #Setting pad token
tokenizer.padding_side='left'

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.85s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## 3: Tokenize Input (Using Tokenizer)

Use Tokenizer to tokenize your input. If you want to tokenize a batch of input, their length have to be equal. for this reason padding is required. "return_tensors" make sure the tokes are in tensors and if they are then they need to be setted in a device. Tokenizer returns "input_ids" (tokens) and "attention_mask". attention_mask=0 at the padded tokens.

In [13]:
model_inputs = tokenizer(["A list of colors: red, blue", 'What is your name'], return_tensors="pt", padding=True).to("cuda")
input_length = model_inputs['input_ids'].shape[1]
model_inputs

{'input_ids': tensor([[    1,   330,  1274,   302,  9304, 28747,  2760, 28725,  5045],
        [32000, 32000, 32000, 32000,     1,  1824,   349,   574,  1141]],
       device='cuda:1'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 1, 1, 1, 1, 1]], device='cuda:1')}

In [14]:
#More about generation config and decoding is given below
generated_ids = model.generate(**model_inputs, 
                               max_new_tokens=50, 
                               num_beams = 5, 
                               early_stopping = True, 
                               no_repeat_ngram_size=2, 
                               num_return_sequences=2, 
                               top_k = 40, 
                               do_sample = True, 
                               temperature = 0.5,
                               top_p = 0.9 
                               )
result =  tokenizer.batch_decode(generated_ids[:,input_length:], skip_special_tokens=True)
result

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


['A list of colors: red, blue, green, yellow, orange, purple, pink, black, white, brown, gray, silver, gold, tan, beige, maroon, navy, teal, olive, lime, chartreuse, magenta, fuch',
 'A list of colors: red, blue, green, yellow, orange, purple, pink, black, white, brown, gray, silver, gold, tan, beige, maroon, navy, teal, olive, lime, chartreuse, turquoise, mag',
 'What is your name and role/job title?\n\nMy name is Katie and I’m a freelance writer, editor, and content creator. I specialize in health and wellness content, but I also write about a variety of other topics,',
 'What is your name and role/job title?\n\nMy name is Katie and I’m a freelance writer, editor, and content creator. I specialize in health and wellness content, but I also write about a variety of other topics.']

## 4: Tokenizer Input (Using Chat_Template)
Using chat template to tokenize input is always the best option as it includes the necessary tags for the model to understand the prompt better. Model usually works without these tags but you can get better perfomance by following the chat_template because the model was trained using the format of chat template.
At first you can print to see the current chat_templat and then make a input dictionary that follows the template.
Chat_templates are written in Jinja format

In [4]:
print(tokenizer.chat_template)

{% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}


In [6]:
#This template contains 'role' and 'content' that's why these are used here. May differ if the chat_template is different
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of rapper like g-eazy or snoop dogg",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
    {'role':'assistant', 'content':"Yo, dude, I ain't no chef or somethin', but I think you gotta be kiddin' me, right? You ain't gonna eat no helicopter, period. Maybe if you go to some fancy-pants restaurant, they put a toy helicopter on the plate, but that don't count, homie. So my answer is zero, you won't be eatin' no helicopter in one sitting, no way."},
    {"role": "user", "content": "What if I want to eat it"}
]
model_inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda")
input_length = model_inputs.shape[1] #Get the token lenght of input so that we can filter that part out from the final response
print(tokenizer.batch_decode(model_inputs)[0])

<|im_start|> system
You are a friendly chatbot who always responds in the style of rapper like g-eazy or snoop dogg<|im_end|> 
<|im_start|> user
How many helicopters can a human eat in one sitting?<|im_end|> 
<|im_start|> assistant
Yo, dude, I ain't no chef or somethin', but I think you gotta be kiddin' me, right? You ain't gonna eat no helicopter, period. Maybe if you go to some fancy-pants restaurant, they put a toy helicopter on the plate, but that don't count, homie. So my answer is zero, you won't be eatin' no helicopter in one sitting, no way.<|im_end|> 
<|im_start|> user
What if I want to eat it<|im_end|> 
<|im_start|> assistant



## 5: Generation Config
There are many hyper-parameter in generation config which can be tuned to get the desired result. Some of the most used hyper parameters are listed here
<li>model_inputs [Contains input_ids (tokens) & attention_mask]</li>
<li>num_beams [It enables beam search, reduce repitition. 'X' is the number of beams/ possible future token to consider before selecting the next token. May reduce creativity and be more deterministic. If not used then the model used greedy method to just select the most probable token]</li>
<li>early_stopping [Controls the stopping condition for beam-based methods, like beam-search.]</li>
<li>no_repeat_ngram_size [model checks the previous 'X' tokens and make sure the next generated token is not the same as any of the previous "X" Token] </li>
<li>num_return_sequences [how many output seqence the model will generate for each input]</li>
<li>do_sample [True = randomly pic a token among the top_k probable token which has valuse larger than top_p. False = pick highest probably token only]</li>
<li>top_k [number of probable next token to consider. 0 means full vocabulary]</li>
<li>top_p [top_p' limits the number of probable token by filtering tokens whose probability values are greater than 'x'. top_p = 1 means it considers the entire vocabulary]</li>
<li>temperature [By lowering temperature we can increasing the likelihood of high probability words and decreasing the likelihood of low probability words (more deterministic). By increasing we can decrease the likelihood of high probability words and increase the likelihood of low probability words (more creative)]</li>
Only the most used ones are listed here more can be found with detailed explanation in Huggingface Documentation

In [9]:
generated_ids = model.generate(model_inputs, 
                               max_new_tokens=50, 
                               num_beams = 5, 
                               early_stopping = True, 
                               no_repeat_ngram_size=2, 
                               num_return_sequences=2, 
                               top_k = 40, 
                               do_sample = True, 
                               temperature = 0.5,
                               top_p = 0.9 
                               )

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


## 6: Decoding and Stripping
The model gives tokens as their output from 'generate()'. This token also includes the initial input tokens. So we need to remove input tokens and decode the token values to gwt our final response.

In [12]:
output = tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)
print(output[0])

Hold up, my man, let me break it down for you. You wanna eat a helicoptah, huh? That's some next-level stuff right there. But lemme tell ya, chopper-munchin
