# Generation: An Interactive Guide 🌟

## Learning Objectives 🎯
- Learn to install and use the transformers library.
- Understand the initialization and use of models for generating text.
- Explore interactive text generation with a pre-trained model.

## Setup and Installation
Make sure to set the notebook to use a GPU runtime to enhance performance.


## Importing Libraries
Import necessary libraries for causal language modeling and authentication for accessing models.

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from huggingface_hub import login

## Authentication
Authenticate to access restricted models from Hugging Face. Replace 'your_token_here' with your actual Hugging Face API token.

## Model Setup
Set the device to GPU and load the model and tokenizer. Specify the model and tokenizer names, load them, and configure the device settings.

In [2]:
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# device = "cuda" if torch.cuda.is_available() else "cpu"
device = "cpu"
print(device)

cpu


In [3]:
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [9]:
model.config

LlamaConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 5632,
  "max_position_embeddings": 2048,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 22,
  "num_key_value_heads": 4,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.49.0",
  "use_cache": true,
  "vocab_size": 32000
}

## Interactive Text Generation
Setup the conversation and prepare inputs for the model. Define a chat template with user and assistant roles, encode the messages, and apply the chat template.

We will use the same chat message for this also and add_generation_prompt will be set as True to tell the model the next sentence should be generated by the model 

In [5]:

# Define example messages in a conversation
messages = [
   {"role": "user", "content": "How do chat templates work?"},
   {"role": "assistant", "content": "Chat templates help  LLMs like me generate more coherent responses by providing a structured way to organize the conversation."},
   {"role": "user", "content": "How do I use them?"},
]

model_inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to(device)

In [6]:
model_inputs

tensor([[  529, 29989,  1792, 29989, 29958,    13,  5328,   437, 13563, 17475,
           664, 29973,     2, 29871,    13, 29966, 29989,   465, 22137, 29989,
         29958,    13,  1451,   271, 17475,  1371, 29871,   365, 26369, 29879,
           763,   592,  5706,   901, 16165,   261,   296, 20890,   491, 13138,
           263,  2281,  2955,   982,   304,  2894,   675,   278, 14983, 29889,
             2, 29871,    13, 29966, 29989,  1792, 29989, 29958,    13,  5328,
           437,   306,   671,   963, 29973,     2, 29871,    13, 29966, 29989,
           465, 22137, 29989, 29958,    13]])

In [8]:
model_inputs.shape

torch.Size([1, 75])

It's still the token ID's from the Tokenizer but structured in a different way

In [7]:
generated = model.generate(model_inputs)
decoded_response = tokenizer.batch_decode(generated)

print(decoded_response[0])

<|user|>
How do chat templates work?</s> 
<|assistant|>
Chat templates help  LLMs like me generate more coherent responses by providing a structured way to organize the conversation.</s> 
<|user|>
How do I use them?</s> 
<|assistant|>
To use chat templates, you can follow these steps:

1. Log in to your LLM chatbot platform.
2. Click on the "Templates" tab in the left-hand menu.
3. Click on the "Add Template" button.
4. Choose the template you want to use from the list of available templates.
5. Customize the template by adding your own text, images, and other elements.
6. Save your template and assign it to a conversation.

Once you have created a chat template, you can use it to generate responses to common customer queries or to customize responses based on the conversation.</s>


**Here the model clearly hallucinated and response is not correct but it's able to generate reponses**

## Generate and Decode Responses
Generate responses using the model and decode them. Adjust parameters like `max_new_tokens`, `do_sample`, `top_p`, `temperature`, and `top_k` to explore different generation settings.

We need to add some parameters to control the response of the model
- `max_new_token` - It will allow the model to generate the new token upto some numbers, from the above input we already used 75 tokens if we set 10 it will generate 10 extra tokens but it will not contain useful info whatever the model generates it will stop/truncate at the 10th token, this llama version has only 2048 tokens per sequence so we have to limit within that 
- `do_sample` - If we change the values of max_new_token every time the output will be static so we will enable the do_sample, it will allow the model to generate answers differently but within the max token limit
- `temperature` - A parameter that controls the randomness and creativity of generated text. A higher temperature (e.g., 0.8-1.0) introduces more randomness, leading to diverse and potentially more creative outputs, while a lower temperature (e.g., 0.0-0.4) results in more predictable and deterministic outputs.
- `top_p` - top_p (also known as nucleus sampling) is a parameter that controls the diversity of the generated text. It sets a threshold for the cumulative probability distribution of token choices, essentially determining how many of the most probable tokens the model considers when generating the next word. A lower top_p value (e.g., 0.1) restricts the model to a smaller, more focused pool of likely tokens, leading to more predictable and coherent output. Conversely, a higher top_p value (e.g., 0.9) allows the model to consider a wider range of possibilities, potentially leading to more diverse and creative, but possibly less coherent, outputs. Usually top_p should be high.
- `top_k` - top_k is a parameter that restricts the model's next token selection to the top k most probable tokens. This helps control the diversity and predictability of the generated text. A smaller k value leads to more predictable and constrained outputs, while a larger k allows for more variety and randomness. 

In [12]:
generated_ids = model.generate(model_inputs, max_new_tokens=128, do_sample=True, top_p=0.9, temperature=1.0, top_k=5)

In [13]:
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

<|user|>
How do chat templates work?</s> 
<|assistant|>
Chat templates help  LLMs like me generate more coherent responses by providing a structured way to organize the conversation.</s> 
<|user|>
How do I use them?</s> 
<|assistant|>
Sure, here's how you can use chat templates to improve your conversational skills with chatbots:

1. Create a template for a basic conversation: Start by creating a basic template that covers the most common topics and questions. For example, you could include questions like "How can I order a drink?" or "Can you recommend a good restaurant nearby?"

2. Customize the template: Once you have a basic template, customize it as needed. For instance, if you want to add more questions about the drink, you can add them. Similarly, if you want to add more options for restaur


## Conclusion 📘
This notebook provided an interactive guide to text generation using the transformers library and a pre-trained model. Through hands-on examples, we explored the power of language models in generating context-aware text based on predefined templates. Experiment with different models and parameters to see how they affect the generated text's style and coherence. Enjoy your journey into NLP!