# Simple Prompts on an open source Large Language Model 
The objective here is to download a large language model from Hugging face and understand how we can use HF API libraries to build and execute simple prompts. 
We will use primarily the Transformer framework in this exercise and look at different classes that will be used to prompt a large language model 

**Model Used here is** is Bloom-1B

In [4]:
#Ensure that we have Transformers 4.33 or greater.
!pip show transformers

Name: transformers
Version: 4.33.2
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /home/cdsw/.local/lib/python3.9/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: 


In [5]:
# Ensure that GPU : 0 has a Tesla T4 with CUDA libarires > 11.4
!nvidia-smi

Wed Sep 27 06:31:12 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   37C    P0    32W /  70W |    285MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [8]:

from transformers import AutoTokenizer, AutoModel, set_seed, AutoModelForCausalLM

In [9]:
#setting the tensor type to Float
import torch
torch.set_default_tensor_type(torch.cuda.FloatTensor)

In [10]:
# download the Tokenizer for the model from HF
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-1b7")

Downloading (…)okenizer_config.json: 100%|█████| 222/222 [00:00<00:00, 32.8kB/s]
Downloading tokenizer.json: 100%|███████████| 14.5M/14.5M [00:00<00:00, 254MB/s]
Downloading (…)cial_tokens_map.json: 100%|███| 85.0/85.0 [00:00<00:00, 11.6kB/s]


In [11]:
# Download and cache the model locally for future 
model_lm = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b7")

Downloading (…)lve/main/config.json: 100%|██████| 715/715 [00:00<00:00, 275kB/s]
Downloading model.safetensors: 100%|████████| 3.44G/3.44G [00:07<00:00, 434MB/s]


In [12]:
# checking that we have the right object handle
model_lm.__class__

transformers.models.bloom.modeling_bloom.BloomForCausalLM

In [11]:
set_seed(11111)

In [14]:
# A basic prompt on the model 
text_prompt = 'what is life in the first century'

In [15]:
input_tokens = tokenizer(text_prompt, return_tensors="pt").to(0)

In [16]:
# As can be seen the tokenizer converts the words into embeddings with attention mask information
input_tokens

{'input_ids': tensor([[25915,   632, 10440,   361,   368,  3968, 32807]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

In [18]:
# let us use this to generate some output from the pre-trained model . The top_k and Temperature parameters can be tweeked for changing the model responses.
# max_length can be used to change the number of responses 
result_sample = model_lm.generate(**input_tokens, max_length=200, top_k=0, temperature=0.5)

In [21]:
# as expected the result generated by the model is a tensor, which needs to be converted back. 
result_sample[0]

tensor([ 25915,    632,  10440,    361,    368,   3968,  32807,     34,    982,
           189,   2175,  12300,    632,    861,  10440,    632,    267,  17238,
         87486,     17,   1387,  87486,    632,   5299,    368,   7220,    530,
           368,  71220,     17,   1387,   7220,    632,    368,  10440,    461,
           368,  29431,     15,    368,  10440,    461,    368,   3117, 160158,
            17,   1387,  71220,    632,    368,  10440,    461,    368,    447,
         25674, 160158,     15,    368,  10440,    461,    368,    447,  25674,
        160158,     17,   1387,   7220,    632,    368,  10440,    461,    368,
         29431,     15,    368,  10440,    461,    368,   3117, 160158,     17,
          1387,  71220,    632,    368,  10440,    461,    368,    447,  25674,
        160158,     15,    368,  10440,    461,    368,    447,  25674, 160158,
            17,   1387,   7220,    632,    368,  10440,    461,    368,  29431,
            15,    368,  10440,    461, 

In [22]:

# Let us re-convert this back to a response we can understand
print(tokenizer.decode(result_sample[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))

what is life in the first century?”
The answer is that life is a constant struggle. The struggle is between the good and the evil. The good is the life of the Christian, the life of the believer. The evil is the life of the unbeliever, the life of the unbeliever. The good is the life of the Christian, the life of the believer. The evil is the life of the unbeliever, the life of the unbeliever. The good is the life of the Christian, the life of the believer. The evil is the life of the unbeliever, the life of the unbeliever. The good is the life of the Christian, the life of the believer. The evil is the life of the unbeliever, the life of the unbeliever. The good is the life of the Christian, the life of the believer. The evil is the life of the unbeliever, the life of


In [23]:
# Beam 
print(tokenizer.decode(model_lm.generate(**input_tokens, max_length=400,
                       num_beams=2,
                       no_repeat_ngram_size=2,
                       early_stopping=True
                      )[0]))

what is life in the first century of the Christian era? What is it like to be a Christian today? How do we live out our faith in our everyday lives? These questions are answered in this book.</s>


In [20]:

prompt_cot = "Write a brief history of United states"
input_tokens_cot = tokenizer(prompt_cot, return_tensors="pt").to(0)
print(tokenizer.decode(model_lm.generate(**input_tokens_cot, max_length=400, top_k=0, temperature=0.5)[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))


Write a brief history of United states of America. The United States of America is a country in the Americas. It is the largest country in the world. It is located in the western hemisphere. The United States is divided into two parts, the continental United States and the islands of the Caribbean. The United States is the largest country in the world. It is located in the western hemisphere. The United States is divided into two parts, the continental United States and the islands of the Caribbean. The United States is the largest country in the world. It is located in the western hemisphere. The United States is divided into two parts, the continental United States and the islands of the Caribbean. The United States is the largest country in the world. It is located in the western hemisphere. The United States is divided into two parts, the continental United States and the islands of the Caribbean. The United States is the largest country in the world. It is located in the western h

In [39]:

prompt_cot = "What is NLP?"
input_tokens_cot = tokenizer(prompt_cot, return_tensors="pt").to(0)
print(tokenizer.decode(model_lm.generate(**input_tokens_cot, max_length=200, do_sample=False, top_k=0, temperature=0.5, repetition_penalty = 2.0)[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))


What is NLP? What does it do for you and your business, what are the benefits of using this technology in today’s world?
Neurolinguistic Programming (NLP) has been around since at least 1970. It was developed by Dr John Grinder who used a combination of:
The first step to understanding how neurolingual programming works involves learning about language itself.
Language can be defined as “an organized system that allows us communicate with one another” – Wikipedia</s>


**Summary: We saw how to use simple prompts with LLMs. let us make sure to delete the memory for using in other sessions
--
We do this in 2 steps
1. We delete the model and tokenizer objects ( uncomment the variable names)
2. we invoke Garbage Collector

In [53]:
import gc
del model_lm
del input_tokens
del input_tokens_cot
gc.collect()
torch.cuda.empty_cache()


## We also need to check if we have the GPU memory available for other sessions. The code below helps us do exactly that

In [54]:
# Let us monitor memory
import torch

# Retrieve GPU memory statistics
memory_stats = torch.cuda.memory_stats()
# Retrieve maximum GPU memory allocated by PyTorch
max_memory_allocated = torch.cuda.max_memory_allocated()
# Calculate available GPU memory
total_memory = torch.cuda.get_device_properties(0).total_memory
available_memory = total_memory - memory_stats["allocated_bytes.all.current"]

# Print the result
print(f"total_memory: {total_memory / 1024**3:.2f} GB")
print(f"Peak GPU memory allocated by PyTorch: {max_memory_allocated / 1024**3:.2f} GB")
print(f"Available GPU memory: {available_memory / 1024**3:.2f} GB")

## Make sure you are able to Total Memory of 14GB before moving to the next assisgnment

total_memory: 14.76 GB
Peak GPU memory allocated by PyTorch: 8.33 GB
Available GPU memory: 14.75 GB
