### Insights and Code References Attribution

This work incorporates insights and code snippets courtesy of Hugging Face, published on Hugging Face's blog. We extend our gratitude to the original authors and Hugging Face for sharing their valuable resources with the community.

Here are few resources used for reference.

https://huggingface.co/google/gemma-7b-it

https://huggingface.co/blog/gemma



### What is Gemma

Meet Gemma! It's like a friendly open source LLM family made by Google, kind of like cousins to the Gemini models. Gemma is super good at turning one bunch of words into another, kind of like how you can turn a bunch of LEGO pieces into something cool. They're really good at understanding and chatting in and the best part? Everyone can see how they're built and play with different versions of them. Whether you need help answering tricky questions, summing up long stories, or even thinking through puzzles, Gemma's got your back!


### Environment

We will run this code on google colab with T4 GPU

### Get access to gemma model on hugging face

https://huggingface.co/google/gemma-7b-it

Gemma Models are publicly accessible, but you have to accept the conditions to access its files and content. Need to provide user names, emails.


### Install required Libraries

transformers - Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models.

accelerate - Accelerate enables the same code to be run across different distributed configurations.

bitsandbytes - For quantization



In [1]:

!pip install -U transformers



Collecting transformers
  Downloading transformers-4.38.1-py3-none-any.whl (8.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.37.2
    Uninstalling transformers-4.37.2:
      Successfully uninstalled transformers-4.37.2
Successfully installed transformers-4.38.1


### Restart Run time if needed

### Login to huggingface using auth token created earlier

In [2]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
!pip install bitsandbytes


Collecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.42.0


In [5]:
!pip install accelerate

Collecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.27.2


### Set up Tokenizer and import model

gemma-7b is the Base 7B model.

gemma-7b-it is the Instruction fine-tuned version of the base 7B model.

We will use gemma-7b-it.

We will use 4 bit Quantization.

Quantization, in the context of machine learning, involves reducing the precision of the numerical representations used in a model. For example, reducing floating-point numbers from 32-bit to 8-bit integers. This process helps decrease the model's memory footprint and can speed up inference, making the model more efficient, especially on hardware with limited computational resources or specific accelerators designed to work with lower-precision arithmetic.


In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", quantization_config=quantization_config)




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.15k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/888 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

### Prompt gemma model with a question

We can see that tokenizer adds ```<bos>``` to the input text. The model won’t respond unless the tokenized input starts with a ```<bos> ``` token

In [7]:
input_text = "suggest books similar to harry potter"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=450)
outputs
outputs[0]
print(tokenizer.decode(outputs[0]))



<bos>suggest books similar to harry potter and the lightning bolt.

Sure, here are some book recommendations similar to Harry Potter and the Lightning Bolt:

**Similar to the magic and mystery:**

* **The Hunger Games** by Suzanne Collins
* **The Hobbit: An Unexpected Journey** by J.R.R. Tolkien
* **The Hunger Games: Mockingjay** by Suzanne Collins

**Similar to the young protagonist and coming-of-age story:**

* **The Lightning Thief** by Rick Riordan
* **The Hunger Games: Catching Fire** by Suzanne Collins
* **The Hobbit: The Battle of Helm's Deep** by J.R.R. Tolkien

**Similar to the themes of friendship and loyalty:**

* **The Hunger Games: Mockingjay** by Suzanne Collins
* **The Hobbit: The Battle of Helm's Deep** by J.R.R. Tolkien
* **The Lightning Thief** by Rick Riordan

**Similar to the magical world and creatures:**

* **The Hobbit: An Unexpected Journey** by J.R.R. Tolkien
* **The Hunger Games: Mockingjay** by Suzanne Collins
* **The Phoenix and the Sword** by Brandon Sander

In [8]:


input_text = " Write a song about soccer ?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=450)
outputs
outputs[0]
print(tokenizer.decode(outputs[0]))


<bos> Write a song about soccer ?

(Verse 1<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><p

In [9]:
input_text = " Write a one line poem about soccer ?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=450)
outputs
outputs[0]
print(tokenizer.decode(outputs[0]))


<bos> Write a one line poem about soccer ?

The ball flies through the air,
A game of passion and flair.<eos>


In [10]:
input_text = " what is capital of USA?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=250)
outputs
outputs[0]
print(tokenizer.decode(outputs[0]))


<bos> what is capital of USA?

The answer is Washington, D.C.

The capital of the United States of America is Washington, D.C.<eos>


In [11]:
input_text = " what is capital of USA? Answer in a sarcastic and humorous way"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=250)
outputs
outputs[0]
print(tokenizer.decode(outputs[0]))


<bos> what is capital of USA? Answer in a sarcastic and humorous way.

The capital of the United States of America is Washington, D.C., a place where politicians go to play pretend and where the homeless population is higher than the national average.<eos>


## Prompt gemma model with a question in chat template.

We can use chat template.

We will set the role to be user.

We will set the content to make the LLM respond in sarcastic and humorous way.

The Instruct models have the following conversational structure

```
<start_of_turn>user
user_question<end_of_turn>
<start_of_turn>model
model_response<end_of_turn>

```

In [12]:
chat = [
    { "role": "user", "content": "what is capital of USA. Answer in a sarcastic and humorous way" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
prompt

'<start_of_turn>user\nwhat is capital of USA. Answer in a sarcastic and humorous way<end_of_turn>\n<start_of_turn>model\n'

In [13]:
inputs = tokenizer.encode(prompt, add_special_tokens=True, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=250)


In [14]:
#outputs
#outputs[0]
print(tokenizer.decode(outputs[0]))

<bos><start_of_turn>user
what is capital of USA. Answer in a sarcastic and humorous way<end_of_turn>
<start_of_turn>model
Washington, D.C. - a place where politicians go to pretend to be important, but actually accomplish nothing.<eos>
