# Gemma 2B-IT Testing on RTX 2060

This Jupyter Notebook contains code to test the performance and capabilities of the **Gemma 2B-IT model** on an **NVIDIA RTX 2060 GPU**. 

**Objectives:**

* Benchmark model inference speed.
* Evaluate memory consumption.
* Explore potential applications on this hardware configuration.

**Requirements:**

* NVIDIA RTX 2060 or compatible GPU
* Jupyter Notebook environment
* Necessary libraries (TensorFlow/PyTorch, etc.)

**Procedure:**

The notebook will guide you through the following steps:

1. **Environment Setup:** Installation of required libraries and dependencies.
2. **Model Loading:** Loading the pre-trained Gemma 2B-IT model.
3. **Data Preparation:** Loading and preprocessing sample data for testing.
4. **Inference and Benchmarking:** Running the model on the test data and measuring inference time.
5. **Memory Usage Analysis:** Monitoring GPU memory consumption during model execution.

**Results:**

The notebook will present the findings of the tests, including:

* Inference speed on various batch sizes.
* GPU memory utilization.

**Conclusion:**

A summary of the Gemma 2B-IT's performance on the RTX 2060, along with potential use cases and limitations.


### Installation 

Installing the following libraries :
```shell
pip install -U transformers
pip install torch
pip uninstall huggingface_hub
pip install git+https://github.com/huggingface/huggingface_hub.
pip install ipywidgets
```

### Login 
To log into hugging face I had to installed a library, and then use it. 
```shell
pip install -U "huggingface_hub[cli]"
```
Then run the following command 
```shell
huggingface-cli login
```

here is the result of the execution and the successful login:
```shell
    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token can be pasted using 'Right-Click'.
Enter your token (input will not be visible):
Add token as git credential? (Y/n) Y
Token is valid (permission: fineGrained).
Your token has been saved in your configured git credential helpers (manager).
Your token has been saved to C:\Users\David\.cache\huggingface\token
Login successful
```


In [1]:
#pip install -U transformers
#pip install torch
#pip uninstall huggingface_hub
#pip install git+https://github.com/huggingface/huggingface_hub.
#pip install ipywidgets
import warnings
warnings.filterwarnings('ignore')

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import time
from transformers import BitsAndBytesConfig

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")


In [2]:
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b-it",
    torch_dtype=torch.bfloat16
)

if torch.cuda.is_available():
    print("GPU is available")
    print("Device name:", torch.cuda.get_device_name(0))  # Print GPU name
else:
    print("GPU is not available")
    
input_text = "Tell me a joke about the moon"
input_ids = tokenizer(input_text, return_tensors="pt")

start_time = time.perf_counter()
outputs = model.generate(**input_ids,max_new_tokens=128)
end_time = time.perf_counter()
print(f"Elapsed time: {end_time - start_time:.6f} seconds")
print(tokenizer.decode(outputs[0]))

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

GPU is available
Device name: NVIDIA GeForce RTX 2060
Elapsed time: 79.951386 seconds
<bos>Tell me a joke about the moon.

What do you call a moon that's always crying?

A mooner.<eos>


**Quantized Versions through bitsandbytes -> Quantized Versions through**

* 8 bits precision model

In [3]:
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b-it",
    quantization_config=quantization_config
)

if torch.cuda.is_available():
    print("GPU is available")
    print("Device name:", torch.cuda.get_device_name(0))  # Print GPU name
else:
    print("GPU is not available")
    
input_text = "Tell me a joke about the moon"
input_ids = tokenizer(input_text, return_tensors="pt")

start_time = time.perf_counter()
outputs = model.generate(**input_ids,max_new_tokens=128)
end_time = time.perf_counter()
print(f"Elapsed time: {end_time - start_time:.6f} seconds")
print(tokenizer.decode(outputs[0]))

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

GPU is available
Device name: NVIDIA GeForce RTX 2060
Elapsed time: 10.703927 seconds
<bos>Tell me a joke about the moon.

What do you call a moon that's always crying?

A moon weep.<eos>


**Quantized Versions through bitsandbytes -> Quantized Versions through**

* 4 bits precision model

In [4]:
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b-it",
    quantization_config=quantization_config
)

if torch.cuda.is_available():
    print("GPU is available")
    print("Device name:", torch.cuda.get_device_name(0))  # Print GPU name
else:
    print("GPU is not available")
    
input_text = "Tell me a joke about the moon"
input_ids = tokenizer(input_text, return_tensors="pt")

start_time = time.perf_counter()
outputs = model.generate(**input_ids,max_new_tokens=128)
end_time = time.perf_counter()
print(f"Elapsed time: {end_time - start_time:.6f} seconds")
print(tokenizer.decode(outputs[0]))

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

GPU is available
Device name: NVIDIA GeForce RTX 2060
Elapsed time: 5.469612 seconds
<bos>Tell me a joke about the moon.

What do you call a moon that's always crying?

A sad moon!<eos>


# Conclusion

This experiment explored the impact of quantization on the Gemma 2B-IT model's performance, specifically focusing on inference time and output quality. The table below summarizes the findings:

| Number of Bits | Inference Time (seconds) | Output        |
|----------------|--------------------------|---------------|
| 16             | 79.95                   | A mooner.     |
| 8              | 10.7                    | A moon weep   |
| 4              | 5.47                    | A sad moon    |

As evident from the results, reducing the precision from 16 bits to lower bit widths (8 and 4 bits) significantly decreases inference time. This speed improvement comes at the cost of slight variations in the output. 

While 16-bit precision yields the most accurate output ("A mooner."), the 8-bit and 4-bit versions produce outputs that are still semantically similar, albeit with minor differences in wording. 

**Key Takeaway:**

For tasks where speed is prioritized over absolute precision, employing lower bit quantization (e.g., 8-bit or 4-bit) with Gemma 2B-IT presents a compelling trade-off. This approach can be particularly beneficial in real-time applications or scenarios with limited computational resources. 
