## Setting up the environment


1. I used Visual Studio Code with the [Python](https://marketplace.visualstudio.com/items?itemName=ms-python.python) and [Polyglot Notebooks](https://marketplace.visualstudio.com/items?itemName=ms-dotnettools.dotnet-interactive-vscode) to create this sample.
1. I launched the project using Python 3.12.3 and used venv to manage the dependencies.

**Install CUDA**
1. https://developer.nvidia.com/cuda-downloads

1. Validate by running the following command
    ```
    nvcc --version
    ```

**Install Torch**
1. https://pytorch.org/get-started/locally/
1. Select the appropriate options for your system
    ```
    pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
    ```


In [None]:
# Note: you may need to restart the kernel to use updated packages.

# Provides hardware acceleration for running PyTorch on GPUs
# %pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Provide APIs and tools to easily download and train state-of-the-art pre-trained models.
# %pip install transformers

# A library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code!
# %pip install accelerate

# provides access to GPU metrics that we will print
# %pip install GPUtil

# provides access to distutils in Python 3.12
# %pip install setuptools

# allows for running python in VS Code Jupyter Notebooks
# %pip install ipykernel

# %pip freeze > requirements-using-phi1-inmemory.txt

In [None]:
# Run the following code to validate the installation.
import torch

print(torch.cuda.is_available())

## Running the sample

In [None]:
# validate that Cuda is available
import torch

print()

# print the cuda device name
print("Cuda_is_available? {}\nUsing: {} ".format(
    torch.cuda.is_available(),
    torch.cuda.get_device_name())
)


In [None]:
import torch
from transformers import AutoModelForCausalLM

# the model id is the model path on the Hugging Face model hub,
# you can find it in the model's page URL
base_model_id = "microsoft/phi-1"

torch.set_default_device("cuda")

# AutoModelForCausalLM: This is a class from the Hugging Face Transformers library. It’s used
#    for causal language modeling (LLM) tasks. Specifically, it’s designed for autoregressive
#    generation, where the model predicts the next token in a sequence given the previous tokens.
#
# from_pretrained(base_model_id, trust_remote_code=True, torch_dtype=torch.float16, device_map={"": 0}):
#    base_model_id: This parameter specifies the pretrained model to load. You provide either
#       a shortcut name (e.g., 'bert-base-uncased') or a path to a directory containing a saved
#       configuration file.
#    trust_remote_code=True: This flag allows the model to download weights/configurations 
#       from a remote source (like Hugging Face’s model hub) if they are not already cached locally.
#    torch_dtype=torch.float16: This sets the data type for the model’s weights to 16-bit
#       floating point (half precision). This can help reduce memory usage and speed up inference.
#    device_map={"": 0}: This maps the model to a specific device (in this case, device index 0).
#       An empty string "" means the default device (usually CPU or GPU).

# this line of code initializes an autoregressive language model (AutoModelForCausalLM) using pretrained weights specified by base_model_id
model =  AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True, torch_dtype=torch.float16, device_map={"": 0})

We need to use a tokenizer to communicate with the model. The model doesn't understand our text, it understands tokens. The tokenizer converts our text into tokens and the model converts the tokens into predictions. The tokenizer is a crucial part of the model and it is important to use the same tokenizer that was used to train the model. The tokenizer is part of the model configuration and we can access it using `model.config`.

In [None]:
from transformers import AutoTokenizer


# AutoTokenizer: This is a class from the Hugging Face Transformers library. It’s used for tokenizing
#    text data. Tokenization involves breaking down a sequence of text into individual tokens (words,
#    subwords, or characters) for further processing by language models.
#
# from_pretrained(base_model_id, use_fast=True):
#    base_model_id: This parameter specifies the pretrained model to load. You provide either a
#       shortcut name (e.g., 'bert-base-uncased') or a path to a directory containing a saved
#       configuration file.
#    use_fast=True: This flag determines whether to use a fast Rust-based tokenizer if it’s supported
#       for the given model. If a fast tokenizer is not available, a normal Python-based tokenizer is
#       used instead.

# this line of code initializes a tokenizer (AutoTokenizer) using pretrained weights specified by base_model_id.
# The use_fast=True option indicates that it should use a faster tokenizer implementation if possible
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)

In [None]:
# let's test the model by generating a function that prints all prime numbers between 1 and n
prompt = '''def print_prime(n):
   """
   Print all primes between 1 and n
   """'''

model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
output = model.generate(**model_inputs, max_length=500)[0]

# and finally, we print the output
print(tokenizer.decode(output, skip_special_tokens=True)) 

In [None]:
# replace the code in the following block with the code provided by the model

#BEGIN CODE BLOCK
def print_prime(n):
   """
   Print all primes between 1 and n
   """
   for num in range(2, n+1):
       for i in range(2, num):
           if num % i == 0:
               break
       else:
           print(num)
#END CODE BLOCK

print_prime(10)

In [None]:
import time
import GPUtil

def print_gpu_utilization():
    """Prints GPU usage using GPUtil."""
    gpus = GPUtil.getGPUs()
    for gpu in gpus:
        print(f"GPU {gpu.id}: {gpu.name}, Utilization: {gpu.load * 100:.2f}%")

# some text we want to send to the model to start our conversation
prompt = "Cite 20 famous people."

model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
start_time = time.time()
output = model.generate(**model_inputs, max_length=500)[0]
tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))
print(print_gpu_utilization())

# and finally, we print the output
print(tokenizer.decode(output, skip_special_tokens=True))
