# Day 7: March 14th 2024

## Running Llama 3 8b Model on GPU vs on CPU

## Setting up the Environment

1. Make sure Python is installed on your system with the command `python --version`.
2. Open a terminal and navigate to the directory this notebook is in and create a python virtual environment with the command `python -m venv .env`.
3. Activate the virtual environment:
  - If you're on Windows use `.env/Scripts/activate`.
  - If you're on Linux use `source .env/bin/activate`.
  - If you're on an ISPM Lab computer and don't have admin privileges, open VS Code and open a terminal from there, it should activate the environment for you. You'll know it has opened the virtual environment if a little pop-up notification appears in the bottom right (unless you've previously disabled this, in that case I'd imagine you're familiar with creating and activating virtual environments)
4. Install the dependencies with `pip install 'transformers[torch]' 'optimum[onnxruntime]' jupyter`

### Authenticating with HuggingFace

1. Create an account on [HuggingFace](https://huggingface.co)
2. If the model that you're using requires permission for you to access it (for example, all of meta-llama's models), navigate to their model and request access.
3. While your request is being process or if there is no need for a request, you'll need to generate a token for authenticating with.
   1. Click on your profile picture in the top right corner and click the Settings option from the drop down.
   2. Click Access Tokens on the left
   3. Click the New Token button and create a new token, the name doesn't matter but the Type should be `Write`
   4. Copy the value of the token and return to your cli
4. Run the command `huggingface-cli login` and when prompted paste your token in then hit enter. I like to save the token in my git credentials as well but it's not necessary.

### Installing Git-LFS

Git-LFS (Git Large File Storage) is an extension of Git (same people) that allows storage of large files.

1. Make sure git is already installed with `git -v`.
   1. If git isn't installed already, download and setup git from [here](https://git-scm.com/downloads).
2. Once you've confirmed that git is installed, download and install git-lfs by navigating to the [git lfs installation guide](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage?platform=linux) and follow the installation guide appropriate for your operating system.

### From ISPM computer with no GPU, run inference on pipeline with some optimizations

In [None]:
# Imports
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
from optimum.pipelines import pipeline
from timeit import default_timer as timer

### Create the prompt we want to use on both the GPU and CPU

In [None]:
prompt = "Describe what a large language model is to me."

In [None]:
# Pull the model down from HuggingFace and create an Optimum RunTime model from it and a Tokenizer from the transformers library
# This might take a while since Llama hasn't already been transformed to an ORT model. I'm unsure if this is something that will always be a long process of if it's just a long process the first time.
model = ORTModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B", export=True)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

In [None]:

# Create the pipeline that will be used to perform the inference
onnx_pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, accelerator="ort")

# Define the context for the pipeline
context = "I'm a cybersecurity expert wanting to learn more about large language models."

In [None]:
# Get the start time
start_time = timer()

# Perform the inference
inference = onnx_pipe(prompt, context)

# Get the stop time
end_time = timer()
elapsed_time = end_time - start_time  # Compute the elapsed time in seconds

# Compute the hours, minutes, and seconds it took to perform the inference
mins, secs = divmod(elapsed_time, 60)
hours, mins = divmod(mins, 60)

# Display results
print(f"Elapsed time (H:M:S): {hours:.0f}:{mins:.0f}:{secs:.5f}")
print("Output from model:")
print(f"{inference}")