# Install all the required stuff

<span style="color:red">---> *USE A VIRTUAL ENVIRONMENT!* <---</span>

One clarification: I am using this notebook locally. On Google Colab some steps are not necessary.

Install PyTorch. Visit https://github.com/unslothai/unsloth to check which versions of PyTorch are good, in this case I'm going to use PyTorch 2.3.0 + CUDA 12.1. Visit https://pytorch.org/ to get the command to install a different version.

In [None]:
%pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121

Check your CUDA version (just to be sure 🙃). In this case, it should be 12.1.

In [None]:
import torch; 

print("Your CUDA version: " + torch.version.cuda)

Install Unsloth. Once again, visit https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions to get the correct version for your system (PyTorch 2.3.0 + CUDA 12.1 in my case).

IMPORTANT: [there are some problems with inference right now](https://github.com/unslothai/unsloth/issues/702). For the time being, I will use an old version of Transformers.

In [None]:
%pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
%pip install transformers==4.38.0

Before proceeding, you need to make sure that the following commands work on your sistem.

TODO 
(I will add more information on how to install the CUDA toolkit, which includes nvcc)

In [None]:
!nvcc --version

In [None]:
!python3 -m xformers.info

In [None]:
!python -m bitsandbytes

# Test Llama 3 Inference

Load the Llama 3 8B Instruct model (4bit).

In [None]:
from unsloth import FastLanguageModel

max_seq_length = 8192
dtype = None # Data type. None = auto detection.
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

FastLanguageModel.for_inference(model)

Prepare the prompt. I'm using the [Meta Llama 3 Instruct prompt format](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3).

The tensor should contain these special tokens: 
- <|begin_of_text|> = 128000
- <|start_header_id|> = 128006
- <|end_header_id|> = 128007
- <|eot_id|> = 128009

In [None]:
begin_of_text = "<|begin_of_text|>"
eot = "<|eot_id|>"
system_header = "<|start_header_id|>system<|end_header_id|>"
user_header = "<|start_header_id|>user<|end_header_id|>"
assistant_header = "<|start_header_id|>assistant<|end_header_id|>"

prompt = (
    #f"{begin_of_text}" # Not necessary, it is added automatically
    f"{system_header}You are a helpful AI assistant.{eot}"
    f"{user_header}Considering the following machine learning technique: nearest neighbour search in the field of machine learning. Can you provide me with specific user stories for the following application domains? Finance and Marketing{eot}"
    f"{assistant_header}"
)

inputs = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

print(inputs)

Let's create user stories with Llama 3.

In [None]:
from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer)
generation_kwargs = {
    "input_ids": inputs,
    "streamer": text_streamer,
    "max_new_tokens": 1024,
    "use_cache": True,
    "temperature": 0.75,
    "top_p": 0.9,         
    "do_sample": True
}
_ = model.generate(**generation_kwargs)