<a href="https://colab.research.google.com/github/TrelisResearch/install-guides/blob/main/Google_Colab_Llama_Simple_Inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# *Simple Llama Inference in Google Colab*
---
**Running in Google Colab**

- Runtime -> Change runtime type.
- Choose a GPU runtime (with at least a T4 GPU).
- Run all cells.

**Running on Laptop**

Jupyter Tiny Llama for Laptop:
- Save and re-load chats.
- Upload pdf or text files for analysis.
- Purchase access [here](https://buy.stripe.com/28ocNyf4pci78EMbJ9).

---
Prepared by Trelis Research.

Find Trelis on [HuggingFace](https://huggingface.co/Trelis) and [YouTube](https://www.youtube.com/@TrelisResearch).



#### HuggingFace Login (optional)
- You don't need this if you are using Trelis Function Calling Llama 2 7B, which is public.
- You do need this to access private/gated repositories.

In [1]:
# !pip install huggingface_hub
# from huggingface_hub import notebook_login

# notebook_login()

#### Google Drive Mounting (optional, but recommended)
This saves you time the next time you load the model.

If you don't use it, remove cache_dir from the model and tokeniser below.

In [2]:
cache_dir=''

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')


In [4]:
# import os

# # This is the path to the Google Drive folder.
# drive_path = "/content/drive"

# # This is the path where you want to store your cache.
# cache_dir_path = os.path.join(drive_path, "My Drive/huggingface_cache")

# # Check if the Google Drive folder exists. If it does, use it as the cache_dir.
# # If not, set cache_dir to None to use the default Hugging Face cache location.
# if os.path.exists(drive_path):
#     cache_dir = cache_dir_path
#     os.makedirs(cache_dir, exist_ok=True) # Ensure the directory exists
# else:
#     cache_dir = None

In [5]:
# https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

# Setup and Install
- It's best to run Llama models on a GPU, which you can do using a free Colab notebook.
- Check the Google Colab runtime to the top right corner.
- Or, go to the menu -> Runtime -> Change Runtime Type.
- Select GPU (T4).

In [6]:
# Set the runtime to cpu or gpu. fLlama 7B (or 13B) requires too much RAM to work on cpu alone on a free or PRO Colab notebook - so use runtime = "gpu".
runtime = "gpu"  # OR "cpu"

if runtime == "cpu":
    runtimeFlag = "cpu"
elif runtime == "gpu":
    runtimeFlag = "cuda:0"
else:
    print("Invalid runtime. Please set it to either 'cpu' or 'gpu'.")
    runtimeFlag = None

print("Runtime flag is:", runtimeFlag)

Runtime flag is: cuda:0


In [7]:
### DEFINE THE HUGGING SPACE MODEL

# # 1.1B model
model_id = "PY007/TinyLlama-1.1B-Chat-v0.1"

# # 7B model
# model_id = "Trelis/Llama-2-7b-chat-hf-function-calling-v2"
# model_id = "meta-llama/Llama-2-7b-chat-hf" # alternately, you can load this and then apply the adapter (see the commented out cells below).

# 13B model (recommended for better function prompting precision, required paid access)
# model_id = "Trelis/Llama-2-13b-chat-hf-function-calling-v2"

### Install

In [8]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U einops
!pip install -q -U safetensors
!pip install -q -U torch
!pip install -q -U xformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.

### Import

In [9]:
import transformers
import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, TextStreamer

## Load Model
This can take 30 mins, which is why connecting Google Drive for caching is recommended. The next time you run, it will be much faster because your model will only need to load checkpoint shards rather than the full model from HuggingFace.

In [10]:
if runtime == "gpu":
    # Load the model in 4-bit to allow it to fit in a free Google Colab runtime with a CPU and T4 GPU
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True, #adds speed with minimal loss of quality.
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        # device_map='auto', # for inference use 'auto', for training use device_map={"":0}
        device_map=runtimeFlag,
        trust_remote_code=True,
        # rope_scaling = {"type": "dynamic", "factor": 2.0}, # allows for a max sequence length of 8192 tokens !!! [not tested in this notebook yet]
        cache_dir=cache_dir)
    # Not possible to use bits and bits if using cpu only, afaik
else:
    model = AutoModelForCausalLM.from_pretrained(model_id, device_map=runtimeFlag, trust_remote_code=True, cache_dir=cache_dir) # this can easily exhaust Colab RAM. Note that bfloat16 can't be used on cpu.

Downloading (…)lve/main/config.json:   0%|          | 0.00/652 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/63.0 [00:00<?, ?B/s]

## Add an adapter
This is a more complicated but usually faster way to get the model loaded.

In [11]:
# !pip install -q -U git+https://github.com/huggingface/peft.git

# from peft import PeftModel

# adapter_model = "Trelis/Llama-2-7b-chat-hf-function-calling-adapters-v2"

# # load perf model with new adapters
# model = PeftModel.from_pretrained(
#     model,
#     adapter_model,
# )

## Set up the Tokenizer

In [12]:
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir, use_fast=True) # will use the Rust fast tokenizer if available

# from transformers import LlamaTokenizer
# tokenizer = LlamaTokenizer.from_pretrained(model_id, cache_dir=cache_dir) # will use the Rust fast tokenizer if available

Downloading (…)okenizer_config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# Inference (Simple stream)



In [21]:
# Define a stream *without* function calling capabilities
def stream(user_prompt):
    system_prompt = 'Assistant responses are concise.'

    # Guanaco style for TinyLlama
    B_INST, E_INST = "### Human:", "### Assistant:"
    B_SYS, E_SYS = "\n", "\n\n"

    # # Llama style
    # B_INST, E_INST = "[INST]", "[/INST]"
    # B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

    prompt = f"{B_INST} {B_SYS}{system_prompt.strip()}{E_SYS}{user_prompt.strip()} {E_INST}\n\n"

    inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag)

    streamer = TextStreamer(tokenizer)

    # Despite returning the usual output, the streamer will also print the generated text to stdout.
    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=50)

In [22]:
stream('List the planets our solar system')

<s> ### Human: 
Assistant responses are concise.

List the planets our solar system ### Assistant:

1. Mercury
2. Venus
3. Earth
4. Mars
5. Jupiter
6. Saturn
7. Uranus
8. Neptune



Concise and to the point
