### Tutorial on inference 70B LLM with 4GB single GPU based on AirLLM

This tutorial is based on the previous blog: [here](https://pub.towardsai.net/make-any-llm-fit-any-gpu-in-10-lines-of-code-dba28eebf5ba)

This tutorial runs successfully with: **airllm==2.8.3**

#### step0. set up the environment and the dependencies

In [1]:
import os

from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

# run llama2 70b directly with 3 A6000, each of which has 48GB memory
# os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,3'

# run llama2 70b through airllm with a single A6000
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

import torch
import transformers

In [6]:
input_text = ["Which team won the NBA Championship the year when Lebron James was born?"]

#### step1. runing Llama2-70b using airllm with a single GPU

In [2]:
from transformers import LlamaForCausalLM, LlamaTokenizer
from airllm import AirLLMLlama2

llama_root = os.getenv('LOCAL_LLAMA_MODEL_ROOT')
llama_name = 'Llama2-70b-chat'

# NOTE: airllm needs to split the model shards layer by layer and save it first before loading
llama_layer_shard_saving_path = os.path.join('./model/airllm/split_mdoel/', llama_name)
if not os.path.exists(llama_layer_shard_saving_path): os.makedirs(llama_layer_shard_saving_path)

2024-01-13 08:19:51.452960: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


>>>> bitsandbytes installed
>>>> cache_utils installed


In [11]:
llama_tokenizer = LlamaTokenizer.from_pretrained(os.path.join(llama_root, llama_name))

In [4]:
llama2_70b = LlamaForCausalLM.from_pretrained( # requiring 150G GPU memory
    os.path.join(llama_root, llama_name),
    torch_dtype=torch.bfloat16,
    device_map='auto',
)
llama2_70b

Loading checkpoint shards:   0%|          | 0/29 [00:00<?, ?it/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 8192)
    (layers): ModuleList(
      (0-79): 80 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=8192, out_features=8192, bias=False)
          (k_proj): Linear(in_features=8192, out_features=1024, bias=False)
          (v_proj): Linear(in_features=8192, out_features=1024, bias=False)
          (o_proj): Linear(in_features=8192, out_features=8192, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=8192, out_features=28672, bias=False)
          (up_proj): Linear(in_features=8192, out_features=28672, bias=False)
          (down_proj): Linear(in_features=28672, out_features=8192, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head):

In [8]:
print(f"The memory footprint of {llama_name} is: {llama2_70b.get_memory_footprint() // 1024**3} GB")

The memory footprint of Llama2-70b-chat is: 128 GB


In [15]:
inputs = llama_tokenizer(
    input_text,
    return_tensors="pt",
).to('cuda')

outputs = llama2_70b.generate( # it takes 10 seconds to generate 48 new tokens, with the speed of 4.8 token/s
    **inputs,
    max_new_tokens=128,
    pad_token_id=llama_tokenizer.eos_token_id,
)

output_text = llama_tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(output_text[len(input_text[0]):])



Lebron James was born on December 30, 1984. The NBA Championship that year was won by the Boston Celtics, who defeated the Los Angeles Lakers in the NBA Finals.


In [3]:
# cut original 29 shards into 83 shards layer by layer (about 3x)
llama2_70b_airllm = AirLLMLlama2( 
    os.path.join(llama_root, llama_name),
    layer_shards_saving_path=llama_layer_shard_saving_path,
)
llama2_70b_airllm

found index file...
found_layers:{'model.embed_tokens.': True, 'model.layers.0.': True, 'model.layers.1.': True, 'model.layers.2.': True, 'model.layers.3.': True, 'model.layers.4.': True, 'model.layers.5.': True, 'model.layers.6.': True, 'model.layers.7.': True, 'model.layers.8.': True, 'model.layers.9.': True, 'model.layers.10.': True, 'model.layers.11.': True, 'model.layers.12.': True, 'model.layers.13.': True, 'model.layers.14.': True, 'model.layers.15.': True, 'model.layers.16.': True, 'model.layers.17.': True, 'model.layers.18.': True, 'model.layers.19.': True, 'model.layers.20.': True, 'model.layers.21.': True, 'model.layers.22.': True, 'model.layers.23.': True, 'model.layers.24.': True, 'model.layers.25.': True, 'model.layers.26.': True, 'model.layers.27.': True, 'model.layers.28.': True, 'model.layers.29.': True, 'model.layers.30.': True, 'model.layers.31.': True, 'model.layers.32.': True, 'model.layers.33.': True, 'model.layers.34.': True, 'model.layers.35.': True, 'model.laye

<airllm.airllm.AirLLMLlama2 at 0x7fde4338f280>

In [17]:
print(f"The memory footprint of {llama_name} with airllm is: \
{torch.cuda.memory_allocated(0) // 1024**2} MB") # NOTE: the running memory peak is about 2.7GB

The memory footprint of Llama2-70b-chat with airllm is: 1132 MB


In [8]:
# inference through layer by layer
inputs = llama2_70b_airllm.tokenizer(
    input_text,
    return_tensors="pt",
    return_attention_mask=False,
).to('cuda')

outputs = llama2_70b_airllm.generate( # it takes 67 minutes to generate 48 new tokens, with the speed of 1min-23s / token
    inputs['input_ids'],
    max_new_tokens=128,
    use_cache=True,
    return_dict_in_generate=True,
    pad_token_id=llama2_70b_airllm.tokenizer.eos_token_id,
)

output_text = llama2_70b_airllm.tokenizer.decode(
    outputs['sequences'][0], skip_special_tokens=True
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [02:10<00:00,  1.57s/it]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:42<00:00,  1.23s/it]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:22<00:00,  1.00it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:22<00:00,  1.01it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:20<00:00,  1.04it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:17<00:00,  1.07it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:20<00:00,  1.03it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:21<00:00,  1.02it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:23<00:00,  1.01s/it]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:22<00:00,  1.01it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:22<00:00,  1.01it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:22<00:00,  1.01it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:19<00:00,  1.04it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:20<00:00,  1.04it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:21<00:00,  1.02it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:20<00:00,  1.03it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:19<00:00,  1.04it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:20<00:00,  1.03it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:20<00:00,  1.03it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:20<00:00,  1.04it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:21<00:00,  1.02it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:21<00:00,  1.02it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:18<00:00,  1.05it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:22<00:00,  1.01it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:22<00:00,  1.00it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:23<00:00,  1.00s/it]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:22<00:00,  1.00it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:22<00:00,  1.00it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:21<00:00,  1.02it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:20<00:00,  1.03it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:20<00:00,  1.04it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:19<00:00,  1.05it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:20<00:00,  1.03it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:21<00:00,  1.02it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:20<00:00,  1.04it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:20<00:00,  1.03it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:21<00:00,  1.02it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:21<00:00,  1.01it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:20<00:00,  1.04it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:20<00:00,  1.03it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:22<00:00,  1.01it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:20<00:00,  1.03it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:22<00:00,  1.00it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:20<00:00,  1.03it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:22<00:00,  1.00it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:22<00:00,  1.00it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:21<00:00,  1.02it/s]


new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>


running layers(self.running_device): 100%|██████████| 83/83 [01:20<00:00,  1.03it/s]


In [9]:
print(output_text)

Which team won the NBA Championship the year when Lebron James was born?

Lebron James was born on December 30, 1984. The NBA Championship that year was won by the Boston Celtics, who defeated the Los Angeles Lakers in the NBA Finals.


#### step2. run quantized mistral-7b with airllm

In [2]:
from airllm import AirLLMMistral

mistral_root = os.getenv('LOCAL_MISTRAL_MODEL_ROOT')

mistral_name = 'Mistral-7B-v0.1'

mistral_layer_shard_saving_path = os.path.join('./model/airllm/split_mdoel/', mistral_name)
if not os.path.exists(mistral_layer_shard_saving_path): os.makedirs(mistral_layer_shard_saving_path)

2024-01-13 18:47:48.506550: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


>>>> bitsandbytes installed
>>>> cache_utils installed


In [4]:
mistral_7b_airllm = AirLLMMistral( # split 2 shards into 35 shards layer by layer
    os.path.join(mistral_root, mistral_name),
    layer_shards_saving_path=mistral_layer_shard_saving_path,
    compression="4bit", # 8bit is also supported
)
mistral_7b_airllm

found index file...
found_layers:{'model.embed_tokens.': True, 'model.layers.0.': True, 'model.layers.1.': True, 'model.layers.2.': True, 'model.layers.3.': True, 'model.layers.4.': True, 'model.layers.5.': True, 'model.layers.6.': True, 'model.layers.7.': True, 'model.layers.8.': True, 'model.layers.9.': True, 'model.layers.10.': True, 'model.layers.11.': True, 'model.layers.12.': True, 'model.layers.13.': True, 'model.layers.14.': True, 'model.layers.15.': True, 'model.layers.16.': True, 'model.layers.17.': True, 'model.layers.18.': True, 'model.layers.19.': True, 'model.layers.20.': True, 'model.layers.21.': True, 'model.layers.22.': True, 'model.layers.23.': True, 'model.layers.24.': True, 'model.layers.25.': True, 'model.layers.26.': True, 'model.layers.27.': True, 'model.layers.28.': True, 'model.layers.29.': True, 'model.layers.30.': True, 'model.layers.31.': True, 'model.norm.': True, 'lm_head.': True}
saved layers already found in model/airllm/split_mdoel/Mistral-7B-v0.1/split

<airllm.airllm_mistral.AirLLMMistral at 0x7fcc319adc40>

In [7]:
inputs = mistral_7b_airllm.tokenizer(
    input_text, 
    return_tensors="pt"
).to('cuda')

outputs = mistral_7b_airllm.generate( # it takes 25 minutes to generate 256 new tokens, with the speed of 6s / token
    **inputs,
    max_new_tokens=128,
    pad_token_id=mistral_7b_airllm.tokenizer.eos_token_id
)

print("-"*25, " response ", "-"*25)
output_text = mistral_7b_airllm.tokenizer.batch_decode(
    outputs, 
    skip_special_tokens=True, 
)[0][len(input_text[0]):]

either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:26<00:00,  1.30it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.18it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:11<00:00,  3.17it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.20it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.21it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.21it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.20it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.22it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.20it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.24it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.28it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.25it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.21it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.21it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:11<00:00,  3.14it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:11<00:00,  3.17it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.19it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.19it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.20it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.21it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.21it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.20it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:11<00:00,  3.18it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.21it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.26it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.22it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.18it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:11<00:00,  3.17it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:11<00:00,  3.13it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:11<00:00,  3.17it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.20it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.28it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.22it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.24it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.22it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.22it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.24it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.24it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.25it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:11<00:00,  3.17it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.19it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.22it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.21it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.25it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.20it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.22it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.25it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.27it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.21it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.20it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.26it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.22it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.26it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.28it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.25it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:11<00:00,  3.15it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:11<00:00,  3.12it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.18it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.19it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.21it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.20it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.26it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:11<00:00,  3.18it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.24it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.29it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.24it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.22it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.27it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.25it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.28it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.28it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.25it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.22it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.25it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.26it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.21it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.27it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.25it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.24it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.19it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.25it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.24it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.25it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.26it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.28it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.25it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.24it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:11<00:00,  3.17it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.19it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.24it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.20it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.24it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.21it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.25it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.24it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.21it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.24it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.21it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.24it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.29it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.27it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.21it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:11<00:00,  3.18it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.23it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.24it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.22it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.19it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.26it/s]


either BetterTransformer or attn_implementation='sdpa' is available, creating model directly


running layers(self.running_device): 100%|██████████| 35/35 [00:10<00:00,  3.26it/s]

-------------------------  response  -------------------------





In [8]:
print(output_text)



## Which team won the NBA Championship the year when Lebron James was born?

The Cleveland Cavaliers won the 2016 NBA Finals, defeating the Golden State Warriors in seven games.

## Who won the NBA Championship in 2003?

The San Antonio Spurs won the 2003 NBA Finals, defeating the New Jersey Nets in five games.

## Who won the NBA Championship in 2004?

The Detroit Pistons won the 2004 NBA Finals, defeating the Los


#### step3. try more models with airllm on your own

In [23]:
from airllm import AirLLMQWen

from airllm import AirLLMBaichuan

from airllm import AirLLMChatGLM

from airllm import AirLLMInternLM

from airllm import AirLLMMistral

from airllm import AirLLMMixtral