# Mixtral in Colab

Welcome! In this notebook you can run [Mixtral8x7B-Instruct](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) with decent generation speed **right in Google Colab or on a consumer-grade GPU**. This was made possible by quantizing the original model in mixed precision and implementing a MoE-specific offloading strategy.

To learn more, read our [tech report](https://arxiv.org/abs/2312.17238) or check out the [repo](https://github.com/dvmazur/mixtral-offloading) on GitHub.

One will need approximately 16 GB of VRAM and 11 GB of RAM to run this notebook and generate somewhat long texts.


<details>

<summary>How to balance between RAM and GPU VRAM usage</summary>

You can balance between RAM and GPU VRAM usage by changing <code>offload_per_layer</code> variable in the <a href="#scrollTo=_mIpePTMFyRY&line=10&uniqifier=1">Initialize model</a> section. Increasing <code>offload_per_layer</code> will decrease GPU VRAM usage, increase RAM usage and decrease generation speed. Decreasing <code>offload_per_layer</code> will have the opposite effect.

Note that this notebook should run normally in Google Colab with <code>offload_per_layer = 4</code>, but may crush with other values. However, if you run this somewhere else, you're free to play with this variable.
</details>

## Install and import libraries

In [1]:
from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/MyDrive/11868/

Mounted at /content/drive
/content/drive/MyDrive/11868


In [2]:
%cd mixtral-offloading
!pwd
!rm expert_cache_log.txt custom_layer_log.txt generated_text.txt generated_tokens.txt expert_cache.tsv


/content/drive/MyDrive/11868/mixtral-offloading
/content/drive/MyDrive/11868/mixtral-offloading
rm: cannot remove 'expert_cache_log.txt': No such file or directory
rm: cannot remove 'generated_text.txt': No such file or directory


In [3]:
# fix numpy in colab
import numpy
from IPython.display import clear_output

# fix triton in colab
!export LC_ALL="en_US.UTF-8"
!export LD_LIBRARY_PATH="/usr/lib64-nvidia"
!export LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
!ldconfig /usr/lib64-nvidia

# !git clone https://github.com/dvmazur/mixtral-offloading.git --quiet
!pip install -q -r requirements.txt
!huggingface-cli download lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo --quiet --local-dir Mixtral-8x7B-Instruct-v0.1-offloading-demo

clear_output()

In [4]:
!pwd

/content/drive/MyDrive/11868/mixtral-offloading


In [5]:
import sys

sys.path.append("mixtral-offloading")
import torch
from torch.nn import functional as F
from hqq.core.quantize import BaseQuantizeConfig
from huggingface_hub import snapshot_download
from IPython.display import clear_output
from tqdm.auto import trange
from transformers import AutoConfig, AutoTokenizer
from transformers.utils import logging as hf_logging

from src.build_model import OffloadConfig, QuantConfig, build_model

hqq_aten package not installed. HQQBackend.ATEN backend will not work unless you install the hqq_aten lib in hqq/kernels.


  _torch_pytree._register_pytree_node(
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

  self.pid = os.fork()
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


## Initialize model

In [6]:
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
quantized_model_name = "lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo"
state_path = "Mixtral-8x7B-Instruct-v0.1-offloading-demo"

config = AutoConfig.from_pretrained(quantized_model_name)

device = torch.device("cuda:0")

##### Change this to 5 if you have only 12 GB of GPU VRAM #####
offload_per_layer = 4
# offload_per_layer = 5
###############################################################

num_experts = config.num_local_experts

offload_config = OffloadConfig(
    main_size=config.num_hidden_layers * (num_experts - offload_per_layer),
    offload_size=config.num_hidden_layers * offload_per_layer,
    buffer_size=4,
    offload_per_layer=offload_per_layer,
)


attn_config = BaseQuantizeConfig(
    nbits=4,
    group_size=64,
    quant_zero=True,
    quant_scale=True,
)
attn_config["scale_quant_params"]["group_size"] = 256


ffn_config = BaseQuantizeConfig(
    nbits=2,
    group_size=16,
    quant_zero=True,
    quant_scale=True,
)
quant_config = QuantConfig(ffn_config=ffn_config, attn_config=attn_config)


model = build_model(
    device=device,
    quant_config=quant_config,
    offload_config=offload_config,
    state_path=state_path,
)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/720 [00:00<?, ?B/s]



Loading experts:   0%|          | 0/32 [00:00<?, ?it/s]

## Run the model

In [7]:
!ls

LICENSE  Mixtral-8x7B-Instruct-v0.1-offloading-demo  notebooks	README.md  requirements.txt  src


In [11]:
from transformers import TextStreamer

with open(
    "/content/drive/MyDrive/11868/mixtral-offloading/expert_cache.tsv",
    "w",
) as f:
    f.write("UID0\tUID1\tEviction_Group\tOffloaded\tIndex\n")

tokenizer = AutoTokenizer.from_pretrained(model_name)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
past_key_values = None
sequence = None

seq_len = 0
while True:
  print("User: ", end="")
  user_input = input()
  print("\n")

  user_entry = dict(role="user", content=user_input)
  input_ids = tokenizer.apply_chat_template([user_entry], return_tensors="pt", tokenize= True).to(device)

  if past_key_values is None:
    attention_mask = torch.ones_like(input_ids)
  else:
    seq_len = input_ids.size(1) + past_key_values[0][0][0].size(1)
    attention_mask = torch.ones([1, seq_len - 1], dtype=torch.int, device=device)

  # print(input_ids)
  print(f"Decoded Tokens: {tokenizer.convert_ids_to_tokens(input_ids.squeeze(), skip_special_tokens=True)}")
  print("\n")

  print("Mixtral: ", end="")
  result = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    past_key_values=past_key_values,
    streamer=streamer,
    do_sample=True,
    temperature=0.9,
    top_p=0.9,
    max_new_tokens=512,
    pad_token_id=tokenizer.eos_token_id,
    return_dict_in_generate=True,
    output_hidden_states=True,
  )
  # print(result)
  print("\n")

  sequence = result["sequences"]
  input_len = len(input_ids.squeeze())
  output_token_ids = sequence.squeeze()[input_len:]
  output_tokens = tokenizer.convert_ids_to_tokens(output_token_ids.squeeze(), skip_special_tokens=True)

  print(f"input_ids shape: {input_ids.shape}")
  print(f"input_ids: {input_ids}")
  print(f"sequence shape: {sequence.shape}")
  print(f"sequence: {sequence}")
  print(f"output_token_ids: {output_token_ids}")
  print(f"output_tokens: {output_tokens}")
  with open(
      "/content/drive/MyDrive/11868/mixtral-offloading/generated_tokens.txt", "a", encoding='utf-8'
  ) as f:
      for t in output_tokens:
        f.write(t)
        f.write("\n")
  #Introduce yourself, limit your response in 50 words.
  past_key_values = result["past_key_values"]

User: Q: Which of the following statements are true concerning a triangular or recursive system?\n\ni) The parameters can be validly estimated using separate applications of OLS to\n\neach equation\n\n\nii) The independent variables may be correlated with the error terms in other\n\nequations\n\n\niii) An application of 2SLS would lead to unbiased but inefficient parameter estimates\n\n\niv) The independent variables may be correlated with the error terms in the equations\n\nin which they appear as independent variables (a) (ii) and (iv) only (b) (i) and (iii) only (c) (i), (ii), and (iii) only (d) (i), (ii), (iii), and (iv)\nA:


Decoded Tokens: ['▁[', 'INST', ']', '▁Q', ':', '▁Which', '▁of', '▁the', '▁following', '▁statements', '▁are', '▁true', '▁concerning', '▁a', '▁tri', 'angular', '▁or', '▁recurs', 'ive', '▁system', '?', '\\', 'n', '\\', 'ni', ')', '▁The', '▁parameters', '▁can', '▁be', '▁valid', 'ly', '▁estimated', '▁using', '▁separate', '▁applications', '▁of', '▁O', 'LS', '▁to', 

KeyboardInterrupt: Interrupted by user

In [10]:
import os

# Specify the file path
file_path = "/content/drive/MyDrive/11868/mixtral-offloading/expert_cache_log.txt"

# Check if the file exists before removing
if os.path.exists(file_path):
    os.remove(file_path)
    print(f"File {file_path} has been removed.")
else:
    print(f"File {file_path} does not exist.")

# Specify the file path
file_path = "/content/drive/MyDrive/11868/mixtral-offloading/expert_cache.tsv"

# Check if the file exists before removing
if os.path.exists(file_path):
    os.remove(file_path)
    print(f"File {file_path} has been removed.")
else:
    print(f"File {file_path} does not exist.")

file_path = "/content/drive/MyDrive/11868/mixtral-offloading/custom_layer_log.txt"

# Check if the file exists before removing
if os.path.exists(file_path):
    os.remove(file_path)
    print(f"File {file_path} has been removed.")
else:
    print(f"File {file_path} does not exist.")

file_path = "/content/drive/MyDrive/11868/mixtral-offloading/generated_tokens.txt"

# Check if the file exists before removing
if os.path.exists(file_path):
    os.remove(file_path)
    print(f"File {file_path} has been removed.")
else:
    print(f"File {file_path} does not exist.")

File /content/drive/MyDrive/11868/mixtral-offloading/expert_cache_log.txt does not exist.
File /content/drive/MyDrive/11868/mixtral-offloading/expert_cache.tsv has been removed.
File /content/drive/MyDrive/11868/mixtral-offloading/custom_layer_log.txt has been removed.
File /content/drive/MyDrive/11868/mixtral-offloading/generated_tokens.txt has been removed.


In [None]:
# output = "Hello! I am a helpful and respectful AI assistant, designed to accurately answer questions, provide suggestions and give a broad range of information across various topics. I strive to ensure user satisfaction while abiding by ethical guidelines and privacy protocols."
#
