<a href="https://www.kaggle.com/code/lonnieqin/chatbot-with-mixtral-8x7b?scriptVersionId=161726326" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Chatbot with Mixtral 8x7B

This Jupyter notebook demonstrates how to build a chatbot using Mixtral 8x7B, a large language model from [Mixtral AI](https://mixtral.ai). Mixtral is a powerful and fast model adaptable to many use-cases. While being 6x faster, it matches or outperform Llama 2 70B on all benchmarks, speaks many languages, has natural coding abilities. It handles 32k sequence length. Mixtral 8x7B even have comparable performance to GPT-3.5.  The notebook covers the following steps:

1. Loading the Mixtral 8x7B model into a Kaggle notebook
2. Creating a simple chatbot that responds to user input or pre-defined prompts

The notebook will also includes a discussion of the limitations of chatbots and how they can be used to improve customer service and provide other valuable services.

## References
* https://github.com/dvmazur/mixtral-offloading

## Configuration

In [1]:
class CFG:
    
    ## Whether to interact with the chatbot interactively
    is_interactive = False

    prompts = [
        "Could you tell me something about Elon Musk?",
        "Could you build an IMDB text classifier using Tensorflow?",
        "Could you build an IMDB text classifier using Pytorch?"
    ]

## Install and import libraries

In [2]:
!git clone https://github.com/dvmazur/mixtral-offloading
!cd mixtral-offloading && pip install -q -r requirements.txt
!pip install triton
!huggingface-cli download lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo --quiet --local-dir Mixtral-8x7B-Instruct-v0.1-offloading-demo

Cloning into 'mixtral-offloading'...
remote: Enumerating objects: 290, done.[K
remote: Counting objects: 100% (76/76), done.[K
remote: Compressing objects: 100% (37/37), done.[K
remote: Total 290 (delta 54), reused 45 (delta 37), pack-reused 214[K
Receiving objects: 100% (290/290), 264.75 KiB | 3.39 MiB/s, done.
Resolving deltas: 100% (165/165), done.
Collecting triton
  Downloading triton-2.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Downloading triton-2.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (167.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m167.9/167.9 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: triton
Successfully installed triton-2.2.0
/kaggle/working/Mixtral-8x7B-Instruct-v0.1-offloading-demo


In [3]:
import sys
import time
sys.path.append("mixtral-offloading")
import torch
from torch.nn import functional as F
from hqq.core.quantize import BaseQuantizeConfig
from huggingface_hub import snapshot_download
from IPython.display import clear_output
from tqdm.auto import trange
from transformers import AutoConfig, AutoTokenizer
from transformers import TextStreamer
from transformers.utils import logging as hf_logging
from src.build_model import OffloadConfig, QuantConfig, build_model

[36mhqq_aten package not installed. HQQBackend.ATEN backend will not work unless you install the hqq_aten lib in hqq/kernels.[0m


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

## Initialize model

In [4]:
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
quantized_model_name = "lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo"
state_path = "Mixtral-8x7B-Instruct-v0.1-offloading-demo"

config = AutoConfig.from_pretrained(quantized_model_name)

device = torch.device("cuda:0")

##### Change this to 5 if you have only 12 GB of GPU VRAM #####
offload_per_layer = 4
# offload_per_layer = 5
###############################################################

num_experts = config.num_local_experts

offload_config = OffloadConfig(
    main_size=config.num_hidden_layers * (num_experts - offload_per_layer),
    offload_size=config.num_hidden_layers * offload_per_layer,
    buffer_size=4,
    offload_per_layer=offload_per_layer,
)


attn_config = BaseQuantizeConfig(
    nbits=4,
    group_size=64,
    quant_zero=True,
    quant_scale=True,
)
attn_config["scale_quant_params"]["group_size"] = 256


ffn_config = BaseQuantizeConfig(
    nbits=2,
    group_size=16,
    quant_zero=True,
    quant_scale=True,
)
quant_config = QuantConfig(ffn_config=ffn_config, attn_config=attn_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = build_model(
    device=device,
    quant_config=quant_config,
    offload_config=offload_config,
    state_path=state_path,
)

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/720 [00:00<?, ?B/s]



Loading experts:   0%|          | 0/32 [00:00<?, ?it/s]

## Run the model

In [5]:
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
past_key_values = None
sequence = None
seq_len = 0
i = 0
while True:
    print("User: ", end="")
    if CFG.is_interactive:
        user_input = input()
    else:
        if i >= len(CFG.prompts):
            break
        user_input = CFG.prompts[i]
        i += 1
    print(f"{user_input}\n")

    user_entry = dict(role="user", content=user_input)
    input_ids = tokenizer.apply_chat_template([user_entry], return_tensors="pt").to(device)

    if past_key_values is None:
        attention_mask = torch.ones_like(input_ids)
    else:
        seq_len = input_ids.size(1) + past_key_values[0][0][0].size(1)
        attention_mask = torch.ones([1, seq_len - 1], dtype=torch.int, device=device)
    print("Mixtral: ", end="")
    begin = time.time()
    result = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        past_key_values=past_key_values,
        streamer=streamer,
        do_sample=True,
        temperature=0.9,
        top_p=0.9,
        max_new_tokens=1024,
        pad_token_id=tokenizer.eos_token_id,
        return_dict_in_generate=True,
        output_hidden_states=True,
    )
    elapsed = time.time() - begin
    print(f"Elapsed time: {elapsed:.2f}s")
    print("\n")
    sequence = result["sequences"]
    past_key_values = result["past_key_values"]
    

User: Could you tell me something about Elon Musk?

Mixtral: 

2024-02-05 03:39:15.980081: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-05 03:39:15.980188: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-05 03:39:16.233916: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Elon Musk is a South African-born Canadian-American entrepreneur and businessman who is known for his work in the fields of technology, engineering, and green energy. He is the founder, CEO, and CTO of SpaceX, which is a company that aims to make space travel more accessible and affordable. He is also the CEO of Neuralink, a company developing neural interface technologies, and the founder of The Boring Company, which manufactures a revolutionary type of tunnel boring machine.

Musk co-founded and served as product architect of Tesla, Inc., the electric vehicle and clean energy company. He stepped down as CEO of Tesla in January 2023, but remains as its product architect and an member of its board of directors.

He has been a major advocate for renewable energy, sustainable transportation, and the importance of colonizing Mars to ensure the survival of humanity. Musk has often been described as a real-life tech titan, visionary, and entrepreneur with a strong commitment to changing the

## Conclusion

It is cool to run a Mixtral 8x7B in Kaggle Notebook. This model can generate some high quality contents even with 4 bit quantization. However the speed is slow, it still takes 500 seconds to generate an answer with 1024 content length. This is because the model is very large and requires a lot of resources to run. However, the results are still very impressive.

Here are some specific examples of how Mixtral 8x7B can be used to improve customer service and provide other valuable services:

* **Mixtral 8x7B can be used to create a knowledge base of frequently asked questions (FAQs).** This can help your customers find answers to their questions quickly and easily, without having to contact customer service.
* **Mixtral 8x7B can be used to provide live chat support.** This allows your customers to get help from a live agent in real time, so that they can resolve their issues quickly and efficiently.
* **Mixtral 8x7B can be used to track customer satisfaction.** This information can be used to identify areas where you can improve your customer service and provide a more personalized experience.
* **Mixtral 8x7B can be used to create surveys and polls.** This feedback can be used to improve your products and services, and to identify new opportunities to grow your business.