# Inferencing a LLM: Mistral-7B-Instruct

**What is Inferencing a Large Language Model?**
Inferencing a model, especially in the context of Mistral 7B Instruct, refers to the process of using the pre-trained language model to generate predictions or responses based on input data or prompts. Here's a breakdown of the inferencing process:

1. **Input Data or Prompt**: You provide input data or prompts to the Mistral 7B Instruct model. This input could be a text snippet, a question, or any content that you want the model to generate a response for.

2. **Tokenization**: The input data is tokenized, which means it is divided into smaller units called tokens. Each token represents a specific part of the input data and is used by the model for processing.

3. **Model Processing**: The tokenized input data is fed into the Mistral 7B Instruct model. The model processes this input using its pre-trained knowledge and language understanding capabilities.

4. **Generation**: Based on the input data and the context provided by the pre-trained model, Mistral 7B Instruct generates a response or prediction. This response could be in the form of text, answers to questions, completion of sentences, or any other relevant output.

5. **Decoding**: The generated output is decoded from the tokenized format back into human-readable text or the desired output format.

6. **Output**: Finally, the model's generated output or inference is presented to the user, providing valuable insights, answers, or content based on the input provided.

In essence, inferencing a model like Mistral 7B Instruct involves leveraging its language understanding capabilities to generate meaningful responses or predictions based on input data or prompts.

**What is a Sharded Model?**

A "sharded" model refers to a model that has been partitioned or split into multiple smaller parts called shards. This partitioning is typically done to facilitate more efficient processing and utilization of resources, especially in distributed computing environments or when dealing with very large models.

In the context of Natural Language Processing (NLP) models like Mistral-7B-Instruct, sharding can be implemented for various reasons:

1. **Parallelism**: Sharding allows different parts of the model to be processed in parallel on different computing units or devices. This can lead to faster inference times and better utilization of available computational resources.

2. **Memory Optimization**: Large models may not fit entirely into the memory of a single device. Sharding helps distribute the model's components across memory spaces, enabling the use of models that would otherwise exceed memory limits.

3. **Scalability**: Sharding supports the scalability of models, allowing them to handle larger volumes of data or more complex tasks by distributing the workload across multiple shards.

4. **Efficient Training**: During model training, sharding can be used to distribute the training workload across multiple GPUs or machines, reducing training time and improving overall efficiency.

5. **Fault Tolerance**: Sharding can also improve fault tolerance by isolating failures to specific shards, allowing the rest of the model to continue functioning.

In summary, sharded models are designed to optimize performance, scalability, and resource utilization, especially in scenarios where large-scale models need to be deployed or trained efficiently.

## **Installing necessary dependencies:**


Here is an explanation of each library mentioned:

1. `peft`: This library is not a standard library used in the Hugging Face Transformers package or related to it. It's possible that there might be a typo or misunderstanding regarding this library. If you meant to refer to another library or functionality, please provide more details.

2. `accelerate`: Accelerate is a library designed to simplify and optimize the training and inference of PyTorch models, especially on distributed systems like multiple GPUs or TPUs. It provides utilities for distributed training, mixed precision training, gradient accumulation, and other performance optimizations.

3. `bitsandbytes`: The `bitsandbytes` library is not directly related to the Hugging Face Transformers package. It might refer to a specific implementation or utility related to quantization or low-bit model representations. However, without more context or specific information, it's challenging to provide a detailed explanation.

4. `safetensors`: This library is not a standard library related to the Hugging Face Transformers package. It could refer to a custom or specific utility for ensuring safe tensor operations, but without more information, it's challenging to provide specific details.

5. `sentencepiece`: SentencePiece is a library for tokenization and subword encoding developed by Google. It's commonly used in Natural Language Processing (NLP) tasks, including pre-processing text data for training language models like those provided by the Hugging Face Transformers package. SentencePiece allows for efficient and customizable tokenization by splitting text into subword units based on the provided vocabulary.


In [1]:
!pip install git+https://github.com/huggingface/transformers -q peft  accelerate bitsandbytes safetensors sentencepiece

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m46.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m1.3 MB/s

In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = 'filipealmeida/Mistral-7B-Instruct-v0.1-sharded'

def load_quantized_model(model_name: str):
    """
    :param model_name: Name or path of the model to be loaded.
    :return: Loaded quantized model.
    """
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,  # Use quantization_config instead of load_in_4bit
        torch_dtype=torch.bfloat16,  # Remove load_in_4bit here
    )

    return model

def initialize_tokenizer(model_name: str):
    """
    Initialize the tokenizer with the specified model_name.

    :param model_name: Name or path of the model for tokenizer initialization.
    :return: Initialized tokenizer.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.bos_token_id = 1  # Set beginning of sentence token id
    return tokenizer


model = load_quantized_model(model_name)

tokenizer = initialize_tokenizer(model_name)

# Define stop token ids
stop_token_ids = [0]


text = "[INST] What is the future of AI? [/INST]"

encoded = tokenizer(text, return_tensors="pt", add_special_tokens=False)
model_input = encoded
generated_ids = model.generate(**model_input, max_new_tokens=600, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] What is the future of AI? [/INST] It is difficult to predict the exact future of AI as it is constantly evolving based on new research and development. However, it is widely believed that AI will continue to play an increasingly important role in many aspects of our lives, both personal and professional. Some potential future developments in AI include:

1. Greater integration with human emotions and behavior: AI may become more adept at detecting and responding to human emotions, leading to more natural and intuitive interactions between humans and AI.
2. Continued advancements in natural language processing: AI may become even better at understanding and responding to natural language, allowing for more seamless communication with humans.
3. Increased use in decision-making and problem-solving: AI may become more advanced in its ability to analyze data and make decisions, potentially replacing human decision-makers in certain situations.
4. Greater use in healthcare: AI may be

Here's a breakdown of what each part of the code does:

1. `import torch`: Imports the PyTorch library, which is used for deep learning tasks.

2. `from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig`: Imports necessary classes from the Transformers library. `AutoModelForCausalLM` is used to load the Causal Language Model, `AutoTokenizer` is used to initialize the tokenizer, and `BitsAndBytesConfig` is used for configuring quantization.

3. `model_name = 'filipealmeida/Mistral-7B-Instruct-v0.1-sharded'`: Specifies the name or path of the model to be loaded.

4. `load_quantized_model(model_name: str)`: Defines a function to load a quantized model based on the provided model name or path. Inside the function:
   - `BitsAndBytesConfig` is used to configure quantization parameters.
   - `AutoModelForCausalLM.from_pretrained` loads the model using the specified model name and quantization configuration.

5. `initialize_tokenizer(model_name: str)`: Defines a function to initialize the tokenizer based on the provided model name or path. Inside the function:
   - `AutoTokenizer.from_pretrained` initializes the tokenizer using the specified model name.
   - `tokenizer.bos_token_id = 1` sets the beginning of sentence token ID.

6. `model = load_quantized_model(model_name)`: Calls the `load_quantized_model` function to load the quantized model.

7. `tokenizer = initialize_tokenizer(model_name)`: Calls the `initialize_tokenizer` function to initialize the tokenizer.

8. `text = "[INST] What is the future of AI? [/INST]"`: Specifies the input text for which the model will generate a response.

9. Tokenization and Generation:
   - `encoded = tokenizer(text, return_tensors="pt", add_special_tokens=False)` tokenizes the input text without adding special tokens and returns a PyTorch tensor.
   - `model.generate(**model_input, max_new_tokens=600, do_sample=True)` generates new text based on the encoded input using the loaded model and specified generation parameters.
   - `tokenizer.batch_decode(generated_ids)` decodes the generated tokens back into human-readable text.
   - `print(decoded[0])` prints the generated text.



**TIP:  Use T4 GPU or TPU in Google Collab.**