In [1]:
# install Important libraries

!pip install --upgrade transformers # load models from existing library
!pip install --upgrade accelerate # optimized training and inference
!pip install --upgrade bitsandbytes # loads large models in low precision to save GPU memory

Collecting transformers
  Downloading transformers-5.1.0-py3-none-any.whl.metadata (31 kB)
Downloading transformers-5.1.0-py3-none-any.whl (10.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.3/10.3 MB[0m [31m57.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 5.0.0
    Uninstalling transformers-5.0.0:
      Successfully uninstalled transformers-5.0.0
Successfully installed transformers-5.1.0
Collecting bitsandbytes
  Downloading bitsandbytes-0.49.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Downloading bitsandbytes-0.49.1-py3-none-manylinux_2_24_x86_64.whl (59.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.49.1


In [2]:
from huggingface_hub import login
login()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
#bitsandbytes config to control howthe model is quantized to reduce GPu  memory usage

from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM
import torch
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                #enable 4bit quantization
    bnb_4bit_use_double_quant=True,   # improves accuracy slightly
    bnb_4bit_quant_type="nf4",        #best quant type for LLMs
    bnb_4bit_compute_dtype=torch.bfloat16 #Computation precision
)


In [7]:
!rm -rf /opt/bin/.nvidia/*
!rm -rf /root/.cache/huggingface
!rm -rf /root/.cache/torch
!pip cache purge

rm: cannot remove '/opt/bin/.nvidia/nvidia-bug-report.sh': Read-only file system
rm: cannot remove '/opt/bin/.nvidia/nvidia-cuda-mps-control': Read-only file system
rm: cannot remove '/opt/bin/.nvidia/nvidia-cuda-mps-server': Read-only file system
rm: cannot remove '/opt/bin/.nvidia/nvidia-debugdump': Read-only file system
rm: cannot remove '/opt/bin/.nvidia/nvidia-installer': Read-only file system
rm: cannot remove '/opt/bin/.nvidia/nvidia-modprobe': Read-only file system
rm: cannot remove '/opt/bin/.nvidia/nvidia-ngx-updater': Read-only file system
rm: cannot remove '/opt/bin/.nvidia/nvidia-pcc': Read-only file system
rm: cannot remove '/opt/bin/.nvidia/nvidia-persistenced': Read-only file system
rm: cannot remove '/opt/bin/.nvidia/nvidia-powerd': Read-only file system
rm: cannot remove '/opt/bin/.nvidia/nvidia-settings': Read-only file system
rm: cannot remove '/opt/bin/.nvidia/nvidia-sleep.sh': Read-only file system
rm: cannot remove '/opt/bin/.nvidia/nvidia-smi': Read-only file sy

In [8]:
# You select the model → load its tokenizer → load the compressed model onto GPU ready for text generation.
model  = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model)
mistral_model = AutoModelForCausalLM.from_pretrained(model, quantization_config=bnb_config,torch_dtype=torch.bfloat16,device_map="auto")

# Pipeline = one-step tool that handles tokenizer + model + generation automatically
from transformers import pipeline
mistral_pipeline = pipeline(
    "text-generation",
    model=mistral_model,
    tokenizer=tokenizer,
    device_map="auto",
    torch_dtype=torch.bfloat16
)


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]



model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/291 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


In [12]:
def mistral_chat(user_input, messages=None):
  # Initialize messages if it's the first turn, otherwise use the existing history
  if messages is None:
    messages=[
        {"role": "system", "content": "You are a helpful, smart and friendly AI assistant. Do not leave the sentence incomplete."}
    ]

  # Save user input
  messages.append({"role": "user", "content": user_input})

  #Build prompt from memory since prompts for llama and mistral needs to be formated
  prompt = tokenizer.apply_chat_template(
      messages,
      tokenize=False, # telling llama not to give output as tokens, intead words
      add_generation_prompt=True) # tells the model its assistant turn to reply

  #generate_response
  output= mistral_pipeline(prompt,do_sample=True, max_new_tokens=200, temperature=0.1,top_p=0.95)
  full_text = output[0]["generated_text"]
  reply = full_text[len(prompt):].strip() # remove prompt from response
  # save assistant reply to memory
  messages.append({"role":"assistant","content":reply})
  return reply, messages

In [10]:
# Chat Loop Function

def run_bot():
  messages =None
  while True:
    user_input = input("User: ")
    if user_input.lower() == "exit":
      print("Chat ended. Bye!")
      break
    reply, messages = mistral_chat(user_input, messages)
    print(f"Assistant: {reply}")


In [13]:
#main Function
if __name__ == "__main__":
  run_bot()

User: hi


Passing `generation_config` together with generation-related arguments=({'max_new_tokens', 'top_p', 'temperature', 'do_sample'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Both `max_new_tokens` (=200) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Assistant: Hello! How can I assist you today?
User: what is the role of self attention in Ai?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Both `max_new_tokens` (=200) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Assistant: Self-attention is a technique used in artificial intelligence (AI) to allow a model to focus on different parts of the input data when making predictions or decisions. It is a key component of many AI models, including natural language processing (NLP) models and image recognition models.

In self-attention, the model is able to weigh the importance of different parts of the input data based on their relevance to the task at hand. This allows the model to focus on the most important information and ignore irrelevant or redundant information.

Self-attention is particularly useful in NLP tasks, where the input data is often unstructured and can contain a large amount of irrelevant information. By allowing the model to focus on the most important parts of the input data, self-attention can improve the accuracy and efficiency of NLP models.

Overall, self-attention is a powerful tool in AI that allows models to make more informed decisions and improve
User: exit
Chat ended. Bye