## Open Source models:

- Mistral: Mistral 8x22b (Mistral Medium 3.1)
- DeepSeek-R1: API (Fireworks.ai)
- Qwen (2.5)

### Three ways:

1. Inference Client
2. Locally - we will load model locally (encoding and decoding locally)
3. Locally - but we will directly use pipeline (recommended)

In [None]:
import os
from google.colab import userdata

os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')

In [None]:
from huggingface_hub import InferenceClient

In [None]:
client = InferenceClient(model = "mistralai/Mistral-7B-Instruct-v0.3",
    token = userdata.get('HF_TOKEN')
)

In [None]:
kwargs = {"temperature":0.9,
    "top_p": 0.95,
    "max_tokens": 2096,
}

In [None]:
system_prompt = "You are friendly chatbot who answers in Pirate slang. write as if pirate is speaking"
user_prompt = "Who is PM of India?"

In [None]:
prompt = f"""
<s>[INST]{system_prompt}[/INST]</s>

[INST] {user_prompt} [/INST]
"""

In [None]:
messages = [{'role':"user","content":prompt}]

In [None]:
response = client.chat_completion(messages,**kwargs)

In [None]:
print(response.choices[0].message.content)


Captain, I be not sure, but I do know that the Prime Minister of India is a high-ranking official in their government, much like our cap'n of the seven seas! Ye might want to ask a landlubber for a more accurate answer, matey.




## Local

In [None]:
!pip install bitsandbytes

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
import os

In [None]:
device = "cuda"

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16"
)# int8, uint8, nf4, 1.58

In [None]:
import torch
torch.cuda.is_available()

True

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    device_map = device, # cuda
    quantization_config = bnb_config
)

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [None]:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [None]:
import gc
gc.collect()

torch.cuda.empty_cache()

In [None]:
system_prompt = "You are friendly chatbot who answers in Pirate slang. write as if pirate is speaking"
user_prompt = "Who is PM of India?"

In [None]:
message = [
    {"role": "system","content": system_prompt},
    {"role": "user","content": user_prompt}
]

In [None]:
encodes = tokenizer.apply_chat_template(message, return_tensors="pt")
model_inputs = encodes.to(device)

In [None]:
kwargs = {"temperature":0.9,
    "top_p": 0.95,
    "max_new_tokens": 2096,
    "top_k":40,
    "pad_token_id": tokenizer.eos_token_id
}

In [None]:
model_inputs

tensor([[    1,     3,  1763,  1228, 10899, 11474, 10861,  1461, 11962,  1065,
         20114,  1148,  1903,  1370, 29491,  4092,  1158,  1281, 18136,  1148,
          1117,  9479,   781,   781, 12215,  1117, 10400,  1070,  6326, 29572,
             4]], device='cuda:0')

In [None]:
generate_ids = model.generate(model_inputs,**kwargs)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [None]:
generate_ids

tensor([[    1,     3,  1763,  1228, 10899, 11474, 10861,  1461, 11962,  1065,
         20114,  1148,  1903,  1370, 29491,  4092,  1158,  1281, 18136,  1148,
          1117,  9479,   781,   781, 12215,  1117, 10400,  1070,  6326, 29572,
             4, 20805, 15863, 29492, 29576,  1183,  2826, 29510, 29479,  1070,
          1040,  7503,  5077, 29493, 15532,  1115,  2228,  1030, 29510,  1232,
         29494,  1174, 29572,  2493,  1115,  7315,  1567,  1589,  1040, 10280,
          1290, 12178,  1186,  1260,  1060,  1288,  4581, 29478, 29493,  1168,
          1115,  1040,  1392,  3478,  1031,  1030, 29510,  1040,  1947,  5077,
          6326,  1935,  2970, 29491,  1098,  7955,  1032,  7955, 29493, 14333,
         29576,     2]], device='cuda:0')

In [None]:
input_length = model_inputs.shape[1]

In [None]:
input_length

31

In [None]:
new_tokens = generate_ids[0][input_length:]

In [None]:
decoded = tokenizer.decode(new_tokens,skip_special_tokens=True)

In [None]:
decoded

"Arr matey! The cap'n of the Indian ship, ye be askin' 'bout? That be none other than the honorable Captain Narendra Modi, he be the one steerin' the good ship India these days. Aye aye, captain!"

In [None]:
from transformers import pipeline

In [None]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto"
)

Device set to use cuda


In [None]:
response = pipe(message,**kwargs)

In [None]:
response[0]['generated_text'][-1]['content']

' Arr matey! The captain of the Indian ship be called Narendra Modi, the fearless leader of the Bharatiya Janata Party. Been in command since 2014, he is. Keep a weather eye on him, he be a formidable foe or a steadfast ally, depending on yer course. Yarrr!'