## **Install dependencies**

In [2]:
!pip install -q transformers accelerate bitsandbytes huggingface_hub

## **Secure HuggingFace authentication**

In [3]:
import os
from huggingface_hub import login, whoami

def authenticate_huggingface(allow_interactive=True):

    token = os.getenv("HF_TOKEN")

    # Colab secrets fallback
    if not token:
        try:
            from google.colab import userdata
            token = userdata.get("HF_TOKEN")
            if isinstance(token, dict):
                token = token.get("value")
            if token:
                print("🔐 Using token from Colab Secrets")
        except Exception:
            pass

    # Interactive fallback
    if not token and allow_interactive:
        from getpass import getpass
        token = getpass("Enter HuggingFace token: ")

    if token:
        try:
            try:
                whoami()
                print("✅ Already authenticated")
            except:
                login(token)

            print(f"✅ Logged in as {whoami()['name']}")
        except Exception as e:
            raise RuntimeError(f"Authentication failed: {e}")
    else:
        print("⚠️ No HuggingFace token detected")

authenticate_huggingface()


🔐 Using token from Colab Secrets
✅ Already authenticated
✅ Logged in as NY0641


## **Runtime hardware detection** - Clean hardware selector with safe fallback

In [4]:
import torch
import warnings


def resolve_runtime_config(min_vram_gb: int = 8):
    """
    Select best runtime device.

    - Uses GPU if available and large enough
    - Falls back to CPU otherwise
    - Warns user if CPU is used (slower inference)

    Returns:
        (torch_dtype, device_map)
    """

    # --- If GPU exists, check memory ---
    if torch.cuda.is_available():
        vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9

        if vram_gb >= min_vram_gb:
            dtype = (
                torch.bfloat16
                if torch.cuda.is_bf16_supported()
                else torch.float16
            )
            return dtype, "auto"

        # GPU exists but too small → fall back
        warnings.warn(
            f"GPU detected but only {vram_gb:.1f}GB VRAM available. "
            "Falling back to CPU. Inference may be slow.",
            RuntimeWarning,
        )

    else:
        warnings.warn(
            "No GPU detected. Running on CPU. Inference may be slow.",
            RuntimeWarning,
        )

    # --- CPU fallback ---
    return torch.float32, "cpu"


# usage
torch_dtype, device_map = resolve_runtime_config()


## **Safe Model Load Wrapper**

In [5]:

import time


def safe_model_load(load_fn, max_load_time: int = 300):
    """
    Executes model loading with runtime safeguards.

    Args:
        load_fn: Callable responsible for loading the model.
        max_load_time: Maximum allowed load duration (seconds).

    Returns:
        Loaded model object.

    Raises:
        RuntimeError: If loading fails or exceeds time limit.
    """

    start_time = time.monotonic()

    try:
        model = load_fn()

    except Exception as exc:
        raise RuntimeError("Model initialization failed.") from exc

    elapsed = time.monotonic() - start_time

    if elapsed > max_load_time:
        raise RuntimeError(
            f"Model initialization exceeded {max_load_time}s "
            f"(actual: {elapsed:.1f}s)."
        )

    return model


## **Load Mistral model (4-bit quantized)**

In [6]:
from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM, pipeline

MODEL_NAME="mistralai/Mistral-7B-Instruct-v0.1"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype
)

def load_pipeline():

    tokenizer=AutoTokenizer.from_pretrained(MODEL_NAME)

    model=AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,
        torch_dtype=torch_dtype,
        device_map=device_map
    )

    pipe=pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        device_map=device_map,
        torch_dtype=torch_dtype
    )

    return tokenizer, pipe

tokenizer, mistral_pipeline = safe_model_load(load_pipeline)


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/291 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


## **Token memory management**  


##### Memory management with Memory truncation on tokens


In [11]:
MAX_INPUT_TOKENS=6000

def count_tokens(messages):

    tokens=tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    )
    return tokens["input_ids"].shape[-1]

def truncate_history(messages):

    while count_tokens(messages) > MAX_INPUT_TOKENS and len(messages)>1:
        messages.pop(1)

    return messages


In [13]:
# Chat Loop Function

messages=[{
    "role":"system",
    "content":"You are a helpful, smart and friendly AI assistant. Do not leave the sentence incomplete."
}]

def get_mistral_response(user_input:str)->str:
    global messages

    messages.append({"role":"user","content":user_input})

    # token-based truncation
    messages=truncate_history(messages)

    prompt=tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    output=mistral_pipeline(
        prompt,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.2,
        top_p=0.9
    )

    full_text=output[0]["generated_text"]
    reply=full_text[len(prompt):].strip()

    messages.append({"role":"assistant","content":reply})

    return reply



In [14]:
def run_chatbot():

    print("Mistral chatbot is ready")

    while True:
        user_input=input("user: ")

        if user_input.lower()=="exit":
            print("chat ended.")
            break

        reply=get_mistral_response(user_input)
        print("Chatbot:",reply)

if __name__ == "__main__":
  run_chatbot()



Mistral chatbot is ready
user: Hi


Passing `generation_config` together with generation-related arguments=({'max_new_tokens', 'temperature', 'top_p', 'do_sample'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Chatbot: Hello! How can I assist you today?
user: is the parliament assembly a war zone?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Chatbot: No, the Parliament Assembly is not a war zone. It is a peaceful gathering place where elected representatives from various countries come together to discuss and make decisions on behalf of their constituents. While there may be disagreements and debates, the Parliament Assembly is a forum for peaceful dialogue and cooperation.
user: but people always talk rudely, questions one' s perfomance.


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Chatbot: It is true that in any political setting, there may be disagreements and debates, and people may express their opinions in a strong or assertive manner. However, it is important to remember that the Parliament Assembly is a place for respectful discourse and constructive dialogue. While it is natural for people to ask questions and challenge each other's ideas, it is important to do so in a respectful and civil manner. It is also important to remember that the Parliament Assembly is a place for cooperation and finding common ground, rather than just focusing on differences.
user: exit
chat ended.
