## Installations

# Inference

## GPU version

## Text-to-Text (Chat Mode)

In [2]:
# CPU (no need for Triton)

import os
from huggingface_hub import login
from transformers import pipeline
import torch

# Load token from file and set as environment variable
with open('Gemma-Token.txt', 'r') as f:
    hf_token = f.read().strip()

os.environ["HF_TOKEN"] = hf_token

# Login using the environment variable
login(token=os.environ["HF_TOKEN"])

pipe = pipeline(
    "text-generation",
    model="google/gemma-3-4b-it",
    device=-1,
    torch_dtype=torch.bfloat16
)

# Initialize chat history with system prompt (optional)
chat_history = ""

print("Bank Zəng Mərkəzinə Xoş Gəlmisiniz! (Çıxmaq üçün 'exit' yazın)")

while True:
    # Get user input in Azerbaijani
    user_input = input("Müştəri: ")
    
    if user_input.lower() == "exit":
        print("Çıxış edilir. Sağ olun!")
        break
    
    # Add user message to chat history
    chat_history += f"<start_of_turn>user\n{user_input}\n<end_of_turn>\n<start_of_turn>model\n"

    # Generate model response
    output = pipe(chat_history, max_new_tokens=200)

    # Extract model's response after <start_of_turn>model
    response_text = output[0]['generated_text'].split('<start_of_turn>model')[-1].strip()

    # Print response
    print(f"Operator: {response_text}")

    # Update chat history with model's reply for context
    chat_history += f"{response_text}\n<end_of_turn>\n"

Bank Zəng Mərkəzinə Xoş Gəlmisiniz! (Çıxmaq üçün 'exit' yazın)


Müştəri:  Salam! Bura bankdı?


Operator: Salam! Həyat necə gedir? Mən yaxşıyam, sən necəsən?


Müştəri:  Bankdakı kartımın balansını necə öyrənə bil


Operator: Bankdakı kartının balansını öyrənmək üçün bir neçə yol var:

**1. Mobil bankçılıq:**

*   Çoğunluq bankları mobil bankçılıq tətbiqi (örneğin, Bankassistent, MobilBank, PayFix) təqdim edir. Bu tətbiqlərə giriş edərək, kartınızın balansını və digər məlumatları necə görmək üçün, bankın tətbiqi üzrə göstərilən göstərişləri izləmək lazımdır.

**2. İnternet bankçılığı:**

*   Bankın internet bankçılığı portalına (örneğin, [https://www.bankofamerica.com/](https://www.bankofamerica.com/)) giriş edin. Onlayn hesabınıza daxil olun və kartınızın balansını orada görməyiniz mümkünd


Müştəri:  exit


Çıxış edilir. Sağ olun!


## Text-to-Text (Single-Prompt Mode)

In [1]:
# CPU (no need for Triton)

import os
from huggingface_hub import login
from transformers import pipeline
import torch

# Load token from file and set as environment variable
with open('Gemma-Token.txt', 'r') as f:
    hf_token = f.read().strip()

os.environ["HF_TOKEN"] = hf_token

# Login using the environment variable
login(token=os.environ["HF_TOKEN"])

pipe = pipeline(
    "image-text-to-text",
    model="google/gemma-3-4b-it",
    device=-1,
    torch_dtype=torch.bfloat16
)

# Define the structured message prompt (text-only)
messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "Sən bir bank zəng mərkəzi operatorusan. Müştəri suallar verir, sən nəzakətli cavab verirsən."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Müştəri: Salam, mənim kartımın balansını necə yoxlaya bilərəm?"},
            {"type": "text", "text": "Operator: Kart balansınızı yoxlamaq üçün mobil bank tətbiqinizə daxil olun və ya *123# yığın."},
            {"type": "text", "text": "Müştəri: Kredit limitimi necə artıra bilərəm?"},
            {"type": "text", "text": "Operator:"}  # This is where the model will continue.
        ]
    }
]

# Generate the response
output = pipe(text=messages, max_new_tokens=200)

# Extract and print the model's reply
print(output[0]["generated_text"][-1]["content"])

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Device set to use cpu


Operator: Kredit limitinizi artırmaq üçün, kredit kartınızın müştərilik müqaviləsində göstərilən prosedur üzrə müraciət edə bilərsiniz. Əlavə kredit limitləri üzərində danışıqlar aparmaq üçün bank şöbəsinə başvura bilərsiniz. Əlavə olaraq, kredit limitinizin artırılması üçün əlavə təminat göstərməyiniz də ehtiyac ola bilər. Ətraflı məlumat üçün bank şöbəsinə müraciət etməyə dəvət edirəm.


## Image-to-Text

In [4]:
messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "Sən köməkçi bir asistentsən."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "Konfet üzərində hansı heyvan var?"}
        ]
    }
]

output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

Konfet üzərində quş var.


## **Conclusion & Insights on Gemma 3-4B-IT for Azerbaijani Applications**

Based on the code executions and outputs above, here are the **Conclusion & Insights**:

---

### 1. **Lightweight Yet Capable Model**

* The **Gemma 3-4B-IT** model, despite being just 4 billion parameters, was able to process and understand Azerbaijani prompts effectively.
* It handled **both simple QA (single prompt completions)** and **contextual multi-turn dialogues (interactive chat loop)** without the need for a large memory footprint.
* This performance was achieved running entirely on a **CPU-only environment**, without any GPU acceleration.
* The inference times were still within a range acceptable for practical applications, proving that large-scale distributed compute is **not a necessity** for Azerbaijani language AI tasks.

---

### 2. **Strong Instruction-Following Ability in Azerbaijani**

* The model correctly followed role-play instructions, acting as a **bank call center operator** when prompted, and responded in fluent Azerbaijani.
* Even without explicit fine-tuning on Azerbaijani customer service data, its base multilingual capabilities were sufficient for structured conversation patterns.
* Given this, **instruction-tuning with domain-specific datasets (Bank Call Center, Government Services, Customer Support, etc.)** would greatly enhance its domain expertise with minimal compute resources.

---

### 3. **Single-GPU / CPU Deployment Feasibility**

* The model’s **lightweight architecture** enables deployment on **single GPU systems (e.g., RTX 3090, 4090)** and even **CPU-only setups** for inference.
* This eliminates the need for expensive **distributed cluster setups** or renting **high-end GPU servers** (e.g., A100/H100s) for Azerbaijani-language applications.
* **On-premise deployment becomes viable**, especially for organizations (banks, government agencies, local businesses) that prioritize **data privacy** and **cost-efficiency**.

---

### 4. **Scalable for Localized AI Solutions**

* The combination of **small model size, low infrastructure cost, and acceptable language capability** makes Gemma 3-4B-IT a strong foundation for localized AI products:

  * **Bank Call Center Automation**
  * **E-Government Citizen Services**
  * **Azerbaijani Virtual Assistants**
  * **Chatbots for Local Enterprises**
* With focused **instruction tuning on Azerbaijani-specific datasets**, this model can be scaled across multiple domains without the need for large compute clusters.

---

### Final Thought:

> **Gemma 3-4B-IT shows that building high-quality Azerbaijani AI systems is now feasible with minimal compute resources. With instruction tuning and dataset curation, this model could power a wave of cost-effective, privacy-friendly, and locally deployed AI solutions in Azerbaijan.**

---