# **```CLI based RAG application```**

##### ```Below are the requirements```

| **Area**         | **Requirement**                                                                                                                                         |
|------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Model**        | Download Llama-3.1 8B Instruct from Hugging Face. Include a script that converts it to INT4 using MLX or OpenVINO IR.                                   |
| **Knowledge Base** | Parse `procyon_guide.pdf`, chunk text, generate embeddings, and store them using FAISS, Qdrant, Milvus, or pgvector.                                   |
| **CLI Tool**     | Command: `rag_cli --query "..."` → retrieve *k* chunks → stream answer with references.                                                                 |
| **Dependencies** | Provide fully pinned dependencies in `requirements.txt` or `environment.yml`.                                                                           |
| **Local Inference** | Once the INT4 model is available, your script must load it locally only – no downloading of FP16 weights at runtime.                                 |
| **Code Quality** | Python or C++ with clear modular structure, error handling, and meaningful docstrings/comments.                                                         |
| **README.md**    | Step-by-step: venv setup, install deps, convert model, ingest PDF, run demo, expected output, and hardware specs.                                       |
| **Self-Test**    | A one-liner shell or batch script (`run_demo.*`) to execute the full pipeline and answer a sample query.                                                |


### **```Imports```**

In [1]:
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig

  from .autonotebook import tqdm as notebook_tqdm


#### ```Testing GPU```

In [2]:
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

True
NVIDIA GeForce RTX 3090 Ti


##### **```Used GPTQ Technique to convert the model to Int4```**

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

# Set up GPTQConfig
gptq_config = GPTQConfig(
    bits=4,
    dataset="c4",
    tokenizer=tokenizer
)

# Load and quantize model
quantized_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    device_map="auto",
    quantization_config=gptq_config
)

# Save locally
quantized_model.save_pretrained("llama3.1-8B-gptq")
tokenizer.save_pretrained("llama3.1-8B-gptq")

Loading checkpoint shards: 100%|██████████| 4/4 [00:14<00:00,  3.74s/it]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Generating train split: 356317 examples [00:04, 72715.28 examples/s]
Quantizing model.layers blocks : 100%|██████████| 32/32 [1:00:38<00:00, 113.69s/it]
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


('llama3.1-8B-gptq\\tokenizer_config.json',
 'llama3.1-8B-gptq\\special_tokens_map.json',
 'llama3.1-8B-gptq\\chat_template.jinja',
 'llama3.1-8B-gptq\\tokenizer.json')

##### **```Verification of Conversion to INT 4```**

```This Model appears to be quantized if:```
- Model size is ~4-5 GB (instead of ~16 GB)
- You see int32/uint8 parameters or GPTQ modules
- GPU memory usage is significantly lower
- Inference still works correctly

In [4]:
import torch
print(torch.version.cuda)   # Should print a CUDA version, e.g., '11.8'
print(torch.cuda.is_available())  # Should be True if GPU is usable

11.8
True


In [3]:
# Load your quantized model
model_path = "E:/My Projects/Hegtavic Projects/All_in_media/Allinmedia-test-project/models/llama3.1-8B-gptq"
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)

print("============ QUANTIZATION VERIFICATION ============\n")
# Below are different tests to verify if the model is quantized correctly

# 1. Check quantization config
print("1. Quantization Configuration:")

# hasattr() is a built-in function used to check if an object has a given attribute. It will return True if the attribute exists, otherwise False.
## We are just verifying if model has a config attribute and if that config has a quantization_config attribute.
if hasattr(model, 'config') and hasattr(model.config, 'quantization_config'):
    quant_config = model.config.quantization_config
    ## getattr() is a built-in function used to retrieve the value of an attribute from an object dynamically (when you may not know the attribute name until runtime).
    print(f"   Quantization method: {getattr(quant_config, 'quant_method', 'Not found')}")
    print(f"   Bits: {getattr(quant_config, 'bits', 'Not found')}")
    print(f"   Group size: {getattr(quant_config, 'group_size', 'Not found')}")
    print(f"   Dataset: {getattr(quant_config, 'dataset', 'Not found')}")
else:
    print("   No quantization config found in model.config")
    
# 2. Check model size on disk
def get_folder_size(folder_path):
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(folder_path):
        for filename in filenames:
            filepath = os.path.join(dirpath, filename)
            ## Checking for the size of the file
            total_size += os.path.getsize(filepath)
    return total_size

print(f"\n2. Model Size on Disk:")
if os.path.exists(model_path):
    size_bytes = get_folder_size(model_path)
    ## Convert bytes to gigabytes
    size_gb = size_bytes / (1024**3)
    print(f"   Total size: {size_gb:.2f} GB")
    print(f"   Expected for int4: ~4-5 GB (vs ~16 GB for fp16)")
else:
    print("   Model path not found")

# 3. Check parameter dtypes and sizes
print(f"\n3. Parameter Analysis:")
total_params = 0
quantized_params = 0
param_dtypes = {}

for name, param in model.named_parameters():
    total_params += param.numel() ##It returns the total count of scalar values stored in a tensor, regardless of its shape or dimensions.
    dtype_str = str(param.dtype)
    
    # Count parameters by dtype
    if dtype_str in param_dtypes:
        param_dtypes[dtype_str] += param.numel()
    else:
        param_dtypes[dtype_str] = param.numel()
    
    # Check for quantized parameters (GPTQ typically uses int32 for quantized weights)
    if 'int' in dtype_str.lower() or param.dtype in [torch.int8, torch.int32, torch.uint8]:
        quantized_params += param.numel()

print(f"   Total parameters: {total_params:,}")
print(f"   Parameter dtypes:")
for dtype, count in param_dtypes.items():
    percentage = (count / total_params) * 100
    print(f"    {dtype}: {count:,} ({percentage:.1f}%)")

# 4. Check for GPTQ-specific attributes
print(f"\n4. GPTQ-Specific Checks:")
gptq_indicators = []

# Check for quantization-related attributes in the model
for name, module in model.named_modules():
    module_type = type(module).__name__
    if 'gptq' in module_type.lower() or 'quant' in module_type.lower():
        gptq_indicators.append(f"   Found quantized module: {name} ({module_type})")

if gptq_indicators:
    print("   GPTQ modules found:")
    for indicator in gptq_indicators[:5]:  # Show first 5
        print(indicator)
    if len(gptq_indicators) > 5:
        print(f"   ... and {len(gptq_indicators) - 5} more")
else:
    print("   No obvious GPTQ modules found")

# Check for specific GPTQ files
gptq_files = []
if os.path.exists(model_path):
    for file in os.listdir(model_path):
        if 'gptq' in file.lower() or file.endswith('.safetensors'):
            gptq_files.append(file)

if gptq_files:
    print(f"   GPTQ-related files: {gptq_files}")

# 5. Memory usage check
print(f"\n5. Memory Usage:")
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    memory_allocated = torch.cuda.memory_allocated() / (1024**3)
    print(f"   GPU memory allocated: {memory_allocated:.2f} GB")
    print(f"   Expected for int4: ~4-6 GB (vs ~16 GB for fp16)")
else:
    print("   CUDA not available, cannot check GPU memory")

# 6. Test inference to ensure model works
print(f"\n6. Inference Test:")
try:
    inputs = tokenizer("The capital of France is", return_tensors="pt")
    if torch.cuda.is_available():
        inputs = {k: v.cuda() for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=10, do_sample=False)
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"   Test input: 'The capital of Pakistan is'")
    print(f"   Model output: '{response}'")
    print("   Inference successful")
except Exception as e:
    print(f"   Inference failed: {str(e)}")

CUDA extension not installed.
CUDA extension not installed.
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.
Loading checkpoint shards: 100%|██████████| 2/2 [00:39<00:00, 19.76s/it]
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


=== QUANTIZATION VERIFICATION ===

1. Quantization Configuration:
   Quantization method: gptq
   Bits: 4
   Group size: 128
   Dataset: c4

2. Model Size on Disk:
   Total size: 5.36 GB
   Expected for int4: ~4-5 GB (vs ~16 GB for fp16)

3. Parameter Analysis:
   Total parameters: 1,050,939,392
   Parameter dtypes:
     torch.float16: 1,050,939,392 (100.0%)

4. GPTQ-Specific Checks:
   GPTQ modules found:
   Found quantized module: model.layers.0.self_attn.k_proj (QuantLinear)
   Found quantized module: model.layers.0.self_attn.o_proj (QuantLinear)
   Found quantized module: model.layers.0.self_attn.q_proj (QuantLinear)
   Found quantized module: model.layers.0.self_attn.v_proj (QuantLinear)
   Found quantized module: model.layers.0.mlp.down_proj (QuantLinear)
   ... and 219 more
   GPTQ-related files: ['model-00001-of-00002.safetensors', 'model-00002-of-00002.safetensors']

5. Memory Usage:
   GPU memory allocated: 5.34 GB
   Expected for int4: ~4-6 GB (vs ~16 GB for fp16)

6. Infere

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   Test input: 'The capital of France is'
   Model output: 'The capital of France is Paris, and the capital of France is also Paris'
   ✅ Inference successful

=== SUMMARY ===
Your model appears to be quantized if:
• Model size is ~4-5 GB (instead of ~16 GB)
• You see int32/uint8 parameters or GPTQ modules
• GPU memory usage is significantly lower
• Inference still works correctly
