# **```CLI based RAG application```**

##### ```Below are the requirements```

| **Area**         | **Requirement**                                                                                                                                         |
|------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Model**        | Download Llama-3.1 8B Instruct from Hugging Face. Include a script that converts it to INT4 using MLX or OpenVINO IR.                                   |
| **Knowledge Base** | Parse `procyon_guide.pdf`, chunk text, generate embeddings, and store them using FAISS, Qdrant, Milvus, or pgvector.                                   |
| **CLI Tool**     | Command: `rag_cli --query "..."` → retrieve *k* chunks → stream answer with references.                                                                 |
| **Dependencies** | Provide fully pinned dependencies in `requirements.txt` or `environment.yml`.                                                                           |
| **Local Inference** | Once the INT4 model is available, your script must load it locally only – no downloading of FP16 weights at runtime.                                 |
| **Code Quality** | Python or C++ with clear modular structure, error handling, and meaningful docstrings/comments.                                                         |
| **README.md**    | Step-by-step: venv setup, install deps, convert model, ingest PDF, run demo, expected output, and hardware specs.                                       |
| **Self-Test**    | A one-liner shell or batch script (`run_demo.*`) to execute the full pipeline and answer a sample query.                                                |


### **```Imports```**

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig

#### ```Testing GPU```

In [None]:
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

True
NVIDIA GeForce RTX 3090 Ti


##### **```Used GPTQ Technique to convert the model to Int4```**

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

# Set up GPTQConfig
gptq_config = GPTQConfig(
    bits=4,
    dataset="c4",
    tokenizer=tokenizer
)

# Load and quantize model
quantized_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    device_map="auto",
    quantization_config=gptq_config
)

# Save locally
quantized_model.save_pretrained("llama3.1-8B-gptq")
tokenizer.save_pretrained("llama3.1-8B-gptq")

Loading checkpoint shards: 100%|██████████| 4/4 [00:14<00:00,  3.74s/it]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Generating train split: 356317 examples [00:04, 72715.28 examples/s]
Quantizing model.layers blocks : 100%|██████████| 32/32 [1:00:38<00:00, 113.69s/it]
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


('llama3.1-8B-gptq\\tokenizer_config.json',
 'llama3.1-8B-gptq\\special_tokens_map.json',
 'llama3.1-8B-gptq\\chat_template.jinja',
 'llama3.1-8B-gptq\\tokenizer.json')