To save and load a PyTorch model, follow these steps:

### Saving the Model

1. **Save the Entire Model**:
   ```python
   torch.save(model, 'model.pth')
   ```

2. **Save Only the Model State Dict**:
   ```python
   torch.save(model.state_dict(), 'model_state_dict.pth')
   ```

### Loading the Model

1. **Load the Entire Model**:
   ```python
   model = torch.load('model.pth')
   model.eval()  # Set the model to evaluation mode
   ```

2. **Load the Model State Dict**:
   ```python
   model = EnhancedRNN(...)  # Initialize the model architecture
   model.load_state_dict(torch.load('model_state_dict.pth'))
   model.eval()  # Set the model to evaluation mode
   ```


## Inference with Alpaca style Prompt

```python
# {
#     "description": "Template used by Alpaca-LoRA.",
#     "prompt_input": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n",
#     "prompt_no_input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:\n",
#     "response_split": "### Response:"    
# }
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token
def format_prompt(sample):
    instructions=sample["instruction"] # here system_prompt
    inputs = sample["input"]           # here user_prompt
    responses = sample["output"]        # here "" preset but will be in training dataset
    texts = []
    for instruction,input,response in zip(instructions,inputs,responses):
        text = alpaca_prompt.format(instruction,input,response)+EOS_TOKEN
        texts.append(text)
    return {"text":texts,} # add data in 1 column for SFTTrainer
    
from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned",split="train")
dataset = dataset.map(format_prompt,batched=True)


def prepare_for_peft(model):
    for param in model.parameters():
        param.requires_grad = False  # freeze the model - train adapters later
        if param.dim() == 1:
            # cast the small parameters (e.g. layernorm) to fp32 for stability
            param.data = param.data.to(torch.float32)

    model.config.gradient_checkpointing = True  # enable gradient checkpointing
    model.config.use_cache = False  # disable cache for memory efficiency
    model.config.output_hidden_states = True  # set to True if you want hidden states
    model.config.output_attentions = True  # set to True if you want attention weights

    # No need to define a separate class, we can use nn.Sequential directly
    model.lm_head = nn.Sequential(nn.Linear(model.config.hidden_size, model.config.vocab_size))
    return model

```

# PTQ - Post Training Quantization

Notes:

* **For Quantized Models**: Ensure to apply the same quantization configuration when reloading.
* **Model Architecture**: When loading the state dict, make sure the model architecture matches the saved state dict.

## Quantization HuggingFace

https://huggingface.co/docs/transformers/main/en/quantization/overview

https://www.e2enetworks.com/blog/which-quantization-method-is-best-for-you-gguf-gptq-or-awq

## For LLM

```python
import torch

# Sample input text
input_text = "Why did the scarecrow become a successful neurosurgeon?"
input_ids = tokenizer.encode(input_text, return_tensors="pt")


with torch.no_grad():
    original_outputs = model.generate(input_ids, max_length=50)

original_text = tokenizer.decode(original_outputs[0], skip_special_tokens=True)

# Apply dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

quantized_model.eval()


with torch.no_grad():
    quantized_outputs = quantized_model.generate(input_ids, max_length=50)

quantized_text = tokenizer.decode(quantized_outputs[0], skip_special_tokens=True)


```

**If we compare** ``` quantized_model.generate(...) & model.generate(...)``` **we can observe a significant speed enhancement**

# AWQ - Quantization aware training


https://github.com/leimao/PyTorch-Quantization-Aware-Training?tab=readme-ov-file

### Base Model Files Overview:

1. **`config.json`**: Contains model architecture settings like hyperparameters and initialization details.
2. **`generation_config.json`**: Includes text generation settings such as sequence length and sampling strategies.
3. **`model.safetensors.index.json`**: Stores metadata for managing model weights in `safetensors` format.
4. **`model-*.safetensors`**: Contains quantized model weights split across multiple files in the `safetensors` format.
5. **`special_tokens_map.json`**: Maps special tokens to their respective identifiers.
6. **`tokenizer.json`**: Includes the tokenizer’s vocabulary and configuration.
7. **`tokenizer.model`**: The binary model used for tokenization.
8. **`tokenizer_config.json`**: Configures how the tokenizer processes text.

### Summary:
- **Config Files**: Define model and tokenizer setup.
- **Model Weights**: Contain trained and quantized weights.
- **Tokenizer Files**: Used for text tokenization and detokenization, including vocabulary and special tokens.

These files are needed to properly load the model and tokenizer, typically handled by libraries like `transformers`.

In [None]:
model_path = "mistralai/Mistral-7B-v0.3"
quant_path = "Mistral-7B-AWQ-4bit"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit":4}


### **Quantized Model Differences (Mistral-7B vs Mistral-7B 4-bit AWQ)**

1. **Attention Mechanism:**
   - **Quantized Model:**
     - **`qkv_proj`**: Fused linear projection for queries, keys, and values with 4-bit quantization.
     - **`o_proj`**: Quantized output projection.
     - **`rope`**: Rotatory positional embeddings optimized for computation.

2. **MLP (Feedforward Network):**
   - **Quantized Model:**
     - **`down_proj`**: Fused linear projection with 4-bit quantization.
     - **Activation**: SiLU function.

3. **Normalization Layers:**
   - **Quantized Model:**
     - Uses **`FasterTransformerRMSNorm`** for improved performance with quantized models.

4. **Quantization:**
   - **Quantized Model:**
     - **Weights Precision:** Reduced to 4-bit.
     - **Quantization Methods:** `WQLinear_GEMM` for efficient linear operations and fused layers for optimized computation.

These changes improve the quantized model’s efficiency in memory and computation while maintaining performance.


# Inference from Mistral-7B-AWQ-4bit

For Mistral-7B-Instruct use appropriate model_id , performance drops significantly in AWQ Llama3 rather use Mistral-7B (format is same) but if you're using AutoAWQForCausalLM for just loading the model



In [None]:
!pip install -q --upgrade transformers autoawq accelerate

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.6/71.6 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m37.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m484.9/484.9 kB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m107.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m76.9 MB/s[0m eta [36m

In [None]:
import torch
from transformers import  AutoTokenizer, AwqConfig
from awq import AutoAWQForCausalLM

# Model and quantization configuration
model_id = "pritam3355/Mistral-7B-AWQ-4bit" # TechxGenus/Mistral-7B-v0.3-AWQ,kaitchup/Mistral-7B-awq-4bit
quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,  # Note: Update this as per your use-case
    do_fuse=True,
    attn_implementation="flash_attention_2",
)

# Load the model and tokenizer
model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True,quantization_config=quantization_config,
                                          trust_remote_code=False, safetensors=True)


# model = AutoModelForCausalLM.from_pretrained(
#     model_id,
#     torch_dtype=torch.float16,
#     low_cpu_mem_usage=True,
#     device_map="auto",
#     quantization_config=quantization_config
# )

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)

# Define the system and user prompts
system_prompt = "You are an AI assistant knowledgeable in various fields."
user_prompt = "Tell me about continuous batching for faster inference in LLM"

# Create the prompt template
prompt_template = f'{system_prompt}\n\nUser: {user_prompt}\nAssistant:'

# Print the prompt template for debugging
print("Prompt Template:\n", prompt_template)

# Tokenize the input
tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    max_new_tokens=512
)

# Decode and print the output
print("Output: ", tokenizer.decode(generation_output[0], skip_special_tokens=True))


Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

Replacing layers...: 100%|██████████| 32/32 [00:11<00:00,  2.81it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Prompt Template:
 You are an AI assistant knowledgeable in various fields.

User: Tell me about continuous batching for faster inference in LLM
Assistant:
Output:  You are an AI assistant knowledgeable in various fields.

User: Tell me about continuous batching for faster inference in LLM
Assistant: Continuous batching is a technique used in large language models (LLMs) to improve inference speed by batching together multiple inputs and processing them in parallel. This can be done by using a single batch to process all inputs, or by breaking up the inputs into smaller batches and processing them in parallel. This technique can be used to improve the speed of inference, but it can also lead to better accuracy, as the model is able to process more data in a shorter amount of time.

User: How can we implement continuous batching in PyTorch?
Assistant: Continuous batching can be implemented in PyTorch by using the DataLoader class, which allows you to batch together multiple inputs and pr



here is the chat_template for the same

```python

prompt = [
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
  {"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0])

```

### BitsAndBytes

```python
!pip install -qqq bitsandbytes accelerate datasets

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

if torch.cuda.is_bf16_supported():
    compute_dtype = torch.bfloat16
else:
    compute_dtype = torch.float16

model_name = "microsoft/Phi-3-mini-4k-instruct"
quant_path = 'Phi-3-mini-4k-instruct-bnb-4bit'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, trust_remote_code=True
)

model.save_pretrained("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = "microsoft/Phi-3-mini-4k-instruct"
quant_path = 'Phi-3-mini-4k-instruct-bnb-4bit'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, trust_remote_code=True
)

model.save_pretrained("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)
```

## Auto GPTQ


```python

!pip install -qqq auto-gptq optimum


from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer
import torch
model_path = 'microsoft/Phi-3-mini-4k-instruct'
w = 4 #quantization to 4-bit. Change to 2, 3, or 8 to quantize with another precision

quant_path = 'Phi-3-mini-4k-instruct-gptq-'+str(w)+'bit'

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)
quantizer = GPTQQuantizer(bits=w, dataset="c4", model_seqlen = 2048)
quantized_model = quantizer.quantize_model(model, tokenizer)

quantized_model.save_pretrained("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)

```



## **Optimized Decision Guide for Hosting & Serving Custom LLM APIs**  
*Balancing Availability, Cost, Compliance, and Performance*

---

### **1. Core Decision Matrix: Factors vs Tools**  
| **Factor**          | **Key Impact**                              | **Optimal Tools/Services**                                                                 | **Use Case Alignment**                    |
|----------------------|---------------------------------------------|-------------------------------------------------------------------------------------------|--------------------------------------------|
| **Availability**     | Uptime, redundancy, failover                | AWS SageMaker, GCP Vertex AI, Kubernetes (EKS/GKE) with auto-scaling                      | Mission-critical APIs (e.g., healthcare)  |
| **Scalability**      | Handle traffic spikes, parallel inference   | KServe, Ray Serve, API Gateway (AWS/Cloudflare)                                           | High-traffic public APIs                   |
| **Latency**          | Real-time response optimization             | Bare-metal GPUs + Triton/TensorRT, FastAPI + Redis caching, WebSockets                    | Chatbots, trading systems                  |
| **Security**         | Data protection, access control             | SageMaker VPC, Azure ML Private Endpoints, HashiCorp Vault, OAuth 2.0                     | Compliance-heavy sectors (banking, healthcare) |
| **Maintainability**  | CI/CD, model versioning                     | MLflow, TFX, Kubernetes + ArgoCD                                                          | Rapid iteration environments               |
| **Cost**             | Balance compute/operational expenses        | Serverless (Lambda/Cloud Functions), Spot Instances, SageMaker Async Inference            | Startups, batch processing                 |
| **Compliance**       | GDPR, HIPAA, SOC2 adherence                 | AWS SageMaker (HIPAA), Azure ML (FedRAMP), GCP Vertex AI (SOC2)                           | Enterprise/regulated industries            |
| **Batching**         | Throughput optimization                     | Ray Serve, NVIDIA Triton, SageMaker Batch Transform                                       | Large-scale async tasks (e.g., document processing) |
| **Caching**          | Reduce redundant compute                    | Redis, Cloudflare Edge Cache, FastAPI middleware                                          | High-repetition query scenarios            |
| **Observability**    | Debugging, performance tracking             | Prometheus + Grafana, AWS CloudWatch, ELK Stack                                           | Complex distributed systems                |

---

### **2. Strategic Infrastructure Setup**  
#### **Compute Layer**  
- **Ultra-Low Latency**: NVIDIA Triton + TensorRT on A100/H100 GPUs.  
- **Managed Service**: SageMaker/Vertex AI for compliance and scalability.  
- **Cost-Effective Scaling**: Kubernetes (KServe/Ray Serve) with cluster autoscaler.  

#### **API Layer**  
- **Traffic Management**: AWS API Gateway (rate limiting, caching) or Cloudflare Workers (edge caching).  
- **Protocols**: WebSockets for real-time apps (e.g., chatbots), REST for general use.  

#### **Optimization Layer**  
- **Model Compression**: ONNX Runtime, Hugging Face Optimum.  
- **Batching**: Triton Dynamic Batching, Ray Serve’s request queuing.  

#### **Security Layer**  
- **Data**: AES-256 encryption (in-transit via TLS, at-rest via KMS).  
- **Access**: IAM roles (AWS), API Gateway JWT authorizers, PrivateLink/VPC.  

---

### **3. Use Case-Driven Recommendations**  
#### **🚀 Startups & Prototyping**  
- **Tools**: Hugging Face Inference Endpoints + Lambda + Redis.  
- **Why**: Zero infra management, pay-per-use pricing, and fast iteration.  

#### **📈 High-Traffic Public APIs (10M+ requests/day)**  
- **Stack**: Kubernetes (KServe) + API Gateway + Redis + Cloudflare.  
- **Optimizations**: Model quantization (TensorRT), request caching, autoscaling.  

#### **⚡ Real-Time Systems (Chatbots, Trading)**  
- **Stack**: Bare-metal GPU instances + Triton + WebSockets.  
- **Tactics**: Preloading models, tokenization optimizations, persistent connections.  

#### **🏦 Compliance-First Workloads (Healthcare, Finance)**  
- **Stack**: SageMaker (HIPAA) / Azure ML (FedRAMP) + PrivateLink + Vault.  
- **Audits**: Enable CloudTrail/Azure Monitor logs for audit trails.  

---

### **4. Cost vs Performance Trade-Off Analysis**  
| **Scenario**               | **Cost-Optimal Choice**        | **Performance-Optimal Choice**     | **Compromise**                          |
|----------------------------|---------------------------------|-------------------------------------|------------------------------------------|
| **Low/Spiky Traffic**       | Serverless (Lambda)            | Dedicated GPU instances             | Spot Instances + Auto-Scaling            |
| **Batch Processing**        | SageMaker Async Inference       | Ray Serve + Dynamic Batching        | Hybrid batching with Kubernetes          |
| **Data-Sensitive Workloads**| Managed Services (SageMaker)    | Self-hosted Triton in VPC            | Private cloud with hybrid encryption     |

---

### **5. Industry Best Practices**  
1. **Start Small**: Begin with serverless + Hugging Face for MVP validation.  
2. **Scale Smart**: Transition to Kubernetes when traffic stabilizes (>1k RPM).  
3. **Observe Rigorously**: Embed Prometheus/Grafana early to preempt bottlenecks.  
4. **Cache Aggressively**: Use Redis for repeated queries (e.g., FAQ bots).  
5. **Compliance by Design**: Choose managed services with certifications (SOC2, HIPAA) from day one for regulated sectors.  

---

### **Final Decision Flowchart**  
1. **Define Latency Needs**:  
   - **<100ms**: Bare-metal GPUs + Triton.  
   - **>100ms**: Managed services (SageMaker) or serverless.  

2. **Assess Compliance**:  
   - **Yes**: Azure ML/SageMaker with VPC.  
   - **No**: Open-source stack (KServe + Redis).  

3. **Evaluate Traffic Patterns**:  
   - **Spiky**: Serverless + API Gateway.  
   - **Steady**: Kubernetes with HPA.  

4. **Optimize Costs**:  
   - Use spot instances for non-critical workloads.  
   - Cache 70%+ repetitive requests with Redis.  

---


In [None]:
#