In [None]:
#

# Base Model contains following files

**config.json** - It includes hyperparameters, model type, and other settings that define how the model is structured and how it should be initialized.

**generation_config.json** - This file includes configuration settings specific to the text generation process. This might include parameters like maximum sequence length, sampling strategies, and other settings relevant to generating text.

**model.safetensors.index.json** -  This file is associated with the safetensors format and contains metadata and indexing information for efficiently accessing the model weights stored in the safetensors files. It helps in managing large model weights across multiple files.


**model-00001-of-00003.safetensors, model-00002-of-00003.safetensors, model-00003-of-00003.safetensors** - These files contain the actual quantized model weights. Since the model is large, the weights are split across multiple files. The .safetensors format is used to safely and efficiently store these weights, ensuring that they are properly serialized and deserialized during model loading.

**special_tokens_map.json** - This file maps special tokens used by the tokenizer to their respective identifiers. Special tokens include things like [CLS], [SEP], or any other token that has a specific role in the tokenization process.

**tokenizer.json** - This file contains the tokenizer's vocabulary and configuration in a JSON format. It includes mappings from token strings to token IDs and other tokenizer-specific settings.

**tokenizer.model** -This file is the actual binary model used by the tokenizer. It typically contains the underlying data structures required for tokenization and detokenization.

**tokenizer_config.json** - This file includes configuration settings for the tokenizer itself. It contains details on how the tokenizer should process text, such as pre-tokenization, normalization, and other settings.

### Summary:

**Configuration Files (config.json, generation_config.json, tokenizer_config.json)**: These files define the setup and parameters for both the model and tokenizer.

**Model Weights Files (model-*.safetensors)**: These files contain the trained and quantized weights of the model.

**Tokenizer Files (tokenizer.json, tokenizer.model, special_tokens_map.json)**: These files are used for tokenizing and detokenizing text, including the tokenizer's vocabulary and special tokens.


When loading the model and tokenizer, you typically need to load all these files to ensure that the model functions correctly and the tokenizer behaves as expected. If you're using a library like transformers, it handles the loading of these files automatically when you use functions like from_pretrained.

## Inference with Alpaca style Prompt

```python
# {
#     "description": "Template used by Alpaca-LoRA.",
#     "prompt_input": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n",
#     "prompt_no_input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:\n",
#     "response_split": "### Response:"    
# }
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token
def format_prompt(sample):
    instructions=sample["instruction"] # here system_prompt
    inputs = sample["input"]           # here user_prompt
    responses = sample["output"]        # here "" preset but will be in training dataset
    texts = []
    for instruction,input,response in zip(instructions,inputs,responses):
        text = alpaca_prompt.format(instruction,input,response)+EOS_TOKEN
        texts.append(text)
    return {"text":texts,} # add data in 1 column for SFTTrainer
    
from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned",split="train")
dataset = dataset.map(format_prompt,batched=True)


def prepare_for_peft(model):
    for param in model.parameters():
        param.requires_grad = False  # freeze the model - train adapters later
        if param.dim() == 1:
            # cast the small parameters (e.g. layernorm) to fp32 for stability
            param.data = param.data.to(torch.float32)

    model.config.gradient_checkpointing = True  # enable gradient checkpointing
    model.config.use_cache = False  # disable cache for memory efficiency
    model.config.output_hidden_states = True  # set to True if you want hidden states
    model.config.output_attentions = True  # set to True if you want attention weights

    # No need to define a separate class, we can use nn.Sequential directly
    model.lm_head = nn.Sequential(nn.Linear(model.config.hidden_size, model.config.vocab_size))
    return model

```

# PTQ - Post Training Quantization

Notes:

* **For Quantized Models**: Ensure to apply the same quantization configuration when reloading.
* **Model Architecture**: When loading the state dict, make sure the model architecture matches the saved state dict.

## Quantization HuggingFace

https://huggingface.co/docs/transformers/main/en/quantization/overview

https://www.e2enetworks.com/blog/which-quantization-method-is-best-for-you-gguf-gptq-or-awq

## For LLM

```python
import torch

# Sample input text
input_text = "Why did the scarecrow become a successful neurosurgeon?"
input_ids = tokenizer.encode(input_text, return_tensors="pt")


with torch.no_grad():
    original_outputs = model.generate(input_ids, max_length=50)

original_text = tokenizer.decode(original_outputs[0], skip_special_tokens=True)

# Apply dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

quantized_model.eval()


with torch.no_grad():
    quantized_outputs = quantized_model.generate(input_ids, max_length=50)

quantized_text = tokenizer.decode(quantized_outputs[0], skip_special_tokens=True)


```

**If we compare** ``` quantized_model.generate(...) & model.generate(...)``` **we can observe a significant speed enhancement**

# Training Mistral-7B 4-bit AWQ

```python

!pip install -q --upgrade transformers autoawq accelerate

model_path = "mistralai/Mistral-7B-v0.3"
quant_path = "Mistral-7B-AWQ-4bit"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit":4}

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path,**{"low_cpu_mem_usage": True})
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

#save the quantized model
model.save_quantized("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)

# load model to huggingface

from huggingface_hub import HfApi

username = "pritam3355"
MODEL_NAME = quant_path

api = HfApi(token=hf_tokens)

api.create_repo(repo_id = f"{username}/{MODEL_NAME}",repo_type="model")

api.upload_folder(repo_id = f"{username}/{MODEL_NAME}",
                  folder_path = f"/kaggle/working/{MODEL_NAME}")

```

## Inference from Mistral-7B-AWQ-4bit - AutoAWQForCausalLM

```python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
from transformers import pipeline

model_id=f"{username}/{MODEL_NAME}"

model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True,
                                          trust_remote_code=False, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)



print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])
```


## Inference from Mistral-7B-AWQ-4bit - AutoModelForCausalLM

```python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

# Model and quantization configuration
model_id = f"{username}/{MODEL_NAME}" # TechxGenus/Mistral-7B-v0.3-AWQ,kaitchup/Mistral-7B-awq-4bit
quantization_config = AwqConfig(bits=4,fuse_max_seq_len=512,
                                do_fuse=True,attn_implementation="flash_attention_2",)

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             torch_dtype=torch.float16,
                                             low_cpu_mem_usage=True,
                                             device_map="auto",
                                             quantization_config=quantization_config)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)

# Define the system and user prompts
system_prompt = "You are an AI assistant knowledgeable in various fields."
user_prompt = "Tell me about continuous batching for faster inference in LLM"

# Create the prompt template
prompt_template = f'{system_prompt}\n\nUser: {user_prompt}\nAssistant:'

# Tokenize the input
tokens = tokenizer( prompt_template,return_tensors='pt').input_ids.cuda()

# Generate output
generation_output = model.generate(tokens,do_sample=True,temperature=0.7,
                                   top_p=0.95,top_k=40,max_new_tokens=512)

# Decode and print the output
print("Output: ", tokenizer.decode(generation_output[0], skip_special_tokens=True))


```


**Performance drops significantly in AWQ Llama3 rather use Mistral-7B (format is same) but if you're using AutoAWQForCausalLM for just loading the model**



## **Optimized Decision Guide for Hosting & Serving Custom LLM APIs**  
*Balancing Availability, Cost, Compliance, and Performance*

---

### **1. Core Decision Matrix: Factors vs Tools**  
| **Factor**          | **Key Impact**                              | **Optimal Tools/Services**                                                                 | **Use Case Alignment**                    |
|----------------------|---------------------------------------------|-------------------------------------------------------------------------------------------|--------------------------------------------|
| **Availability**     | Uptime, redundancy, failover                | AWS SageMaker, GCP Vertex AI, Kubernetes (EKS/GKE) with auto-scaling                      | Mission-critical APIs (e.g., healthcare)  |
| **Scalability**      | Handle traffic spikes, parallel inference   | KServe, Ray Serve, API Gateway (AWS/Cloudflare)                                           | High-traffic public APIs                   |
| **Latency**          | Real-time response optimization             | Bare-metal GPUs + Triton/TensorRT, FastAPI + Redis caching, WebSockets                    | Chatbots, trading systems                  |
| **Security**         | Data protection, access control             | SageMaker VPC, Azure ML Private Endpoints, HashiCorp Vault, OAuth 2.0                     | Compliance-heavy sectors (banking, healthcare) |
| **Maintainability**  | CI/CD, model versioning                     | MLflow, TFX, Kubernetes + ArgoCD                                                          | Rapid iteration environments               |
| **Cost**             | Balance compute/operational expenses        | Serverless (Lambda/Cloud Functions), Spot Instances, SageMaker Async Inference            | Startups, batch processing                 |
| **Compliance**       | GDPR, HIPAA, SOC2 adherence                 | AWS SageMaker (HIPAA), Azure ML (FedRAMP), GCP Vertex AI (SOC2)                           | Enterprise/regulated industries            |
| **Batching**         | Throughput optimization                     | Ray Serve, NVIDIA Triton, SageMaker Batch Transform                                       | Large-scale async tasks (e.g., document processing) |
| **Caching**          | Reduce redundant compute                    | Redis, Cloudflare Edge Cache, FastAPI middleware                                          | High-repetition query scenarios            |
| **Observability**    | Debugging, performance tracking             | Prometheus + Grafana, AWS CloudWatch, ELK Stack                                           | Complex distributed systems                |

---

### **2. Strategic Infrastructure Setup**  
#### **Compute Layer**  
- **Ultra-Low Latency**: NVIDIA Triton + TensorRT on A100/H100 GPUs.  
- **Managed Service**: SageMaker/Vertex AI for compliance and scalability.  
- **Cost-Effective Scaling**: Kubernetes (KServe/Ray Serve) with cluster autoscaler.  

#### **API Layer**  
- **Traffic Management**: AWS API Gateway (rate limiting, caching) or Cloudflare Workers (edge caching).  
- **Protocols**: WebSockets for real-time apps (e.g., chatbots), REST for general use.  

#### **Optimization Layer**  
- **Model Compression**: ONNX Runtime, Hugging Face Optimum.  
- **Batching**: Triton Dynamic Batching, Ray Serve’s request queuing.  

#### **Security Layer**  
- **Data**: AES-256 encryption (in-transit via TLS, at-rest via KMS).  
- **Access**: IAM roles (AWS), API Gateway JWT authorizers, PrivateLink/VPC.  

---

### **3. Use Case-Driven Recommendations**  
#### **🚀 Startups & Prototyping**  
- **Tools**: Hugging Face Inference Endpoints + Lambda + Redis.  
- **Why**: Zero infra management, pay-per-use pricing, and fast iteration.  

#### **📈 High-Traffic Public APIs (10M+ requests/day)**  
- **Stack**: Kubernetes (KServe) + API Gateway + Redis + Cloudflare.  
- **Optimizations**: Model quantization (TensorRT), request caching, autoscaling.  

#### **⚡ Real-Time Systems (Chatbots, Trading)**  
- **Stack**: Bare-metal GPU instances + Triton + WebSockets.  
- **Tactics**: Preloading models, tokenization optimizations, persistent connections.  

#### **🏦 Compliance-First Workloads (Healthcare, Finance)**  
- **Stack**: SageMaker (HIPAA) / Azure ML (FedRAMP) + PrivateLink + Vault.  
- **Audits**: Enable CloudTrail/Azure Monitor logs for audit trails.  

---

### **4. Cost vs Performance Trade-Off Analysis**  
| **Scenario**               | **Cost-Optimal Choice**        | **Performance-Optimal Choice**     | **Compromise**                          |
|----------------------------|---------------------------------|-------------------------------------|------------------------------------------|
| **Low/Spiky Traffic**       | Serverless (Lambda)            | Dedicated GPU instances             | Spot Instances + Auto-Scaling            |
| **Batch Processing**        | SageMaker Async Inference       | Ray Serve + Dynamic Batching        | Hybrid batching with Kubernetes          |
| **Data-Sensitive Workloads**| Managed Services (SageMaker)    | Self-hosted Triton in VPC            | Private cloud with hybrid encryption     |

---

### **5. Industry Best Practices**  
1. **Start Small**: Begin with serverless + Hugging Face for MVP validation.  
2. **Scale Smart**: Transition to Kubernetes when traffic stabilizes (>1k RPM).  
3. **Observe Rigorously**: Embed Prometheus/Grafana early to preempt bottlenecks.  
4. **Cache Aggressively**: Use Redis for repeated queries (e.g., FAQ bots).  
5. **Compliance by Design**: Choose managed services with certifications (SOC2, HIPAA) from day one for regulated sectors.  

---

### **Final Decision Flowchart**  
1. **Define Latency Needs**:  
   - **<100ms**: Bare-metal GPUs + Triton.  
   - **>100ms**: Managed services (SageMaker) or serverless.  

2. **Assess Compliance**:  
   - **Yes**: Azure ML/SageMaker with VPC.  
   - **No**: Open-source stack (KServe + Redis).  

3. **Evaluate Traffic Patterns**:  
   - **Spiky**: Serverless + API Gateway.  
   - **Steady**: Kubernetes with HPA.  

4. **Optimize Costs**:  
   - Use spot instances for non-critical workloads.  
   - Cache 70%+ repetitive requests with Redis.  

---
