# [Harnessing Power at the Edge: An Introduction to Local Large Language Models](https://pyimagesearch.com/2024/05/13/harnessing-power-at-the-edge-an-introduction-to-local-large-language-models/)

## About LLMs

- Large Language Models (LLMs): AI systems trained on extensive text datasets.
  - Transformers architecture (self-attention mechanisms; **context**)

- [Attention Is All You Need (Google)](https://arxiv.org/abs/1706.03762) - Introduced understanding context, semantics and Transformers
- GPT-1 was notable for its decoder-only architecture and its pioneering approach to generative pre-training

## Training LLMs

Essential steps with LLMs:
- Pre-training: training over a large corpus, the model understands patterns and acquires general knowledge
- Fine-tuning: re-train over a small corpus and update weights (frozen essential layers) for specific task  

## Applications

Content generation, customar service, programming and code generation, translation and localization, education and tutoring, and more.

## Ethical Considerations

Problems: data privacy (specially considering cloud-based services), model bias, and the generation of misleading information.

## Local LLMs

Cloud-based services for LLMs -> LLMs on local infrastructures

This movement is driven by some factors, like the need for lower latency, and the desire for greater control over the models.

## Advantages of Local LLMs

Data privacy and security, low latency, cost, always-on availability.

## Considerations

- Significant hardware investments are necessary (computation-heavy workload associated)
- Necessity of manage and update the models, handle data security, and ensure the infrastructure’s integrity.
- Scaling local hardware can be challenging and expensive compared to scalable cloud solutions.
- There are numerous frameworks that facilitate the access and use of this models

Organizations can choose the best strategy for their needs (using cloud-based services or locally).

## Common Model Formats for LLMs

### PyTorch Models

- `.pt`, `.pth`
- `fp16`, `fp32` denotes the precision of the model's floating-point computations (there are a trade-off between model's memory and computational speed)

### SafeTensors

- Created for bolster the security and integrity of model data
- SafeTensors has demonstrated faster performance than PyTorch on both CPU and GPU environments

### GGML and GGUF

- GGML (GPT-Generated Model Language) enables LLMs to run efficiently on consumer-grade CPUs -> quantization techniques
- GGUF (GPT-Generated Unified Format) was designed to be more flexible and robust, supporting a broader array of models

## Post-training Quantization

### GPTQ (Generalized Post-Training Quantization)

- Compress all model weights to 4-bit quantization; during inference, dynamically dequantizes weights to float16
- There are two directions in reducing computational demands of GPTs:
  - develop smaller models
  - reduce size of existing models

- AutoGPTQ Library

### AWQ (Activation-aware Weight Quantization)

- This method dynamically adjusts the quantization levels of weights to balance performance with model accuracy
- Distinguishes itself from other methods by giving importance to individual weights within a model (premise: not all weights contribute equally to a model’s performance)

- AutoAWQ

## Frameworks for Local LLMs

- Ollama
- LM Studio
- NVIDIA ChatRTX