#Introduction to Weight Quantization for Compact Large Language Models
Weight quantization is a process that reduces the precision of the weights in a neural network. By converting floating-point representations to lower-bit integers, we can significantly decrease the model's memory footprint and accelerate its inference time. This transformation is not only essential for deploying LLMs in resource-constrained environments but also for enhancing their scalability and energy efficiency.



To prepare the computational environment for the optimization of large language models through weight quantization, the installation of specific Python packages is required. This preparation is pivotal for ensuring the availability of functionalities critical to implementing, evaluating, and optimizing quantized models with efficiency.

The following packages are set to be installed, each serving a distinct purpose in the project:

- **`bitsandbytes` (version 0.39.0 or later)**: Chosen for its advanced capabilities in optimizing deep learning models, this library facilitates efficient weight quantization and model compression, which are essential for our project's success.

- **Hugging Face's `accelerate`**: This library is designed to abstract the complexities of running machine learning models across different hardware configurations, from CPUs to GPUs and TPUs, making it easier to adapt our quantization strategies without deep hardware knowledge.

- **Hugging Face's `transformers`**: Accessed directly from the source, this installation ensures the latest pre-trained models and transformer architectures are at our disposal, complete with the latest features and bug fixes critical for state-of-the-art quantization efforts.




In [1]:
!pip install -q bitsandbytes>=0.39.0
!pip install -q git+https://github.com/huggingface/accelerate.git
!pip install -q git+https://github.com/huggingface/transformers.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone




##The initial installations encompass the following core libraries:

- **`transformers`**: Facilitates access to pre-trained transformer models and architectures, serving as a cornerstone for working with LLMs.
- **`torch`**: The PyTorch library, a flexible and powerful deep learning framework, is indispensable for building and manipulating neural networks, including those undergoing quantization.
- **`accelerate`**: A library from Hugging Face that abstracts the complexities of adapting machine learning workflows to various hardware configurations, thereby simplifying the execution of models across different platforms.



In [2]:
!pip install transformers torch accelerate





To work with LLaMA 2, a cutting-edge large language model known for its efficiency and power, users are required to complete a registration process with Hugging Face and obtain an authorized API key. This key serves as a credential to access the model and utilize its capabilities within your projects



In [3]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.c

In [4]:
!huggingface-cli whoami

Belal998




**Preparing the Environment for LLaMA 2 Deployment**

With the necessary access and authentication steps completed, we now transition to the core of our project: setting up LLaMA 2 for generating text. This stage involves importing essential libraries and initializing the model along with its tokenizer. These components are foundational for leveraging the model's capabilities within our computational environment.

The process unfolds as follows:

1. **Import Libraries**:
2. **Seed Setting for Reproducibility**

3. **Device Configuration**

4. **Model and Tokenizer Loading**

5. **Memory Footprint**


In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)

# Set device to CPU for now
device = 'cpu'

# Load model and tokenizer
model_id = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Print model size
print(f"Model size: {model.get_memory_footprint():,} bytes")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model size: 27,087,888,384 bytes




**Finalizing LLaMA 2 Deployment: Quantization and Google Drive Integration**




In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from google.colab import drive
import torch

# Mount Google Drive
drive.mount('/content/drive')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Specify your Google Drive path
drive_path_int8 = '/content/drive/My Drive/your_drive_path/model_int8/int4'
model_int8 = AutoModelForCausalLM.from_pretrained(model_id,
                                             device_map='auto',
                                             load_in_4bit=True,
                                             )
model_int8.save_pretrained(drive_path_int8)

print(f"Model size: {model_int8.get_memory_footprint():,} bytes")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model size: 3,829,940,224 bytes
