## Open notebook in:
| Colab                                 |  Gradient                                                                                                                                         |
|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nicolepcx/Transformers-in-Action/blob/main/CH09/ch09_sharded_falcon.ipynb)                                              | [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/Nicolepcx/Transformers-in-Action/blob/main/CH09/ch09_sharded_falcon.ipynb)|             

In [1]:
# Clone repo, if it's not already cloned, to be sure all runs smoothly
# on Colab or Paperspace
import os

if not os.path.isdir('Transformers-in-Action'):
    !git clone https://github.com/Nicolepcx/Transformers-in-Action.git
else:
    print('Repository already exists. Skipping clone.')


current_path = %pwd
if '/Transformers-in-Action' in current_path:
    new_path = current_path + '/utils'
else:
    new_path = current_path + '/Transformers-in-Action/utils'
%cd $new_path


Cloning into 'Transformers-in-Action'...
remote: Enumerating objects: 324, done.[K
remote: Counting objects: 100% (35/35), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 324 (delta 13), reused 22 (delta 7), pack-reused 289[K
Receiving objects: 100% (324/324), 3.15 MiB | 36.24 MiB/s, done.
Resolving deltas: 100% (162/162), done.
/content/Transformers-in-Action/utils


# About this notebook


In this notebook you will load `tiiuae/falcon-7b` from `HuggingFace` with sharding and run inference on the model.


#Install requirements

In [2]:
from requirements import *

In [3]:
install_required_packages_ch09()

[1mInstalling chapter 9 requirements...
[0m
✅ accelerate==0.26.1 installation completed successfully!

✅ safetensors==0.4.1 installation completed successfully!

✅ transformers == 4.38.2 installation completed successfully!

✅ datasets==2.10.1 installation completed successfully!

✅ torch>=1.10.0 installation completed successfully!

✅ ray==2.9.3 installation completed successfully!

✅ wandb installation completed successfully!



# Imports

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import Accelerator, load_checkpoint_and_dispatch
import os
import psutil
import torch

In [5]:
# Function to print system resources
def print_system_resources():
    num_cpus = os.cpu_count()
    print(f'Number of CPU cores: {num_cpus}')
    print(f'Total CPU RAM: {psutil.virtual_memory().total / (1024**3):.2f} GB')

    if torch.cuda.is_available():
        num_gpus = torch.cuda.device_count()
        print(f'Number of GPUs: {num_gpus}')
        for i in range(num_gpus):
            print(f'GPU {i} Name: {torch.cuda.get_device_name(i)}')
            print(f'GPU {i} RAM: {torch.cuda.get_device_properties(i).total_memory / (1024**3):.2f} GB')
    else:
        print('No GPUs found')

# Function to save and load the model with sharding
def shard_and_load_model(model_name, save_directory, max_shard_size, device_map):
    # Initialize tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    # Initialize Accelerator
    accelerator = Accelerator()

    # Save model with sharding
    accelerator.save_model(model=model, save_directory=save_directory, max_shard_size=max_shard_size)

    # Load model from checkpoint with device map
    model = load_checkpoint_and_dispatch(
        model, checkpoint=save_directory, device_map=device_map, no_split_module_classes=['Block']
    )

    return model, tokenizer

# Function to generate model outputs
def generate_outputs(model, tokenizer, input_text, max_new_tokens):
    inputs = tokenizer(input_text, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, return_dict_in_generate=True, output_scores=True)
    return tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)

# Main function to run the workflow
def main():
    model_name = "tiiuae/falcon-7b"
    save_directory = '/content/model'
    device_map = {"": 'cpu'}

    # Print system resources
    print_system_resources()

    # Shard and load the model
    model, tokenizer = shard_and_load_model(model_name, save_directory, "2GB", device_map)

    # Generate outputs
    raw_input_text = "Tell me something about falcons"
    generated_text = generate_outputs(model, tokenizer, raw_input_text, 100)

    print(f'Generated Text: {generated_text}')

# Run the main function
if __name__ == "__main__":
    main()


Number of CPU cores: 12
Total CPU RAM: 83.48 GB
Number of GPUs: 1
GPU 0 Name: NVIDIA A100-SXM4-40GB
GPU 0 RAM: 39.56 GB


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Generated Text: Tell me something about falcons.
Falcons are birds of prey. They are carnivores and they hunt for their food. They are very strong and they can kill their prey with their sharp talons. They are very fast and they can fly very fast. They are very good hunters. They hunt for small animals like rabbits, mice, squirrels, birds, etc. They hunt for their food in the wild. They are very good at hunting. They are very good at flying. They are very good at hunting.
