# Pushing to Hub
A simple notebook for uploading, downloading and pushing models to HuggingFace Hub. Including for quantized models!

---

[Open the notebook in Google Colab](https://colab.research.google.com/drive/1DgwzDwNQNv_NAwLnGCKHiO4-XhGVqg6-?usp=sharing).

OR

Run using Runpod, Vast.ai or your own GPU by following the guide [here](https://github.com/TrelisResearch/install-guides/blob/main/llm-notebook-setup.md).

---

Hat tip to @poedator on GitHub for work to make pushing 4bit models possible!

---

Notebook built by Trelis Research. Find us at [Trelis.com](https://trelis.com) and on [HuggingFace](https://huggingface.co/Trelis).

*Trelis Research emails members each time a new video tutorial is published. If you'd like, you can join [here](https://trelis.substack.com).*

In [33]:
#Upgrade pip and install scipy
!python -m pip install --upgrade pip -q -U
!pip install -q -U scipy
# !pip install einops #needed for Phi-2

[0m

In [34]:
# Required when training models/data that are gated on HuggingFace, and required for pushing models to HuggingFace
!pip install huggingface_hub -q -U
from huggingface_hub import notebook_login

notebook_login()

[0m

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [35]:
cache_dir='.' #means models will be downloaded into the current directory

### Connect Google Drive (only for Google Colab)

Optional but saves time by caching the model and allows for training data to be saved on Drive.

If you're running in Jupyter (e.g. on runpod) then use cache_dir='' to set a local caching directory on the pod.

In [37]:
# https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# import os
# cache_dir = "/content/drive/My Drive/huggingface_cache"
# os.makedirs(cache_dir, exist_ok=True) # Ensure the directory exists

# Installation

In [36]:
!pip install git+https://github.com/huggingface/transformers.git -q -U #Necessary for merging LoRA adapters onto quantized models.
# !pip install -q -U transformers # if you are facing issues with the dev branch above

!pip install accelerate -q -U

# # Install peft to allow for LoRA fine-tuning
# !pip install -q -U peft

# # Install bitsandbytes for quantized fine-tuning
# !pip install -q -U bitsandbytes

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[0m

## Downloading a Repo

In [128]:
from huggingface_hub import snapshot_download
import os

hub_model_path = "Trelis/TinyLlama-1.1B-Chat-v1.0-bf16"

local_model_path = cache_dir + '/' + hub_model_path

In [108]:
repo_path = snapshot_download(
    repo_id=hub_model_path,
    cache_dir=cache_dir,
    local_dir=local_model_path,
    local_dir_use_symlinks=False)

print(f"Repository downloaded to: {local_model_path}")

Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

LICENSE.txt:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

USE_POLICY.md:   0%|          | 0.00/4.77k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Repository downloaded to: ./meta-llama/Llama-2-7b-chat-hf


# Load the Model

In [148]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# # COMMENT this in for quantization in 4bit (nf4)!
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.bfloat16
# )

model = AutoModelForCausalLM.from_pretrained(
    hub_model_path,
    # quantization_config=bnb_config, # COMMENT this in for quantization in 4bit (nf4)!
    device_map='auto', #loads automatically to gpu if there is one.
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    cache_dir=cache_dir)

tokenizer = AutoTokenizer.from_pretrained(local_model_path,use_fast=True,trust_remote_code=True)

### Prepare for LoRA fine-tuning

In [131]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [132]:
## Parameter Efficient Fine Tuning (PEFT), specifically, Low Rank Adaptation (LoRA).

from peft import LoraConfig, get_peft_model

config = LoraConfig( #matching the Llama recipe, but with added modules
    r=8,
    lora_alpha=32,
    target_modules=[
              "self_attn.q_proj",
              "self_attn.k_proj",
              "self_attn.v_proj",
              "self_attn.o_proj",
              "mlp.gate_proj",
              "mlp.up_proj",
              "mlp.down_proj",
              ],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)

from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
# model = prepare_model_for_kbit_training(model)

model_with_lora = get_peft_model(
    model,
    config,
)

print_trainable_parameters(model_with_lora)

trainable params: 6307840 || all params: 1106356224 || trainable%: 0.570145479653396


## Typically here you:
- Set up the tokenizer
- Set up padding
- Set up the dataset
- Run evaluation before training
- Set up training
- Run evaluation after training

Check out the [Trelis Research Youtube Channel](https://youtube.com/@trelisresearch) for videos on these topics.

# Push Adapters and Model to Hub

In [149]:
org = 'Trelis'
new_hub_model_path = org + '/' + hub_model_path.split("/")[-1] + '-push-demo'
print(f"Setting up to push the model to {new_hub_model_path} on HuggingFace Hub")

Setting up to push the model to Trelis/TinyLlama-1.1B-Chat-v1.0-bf16-push-demo on HuggingFace Hub


## Push Adapters to Hub

In [135]:
# If you have done LoRA fine-tuning
new_hub_model_adapters_path = new_hub_model_path + "-adapters"
print(f"Setting up to push the adapters to {new_hub_model_adapters_path} on HuggingFace Hub")

Setting up to push the adapters to Trelis/TinyLlama-1.1B-Chat-v1.0-bf16-push-demo-adapters on HuggingFace Hub


In [136]:
### Typically you do one of three things

# # Option A: Pick an adapter from somewhere during your fine-tuning
# adapter_to_push = save_dir + '/checkpoint-32'

# # Option B: Grab an adapter you trained before from HuggingFace Hub
# adapter_to_push = "Trelis/Llama-2-7b-chat-hf-touch-rugby-rules-adapters" #uncomment if you want to grab an adapter from the hub

# # Apply the desired adapter to the base model - Required for Option A or Option B.

# # load peft model with the chosen adapter
# model_to_push = PeftModel.from_pretrained(
#     model,
#     adapter_to_push,
# )

# Option C: the adapter is your model_and_lora (yes, this is confusing...), i.e. you are going to push the model_with_lora as it is at the very last step of training.
adapter_to_push = model_with_lora
model_to_push = model_with_lora

In [137]:
# Save the adapter model
adapter_to_push.save_pretrained(new_hub_model_adapters_path, token=True)

In [138]:
# Push the model adapters to the hub
# if running this, the peft base model needs to be re-named to refer to a model on the hub
adapter_to_push.push_to_hub(new_hub_model_adapters_path, token=True, safe_serialization=True)

adapter_model.safetensors:   0%|          | 0.00/12.7M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Trelis/TinyLlama-1.1B-Chat-v1.0-bf16-push-demo-adapters/commit/6ae961e74e20fd5a23b1ba7392adbe55caeac8b2', commit_message='Upload model', commit_description='', oid='6ae961e74e20fd5a23b1ba7392adbe55caeac8b2', pr_url=None, pr_revision=None, pr_num=None)

## Merge Adapters

In [140]:
# # Added Option X: Reload a base model in 16-bit precision and then merge.
# # The motivation is if you have trained with quantization but want a full precision model to inference using Text Generation Inference OR you want to quantize to GGUF or AWQ. This will hurt precision a little.
# model = AutoModelForCausalLM.from_pretrained(
#     local_model_path,
#     device_map='auto', #loads automatically to gpu if there is one. It can be useful to load onto cpu if using a free Colab notebook as that gives more RAM.
#     torch_dtype=torch.bfloat16,
#     trust_remote_code=True,
#     cache_dir=cache_dir)

# from peft import PeftModel

# # load peft model with the chosen adapter
# model_to_push = PeftModel.from_pretrained(
#     model,
#     new_hub_model_adapters_path,
# )

In [141]:
model_to_push = model_to_push.merge_and_unload() # merge adapters with the base model. This will hurt precision a little if you are merging a quantized model.

## Push Model to Hub

In [150]:
# ONLY RUN THIS CELL IF YOU DID *NOT* MERGE ADAPTERS, i.e. you did a fine-tuning without LoRA
model_to_push = model

In [151]:
#Save the model locally (using the same path as will be used on the hub
model_to_push.save_pretrained(new_hub_model_path)

In [152]:
model_to_push.push_to_hub(new_hub_model_path, token=True, max_shard_size="5GB", safe_serialization=True)

model.safetensors:   0%|          | 0.00/762M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Trelis/TinyLlama-1.1B-Chat-v1.0-bf16-push-demo/commit/08426f6f068b6e33a131b1cb72a8355af7328cfd', commit_message='Upload LlamaForCausalLM', commit_description='', oid='08426f6f068b6e33a131b1cb72a8355af7328cfd', pr_url=None, pr_revision=None, pr_num=None)

## Push Tokenizer to Hub

In [81]:
## Re-load a tokenizer (uncommon)
# from transformers import AutoTokenizer
# tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# save the tokenizer
tokenizer.save_pretrained(new_hub_model_path)

# push the tokenizer to hub
tokenizer.push_to_hub(new_hub_model_path, token=True)

CommitInfo(commit_url='https://huggingface.co/Trelis/TinyLlama-1.1B-Chat-v1.0-bf16/commit/261fce88c3ae20560ed70e71e5151f738360c77e', commit_message='Upload tokenizer', commit_description='', oid='261fce88c3ae20560ed70e71e5151f738360c77e', pr_url=None, pr_revision=None, pr_num=None)

## (Alternative) Upload a folder to the Hub

In [14]:
from huggingface_hub import HfApi, upload_folder, create_branch

# Initialize the HfApi class
api = HfApi()

# # Optionally, create a new branch for 'nf4'. Beware this will copy all files from main.
# create_branch(repo_id=new_hub_model_path, repo_type="model", branch="nf4")

# Upload the entire folder to the specified branch in the repository
upload_folder(
    folder_path=new_hub_model_path,
    repo_id=new_hub_model_path,
    repo_type="model",  # Assuming it's a model; can be "dataset" or "space" as well
    # revision="nf4",  # Specify the branch you want to push to
    token=True,
)

print(f"Uploaded contents of {new_hub_model_path} to {new_hub_model_path} on HuggingFace Hub")

Uploaded contents of Trelis/TinyLlama-1.1B-Chat-v1.0-push-demo to Trelis/TinyLlama-1.1B-Chat-v1.0-push-demo on HuggingFace Hub
