From 05810b8c455c58dc805c1e436f9e6ef2f42fa6b7 Mon Sep 17 00:00:00 2001
From: Neal Vaidya <nealv@nvidia.com>
Date: Wed, 21 Feb 2024 04:59:39 -0800
Subject: [PATCH] Add Gemma notebooks

---
 models/Gemma/README.md  |  59 ++++
 models/Gemma/lora.ipynb | 593 ++++++++++++++++++++++++++++++++++++++++
 models/Gemma/sft.ipynb  | 422 ++++++++++++++++++++++++++++
 models/README.md        |   4 +
 4 files changed, 1078 insertions(+)
 create mode 100644 models/Gemma/README.md
 create mode 100644 models/Gemma/lora.ipynb
 create mode 100644 models/Gemma/sft.ipynb

diff --git a/models/Gemma/README.md b/models/Gemma/README.md
new file mode 100644
index 000000000..d2650c0a4
--- /dev/null
+++ b/models/Gemma/README.md
@@ -0,0 +1,59 @@
+# Gemma
+
+[Gemma](https://ai.google.dev/gemma/docs) is a family of decoder-only, text-to-text large language models for English language, built from the same research and technology used to create the [Gemini models](https://blog.google/technology/ai/google-gemini-ai/). Gemma models have open weights and offer pre-trained variants and instruction-tuned variants. These models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop, or your own cloud infrastructure, democratizing access to state-of-the-art AI models and helping foster innovation for everyone.
+For more details, refer the the [Gemma model card](https://ai.google.com/gemma/docs/model_card) released by Google.
+
+
+## Customizing Gemma with NeMo Framework
+
+Gemma models are compatiable with [NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/index.html). In this repository we have two notebooks that covert different ways of customizing Gemma.
+
+### Paramater Efficient Fine-Tuning with LoRA
+
+[LoRA tuning](https://arxiv.org/abs/2106.09685) is a parameter efficient method for fine-tuning models, where we freeze the base model parameters and update an auxilliary "adapter" with many fewer weights. At inference time, the adapter weights are combined with the base model weights to produce a new model, customized for a particular use case or dataset. Because this adapter is so much smaller than the base model, it can be trained with far fewer resources than it would take to fine-tune the entire model. In this example, we'll show you how to LoRA-tune small models like the Gemma models on a single GPU.
+
+[Get Started Here](./lora.ipynb)
+
+### Supervised Fine-Tuning for Instruction Following (SFT)
+
+Supervised Fine-Tuning (SFT) is the process of fine-tuning all of a model’s parameters on supervised data of inputs and outputs. It teaches the model how to follow user specified instructions and is typically done after model pre-training. This example will describe the steps involved in fine-tuning Gemma for instruction following. Gemma was released with a checkpoint already fine-tuned for instruction-following, but here we'll learn how we can tune our own model starting with the pre-trained checkpoint to acheive a similar outcome.
+
+Full fine-tuning is more resource intensive than Low Rank adaptation, so for SFT we'll need multiple GPUs, as opposed to the single GPU used for LoRA.
+
+[Get Started Here](./)
+
+## Download the base model
+
+For all of our customization and deployment processes, we'll need to start off with a pre-trained version of Gemma in the `.nemo` format. You can download the base model in `.nemo` format from the NVIDIA GPU Cloud, or convert checkpoints from another framework into a `.nemo` file. You can choose to use the 2B parameter or 7B parameter Gemma models for this notebook -- the 2B model will be faster to customize, but the 7B model will be more capable.
+
+You can download either model from the NVIDIA NGC Catalog, using the NGC CLI. The instructions to install and configure the NGC CLI can be found [here](https://ngc.nvidia.com/setup/installers/cli).
+
+To download the model, execute one of the following commands, based on which model you want to use:
+
+```bash
+ngc registry model download-version "nvidia/nemo/gemma_2b_base:1.0"
+```
+
+or
+
+```bash
+ngc registry model download-version "nvidia/nemo/gemma_7b_base:1.0"
+```
+
+## Getting NeMo Framework
+
+NVIDIA NeMo Framework is a generative AI framework built for researchers and PyTorch developers working on large language models (LLMs), multimodal models (MM), automatic speech recognition (ASR), and text-to-speech synthesis (TTS). The primary objective of NeMo is to provide a scalable framework for researchers and developers from industry and academia to more easily implement and design new generative AI models by being able to leverage existing code and pretrained models.
+
+You can pull a container that includes the version of NeMo Framework and all dependencies needed for these notebooks with the following:
+
+```bash
+docker pull nvcr.io/nvidia/nemo:24.01.gemma
+```
+
+The best way to run this notebook is from within the container. You can do that by launching the container with the following command
+
+```bash
+docker run -it --rm --gpus all --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:24.01.gemma
+```
+
+Then, from within the container, start the jupyter server with
\ No newline at end of file
diff --git a/models/Gemma/lora.ipynb b/models/Gemma/lora.ipynb
new file mode 100644
index 000000000..a8509185a
--- /dev/null
+++ b/models/Gemma/lora.ipynb
@@ -0,0 +1,593 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Gemma Parameter Efficient Fine-Tuning with LoRA using NeMo Framework"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Gemma](https://ai.google.com/gemma/docs/model_card) is a groundbreaking new open model in the Gemini family of models from Google. Gemma is just as powerful as previous models but compact enough to run locally on NVIDIA RTX GPUs. Gemma is available in 2 sizes: 2B and 7B parameters. With NVIDIA NeMo, you can customize Gemma to fit your usecase and deploy an optimized model on your NVIDIA GPU.\n",
+    "\n",
+    "In this tutorial, we'll go over a specific kind of customization -- Low-rank adapter tuning to follow a specific output format (also known as LoRA). To learn how to perform full parameter supervised fine-tuning for instruction following (also known as SFT), see the [companion notebook](./sft.ipynb). For LoRA, we'll perform all operations within the notebook on a single GPU. The compute resources needed for training depend on which Gemma model you use. For the 7 billion parameter variant of Gemma, you'll need a GPU with 80GB of memory. For the 2 billion parameter model, 40GB will do. \n",
+    "\n",
+    "We'll also learn how to export your custom model to TensorRT-LLM, an open-source library that accelerates and optimizes inference performance of the latest LLMs on the NVIDIA AI platform."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[LoRA tuning](https://arxiv.org/abs/2106.09685) is a parameter efficient method for fine-tuning models, where we freeze the base model parameters and update an auxiliary \"adapter\" with many fewer weights. At inference time, the adapter weights are combined with the base model weights to produce a new model, customized for a particular use case or dataset. Because this adapter is so much smaller than the base model, it can be trained with far fewer resources than it would take to fine-tune the entire model. In this notebook, we'll show you how to LoRA-tune small models like the Gemma models on a single A100 GPU.\n",
+    "\n",
+    "For this example, we're going to tune our Gemma model on the [PubMedQA](https://pubmedqa.github.io/) dataset, a Question Answering dataset for biomedical texts. We'll see later on that our base model performs pretty poorly on this dataset -- not necessarily because it can't accurately extract the answers from the context, but because it fails to respond in the \"yes/no/maybe\" format expected. With LoRA finetuning, we'll modify our model to respond in the way we need for our task."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Download the base model\n",
+    "\n",
+    "For all of our customization and deployment processes, we'll need to start off with a pre-trained version of Gemma in the `.nemo` format. You can download the base model in `.nemo` format from the NVIDIA GPU Cloud, or convert checkpoints from another framework into a `.nemo` file. You can choose to use the 2B parameter or 7B parameter Gemma models for this notebook -- the 2B model will be faster to customize, but the 7B model will be more capable. \n",
+    "\n",
+    "You can download either model from the NVIDIA NGC Catalog, using the NGC CLI. The instructions to install and configure the NGC CLI can be found [here](https://ngc.nvidia.com/setup/installers/cli).\n",
+    "\n",
+    "To download the model, execute one of the following commands, based on which model you want to use:\n",
+    "\n",
+    "```bash\n",
+    "ngc registry model download-version \"nvidia/nemo/gemma_2b_base:1.0\"\n",
+    "```\n",
+    "\n",
+    "or\n",
+    "\n",
+    "```bash\n",
+    "ngc registry model download-version \"nvidia/nemo/gemma_7b_base:1.0\"\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Getting NeMo Framework\n",
+    "\n",
+    "NVIDIA NeMo Framework is a generative AI framework built for researchers and PyTorch developers working on large language models (LLMs), multimodal models (MM), automatic speech recognition (ASR), and text-to-speech synthesis (TTS). The primary objective of NeMo is to provide a scalable framework for researchers and developers from industry and academia to more easily implement and design new generative AI models by being able to leverage existing code and pretrained models.\n",
+    "\n",
+    "If you haven't already, you can pull a container that includes the version of NeMo Framework and all dependencies needed for this notebook with the following:\n",
+    "\n",
+    "```bash\n",
+    "docker pull nvcr.io/nvidia/nemo:24.01.gemma\n",
+    "```\n",
+    "\n",
+    "The best way to run this notebook is from within the container. You can do that by launching the container with the following command\n",
+    "\n",
+    "```bash\n",
+    "docker run -it --rm --gpus all --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:24.01.gemma\n",
+    "```\n",
+    "\n",
+    "Then, from within the container, start the jupyter server with\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data Preparation\n",
+    "\n",
+    "First let's download the data, and use the provided script to divide it into train/validation/test splits"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!git clone https://github.com/pubmedqa/pubmedqa.git\n",
+    "!cd pubmedqa/preprocess && python split_dataset.py pqal"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's look at a sample of our training data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "data = json.load(open(\"pubmedqa/data/pqal_fold0/train_set.json\", 'rt'))\n",
+    "data[list(data)[0]]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As we can see, the data is in `json` format and includes several different fields and labels. For doing parameter efficient fine-tuning with NeMo Framework, we need the data in `jsonl` format. We also need to reformat our dataset into input and output pairs so that we can tune our model in a supervised way. The helper functions below take care of extracting and reformatting the raw data and writing out the `jsonl` file. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "\n",
+    "def write_jsonl(fname, json_objs):\n",
+    "    with open(fname, 'wt') as f:\n",
+    "        for o in json_objs:\n",
+    "            f.write(json.dumps(o)+\"\\n\")\n",
+    "\n",
+    "# Converts the data from the original format into an LLM input prompt\n",
+    "def form_question(obj):\n",
+    "    st = \"\"\n",
+    "    st += f\"QUESTION:{obj['QUESTION']}\\n\"\n",
+    "    st += \"CONTEXT: \"\n",
+    "    for i, label in enumerate(obj['LABELS']):\n",
+    "        st += f\"{obj['CONTEXTS'][i]}\\n\"\n",
+    "    st += f\"TARGET: the answer to the question given the context is (yes|no|maybe): \"\n",
+    "    return st\n",
+    "\n",
+    "def convert_to_jsonl(data_path, output_path):\n",
+    "    data = json.load(open(data_path, 'rt'))\n",
+    "    json_objs = []\n",
+    "    for k in data.keys():\n",
+    "        obj = data[k]\n",
+    "        prompt = form_question(obj)\n",
+    "        completion = obj['reasoning_required_pred']\n",
+    "        json_objs.append({\"input\": prompt, \"output\": completion})\n",
+    "    write_jsonl(output_path, json_objs)\n",
+    "    return json_objs\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "test_json_objs = convert_to_jsonl(\"pubmedqa/data/test_set.json\", \"pubmedqa_test.jsonl\")\n",
+    "train_json_objs = convert_to_jsonl(\"pubmedqa/data/pqal_fold0/train_set.json\", \"pubmedqa_train.jsonl\")\n",
+    "dev_json_objs = convert_to_jsonl(\"pubmedqa/data/pqal_fold0/dev_set.json\", \"pubmedqa_val.jsonl\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here's an example of what the data looks like after formatting"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "test_json_objs[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Configuration and Training"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "NeMo Framework uses config objects to control many of its operations, which allows you to quickly see what options you can change and carry out different experiments. We can start by downloading an example config file from github."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/nlp/language_modeling/tuning/conf/megatron_gpt_finetuning_config.yaml"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we'll read in this default config file with Hydra, and apply an override that enables the use of [Megatron core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core). "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import hydra\n",
+    "from omegaconf.omegaconf import OmegaConf\n",
+    "\n",
+    "hydra.initialize(version_base=None, config_path=\".\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cfg = hydra.compose(config_name=\"megatron_gpt_finetuning_config\", overrides=['++model.mcore_gpt=True'])\n",
+    "cfg.name = \"gemma_lora_pubmedqa\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's see what the default configuration looks like before we make any modifications "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(OmegaConf.to_yaml(cfg))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To see all of the different configuration options available, you can take a look at the file we downloaded. For this example, we're going to update a couple of settings to point to our newly-prepared datasets and to make sure the LoRA tuning runs on our A100. Feel free to experiment with these different options -- you can swap in your own datasets and change the training settings depending on what GPU you're using.\n",
+    "\n",
+    "For data our data configuration, we'll point to the `jsonl` files we wrote out earlier. `concat_sampling_probabilities` determines what percentage of the finetuning data you would like to come from each file -- in our example we only have 1 training file so we choose `[1.0]`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "OmegaConf.update(cfg, \"model.data\", {\n",
+    "  \"train_ds\": {\n",
+    "      \"num_workers\": 0,\n",
+    "      \"file_names\": [\"pubmedqa_train.jsonl\"],\n",
+    "      \"concat_sampling_probabilities\": [1.0]\n",
+    "  },\n",
+    "  \"validation_ds\": {\n",
+    "      \"num_workers\": 0,\n",
+    "      \"file_names\": [\"pubmedqa_val.jsonl\"]\n",
+    "  },\n",
+    "  \"test_ds\": {\n",
+    "    \"file_names\": [\"pubmedqa_test.jsonl\"],\n",
+    "    \"names\": [\"pubmedqa\"]\n",
+    "  }\n",
+    "}, merge=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For our model settings, we don't have much to change since we're reading in a pretrained model and can inherit the values that were already set. We need to point to our existing `.nemo` file, specify that we want to use LoRA as our scheme for finetuning, and choose our parallelism and batch size values. The values below should be appropriate for a single A100 GPU.\n",
+    "\n",
+    "Make sure to change the `restore_from_path` setting if you're using a different checkpoint!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "OmegaConf.update(cfg, \"model\", {\n",
+    "    \"restore_from_path\": \"gemma_7b_pt.nemo\",\n",
+    "    \"peft\": {\n",
+    "        \"peft_scheme\": \"lora\"\n",
+    "    },\n",
+    "    \"tensor_model_parallel_size\": 1,\n",
+    "    \"pipeline_model_parallel_size\": 1,\n",
+    "    \"micro_batch_size\": 1,\n",
+    "    \"global_batch_size\": 8,\n",
+    "}, merge=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, we set some options for the trainer. We'll be training on 1 GPU on a single node, at bfloat16 precision. For this example we'll also only train for 50 steps, with a validation check every after every 20 iterations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "OmegaConf.update(cfg, \"trainer\", {\n",
+    "    'devices': 1,\n",
+    "    'num_nodes': 1,\n",
+    "    'precision': \"bf16\",\n",
+    "    \"val_check_interval\": 20,\n",
+    "    \"max_steps\": 50\n",
+    "})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "With our configurations set, we are ready to initialize our `Trainer` object to handle our training loop, and an experiment manager to handle checkpointing and logging."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from nemo.collections.nlp.parts.megatron_trainer_builder import MegatronLMPPTrainerBuilder\n",
+    "from nemo.utils.exp_manager import exp_manager\n",
+    "\n",
+    "trainer = MegatronLMPPTrainerBuilder(cfg).create_trainer()\n",
+    "exp_manager(trainer, cfg.exp_manager)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After initializing the Trainer object we can load our model from disk into memory. To load the model weights we'll need a config object for the model, which we can read from the `.nemo` file on disk and update it with any new settings we added in our LoRA config above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from nemo.collections.nlp.models.language_modeling.megatron_gpt_sft_model import MegatronGPTSFTModel\n",
+    "\n",
+    "model_cfg = MegatronGPTSFTModel.merge_cfg_with(cfg.model.restore_from_path, cfg)\n",
+    "model = MegatronGPTSFTModel.restore_from(cfg.model.restore_from_path, model_cfg, trainer=trainer)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let's see how to add the LoRA Adapter to our model and train it. We can specify that we want to use LoRA by using the `LoraPEFTConfig` class, which stores the types of applicable adapter and the hyperparameters required to initialize the adapter module. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from nemo.collections.nlp.parts.peft_config import LoraPEFTConfig\n",
+    "\n",
+    "my_peft_config = LoraPEFTConfig(model_cfg)\n",
+    "print(OmegaConf.to_yaml(my_peft_config.get_config_dict()))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can then call `add_apater` to actuall add the LoRA adapter to our base model and prepare it for training. When we first call `add_adapter`, the model prints out the parameter count before and after the operation and we can see the number of trainable parameters rise. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.add_adapter(my_peft_config)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note: If you want to use a different PEFT method, you can use a different config class in place of `LoraPEFTConfig`, such as `CanonicalAdaptersPEFTConfig`, `IA3PEFTConfig`, or `PtuningPEFTConfig`. You can also use a combination of the methods by passing in a list: `model.add_adapter([LoraPEFTConfig(model_cfg), PtuningPEFTConfig(model_cfg)])`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We're now ready to start training! As the training loop runs, you'll see the validation loss drop significantly -- even with this short demonstration."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainer.fit(model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once training is completed you should see a saved '.nemo' file in the `nemo_experiments/{taurus}_lora_pubmedqa` folder. This checkpoint will only contain the trained adapter weights, and not the frozen base model weights.\n",
+    "\n",
+    "We can also now see how the newly finetuned model performs on the test data:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainer.test(model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Much better! If you want to learn how to export our new model for optimized inference using TensorRT-LLM, continue on to the \"Exporting to TensorRT-LLM\" section."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exporting to TensorRT-LLM"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "TensorRT-LLM is an open-source library for optimizing inference performance to acheive state-of-the-art speed on NVDIA GPUs. The NeMo framework offers an easy way to compile `.nemo` models into optimized TensorRT-LLM engines which you can run locally embedded in another application, or serve to other applications using a server like Triton Inference Server."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To start with, lets create a folder where our exported model will land"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!mkdir gemma_trt_llm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To export the model to TensorRT-LLM, we'll need to merge the weights of the base model and the weights of the adapter. If you're using the NeMo Framework container, you'll find a script for this at `/opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py`. Otherwise, you can download the standalone script from GitHub at `https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/nlp_language_modeling/merge_lora_weights/merge.py`. To run the merge script, you'll need the path to the base model and trained adapter, as well as a path to save the merged model to."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "python /opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py \\\n",
+    "    trainer.accelerator=gpu \\\n",
+    "    tensor_model_parallel_size=1 \\\n",
+    "    pipeline_model_parallel_size=1 \\\n",
+    "    gpt_model_file=${MODEL} \\\n",
+    "    lora_model_path=nemo_experiments/gemma_lora_pubmedqa/checkpoints/gemma_lora_pubmedqa.nemo \\\n",
+    "    merged_model_path=gemma_lora_pubmedqa_merged.nemo"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "With our merged model weights, we just need to create an instance of the `TensorRTLLM` class and call the `TensorRTLLM.export()` function -- pointing the `nemo_checkpoint_path` argument to the newly merged model from above.\n",
+    "\n",
+    "This creates a couple of files in the folder we created -- an `engine` file that holds the weights and the compiled execution graph of the model, a `tokenizer.model` file which holds the tokenizer information, and `config.json` which holds some metadata about the model (along with `model.cache`, which caches some operations and makes it faster to re-compile the model in the future.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from nemo.export import TensorRTLLM\n",
+    "trt_llm_exporter = TensorRTLLM(model_dir=\"gemma_pubmedqa_merged_trt_llm\")\n",
+    "trt_llm_exporter.export(nemo_checkpoint_path=\"gemma_lora_pubmedqa_merged.nemo\", model_type=\"gemma\", n_gpus=1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "With the model exported into TensorRTLLM, we can perform very fast inference:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trt_llm_exporter.forward([\"NVIDIA and Google are \"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There's also a convenient function to deploy a the model as a service, backed by Triton Inference Server:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from nemo.deploy import DeployPyTriton\n",
+    "\n",
+    "nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name=\"gemma\")\n",
+    "nm.deploy()\n",
+    "nm.serve()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "nemo_lora",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/models/Gemma/sft.ipynb b/models/Gemma/sft.ipynb
new file mode 100644
index 000000000..f5e63357e
--- /dev/null
+++ b/models/Gemma/sft.ipynb
@@ -0,0 +1,422 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Supervised Fine-Tuning for Instruction Following"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Gemma](https://ai.google.com/gemma/docs/model_card) is a groundbreaking new open model in the Gemini family of models from Google. Gemma is just as powerful as previous models but compact enough to run locally on NVIDIA RTX GPUs. Gemma is available in 2 sizes: 2B and 7B parameters. With NVIDIA NeMo, you can customize Gemma to fit your usecase and deploy an optimized model on your NVIDIA GPU.\n",
+    "\n",
+    "In this tutorial, we'll go over a specific kind of customization -- full parameter supervised fine-tuning for instruction following (also known as SFT). To learn how to perform Low-rank adapter (LoRA) tuning to follow a specific output format, see the [companion notebook](./lora.ipynb). For LoRA, we'll show how you can kick off a multi-GPU training job with an example script so that you can train on 8 GPUs. The exact number of GPUs needed will depend on which model you use and what kind of GPUs you use, but we recommend using 8 A100-80GB GPUs.\n",
+    "\n",
+    "We'll also learn how to export your custom model to TensorRT-LLM, an open-source library that accelerates and optimizes inference performance of the latest LLMs on the NVIDIA AI platform."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Supervised Fine-Tuning (SFT) is the process of fine-tuning all of a model’s parameters on supervised data of inputs and outputs. It teaches the model how to follow user specified instructions and is typically done after model pre-training. This notebook describes the steps involved in fine-tuning Gemma for instruction following. Gemma was released with a checkpoint already fine-tuned for instruction-following, but here we'll learn how we can tune our own model starting with the pre-trained checkpoint to achieve a similar outcome. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Download the base model\n",
+    "\n",
+    "For all of our customization and deployment processes, we'll need to start off with a pre-trained version of Gemma in the `.nemo` format. You can download the base model in `.nemo` format from the NVIDIA GPU Cloud, or convert checkpoints from another framework into a `.nemo` file. You can choose to use the 2B parameter or 7B parameter Gemma models for this notebook -- the 2B model will be faster to customize, but the 7B model will be more capable. \n",
+    "\n",
+    "You can download either model from the NVIDIA NGC Catalog, using the NGC CLI. The instructions to install and configure the NGC CLI can be found [here](https://ngc.nvidia.com/setup/installers/cli).\n",
+    "\n",
+    "To download the model, execute one of the following commands, based on which model you want to use:\n",
+    "\n",
+    "```bash\n",
+    "ngc registry model download-version \"nvidia/nemo/gemma_2b_base:1.0\"\n",
+    "```\n",
+    "\n",
+    "or\n",
+    "\n",
+    "```bash\n",
+    "ngc registry model download-version \"nvidia/nemo/gemma_7b_base:1.0\"\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Getting NeMo Framework\n",
+    "\n",
+    "NVIDIA NeMo Framework is a generative AI framework built for researchers and PyTorch developers working on large language models (LLMs), multimodal models (MM), automatic speech recognition (ASR), and text-to-speech synthesis (TTS). The primary objective of NeMo is to provide a scalable framework for researchers and developers from industry and academia to more easily implement and design new generative AI models by being able to leverage existing code and pretrained models.\n",
+    "\n",
+    "If you haven't already, you can pull a container that includes the version of NeMo Framework and all dependencies needed for this notebook with the following:\n",
+    "\n",
+    "```bash\n",
+    "docker pull nvcr.io/nvidia/nemo:24.01.gemma\n",
+    "```\n",
+    "\n",
+    "The best way to run this notebook is from within the container. You can do that by launching the container with the following command\n",
+    "\n",
+    "```bash\n",
+    "docker run -it --rm --gpus all --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:24.01.gemma\n",
+    "```\n",
+    "\n",
+    "Then, from within the container, start the jupyter server with\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## SFT Data Formatting"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To begin, we'll need to prepare a dataset to tune our model on.\n",
+    "\n",
+    "This notebook uses the [Dolly dataset](https://huggingface.co/datasets/databricks/databricks-dolly-15k) as an example to demonstrate how to format your SFT data. This dataset consists of 15,000 instruction-context-response triples.\n",
+    "\n",
+    "First, to download the data enter the following command:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!wget https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "The downloaded data, stored at `databricks-dolly-15k.jsonl`, is a `JSONL` file with each line formatted like this:\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "{\n",
+    "    \"instruction\": \"When did Virgin Australia start operating?\",\n",
+    "    \"context\": \"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.[3] It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.[4]\",\n",
+    "    \"response\": \"Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.\",\n",
+    "    \"category\": \"closed_qa\"\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As this example shows, there are no clear “input” and “output” fields, which are required for SFT with NeMo. To remedy this, we can do some data pre-processing. This cell converts the `instruction`, `context`, and `response` fields into `input` and `output`. It also concatenates the `instruction` and `context` fields with a `\\n\\n` separator, and randomizes the order in which they appear in the input to generate a new `JSONL` file. This generates an output file called `databricks-dolly-15k-output.jsonl`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import numpy as np\n",
+    "\n",
+    "path_to_data = \"databricks-dolly-15k.jsonl\"\n",
+    "output_path = f\"{path_to_data.split('.')[0]}-output.jsonl\"\n",
+    "with open(path_to_data, \"r\") as f, open(output_path, \"w\") as g:\n",
+    "    for line in f:\n",
+    "\n",
+    "        # Read JSONL line in original format\n",
+    "        line = json.loads(line)\n",
+    "        context = line[\"context\"].strip()\n",
+    "\n",
+    "        # Randomize context and instruction order.\n",
+    "        if context != \"\":\n",
+    "            context_first = np.random.randint(0, 2) == 0\n",
+    "            if context_first:\n",
+    "                instruction = line[\"instruction\"].strip()\n",
+    "                assert instruction != \"\"\n",
+    "                input = f\"{context}\\n\\n{instruction}\"\n",
+    "                output = line[\"response\"]\n",
+    "            else:\n",
+    "                instruction = line[\"instruction\"].strip()\n",
+    "                assert instruction != \"\"\n",
+    "                input = f\"{instruction}\\n\\n{context}\"\n",
+    "                output = line[\"response\"]\n",
+    "        else:\n",
+    "            input = line[\"instruction\"]\n",
+    "            output = line[\"response\"]\n",
+    "\n",
+    "        # Write JSONL line in new format\n",
+    "        g.write(\n",
+    "            json.dumps(\n",
+    "                {\"input\": input, \"output\": output, \"category\": line[\"category\"]}\n",
+    "            )\n",
+    "            + \"\\n\"\n",
+    "        )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, the dataset is a `JSONL` file with each line formatted like this: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "{\n",
+    "  \"input\": \"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.\\n\\nWhen did Virgin Australia start operating?\",\n",
+    "  \"output\": \"Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.\",\n",
+    "  \"category\": \"closed_qa\"\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## SFT Training\n",
+    "\n",
+    "To perform the SFT Training, we'll use NVIDIA NeMo-Aligner. NeMo-Aligner is a scalable toolkit for efficient model alignment, built using the [NeMo Toolkit](https://github.com/NVIDIA/NeMo) which allows for scaling training up to 1000s of GPUs using tensor, data and pipeline parallelism for all components of alignment. Users can do end-to-end model alignment on a wide range of model sizes and take advantage of all the parallelism techniques to ensure their model alignment is done in a performant and resource efficient manner.\n",
+    "\n",
+    "To install NeMo Aligner, we can clone the repository and install it using `pip`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "git clone https://github.com/NVIDIA/NeMo-Aligner.git -b dev\n",
+    "cd NeMo-Aligner\n",
+    "pip install -e ."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you want to track and visualize your SFT training experiments, you can login to Weights and Biases. If you don't want to use wandb, make sure to set the argument `exp_manager.create_wandb_logger=False` when launching your job."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import wandb\n",
+    "wandb.login()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To run SFT locally on a single node, you can use the following command. Note the `trainer.num_nodes` and `trainer.devices` arguments, which define how many nodes and how many total GPUs you want to use for training. Make sure the source model, output model, and dataset paths all match your local setup.\n",
+    "\n",
+    "If you'd like to perform multi-node finetuning -- for example on a slurm cluster -- you can find more information in the [NeMo-Aligner user guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/modelalignment/rlhf.html#instruction-following-taught-by-supervised-fine-tuning-sft)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "%%bash\n",
+    "\n",
+    "cd NeMo-Aligner\n",
+    "\n",
+    "python examples/nlp/gpt/train_gpt_sft.py \\\n",
+    "   name=gemma_dolly_finetuned \\\n",
+    "   trainer.precision=bf16 \\\n",
+    "   trainer.num_nodes=1 \\\n",
+    "   trainer.devices=8 \\\n",
+    "   trainer.sft.max_steps=-1 \\\n",
+    "   trainer.sft.limit_val_batches=40 \\\n",
+    "   trainer.sft.val_check_interval=1000 \\\n",
+    "   model.tensor_model_parallel_size=4 \\\n",
+    "   model.pipeline_model_parallel_size=1 \\\n",
+    "   model.megatron_amp_O2=True \\\n",
+    "   model.restore_from_path=../gemma_7b_pt.nemo \\\n",
+    "   model.optim.lr=5e-6 \\\n",
+    "   model.answer_only_loss=True \\\n",
+    "   ++model.bias_activation_fusion=true \\\n",
+    "   model.data.num_workers=0 \\\n",
+    "   model.data.train_ds.micro_batch_size=1 \\\n",
+    "   model.data.train_ds.global_batch_size=128 \\\n",
+    "   model.data.train_ds.file_path=../databricks-dolly-15k-output.jsonl \\\n",
+    "   model.data.validation_ds.micro_batch_size=1 \\\n",
+    "   model.data.validation_ds.global_batch_size=128 \\\n",
+    "   model.data.validation_ds.drop_last=True \\\n",
+    "   model.data.validation_ds.file_path=../databricks-dolly-15k-output.jsonl \\\n",
+    "   exp_manager.create_wandb_logger=True \\\n",
+    "   exp_manager.explicit_log_dir=../results \\\n",
+    "   exp_manager.wandb_logger_kwargs.project=sft_run \\\n",
+    "   exp_manager.wandb_logger_kwargs.name=dolly_sft_run \\\n",
+    "   exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \\\n",
+    "   exp_manager.resume_if_exists=True \\\n",
+    "   exp_manager.resume_ignore_no_checkpoint=True \\\n",
+    "   exp_manager.create_checkpoint_callback=True \\\n",
+    "   exp_manager.checkpoint_callback_params.monitor=validation_loss"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "When training is finished, you should see a file called `results/checkpoints/gemma_dolly_finetuned.nemo` that contains the weights of your new, instruction-tuned model."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exporting to TensorRT-LLM"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "TensorRT-LLM is an open-source library for optimizing inference performance to achieve state-of-the-art speed on NVDIA GPUs. The NeMo framework offers an easy way to compile `.nemo` models into optimized TensorRT-LLM engines which you can run locally embedded in another application, or serve to other applications using a server like Triton Inference Server."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To start with, lets create a folder where our exported model will land"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!mkdir gemma_trt_llm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To export the model, we just need to create an instance of the `TensorRTLLM` class and call the `TensorRTLLM.export()` function -- pointing the `nemo_checkpoint_path` argument to the newly fine-tuned model we trained above.\n",
+    "\n",
+    "This creates a couple of files in the folder we created -- an `engine` file that holds the weights and the compiled execution graph of the model, a `tokenizer.model` file which holds the tokenizer information, and `config.json` which holds some metadata about the model (along with `model.cache`, which caches some operations and makes it faster to re-compile the model in the future.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from nemo.export import TensorRTLLM\n",
+    "trt_llm_exporter = TensorRTLLM(model_dir=\"gemma_dolly_finetuned_trt_llm\")\n",
+    "trt_llm_exporter.export(nemo_checkpoint_path=\"results/checkpoints/gemma_dolly_finetuned.nemo\", model_type=\"gemma\", n_gpus=1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "With the model exported into TensorRTLLM, we can perform very fast inference"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trt_llm_exporter.forward([\"NVIDIA and Google are\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There's also a convenient function to deploy a the model as a service, backed by Triton Inference Server:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from nemo.deploy import DeployPyTriton\n",
+    "\n",
+    "nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name=\"gemma\")\n",
+    "nm.deploy()\n",
+    "nm.serve()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "nemo_lora",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/models/README.md b/models/README.md
index e69de29bb..61ad69ff1 100644
--- a/models/README.md
+++ b/models/README.md
@@ -0,0 +1,4 @@
+Generative AI Model Examples
+====
+
+1. [Gemma](./Gemma/)
\ No newline at end of file