# Fine-tune open-source ChatGPT alternative

In this tutorial, we are going to fine-tune the new [GPT-NeoXT-Chat-Base-20B](https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B) on the [ELI5](https://huggingface.co/datasets/eli5) dataset to improve the explanation and question-answering skills of the agent. The [ELI5](https://huggingface.co/datasets/eli5) dataset is an English-language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers. We are going to use Hugging Face Transformers and DeepSpeed ZeRO to fine-tune our model.

In this tutorial, you will learn how to:

1. Setup Environment
2. Create and prepare chat dataset
3. fine-tune the GPT model using Deepspeed
4. Testing new agent

Let's get started! 🚀

*Note: This tutorial was created and ran on a p4dn.24xlarge AWS EC2 Instance including 8x NVIDIA A100 40GB.*



## 1. Setup Environment

The first step is installing the Hugging Face Libraries, including transformers, datasets, and DeepSeed. Running the following cell will install all the required packages.

In [None]:
# install torch with the correct cuda version, check nvcc --version
!pip install torch --extra-index-url https://download.pytorch.org/whl/cu116 --upgrade
# install Hugging Face Libraries
!pip install "transformers==4.26.0" "datasets==2.9.0" "accelerate==0.16.0" "evaluate==0.4.0" --upgrade
# install deepspeed and ninja for jit compilations of kernels
!pip install "deepspeed==0.8.0" ninja --upgrade
# install additional dependencies needed for training
!pip install rouge-score nltk py7zr tensorboard

## 2. Create and prepare the dataset

As the base dataset, we will use the [ELI5](https://huggingface.co/datasets/eli5) dataset, but before fine-tuning the model, we need to preprocess the data. We will create a "chat" version of the dataset by adding `<user>` and `<bot>`tokens and add an end-of-sequence `<|endoftext|>` token to help the model learn to distinguish consecutive examples. Additionally, we create chunks of `2048` tokens ([model max length](https://huggingface.co/EleutherAI/gpt-neox-20b)) to avoid unnecessary padding and computing. 

The first step is to load our dataset from Hugging Face. The dataset contains `272634` samples for `eli5`. We will downsample the dataset to `10 000` to make it more realistic for real-world use cases.

In [34]:
from datasets import load_dataset
from transformers import AutoTokenizer 

# Load Tokenizer 
model_id = "togethercomputer/GPT-NeoXT-Chat-Base-20B"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load dataset from huggingface.co
dataset_id = "eli5"
dataset = load_dataset(dataset_id, split="train_eli5")

# downsample dataset to 10k
dataset = dataset.shuffle(42).select(range(10_000))

Found cached dataset eli5 (/home/ubuntu/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa)
Loading cached shuffled indices for dataset at /home/ubuntu/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa/cache-ff13b89bd5550ed9.arrow


An [ELI5](https://huggingface.co/datasets/eli5) sample can include multiple answers to a “question”. We will select the answer with the highest user score for our explanation. 

*Note: This dataset is a good example of using reinforcement learning for training transformers learning to generate answers with higher scores. Let me know if you are interested in an example of that.*

In [35]:
import random

def filter_score(sample):
	# create new question field
	sample["question"] = sample["title"]

	# get the answer with the highest score
	index_of_best_score = sample["answers"]["score"].index(max(sample["answers"]["score"]))
	sample["answer"] = sample["answers"]["text"][index_of_best_score]
	return sample

# filter dataset and remove all other columns 
dataset = dataset.map(filter_score, remove_columns=list(dataset.features))

# print random sample
print(dataset[random.randint(0, 10_000)])

Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa/cache-6fdce9b0bad807df.arrow


{'question': "Why does say 70 degrees indoors in the winter feel cool whereas in the summer 70 degrees indoors is typically a warm indoor temp and you'd turn on your air conditioning?", 'answer': "Is that backwards? \n\nI feel like 70 in the winter is warm af and 70 in the summer is cool af. \n\nLikely because your body acclimates to the certain temps outdoor, and to heat or cool your house is warmer or cooler, etc. Also, lower humidity in the summer compared to outside and the lower humidity will feel cooler...\n\nAround Chicago, 80 degrees F isn't bad in the summer unless you have 60% humidity or more. Florida hitting 100 degrees is way worse than 100 in Arizona, again because the humidity... If it's dryer, moisture on your skin can evaporate faster making you feel cooler."}


The next step is to convert our dataset into a chat version. Here we will follow the instructions on the [Model card](https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B#strengths-of-the-model) and add the EOS token.

In [36]:
# dataset template for chat conversation
template=f'''<user>: Explain like I am five: {{question}}
<bot>: {{answer}}{{eos_token}}'''

eos_token = tokenizer.eos_token 

def template_dataset(sample):
	sample["text"] = template.format(
														question=sample["question"], 
														answer=sample["answer"],
														eos_token=eos_token
													)
	return sample

# apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))

# print random sample
print(dataset[random.randint(0, 10_000)])

Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa/cache-53e5a9b9d01c6a2c.arrow


{'text': "<user>: Explain like I am five: Zero Calorie Soft Drinks\n<bot>: First of all, technically it's not actually zero calories, but companies are allowed to advertise them as such if they have less than 5 calories per serving. \n\nSecondly, artificial sweeteners. They're chemicals which either taste sweet but have no caloric value (because the body can't digest them) or they have the same caloric value as sugar but are much, much sweeter, so you need a lot less to achieve the same level of sweetness.<|endoftext|>"}


The last step of the data preparation is to tokenize and chunk our dataset. We convert our inputs (text) to token IDs by tokenizing, which the model can understand. Additionally, we concatenate our dataset samples into chunks of `2048` to avoid unnecessary padding.  

In [75]:
from itertools import chain
from functools import partial

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": []}

def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result

# tokenize and chunk dataset
lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")


Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa/cache-96d6e8aa1ef97679.arrow
Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa/cache-fd73d7cdd44a14f4.arrow


Total number of samples: 902
