## How to prepare data for FineTuning

**Install dependencies for download hugging face datasets.**

In [None]:
# %pip install datasets --upgrade
# %pip install py7zr

### 1. Pick the dataset for fine-tuning the model

We use the [samsum](https://huggingface.co/datasets/samsum) dataset. The next few cells show basic data preparation for fine tuning:
* Visualize some data rows
* Preprocess the data and format it in required format. This is an important step for performing text completion as we add the required sequences/separators in the data. This is how we repurpose the text-completion task to any specific task like summarization, translation, text-completion, etc.
* While fintuning, text column is concatenated with ground_truth column to produce finetuning input. Hence, the data should be prepared such that `text + ground_truth` is your actual finetuning data.
* bos and eos tokens are added to the data by finetuning pipeline, you do not need to add it explicitly 
* We want this sample to run quickly, so save smaller `train`, `validation` and `test` files containing 10% of the original. This means the fine tuned model will have lower accuracy, hence it should not be put to real-world use. 

##### Here is an example of how the data should look like

text completion requires the training data to include at least 2 fields – one for ‘text’ and ‘ground_truth’ like in this example. The below examples are from Samsum dataset. 

Original dataset:

| dialogue | summary  |
| :- | :- |
| Eric: MACHINE!\r\nRob: That's so gr8!\r\nEric: I know! And shows how Americans see Russian ;)\r\nRob: And it's really funny!\r\nEric: I know! I especially like the train part!\r\nRob: Hahaha! No one talks to the machine like that!\r\nEric: Is this his only stand-up?\r\nRob: Idk. I'll check.\r\nEric: Sure.\r\nRob: Turns out no! There are some of his stand-ups on youtube.\r\nEric: Gr8! I'll watch them now!\r\nRob: Me too!\r\nEric: MACHINE!\r\nRob: MACHINE!\r\nEric: TTYL?\r\nRob: Sure :) | Eric and Rob are going to watch a stand-up on youtube. | 
| Will: hey babe, what do you want for dinner tonight?\r\nEmma:  gah, don't even worry about it tonight\r\nWill: what do you mean? everything ok?\r\nEmma: not really, but it's ok, don't worry about cooking though, I'm not hungry\r\nWill: Well what time will you be home?\r\nEmma: soon, hopefully\r\nWill: you sure? Maybe you want me to pick you up?\r\nEmma: no no it's alright. I'll be home soon, i'll tell you when I get home. \r\nWill: Alright, love you. \r\nEmma: love you too. | Emma will be home soon and she will let Will know. | 

Formatted dataset the user might pass:

| text (dialogue) | ground_truth (summary) |
| :- | :- |
| Summarize this dialog:\nEric: MACHINE!\r\nRob: That's so gr8!\r\nEric: I know! And shows how Americans see Russian ;)\r\nRob: And it's really funny!\r\nEric: I know! I especially like the train part!\r\nRob: Hahaha! No one talks to the machine like that!\r\nEric: Is this his only stand-up?\r\nRob: Idk. I'll check.\r\nEric: Sure.\r\nRob: Turns out no! There are some of his stand-ups on youtube.\r\nEric: Gr8! I'll watch them now!\r\nRob: Me too!\r\nEric: MACHINE!\r\nRob: MACHINE!\r\nEric: TTYL?\r\nRob: Sure :)\n---\nSummary:\n | Eric and Rob are going to watch a stand-up on youtube. | 
| Summarize this dialog:\nWill: hey babe, what do you want for dinner tonight?\r\nEmma:  gah, don't even worry about it tonight\r\nWill: what do you mean? everything ok?\r\nEmma: not really, but it's ok, don't worry about cooking though, I'm not hungry\r\nWill: Well what time will you be home?\r\nEmma: soon, hopefully\r\nWill: you sure? Maybe you want me to pick you up?\r\nEmma: no no it's alright. I'll be home soon, i'll tell you when I get home. \r\nWill: Alright, love you. \r\nEmma: love you too. \n---\nSummary:\n | Emma will be home soon and she will let Will know. | 
 

In [None]:
# Download the dataset from huggingface datasets library 
# and save it in jsonl format
import os
download_dir = "samsum-dataset"
dataset_name ="samsum"
fraction = 1
if not os.path.exists(download_dir):
    os.makedirs(download_dir)

# import hugging face datasets library
import datasets
from datasets import get_dataset_split_names

for split in get_dataset_split_names(dataset_name):
    print(f"Loading {split} split of {dataset_name} dataset...")
    # load the split of the dataset
    dataset = datasets.load_dataset(path=dataset_name, split=split)
    dataset.select(range(int(dataset.num_rows * fraction))).to_json(
        os.path.join(download_dir, f"{split}.jsonl"))

In [None]:
# load the ./samsum-dataset/train.jsonl file into a pandas dataframe and show the first 5 rows
import pandas as pd

pd.set_option(
    "display.max_colwidth", 0
)  # set the max column width to 0 to display the full text
df = pd.read_json("./samsum-dataset/train.jsonl", lines=True)
df.head()

In [None]:
# create a function to preprocess the dataset in desired format
def get_preprocessed_samsum(df):
    prompt = f"Summarize this dialog:\n{{}}\n---\nSummary:\n"

    df["text"] = df["dialogue"].map(prompt.format)
    df["ground_truth"] = df["summary"]
    df = df.drop(columns=["dialogue", "id"])
    df = df[["text", "ground_truth"]]

    return df

In [None]:
# load test.jsonl, train.jsonl and validation.jsonl form the ./samsum-dataset folder into pandas dataframes

import pandas as pd
test_df = pd.read_json("./samsum-dataset/test.jsonl", lines=True)
train_df = pd.read_json("./samsum-dataset/train.jsonl", lines=True)
validation_df = pd.read_json("./samsum-dataset/validation.jsonl", lines=True)
# map the train, validation and test dataframes to preprocess function
train_df = get_preprocessed_samsum(train_df)
validation_df = get_preprocessed_samsum(validation_df)
test_df = get_preprocessed_samsum(test_df)
# show the first 5 rows of the train dataframe
train_df.head()

In [None]:
# save 10% of the rows from the train, validation and test dataframes into files with small_ prefix in the ./samsum-dataset folder
frac = 0.1
train_df.sample(frac=frac).to_json(
    "./samsum-dataset/small_train.jsonl", orient="records", lines=True
)
validation_df.sample(frac=frac).to_json(
    "./samsum-dataset/small_validation.jsonl", orient="records", lines=True
)
test_df.sample(frac=frac).to_json(
    "./samsum-dataset/small_test.jsonl", orient="records", lines=True
)