# Customization Dataset Preparation

NeMo LLM Customization service requires data to be in the form of .jsonl file with each line having only two fields (namely prompt and completion).

However, you might not have your data readily in this format (or even filetype).

This tutorial will help you to convert from what you have to what you will need quickly and easily.

What you will need:


1.   NeMo LLM Python Client
2.   Your datafile (in the form of a .jsonl, .json, .csv, .tsv or .xlsx). Each row should contain one sample. Make sure that the directory your file is in is readable and writeable, otherwise, please change it using chmod. Don't worry, we will not overwrite your existing file.



# Proof-read/validate data already in prompt-and-completion format

If you have your dataset in the prompt and completion format, you can use this tool to check that the way your dataset is prepared is suitable for the  Customization service. 

With close to a dozen consideration factors that makes training optimal, there might just be something you overlook (we all do!). 

In [None]:
#cd to the directory containing dataset_validation.py

!python dataset_validation.py --filename <filename>

# Making changes following tool recommendations

After running this code, you see a list of suggestions to use under ACTIONABLE MESSAGES as well as some insights into your dataset under INFORMATIONAL MESSAGES.

We suggest you prioritize changes suggested under ACTIONABLE MESSAGES but also have a look at the INFORMATIONAL MESSAGES to ensure that changes are done in an expected manner.

Many ACTIONABLE MESSAGES will include an additional you can add to previous command such as `--drop_duplicates` or `--long_seq_model`

For instance, if you would like to drop duplicate samples, run

```
!python dataset_validation.py --filename <filename> --drop_duplicates
```

There will also be recommendations that have to be done outside of the functionality of this tool. For instance, if you have too few datapoints, you might need to add a few more.


# Formatting data into Prompt/Completion

If you have data that is not already in Prompt/Completion format, we can also help.

For instances, if you are working on a Question Answering Task, you would typically have the columns `context`, `question` and `answer`

To format context and question into a prompt, we can use the flag 

```
--prompt_template "Context: {context} Question: {question} Answer:"
```

This tool will make use of this template to convert your data into a prompt field

Similarly, this can work for the completion template

```
--completion_template "{answer}"
```


In [None]:
!python dataset_validation.py --filename <filename> --prompt_template "Context: {context} Question: {question} Answer:" --completion_template "{answer}"

# Additional Methods included in this tool


1.   `--long_seq_model` :Use this flag to allow the preparation tool to allow a higher max sequence length (from 10000 chars to 40000 chars)
2.   `--drop_duplicates` : Use this flag to drop rows that are exactly the same for both prompt and completion
3.   `--split_train_validation` : Use this flag to split one file into separate train and validation files.
4.   `--val_proportion 0.1`: Use a float (default 0.1) between 0 and 1 to control how much of the dataset to allocate to the validation set and the remaining for the train dataset.
      

