# Preparing the data

In [1]:
# This is necessary to fix the imports
import os
import sys
sys.path.append(os.path.abspath(os.path.join('../src')))

In [2]:
ORIGINAL_TRAIN_DATA = "../data/train.jsonl"
ORIGINAL_VALIDATION_DATA = "../data/validation.jsonl"

SPLIT_FOLDER = "../data/splitted"

TRAIN_DATA = f"{SPLIT_FOLDER}/train.jsonl"
VALIDATION_DATA = f"{SPLIT_FOLDER}/validation.jsonl"
TEST_DATA = f"{SPLIT_FOLDER}/test.jsonl"

### Split data

Preparing the data using a custom function and validating it using the openai preparation tool

In [3]:
from split_data import split_data

In [4]:
split_data(input=ORIGINAL_VALIDATION_DATA, output=SPLIT_FOLDER, seed=1337)

In [5]:
import shutil
shutil.copyfile(ORIGINAL_TRAIN_DATA, TRAIN_DATA)

'../data/splitted/train.jsonl'

### Prepare data for OpenAI

In [6]:
OPENAI_MODEL = "ada"
OPENAI_TRAIN_DATA = "../data/parsed/openai/train.jsonl"
OPENAI_TEST_DATA = "../data/parsed/openai/test.jsonl"
OPENAI_VALIDATION_DATA = "../data/parsed/openai/validation.jsonl"

In [7]:
import prepare_data_openai

In [8]:
prepare_data_openai.prepare(input=TRAIN_DATA, output=OPENAI_TRAIN_DATA, model=OPENAI_MODEL, validation=False)
!openai tools fine_tunes.prepare_data --file $OPENAI_TRAIN_DATA --quiet

Analyzing...

- Your file contains 3200 prompt-completion pairs
- There are 3 duplicated prompt-completion sets. These are rows: [1248, 2315, 2698]
- There are 1 examples that are very long. These are rows: [2594]
For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.
- All prompts end with suffix `\n\n###\n\n`
- All prompts start with prefix `CLICKBAIT:

`
- All completions end with suffix ` END`

Based on the analysis we will perform the following actions:
- [Recommended] Remove 3 duplicate rows [Y/n]: Y
- [Recommended] Remove 1 long examples [Y/n]: Y
The indices of the long examples has changed as a result of a previously applied recommendation.
The 1 long examples to be dropped are now at the following indices: [2592]


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified file to `../data/parsed/openai/train_prepared (1).jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_t

In [9]:
prepare_data_openai.prepare(input=TEST_DATA, output=OPENAI_TEST_DATA, model=OPENAI_MODEL, validation=True)
!openai tools fine_tunes.prepare_data --file $OPENAI_TEST_DATA --quiet

Analyzing...

- Your file contains 602 prompt-completion pairs
- The input file should contain exactly two columns/keys per row. Additional columns/keys present are: ['id', 'type']
- All prompts end with suffix `\n\n###\n\n`
- All prompts start with prefix `CLICKBAIT:

`
- All completions end with suffix ` END`

Based on the analysis we will perform the following actions:
- [Necessary] Remove additional columns/keys: ['id', 'type']


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified file to `../data/parsed/openai/test_prepared (1).jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "../data/parsed/openai/test_prepared (1).jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `\n\n###\n\n` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=[" END"]` so that the generated texts ends at the expect

In [10]:
prepare_data_openai.prepare(input=VALIDATION_DATA, output=OPENAI_VALIDATION_DATA, model=OPENAI_MODEL, validation=True)
!openai tools fine_tunes.prepare_data --file $OPENAI_VALIDATION_DATA --quiet

Analyzing...

- Your file contains 198 prompt-completion pairs
- The input file should contain exactly two columns/keys per row. Additional columns/keys present are: ['id', 'type']
- All prompts end with suffix `\n\n###\n\n`
- All prompts start with prefix `CLICKBAIT:

`
- All completions end with suffix ` END`

Based on the analysis we will perform the following actions:
- [Necessary] Remove additional columns/keys: ['id', 'type']


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified file to `../data/parsed/openai/validation_prepared (1).jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "../data/parsed/openai/validation_prepared (1).jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `\n\n###\n\n` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=[" END"]` so that the generated texts ends a

### Prepare data for BERT

In [11]:
BERT_TRAIN_DATA = "../data/parsed/bert/train.jsonl"
BERT_TEST_DATA = "../data/parsed/bert/test.jsonl"
BERT_VALIDATION_DATA = "../data/parsed/bert/validation.jsonl"

In [12]:
import prepare_data_bert

In [13]:
prepare_data_bert.prepare(input=TRAIN_DATA, output=BERT_TRAIN_DATA)

In [14]:
prepare_data_bert.prepare(input=TEST_DATA, output=BERT_TEST_DATA)

In [15]:
prepare_data_bert.prepare(input=VALIDATION_DATA, output=BERT_VALIDATION_DATA)

### Prepare data for LLaMA

In [16]:
LLAMA_TRAIN_DATA = "../data/parsed/llama/train.jsonl"
LLAMA_TEST_DATA = "../data/parsed/llama/test.jsonl"
LLAMA_VALIDATION_DATA = "../data/parsed/llama/validation.jsonl"

In [17]:
import prepare_data_llama

In [18]:
prepare_data_llama.prepare(input=TRAIN_DATA, output=LLAMA_TRAIN_DATA)

In [19]:
prepare_data_llama.prepare(input=TEST_DATA, output=LLAMA_TEST_DATA)

In [20]:
prepare_data_llama.prepare(input=VALIDATION_DATA, output=LLAMA_VALIDATION_DATA)