# Inspecting Flattened Data

One of the key features of axolotl is that it flattens your data from a JSONL file into a prompt template format you specify in the config.  In the case of the text to SQL dataset defined in [mistral.yml](config/mistral.yml), the prompt template is defined as:


```yaml
datasets:
  # This will be the path used for the data when it is saved to the Volume in the cloud.
  - path: data.jsonl
    ds_type: json
    type:
      # JSONL file contains question, context, answer fields per line.
      # This gets mapped to instruction, input, output axolotl tags.
      field_instruction: question
      field_input: context
      field_output: answer
      # Format is used by axolotl to generate the prompt.
      format: |-
        [INST] Using the schema context below, generate a SQL query that answers the question.
        {input}
        {instruction} [/INST]
```

## Prerequisites

Make sure you install the following dependencies first:

```bash
pip install -U transformers datasets
```

In [None]:
import yaml, os
from pathlib import Path
from datasets import load_from_disk

from transformers import AutoTokenizer

## Step 1: Preprocess Data

It is often useful to just preprocess the data and inspect it before training.  You can do this by passing the `--preproc-only` flag to the `train` command.  This will preprocess the data and write it to the `datasets.path` specified in your config file.  You can then inspect the data and make sure it is formatted correctly before training.

For example, to preprocess the data for the `mistral.yml` config file, you would run:

```bash
# run this from the root of the repo
modal run --detach src.train \
   --config=config/mistral.yml\
   --data=data/sqlqa.jsonl\
   --preproc-only 
```

Modal will give you a run-id, which allows you to get the preprocessed data. For example, you will see something like this in the logs:

```
Training complete. Run tag: axo-2024-05-09-19-04-56-90c0
```

### Step 2: Download Data

The Run tag can be used to download and inspect the preprocessed data with [modal volume](https://modal.com/docs/reference/cli/volume):

In [None]:
# change this to your run tag
RUN_TAG='axo-2024-05-09-19-04-56-90c0'

inspect the directory structure

In [None]:
! modal volume ls example-runs-vol {RUN_TAG}

download the preprocessed data locally into a directory called _debug_data

In [None]:
!rm -rf _debug_data
!modal volume get example-runs-vol {RUN_TAG}/last_run_prepared  _debug_data

### Step 3: Analyze Data

Get the right tokenizer

In [None]:
!pip install transformers


In [None]:
# with open('../config/mistral.yml', 'r') as f:
#     cfg = yaml.safe_load(f)
# model_id = cfg['base_model']

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('NousResearch/Meta-Llama-3-8B-Instruct')






Load the dataset into a HF Dataset

In [None]:
from pathlib import Path

ds_dir = Path(f'_debug_data/axo-2024-06-20-18-50-45-215f/last_run_prepared')
ds_path = [p for p in ds_dir.iterdir() if p.is_dir()][0]
ds = load_from_disk(str(ds_path))

Verify that the data looks good

In [12]:
print(tok.decode(ds['input_ids'][0]))

<|begin_of_text|> You are given an input 2d grid and an output 2d grid, can you predict what operation was performed on the input to get the output?
input: ['0001112000000', '0000012500000', '0000666660000', '5555666660000', '5555666662000', '5555666665500', '5555510000000', '5555510000000', '0333310000000', '0333300000000', '0000000000000', '0000000000000', '0000000000000']
output: ['0000000000000', '0000000000000', '0000050000000', '0000250000000', '0066660000000', '0566660000000', '2266660000000', '1166661110000', '1066665533000', '1005555533000', '0005555533000', '0005555533000', '0005555500000']Rotate the entire grid 90 degrees counter-clockwise.<|end_of_text|>


## Resume A Training Run

After you have inspected the data and you are satisified with the results, you can resume training the model without having to preprocess the data again. This is made possible with the `--run-to-resume` flag.  For example, to resume the training run on this example, I can run this command:

```bash
# run this from the root of the repo
modal run --detach src.train \
   --config=config/mistral.yml\
   --data=data/sqlqa.jsonl\
   --run-to-resume {RUN_TAG} 
```