# Tutorial 2: Create Dataset Files
Training, validation, and prediction using Ultralytics YOLO models are specified by YAML files in a specific format. In this tutorial we explain how you would set this up for a dummy dataset. 

Use tutorial 1 to create a dataset to use or create one as explained.

The dataset YAML file contains the following:

```
names:
- name of index class 0
- name of index class 1
...
- name of class n

nc: n-1  # number of classes

path: path to directory containing text files

train: filename of train text file
val: filename of validation text file
test: filename of test text file  # optional
```

The train, val, and test text files contain the filepath (try to use absolute filepaths to avoid issues) of each image to use in those subsets. For example, for the ROIs downloaded from ROIs.zip your text file might look like this:

train.txt:

```
/workspace/data/ROIs/images/638147637f8a5e686a52dded-x18232y55761.png
/workspace/data/ROIs/images/638147667f8a5e686a52efa4-x56607y69379.png
/workspace/data/ROIs/images/638147787f8a5e686a53bf21-x59676y38171.png
```

Note that there should be a new line character between each filepath.

In [1]:
# Imports
from glob import glob
from os.path import join
import yaml

In [4]:
# Example of how you might set this up for ROIs folder (extracted from ROIs.zip).
# See tutorial 1 on how to create the tile images.
tile_dir = '/workspace/data/tiles/images'

# List all the tile images.
fps = sorted(glob(join(tile_dir, '*.png')))

print(f'Total number of tile images: {len(fps)}')

# Take a subset for validation and the rest for training.
val_fps = fps[100:]
train_fps = fps[100:]

path = '/workspace/data/'
train_fn = 'train.txt'
val_fn = 'val.txt'

# Write the filepaths to file.
with open(join(path, train_fn), 'w') as fh:
    lines = ''
    
    for fp in train_fps:
        lines += f'{fp}\n'
        
    fh.write(lines.strip())
    
with open(join(path, val_fn), 'w') as fh:
    lines = ''
    
    for fp in val_fps:
        lines += f'{fp}\n'
        
    fh.write(lines.strip())
    
# Create and save the yaml.
yaml_dict = {
    'names': ['Pre-NFT', 'iNFT'],
    'nc': 2,
    'path': path,
    'train': train_fn,
    'val': val_fn
}

with open(join(path, 'dataset.yaml'), 'w') as fh:
    yaml.dump(yaml_dict, fh)
    
yaml_dict

Total number of tile images: 504


{'names': ['Pre-NFT', 'iNFT'],
 'nc': 2,
 'path': '/workspace/data/',
 'train': 'train.txt',
 'val': 'val.txt'}