# Data Preprocess for the Dataset

In this session, we will have 2 parts:
- Baseline Preprocess
- Enhanced Preprocess

Each part will have 3 steps:
- Prepare training data
- Prepare public data
- Prepare private data (not yet)

In [8]:
import os
import shutil

In [None]:
# check if `pytorch-CycleGAN-and-pix2pix` is already cloned
if not os.path.exists('../pytorch-CycleGAN-and-pix2pix'):
    os.chdir('../')
    !git clone https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix
    os.chdir('./dataset')	# change directory back to `./dataset`
else:
    print('pytorch-CycleGAN-and-pix2pix is already cloned.')

In [None]:
!pwd

## Download Dataset

In [None]:
!bash ../scripts/download_official_dataset.sh

# Baseline Preprocess

we will cover the following steps:
- Prepare training data
- Prepare public data
- Prepare private data

## Prepare Raw Training Data (Baseline)

The Training dataset contains two subfolder:
- label_img: contains the draft images
- img: contains the corresponding ground truth images

In [None]:
import zipfile

train_dataset_zip = '34_Competition 1_Training dataset.zip'

# unzip the train_dataset_zip
with zipfile.ZipFile(train_dataset_zip, 'r') as zip_ref:
    zip_ref.extractall()

In [None]:
train_dir = 'training_dataset'

# rename the extracted folder
os.rename('Training dataset', train_dir)

In [None]:
# train_dir = './training_dataset'
print(os.listdir(train_dir))

### Rename the subfolders as trainA and trainB
mapping the folder name to the model input:
- `training_dataset/label_img` -> `training_dataset/trainA`
- `training_dataset/img` -> `training_dataset/trainB`

In [None]:
# rename the subfolders
os.rename(train_dir + '/label_img', train_dir + '/trainA')
os.rename(train_dir + '/img', train_dir + '/trainB')

### Copy the folder to the model input folder

move the `training_dataset` folders to `../pytorch-CycleGAN-and-pix2pix/datasets`

In [None]:
# copy the folder to target folder
target_dir = '../pytorch-CycleGAN-and-pix2pix/datasets'

# check if the folders exists
if not os.path.exists(target_dir + '/' + train_dir):
    shutil.copytree(train_dir, target_dir + '/' + train_dir)
else:
    # remove the existing folder
    shutil.rmtree(target_dir + '/' + train_dir)
    # copy the folder
    shutil.copytree(train_dir, target_dir + '/' + train_dir)

### Align the trainA and trainB images

In [None]:
os.chdir('../pytorch-CycleGAN-and-pix2pix/datasets')

! python make_dataset_aligned.py --dataset-path training_dataset
print('Done!')

## Prepare Public Testing Data (Baseline)

1. The extracted zip file only contains `label_img` folder
2. so we need to create the parent folder `public_testing_dataset`
3. and move the `label_img` folder to `public_testing_dataset`

In [None]:
import os

# change directory to the root of the project
try:
	os.chdir('../../dataset')
except:
	print("Already in the root directory")

In [None]:
import zipfile

test_dataset_zip = '34_Competition 1_public testing dataset.zip'
test_dir = 'public_testing_dataset'

# unzip the test_dataset_zip
with zipfile.ZipFile(test_dataset_zip, 'r') as zip_ref:
	zip_ref.extractall(test_dir)

In [None]:
test_dir = 'public_testing_dataset'
print(os.listdir(test_dir))

### Rename the subfolder as testA

since the ground truth images are not provided, we just need to rename the folder `label_img` as `testA`

mapping the folder name to the model input:
- `public_testing_dataset/label_img` -> `public_testing_dataset/testA`

In [None]:
os.rename(test_dir + '/label_img', test_dir + '/testA')

### Copy the folders to the model input folder

copy the `public_testing_dataset` folders to `../pytorch-CycleGAN-and-pix2pix/datasets`

In [None]:
# copy the folder to target folder
target_dir = '../pytorch-CycleGAN-and-pix2pix/datasets'

# check if the folder exists
if not os.path.exists(target_dir + '/' + test_dir):
	shutil.copytree(test_dir, target_dir + '/' + test_dir)
else:
	# remove the existing folder
	shutil.rmtree(target_dir + '/' + test_dir)
	# copy the new folder
	shutil.copytree(test_dir, target_dir + '/' + test_dir)

## Prepare Private Testing Data (Baseline)

1. The extracted zip file only contains `label_img` folder
2. so we need to create the parent folder `private_testing_dataset`
3. and move the `label_img` folder to `private_testing_dataset`

In [1]:
import os

# change directory to the root of the project
try:
	os.chdir('../../dataset')
except:
	print("Already in the root directory")

Already in the root directory


In [4]:
import zipfile

private_test_dataset_zip = '34_Competition 1_Private Test Dataset.zip'
private_test_dir = 'private_testing_dataset'

# unzip the test_dataset_zip
with zipfile.ZipFile(private_test_dataset_zip, 'r') as zip_ref:
	zip_ref.extractall(private_test_dir)

In [5]:
private_test_dir = 'private_testing_dataset'
print(os.listdir(private_test_dir))

['label_img']


### Rename the subfolder as testA

since the ground truth images are not provided, we just need to rename the folder `label_img` as `testA`

mapping the folder name to the model input:
- `private_testing_dataset/label_img` -> `private_testing_dataset/testA`

In [6]:
os.rename(private_test_dir + '/label_img', private_test_dir + '/testA')

### Copy the folders to the model input folder

copy the `private_testing_dataset` folders to `../pytorch-CycleGAN-and-pix2pix/datasets`

In [9]:
# copy the folder to target folder
target_dir = '../pytorch-CycleGAN-and-pix2pix/datasets'

# check if the folder exists
if not os.path.exists(target_dir + '/' + private_test_dir):
	shutil.copytree(private_test_dir, target_dir + '/' + private_test_dir)
else:
	# remove the existing folder
	shutil.rmtree(target_dir + '/' + private_test_dir)
	# copy the new folder
	shutil.copytree(private_test_dir, target_dir + '/' + private_test_dir)

# Enhanced Preprocess (2 domain datasets)

## Prepare Training Data (Enhanced)

### Extract the Images from Raw Training Data
Each `trainA` and `trainB` subfolders contains 2 types of images:
- River images(e.g. TRA_RI_1000000.png)
- Road images(e.g. TRA_RO_1000000.png)

so we need to create 2 folders:
- River (contains `trainA` and `trainB` subfolders, each contains river images)
- Road (contains `trainA` and `trainB` subfolders, each contains road images)

In [None]:
train_dir = 'training_dataset'
river_dir = 'RIVER'
road_dir = 'ROAD'

# create the folders
if not os.path.exists(river_dir):
	os.makedirs(river_dir)
if not os.path.exists(road_dir):
	os.makedirs(road_dir)

for subdir in os.listdir(train_dir):
	# create the subfolders if not exist
	if not os.path.exists(river_dir + '/' + subdir):
		os.makedirs(river_dir + '/' + subdir)
	if not os.path.exists(road_dir + '/' + subdir):
		os.makedirs(road_dir + '/' + subdir)
	
	# move or copy the files
	for file in os.listdir(train_dir + '/' + subdir):
		if 'RI' in file:
			shutil.copy(train_dir + '/' + subdir + '/' + file, river_dir + '/' + subdir + '/' + file)
		elif 'RO' in file:
			shutil.copy(train_dir + '/' + subdir + '/' + file, road_dir + '/' + subdir + '/' + file)
		else:
			print('ERROR: file name not recognized: ' + file)

### Copy the folders to the model input folder

move the `RIVER` and `ROAD` folders to `../pytorch-CycleGAN-and-pix2pix/datasets`

In [None]:
# copy the folder to target folder
target_dir = '../pytorch-CycleGAN-and-pix2pix/datasets'
shutil.copytree(river_dir, target_dir + '/' + river_dir)
shutil.copytree(road_dir, target_dir + '/' + road_dir)

### Align the trainA and trainB images

In [None]:
os.chdir('../pytorch-CycleGAN-and-pix2pix/datasets')

! python make_dataset_aligned.py --dataset-path RIVER
! python make_dataset_aligned.py --dataset-path ROAD
print('Done!')

## Prepare Public Testing Data (Enhanced)

### Extract the Images from Raw Testing Data
`public_testing_dataset/testA` subfolder contains 2 types of images:
- River images(e.g. PUB_RI_1000000.png)
- Road images(e.g. PUB_RO_1000459.png)

so we need to create 2 folders:
- test_RIVER (contains `testA` subfolders, only contains river images)
- test_ROAD (contains `testA` subfolders, only contains road images)

In [None]:
import os

# change directory to the root of the project
try:
	os.chdir('../../dataset')
except:
	print("Already in the root directory")

In [None]:
test_dir = 'public_testing_dataset'
test_river_dir = 'test_RIVER'
test_road_dir = 'test_ROAD'

# create the folders
if not os.path.exists(test_river_dir):
	os.makedirs(test_river_dir)
if not os.path.exists(test_road_dir):
	os.makedirs(test_road_dir)
 

for subdir in os.listdir(test_dir):
	# create the subfolders if not exist
	if not os.path.exists(test_river_dir + '/' + subdir):
		os.makedirs(test_river_dir + '/' + subdir)
	if not os.path.exists(test_road_dir + '/' + subdir):
		os.makedirs(test_road_dir + '/' + subdir)
	
	# move or copy the files
	for file in os.listdir(test_dir + '/' + subdir):
		if 'RI' in file:
			shutil.copy(test_dir + '/' + subdir + '/' + file, test_river_dir + '/' + subdir + '/' + file)
		elif 'RO' in file:
			shutil.copy(test_dir + '/' + subdir + '/' + file, test_road_dir + '/' + subdir + '/' + file)
		else:
			print('ERROR: file name not recognized: ' + file)

### Copy the folders to the model input folder

copy the `test_RIVER` and `test_ROAD` folders to `../pytorch-CycleGAN-and-pix2pix/datasets`

In [None]:
# copy the folder to target folder
target_dir = '../pytorch-CycleGAN-and-pix2pix/datasets'
shutil.copytree(test_river_dir, target_dir + '/' + test_river_dir)
shutil.copytree(test_road_dir, target_dir + '/' + test_road_dir)