# Part 1: Preaparation of the dataset

The given dataset of crosswalking islands consits of over 4000 images saved as .tif files as well as a CSV file that contains polygon coordinates. These coordinates are needed for the creation of the binary masks which are not included in the dataset. Furthermore, there exist also invalid images or damaged images in the set. Thus, it must be cleaned before the creation of the mask. This notebook solves the following problems in chronological order:

- Download of dataset from SwitchDrive and unzpiing of content in current directory
- Conversion of large TIF files in smaller PNG files in separate directory
- Data cleaning
- Creation of labels in the form of binary masks
- Additional data cleaning

In [1]:
import os
import sys
import csv
import os
import cv2
import tifffile
import zipfile
import random
import json
import urllib.request
import pandas as pd
import numpy as np
from fastai.vision.all import *
from skimage import io, exposure
from skimage.transform import resize
from skimage.color import rgba2rgb
from skimage.util import img_as_ubyte
from multiprocessing import Pool
from functools import partial

The functions used in this notebook are stored in a dedicated python function file. This has the aim to make the structure clearer and reader-friendly.

In [None]:
# import created functions from separate python file defs.py (same directory)
from defs import read_image, reduce_image_array, save_img_as_png, split_list, convert_list_of_images

In [2]:
# Surpresses OS warnings when apply multiprocessing later
import warnings
warnings.filterwarnings('ignore')

### Working directory
Lets start with the creation of the working directory for the needed data. The code below creates a new data folder at the current working directory. Inside that folder, a subdirectory for each, TIF images of refuge-island and the corresponding masks are created. If, however, the directory already exists, this step is skipped.

In [3]:
DATASET_PATH = Path(os.getcwd()) / "data"
IMAGES_PATH = DATASET_PATH / "islands"
MASKS_PATH = DATASET_PATH  / "masks"

# Create Directory
if not DATASET_PATH.exists():
    os.mkdir(DATASET_PATH)
if not MASKS_PATH.exists():
    os.mkdir(MASKS_PATH)

### Download TIF images and convert them to PNG format
The whole dataset (images and csv file) can be downloaded as zip file. Due to the large size of the .tif images, they are not suitable for the further image segmentation process. Therefore, each image is converted (and compressed) to a .png in a separate directory.

> This takes around 10-20 min and requires a constant internet connection. To avoid doing this step multiple time it is checked whether the path with the images already exists before initiating the download. 

In [4]:
# Download Images and CSV file as Zip
if not IMAGES_PATH.exists():    
    urllib.request.urlretrieve(url=r'https://drive.switch.ch/index.php/s/bWb8D4mQe7HpSqL/download',filename=DATASET_PATH/'islands.zip')

In [5]:
# Unzip images 
if not IMAGES_PATH.exists():
    with zipfile.ZipFile(DATASET_PATH/'islands.zip', 'r') as zip_ref:
        zip_ref.extractall(DATASET_PATH)
    with zipfile.ZipFile(DATASET_PATH/'image-segmentation-islands'/'islands.zip', 'r') as zip_ref:
        zip_ref.extractall(DATASET_PATH)

### Using multiprocessing to speed up the conversion process

Since there are over 4000 images that must be converted, multiprocessing is used to fully leverage the processor capabilities of the used machine. During this process, the images are grouped in multiple sets and are compressed to a smaller scale before saved as PNG images.

> Still, this process can take about 30+ mins. Depending on the capability of the used machine.

In [6]:
# Create png folder
IMAGES_PNG_PATH = DATASET_PATH / 'images_png'
if not IMAGES_PNG_PATH.exists():
    os.mkdir(IMAGES_PNG_PATH)
    
    number_of_parallelism = 10
    # Get images in a list []
    image_list = list(IMAGES_PATH.glob("*.tif"))

    # Split list into n sublist [[][]]
    image_list_splitted = split_list(image_list, number_of_parallelism)
    do_conversion = partial(convert_list_of_images, out_folder = IMAGES_PNG_PATH)
    
    if __name__ == '__main__':
        with Pool(number_of_parallelism) as p:
            print(p.map(do_conversion, image_list_splitted))

[None, None, None, None, None, None, None, None, None, None]


### Cleaning of CSV file 

The CSV file comprises of the following columns:

- Image path (reference to which image file the information is applicable)
- The polygon coordinates
- A boolean value whether the image is valid or not

> Invalid images must be exluded from the dataset!

In [7]:
file_path = IMAGES_PATH / "validated.csv"

data_list = []

with open (file_path, 'r', encoding="utf-8-sig") as f_obj:
    reader = csv.DictReader(f_obj, delimiter=',')
    for item in reader:
        data_list.append(item)

For the cleaning process, the data of the CSV file is transfered into a Pandas dataframe. Rows containing invalid images are removed.

In [8]:
columns = [['image_path','polygons', 'invalid_image']]
for record in data_list:  
    image_path = record['path']
    polygons = record['polygons']
    invalid_image = record['invalid_image']
    
    columns.append([image_path, polygons, invalid_image])
    
    df = pd.DataFrame(columns[1:], columns=columns[0])

In [9]:
df.apply(lambda x: x.replace({'.tif': '.png', '/data/islands/': '',}, regex=True, inplace = True))
df['polygons'] = df['polygons'].str[1:-1]
df = df[df.invalid_image == 'False']

In [10]:
print("Number of valid images: " + df.shape[0])

4202

### Mask generation

Now that the CSV file is cleaned, the image list with the assigned polygon coordinates can be used to create the binary masks. There is a dedicated method from the OpenCV (cv2) library that fulfills this job.

> Note that this is done with the png images, not with the tif images

Afterwards, the number of erroneous files are acertained. That can have the following two reasons:

- Polygon coordinates are empty or in the wrong format (invalid_images)
- The image path of the CSV file does not match with any image file of the dataset

In [11]:
not_matching_images = []
invalid_images = []

#df = df.reset_index()
for index, row in df.iterrows():
    path = str(IMAGES_PNG_PATH / (row['image_path']))
    if os.path.isfile(path):
        try:
            img = cv2.imread(path)
            image = np.zeros((img.shape[0], img.shape[1]))
            # Extract and allocate coordinates for mask generation
            polygon = np.asarray(json.loads(row['polygons']), dtype=np.float32) #np.array(row['polygons'])
            contours = polygon.astype(int)
            mask = cv2.fillPoly(image, pts = contours, color =(255,255,255))
            cv2.imwrite(str(MASKS_PATH / (row['image_path'])),mask)
        except:
            invalid_images.append(str(row['image_path']))
            os.remove(path)
    else:
        not_matching_images.append(path)
            
print(len(invalid_images))
print(len(not_matching_images))

975
43


As a last step, the images which are not included in the the validated.csv file must be deleted (they are of no use since there are no labels available for the training)

In [12]:
png_list = []
mask_list = []

for filename in os.listdir(IMAGES_PNG_PATH):
    png_list.append(filename)
    
for filename in os.listdir(MASKS_PATH):
    mask_list.append(filename)
    
difference = list(set(png_list)-set(mask_list))
print(len(difference))

for file in difference:
    os.remove(IMAGES_PNG_PATH / str(file))

22


### Remark

In hindsight, it might make more sense to delete to unusable images first before the conversion process. This would save computation time.