# Farm ponds identification pipeline: Generate traning data
**Before you start this notebook, make sure that you have a satellite image (PNG), and a mask for training. If not, please create a mask for training with the [training setup](./README.md). Make sure to place both the satellite image and mask in the [data folder](../../data/), with names as train.png and train_mask.png**

This notebook will guide you step by step on how you can identify instances in satellite imagery using a computer vision algorithms such as RCNNs. We will start by Preprocessing the image into tiles, split them into training and validation datasets and then start the training process. 

## Install the packages for the pipeline
Make sure you have the environment set up done, so that we can import the packages used in this notebook. Check out the [setup info](../README.md)

## Import the libraries that are used in this notebook

In [1]:
import os, sys, json
from PIL import Image

## Preprocessing
If you have a large satellite image, this will take a few minutes to run. For the traning data for Kadwanchi withe 1024x1024 px tiles, it takes about 15 minutes.

### 1. Setting up the folder paths and parameters
You don't need to change the folder paths as the missing folders should be created if they do not exist. The data produced in the pipeline will be stored in the corresponding folders (e.g. training data in [train](../../data/train/)). Refer to the [repository README](../../README.md) to see the structure of the folders in the repository. 


In [None]:
ponds_root = os.path.dirname(os.path.dirname(os.getcwd())) 
if ponds_root not in sys.path:
    sys.path.append(ponds_root)

# Configuration settings to set up folders 
train_image_path = os.path.join(ponds_root, "data/train.png")  # Path to the input image
train_mask_path =  os.path.join(ponds_root,"data/train_mask.png")
train_folder =  os.path.join(ponds_root,"data/train/")  # Output folder for tiles
train_mask_folder =  os.path.join(ponds_root,"data/train_mask/")
train_not_used_folder =  os.path.join(ponds_root,"data/train_not_used/")
val_folder =  os.path.join(ponds_root,"data/val/")
val_mask_folder =   os.path.join(ponds_root,"data/val_mask/")

### 2. Set tile parameters:

We create traning data by cutting the satellite image into tiles. Set the ```tile_width``` and ```tile_height``` to values that you prefer. We use 1024x1024 px in the farm ponds example. Set the ```min_mask_size``` to filter out blank tiles (the size of a blank tile in our farm ponds example is below 6496 bytes in an 1024px image).  

In [None]:
tile_width, tile_height = 1024, 1024  # Tile dimensions
min_mask_size = 6496  # Minimum tile file size in bytes
Image.MAX_IMAGE_PIXELS = None

### 3. Preprocessing the tiles 
This next cell does the following things: 
1. Divides image and mask into tiles and place them in the traning folder.
2. filter the tiles by size, if it is blank or contains little data, filter out the tile.
3. Keep training data so that the images data matches the masks. Making sure that all images have a corresponding mask

In [None]:
from utils import preprocess as prep
# Step 1: Divide the input image into smaller tiles
prep.divide_and_save_image(train_image_path, train_folder, tile_width, tile_height)
prep.divide_and_save_image(train_mask_path, train_mask_folder, tile_width, tile_height)

# Step 2: Filter the masks generated tiles by their file size (filters out irrelevant blank tiles)
prep.filter_tiles_by_size(train_mask_folder, min_mask_size) 
# Step 3: Keep training data so that the images data matches the masks 
#         Making sure that all images have a corresponding mask
prep.process_training_data(train_folder, train_mask_folder, train_not_used_folder)

### 4. Splitting the data (both images and masks)

Choose the number of random images you want to exclude from training and place in the validation set to validate the performance of the model so that the model we train later does not pick up trends that are too specific to the training data (lower variance, i.e. we want the model to pick up the general trend for farmponds). 

Here we set the number of images in the validation set to be **150**. Which is roughly about 20% of the total data. You can follow the 80-20 rule, where a randomly chosen 20% of the data becomes validation data. You can also use other rules to determine how you want to validate model's training performance. 

In [None]:
num_images_to_select = 150
prep.create_validation_set(train_folder, train_mask_folder, val_folder, val_mask_folder, num_images_to_select)

### 5. Inverting image colors for mask (both train and val)
Once the train-val split is done, we turn the instances white, and make the background black, so the next cell can recognize the instances when creating a JSON file to descirbe the instances. When finish running the cell below, we are ready to create label information for the model traning later. 

In [None]:
prep.invert_image_colors(train_mask_folder, file_extension=".png")
prep.invert_image_colors(val_mask_folder, file_extension=".png")

### 6. Creating COCO dataset for the labeled pond data
This part of the code creates a COCO dataset as a JSON file that will be used for training and validating the model. Both [train](../../data/train/) and [val](../../data/val/) folders will have JSON files that labels the masks on the image tiles. This cell is modified from the repository [image-to-coco-json-converter](https://github.com/chrise96/image-to-coco-json-converter.git). You can also customize the labels and add other ```category_ids``` and ```category_colors```.

The train/val dataset for the farmponds example data will take about 7 mins. 

In [None]:
import glob
from utils.create_annotations import *

# Label ids of the dataset
category_ids = {

    "pond": 1
}

# Define which colors match which categories in the images
category_colors = {
    "(255, 255, 255)": 1 # pond
}

# Define the ids that are a multiplolygon. In our case: wall, roof and sky
multipolygon_ids = []

# Get "images" and "annotations" info 
def images_annotations_info(maskpath):
    # This id will be automatically increased as we go
    annotation_id = 0
    image_id = 0
    annotations = []
    images = []
    
    for mask_image in glob.glob(maskpath + "*.png"):
        # The mask image is *.png but the original image is *.jpg.
        # We make a reference to the original file in the COCO JSON file
        original_file_name = os.path.basename(mask_image).split(".")[0] + ".png"
        #print(original_file_name)
        # Open the image and (to be sure) we convert it to RGB
        mask_image_open = Image.open(mask_image).convert("RGB")
        w, h = mask_image_open.size
        
        # "images" info 
        image = create_image_annotation(original_file_name, w, h, image_id)
        images.append(image)

        sub_masks = create_sub_masks(mask_image_open, w, h)
        for color, sub_mask in sub_masks.items():
            category_id = category_colors[color]

            # "annotations" info
            polygons, segmentations = create_sub_mask_annotation(sub_mask)

            # Check if we have classes that are a multipolygon
            if category_id in multipolygon_ids:
                # Combine the polygons to calculate the bounding box and area
                multi_poly = MultiPolygon(polygons)
                                
                annotation = create_annotation_format(multi_poly, segmentations, image_id, category_id, annotation_id)

                annotations.append(annotation)
                annotation_id += 1
            else:
                for i in range(len(polygons)):
                    # Cleaner to recalculate this variable
                    segmentation = [np.array(polygons[i].exterior.coords).ravel().tolist()]
                    
                    annotation = create_annotation_format(polygons[i], segmentation, image_id, category_id, annotation_id)
                    
                    annotations.append(annotation)
                    annotation_id += 1
        image_id += 1
    return images, annotations, annotation_id

if __name__ == "__main__":
    # Get the standard COCO JSON format
    coco_format = get_coco_json_format()
    
    for keyword in ["train", "val"]:
        mask_path = os.path.join(ponds_root, f"data/{keyword}_mask/")
        # Create category section
        coco_format["categories"] = create_category_annotation(category_ids)
    
        # Create images and annotations sections
        coco_format["images"], coco_format["annotations"], annotation_cnt = images_annotations_info(mask_path)
        json_path = os.path.join(ponds_root, f"data/{keyword}/{keyword}.json")
        with open(json_path,"w") as outfile:
            print(json_path)
            json.dump(coco_format, outfile)
            
        print("Created %d annotations for images in folder: %s" % (annotation_cnt, mask_path))


With the JSON files created in the train/val folders, we are now ready to proceed to the next step: [network_selection](../1_network_selection/network_selection.ipynb). We will also visualize the masks so you can see how the labeled instances look like on the images.