# Understanding Amazon rainforest from space
### input data: training on a single band (nir) TIFF image file with size of 128 x 128
### architecture: (3 CNN layers + 3 MAX-pulling + 3 drop outs) + flatten + ann 

The first step is pre-processing of our training data. Let's breakdown steps involved in preprocessing for this scenario as below:

1. get the file paths
2. load TIFF images
3. extract Near Inferared Band from bgrn

4. define hyper parameters

5. transform training data to matrices; returns img_array and targets as normalaized numpy arrays

6. get train matrices; returns x_train, y_train and lables_map

7. get train tensor; returns train_tensor as a numpy array of (x_train, y_train and lables_map)

#### 0. import required packages

In [1]:
import os
import sys
import cv2
import gc
import scipy
import subprocess
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

from six import string_types
from os.path import exists
from tqdm import tqdm
from itertools import chain
from multiprocessing import cpu_count
from concurrent.futures import ThreadPoolExecutor
from skimage import io
from scipy import ndimage
from IPython.display import display
%matplotlib inline

#### 1. get the input file path

In [2]:
def get_tif_data_files_paths():
    """
    Returns the input file folders path for .tiff files
    
    :return: The input file paths as list [train_tif_dir, test_tif_dir, train_tif_sample, train_csv_file]
    """

    data_folder = "/root/input"
    train_tif_dir = os.path.join(data_folder, 'train-tif-v2')
    test_tif_dir = os.path.join(data_folder, 'test-tif-v2')
    train_tif_sample = os.path.join(data_folder, 'train-tif-sample')
    train_csv_file = os.path.join(data_folder, 'train_v2.csv')

    assert os.path.exists(data_folder), "The {} folder does not exist".format(data_folder)
    assert os.path.exists(train_tif_dir), "The {} folder does not exist".format(train_tif_dir)
    assert os.path.exists(test_tif_dir), "The {} folder does not exist".format(test_tif_dir)
    assert os.path.exists(train_tif_sample), "The {} file does not exist".format(train_tif_sample)
    assert os.path.exists(train_csv_file), "The {} file does not exist".format(train_csv_file)
    return [train_tif_dir, test_tif_dir, train_tif_sample, train_csv_file]

let's test our function to make sure it return the right stuff.

In [3]:
print(get_tif_data_files_paths())

['/root/input/train-tif-v2', '/root/input/test-tif-v2', '/root/input/train-tif-sample', '/root/input/train_v2.csv']


now we need to create a list of file names in our train directory so we can iterate over every file to extract our needed band.

In [4]:
train_tif_dir = get_tif_data_files_paths()[0] #[0] to indicate which directory we want. 
train_tif_file_path_list = os.listdir(train_tif_dir)

let's inspect our list by printing out it's lenght

In [5]:
print(len(train_tif_file_path_list))

40479


nir = bgrn_image[:, :, 3]

In [6]:
labels_df = pd.read_csv(get_tif_data_files_paths()[3])
print(labels_df)

        image_name                                               tags
0          train_0                                       haze primary
1          train_1                    agriculture clear primary water
2          train_2                                      clear primary
3          train_3                                      clear primary
4          train_4          agriculture clear habitation primary road
5          train_5                                 haze primary water
6          train_6        agriculture clear cultivation primary water
7          train_7                                       haze primary
8          train_8              agriculture clear cultivation primary
9          train_9         agriculture clear cultivation primary road
10        train_10         agriculture clear primary slash_burn water
11        train_11                                clear primary water
12        train_12                                             cloudy
13        train_13  

In [7]:
labels = sorted(set(chain.from_iterable([tags.split(" ") for tags in labels_df['tags'].values])))
print(labels)

['agriculture', 'artisinal_mine', 'bare_ground', 'blooming', 'blow_down', 'clear', 'cloudy', 'conventional_mine', 'cultivation', 'habitation', 'haze', 'partly_cloudy', 'primary', 'road', 'selective_logging', 'slash_burn', 'water']


In [8]:
labels_map = {l: i for i, l in enumerate(labels)}
print(labels_map)

{'blow_down': 4, 'road': 13, 'primary': 12, 'blooming': 3, 'slash_burn': 15, 'agriculture': 0, 'water': 16, 'artisinal_mine': 1, 'bare_ground': 2, 'cultivation': 8, 'partly_cloudy': 11, 'habitation': 9, 'cloudy': 6, 'conventional_mine': 7, 'clear': 5, 'selective_logging': 14, 'haze': 10}


In [9]:
files_path = []
tags_list = []
for file_name, tags in labels_df.values:
    files_path.append('{}/{}.tif'.format(train_tif_dir, file_name))
    tags_list.append(tags)
print (len(files_path))
print (len(tags_list))
print (files_path[1])

40479
40479
/root/input/train-tif-v2/train_1.tif


Now let's extyract each band, sort them as rgbn and plot each band

In [34]:
# note the initial bgrn band ordering

#bgrn_image = (io.imread(files_path[103]))

train_img = io.imread(files_path[0])[:,:,3]
print(train_img)

[[6621 6596 6648 ..., 5970 5910 5812]
 [6352 6440 6554 ..., 6315 6328 6156]
 [6020 6201 6389 ..., 6494 6564 6358]
 ..., 
 [6290 6230 6159 ..., 6209 5969 5814]
 [6331 6305 6250 ..., 6323 6164 6041]
 [6339 6356 6292 ..., 6424 6363 6281]]


let's take a look at the shape of nir band. this shows that our images are 256 x 256

a glance at actual values of the nir band. These values are raw sensory data as recorded

the next setp is to iterate over every image and grab the nir band and put them all into img_array