### Improve the model performance by changes to the dataset and model architecture

Suggestions for improvement:

**Dataset:**
1. More/divers trainings data: add images from different seasons. Optional: add images from other european countries.
2. Clean trainings data: remove images with no solar panels (Confident Learning, CleanLab).
3. Add images without solar panels to dataset -> Currently many false positives (house roofs etc.)
4. Data Preprocessing: Augmentation, Robust Normalisation, Gamma Correction, Brighten

**Model:** 
1. test other architectures -> most promising (U-Net, U-Net++, MANet)
2. Test other/bigger backbones -> most promising (ResNet-18, ResNet-36, ResNet-101, etc.)
3. Hyperparameter Tuning:
     1. depth of encoder 
     2. Learning-Rate
     3. Loss Function
     4. Optimization
     5. Batch Size
     6. aux_params (https://smp.readthedocs.io/en/latest/insights.html#aux-classification-output)
     7. Dropout?

Goal is to get a model with a IoU of 0.85 or higher.

ToDo:
1. Test different loss functions (because it doesn't correlate with the IoU)
2. Color improvements (Gamma Correction, Brighten, etc.)
3. Check for cloud coverage (maybe remove images with high cloud coverage)

In [1]:
from pathlib import Path
import random
import torch
from torch.utils.data import DataLoader

### 1. More/divers trainings data: add images from different seasons. Optional: add images from other european countries. 

To ensure no data leakages, we need to create an index of the same images patches in different seasons. This index can be used to split the dataset into trainings, test and validation data. This should make sure that the same image patches (solar plants) are not in both datasets. 

Images and masks share the same name but are in different folders.

#### 1.1 Create index of all images int the training data


1. Create a list of all tile + patch number combinations in the dataset

In [5]:
image_dir = Path(r"C:\Users\Fabian\Documents\Github_Masterthesis\Solarpark-detection\data_local\images_only_AOI_test_season")

In [6]:
example_filename = Path("31UGR_2018-4-20_2_39.pt")

In [7]:
tile, date, id1, id2 = example_filename.stem.split("_")
identifier = f"{id1}_{id2}"
tile, date, identifier

('31UGR', '2018-4-20', '2_39')

In [8]:
tile_id_list = []

for file_path in image_dir.glob("*.pt"):
    filename = file_path.stem
    tile, date , id1, id2 = filename.split("_")
    identifier = f"{id1}_{id2}"
    tile_id_list.append((tile, identifier))

In [9]:
print(f"With seasons: {len(tile_id_list)}")

# drop duplicates, due to the seasons
tile_id_unique = list(set(tile_id_list))

print(f"Without seasons: {len(tile_id_unique)}")

With seasons: 7061
Without seasons: 2493


2. Mix list to ensure random sampling

In [10]:
random.shuffle(tile_id_unique)

3. Split list into trainings, test and validation data

In [11]:
num_total = len(tile_id_unique)
num_train = int(num_total * 0.7)
num_val = int(num_total * 0.1)
num_test = num_total - num_train - num_val

train_list = tile_id_unique[:num_train]
val_list = tile_id_unique[num_train:num_train+num_val]
test_list = tile_id_unique[num_train+num_val:]
train_list[:5]

[('32UNV', '3_17'),
 ('32UPU', '17_34'),
 ('32UME', '21_12'),
 ('32UMB', '36_33'),
 ('32UQD', '16_40')]

4. Create train_list, test_list and validation_list with full name of images and masks respectively the seasons

In [12]:
tile, identifier = train_list[0]
id1, id2 = identifier.split("_")
id1, id2

('3', '17')

In [13]:
train_filenames =  []
for file_path in image_dir.glob("*.pt"):
    tile, date, id1, id2 = file_path.stem.split("_")
    identifier = f"{id1}_{id2}"
    if (tile, identifier) in train_list:
        train_filenames.append(str(file_path))

In [22]:
test = torch.load(train_filenames[0])

In [23]:
train_dataloader = DataLoader(
    train_filenames,
    batch_size=32
)

In [25]:
train_dataloader.iter()

AttributeError: 'DataLoader' object has no attribute 'iter'

In [105]:
for file_path in image_dir.glob("*.pt"):
    tile, date, id1, id2 = file_path.stem.split("_")
    

31UGR_2018-10-10_10_37
31UGR 2018-10-10 10 37
31UGR_2018-10-10_10_38
31UGR 2018-10-10 10 38
31UGR_2018-10-10_11_32
31UGR 2018-10-10 11 32
31UGR_2018-10-10_12_18
31UGR 2018-10-10 12 18
31UGR_2018-10-10_12_32
31UGR 2018-10-10 12 32
31UGR_2018-10-10_12_38
31UGR 2018-10-10 12 38
31UGR_2018-10-10_14_31
31UGR 2018-10-10 14 31
31UGR_2018-10-10_15_16
31UGR 2018-10-10 15 16
31UGR_2018-10-10_15_17
31UGR 2018-10-10 15 17
31UGR_2018-10-10_16_17
31UGR 2018-10-10 16 17
31UGR_2018-10-10_16_41
31UGR 2018-10-10 16 41
31UGR_2018-10-10_17_12
31UGR 2018-10-10 17 12
31UGR_2018-10-10_17_13
31UGR 2018-10-10 17 13
31UGR_2018-10-10_17_17
31UGR 2018-10-10 17 17
31UGR_2018-10-10_17_18
31UGR 2018-10-10 17 18
31UGR_2018-10-10_17_30
31UGR 2018-10-10 17 30
31UGR_2018-10-10_17_31
31UGR 2018-10-10 17 31
31UGR_2018-10-10_18_12
31UGR 2018-10-10 18 12
31UGR_2018-10-10_18_13
31UGR 2018-10-10 18 13
31UGR_2018-10-10_18_17
31UGR 2018-10-10 18 17
31UGR_2018-10-10_18_18
31UGR 2018-10-10 18 18
31UGR_2018-10-10_18_30
31UGR 2018-

In [91]:
train_filenames

[]

# Different Approach:
https://stackoverflow.com/questions/50544730/how-do-i-split-a-custom-dataset-into-training-and-test-datasets

In [None]:
# Creating data indices for training and validation splits:
dataset_size = len(dataset)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
if shuffle_dataset :
    np.random.seed(random_seed)
    np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]

In [3]:
import tensorflow_io_gcs_filesystem as tfio

ModuleNotFoundError: No module named 'tensorflow'

In [2]:
import tensorflow as tf

ModuleNotFoundError: No module named 'tensorflow'

In [1]:
from torch.utils.tensorboard import SummaryWriter

In [2]:
import os
import torch
import numpy as np
from torch.utils.data import Dataset, DataLoader, sampler
from torch import nn
import matplotlib.pyplot as plt
from torchmetrics.classification import BinaryJaccardIndex
from torch.utils.tensorboard import SummaryWriter
#from dataset_class import GeoImageDataset
from rasterio.plot import show
import segmentation_models_pytorch as smp
import tensorflow as tf
from tensorboard.plugins.hparams import api as hp

ModuleNotFoundError: No module named 'tensorflow'

In [None]:
img_dir = r'C:\Users\Fabian\Documents\Masterarbeit_Daten\images_only_AOI4'
mask_dir = r'C:\Users\Fabian\Documents\Masterarbeit_Daten\masks_only_AOI4'
geo_image_dataset = GeoImageDataset(img_dir, mask_dir)
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Using {device} device")
# select Unet with resnet34 as backbone
model = smp.Unet(
    encoder_name="timm-resnest14d",        # choose encoder, e.g. mobilenet_v2 or efficientnet-b7
    encoder_weights="imagenet",     # use `imagenet` pre-trained weights for encoder initialization
    in_channels=4,                  # model input channels (1 for gray-scale images, 3 for RGB, etc.)
    classes=1,                      # model output channels (number of classes in your dataset)
    activation='sigmoid', 
).cuda()

In [None]:
model_dir = r'C:\Users\Fabian\Documents\Masterarbeit_Daten\saved_models2'
model_filename = 'timm-resnest14d_imagenet_sigmoid_epoch-160.pth'
model_path = os.path.join(model_dir, model_filename)
model.load_state_dict(torch.load(model_path))
model.eval() # enabling the eval mode to test with new samples.
metric = BinaryJaccardIndex().to(device)

In [None]:
img, mask = geo_image_dataset[1000]
img = img.to(device)
mask = mask.to(device)

# Run forward pass
with torch.no_grad():
  pred = model(img.unsqueeze(0))

100*metric(pred[:,0], mask.unsqueeze(0))
pred_np = pred.detach().cpu().numpy()
img_np = img.detach().cpu().numpy()
mask_np = mask.detach().cpu().numpy()

In [None]:
def gammacorr(band):
    gamma=2.2
    return np.power(band, 1/gamma)

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(12,12))
show(gammacorr(img_np[1:4]))
plt.show()

In [None]:
show(mask_np)

In [None]:
show(source=pred_np[0])