# K-fold cross validation (k=number of samples =16)

For each run, we test on a single image and train on n_samples-1 images. The detailed output for each fold is on the third cell. The final score is in the last cell

In [1]:
from utils.data import Data
from utils.estimators import Dataset, Classifier
import numpy as np
%matplotlib inline

In [2]:
tiff_location = "./Data/Images/"
shp_location = "./Data/Labels/"
all_data = Data(tiff_location, shp_location, classes = ["water", "land"])
all_tiff = all_data.read_tiff() 
all_mask = all_data.get_mask()
X, y = all_data.get_Xy(all_tiff, all_mask, n_sample = 200000, k_fold=True)

  return _prepare_from_string(" ".join(pjargs))


In [3]:
classifier = Classifier()
all_preds = []
all_y = []
for i in range(X.shape[0]):
    print(f"\n\nTraining on split {i+1} out of split {X.shape[0]}")
    print(f"Test tiff: {all_tiff[i].name}")
    _tempX = np.copy(X)
    _tempY = np.copy(y)
    X_test, y_test = _tempX[i], _tempY[i]
    X_train, y_train = np.delete(_tempX, i, 0).reshape(-1, X.shape[2]), np.delete(_tempY, i, 0).reshape(-1, y.shape[2])
    dataset = Dataset(X_train, X_test, y_train, y_test)
    print(dataset.info())
    all_y.extend(dataset.testY)
    preds = classifier.random_forest(trainX=dataset.trainX, trainY=dataset.trainY, 
                                     testX=dataset.testX, testY=dataset.testY,
                                     grid_search=False, train=True, 
                                     n_estimators = 5, max_depth = 5)
    all_preds.extend(preds)
all_preds = np.asarray(all_preds)
all_y = np.asarray(all_y)



Training on split 1 out of split 16
Test tiff: ./Data/Images/5_band13.tif
No. of classes: 2
Class labels: ['water', 'land']
Total data samples: 6400000
Train samples: 6000000
	 0:water = 3000000
	 1:land = 3000000
Test stats: 400000
	 0:water = 200000
	 1:land = 200000
None

Random Forest
Elapsed_time training  20.674340 
Accuracy on train Set: 
0.9446685
Accuracy on Test Set: 
0.9834
Classification Report: 
              precision    recall  f1-score   support

           0       0.97      1.00      0.98    200000
           1       1.00      0.97      0.98    200000

    accuracy                           0.98    400000
   macro avg       0.98      0.98      0.98    400000
weighted avg       0.98      0.98      0.98    400000

Confusion Matrix: 
[[199383    617]
 [  6023 193977]]


Training on split 2 out of split 16
Test tiff: ./Data/Images/5_band16.tif
No. of classes: 2
Class labels: ['water', 'land']
Total data samples: 6400000
Train samples: 6000000
	 0:water = 3000000
	 1:land

Elapsed_time training  21.027024 
Accuracy on train Set: 
0.9521466666666667
Accuracy on Test Set: 
0.763975
Classification Report: 
              precision    recall  f1-score   support

           0       0.89      0.60      0.72    200000
           1       0.70      0.92      0.80    200000

    accuracy                           0.76    400000
   macro avg       0.79      0.76      0.76    400000
weighted avg       0.79      0.76      0.76    400000

Confusion Matrix: 
[[120976  79024]
 [ 15386 184614]]


Training on split 12 out of split 16
Test tiff: ./Data/Images/5_band8.tif
No. of classes: 2
Class labels: ['water', 'land']
Total data samples: 6400000
Train samples: 6000000
	 0:water = 3000000
	 1:land = 3000000
Test stats: 400000
	 0:water = 200000
	 1:land = 200000
None

Random Forest
Elapsed_time training  20.823639 
Accuracy on train Set: 
0.9438231666666667
Accuracy on Test Set: 
0.975975
Classification Report: 
              precision    recall  f1-score   support

      

In [4]:
from sklearn import metrics
print("Classification Report: ")
print(metrics.classification_report(all_y, all_preds))
print("Confusion Matrix: ")
print(metrics.confusion_matrix(all_y, all_preds))

Classification Report: 
              precision    recall  f1-score   support

           0       0.93      0.92      0.92   3200000
           1       0.92      0.93      0.92   3200000

    accuracy                           0.92   6400000
   macro avg       0.92      0.92      0.92   6400000
weighted avg       0.92      0.92      0.92   6400000

Confusion Matrix: 
[[2936373  263627]
 [ 234553 2965447]]


## K-fold CV on a bigger tree

In [5]:
classifier = Classifier()
all_preds = []
all_y = []
for i in range(X.shape[0]):
    print(f"\n\nTraining on split {i+1} out of split {X.shape[0]}")
    print(f"Test tiff: {all_tiff[i].name}")
    _tempX = np.copy(X)
    _tempY = np.copy(y)
    X_test, y_test = _tempX[i], _tempY[i]
    X_train, y_train = np.delete(_tempX, i, 0).reshape(-1, X.shape[2]), np.delete(_tempY, i, 0).reshape(-1, y.shape[2])
    dataset = Dataset(X_train, X_test, y_train, y_test)
    print(dataset.info())
    all_y.extend(dataset.testY)
    preds = classifier.random_forest(trainX=dataset.trainX, trainY=dataset.trainY, 
                                     testX=dataset.testX, testY=dataset.testY,
                                     grid_search=False, train=True, 
                                     n_estimators = 50, max_depth = 20)
    all_preds.extend(preds)
all_preds = np.asarray(all_preds)
all_y = np.asarray(all_y)



Training on split 1 out of split 16
Test tiff: ./Data/Images/5_band13.tif
No. of classes: 2
Class labels: ['water', 'land']
Total data samples: 6400000
Train samples: 6000000
	 0:water = 3000000
	 1:land = 3000000
Test stats: 400000
	 0:water = 200000
	 1:land = 200000
None

Random Forest
Elapsed_time training  460.773308 
Accuracy on train Set: 
0.9771321666666667
Accuracy on Test Set: 
0.97166
Classification Report: 
              precision    recall  f1-score   support

           0       0.97      0.97      0.97    200000
           1       0.97      0.97      0.97    200000

    accuracy                           0.97    400000
   macro avg       0.97      0.97      0.97    400000
weighted avg       0.97      0.97      0.97    400000

Confusion Matrix: 
[[194111   5889]
 [  5447 194553]]


Training on split 2 out of split 16
Test tiff: ./Data/Images/5_band16.tif
No. of classes: 2
Class labels: ['water', 'land']
Total data samples: 6400000
Train samples: 6000000
	 0:water = 30000

Elapsed_time training  474.135563 
Accuracy on train Set: 
0.9789758333333334
Accuracy on Test Set: 
0.82342
Classification Report: 
              precision    recall  f1-score   support

           0       0.86      0.77      0.81    200000
           1       0.79      0.88      0.83    200000

    accuracy                           0.82    400000
   macro avg       0.83      0.82      0.82    400000
weighted avg       0.83      0.82      0.82    400000

Confusion Matrix: 
[[153784  46216]
 [ 24416 175584]]


Training on split 12 out of split 16
Test tiff: ./Data/Images/5_band8.tif
No. of classes: 2
Class labels: ['water', 'land']
Total data samples: 6400000
Train samples: 6000000
	 0:water = 3000000
	 1:land = 3000000
Test stats: 400000
	 0:water = 200000
	 1:land = 200000
None

Random Forest
Elapsed_time training  476.342707 
Accuracy on train Set: 
0.9765045
Accuracy on Test Set: 
0.9598975
Classification Report: 
              precision    recall  f1-score   support

           0 

In [6]:
from sklearn import metrics
print("Classification Report: ")
print(metrics.classification_report(all_y, all_preds))
print("Confusion Matrix: ")
print(metrics.confusion_matrix(all_y, all_preds))

Classification Report: 
              precision    recall  f1-score   support

           0       0.91      0.91      0.91   3200000
           1       0.91      0.92      0.91   3200000

    accuracy                           0.91   6400000
   macro avg       0.91      0.91      0.91   6400000
weighted avg       0.91      0.91      0.91   6400000

Confusion Matrix: 
[[2925005  274995]
 [ 271796 2928204]]
