In [None]:
import os
# The jupyter notebook is launched from your $HOME directory.
# Change the working directory to the workshop directory
# which was created in your username directory under /scratch/vp91
os.chdir(os.path.expandvars("/scratch/vp91/$USER/"))

# Supervised Image Classification


- **Special requirements:** A Google account, access to Google Earth Engine.




## Description
Supervised classification is one of the most popular ways to derive thematic maps in remote senisng for applications ranging from generating Land Use/Land Cover maps to change detection. 
In this session, you will learn how to collect training samples, use machine learning techniques to allocate the whole image to one of the defined categories, perform accuracy assessments, and run some statistics for the resulting classified map. Also, you will learn how to improve the results of classification using post-classification methods.

## Aims of the practical session
* Load images for a region of interest
* Collect training samples
* Split samples into training/validation data
* Correspond training data with the data
* Use classifier
* Calculate spectral indices
* Improve classification results
* Accuracy assessement for training/validation data
* Calculating area by class

## Getting started

### Load packages

Import GEE packages that are needed for the analysis.

In [1]:
import ee
import geemap
import pandas as pd
import matplotlib.pyplot as plt

### Connect to Google Earth Engine (GEE)

Connect to the GEE to have access computing tools and GEE datasets.
You may be required to input your Google account for authorization.

In [2]:
# Using basemap in geemap
Map = geemap.Map(center=[-35.2041, 149.2721], zoom=12)

*** Earth Engine *** Share your feedback by taking our Annual Developer Satisfaction Survey: https://google.qualtrics.com/jfe/form/SV_0JLhFqfSY1uiEaW?source=Init


### Adding Region of Interest (ROI)

Create ROI that we want to work with and display it on the GEE map.
We can also create ROI through manually drawing option in GEE or import a downloaded shapefile from your computer path. 

In [3]:
# Draw polygon for the ROI and add layer on the GEE map.
geometry = ee.Geometry.Polygon([[
    [149.08169361455955, -35.32478551096885],
    [149.1481265674404, -35.325065623240356],
    [149.14829822881737, -35.27911424131675],
    [149.08289524419823, -35.27855369756653]
]])

Map.add_basemap('Esri.WorldImagery')
Map.addLayer(geometry, {'alpha': 0.01}, 'Canberra ROI')
Map.addLayerControl()
Map.centerObject(geometry)
Map

Map(center=[-35.2041, 149.2721], controls=(WidgetControl(options=['position', 'transparent_bg'], widget=Search…

## Training data

Training data (or a training dataset) constitutes the backbone of classification tasks, as it is used to characterise the variability of categories within the study area (in other words, it is used to obtain information instrumental for predicting classes from new data).
The quality of the training data has a greater impact on the classification than the algorithm used. Large and accurate training data sets are preferable: increasing the training sample size results in increased classification accuracy ([Maxell et al 2018](https://www.tandfonline.com/doi/full/10.1080/01431161.2018.1433343)). A review of training data methods in the context of Earth Observation is available [here](https://www.mdpi.com/2072-4292/12/6/1034)

When creating training labels (i.e., the names of each category/class), be sure to capture the spectral variability of the class, and to use imagery from the time period you want to classify (rather than relying on basemap composites). 

Another common problem with training data is class imbalance. This can occur when one of your classes is relatively rare and therefore the rare class will comprise a smaller proportion of the training set. When imbalanced data is used, it is common that the final classification will under-predict less abundant classes relative to their true proportion. An ideal training dataset would have approximately the same number of data entries for each of the classes considered.

There are many platforms to use for gathering training labels, the best one to use depends on your application.  We will show you how you can collect training data through the web app [geojson.io](https://geojson.io/#map=3.63/-27.46/134.67) - a simple application for drawing geometries on a basemap. 

Geemap also has some functionality for collecting training samples, though it can be glitchy. To see how its used see the video here: https://www.youtube.com/watch?v=VWh5PxXPZw0.

### Loading training samples from file

We have prepared a small training data sample for use with this practical. Below we load the file (which is stored on github as a geojson) into the notebook.

In [4]:
# load training data from github
training_path = 'https://raw.githubusercontent.com/nicolasyounes/engn3903/main/figures/training_data.geojson'
training_data = geemap.geojson_to_ee(training_path)

#print how many classes there are in the TD
df = geemap.ee_to_df(training_data)
n_classes = len(df['landcover'].unique())
print(f'There are {n_classes} landcover classes in the training dataset')

Downloading...
From: https://raw.githubusercontent.com/nicolasyounes/engn3903/main/figures/training_data.geojson
To: /jobfs/129599448.gadi-pbs/4a329d52-a3ea-4b07-9658-30499f4c5f36.geojson
52.7kB [00:00, 21.9MB/s]                   


There are 4 landcover classes in the training dataset


In [5]:
#plot the training samples, add a unique color to each landcover type
def setPointProperties(f):
  lc = f.get('landcover') # 0 to 3
  mapDisplayColors = ee.List(['red', 'green', 'white', 'blue'])

  # use the class as index to lookup the corresponding display color
  return f.set({'style': {'color': mapDisplayColors.get(lc)}})

    
# apply the function and view the results on map
training_data = training_data.map(setPointProperties)
Map.addLayer(training_data.style(**{'styleProperty': 'style'}), {}, 'all training data')
Map.centerObject(training_data)
Map

Map(bottom=634612.0, center=[-35.301553306622715, 149.16028114271822], controls=(WidgetControl(options=['posit…

### Training data sampling from Sentinel-2 images

In the next few code cells we will extract training data from Sentinel-2 images over the pixels specified by the training sample locations loaded in the previous step

Sentinel-2 is a wide-swath, high-resolution, multi-spectral imaging mission supporting Copernicus Land Monitoring studies, including the monitoring of vegetation, soil and water cover, as well as observation of inland waterways and coastal areas.

We will:
* Define a function for cloud masking and rescaling sentinel-2 images
* Load Sentinel-2 images for the analysis
* Filter a collection by date range
* Calculate a temporal median to collapse the time dimension
* Clip based on the geometry

In [6]:
def maskS2clouds(image):
    qa = image.select('QA60')
    # Bits 10 and 11 are clouds and cirrus, respectively.
    cloudBitMask = 1 << 10
    cirrusBitMask = 1 << 11
    # Both flags should be set to zero, indicating clear conditions.
    mask = qa.bitwiseAnd(cloudBitMask).eq(0) \
        .And(qa.bitwiseAnd(cirrusBitMask).eq(0))
    
    return image.updateMask(mask).divide(10000) #re-scale 


In [7]:
# Load sentinel 2
S2 = (
    ee.ImageCollection('COPERNICUS/S2_SR_HARMONIZED')
    .filterBounds(geometry) #inly images that intersect our ROI
    .filterDate('2020-09-01','2020-09-30') ### Note: you can try different dates
    .map(maskS2clouds) #map the cloudmasking/rescaling function
    .median() #collapse time-dimension using median statistic
    .clip(geometry) #'clip' images to the ROI extent
)

#viualisation prams
vis_params = {'min': 0, 'max': 0.4, 'bands': ['B4', 'B3', 'B2']}

#add to map
Map2 = geemap.Map(center=[-35.30, 149.12], zoom=13)
Map2.add_basemap('Esri.WorldImagery')
Map2.addLayer(geometry, {'alpha': 0.01}, 'Canberra ROI')
Map2.addLayerControl()
Map2.addLayer(S2, vis_params, 'Sentinel-2')
Map2.addLayer(training_data.style(**{'styleProperty': 'style'}), {}, 'All training data')
Map2

Map(center=[-35.3, 149.12], controls=(WidgetControl(options=['position', 'transparent_bg'], widget=SearchDataG…

### Sample Imagery at training points to create training datasets
Now that we have created the points and labels, we need to sample the Sentinel-2 imagery using `image.sampleRegions()`. This command will extract the reflectance values in the designated bands for each of the points you have created. 

We will then:
* Select the bands for training
* Sample the input imagery to get a FeatureCollection of training data

In [8]:
# # select bands wanted to use in the classification
bands = ['B2','B3','B4','B5','B6','B7','B8','B8A','B11','B12']

In [9]:
# # correspond training data with S2 data
# This property of the table stores the land cover labels.
label = 'landcover'

#sample the S2 data at the points
gcp = S2.select(bands).sampleRegions(
    **{'collection': training_data,
       'properties': [label],
        'scale': 20}
)

print(f'Size of full training data set: {gcp.size().getInfo()}')

Size of full training data set: 349


## Image Classification
The <a href="https://developers.google.com/earth-engine/guides/classification">Classifier</a> package contains the supervised classification machine learning algorithms in Earth Engine. In this part we will:
* Instantiate a supervised classifier
* Set its parameters, if necessary
* Train the classifier using the training data
* Classify an image using the trained algorithm
* Display the classified map

> Note: Here we used `Support Vector Machine` model for classification. You can also try different machine learning techniques

In [10]:
# train a Support Vector Machine classifier 
classifier = ee.Classifier.libsvm().train(**{
  'features' : gcp,
  'classProperty' : 'landcover',
  'inputProperties' : bands
})

#classify the pixels in the image
classified = S2.select(bands).classify(classifier)

# quickly show the result using random colours
Map2.addLayer(classified.randomVisualizer(), {}, 'Classified')
Map2.addLayer(training_data.style(**{'styleProperty': 'style'}), {}, 'All training data')
Map2

Map(bottom=2537791.0, center=[-35.31869697597375, 149.13863230190944], controls=(WidgetControl(options=['posit…

## Accuracy assessment
To assess the accuracy of a classifier use a Confusion Matrix (<a href="http://www.sciencedirect.com/science/article/pii/S0034425797000837">Stehman 1997</a>) and we will also calculate overall accuracy (OA).

A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by a machine learning model.

In [11]:
# confusion matrix
train_accuracy = classifier.confusionMatrix()
train_accuracy.getInfo()

[[84, 3, 6, 0], [5, 75, 6, 0], [21, 1, 110, 0], [0, 0, 0, 38]]

We can convert the `train_accuracy.getInfo()` results into a nicer looking (more readable) Pandas dataframe:

In [12]:
df_matrix = pd.DataFrame(train_accuracy.getInfo())
df_matrix.columns.name = 'PREDICTION'
df_matrix.index.name = 'ACTUAL'
df_matrix.loc['TOTAL',:]= df_matrix.sum(axis=0)
#Total sum per row: 
df_matrix.loc[:,'TOTAL'] = df_matrix.sum(axis=1)
df_matrix=df_matrix.rename(columns={0:'highveg', 1:'lowveg', 2:'urban', 3:'water'},
             index={0:'highveg', 1:'lowveg', 2:'urban', 3:'water'})
df_matrix

PREDICTION,highveg,lowveg,urban,water,TOTAL
ACTUAL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
highveg,84.0,3.0,6.0,0.0,93.0
lowveg,5.0,75.0,6.0,0.0,86.0
urban,21.0,1.0,110.0,0.0,132.0
water,0.0,0.0,0.0,38.0,38.0
TOTAL,110.0,79.0,122.0,38.0,349.0


Now, let's print the overall accuracy

In [13]:
# overall accuracy
print(f'Overall classification accuracy of the model is: {train_accuracy.accuracy().getInfo()*100:.2f} %')

Overall classification accuracy of the model is: 87.97 %


## Split the samples into training/test sets
The goal here is to split up the training sample into training data (70% of the total sample) and validation data (30% of the total data) (with randomization). The training set is used to train the model and test set is used to validate it. This is best practice in the machine learning community.

In [14]:
# This property of the table stores the land cover labels.
label = 'landcover'

# Add a random column and split the GCPs into training and validation set
gcp = training_data.randomColumn()

# This being a simpler classification, we take 30% points
# for validation.
trainingGcp = gcp.filter('random <= 0.7')
validationGcp = gcp.filter('random > 0.7')

Map2.addLayer(validationGcp.style(**{'styleProperty': 'style'}), {}, 'Valildation points')

# # Overlay the point on the image to get training data.
composite = S2.select(bands)
training = composite.sampleRegions(
    **{
  'collection': trainingGcp,
  'properties': [label],
  'scale': 20}
)

print('Training data size: ', training.size().getInfo())
print('Validation data size: ',validationGcp.size().getInfo())

Training data size:  248
Validation data size:  101


Now we will reclassify the image using only the 70 % training data

In [15]:
# # classifier
classifier1 = ee.Classifier.libsvm().train(**{
  'features' : training,
  'classProperty' : 'landcover',
  'inputProperties' : bands
})

classified = composite.classify(classifier1)

# # Display the clusters with random colors.
Map2.addLayer(classified.randomVisualizer(), {}, 'Classified - training subset')
Map2.addLayer(trainingGcp.style(**{'styleProperty': 'style'}), {}, 'Training subset')
Map2

Map(bottom=2537791.0, center=[-35.31869697597375, 149.13863230190944], controls=(WidgetControl(options=['posit…

### Accuracy assessment

We can now conduct a similar accuracy assessment to the one we previously did, but this time using only the validation samples to test the accuracy (these validation samples were not used in the training of the classifier) 

In [16]:
# # # Accuracy Assessment
test = classified.sampleRegions(
    **{
  'collection': validationGcp,
  'properties': [label],
  'scale': 10}
)
print(test.size().getInfo())

101


In [17]:
# # # confusion matrix
test_accuracy = test.errorMatrix('landcover', 'classification')
test_accuracy.getInfo()

df_matrix = pd.DataFrame(test_accuracy.getInfo())
df_matrix.columns.name = 'PREDICTION'
df_matrix.index.name = 'ACTUAL'
df_matrix.loc['TOTAL',:]= df_matrix.sum(axis=0)
#Total sum per row: 
df_matrix.loc[:,'TOTAL'] = df_matrix.sum(axis=1)
df_matrix=df_matrix.rename(columns={0:'highveg', 1:'lowveg', 2:'urban', 3:'water'},
             index={0:'highveg', 1:'lowveg', 2:'urban', 3:'water'})
df_matrix

PREDICTION,highveg,lowveg,urban,water,TOTAL
ACTUAL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
highveg,26.0,0.0,0.0,0.0,26.0
lowveg,5.0,21.0,0.0,0.0,26.0
urban,5.0,0.0,32.0,0.0,37.0
water,0.0,0.0,0.0,12.0,12.0
TOTAL,36.0,21.0,32.0,12.0,101.0


In [18]:
# overall accuracy
print(f'Overall classification accuracy of the model is: {test_accuracy.accuracy().getInfo()*100:.2f} %')

Overall classification accuracy of the model is: 90.10 %


## Reference
[ENGN3903 - Environmental Sensing, Mapping and Modelling](https://github.com/nicolasyounes/engn3903/tree/main) 