# Accuracy Assessment of WOfS Product in Africa using Ground Truth Data  <img align="right" src="../Supplementary_data/DE_Africa_Logo_Stacked_RGB_small.jpg">

* **Products used:** 
[ga_ls8c_wofs_2](https://explorer.digitalearth.africa/ga_ls8c_wofs_2),
[ga_ls8c_wofs_2_summary ](https://explorer.digitalearth.africa/ga_ls8c_wofs_2_summary),
[usgs_ls8c_level2_2]()

Notes:
* Landsat 8 collection 2 is confidential at continental level on 26 June 2020.
* This notebook should be run in Collection 2 Read Private Workspace should we need to run the Landsat 8 Collection 2 Sample dataset. 

## Background
The [Water Observations from Space (WOfS)](https://www.ga.gov.au/scientific-topics/community-safety/flood/wofs/about-wofs) is a derived product from Landsat 8 satellite observations as part of provisional Landsat 8 Collection 2 surface reflectance and shows surface water detected in Africa.
Individual water classified images are called Water Observation Feature Layers (WOFLs), and are created in a 1-to-1 relationship with the input satellite data. 
Hence there is one WOFL for each satellite dataset processed for the occurrence of water.

The data in a WOFL is stored as a bit field. This is a binary number, where each digit of the number is independantly set or not based on the presence (1) or absence (0) of a particular attribute (water, cloud, cloud shadow etc). In this way, the single decimal value associated to each pixel can provide information on a variety of features of that pixel. 
For more information on the structure of WOFLs and how to interact with them, see [Water Observations from Space](../Datasets/Water_Observations_from_Space.ipynb) and [Applying WOfS bitmasking](../Frequently_used_code/Applying_WOfS_bitmasking.ipynb) notebooks. 

Accuracy assessment for WOfS product in Africa includes generating a confusion error matrix for a WOFL binary classification.
The inputs for the estimating the accuracy of WOfS derived product are a binary classification WOFL layer showing water/non-water and a shapefile containing validation points collected by [Collect Earth Online](https://collect.earth/) tool. Validation points are the ground truth or actual data while the extracted value for each location from WOFL is the predicted value. A confusion error matrix containing overall, producer's and user's accuracy is the output of this analysis. 

## Description
This notebook explains how you can perform accuracy assessment for WOFS derived product using collected ground truth dataset. 

The notebook demonstrates how to:

1. Load collected validation points as a list of observations each has a location and month
2. Query WOFL data for the collected points and capture available WOfS observation available
3. Extract statistics for each WOfS observation in each validation point including min, max and mean values for each point (location and month)
4. Extract a LUT for each point that contains both validation info and WOfS result for each month 
5. Generating a confusion error matrix for WOFL classification
6. Assessing the accuracy of the classification 
***

* Two extreme cases: 
    - only test wofs classifier and excluding clouds is ok 
     - keep clear observations and remove non-clear ones
     - then query on those that are water/non-water
    - include terrain so water observed and no terrain is predicted true 

## Getting started

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell.

After finishing the analysis, you can modify some values in the "Analysis parameters" cell and re-run the analysis to load WOFLs for a different location or time period.

### Load packages

In [1]:
%matplotlib inline

import datacube
from datacube.utils import masking, geometry 
import sys
import os
import dask 
import rasterio, rasterio.features
import xarray
import glob
import numpy as np
import pandas as pd
import seaborn as sn
import geopandas as gpd
import subprocess as sp
import matplotlib.pyplot as plt
import scipy, scipy.ndimage
import warnings
warnings.filterwarnings("ignore") #this will suppress the warnings for multiple UTM zones in your AOI 

sys.path.append("../Scripts")
from deafrica_plotting import display_map, rgb
from deafrica_spatialtools import xr_rasterize
from deafrica_datahandling import wofs_fuser, mostcommon_crs,load_ard
from rasterio.mask import mask

### Connect to the datacube

In [2]:
dc = datacube.Datacube()

### Analysis parameters

In [3]:
#make sure that validation points have at least three columns : location (x,y), class, as well as 12 records for each observation  
#Path to the validation data points csv file 
CEO = '../Supplementary_data/Validation/CEO_2_RCMRD_2020-07-30.csv'

### Loading Dataset

In [4]:
#Read in the validation data csv
df = pd.read_csv(CEO, delimiter=",")
ground_truth = df.drop(['SAMPLE_ID','USER_ID','IMAGERY_TITLE','COLLECTION_TIME','ANALYSIS_DURATION','PL_PLOTID'], axis=1)

In [9]:
ground_truth['ENTER MONTHS[1-12] IN 2018,WATER WAS OBSERVED?'][0]

'1-3,5-12'

In [7]:
ground_truth = ground_truth.rename(columns={'WHAT IS THE FEATURE?':'CLASS','ENTER MONTHS[1-12] IN 2018, WATER WAS OBSERVED?':'WATER',
                                            'ENTER MONTHS[1-12] IN 2018, WATER WAS NOT OBSERVED?':'NO_WATER','ENTER MONTHS[1-12] IN 2018, IMAGE WAS BAD?':'BAD_IMAGE',
                                            'ENTER MONTHS[1-12] IN 2018, THAT YOU ARE UNSURE IF YOU OBSERVE WATER OR NOT? ':'NOT_SURE'})
ground_truth

Unnamed: 0,PLOT_ID,LON,LAT,FLAGGED,ANALYSES,SENTINEL2MOSAICYEARMONTH,"ENTER MONTHS[1-12] IN 2018,WATER WAS OBSERVED?","ENTER MONTHS[1-12] IN 2018,WATER WAS NOT OBSERVED?","ENTER MONTHS[1-12] IN 2018,IMAGE WAS BAD?","ENTER MONTHS[1-12] IN 2018,THAT YOU ARE UNSURE IF YOU OBSERVE WATER OR NOT?",CLASS,COMMENT
0,137387237,36.248262,-0.439987,False,1,2018 - 2018,"1-3,5-12",0,4,0,Wetlands - freshwater,Possible high algae bloom between 1-3
1,137387238,34.149518,-0.539462,False,1,2018 - 2018,1-Dec,0,0,0,Open water - freshwater,
2,137387239,34.159779,-0.542990,False,1,2018 - 2018,1-Dec,0,0,0,Open water - freshwater,
3,137387240,30.925848,-0.669983,False,1,2018 - 2018,1-Nov,0,12,0,Open water - freshwater,
4,137387241,37.892745,-0.683152,False,1,2018 - 2018,1-Dec,0,0,0,Open water - freshwater,
...,...,...,...,...,...,...,...,...,...,...,...,...
195,137387432,39.613404,-9.090942,False,1,2018 - 2018,"1-4,6-12",0,0,5,Open water - marine,
196,137387433,35.464420,-9.505899,False,1,2018 - 2018,"1,2,4-6,8,10,12",0,3911,7,Open water - freshwater,
197,137387434,33.947728,-9.568113,False,1,2018 - 2018,2-Dec,0,1,0,Open water - freshwater,
198,137387435,34.424376,-9.843084,False,1,,0,"1,2,7,8-11",345612,0,Forest/woodlands,


In [127]:
#Converting all column types to string if not already
#ground_truth = ground_truth.astype(str)
ground_truth['NOT_SURE'] = ground_truth.NOT_SURE.astype(str)

In [129]:
cols = ['WATER','NO_WATER','BAD_IMAGE','NOT_SURE']
for col in cols:
    ground_truth[col] = ground_truth[col].str.replace('[','')
    ground_truth[col] = ground_truth[col].str.replace(']','')
    ground_truth[col] = ground_truth[col].str.replace('&','')
    ground_truth[col] = [''.join(c.split()) for c in ground_truth[col]]

In [130]:
#check whether any nan values in the dataframe and print it out against the column name 
# count_nan_in_df = ground_truth.isnull().sum()
# print (count_nan_in_df)

In [131]:
#replacing the name of months with their numerical values
replacements = { 'WATER': {r'Jan':'1', r'Feb':'2',r'Mar':'3',r'Apr':'4',r'May':'5',r'Jun':'6',r'Jul':'7',r'Aug':'8',r'Sep':'9',r'Oct':'10',r'Nov':'11',r'Dec':'12'},
               'NO_WATER': {r'Jan':'1', r'Feb':'2',r'Mar':'3',r'Apr':'4',r'May':'5',r'Jun':'6',r'Jul':'7',r'Aug':'8',r'Sep':'9',r'Oct':'10',r'Nov':'11',r'Dec':'12'},
               'BAD_IMAGE':{r'Jan':'1', r'Feb':'2',r'Mar':'3',r'Apr':'4',r'May':'5',r'Jun':'6',r'Jul':'7',r'Aug':'8',r'Sep':'9',r'Oct':'10',r'Nov':'11',r'Dec':'12'}}

ground_truth.replace(replacements, regex=True, inplace=True)

In [132]:
ground_truth['SENTINEL2MOSAICYEARMONTH'] = ground_truth['SENTINEL2MOSAICYEARMONTH'].str.replace('2019-2019','2018-2018')

In [133]:
def split_str(row, newtable):
#check each row for No-WATER info an update the water column 
    monthstr=row['NO_WATER']
    if monthstr!='0'and monthstr!='nan':
        monthlist=[[int(i) for i in s.split('-')] for s in monthstr.split(',')]
        for l in monthlist:
            if len(l)==1: l=[l[0],l[0]]
            for i in range(l[0], l[1]+1):
                newrow=row[['PLOT_ID','LON','LAT','FLAGGED','ANALYSES','WATER','NO_WATER','BAD_IMAGE','NOT_SURE','CLASS','COMMENT']]
                newrow['MONTH']=f'{i:02d}'
                newrow['WATERFLAG']='0'
                newrow["SENTINEL2YEAR"]='2018'
                newtable=newtable.append(newrow)
#check each row for water info 
    monthstr=row['WATER']
    if monthstr!='0' and monthstr!='nan':
        monthlist=[[int(i) for i in s.split('-')] for s in monthstr.split(',')]
        for l in monthlist:
            if len(l)==1: l=[l[0],l[0]]
            for i in range(l[0], l[1]+1):
                newrow=row[['PLOT_ID','LON','LAT','FLAGGED','ANALYSES','WATER','NO_WATER','BAD_IMAGE','NOT_SURE','CLASS','COMMENT']]
                newrow['MONTH']=f'{i:02d}'
                newrow['WATERFLAG']='1'
                newrow["SENTINEL2YEAR"]='2018'
                newtable=newtable.append(newrow)  # update index / ignore original index
#check each row for bad image 
    monthstr=row['BAD_IMAGE']
    if monthstr!='0' and monthstr!='nan':
        monthlist=[[int(i) for i in s.split('-')] for s in monthstr.split(',')]
        for l in monthlist:
            if len(l)==1: l=[l[0],l[0]]
            for i in range(l[0], l[1]+1):
                newrow=row[['PLOT_ID','LON','LAT','FLAGGED','ANALYSES','WATER','NO_WATER','BAD_IMAGE','NOT_SURE','CLASS','COMMENT']]
                newrow['MONTH']=f'{i:02d}'
                newrow['WATERFLAG']='2'
                newrow["SENTINEL2YEAR"]='2018'
                newtable=newtable.append(newrow) 
    monthstr=row['NOT_SURE']
    if monthstr!='0' and monthstr!='nan':
        monthlist=[[int(i) for i in s.split('-')] for s in monthstr.split(',')]
        for l in monthlist:
            if len(l)==1: l=[l[0],l[0]]
            for i in range(l[0], l[1]+1):
                newrow=row[['PLOT_ID','LON','LAT','FLAGGED','ANALYSES','WATER','NO_WATER','BAD_IMAGE','NOT_SURE','CLASS','COMMENT']]
                newrow['MONTH']=f'{i:02d}'
                newrow['WATERFLAG']='3'
                newrow["SENTINEL2YEAR"]='2018'
                newtable=newtable.append(newrow) 
                
    return newtable


In [134]:
#count_nan_in_df = ground_truth.isnull().sum()
#print (count_nan_in_df)

In [135]:
#for check on any issues 
ground_truth.to_csv('../Supplementary_data/Validation/Refined/CEO_2.csv')

In [138]:
#Making an empty dataframe
result = pd.DataFrame()

In [139]:
for irow in range(len(ground_truth)):
    result=split_str(ground_truth.iloc[irow], result)
    result.update(result)

In [140]:
#result
#result.loc[13]#this shows all the table 

In [141]:
result = result[['PLOT_ID', 'LON', 'LAT','FLAGGED','ANALYSES','SENTINEL2YEAR', 'WATER','NO_WATER','BAD_IMAGE','NOT_SURE','CLASS', 'COMMENT', 'MONTH','WATERFLAG']]

the part that looks into fixing the rows with multiple values for months should go here:


In [None]:
# Select all duplicate rows based on one column
# duplicateRowsDF = input_data[input_data.duplicated(['MONTH'])].index
# duplicateRowsDF

In [None]:
# input_data = input_data.groupby(['PLOT_ID']).size().reset_index(name='count')
# input_data
# #input_data[['PLOT_ID','MONTH']].groupby(['PLOT_ID']).agg(['count'])

In [None]:
# ID =  input_data.groupby('PLOT_ID')
# ID['MONTH'].agg(np.mean)
#this gives the number of rows for each item in the table 

#now we need to check whether the month value in two rows are similar 

# for i, group in ID
#     print(i)
#     print(group)

In [142]:
#save the dataframe as csv file 
result.to_csv('../Supplementary_data/Validation/Refined/CEO_3_RCMRD_2020-07-30.csv')

In [22]:
#joining dataframes together and extract one csv for each partner institution 
DF = glob.glob('../Supplementary_data/Validation/Refined/*_RCMRD_*.csv')
frame = []
for d in DF: 
    f = pd.read_csv(d,delimiter=",")
    frame.append(f)
out = pd.concat(frame)
out.to_csv('../Supplementary_data/Validation/Refined/CEO_RCMRD_2020-07-30.csv')

***

## Additional information

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Africa data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).
If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks).

**Last modified:** January 2020

**Compatible datacube version:** 

## Tags
Browse all available tags on the DE Africa User Guide's [Tags Index](https://) (placeholder as this does not exist yet)

In [None]:
#test the groundtruth with a 6933 EPSG as well (conversion) - how to reproject