## Sedinet: predict categorical population

This Jupyter notebook accompanies the [SediNet](https://github.com/MARDAScience/SediNet) package

Written by Daniel Buscombe, MARDA Science

daniel@mardascience.com


> Demonstration of how to use SediNet to estimate from an ensemble of three models to estimate sediment population

First, this notebbok assumes you are a cloud computer such as Colab so we first download the SediNet package from github:


In [1]:
!git clone --depth 1 https://github.com/MARDAScience/SediNet.git

Cloning into 'SediNet'...
remote: Enumerating objects: 760, done.[K
remote: Counting objects: 100% (760/760), done.[K
remote: Compressing objects: 100% (688/688), done.[K
remote: Total 760 (delta 87), reused 716 (delta 68), pack-reused 0[K
Receiving objects: 100% (760/760), 1.11 GiB | 46.80 MiB/s, done.
Resolving deltas: 100% (87/87), done.
Checking out files: 100% (725/725), done.


Load TF v 2

In [2]:
import tensorflow as tf
print(tf.__version__)

2.2.0
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
False


In [7]:
import os, json
os.getcwd()

'/content/SediNet'

In [8]:
os.chdir('SediNet')

Import everything we need from sedinet_models.py

In [9]:
from sedinet_eval import *
from numpy import any as npany

In [10]:
configfile = 'config_pop_predict.json'

weights_path = 'grain_population/res/color/pop_model_checkpoint.hdf5'

Load the config file and parse out the variables we need

In [11]:
# load the user configs
with open(os.getcwd()+os.sep+'config'+os.sep+configfile) as f:    
  config = json.load(f)     

###===================================================
## user defined variables: proportion of data to use for training (a.k.a. the "train/test split")
csvfile = config["csvfile"] #csvfile containing image names and class values
res_folder = config["res_folder"] #folder containing csv file and that will contain model outputs
name = config["name"] #name prefix for output files
greyscale = config["greyscale"] #convert imagery to greyscale or not
dropout = config["dropout"] #dropout factor
    
try:
   numclass = config['numclass']
except:
   numclass = 0
                        
#output variables            
vars = [k for k in config.keys() if not npany([k.startswith('base'), k.startswith('csvfile'), k.startswith('res_folder'), k.startswith('train_csvfile'), k.startswith('test_csvfile'), k.startswith('name'), k.startswith('greyscale'), k.startswith('aux_in'), k.startswith('dropout'), k.startswith('N'), k.startswith('numclass')])]
vars = sorted(vars)

###==================================================
ID_MAP = dict(zip(np.arange(numclass), [str(k) for k in range(numclass)]))

csvfile = res_folder+os.sep+csvfile

In [12]:
if len(vars) ==1:
   mode = 'siso'
elif len(vars) >1:
   mode = 'simo'

This next part reads the data in from the csv file as a pandas dataframe, gets an image generator, and then prepares a model

In [13]:
ID_MAP = dict(zip(np.arange(numclass), [str(k) for k in range(numclass)]))
   
###===================================================
## read the data set in, clean and modify the pathnames so they are absolute
df = pd.read_csv(csvfile)
df['files'] = [k.strip() for k in df['files']]
df['files'] = [os.getcwd()+os.sep+f.replace('\\',os.sep) for f in df['files']]

train_idx = np.arange(len(df))

train_gen = get_data_generator_1vars(df, train_idx, True, vars, greyscale, len(df))
   
##==============================================
## create a SediNet model to estimate sediment category
SM = make_cat_sedinet(ID_MAP, dropout)
SM.load_weights(os.getcwd()+os.sep+weights_path)

[INFORMATION] Model summary:
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 1024, 1024, 3)]   0         
_________________________________________________________________
conv2d (Conv2D)              (None, 1022, 1022, 30)    840       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 1020, 1020, 60)    16260     
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 510, 510, 60)      0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 508, 508, 90)      48690     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 254, 254, 90)      0         
_________________________________________________________________
conv2d_3 (Conv2D)            (No

In [14]:
vars

['pop']

Now the models are set up, we use them below to make predictions on each image so we end up with three estimates per image, and our final estimate is their mode

A classification report is printed to screen showing per-class F1 scores which is an average of precision and recall. Precision is the proportion of positive identifications that are correct (a precision of 1 means there are no false positives), and recall is the proportion of actual positives identified correctly (a recall of 1 means there are no false negatives). 

In [None]:
x_train, (trueT)= next(train_gen)
trueT = trueT[0] 

predT = SM.predict(x_train, batch_size=1)
   
del x_train, train_gen
   
predT = np.asarray(predT).argmax(axis=-1)

## print a classification report to screen, showing f1, precision, recall and accuracy
print("==========================================")
print("Classification report for "+vars[0])
print(classification_report(trueT, predT))

Finally we print a confusion matrix showing normalized  correspondences between actual and estimated labels

In [None]:
classes = np.arange(len(ID_MAP))
##==============================================
## create figures showing confusion matrices for data set
plot_confmat(predT, trueT, vars[0]+'T',classes)  
plt.savefig(weights_path.replace('.hdf5','_cm_predict.png'), dpi=300, bbox_inches='tight') 
plt.close('all')   

See `pop_model_checkpoint_cm_predict.png` inside grain_population/res/color/