# Reading and Viewing Material

In this exercise we will be classifying 9 classes of plankton. First here is some reading and viewing references that cover many of the topics we will be doing.

**Erosion**

https://towardsdatascience.com/introduction-to-image-processing-with-python-dilation-and-erosion-for-beginners-d3a0f29ad72b

**Classifiers**

https://www.simplilearn.com/tutorials/machine-learning-tutorial/classification-in-machine-learning
https://www.edureka.co/blog/classification-algorithms/
https://www.tutorialspoint.com/machine_learning_with_python/classification_introduction.htm
https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623
https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html?highlight=standardscaler

**Nearest Neighbor** 
https://youtu.be/HVXime0nQeI

**SVM**
https://youtu.be/efR1C6CvhmE
https://youtu.be/8A7L0GsBiLQ

**Random Forest**
https://youtu.be/J4Wdy0Wc_xQ

**Linear Discriminant Analysis**
https://youtu.be/azXCzI57Yfc

**Train and Test Sets**
https://youtu.be/EuBBz3bI-aA

**Scaling**
https://youtu.be/SzZ6GpcfoQY
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

**Accuracy and Confusion Matrix**

https://youtu.be/Kdsp6soqA7o
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea


# Installing and Importing Libraries

Install the mahotas library so we can import Zernike Moments, to extract shape features.

In [None]:
!pip install mahotas

Import all the libraries we will need. The os libraries let us load files from the drive. The sklearn files are for performing classification. The seaborn library is for plotting the confusion matrix.

In [None]:
import numpy as np
import cv2
import os
import matplotlib.pyplot as plt
from mahotas.zernike import zernike_moments
from google.colab.patches import cv2_imshow
from IPython import display
from google.colab import drive

from os import listdir
from os.path import isfile, join
from mahotas.zernike import zernike_moments

from sklearn import preprocessing
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

import time

from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
import seaborn as sn

# Mount the Drive

Mount the drive and create pointers to the test and train file directories.

In [None]:
drive.mount('/content/drive')
TRAIN_DIR = r'/content/drive/MyDrive/SCIP_DATA/Images/TRAIN_IMAGEBIN2'
TEST_DIR = r'/content/drive/MyDrive/SCIP_DATA/Images/TEST_IMAGEBIN2'

# Extracting Features from Image and Contour List

Create a function that collects 43 features from a binary image and the contours of the binary image. We will use this function to get features of the training and test images.

In [None]:
def getFeatures(binaryIM,objContour):
    feat=[]
    
    area = cv2.contourArea(objContour)
    feat.append(area)
    
    # aspect ratio
    ((x0,y0),(w,h),theta) = cv2.minAreaRect(objContour)  # from https://www.programcreek.com/python/example/89463/cv2.minAreaRect 
    if w==0 or h==0:
        aspectRatio=0
    elif w>=h:
        aspectRatio = float(w)/h
    else:
        aspectRatio = float(h)/w
    feat.append(aspectRatio)

    # solidity
    hull = cv2.convexHull(objContour)    
    hull_area = cv2.contourArea(hull)
    solidity = float(area)/hull_area
    feat.append(hull_area)
    feat.append(solidity)

    # ellipse
    (xe,ye),(eMajor,eMinor),angle = cv2.fitEllipse(objContour)
    feat.append(eMajor)
    feat.append(eMinor)

    # contour
    contourLen=len(objContour)   
    mean = np.mean(objContour)
    std = np.std(objContour)
    perimeter = cv2.arcLength(objContour,True)
    feat.append(contourLen)
    feat.append(mean)
    feat.append(std)
    feat.append(perimeter)

    # circle
    (xc,yc),radius = cv2.minEnclosingCircle(objContour)
    feat.append(radius)

    # ZERNIKE MOMENTS
    W, H = binaryIM.shape
    R = min(W, H) / 2
    zer = zernike_moments(binaryIM, R, 8)
    zsum=np.sum(zer[1:])
    zer[0]=zsum           # first Zernike moment always constant so replace with sum
    for z in zer:
        feat.append(z)

    
    # HARALICK FEATURES
    mom = cv2.HuMoments(cv2.moments(binaryIM)).flatten()
    momLog=abs(np.log10(np.abs(mom))) # get absolute value of log
    momLog=(1000*momLog).astype(int)
    for m in momLog:
        feat.append(m)

    maxFeat=43
    featVector=np.zeros((1,maxFeat))
    featVector[0]=feat
    return(featVector)

# Create List of Class Names

Create a label of the classes of images. Each image starts with the class name. We will extract the first three letters of each image as the class. The className list will save the class names. There should be 9 unique class names.

In [None]:
############## Create classes from file name ##############
files = [f for f in listdir(TRAIN_DIR) if isfile(join(TRAIN_DIR, f))]
className=[]
for file in files:
    n=file[0:3]
    if n not in className:
        className.append(n)
  

Let's veryify we captured the 9 unique class names.

In [None]:
print(className)

['spi', 'ste', 'vol', 'did', 'dil', 'par', 'ble', 'arc', 'act']


# Create Training Set

Now we are ready to get the features for each image in the training set. We start with binary images, produced by applying a threshould to the greyscale images. I already did this since we performed this in a previous class. Notice I am using an erosion function (cv2.erode) since some of the objects are "attached" to other objects during the detection. 

In [None]:
############## Get features of training set ##############
DIR=TRAIN_DIR
maxFeat=43
files = [f for f in listdir(DIR) if isfile(join(DIR, f))]
xTrain=np.zeros((1,maxFeat))
yTrain=np.zeros((1))
kernel = np.ones((5, 5), np.uint8)
for file in files:
    binaryIM=cv2.imread(DIR+'/'+file,cv2.IMREAD_GRAYSCALE)
    binaryIM = cv2.erode(binaryIM, kernel)
    binaryIM = cv2.erode(binaryIM, kernel)
    contours, hierarchy = cv2.findContours(binaryIM, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE) # all countour points, uses more memory
    i=0
    maxArea=0
    bestCnt=-1
    for cnt in contours:
        area = cv2.contourArea(cnt)
        if area>maxArea:
            maxArea=area
            bestCnt=i
        i+=1
    if area>0 and bestCnt!=-1 and len(contours[bestCnt])>4: # ellipse requires at least 5 points
        (r,c)=binaryIM.shape
        im=np.zeros((r,c))
        # Draw all contours, -1 signifies drawing all contours
        feat=getFeatures(binaryIM,contours[bestCnt])
        xTrain=np.append(xTrain,feat,axis=0)
        nameIndex=className.index(file[0:3])
        yTrain=np.append(yTrain,nameIndex)
xTrain=np.delete(xTrain,0,0) # eliminate first row which contains all zeros
yTrain=np.delete(yTrain,0,0) # eliminate first row which contains all zeros


Let's check that we extracted features from 3954 training images. Notice the xTrain array has two dimensions, 43 features for each of the 3954 images. The yTrain has one dimension, the class, from 0 to 8, for each of the 3954 images.

In [None]:
print(xTrain.shape)
print(yTrain.shape)

# Create Testing Set

In [None]:
############## Get features of test set ##############
DIR=TEST_DIR
maxFeat=43
files = [f for f in listdir(DIR) if isfile(join(DIR, f))]
xTest=np.zeros((1,maxFeat))
yTest=np.zeros((1))
kernel = np.ones((5, 5), np.uint8)
for file in files:
    binaryIM=cv2.imread(DIR+'/'+file,cv2.IMREAD_GRAYSCALE)
    binaryIM = cv2.erode(binaryIM, kernel)
    binaryIM = cv2.erode(binaryIM, kernel)
    contours, hierarchy = cv2.findContours(binaryIM, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE) # all countour points, uses more memory
    i=0
    maxArea=0
    bestCnt=-1
    for cnt in contours:
        area = cv2.contourArea(cnt)
        if area>maxArea:
            maxArea=area
            bestCnt=i
        i+=1
    if area>0 and bestCnt!=-1 and len(contours[bestCnt])>4: # ellipse requires at least 5 points
        (r,c)=binaryIM.shape
        im=np.zeros((r,c))
        # Draw all contours, -1 signifies drawing all contours
        feat=getFeatures(binaryIM,contours[bestCnt])
        xTest=np.append(xTest,feat,axis=0)
        nameIndex=className.index(file[0:3])
        yTest=np.append(yTest,nameIndex)
xTest=np.delete(xTest,0,0) # eliminate first row which contains all zeros
yTest=np.delete(yTest,0,0) # eliminate first row which contains all zeros

Let's check that we extracted features from 1106 training images. Notice the xTest array has two dimensions, 43 features for each of the 1106 images. The yTest has one dimension, the class, from 0 to 8, for each of the 1106 images.

In [None]:
print(xTest.shape)
print(yTest.shape)

The features have all kind of values, from small fractions to large integers. This sometimes confuses classifiers. To make the numbers all in a similar range, we scale the features so the mean (average) for each feature is 0 and the standard deviation is one. 

In [None]:
######################## Scale Features ###############################
xTest = preprocessing.StandardScaler().fit_transform(xTest)
xTrain = preprocessing.StandardScaler().fit_transform(xTrain)

# Classify Testing Images and Plot Confusion Matrix

With all our features extracted and scaled for our training and testing set, we are now ready to train four classifiers. We will also time the training and testing of each classifier and use it, along with accuracy, to evaluate the performance of each classifer. Time is imporant if you want a fast responsive system, like auto-driving a car. We will also create a confusion matrix for each classifer to see what classes cause the classifiers to make mistakes.  

In [None]:
######################## Classify #####################################
classifiers = [
    SVC(gamma='auto'),
    KNeighborsClassifier(3),
    RandomForestClassifier(),
    LinearDiscriminantAnalysis()
    ]

for clf in classifiers:
    T1=time.time()
    clf.fit(xTrain, yTrain)
    name = clf.__class__.__name__
    T2=time.time()
    yPredict = clf.predict(xTest)
    T3=time.time()
    acc = accuracy_score(yTest, yPredict)
    tTrain=round((T2-T1)*1000,0)
    tPredict=round((T3-T2)*1000,0)
    print(name,'Accuracy:',round(acc,2),'%','Train:',int(tTrain),'msec','Predict:',int(tPredict),'msec')
    
    # create normalize confusion matrix
    cm=confusion_matrix(yPredict,yTest)  
    cmNorm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    cmNorm = cmNorm.round(decimals=2)
    ax=sn.heatmap(cmNorm,annot=True,xticklabels=className,yticklabels=className)
    plt.title(name)
    plt.xlabel('True Label')
    plt.ylabel('Predicted Label')
    plt.show()