# Assignment Living skin detection # 


## Authors: Jan Zahn, Jonas Meier, Thomas Wiktorin ##

## Task ##

Train a classifier which is able to distinguish between living and dead materials with highest success rate
The used NN should have few most distinctive input features i.e. spectral lines

Steps:

Read all XLS and CSV data files into Python.  
Visualize them and based on your insight suggest / try some decent feature classifiers.  
Select them, train them and validate them.  
Analyse runtime and memory footprint.  
Argue why your solution is appropriate.  

EXTENSION :
please implement and compare the "Living Skin" detection using  MLP and  SVMs and radial basis functions RBFs.  
Imporant metrics are especailly confusion-matrix, precsion and recall, many more may be runtime, memory foot step  
Train time etc  


## Explanation:##

BUG NOTE: We shuffle the data to get a normalised distribution of datatypes. Sometimes the random shuffle does not shuffle the the data-matrix and class-matrix the same way order, resulting in an unusable model with 50% accuraxy. If this happens, re-run the program.

In our first program we chose to use all 121 values on the input layer, and see how accurate our model is. We reached over 96 percent accuracy consistently. Using different sizes of epochs/batch_size/validation data/training data/test data/hidden layers did not change our results in huge ways. However, I believe the size of the hidden layers is better to be small due to overfitting.   

One problem existed in the unequal representation of classes. The "living Material" has just 6 examples, while the other class is represented by 171 data sets. To reach a high accuracy, our model just classified everything as "dead material" and had instant high training and validation accuracy. Of course, this accuracy is meaningless, since we are often interested in the under represented class.  

At first, we tried to weight our examples by the means of sensitivity and specificity.  

sensitivity = true positives / positives  
specificity = true negatives / negatives  

Our model always predicts dead material and therefore has a sensitivity of 0 and specificity of 1.  
We want to achieve a model that gets close to both sensitivity and specificity beeing 1.  

Weighting did not fix our issues, so we increased the number of positive examples by duplicating them, until the numbers for each class were equal.  
This fixed our issue and we got high accuracy (>95%) and a sensitivity & specificity value close to 1.

In the second program we reduced our number of features, otherwise it is the same. 
We reduced the input layer from 121 to 12.

We looked at the second derivative and the concavity (see program 2 graphs) seemed to provide a few distinct values, important for our few features. We chose the points where the second derivative is in its extremes for both min and max, resulting in a total of 12 points.   
This resulted in a higher accuracy (>98%) and a more stable learning (see last graphs in program two)  

Runtime for an prediction of our model with 12 features takes:  
0.0001302809675962635 seconds  
And with 121:  
0.0001329888860936989 seconds  
  
340 evaluations with 121 features take:  
0.00032795901779536507 seconds  
  
340 evaluations with 12 features take:  
0.000109520259115925  

We get a slight improvement, but with our data and network size, runtime and memory does not seem to matter much.

In [1]:
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
import numpy as np
import csv

#-------------------------------------------------------------------------------Get Data
dataFilesOldNegatives = ["Fleisch", "Holz", "Leder", "Stoff"]
dataFilesOldPositives = ["Referenz-Haut_6-Klassen"]

dataFilesNewNegatives = ["2016material", "2016material-fake"]
dataFilesNewPositives = ["2016skin"]

def importData(fileNames):
    data = np.array([]);
    xPoints = np.array([]);
    for dataType in fileNames:
        with open("Archiv\\" + dataType + '.csv', mode='r') as csv_file:
                csv_reader = csv.reader(csv_file, delimiter=';')
                
                newList = list(csv_reader)      
                for index, row in enumerate(newList):
                   newList[index] = [value.replace(',', '.') for value in row]
                
                newData = np.asarray(newList)
                newData = np.delete(newData, (0), axis=0)
                if xPoints.size == 0:
                    xPoints = newData[:,0]
                newData = np.delete(newData,(0), axis=1)
                
                if data.size == 0:
                    data = newData.transpose()
                else:
                    data = np.append(data, newData.transpose(),axis=0)
    
        data = data.astype(np.float)
        xPoints =  xPoints.astype(np.float)    
    return xPoints, data 


#Negatives Old
xValuesOld, dataNegativesOld = importData(dataFilesOldNegatives)

#Positives Old
_, dataPositivesOld = importData(dataFilesOldPositives)

#Negatives New
xValuesNew, dataNegativesNew = importData(dataFilesNewNegatives)

#Positives New
_, dataPositivesNew = importData(dataFilesNewPositives)

#Increase positives to deal with unbalance class (maybe not needed with 2016 set)
dataPositives = np.tile(dataPositivesOld,(28,1)) 

In [17]:
#--------------------------------------------------------------------------------Plott Data
def plottSpectrals(x,y,z, format=""):
    for row in y:
        plt.plot(x,row, format)
    plt.title(z)
    plt.show()
        
plottSpectrals(xValuesOld,dataNegativesOld, "Negative old")
plottSpectrals(xValuesOld,dataPositivesOld, "Positive old")
plottSpectrals(xValuesNew,dataNegativesNew, "Negative new")
plottSpectrals(xValuesNew,dataPositivesNew, "Positive new")

NameError: name 'xValuesOld' is not defined

In [18]:
#--------------------------------------------------------------------------------Some Information on Data
print("Old xValue data: " + str(xValuesOld.shape))
print("Old negative data: " + str(dataNegativesOld.shape))
print('Old positive data: ' + str(dataPositivesOld.shape))
print("New xValue data: " + str(xValuesNew.shape))
print('New negative data: ' + str(dataNegativesNew.shape))
print('New positive data: ' + str(dataPositivesNew.shape))


NameError: name 'xValuesOld' is not defined

In [7]:
#---------------------------------------------------------------------------------Combine different Data
#Old wavelength 400-1600 ;                           in steps of 10
#New wavelength 670-1690 (everything after is NaN);  in steps of 1 

#Delete NaN at the end of new files
dataNegativesNew = dataNegativesNew[:,:xValuesNew.size-10]
dataPositivesNew = dataPositivesNew[:,:xValuesNew.size-10]
xValuesNew = xValuesNew[:xValuesNew.size-10]

#InterpolateOldData to match new 
tmp_positive_old = np.empty((len(dataPositivesOld),1200))
tmp_negative_old = np.empty((len(dataNegativesOld),1200))
xValuesAlteredOld = np.asarray(range(400,1600))

for i in range(len(dataPositivesOld)):
    tmp_positive_old[i,:] = np.interp(xValuesAlteredOld,xValuesOld,dataPositivesOld[i,:])

for i in range(len(dataNegativesOld)):
    tmp_negative_old[i,:] = np.interp(xValuesAlteredOld,xValuesOld,dataNegativesOld[i,:])

NameError: name 'dataNegativesNew' is not defined

In [8]:
#---------------------------------------------------------------------------------Print new data format

xValuesOld = xValuesAlteredOld
dataNegativesOld = tmp_negative_old
dataPositivesOld = tmp_positive_old

print("After interpolation:")
print("Old negative data: " + str(dataNegativesOld.shape))
print('Old positive data: ' + str(dataPositivesOld.shape))
print('New negative data: ' + str(dataNegativesNew.shape))
print('New positive data: ' + str(dataPositivesNew.shape))

NameError: name 'xValuesAlteredOld' is not defined

In [9]:
#---------------------------------------------------------------------------------Cut off old data to match new data

dataNegativesOld = dataNegativesOld[:,int(xValuesNew[0]-xValuesOld[0]):]
dataPositivesOld = dataPositivesOld[:,int(xValuesNew[0]-xValuesOld[0]):]
dataNegativesNew = dataNegativesNew[:,:int(xValuesOld[xValuesOld.size-1]-xValuesNew[0])+1]
dataPositivesNew = dataPositivesNew[:,:int(xValuesOld[xValuesOld.size-1]-xValuesNew[0])+1]

xValuesOld = np.asarray(range(670,1600))
xValuesNew = xValuesOld

NameError: name 'dataNegativesOld' is not defined

In [10]:
#---------------------------------------------------------------------------------Print new data format

print("After cutting:")
print("Old negative data: " + str(dataNegativesOld.shape))
print('Old positive data: ' + str(dataPositivesOld.shape))
print('New negative data: ' + str(dataNegativesNew.shape))
print('New positive data: ' + str(dataPositivesNew.shape))

After cutting:


NameError: name 'dataNegativesOld' is not defined

In [11]:
# Normalisation of ys over all data

# Stores average value of each measure
avg = []
for dataSet in [dataNegativesOld, dataPositivesOld, dataNegativesNew, dataPositivesNew]:
    for measure in dataSet:
        # Average for each measure
        avg.append(np.average(measure))
avg = np.average(avg)
print("Average value of all measurements is", avg)

# Use average to normalise data
for dataSet in [dataNegativesOld, dataPositivesOld, dataNegativesNew, dataPositivesNew]:
    for index, measure in enumerate(dataSet):
        # Average for each measure
        dataSet[index] = measure - avg

#--------------------------------------------------------------------------------Plott Data        
plottSpectrals(xValuesOld,dataNegativesOld, "Negative old normalised")
plottSpectrals(xValuesOld,dataPositivesOld, "Positive old normalised")
plottSpectrals(xValuesNew,dataNegativesNew, "Negative new normalised")
plottSpectrals(xValuesNew,dataPositivesNew, "Positive new normalised")

NameError: name 'dataNegativesOld' is not defined

In [12]:
#---------------------------------------------------------------------------------New Features
#Extra Value for less Features (maybe)
# GradientData = 2nd derivation of input data
gradientData = np.empty([dataPositivesOld.shape[0],dataPositivesOld.shape[1]]);
i = 0;
for row in dataPositivesOld:
    gradient = np.gradient(np.gradient(row)) #where does the gradient change fastest
    plt.plot(xValuesOld,gradient)
    gradientData[i] = gradient
    i += 1
    
plt.title("Change of gradient for old positive data")
plt.show()

NameError: name 'np' is not defined

In [13]:
#--------------------------------------------------------------------------------Split Up Data
#hardcoded
# Get gradients
#maxGradients = gradientData.argmax(axis=1)
#minGradients = gradientData.argmin(axis=1)
#smallXValues = np.append(maxGradients,minGradients,axis=0)

number_of_values = 5

maxGradients = np.array([], dtype=np.int)
minGradients = np.array([], dtype=np.int)
for entry in gradientData:
    maxGradients = np.append(maxGradients, entry.argsort()[-number_of_values:][::-1])
    minGradients = np.append(minGradients, entry.argsort()[:number_of_values])

smallXValues = np.append(maxGradients,minGradients,axis=0)
    
# Remove duplicate wavelenghts
smallXValues = np.unique(smallXValues)
print("Wavelengths to use for further operations:", smallXValues, "nm.")

# Small features = only wavelengths where gradient is max or min
smallFeaturesPositiveReal = dataPositivesOld[:,smallXValues]
smallFeaturesPositive = np.tile(smallFeaturesPositiveReal,(28,1)) #Increase positives to deal with unbalance class
smallFeature = dataNegativesOld[:,smallXValues]

#[1,0] is dead, [0.1] alive
#CompleteSet Small Version
compSet = np.append(smallFeature, smallFeaturesPositive ,axis=0)
classifcSet = np.append(np.tile([1,0],(smallFeature.shape[0],1)),np.tile([0,1],(smallFeaturesPositive.shape[0],1)),axis=0)

#shuffle data together
mix = np.random.permutation(len(compSet))
compSet = compSet[mix]
classifcSet = classifcSet[mix]

#Split in training and test Data
#trainingSet = compSet[:200]
#trainingLabelSet =  classifcSet[:200]

trainingSet = compSet[:300]
trainingLabelSet =  classifcSet[:300]

#validationSet = compSet[200:300]
#validationLabelSet = classifcSet[200:300]

testSet = compSet[300:]
testLabelSet = classifcSet[300:]

print(smallXValues.shape)
print(trainingSet.shape)
print(smallFeature.shape)
print(smallFeaturesPositive.shape)

plottSpectrals(smallXValues,smallFeature, "Negative features", "o")
plottSpectrals(smallXValues,smallFeaturesPositive, "Positive features", "o")

NameError: name 'np' is not defined

In [14]:
#--------------------------------------------------------------------------------Build Model with tensorflow

hidden_neuron_size = [5,10,15,20,50,100,500]
loss_array = []
acc_array = []

for number_of_hidden_neurons in hidden_neuron_size:
    from tensorflow.keras import backend as K

    # https://stackoverflow.com/questions/46009619/keras-weighted-binary-crossentropy
    def weighted_binary_crossentropy(y_true, y_pred):

        one_weight = 1
        zero_weight = 1

        # Original binary crossentropy (see losses.py):
        # K.mean(K.binary_crossentropy(y_true, y_pred), axis=-1)

        # Calculate the binary crossentropy
        b_ce = K.binary_crossentropy(y_true, y_pred)

        # Apply the weights
        weight_vector = y_true * one_weight + (1. - y_true) * zero_weight
        weighted_b_ce = weight_vector * b_ce

        print(y_true)

        # Return the mean error
        return K.mean(weighted_b_ce)



    model = keras.Sequential() #Single input-output
    model.add(keras.layers.Dense(number_of_hidden_neurons, activation=tf.nn.relu, input_shape=(trainingSet.shape[1],))) #fully-conndected = dense, with 16 units, relu: rectified linear unit
    model.add(keras.layers.Dense(2, activation=tf.nn.softmax)) #Cofidence level

    model.summary()

    #Optimizer and loss function
    model.compile(optimizer=tf.train.AdamOptimizer(), #or sgd(stochastic gradient descent optimizer: keras.optimizers.SGD(lr=0.01, momentum=0.0, decay=0.0, nesterov=False)
                  loss='binary_crossentropy', #or mean_squared_error (our target is not in the continuos space), but binary seems to deal better with probabilitis
                  #loss=weighted_binary_crossentropy,
                  metrics=['accuracy']
                  )


    #Weighted class did not solve the unequal class problem
    #[1,0] is dead, [0.1] alive

    # Create sample weights
    # Make positive samples count more
    weights = []
    for index, entry in enumerate(trainingSet):
        if trainingLabelSet[index][0] == 0:
            weights.append(1)
        else:
            weights.append(2)
    weights = np.array(weights)

    #class_weight=[3, .5]

    #Train model for 50 epochs in batches of 3 samples
    history = model.fit(trainingSet,
                        trainingLabelSet,
                        epochs=50,
                        batch_size=10,  #the bigger the more memory space needed
                        validation_split=0.2,
                        #verbose=1,
                        verbose=0,
                        sample_weight=weights,
                        #class_weight=class_weight
                        )

    print()
    results = model.evaluate(testSet, testLabelSet)
    positive = 0
    negative = 0
    for entry in testLabelSet:
        if entry[0] == 0:
            positive += 1
        else:
            negative += 1

    print("Positives in testset = ", positive)
    print("Negatives in testset = ", negative)

    for index, metric in enumerate(model.metrics_names):
        print(metric, ": ", results[index])
        # Loss
        if index == 0:
            loss_array.append(results[index])
        else:
            acc_array.append(results[index])
            

    # Creating the Confusion Matrix
    from sklearn.metrics import confusion_matrix
    y_pred = model.predict(testSet)
    y_test = testLabelSet
    #print(y_test)
    #print(y_pred)
    #print(y_pred.round())
    cm = confusion_matrix(y_test.argmax(axis=1), y_pred.round().argmax(axis=1))
    print(cm)

plt.plot(hidden_neuron_size, loss_array, label="Loss")
plt.plot(hidden_neuron_size, acc_array, label="Accuracy")
plt.xlabel("Number of hidden neurons")
plt.xscale('log')

ImportError: Traceback (most recent call last):
  File "C:\Users\Systemverwaltung\Anaconda3\envs\Tensorflow 2\lib\site-packages\tensorflow\python\pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "C:\Users\Systemverwaltung\Anaconda3\envs\Tensorflow 2\lib\site-packages\tensorflow\python\pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "C:\Users\Systemverwaltung\Anaconda3\envs\Tensorflow 2\lib\site-packages\tensorflow\python\pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "C:\Users\Systemverwaltung\Anaconda3\envs\Tensorflow 2\lib\imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "C:\Users\Systemverwaltung\Anaconda3\envs\Tensorflow 2\lib\imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: DLL load failed: Das angegebene Modul wurde nicht gefunden.


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

In [15]:

#--------------------------------------------------------------------------------Print Results
#Plot accuracy and loss over time
history_dict = history.history
history_dict.keys()

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

plt.clf()   # clear figure
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()


NameError: name 'history' is not defined

In [None]:
# ----------------------------------------------------Some Time estimation
from timeit import default_timer as timer

start = timer()
result = model.predict(smallFeature)
end = timer()
ms = (end - start) / 1000
print("%fms"% ms)

In [None]:
#http://www.rueckstiess.net/research/snippets/show/72d2363e
from scipy import *
from scipy.linalg import norm, pinv
 
from matplotlib import pyplot as plt
 
class RBF:
     
    def __init__(self, indim, numCenters, outdim):
        self.indim = indim
        self.outdim = outdim
        self.numCenters = numCenters
        self.centers = [random.uniform(-1, 1, indim) for i in range(numCenters)]
        self.beta = 8
        self.W = random.random((self.numCenters, self.outdim))
         
    def _basisfunc(self, c, d):
        assert len(d) == self.indim
        return exp(-self.beta * norm(c-d)**2)
     
    def _calcAct(self, X):
        # calculate activations of RBFs
        G = zeros((X.shape[0], self.numCenters), float)
        for ci, c in enumerate(self.centers):
            for xi, x in enumerate(X):
                G[xi,ci] = self._basisfunc(c, x)
        return G
     
    def train(self, X, Y):
        """ X: matrix of dimensions n x indim 
            y: column vector of dimension n x 1 """
         
        # choose random center vectors from training set
        rnd_idx = random.permutation(X.shape[0])[:self.numCenters]
        self.centers = [X[i,:] for i in rnd_idx]
         
        print ("center", self.centers)
        # calculate activations of RBFs
        G = self._calcAct(X)
        print (G)
         
        # calculate output weights (pseudoinverse)
        self.W = dot(pinv(G), Y)
         
    def test(self, X):
        """ X: matrix of dimensions n x indim """
         
        G = self._calcAct(X)
        Y = dot(G, self.W)
        return Y
 
    
myRFB = RBF(12,3,2)
myRFB.train(trainingSet, trainingLabelSet)
myRFB.test(testSet)

# plot rbfs
plt.plot(myRFB.centers, zeros(myRFB.numCenters), 'gs')


for c in myRFB.centers:
    # RF prediction lines
    cx = c
    #cx = np.arange(c-0.7, c+0.7, 0.01)
    cy = [myRFB._basisfunc(array([cx_]), array([c])) for cx_ in cx]
    plt.plot(cx, cy, '-', color='gray', linewidth=0.2)

In [None]:
true = tf.convert_to_tensor([0,1], dtype=np.int32)
predict = tf.convert_to_tensor([0,1], dtype=np.int32)
def binary_crossentropy(y_true, y_pred):
    return keras.mean(keras.binary_crossentropy(y_true, y_pred), axis=-1)
print(binary_crossentropy(true, predict))