# Neural Network - Ecoli data set :-

#### Title of the data set: Protein Localization Sites

#### Number of instance: 336

#### Number of attributes : 8 (7 predictive, 1 name)

#### Attribute Information:

  1.  **Sequence Name:** Accession number for the SWISS-PROT database
  2.  **mcg:** McGeoch's method for signal sequence recognition.
  3.  **gvh:** von Heijne's method for signal sequence recognition.
  4.  **lip:** von Heijne's Signal Peptidase II consensus sequence score.
  5.  **chg:** Presence of charge on N-terminus of predicted lipoproteins.
  6.  **aac:** score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins.
  7.  **alm1:** score of the ALOM membrane spanning region prediction program.
  8.  **alm2:** score of ALOM program after excluding putative cleavable signal regions from the sequence.
  9.  **Class Distribution:** The class is the localization site.
                              - cp  (cytoplasm)                                   
                              - im  (inner membrane without signal sequence)                    
                              - pp  (perisplasm)                                   
                              - imU (inner membrane, uncleavable signal sequence)   
                              - om  (outer membrane)                                
                              - omL (outer membrane lipoprotein)                
                              - imL (inner membrane lipoprotein)                    
                              - imS (inner membrane, cleavable signal sequence)     

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
pd.set_option('display.max_rows', 200)

In [2]:
df = pd.read_csv('../data/Ecoli.csv', header=None, names=['SequenceName','MCG',
                                                                   'GVH','LIP','CHG','AAC','ALM1','ALM2','Class'])

In [3]:
df.sample(5)

Unnamed: 0,SequenceName,MCG,GVH,LIP,CHG,AAC,ALM1,ALM2,Class
20,CRL_ECOLI,0.4,0.45,0.48,0.5,0.38,0.22,0.0,cp
62,MALZ_ECOLI,0.36,0.41,0.48,0.5,0.48,0.47,0.54,cp
316,OSMY_ECOLI,0.64,0.66,0.48,0.5,0.41,0.39,0.2,pp
262,NFRA_ECOLI,0.61,0.75,0.48,0.5,0.51,0.33,0.33,om
264,OMPA_ECOLI,0.74,0.9,0.48,0.5,0.57,0.53,0.29,om


In [4]:
#Prepare the input data
X = df[['MCG', 'GVH', 'LIP', 'CHG', 'AAC', 'ALM1', 'ALM2']]
X = np.array(X)
X[:5]

array([[0.49, 0.29, 0.48, 0.5 , 0.56, 0.24, 0.35],
       [0.07, 0.4 , 0.48, 0.5 , 0.54, 0.35, 0.44],
       [0.56, 0.4 , 0.48, 0.5 , 0.49, 0.37, 0.46],
       [0.59, 0.49, 0.48, 0.5 , 0.52, 0.45, 0.36],
       [0.23, 0.32, 0.48, 0.5 , 0.55, 0.25, 0.35]])

In [5]:
#Using one-hot-encoding technique to map the categorical value of species to numerical i.e. 
#(cp, im, pp, imU, om, omL, imL, imS) to (0,1,2,3,4,5,6,7) and then to one-hot encoded 
#[1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0] and so on.
one_hot_encoder = OneHotEncoder(sparse=False)

Y = df.Class
Y = one_hot_encoder.fit_transform(np.array(Y).reshape(-1, 1))
Y[:5]

array([[1., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0.]])

In [6]:
#Split the data set into train/validation/test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.15)
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.1)

## Implementation !!

In [7]:
def initializeWeight(nodes):
    layers, weights = len(nodes), []
    
    for i in range(1, layers):
        wt = [[np.random.uniform(-1, 1) for k in range(nodes[i-1] + 1)] for j in range(nodes[i])]
        weights.append(np.matrix(wt))
    
    return weights

In [8]:
def neuralNet(X_train, Y_train, X_val=None, Y_val=None, iterations=10, nodes=[], rate=0.15):
    hiddenLayers = len(nodes) - 1
    weights = initializeWeight(nodes)

    for iteration in range(1, iterations+1):
        weights = trainNetwork(X_train, Y_train, rate, weights)

        #Print the accuracy of training and validation after every 20 iterations
        if(iteration % 20 == 0):
            print("Iteration {}".format(iteration))
            print("Training Accuracy:{}".format(accuracy(X_train, Y_train, weights)))
            if X_val.any():
                print("Validation Accuracy:{}".format(accuracy(X_val, Y_val, weights)))
                        
    return weights

In [9]:
def feedForward(x, weights, layers):
    output, current_input = [x], x

    for j in range(layers):
        activation = Sigmoid(np.dot(current_input, weights[j].T))
        output.append(activation)
        current_input = np.append(1, activation) # add the bias = 1
    
    return output

In [10]:
def backPropagation(y, output, weights, layers):
    outputFinal = output[-1]
    error = np.matrix(y - outputFinal) #Calculate the error at last output
    
    #Back propagate the error
    for j in range(layers, 0, -1):
        currOutput = output[j]
        
        if(j > 1):
            # Add previous output
            prevOutput = np.append(1, output[j-1])
        else:
            prevOutput = output[0]
        
        delta = np.multiply(error, sigmoidDerivative(currOutput))
        weights[j-1] += rate * np.multiply(delta.T, prevOutput)

        wt = np.delete(weights[j-1], [0], axis=1) # Remove bias from weights
        error = np.dot(delta, wt) # Calculate error for current layer
    
    return weights

In [11]:
#This will perform forward and backward propagation, the new weights will be returned n the end
def trainNetwork(X, Y, rate, weights):
    layers = len(weights)
    for i in range(len(X)):
        x, y = X[i], Y[i]
        x = np.matrix(np.append(1, x)) # Add feature vector
        
        output = feedForward(x, weights, layers)
        weights = backPropagation(y, output, weights, layers)

    return weights

In [12]:
def Sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoidDerivative(x):
    return np.multiply(x, 1-x)

In [13]:
def predict(item, weights):
    layers = len(weights)
    item = np.append(1, item)
    
    #forward propagation
    output = feedForward(item, weights, layers)
    
    outputFinal = output[-1].A1
    index = findMaxActivation(outputFinal)

    y = [0 for i in range(len(outputFinal))]
    y[index] = 1

    return y

In [14]:
def findMaxActivation(output):
    m, index = output[0], 0
    for i in range(1, len(output)):
        if(output[i] > m):
            m, index = output[i], i
    
    return index

In [15]:
def accuracy(X, Y, weights):
    correct_classification = 0

    for i in range(len(X)):
        x, y = X[i], list(Y[i])
        prediction = predict(x, weights)

        if(y == prediction):
            correct_classification += 1

    return correct_classification / len(X)

In [16]:
# Run it here
features = len(X[0]) # Number of features - using all of the features except for the sequence name
classes = len(Y[0]) # Number of classes

layers = [features, 5, 10, classes]
rate, iterations = 0.15, 500

weights = neuralNet(X_train, Y_train, X_val, Y_val, iterations=iterations, nodes=layers, rate=rate)

Iteration 20
Training Accuracy:0.63671875
Validation Accuracy:0.6206896551724138
Iteration 40
Training Accuracy:0.68359375
Validation Accuracy:0.7586206896551724
Iteration 60
Training Accuracy:0.7734375
Validation Accuracy:0.8275862068965517
Iteration 80
Training Accuracy:0.78515625
Validation Accuracy:0.7931034482758621
Iteration 100
Training Accuracy:0.78125
Validation Accuracy:0.7931034482758621
Iteration 120
Training Accuracy:0.78125
Validation Accuracy:0.7931034482758621
Iteration 140
Training Accuracy:0.78515625
Validation Accuracy:0.8275862068965517
Iteration 160
Training Accuracy:0.82421875
Validation Accuracy:0.8275862068965517
Iteration 180
Training Accuracy:0.84375
Validation Accuracy:0.8275862068965517
Iteration 200
Training Accuracy:0.84765625
Validation Accuracy:0.8620689655172413
Iteration 220
Training Accuracy:0.859375
Validation Accuracy:0.8620689655172413
Iteration 240
Training Accuracy:0.8671875
Validation Accuracy:0.8620689655172413
Iteration 260
Training Accuracy:0

In [17]:
#Final testing accuracy
print("Testing Accuracy: {}".format(accuracy(X_test, Y_test, weights)))

Testing Accuracy: 0.8235294117647058


## Check an example prediction !!

In [18]:
print(X_test[0], list(Y_test[0]))

[0.25 0.37 0.48 0.5  0.43 0.26 0.36] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


In [19]:
print(predict(X_test[0], weights))

[1, 0, 0, 0, 0, 0, 0, 0]


In [None]:
#Same class was predicted

#### Number of iterations :

    - We had to train the network for upto 500 iterations in case of this data set compared to Iris to get a better 
    accuracy. We stopeed at 500 in order to avoid overfitting. However, on further trial, it seemed to have gone a 
    little further as well without getting overfited. 