#### **Computational Intelligence in Manufacturing Systems**
#### **MFAIMFG**
###### Created by: Wynnezel Wayne Naoto P Akeboshi
##### Checked by: SME Academics Database team
##### Initial Publish: January 10, 2021
##### Assignment Code from the class of: Dr. Robert Kerwin Billones

#### **Machine Learning Exercise 3 - Multi-class Classification**

#### **INTRODUCTION** (Give a short background about the topic)

Multiclass Classification is a classification task with more than two classifications. [1] For example, instead of a binary category such as dogs and cats, we have more than 2 classifications like dogs, cats, and birds. But in this method, a primary assumption is made such that each item fits in one and only one classification. For example, should we classify based on ethnicities: Native American, Hispanic, Asian, individuals (observations) that fit in both Asian and Hispanic or Native American and Hispanic must be excluded. 

#### **CODE DESIGN**

##### **Code segment no. 1**
*Numpy* is a library primarily used for advanced mathematical operations specifically in linear algebra and matrix operations. [2]

*Pandas* is a dataset handling library that reads and writes multiple formats of data files such as .csv, .mat, .txt, and .json files. Pandas, built on numpy, is used primarily for handling, cleaning, manipulating, and presenting data in a tabular form. [3]

*matplotlib.pyplot* is a sub-library within matplotlib that is used primarily for plotting datasets in a MatLab-esque kind of way. This allows for simple data visualization during data analysis. [4]

*scipy.io* is the scipy input and output library that primary deals with reading different types of files to be read within a python program. [5]

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.io import loadmat
%matplotlib inline

data = loadmat('ex3data1.mat')
data

{'__header__': b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Oct 16 13:09:09 2011',
 '__version__': '1.0',
 '__globals__': [],
 'X': array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 'y': array([[10],
        [10],
        [10],
        ...,
        [ 9],
        [ 9],
        [ 9]], dtype=uint8)}

##### **Code segment no. 2** 
In this segment, we identify the shape (or sizes) of our two arrays. This segment is essential for matrix operations and data training.

In [2]:
data['X'].shape, data['y'].shape

((5000, 400), (5000, 1))

##### **Code segment no. 3**
Creates the sigmoid function taking a parameter z. Since the sigmoid function is the backbone of the learning algorithm used in this exercise and will be repeatedly used, it is more efficient to create a function for it.

In [3]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

##### **Code segment no. 4**
The cost function is a function that measures the performance of a Machine Learning model by quantifying the error between the predicted value and the expected value. 

In [4]:
def cost(theta, X, y, learningRate):
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    first = np.multiply(-y, np.log(sigmoid(X * theta.T)))
    second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))
    reg = (learningRate / 2 * len(X)) * np.sum(np.power(theta[:,1:theta.shape[1]], 2))
    return np.sum(first - second) / (len(X)) + reg

##### **Code segment no. 5** 
This code defines the gradient function. The gradient function is a optimization function that measures how much the output of a function changes when changing inputs. The loop in this function is necessary to go through each gradient one at a time.

In [5]:
def gradient_with_loop(theta, X, y, learningRate):
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    
    parameters = int(theta.ravel().shape[1])
    grad = np.zeros(parameters)
    
    error = sigmoid(X * theta.T) - y
    
    for i in range(parameters):
        term = np.multiply(error, X[:,i])
        
        if (i == 0):
            grad[i] = np.sum(term) / len(X)
        else:
            grad[i] = (np.sum(term) / len(X)) + ((learningRate / len(X)) * theta[:,i])
    
    return grad

##### **Code segment no. 6**
This defines the gradient function. In this code segment, the for loop is omitted. Instead, using linear algebra, matrix operations were used to remove the need of going through each item in the arrays one by one. 

In [6]:
def gradient(theta, X, y, learningRate):
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    
    parameters = int(theta.ravel().shape[1])
    error = sigmoid(X * theta.T) - y
    
    grad = ((X.T * error) / len(X)).T + ((learningRate / len(X)) * theta)
    
    # intercept gradient is not regularized
    grad[0, 0] = np.sum(np.multiply(error, X[:,0])) / len(X)
    
    return np.array(grad).ravel()

##### **Code segment no. 7**
Utilizing the scipy sub-library optimize, the entirity of cost minimization and data processing was combined into a single function. Instead of the previous methods of necessitating the use of a cost function and the gradient function, in one function, the entirety of the machine learning model was handled and combined in this function, ultimately returning the final theta needed for the model.

In [7]:
from scipy.optimize import minimize

def one_vs_all(X, y, num_labels, learning_rate):
    rows = X.shape[0]
    params = X.shape[1]
    
    # k X (n + 1) array for the parameters of each of the k classifiers
    all_theta = np.zeros((num_labels, params + 1))
    
    # insert a column of ones at the beginning for the intercept term
    X = np.insert(X, 0, values=np.ones(rows), axis=1)
    
    # labels are 1-indexed instead of 0-indexed
    for i in range(1, num_labels + 1):
        theta = np.zeros(params + 1)
        y_i = np.array([1 if label == i else 0 for label in y])
        y_i = np.reshape(y_i, (rows, 1))
        
        # minimize the objective function
        fmin = minimize(fun=cost, x0=theta, args=(X, y_i, learning_rate), method='TNC', jac=gradient)
        all_theta[i-1,:] = fmin.x
    
    return all_theta

##### **Code segment no. 8** 
This code segment is mostly preparation. The rows, parameters, all_theta, and theta are prepared here to be inputted in the functions later. These steps are taken to utilize variables X, y, theta, and all_theta. After preparation of variables, the shapes of these data are identify to ensure that the matrix operations later on are allowed. 

Note that all the previous code segments only *defined* functions. There has been no data manipulation yet with the actual data.

In [8]:
rows = data['X'].shape[0]
params = data['X'].shape[1]

all_theta = np.zeros((10, params + 1))

X = np.insert(data['X'], 0, values=np.ones(rows), axis=1)

theta = np.zeros(params + 1)

y_0 = np.array([1 if label == 0 else 0 for label in data['y']])
y_0 = np.reshape(y_0, (rows, 1))

X.shape, y_0.shape, theta.shape, all_theta.shape

((5000, 401), (5000, 1), (401,), (10, 401))

##### **Code segment no. 9** 
np.unique returns the *unique* values within an array. In this case, the unique values that can be found in data['y'], which is the output arrray, are all integers from 1 to 10.

In [9]:
print(np.unique(data['y']))
print('No. of Labels: ', len(np.unique(data['y'])))

[ 1  2  3  4  5  6  7  8  9 10]
No. of Labels:  10


##### **Code segment no. 10**
All the theta (coefficients/weights that fit the model), are generated here using the one_vs_all function previously defined in segment 7. We enter our X input data, y output data, the number of labels (10 as found in code segment 9), and a learning rate (which is completely up to you).  

In [10]:
all_theta = one_vs_all(data['X'], data['y'], 10, 1)
all_theta

array([[-3.70247933e-05,  0.00000000e+00,  0.00000000e+00, ...,
        -2.24803591e-10,  2.31962901e-11,  0.00000000e+00],
       [-8.96250738e-05,  0.00000000e+00,  0.00000000e+00, ...,
         7.26120841e-09, -6.19965312e-10,  0.00000000e+00],
       [-8.39553367e-05,  0.00000000e+00,  0.00000000e+00, ...,
        -7.61695604e-10,  4.64917644e-11,  0.00000000e+00],
       ...,
       [-7.00832571e-05,  0.00000000e+00,  0.00000000e+00, ...,
        -6.92009196e-10,  4.29241582e-11,  0.00000000e+00],
       [-7.65187952e-05,  0.00000000e+00,  0.00000000e+00, ...,
        -8.09503301e-10,  5.31058734e-11,  0.00000000e+00],
       [-6.63412433e-05,  0.00000000e+00,  0.00000000e+00, ...,
        -3.49765885e-09,  1.13668536e-10,  0.00000000e+00]])

##### **Code segment no. 11**
We define the prediction function in this code segment. We compute the probability of each entry in X by multiplying it to all_theta, to show the probability that an entry is within a certain class. For example, entry 1 has a probability of 0.4 to be classification A, 0.6 to be classification B, and 0.55 to be classification C. Using the np.argmax, this returns the indeces (the classification) with the highest probability. As seen in the example, this would return index 1 (0-based index) for entry 1. Thus classifying entry 1 to be in classification B.

This function eventually returns an array containing the index classifications of each entry, completing its predictions.

In [11]:
def predict_all(X, all_theta):
    rows = X.shape[0]
    params = X.shape[1]
    num_labels = all_theta.shape[0]
    
    # same as before, insert ones to match the shape
    X = np.insert(X, 0, values=np.ones(rows), axis=1)
    
    # convert to matrices
    X = np.matrix(X)
    all_theta = np.matrix(all_theta)
    
    # compute the class probability for each class on each training instance
    h = sigmoid(X * all_theta.T)
    
    # create array of the index with the maximum probability
    h_argmax = np.argmax(h, axis=1)
    
    # because our array was zero-indexed we need to add one for the true label prediction
    h_argmax = h_argmax + 1
    
    return h_argmax

##### **Code segment no. 12**
Using the predict_all function from segment 11, it takes all the input values within data['X'] and the optimized all_theta (weights or coefficients of the model) to predict classifications of entries in data['X'] and be saved to array y_pred.

The correct predictions are then counted and the accuracy is computed using this score.


In [12]:
y_pred = predict_all(data['X'], all_theta)
correct = [1 if a == b else 0 for (a, b) in zip(y_pred, data['y'])]
accuracy = (sum(map(int, correct)) / float(len(correct)))
print('accuracy = {0}%'.format(accuracy * 100))

accuracy = 74.6%


#### **ANALYSIS OF RESULTS**
In this exercise, we displayed different methods of coding the training models. In code segments 5, 6, and 7, different ways of computing for the thetas were shown from most brute force to most efficient, utilizing useful libraries in the process. We can see the efficiency in the coding process by using code segment 7, function one_vs_all, despite it simply being the exact same process as code segments 5 and 6. All of which, though not displayed in the code, would have resulted in the same model.

#### **CONCLUSION**
In conclusion, utilizing more libraries and coding in a more efficient manner is possible. Nearly every modelling process may have a library with a single function doing the entire process. But it is essential for us to understand the backbone of these libraries as to see the nuances wherein we should or should not use such automated processes. Ultimately, the multi-class classification model resulted in a 74.6% accuracy, which would have been the same regardless of the use of additional 'shortcut' libraries. 

##### **References**
[1] https://towardsdatascience.com/machine-learning-multiclass-classification-with-imbalanced-data-set-29f6a177c1a </br>
[2] https://www.geeksforgeeks.org/python-numpy/ </br>
[3] https://www.learnpython.org/en/Pandas_Basics </br>
[4] https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.html </br>
[5] https://docs.scipy.org/doc/scipy/reference/io.html