# Multiclass Logistic Regression (MLR)
We will implement multiclass logistic regression from scratch using cross entrpoy loss with sigmoid function that models the class probabilities.<br>
Although, we describe the formulation of binary class clasification, we code multiclass case. The link between binary class logistic regression and its multiclass counterpart can be drawn by using one-vs-rest strategy (see point 5 below).<br>
Settings for binary class logistic regression<br>
$\{x_i,y_i\}_{i=1}^{N}$; $x_i \in \mathbb{R}^{p\times 1}$ where $y_i$ is $1$ if data point $x_i$ belongs to class $1$, otherwise it is zero.<br>
1. Sigmoid function <br> 
$g(z) = \frac{1}{1+e^{-z}}$ <br>
2. $h_{\theta}(x_i) = g(\theta^Tx_i)$ is the predicted probability that the input $x$ is being classified "positive"
3. Data log-liklihood <br>
$
J(\theta) = \frac{1}{N} \sum_{i=1}^N \bigg[-y_i \text{log} h_{\theta}(x_i) - (1-y_i) 
\text{log} (1-h_{\theta}(x_i))\bigg]
$
4. Derivative of log-likelihood with respect to $\theta_j$<br>
$
\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{N}\sum_{i=1}^{N}(h_{\theta}(x_i)-y_i)x_{ij}
$
5. Weight update using gradient descent
$
\theta:=\theta-\alpha \frac{\partial J(\theta)}{\partial \theta}
$
5. We choose one-vs-rest strategy for multi-class classification where one class is assumed "positive" at one time and the rest are assumed "negative" and the processes is repeated while all the classes are covered.<br>
6. We use iris data-set from url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
which has 3 species of iris plants.
7. We use 4 features for each of the 3 classes
8. During prediction in MLR, a data point is assigned to a class that gets maximum probability.

In [20]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

def fit(x,y,max_itr=3000,alpha=0.1):
    x = np.insert(x.astype(float),0,1,axis=1)
    thetas = [] # holds weight vectors for the classes, data type: list
    classes = np.unique(y)
    costs = np.zeros(max_itr)
    for c in classes: # loop over the classes
        theta = np.random.rand(x.shape[1]) # weight initialization
        binary_y = np.where(y==c,1,0) # one-vs-rest strategy for multiclass classification
        for epoch in range(max_itr):
            costs[epoch] = cost_function(theta,x,binary_y) # cost in each iteration is stored so that it can be plotted, the cost must show a decreasing trend
            if np.remainder(epoch,1000)==0:
                print("The cost for class {} at iteration {}/{} is {}".format(c,epoch,max_itr,costs[epoch]))
            grad   = gradient(theta,x,binary_y)
            theta -= alpha * grad # weight update
        thetas.append(theta)
    return thetas,classes,costs
def sigmoid(z):
    return 1.0/(1+np.exp(-z))
def net_input(theta,x):
    return np.dot(x,theta)
def probablity(theta,x):
    return sigmoid(net_input(theta,x))
def cost_function(theta,x,y):
    m = x.shape[0]
    total_cost = -(1.0/m) * np.sum(y * np.log(probablity(theta,x))+(1-y)*np.log(1-probablity(theta,x)))
    return total_cost
def gradient(theta,x,y):
    m = x.shape[0]
    grad = (1.0)/m*np.dot(x.T,sigmoid(net_input(theta,x))-y)
    return grad

# load the data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
df = pd.read_csv(url,header=None,names=[
    "Sepal length (cm)",
    "Sepal width (cm)", 
    "Petal length (cm)",
    "Petal width (cm)",
    "Species"
]) # use 4 features
#df.head()

# get train and test data
data = np.array(df)
np.random.shuffle(data)
num_train = int(.8*len(data)) # 80/20 train/test data split
x_train, y_train = data[:num_train,:-1], data[:num_train,-1]
x_test,  y_test  = data[num_train:,:-1], data[num_train:,-1]

# train the model
thetas, classes, costs = fit(x_train,y_train)

# compute accuracy
x_test1 = np.insert(x_test.astype(float),0,1,axis=1)
numClasses = len(classes)
p          = np.zeros([x_test1.shape[0],numClasses])
for c in range(numClasses):
    theta1 = np.array(thetas[c])
    p[:,c] = probablity(theta1,x_test1)
    
pred_classes = np.argmax(p,axis=1) # choose the class with maximum probabilities for a data point
y_test_numeric = np.zeros(y_test.shape)# give numeric label to each class for the computation of accuracy

for i in range(len(y_test)):
    for c in range(numClasses):
        if(y_test[i]==classes[c]):
            y_test_numeric[i]=c
            
accuracy = np.mean((y_test_numeric==pred_classes)) * 100.0        

The cost for class Iris-setosa at iteration 0/3000 is 6.10343505996
The cost for class Iris-setosa at iteration 1000/3000 is 0.00745789325876
The cost for class Iris-setosa at iteration 2000/3000 is 0.00404383509684
The cost for class Iris-versicolor at iteration 0/3000 is 4.087417799
The cost for class Iris-versicolor at iteration 1000/3000 is 0.503953829955
The cost for class Iris-versicolor at iteration 2000/3000 is 0.492574970468
The cost for class Iris-virginica at iteration 0/3000 is 4.63631108073
The cost for class Iris-virginica at iteration 1000/3000 is 0.104899054365
The cost for class Iris-virginica at iteration 2000/3000 is 0.0756085362569


In [21]:
print("The accuracy is {} %".format(np.round(accuracy*100.0)/100));
# plt.plot(costs)
# plt.xlabel('# EPOCHS')
# plt.ylabel('COST')

The accuracy is 86.67 %
