# MultiClass Logistic Regression 
earlier, we have implemented binary classification however now, we will try to implement a multiclass logistic regression where we should consider the output to be more than 2 categories. thus we will try to define the model , loss function , lost function , run the gradient descent. 

## Model 
now, you have more than two classes. You have C classes and for each class assign to it a number k between 1 to C. that doesn't reflect the class but it is only a representation of the class within the multiclass. y=1,...,C the output. for each output value we will have a parameter a vector w and b value.thus we will need to have a matrix of size n*C for the parameter W one vector for each class. we will also a vector of size C for the b each value for a class. we will compute the probabilty of y being a certain class taking into consideration the feature vector $P(y=j/{x})$. As usual, we will start by importing the numpy. 

In [1]:
# import numpy 
import numpy as np 

In [1]:
# we will define a function that returns a vector of Zi 
def linear_regression_for_multiclass (x,W,B,C) : 
    # x is a training example 
    # W is a matrix of parameters W
    # B is a vector of parameter 
    # C is the number of classes 
    # we will return Z the output which is the vector 
    Z = np.zeros(C) # the output each Zi = wi.x + bi 
    for i in range (C): 
        Z[i] = np.dot(W[i],x)
        Z[i] += B[i] 
    return Z 

In [2]:
# we will define a function that is exp(Zi) 
def exponsential_multiclass (x,W,B,C) : 
    Z = linear_regression_for_multiclass(x,W,B,C) 
    Z = np.exp(Z) # compute the exp(Zi)
    return Z # return the function 

In [4]:
# we will define a function to compute ai 
def compute_activation_a (x,W,B,C) :  
    Z = exponsential_multiclass (x,W,B,C) 
    return Z/Z.sum()
    

## lost function 
now we will define the loss function and then we will define the lost function, for the loss function , for it to be computed we should if y = i then the loss for the example is  : $-\log{(a[i])}$ and then to define the lost function we will sum on all the losses of each training example. 

In [9]:
def loss (x,W,B,C,y) :
    # y is the output (one of classes)
    # compute the a activation vector 
    a = compute_activation_a (x,W,B,C) 
    # y is the class that we are referring to 
    return -np.log(a[y])

In [10]:
def lost (X,W,B,C,Y) : 
    # Y is the vector of outputs containing values from 0..C 
    # W is the matrix of parameters 
    # B is the vector of parameters 
    # C is the number of classes 
    # X is the set of training set 
    m = X.shape[1] 
    thesum = 0.0 
    for i in range (m) : 
        iloss = loss (X[i],W,B,C,Y[i])
        thesum += iloss 
    return thesum/m 
        

## Gradient Descent 
now, we will try to compute the gradient descent but we will try to compute before the gradient. I had to google how to compute the partial derivative in terms of W and B  : 
$\frac{\partial(L)}{\partial(Wi)}=(ai -1i=y )*x $ and for computing the gardient in terms of b : 
$\frac{\partial(L)}{\partial(Bi)}=(ai -1i=y )$


In [11]:
# compute the gradient 
def compute_gradient (X,Y,W,B,C): 
    """
    X: A set of training examples, shape (m,n) 
    Y: A set of outputs containing a value 0...C-1 
    W: initial weights , shape (C,n) 
    B: initial biases , shape (C,).
    """
    n = X.shape[1] # number of features 
    m = X.shape[0] # number of training examples 
    dW = np.zeros((C,n))
    dB = np.zeros(C)
    for i in range (m) : 
        x = X[i]
        y = Y[i]
        # compute softmax probabilities 
        a = compute_activation_a(x,W,B,C) 
        # compute the gradients for each class 
        for c in range (C) : 
            dW[c] += (a[c] - (c==y)) * x 
            dB[c] += (a[c] - (c==y))
    # average the gradients 
    dW /= m 
    dB /= m 
    return dW,dB 

In [12]:
# define a function that applies gradient descent 
def Gradient_Descent (X,Y,W,B,C,alpha): 
    """
    X : is the set of training examples 
    Y : is the set of labels / outputs 
    W : is the matrix of weights 
    B : is the vector of biases 
    alpha : is the learning rate 
    """
    current_cost = lost(X,W,B,C,Y) # compute the cost with initial parameters 
    for i in range (1000) : 
        dW,dB = compute_gradient(X,Y,W,B,C) # compute the gradient 
        W = W - alpha * dW # update W
        B = B - alpha * dB # update B 
        new_cost = lost(X,W,B,C,Y) # compute the current cost 
        if new_cost > current_cost : 
            break ; 
        else : 
            current_cost = new_cost 
    return W,B # return the values of W and B 
