<h3 style="text-align: center;"><b>Implementing Softmax Regression (Multinomial Logistic Regression)</b></h3>
<h5 style="text-align: center;">This notebook follows this Stanford file: <a href="http://deeplearning.stanford.edu/tutorial/supervised/SoftmaxRegression/" target="_blank">http://deeplearning.stanford.edu/tutorial/supervised/SoftmaxRegression/</a><br>and this tutorial by Nikhil Kumar:<a href="https://www.geeksforgeeks.org/softmax-regression-using-tensorflow/" target="_blank">https://www.geeksforgeeks.org/softmax-regression-using-tensorflow/</a></h5>
<h5 style="text-align: center;">**Again like the last notebook, I recomend just following those two tutorials above. This notebook doesnt add much to both those wonderful tutorials</h5>
$$ \text{Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes.} \\ \text{In logistic regression we assumed that the labels were binary: } y_i = \in \{0, 1\} $$
$$ \text{Softmax regression allows us to handle: } y_i = \in \{0, 1, \ldots, K\} \text{where K is the number of classes.} $$
$$ \text{Given a test input x, we want our hypothesis to estimate the probability that } P(y=k|x) \\ \text{ for each value of } k = 1,\ldots,K \text{. I.e., we want to estimate the probability of the class label taking on each of the K different possible values. Thus, our hypothesis will output a } \\ \text{K-dimensional vector (whose elements sum to 1) giving us our K estimated probabilities.} $$
$$ \text{Let } X = \begin{equation} \begin{bmatrix} 1 & x_{1,1} & \ldots & x_{1,p} \\ 1 & x_{2,1} & \ldots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n,1} & \ldots & x_{n,p} \end{bmatrix} \label{eq:aeqn} \end{equation}$$
$$ \text{Where the dataset has ‘m’ features and ‘n’ observations. Also, there are ‘k’ class labels, i.e every observation can be classified} \\ \text{as one of the ‘k’ possible target values. For example, if we have a dataset of 100 handwritten digit images of vector size 28×28 for} \\ \text{digit classification, we have, n = 100, m = 28×28 = 784 and k = 10.} $$
$$ \text{Then our prediction function is: } P(y_i=k|x_i;\theta) = \begin{bmatrix} P(y=0|x;\theta) \\ P(y=1|x;\theta) \\ \vdots \\ P(y=K|x;\theta) \end{bmatrix} = \frac{e^{\theta_k^Tx_i}}{\sum_{j=1}^{K}e^{\theta_j^Tx}}$$
<h5 style="text-align: center;">We now describe the cost function that we’ll use for softmax regression. In the equation below, 1{⋅} is the ”‘indicator function,”’ so that 1{a true statement}=1, and 1{a false statement}=0. For example, 1{2+2=4} evaluates to 1; whereas 1{1+1=5} evaluates to 0. Our cost function will be:</h5>
$$ J(\theta) = -[\sum_{i=1}^{n}\sum_{k=1}^{K}1\{y_i=k\}\log_{10}(\frac{e^{\theta_k^Tx_i}}{\sum_{j=1}^{K}e^{\theta_j^Tx_i}})] $$
<h5 style="text-align: center;">Notice that this generalizes the logistic regression cost function, which could also have been written:</h5>
$$ J(\theta) = -[\sum_{i=1}^{n}\log_{10}(1 - h_\theta(x_i)) + y_i(\log_{10}(h_\theta(x_i))] $$
$$ = -[\sum_{i=1}^{n}\sum_{k=0}^{1}1\{y_i=k\}\log_{10}(P(y_i=k|x_i;\theta)] $$
<h5 style="text-align: center;">We cannot solve for the minimum of J(θ) analytically, and thus as usual we’ll resort to an iterative optimization algorithm. Taking derivatives, one can show that the gradient is:</h5>
$$ \triangledown_{\theta_k}J(\theta) =  -\sum_{i=1}^{n}[x_i(1\{y_i=k\} - P(y_i=k|x_i;\theta)] \text{ or: } $$
$$ \triangledown_{\theta_k}J(\theta) =  -\sum_{i=1}^{n}[x_i(1\{y_i=k\} - \frac{e^{\theta_k^Tx_i}}{\sum_{j=1}^{K}e^{\theta_j^Tx}}] $$

In [78]:
import csv 
import numpy as np 
import matplotlib.pyplot as plt 

def loadCSV(filename): 
    ''' 
    function to load dataset 
    '''
    with open(filename,"r") as csvfile: 
        lines = csv.reader(csvfile) 
        dataset = list(lines) 
        for i in range(len(dataset)): 
            dataset[i] = [float(x) for x in dataset[i]]      
    return np.array(dataset) 

def normalize(X): 
    ''' 
    function to normalize feature matrix, X 
    '''
    mins = np.min(X, axis = 0) 
    maxs = np.max(X, axis = 0) 
    rng = maxs - mins 
    norm_X = 1 - ((maxs - X)/rng) 
    return norm_X 

def P(theta, X):
    #return np.exp(np.matmul(theta.T, X)) / np.sum(np.exp(np.matmul(theta.T, X)))
    return np.exp(theta.T * X) / np.sum(np.exp(theta.T * X))

def J(theta, X, Y, K):
    J = 0
    for i in range(len(Y)):
        for k in range(1, K):
            if Y[i] == k:
                multiplier = 1
            else:
                multiplier = 0
            J += multiplier*np.log10(P(theta, X))
    return J

def del_J(theta, X, Y, K):
    J = 0
    for i in range(len(Y)):
        for k in range(1, K):
            if Y[i] == k:
                multiplier = 1
            else:
                multiplier = 0
            J += X[i]*(multiplier - P(theta, X))
    return J

def grad(X, Y, K, lr, epochs):
    theta = np.random.random(size=X.shape)
    for _ in range(epochs):
        theta -= lr*del_J(theta, X, Y, K)
    return theta

def fit(X, Y, lr=0.01, epochs=25): 
    #Note: after 25 epochs data points get closer and closer to nan so crashes. This is fine for learning purposes
    y, K = OneHotEncode(Y)
    thetas = grad(X, y, K, lr, epochs)
    return thetas

def predict(X, thetas):
    return np.round(P(thetas, X))

In [96]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split 
X, y = load_iris(return_X_y=True)
xtrain, xtest, ytrain, ytest = train_test_split( 
        X, y, test_size = 0.25, random_state = 0) 
clf = LogisticRegression(random_state=0).fit(xtrain, ytrain)
y_pred = clf.predict(xtest)
clf.predict_proba(xtrain[:2, :])
print(clf.score(xtest, ytest))

from sklearn.metrics import confusion_matrix 
cm = confusion_matrix(ytest, y_pred) 

print ("Confusion Matrix : \n", cm) 
from sklearn.metrics import accuracy_score 
print ("Accuracy : ", accuracy_score(ytest, y_pred)) 

(150, 4) (150,)
0.9736842105263158
Confusion Matrix : 
 [[13  0  0]
 [ 0 15  1]
 [ 0  0  9]]
Accuracy :  0.9736842105263158
