## Probabilistic Discriminative Models
### Multinomial Distribution
Before getting into Softmax regression, we need to define what is a nultinomial distribution. A discrete random variable is said to have a multinomial distribution when it have multiple levels instead of 2 levels as was the case in binomial distribution. This discrete r.v is usually encoded in what is known as a one hot encoder in which this vector would have 1 at the index that correspond to the specified level and the rest of the entries would be zeros. In which each level is represented by a an index in this vector. So, the probability of occurance of a single observation is given by the following $p(\underline{x}|\underline{\mu}=\prod_{k=1}^{K}\mu_k^{x_k}$. And to be be valid distribution the normalization axiom should be satisfied, hence, $\sum_{k=1}^{K}\mu_k=1$. For a dataset that are coonstructed from N observation is represented as follows:-
$$
\begin{align*}
\begin{split}
&p(D|x;\mu)=\prod_{n=1}^{N}\prod_{k=1}^K\mu_k^{x_k}=\prod_{k=1}^{K}\prod_{n=1}^{N}\mu_k^{x_k};\ by \ using\ x^ax^b=x^{a+b}\\
&p(D|x;\mu)=\prod_{k=1}^{K}\mu_k{\sum_{n=1}^{N}x_{nk}};\ let\ \sum_{n=1}^{N}x_{nk}=m_k\\
&p(D|x;\mu)=\prod_{k=1}^{K}\mu_k^{m_k}\\
\end{split}
\end{align*}
$$
In order the best value for the $\mu^{'s}$, we need to maximize the log-likelihood of the conditional distribution while taking care of the constrain on $||\mu||_1=1$. This can be done by using lagrangian, which can be derived as follows:-
$$
\begin{align*}
\begin{split}
&\nabla_{mu_k}L(\mu, \lambda)=\nabla_{\mu_k}\Big( \sum_{k=1}{K}m_klog(\mu_k)+\lambda(\sum_{i}\mu_i -1) \Big)=0\rightarrow\mu_k=\frac{-m_k}{\lambda}\\
&\sum_{k}\mu_k=\sum_{k}\frac{-m_k}{\lambda}=\sum_{k}\frac{-\sum_{n}x_{nk}}{\lambda}=1\rightarrow \lambda=-N\\
&\mu_k=\frac{\sum_{n}x_{nk}}{N}\\
\end{split}
\end{align*}
$$
As can be seen from the last equation, it is clearly sufficient to estimate the probability of each class target as the fraction of observation in those classes, hence, the name sufficient statistics that is used for $m_k$

Therefore, the best value of each $\mu$ is described by the proportion of number of observation of each class with respect to total number of observation. The multinomial distribution is discribed by the following equation, in which the multinomial coefficent(N choose the $\mu^s$) is used for normalization purposes.
$$
\begin{align*}
Multi(\underline{x}|N, \mu^{'s}, m^{'s})=
\begin{pmatrix}
&N\\
&m1,m2,...,mk
\end{pmatrix}
\prod_{k=1}^{K}\mu_k^{x_k}
\end{align*}
$$ 

### Softmax regression
As was shown in logistic regression, we want to describe the conditional distribution of the target class given the input. Softmax regression will deal with datasets that have more than one class, in which each observation is descibed by the following formula:- $p(\underline{t}|x;\mu_1,...,\mu_k)=\prod_{k=1}^{K}\mu_k^{t_k}$. In which t is hot encoded by 1 at the index that correspond to the class number, in which it act like an indicator function. Due to the constrain of $||\mu||_1=1$ then knowing the mu's of class 1 to K-1 will be sufficient to know $\mu_k$, hence, this class will be considered as the reference class and it will have any parameters such was the case for the negative class in logistic regression. As we did for logistic regression, we need to transform the distribution of the conditional distribution to the expoonential family, in order to deal with this probelm in the GLM's sense, and this will end up resulting to what is known as the softmax formula.
$$
\begin{align*}
\begin{split}
&p(t|x;\eta)=h(t)g(\eta)exp(\eta^Tu(t))\\
&p(\underline{t}|x;\eta)=exp(log(p(\underline{t}|x;\mu)))=exp(\sum_{k=1}^{K}t_klog(\mu_k));\ by\ \sum_{i}\mu_i=1\rightarrow t_K = 1- \sum_{i=1}^{K-1}t_i\ because\ at\ least\ 1\ of\ them\ is\ 1\\
&p(\underline{t}|x;\eta)=exp(\sum_{i=1}^{K-1}t_i\mu_i + (1 -\sum_{i=1}^{K-1}t_i)log(\mu_K));\ by\ log(\frac{a}{b})=log(a)-log(b)\\
&p(\underline{t}|x;\eta)=\mu_kexp(\sum_{i=1}^{K-1}log(\frac{\mu_i}{\mu_K}))\\

&\eta=
\begin{pmatrix}
&log(\frac{\mu_1}{\mu_k})\\
&...\\
&log(\frac{\mu_{K-1}}{\mu_k})
\end{pmatrix}\\

&u(t)=(t_1, t_2,..., t_{K-1})^T\\
&Canonical\ Link\ function \rightarrow\ \eta_i=log(\frac{\mu_i}{\mu_K})=w_i^T\underline{x}\\
&\mu_i=\mu_Kexp(w_i^Tx);\ by\ summing\ across\ all\ \mu^{'}s\rightarrow\ \mu_k=\frac{1}{\sum_{i}^{K}exp(w_i^Tx)}; w_k=\underline{0}\\
&\therefore \mu_i=\frac{exp(w_i^Tx)}{1+\sum_{k=1}^{K-1}exp(w_i^Tx)}\\
&\because p(y=1|x;\eta)=\mu_1\ ; E[u(t_1)|x;\eta]=E[1(t =1)|x;\eta]=0(p(1(y \neq 1)|x;\eta))+p(1(y=1)|x;\eta))=\mu_1\\
&\therefore E[u(y)|x;\eta]=(\mu_1,..,\mu_{K-1})^T\\
\end{split}
\end{align*}
$$

### Estimating the Parameters 
We will use gradient descent to minimize the cross entropy cost function, in which we need to minimize it K-1 times with respect to the parameters of each class.
$$
\begin{align*}
\begin{split}
&\nabla_{w_i}\ell(w1, w2, ..., w_{K-1}) = -\nabla_{w_i}\sum_{n=1}^{N}\sum_{k=1}^{K}t_{nk}log(\frac{exp(\eta_{nk})}{1+\sum_{j=1}^{K-1}exp(\eta_{nj})}); by\ \frac{d logx}{dx} =(\frac{1}{x})*1\\

&\nabla_{w_1}\ell(w_1, w_2, ..., w_{K-1})=-\sum_{n=1}^{N}t_{n1} \frac{\sum_{j}exp(w_j^Tx_n)}{exp(w_1^Tx_n)} 
\Big(
     \frac{x\ exp(w_1^Tx_n)\sum_{j}exp(w_j^Tx_n)  -  x\ exp(w_1^Tx_n)exp(w_1^Tx_n)}
     {(\sum_{j}exp(w_j^Tx_n) )^2}\Big)
     +t_{n2} \big(
          \frac{-x\ exp(w_2^Tx)exp(w_1^Tx)}
          {exp(w_2^tx)\sum_{j}exp(w_j^Tx)} 
           \big) + ... 
 ;\\


&\nabla_{w_1}\ell(w1, w2, ..., w_{K-1})=-\sum_{n=1}^{N}(x_nt_{n1}(1-\mu_1)-x_nt_{n2}\mu_1-x_nt_{n3}\mu_1...)=-\sum_{n=1}^{N}x_n(t_{n1} - \mu_1(t_{n1}+t_{n2} + ...));\ by\ the\ principle\ of\ one\ hot\ encoded\\

& \nabla_{w_1}\ell(w1, w2, ..., w_{K-1}) =-\sum_{n=1}^{N}(t_{n1} -\mu_1)x_n\\

&w_1^{t+1}=w_1^{t}+\alpha\nabla_{w_1^{t}}\ell(w_1^t, w_2, ..., w_{K-1})
\end{split}
\end{align*}
$$

As can be seen from the last equation, we arrived to the same exact equation for gradient descent step that was seen in logistic regression and linear regression. 

### Prediction
We will make K prediction in which the class that would be chosen is the class that gave the highest probability.
$$
\begin{align*}
\begin{split}
&p(y=1|x;\eta)=\mu_1=\frac{exp(w_1^Tx)}{1+\sum_{k=1}^{K-1}exp(w_k^Tx)}\\
&...\\
&p(y=K|x;\eta)=\mu_K=1-\sum_{i=1}^{K-1}\frac{exp(w_k^Tx)}{1+\sum_{k=1}^{K-1}exp(w_k^Tx)}; from\ ||\mu||_1=1 \\
\end{split}
\end{align*}
$$

In [1]:
%matplotlib inline
import numpy as np 
import sklearn.preprocessing
import sklearn.datasets
import pandas as pd
import sklearn.model_selection
import numpy.random
import math
import sklearn.metrics


In [122]:
X, y = sklearn.datasets.load_iris(return_X_y=True)
#X, y = sklearn.datasets.load_wine(return_X_y=True)

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=42)
standard = sklearn.preprocessing.StandardScaler()
X_train = standard.fit_transform(X_train)
training_data = np.c_[X_train, y_train]#All of the features are continuous, so, no need to use one-hot encoder and we can directly standard normalize the features of the data set

X_test = standard.transform(X_test)
test_data = np.c_[X_test, y_test]
print(training_data.shape)
print(test_data.shape)

(112, 5)
(38, 5)


In [123]:
k = 3
X_train, y_train = (training_data[:, 0:-1], training_data[:, -1])
X_test, y_test = (test_data[:, 0:-1], test_data[:, -1])

In [132]:
class SoftmaxRegression(object):

    def __init__(self, X, y, k, learning_rate, num_iteration):
        self.num_iteration = num_iteration
        self.learning_rate = learning_rate
        self.K = k
        self.m = X.shape[0]
        self.n = X.shape[1] + 1 #Need to add the bias unit
        self.W = {}
        self._initialize_parameters(self.n, self.K)
        self.X_train = np.c_[np.ones((self.m, 1)), X]
        self.y_train = sklearn.preprocessing.OneHotEncoder().fit_transform(y.reshape(-1, 1)).toarray()
        
    def _initialize_parameters(self, n, K):
        for k in range(1, K + 1):
            self.W[str(k)] = np.zeros((n, 1)) #We willn't change w[K]

    def softmax_function(self, w, x):
        assert(w.shape == x.shape)
        t1 = np.exp(np.dot(w.T, x))
        keys = list(self.W.keys())
        t2 = np.sum(list(map(lambda k: np.exp(np.dot(self.W[k].T, x)), keys)))
        return np.divide(t1, t2)

    #Needs to be made pythonian 
    def fit(self):
        for iter in range(0, self.num_iteration):
            for k in range(1, self.K):
                for n in range(0, self.m):
                    self.W[str(k)] = self.W[str(k)].reshape(-1, 1) + self.learning_rate * np.multiply(self.y_train[n, k-1] - self.softmax_function(self.W[str(k)].reshape(-1, 1), self.X_train[n, :].reshape(-1, 1)) , self.X_train[n, :].reshape(-1, 1))
        
        return self.W

    def predict(self, X):
        prediction = np.zeros((X.shape[0], self.K))
        for k in range(1, self.K):
            prediction[:, k-1] = list(map(lambda x: self.softmax_function(self.W[str(k)].reshape(-1, 1), x.reshape(-1, 1)), X))

        prediction[:, self.K-1] = 1 - np.sum(prediction[:, 0:self.K-1], axis=1)#sum across the columns
        return np.argmax(prediction, axis=1)#across the columns


In [133]:
model = SoftmaxRegression(X_train, y_train, 3, 0.1, 100)
W = model.fit()
pred = model.predict(np.c_[np.ones((X_train.shape[0],1)), X_train])
print("Performance on the training set")
print(sklearn.metrics.confusion_matrix(y_train, pred))
#print(f"precision:{sklearn.metrics.precision_score(y_train, prediction):0.3f}, recall:{sklearn.metrics.recall_score(y_train, prediction):0.3f}")

Performance on the training set
[[35  0  0]
 [ 0 36  3]
 [ 0  1 37]]


In [134]:
prediction = model.predict( np.c_[np.ones((X_test.shape[0], 1)), X_test] )
print("Performance on the test set")
print(sklearn.metrics.confusion_matrix(prediction, y_test))

Performance on the test set
[[15  0  0]
 [ 0 11  0]
 [ 0  0 12]]


In [128]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='none')
model.fit(X_train, y_train)
prediction = model.predict(X_train)
print("Performance on the training set")
print(sklearn.metrics.confusion_matrix(y_train, prediction))

Performance on the training set
[[35  0  0]
 [ 0 38  1]
 [ 0  1 37]]


### References 
* Chapter 1, chapter 2 and Chapter 4 from Bishop, C. (2006). Pattern Recognition and Machine Learning. Cambridge: Springer.
* Andrew Ng, Lec 1: (https://www.youtube.com/watch?v=UzxYlbK2c7E)
* Andrew Ng, Lec 2: (https://www.youtube.com/watch?v=5u4G23_OohI)
* Andrew Ng, Lec 3: (https://www.youtube.com/watch?v=HZ4cvaztQEs)
* Andrew Ng, Lec 4: (https://www.youtube.com/watch?v=nLKOQfKLUks)
