## ___Softmax Regression___
Or Multinomial Regression is an algorithm used for multi-class classification  

### How it works:  
- It computes score $s_k(x)$ for each class k:  
$$s_k(x) = x^T\theta^k$$
where:  
$\theta^k$ is weight matrix of class k,  
$x$ is input  
$s_k$ is score (logit)
- Then uses Softmax function to get the probability associated with each class
$$\hat{p} = \frac{e^{s_k(x)}}{\sum_{j=1}^{K}e^{s_j(x)}}$$

- After that, based on highest probability, it classifies the instance to certain class

`Note`:  
Softmax regression is _multiclass_ not _multioutput_, that means it can only classify an instance to only one certain class.  
ex: Classify image in A,B,C if there is an image of A,B together the model fails to predict as it can only classify image as A or B or C

### Cost function:
Cross entropy function -:
$$J(\Theta) = -\frac{1}{m}\sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} log(\hat{p}_k^{(i)})$$

- Here, $y_k^{(i)}$ is the target probability that $i^{th}$ instance belongs to $k^{th}$ class. generally, this is 0 or 1

- Notice that, for K=2, the cross entropy function becomes log-loss function (of logistic regression)

- Now we can minimize this function for training of the model!

**Gradient of Cross-Entropy Loss (for class $k$):**

Let:  
- $m$ = number of training examples  
- $\hat{p}_k^{(i)}$ = predicted probability for class $k$ for example $i$  
- $y_k^{(i)}$ = true label (1 if example $i$ belongs to class $k$, else 0)  
- $\mathbf{x}^{(i)}$ = input feature vector for example $i$

Then the gradient of the cost function $J(\Theta)$ with respect to the parameter vector $\boldsymbol{\theta}_{(k)}$ for class $k$ is:

$$
\nabla_{\boldsymbol{\theta}_{(k)}} J(\Theta) = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{p}_k^{(i)} - y_k^{(i)} \right) \mathbf{x}^{(i)}
$$


In [3]:
import pandas as pd
import numpy as np
from seaborn import load_dataset

In [5]:
data = load_dataset("iris")
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [7]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

x = data.drop('species',axis=1)
y = label_encoder.fit_transform(data['species'])

In [10]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)

In [33]:
from sklearn.linear_model import LogisticRegression
softmax_regressor = LogisticRegression(C=0.1)   #Logistic Reg is now 'multinomial' by default
softmax_regressor.fit(x_train,y_train)

Logistic regression uses 'l2' penalty by default   
$$C = \frac{1}{\alpha}$$ 
is inverse of regularization strength 

- Too many features?  C=0.01,0.001 Prevents overfitting
- Model is slightly underfit? C=10,30 
- Not sure? default: C=1

In [34]:
y_test_predict = softmax_regressor.predict(x_test)
y_test_predict

array([1, 1, 1, 2, 2, 1, 1, 1, 0, 0, 2, 1, 2, 2, 2, 0, 0, 1, 2, 2, 1, 0,
       1, 0, 2, 1, 0, 0, 0, 0])

In [35]:
from sklearn.metrics import accuracy_score,precision_score,recall_score
accuracy = accuracy_score(y_test,y_test_predict)
precision = precision_score(y_test,y_test_predict,average="macro")
recall = recall_score(y_test,y_test_predict,average="macro")

print("Accuracy",accuracy)
print("Precision",precision)
print("recall",recall)

Accuracy 0.9333333333333333
Precision 0.9326599326599326
recall 0.9326599326599326
