The lecture notes claim that softmax is equivalent to sigmoid when n=2. He didn't prove or explain this further, so I thought I'd prove it here myself.


In [1]:
import numpy as np 

So obviously they're not equal, what am I doing wrong? 

The first thing to emphasize is that when they say n=2, they don't mean n datapoints, they mean n parameters. I.e. they are talking about $\theta$, not $X$.

So here's what I'm going to do: 

1. Prove the result algebraically 
2. Prove the result numerically

[This answer on Quora seems to explain stuff pretty well!](https://www.quora.com/Why-is-it-better-to-use-Softmax-function-than-sigmoid-function)

Here's the idea rewritten: 

Softmax definition: 

\begin{align}
P(y = k | X) = \frac{\exp({\theta_j^T \cdot X})}{\sum_{i=0}^{n-1}{\exp({\theta_i^T \cdot X})}} 
= \frac{1}{\sum_{i=0}^{n-1}{\exp({(\theta_i - \theta_j)^T \cdot X})}}
= \frac{1}{\sum_{i=0}^{n-1}{\exp(-{(\theta_j - \theta_i)^T \cdot X})}}
\end{align}


When n = 2,
\begin{align}
P(y = 1 | X) = \frac{1}{\sum_{i=0}^{n-1}{\exp(-{(\theta_1 - \theta_i)^T . X})}}
= \frac{1}{{\exp(-{(\theta_1 - \theta_0)^T . X})}} + \frac{1}{{\exp(-{(\theta_1 - \theta_1)^T . X})}}
= \frac{1}{{\exp(-{(\theta_1 - \theta_0)^T . X})} + 1}
= \sigma((\theta_1 - \theta_0)^T . X)
\end{align}




### Just a reminder: 


By linear regression, we say: 

\begin{align}
f(y | X_i) = \sum_{j=0}^{n-1}{{\theta_j^T \cdot X_i}}
\end{align}

For linear regression, we say: 

\begin{align}
logit(P(y=k | X_i)) = \sum_{j=0}^{n-1}{{\theta_j^T \cdot X_i}}
\end{align}

Where: 

\begin{align}
logit(x) = \log\frac{p}{1-p}
\end{align}

So:

\begin{align}
P(y=k | X_i) = \frac{1}{1+\exp(-y)}
\end{align}

Where: 

\begin{align}
y = \exp{\sum_{j=0}^{n-1}{{\theta_j^T \cdot X_i}}}
\end{align}


### Now to prove numerically 

In [12]:
num_data_points, num_features, num_classes = 1000, 10, 2

data_matrix = np.random.normal(size=(num_data_points, num_features))
weights_matrix = np.random.random(size=(num_features, num_classes - 1))

In [13]:
z = np.matmul(data_matrix, weights_matrix)
z[:10]

array([[-0.25896617],
       [-4.20362076],
       [ 2.34005395],
       [-1.51754527],
       [ 1.04138182],
       [ 1.89313601],
       [-0.26774478],
       [ 0.91482927],
       [ 0.18728667],
       [ 0.20086445]])

## Using sigmoid 

In [14]:
p = np.exp(z) / (1 + np.exp(z))
p[:10]

array([[0.43561786],
       [0.01472142],
       [0.91214041],
       [0.17982328],
       [0.73911654],
       [0.86911268],
       [0.43346083],
       [0.71398736],
       [0.54668528],
       [0.55004795]])

In [15]:
sigmoid_predictions = (p > 0.5).astype(int).reshape(1, -1)[0]

## Using softmax

In [16]:
matrix_for_softmax = np.hstack([1 - p, p])

In [17]:
softmax_predictions = matrix_for_softmax.argmax(axis=1)

## Comparing results

In [18]:
softmax_predictions[:10]

array([0, 0, 1, 0, 1, 1, 0, 1, 1, 1])

In [19]:
sigmoid_predictions[:10]

array([0, 0, 1, 0, 1, 1, 0, 1, 1, 1])

In [20]:
np.unique(softmax_predictions == sigmoid_predictions, return_counts=True)

(array([ True]), array([1000]))