## Softmax Regression

### Model
$$\mathbf{o}=\mathbf{W}\mathbf{x}+\mathbf{b}$$
$$\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o})\quad \text{where}\quad \hat{y}_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}.$$

### Loss function
#### Maximize the lig-likelihood
Here we still want to maximize the likelihood of $P(\mathbf{Y}|\mathbf{X})$, which is equivalent to minimizing the negtive log-likelihood
$$-\log P(\mathbf{Y} \mid \mathbf{X}) = \sum_{i=1}^n -\log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)})
= \sum_{i=1}^n l(\mathbf{y}^{(i)}, \hat{\mathbf{y}}^{(i)}),$$
    And for each pair of $<\mathbf{y}, \mathbf{\hat y}>$ over $q$ classes, the negtive log-likelihood is
$$l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j.$$

So the total loss function is 
$$L(\mathbf{Y}, \mathbf{\hat Y})=-\frac{1}{n}\sum_{i=1}^n\sum_{j=1}^q y_j \log \hat{y}_j$$

#### Minimize the loss entropy loss
For each pair of $<\mathbf{y}, \mathbf{\hat y}>$ over $q$ classes, the cross entropy is
$$l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j.$$

## Implemantation from scratch 

In [12]:
import torch
from torch.nn import Module
from torch.utils.data import DataLoader
import torchvision
import sys
sys.path.append("../dlutils")
import importlib
import model
import loss
import train
import dataset
importlib.reload(model)
importlib.reload(loss)
importlib.reload(train)
importlib.reload(dataset)

<module 'dataset' from '../dlutils/dataset.py'>

### load the dataset
Here we are using the Fashion-MNIST dataset.

In [2]:
from dataset import load_fashion_mnist_dataset
# get Fashion-MNIST DataLoader
batch_size = 256
train_loader,test_loader=load_fashion_mnist_dataset(batch_size)

#### Retrieve text labels
Here we define a function to retrieve the corresponding text label given the numerical label.

In [3]:
def get_fashion_mnist_labels(labels):
    text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat', 'sandal', 'shirt',
        'sneaker', 'bag', 'ankle boot']
    return [text_labels[int(i)] for i in labels]

### Defining the model and loss 

In [4]:
from model import SoftmaxRegression
num_train_samples, height, width = list(train_loader.dataset[0][0].shape)
in_features = height*width
out_features = 10
net = SoftmaxRegression(in_features, out_features)

In [5]:
from loss import CrossEntropyLoss
loss = CrossEntropyLoss()

### Training

In [6]:
from train import sgd, classifier_accuracy
num_epochs = 10
lr = 0.1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
net.reset_weights()
net.to(device)
loss.to(device)

for i in range(num_epochs):
    for X,y in train_loader:
        X, y = X.to(device), y.to(device)
        y_hat = net(X)
        ce_loss = loss(y_hat, y).sum()
        ce_loss.backward()
        sgd(net.parameters(), lr, batch_size)
    
    #after each epoch, calculate traning loss, testing loss and accuracy
    with torch.no_grad():
        train_loss = 0
        test_loss = 0
        train_cp = 0
        test_cp = 0
        for X, y in train_loader:
            X,y = X.to(device), y.to(device)
            y_hat = net(X)
            train_loss += loss(y_hat,y).sum()
            train_cp += classifier_accuracy(y_hat, y)
        train_loss = train_loss/len(train_loader.dataset)
        train_cp = train_cp/len(train_loader.dataset)
            
        for X, y in test_loader:
            X,y = X.to(device), y.to(device)
            y_hat = net(X)
            test_loss += loss(y_hat,y).sum()
            test_cp += classifier_accuracy(y_hat,y)
        test_loss = test_loss/len(test_loader.dataset)
        test_cp = test_cp/len(test_loader.dataset)
        print(f'epoch {i}, training loss {float(train_loss):f}, training accuracy {float(train_cp):f},testing loss {float(test_loss):f}, testing accuracy {float(test_cp):f}')


epoch 0, training loss 2.154412, training accuracy 0.591717,testing loss 2.255690, testing accuracy 0.582700
epoch 1, training loss 1.665212, training accuracy 0.661667,testing loss 1.767846, testing accuracy 0.651300
epoch 2, training loss 1.430462, training accuracy 0.692317,testing loss 1.534156, testing accuracy 0.681600
epoch 3, training loss 1.296828, training accuracy 0.714367,testing loss 1.397760, testing accuracy 0.701000
epoch 4, training loss 1.202236, training accuracy 0.729317,testing loss 1.302553, testing accuracy 0.717600
epoch 5, training loss 1.142621, training accuracy 0.741267,testing loss 1.245808, testing accuracy 0.729000
epoch 6, training loss 1.083060, training accuracy 0.748017,testing loss 1.184234, testing accuracy 0.734200
epoch 7, training loss 1.043397, training accuracy 0.752117,testing loss 1.144947, testing accuracy 0.738800
epoch 8, training loss 1.005169, training accuracy 0.759717,testing loss 1.106884, testing accuracy 0.746500
epoch 9, training l

## Concise Implementation

### Initializing model parameters

In [13]:
net = torch.nn.Sequential(torch.nn.Flatten(), torch.nn.Linear(in_features, out_features))

def init_weights(m):
    if type(m)==torch.nn.Linear:
        m.weight.data.zero_()

net.apply(init_weights)

Sequential(
  (0): Flatten(start_dim=1, end_dim=-1)
  (1): Linear(in_features=784, out_features=10, bias=True)
)

### Cross entropy loss

In [14]:
loss = torch.nn.CrossEntropyLoss(reduction="sum")

### Optimizer

In [15]:
trainer = torch.optim.SGD(net.parameters(), lr=0.1)

### Training

In [16]:
from train import train_3ch
num_epochs = 10
lr = 0.1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
net.to(device)
loss.to(device)
train_3ch(net, loss, num_epochs, train_loader, trainer, test_loader, device)

epoch 0, training loss 53.603989, training accuracy 0.780850, testing loss 55.988068, testing accuracy 0.771900
epoch 1, training loss 62.484596, training accuracy 0.792083, testing loss 65.545845, testing accuracy 0.784400
epoch 2, training loss 74.655914, training accuracy 0.733033, testing loss 79.184387, testing accuracy 0.724700
epoch 3, training loss 21.902069, training accuracy 0.831067, testing loss 24.407562, testing accuracy 0.816400
epoch 4, training loss 19.011396, training accuracy 0.830017, testing loss 22.082993, testing accuracy 0.815100
epoch 5, training loss 19.658064, training accuracy 0.842033, testing loss 22.473108, testing accuracy 0.828700
epoch 6, training loss 22.639240, training accuracy 0.830367, testing loss 25.885845, testing accuracy 0.817200
epoch 7, training loss 15.621553, training accuracy 0.851950, testing loss 18.589893, testing accuracy 0.831200
epoch 8, training loss 15.852967, training accuracy 0.851700, testing loss 18.795355, testing accuracy 0