# Q3 : Training a simple Linear Model
In this section, you will write the code to train a Linear Model. The goal is to classify an input $X_i$ of size $n$ into one of $m$ classes. For this, you need to consider the following:

1)  **Weight Matrix** $W_{n\times m}$: The Weights are multipled with the input $X_i$ (vector of size $n$), to find $m$ scores $S_m$ for the $m$ classes.

2)  **The Loss function**:   
  * The Cross Entropy Loss: By interpreting the scores as unnormalized log probabilities for each class, this loss tries to measure dissatisfaction with the scores in terms of the log probability of the right class:

$$
L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) \hspace{0.5in} \text{or equivalently} \hspace{0.5in} L_i = -f_{y_i} + \log\sum_j e^{f_j}
$$

where $f_{ y_i }$ is the $y_i$-th element of the output of $W^T  X_i$

3) **A Regularization term**: In addition to the loss, you need a Regularization term to lead to a more distributed (in case of $L_2$) or sparse (in case of $L_1$) learning of the weights. For example, with $L_2$ regularization, the loss has the following additional term:

$$
R(W) = \sum_k\sum_l W_{k,l}^2  
$$

Thus the total loss has the form:
$$
L =  \underbrace{ \frac{1}{N} \sum_i L_i }_\text{data loss} + \underbrace{ \lambda R(W) }_\text{regularization loss} \\\\
$$

4) **An Optimization Procedure**: This refers to the process which tweaks the weight Matrix $W_{n\times m}$ to reduce the loss function $L$. In our case, this refers to Mini-batch Gradient Descent algorithm. We adjust the weights $W_{n\times m}$, based on the gradient of the loss $L$ w.r.t. $W_{n\times m}$. This leads to:
$$
W_{t+1} = W_{t} - \alpha \frac{\partial L}{\partial W},
$$
where $\alpha$ is the learning rate. Additionally, with "mini-batch" gradient descent, instead of finding loss over the whole dataset, we use a small sample $B$ of the training data to make each learning step. Hence,
$$
W_{t+1} = W_{t} - \alpha \frac{\partial \sum_{i \in B}{L_{x_i}}}{\partial W},
$$
where $|B|$ is the batch size.

### [Task 1]
Using the solution to ScalarBackProagation, construct a Linear Layer Class with a weight matrix W.\
The Class must have the capability to:
- forward propagate given inputs
- backpropagate to compute the gradients for the update of W i.e $$ \frac{\partial{L}}{\partial{W_{i,j}}}$$
- USE ONLY NUMPY OPERATIONS AND NOT ANY ML API for training and inference. (Data loading is fine)

In [107]:
import numpy as np


In [108]:
# To solve this problem, construct the computational graph
# Write a class with forward and backward functions, for each node if you like
# For eg:
class Sigmoid():
  def __init__(self):
    self.x=None
    self.x_grad=None
    self.result=None

  def forward(self,x):
    self.x=x
    if self.result is not None:
      self.result=None
    self.result=1/(1+np.exp(-self.x))
    return self.result

    # code here for forward prop
    # save values useful for backpropagation


  def backward(self, del_f_by_del_x):
    # code here
    if self.x_grad is not None:
      self.x_grad=None
    self.x_grad=self.result*(1-self.result)
    res=del_f_by_del_x*self.x_grad
    return res


class Cosine():
  def __init__(self):
    self.x=None
    self.x_grad=None
    self.result=None

  def forward(self,x):
    # code here for forward prop
    # save values useful for backpropagation
    self.x=x
    if self.result is not None:
      self.result=None
    self.result=np.cos(self.x)
    return self.result

  def backward(self, del_f_by_del_x):
    if self.x_grad is not None:
      self.x_grad=None
    self.x_grad=-1*np.sin(self.x)
    res=del_f_by_del_x*self.x_grad
    return res
    # code here


class Sine():
  def __init__(self):
    self.x=None
    self.x_grad=None
    self.result=None

  def forward(self,x):
    # code here for forward prop
    # save values useful for backpropagation
    self.x=x
    if self.result is not None:
      self.result=None
    self.result=np.sin(self.x)
    return self.result


  # save values useful for backpropagation
  def backward(self, del_f_by_del_x):
    if self.x_grad is not None:
      self.x_grad=None
    self.x_grad=np.cos(self.x)
    res=del_f_by_del_x*self.x_grad
    return res
    # code here


class Tanh():
  def __init__(self):
    self.x=None
    self.x_grad=None
    self.result=None

  def forward(self,x):
    # code here for forward prop
    # save values useful for backpropagation
    self.x=x
    if self.result is not None:
      self.result=None
    self.result=np.tanh(self.x)
    return self.result

  def backward(self, del_f_by_del_x):
    # code here
    if self.x_grad is not None:
      self.x_grad=None
    self.x_grad=1-(np.tanh(self.x)**2)
    res=del_f_by_del_x*self.x_grad
    return res


class Log():
  def __init__(self):
    self.x=None
    self.x_grad=None
    self.result=None

  def forward(self,x):
    # code here for forward prop
    # save values useful for backpropagation
    self.x=x
    if self.result is not None:
      self.result=None
    self.result=np.log(self.x)
    return self.result


  def backward(self, del_f_by_del_x):
    # code here
    if self.x_grad is not None:
      self.x_grad=None
    self.x_grad=1.0/self.x
    res=del_f_by_del_x*self.x_grad
    return res

# CAUTION: Carefully treat the input and output dimension variation. At worst, handle them with if statements.
# Note: Do not forget to clear our the gradients for repeatable usage.

In [109]:
# Now write the class func
# which constructs the graph (all operators), forward and backward functions.

class LinearLayer():
  def __init__(self, args):
    # construct the graph here
    # assign the instances of function modules to self.var
    #Assuming args[0] to be 'm', ie the total numer of classes args[1] to be alpha
    self.w=np.random.rand(3072,10)*130
    #self.var.b=None
    self.m=args[0]
    self.alpha=args[1]
    self.n=0
    self.x=None
    self.output=[]
    self.grad_f_w=None
    pass

  def forward(self,x):
    # Using the graph element's forward functions, get the output.
    self.n=np.shape(x)[0]
    #self.var.b=np.zeros(np.shape(x)[0])
    self.x=x
    self.output=np.dot(self.w.T, x) #+ self.var.b
    output=self.output
    return output

  def backward(self, del_L_by_del_output):
    # Use the saved outputs of each module, and backward() function calls
    # updated the weight matrix
    # clears the gradients for repeatable use
    if self.grad_f_w is not None:
      self.grad_f_w=None
    #self.grad_f_w=self.var.w.T #compute the gradient
    self.w-=self.alpha * del_L_by_del_output #update the weight matrix
    print("Current weight matrix ",self.w,"\n\n")
    return self.w


In [110]:
from array import array
class Loss():
    def __init__(self) -> None:
        #self.log = Log()
        self.compute=None
        self.ypred=[]
        self.LL=LinearLayer([10,0.05])
        self.x=None
        self.a=[]
        self.b=[]
        self.y=[]
        pass

    def forward(self, x) -> float: #x is entire batch here
      self.x=x
      X_train=x[0]
      Y_train=x[1]
      final_loss=0
      for q in range(100):
        X=X_train[q].view(3072).numpy() #qth image in the batch
        X=X/255
        self.y=Y_train.numpy() #true label of the qth image in the batch
        self.ypred=self.LL.forward(X) # this is W.T@X which has dim of mX1
        self.a.append(self.ypred)
        lossexp=0 #ensuring that with every iteration the two loss terms are starttoing from 0
        lossq=0
        for li in range(10): #total number of classes
          lossexp+=np.exp(self.ypred[li])
        lossq=self.ypred[self.y[q]]
        lossq+=np.log(lossexp)
        self.b.append(lossexp)
        final_loss+=lossq #calculates the incremental loss for all 64 inputs
      final_loss/=100 #N=100
      regulariser_loss=0

      for wk in range(self.LL.w.shape[0]):
        for wl in range(self.LL.w.shape[1]):
           regulariser_loss+=self.LL.w[wk][wl]**2
      final_loss+=0.1*regulariser_loss #hardcoding Lamda to be 0.1
      out=final_loss
      return out

    def backward(self): #-> array:

      term1=np.zeros((10,3072)) #mXn same as W.T
      term2=np.zeros((3072,10)) #nXm same as W
      for q in range(100):
        term1[self.y[q]]+=-(self.LL.w.T[self.y[q]])

        term3=0
        for idx in range(10):
          term3+=np.exp(self.a[q][idx])
          term4=np.zeros((10,3072))
          term4[idx]=self.LL.w.T[idx]
          term2+=np.exp(self.a[q][idx])*term4.T
        term3=1/term3
        term2=term2*term3
      term1=term1.T
      term1+=term2
      term1/=100
      term1+=2*0.1*self.LL.w
      a=self.LL.backward(term1)
      return a




- Using the above Linear Layer, and loss function construct a Linear classifier for the CIFAR-10 dataset.\
- Convert the 3-d input (C, H, W) into a 1 d input by reshaping it.\
- You can use pytorch CIFAR-10 dataset but only as data loaders.

In [111]:
'''
Code below
'''

import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

#Creating a DataLoader object for the CIFAR-10 training dataset:
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transforms.ToTensor())
train_dataloader =DataLoader(train_dataset, batch_size=100, shuffle=True, num_workers=2)

#Creating a DataLoader object for the CIFAR-10 test dataset:
test_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transforms.ToTensor())
test_dataloader = DataLoader(test_dataset, batch_size=1,shuffle=False, num_workers=2)






Files already downloaded and verified
Files already downloaded and verified


In [112]:
#Training
model=Loss()
W=np.zeros((3072,10))
for batch in train_dataloader:
  print("Loss for batch is :",model.forward(batch),"\n\n")
  a=model.backward()


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  38.01667871]
 [42.14317698 30.18017485 11.28323631 ... 32.61910433  1.88221519
  16.25402965]
 [25.28467674 18.25360757 35.55327617 ... 20.53520662 17.31355566
  22.41819706]
 ...
 [ 4.5077499  17.86285927  6.38999386 ... 39.27062414 34.15806603
   1.44822284]
 [19.7216356  26.19664277 15.89987537 ... 38.67166699 17.9517049
  34.60157074]
 [31.64613492  1.50864804 10.67728727 ... 21.10196759  1.88219185
   4.98835852]] 


Loss for batch is : 1816617.337660582 


Current weight matrix  [[33.81844493 34.63319887 32.7558075  ...  4.60013822 24.06500154
  37.77838317]
 [41.89029067 30.02924531 11.23793507 ... 32.42303724  1.87462834
  16.15214641]
 [25.13295233 18.1623222  35.41053278 ... 20.41177349 17.24376805
  22.27767567]
 ...
 [ 4.48070048 17.77352801  6.36433858 ... 39.03457606 34.02038145
   1.43914511]
 [19.60329304 26.06563468 15.83603872 ... 38.43921912 17.87934504
  34.38468173]
 [31.45623765  1.50110337 10.6344

In [113]:
#Testing
count=0
for batch in test_dataloader:
  X_test=batch[0]
  X_test=X_test.view(3072).numpy()
  #X_test/=255
  X_test=X_test.reshape(3072,-1)
  Y_test=batch[1]
  print(a.T@X_test)
  print("Predicted class : ",np.argmax(a.T@X_test)," True Class",Y_test[0])
  if np.argmax(a.T@X_test)==Y_test.numpy()[0]:
    count+=1
print("Testing Accuracy= ",count/10000*100, "%")





[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 [10613.07487385]
 [10650.59285813]
 [10679.08513127]
 [ 9655.81058574]]
Predicted class :  1  True Class tensor(9)
[[7809.96793021]
 [8015.99632071]
 [7834.07696575]
 [7904.52518323]
 [7226.1402449 ]
 [7927.84298112]
 [7722.03716601]
 [7785.95170901]
 [7786.79425479]
 [7012.74124227]]
Predicted class :  1  True Class tensor(6)
[[6567.03712141]
 [6650.29997579]
 [6554.40658296]
 [6549.85408928]
 [6074.45251214]
 [6591.37214338]
 [6473.56141129]
 [6533.19528791]
 [6536.09491539]
 [5868.0730492 ]]
Predicted class :  1  True Class tensor(4)
[[13390.95924379]
 [13644.58596604]
 [13493.64849117]
 [13481.09781446]
 [12439.06169739]
 [13566.13693869]
 [13296.37121564]
 [13313.63022233]
 [13472.88935342]
 [12134.08100593]]
Predicted class :  1  True Class tensor(2)
[[6686.38534866]
 [6687.91461107]
 [6688.76066143]
 [6665.67464258]
 [6136.53147674]
 [6662.62169372]
 [6577.29457743]
 [6579.98957228]
 [6580.77034591]
 [5937.0477422