$${\color{yellow}{\text{Deep Learning for LLMs}}}$$



---

Load essential libraries

---

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
import torch
import matplotlib.pyplot as plt
plt.style.use('dark_background')
%matplotlib inline
import sys
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.metrics import confusion_matrix
import gensim.downloader
import nltk
from nltk.tokenize import word_tokenize

---

Mount Google Drive folder if running Google Colab

---

In [None]:
## Mount Google drive folder if running in Colab
if('google.colab' in sys.modules):
    from google.colab import drive
    drive.mount('/content/drive', force_remount = True)
    DIR = '/content/drive/MyDrive/Colab Notebooks/MAHE/Workshops/Generative AI with LLMs Workshop_December2024'
    DATA_DIR = DIR+'/Data/'
else:
    DATA_DIR = 'Data/'

---

**We will now use Pytorch to create tensors**

The patient data matrix:

![patient data matrix](https://1drv.ms/i/s!AjTcbXuSD3I3hsxIkL4V93-CGq8RkQ?embed=1&width=660)

**Notation**:

Zeroth patient vector $\mathbf{x}^{(0)}= \begin{bmatrix}72\\120\\37.3\\104\\32.5\end{bmatrix}$ and zeroth feature (heart rate vector) $\mathbf{x}_0 = \begin{bmatrix}72\\85\\68\\90\\84\\78\end{bmatrix}.$

---



In [None]:
## Create a patient data matrix as a constant tensor
X = torch.tensor([[72, 120, 37.3, 104, 32.5],
                 [85, 130, 37.0, 110, 14],
                 [68, 110, 38.5, 125, 34],
                 [90, 140, 38.0, 130, 26],
                 [84, 132, 38.3, 146, 30],
                 [78, 128, 37.2, 102, 12]])
print(X)
print(X.shape)
# X is a rank-2 tensor which is similar to a numpy 2D array
print(X[0]) # this is patient-0 info which is a rank-1 tensor
print(X[0, 2])

---

**Convert a PyTorch object into a numpy array**

---

In [None]:
X_numpy = X.numpy()
print(X_numpy)
print(type(X_numpy))
print(X_numpy.shape)

---

**Addition and subtraction of vectors, scalar multiplication (apply operation componentwise)**

![vector addition](https://1drv.ms/i/c/37720f927b6ddc34/IQQ03G17kg9yIIA3NokBAAAAAZLAaAoWwhtn8Vk26NotALo?width=256)

![vector subtracton](https://1drv.ms/i/c/37720f927b6ddc34/IQQ03G17kg9yIIA3M4kBAAAAAU_n_mAEv006QFZm_sUj2Dc?width=256)

![vector multiplication](https://1drv.ms/i/c/37720f927b6ddc34/IQQ03G17kg9yIIA3NIkBAAAAAa_qL04bLT4kWoNeHcrR9LQ?width=256)

![vector geometry1](https://1drv.ms/i/c/37720f927b6ddc34/IQSGNMr5z3SSRry7LSKL7LybAcGYuzgw5smabV8-6DudXIs?width=230)

![vector geometry2](https://1drv.ms/i/c/37720f927b6ddc34/IQQ03G17kg9yIIA3WokBAAAAAQi8FPV9YCebl5WnyEKJ3vg?width=213&height=192)


---

In [None]:
# Vector addition
print(X[1, :] + X[2, :])

# Vector subtraction
print(X[1, :] - X[2, :]) # how different patient-1 and patient-2 are

# Scalar-vector multiplication
print(X[:, 2])
print((9/5)*X[:, 2] + 32)

# Average patient
print((1/4)*(X[0, :] + X[1, :] + X[2, :] + X[3, :]))
print(torch.mean(X, dim = 0)) # dim = 0 means top-to-bottom operation or each row is an element

---

Application of vector subtraction in natural language processing (NLP): download the word embedding model trained on Wikipedia articles.

---

In [None]:
model = gensim.downloader.load('glove-wiki-gigaword-50')

---

Now we will see what embedding vector comes as a result of applying the model for the words *cricket* and *football*.

Next, we will do an *intuitive* subtraction of word embeddings as in

1. Cricket without Tendulkar
2. Football without Messi

Note that the embedding vectors have 50 components corresponding to the 50-dimensional embedding of model suggested by the name '**glove-wiki-gigaword-50**'

---

In [None]:
print(model['cricket'])
print(model['football'])
a = model['cricket'] - model['tendulkar']
b = model['football'] - model['messi']
print(a)
print(b)

---

A tensor of rank 3 corresponding to 4 time stamps (hourly), 3 samples (patients), 2 features (HR and BP)

---

In [None]:
# A rank-3 patient tensor with shape (4, 3, 2)
# with meaning for
# axis-0 as 4 hourly timestamps,
# axis-1 as 3 patients, and
# axis-2 as 2 features (HR and BP)
T = torch.tensor([[[74., 128], [79, 116], [71, 116]],
                 [[78, 118], [82, 124], [72, 128]],
                 [[84, 138], [84, 130], [74, 120]],
                 [[82, 126], [76, 156], [82, 132]]])
print(T)
print(T.shape)

---

**Accessing elements of a tensor**

---

In [None]:
## Accessing elements of a tensor
# Rank-3 tensor T has axes order (timestamps, patients, features)

# Element of T at postion 3 w.r.t. axis-0, position 2 w.r.t. axis-1,
# position-1 w.r.t axis-2
print(T[3, 2, 1]) # 3rd timestamp, 2nd patient, 1st feature (BP)

print(T[1]) # element-1 of object T which is also the info for all patients at 10AM (admission at 9AM)

print(T[3, 2]) # patient-2 info at 12PM

---

$l_2$ norm or the geometric length of a vector denoted as $\lVert \mathbf{a}\rVert$ tells us how long a vector is. In 2-dimensions, $$\lVert \mathbf{a}\rVert_2 = \sqrt{a_1^2+a_2^2}$$ and in $n$-dimensions, $$\lVert \mathbf{a}\rVert_2 = \sqrt{a_1^2+a_2^2+\cdots+a_n^2}.$$

![vector norm](https://1drv.ms/i/c/37720f927b6ddc34/IQT817WmpQjlRqZ1R0d5Cfv6AUW6c4robL-gk06i9wmCaFU?width=250)

---

In [None]:
## l2 norm of a vector
x = torch.tensor([76., 124])
print(x)
print(torch.norm(x)) # sqrt(76^2+124^2)


---

**Dot Product of Vectors**

A scalar resulting from an elementwise multiplication and addition: $$\mathbf{a}{\color{cyan}\cdot}\mathbf{b} = {\color{red}{a_1b_1}}+{\color{green}{a_2b_2}}+\cdots+{\color{magenta}{a_nb_n}}$$

The <font color="cyan">dot</font> ${\color{cyan}\cdot}$ represents the computation of the dot product.


---

In [None]:
## Dot product of vectors
a = torch.tensor([1., 2, 3])
b = torch.tensor([4., 5, 6])
print(torch.dot(a, b)) # elementwise product followed by a summation

---

The dot product is a measure of similarity between vectors (or, how aligned they are geometrically).

![dot product](https://1drv.ms/i/c/37720f927b6ddc34/IQTbcGSjdbhSTJ7J39d5BCWAAWS6-y5U6J87vHuDWeAqGwM?width=450)
---

In [None]:
a = torch.tensor([1.0, 2.0])
b = torch.tensor([2.0, 4.0])  # b is exactly aligned with a
c = torch.tensor([-2.0, 1.0]) # c is perpendicular or orthogonal to a
d = torch.tensor([-1.0, -2.0])  # d is anti-aligned with a
print(torch.dot(a, b))
print(torch.dot(a, c))
print(torch.dot(a, d))

---

Cauchy-Schwarz inequality $-1\leq\frac{\mathbf{x}\cdot{\mathbf{y}}}{\lVert\mathbf{x}\rVert_2\lVert\mathbf{y}\rVert_2}\leq1.$

This is a normalized measure of similarity (or extent of alignment) between vectors.

Angle between vectors $\mathbf{x}$ and $\mathbf{y} = \cos^{-1}\left(\frac{\mathbf{x}\cdot{\mathbf{y}}}{\lVert\mathbf{x}\rVert_2\lVert\mathbf{y}\rVert_2}\right).$

![angle](https://1drv.ms/i/c/37720f927b6ddc34/IQQ03G17kg9yIIA3WokBAAAAAQi8FPV9YCebl5WnyEKJ3vg?width=213&height=192)


---

In [None]:
x = torch.tensor([1.0, 2.0])
y = torch.tensor([2.0, 1.0])
print(torch.dot(x, y) / (torch.norm(x) * torch.norm(y))) # normalized similarity measure
print(torch.acos(torch.dot(x, y) / (torch.norm(x) * torch.norm(y)))) # angle in radians
print((180/torch.pi)*torch.acos(torch.dot(x, y) / (torch.norm(x) * torch.norm(y)))) # angle in degrees

---

Application of the Cauchy-Schwarz inequality: is "Cricket without Tendulkar" same as "Football without Messi"?

---

In [None]:
a = model['cricket'] - model['tendulkar']
b = model['football'] - model['messi']
print(np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))) # normalized similarity
print((180/np.pi)*np.arccos(np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b)))) # angular difference in degrees
print(np.linalg.norm(a-b)) # linear difference


---

**Hadamard Product of Vectors**

A vector resulting from an elementwise multiplication: $$\mathbf{a}{\color{cyan}\otimes}\mathbf{b} = \begin{bmatrix}{\color{red}{a_1\times b_1}}\\{\color{green}{a_2\times b_2}}\\\vdots\\{\color{magenta}{a_n\times b_n}}\end{bmatrix}.$$

The <font color="cyan">$\otimes$</font> represents the computation of the Hadamard product.

---

In [None]:
## Hadamard product
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

# Element-wise multiplication (Hadamard product)
print(a * b)  # Using the * operator
print(torch.mul(a, b))  # Using torch.mul function

---

A matrix-vector product is simply a sequence of dot products of the rows of matrix (seen as vectors) with the vector

![matrvec product](https://1drv.ms/i/c/37720f927b6ddc34/IQQ1cQ8fZdFmS4cnGkBlsZbAAaL2zMtzWdjHe-HCMt4UTA0?width=500)

---

In [None]:
## Matrix-vector product
A = torch.tensor([[1.0, 2.0, 4.0],
                  [2.0, -1.0, 3.0]])
x = torch.tensor([4.0, 2.0, -2.0])

# Matrix-vector multiplication
print(A)
print(x)
print(torch.matmul(A, x))

---

Here we create a simple sentence in English and tokenize it

---

In [None]:
sentence = 'i swam quickly across the river to get to the other bank'
nltk.download('punkt_tab')
tokens = word_tokenize(sentence)
print(len(tokens))
print(tokens)

---

Generate the word embeddings for the tokens and store them in a matrix $\mathbf{X}$ such that each row of the matrix corresponds to a token.

---

In [None]:
X_word = torch.tensor(model[tokens])
np.set_printoptions(precision=3, suppress=True)
print(X_word)
print(X_word.shape)
print(X_word[1]) # embedding vector for the word "swam"

---

A matrix-matrix product is simply a sequence of matrix-vector products.

![matmatprod](https://1drv.ms/i/c/37720f927b6ddc34/IQQ-B3z7tbWHQqBrW9k2ElDVAUc5fWzM24txLkgBK7f8Yac?width=350)


---

In [None]:
## Matrix-matrix product
A = torch.tensor([[1.0, 2.0, 4.0],
                  [2.0, -1.0, 3.0]])
B = torch.tensor([[4.0, -1.0],
                  [2.0, 0.0],
                  [-2.0, 3.0]])
torch.matmul(A, B)

---

The similarity between each pair of words represented in the word embeddings matrix $\mathbf{X}$ is the matrix-matrix product $\mathbf{X}\mathbf{X}^\mathrm{T}.$

---

In [None]:
S = torch.matmul(X_word, X_word.T)
print(S)
print(S.shape)

---

Matrix-matrix product using the patient data matrix and a weights matrix:

![Patient dataset](https://1drv.ms/i/s!AjTcbXuSD3I3hscharGu916tjWNzZQ?embed=1&width=660)

---

In [None]:
# Patients data matrix
X = torch.tensor([[72, 120, 36.5],
                  [85, 130, 37.0],
                  [68, 110, 38.5],
                  [90, 140, 38.0]])
print(X)

# Weights matrix
W = torch.tensor([[0.5, 0.3, -0.6],
                  [0.9, 0.3, -0.25],
                  [-1.5, 0.4, 0.1]])
print(W)

# Raw scores matrix (Matrix-matrix multiplication)
Z = torch.matmul(X, W) # PyTorch matmul() also does matrix-matrix multiplication
print(Z)

# The raw scores are also referred to as the logits

---

The softmax function

![softmax](https://1drv.ms/i/s!AjTcbXuSD3I3hscmdol7J2G4GDo5WQ?embed=1&width=660)

---

In [None]:
## In-built softmax function in PyTorch (dim = 1 corresponds to applying row-by-row)
## applied to the word embeddings similarity matrix
S_softmax = torch.nn.functional.softmax(torch.tensor(S), dim = 1)
print(S_softmax[1])

---

Transform the word embeddings using the softmax-normalized similarity matrix.

---

In [None]:
X_word = torch.tensor(model[tokens])
Y = torch.matmul(S_softmax, X_word)
print(Y)

In [None]:
## In-built softmax function in PyTorch (dim = 1 corresponds to row-by-row)
## applied to the toy patient data matrix
softmax_scores = torch.nn.functional.softmax(Z, dim = 1)
print(softmax_scores)

---

A toy data matrix with output labels and an initial weights matrix for the softmax classifier:

![data for softmax](https://1drv.ms/i/s!AjTcbXuSD3I3hspfrgklysOtJMOjaA?embed=1&width=660)

---

In [None]:
# Create the data matrix (read from a file typically)
X = np.array([[72, 120, 37.3, 104, 32.5],
              [85, 130, 37.0, 110, 14],
              [68, 110, 38.5, 125, 34],
              [90, 140, 38.0, 130, 26],
              [84, 132, 38.3, 146, 30],
              [78, 128, 37.2, 102, 12]])

# Standardize the data matrix
sc = StandardScaler()
X_S = sc.fit_transform(X)  # fit(), fit_transform(), transform()

# Convert to a PyTorch tensor
X_S = torch.tensor(X_S, dtype=torch.float32)

# Get the number of samples and features
num_samples, num_features = X_S.shape

# Create the output labels vector (also read from a file typically)
y = np.array(['non-diabetic',
              'diabetic',
              'non-diabetic',
              'pre-diabetic',
              'diabetic',
              'pre-diabetic'])

# One-hot encoding of output labels using scikit-learn
ohe = OneHotEncoder(sparse_output=False)  # Use `sparse_output=False` for dense array
Y = ohe.fit_transform(y.reshape(-1, 1))

# Convert to a PyTorch tensor
Y = torch.tensor(Y, dtype=torch.float32)

# Get the number of labels
num_labels = Y.shape[1]

# Create the weights matrix
W = torch.tensor([[-0.1, 0.5, 0.3],
                  [0.9, 0.3, 0.5],
                  [-1.5, 0.4, 0.1],
                  [0.1, 0.1, -1.0],
                  [-1.2, 0.5, -0.8]], dtype=torch.float32)

print(X_S)
print(Y)
print(W)

---

Bias trick to absorb the bias into the weights matrix

![bias trick](https://1drv.ms/i/c/37720f927b6ddc34/IQR8NDbhvaddQa3W3F_46q4nATD7WBNgnwGJ7QC6HDL6g14?width=550)

---

In [None]:
## Bias trick to absorb the bias into the weights matrix
# Concatenate a column of ones to X_S (bias term)
X_B = torch.cat([X_S, torch.ones((num_samples, 1))], dim=1)

# Create the bias vector `b`
b = 0.1 * torch.ones((1, num_labels))

# Concatenate the weights matrix `W` with the bias vector `b`
W_B = torch.cat([W, b], dim=0)

print(X_B)
print(W_B)

---

Forward propagation for the toy patient dataset: $$\textbf{bias-added input }\mathbf{X}_B\,{\color{yellow}\longrightarrow}\,\textbf{raw scores }\mathbf{Z}=\mathbf{X}_B\textbf{W}_B\,{\color{yellow}\longrightarrow}\,\textbf{softmax activated scores }\mathbf{A}=\text{softmax}(\mathbf{Z}).$$

---

In [None]:
# Raw scores matrix
Z = torch.matmul(X_B, W_B) # also alled logits
print(Z)

# Softmax activated scores
A = torch.nn.functional.softmax(Z, dim = 1)

# Predicted probabilities for each sample
print(A)

# True output label for each sample
print(Y)

---

Loss for each sample can be quantified using the categorical crossentropy (CCE) loss function which is defined as $$\color{yellow}{-\log(\text{predicted probability that a sample belongs its correct class})}$$

For example, consider a sample with

- true_label = [$\color{yellow}{1}$ 0 0]
- predicted_label = [$\color{yellow}{0.05}$, 0.99, 0.05]

categorical crossentropy loss = $-\log(\color{yellow}{0.05}).$

Here, we calculate the average CCE loss for all all samples and average them out.

---

In [None]:
## Calculate average CCE loss
loss = torch.mean(-torch.log(torch.sum(Y * A, dim = 1)))
print(loss)

# Using the PyTorch in-built function for CCE loss
loss_fn = torch.nn.CrossEntropyLoss()
loss = loss_fn(Z, torch.argmax(Y, dim = 1))
print(loss)

---

Applying the gradient descent method with

- a maximum number of iterations equal to 1000
- a stopping tolerance equal to $10^{-6}$
- a learning rate of 0.01

 to minimize $$L(\mathbf{w}) = (w_1-2)^2+(w_2+3)^2$$ starting from $\mathbf{w} = \begin{bmatrix}w_1\\w_2\end{bmatrix}=\begin{bmatrix}0\\0\end{bmatrix}.$

---

In [None]:
# Initialize weights as tensors with gradients
w = torch.tensor([0.0, 0.0], requires_grad=True)

# Hyperparameters
maxiter = 1000
tol = 1e-06
lr = 1e-02
norm_grad = float('inf')

k = 0
while k < maxiter and norm_grad > tol:
    # Zero the gradients
    if w.grad is not None:
        w.grad.zero_()

    # Define the loss function
    L = (w[0] - 2)**2 + (w[1] + 3)**2

    # Backpropagate to compute gradients
    L.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w -= lr * w.grad

    # Compute the norm of the gradient
    norm_grad = w.grad.norm().item()
    k += 1

    print(f'Iteration {k}: ||grad|| = {norm_grad}')

---

We will consider again the same toy data matrix with 6 samples and 3 possible output labels :


![data for softmax](https://1drv.ms/i/s!AjTcbXuSD3I3hsxIkL4V93-CGq8RkQ?embed=1&width=660)

---

---

Define the linear layer (dense layer) where the raw scores are calculated through the linear operation:
$$\underbrace{\mathbf{Z}}_{\color{red}{6\times3}} = \underbrace{\begin{bmatrix}{\mathbf{z}^{(0)}}^\mathrm{T}\\{\mathbf{z}^{(1)}}^\mathrm{T}\\{\mathbf{z}^{(2)}}^\mathrm{T}\\{\mathbf{z}^{(3)}}^\mathrm{T}\\{\mathbf{z}^{(4)}}^\mathrm{T}\\{\mathbf{z}^{(5)}}^\mathrm{T}\end{bmatrix}}_{\color{red}{6\times3}}=\underbrace{\begin{bmatrix}{\mathbf{x}^{(0)}}^\mathrm{T}\mathbf{W}+{\color{orange}{\mathbf{b}}^\mathrm{T}}\\{\mathbf{x}^{(1)}}^\mathrm{T}\mathbf{W}+{\color{orange}{\mathbf{b}}^\mathrm{T}}\\{\mathbf{x}^{(2)}}^\mathrm{T}\mathbf{W}+{\color{orange}{\mathbf{b}}^\mathrm{T}}\\{\mathbf{x}^{(3)}}^\mathrm{T}\mathbf{W}+{\color{orange}{\mathbf{b}}^\mathrm{T}}\\{\mathbf{x}^{(4)}}^\mathrm{T}\mathbf{W}+{\color{orange}{\mathbf{b}}^\mathrm{T}}\\{\mathbf{x}^{(5)}}^\mathrm{T}\mathbf{W}+{\color{orange}{\mathbf{b}}^\mathrm{T}}\end{bmatrix}}_{\color{red}{6\times3}} = \underbrace{\begin{bmatrix}{\mathbf{x}^{(0)}}^\mathrm{T}\mathbf{W}\\{\mathbf{x}^{(1)}}^\mathrm{T}\mathbf{W}\\{\mathbf{x}^{(2)}}^\mathrm{T}\mathbf{W}\\{\mathbf{x}^{(3)}}^\mathrm{T}\mathbf{W}\\{\mathbf{x}^{(4)}}^\mathrm{T}\mathbf{W}\\{\mathbf{x}^{(5)}}^\mathrm{T}\mathbf{W}\end{bmatrix}}_{\color{red}{6\times3}} + \underbrace{\begin{bmatrix}{\color{orange}{\mathbf{b}}^\mathrm{T}}\\{\color{orange}{\mathbf{b}}^\mathrm{T}}\\{\color{orange}{\mathbf{b}}^\mathrm{T}}\\{\color{orange}{\mathbf{b}}^\mathrm{T}}\\{\color{orange}{\mathbf{b}}^\mathrm{T}}\\{\color{orange}{\mathbf{b}}^\mathrm{T}}\end{bmatrix}}_{\color{red}{6\times3}}=\underbrace{\begin{bmatrix}{\mathbf{x}^{(0)}}^\mathrm{T}\\{\mathbf{x}^{(1)}}^\mathrm{T}\\{\mathbf{x}^{(2)}}^\mathrm{T}\\{\mathbf{x}^{(3)}}^\mathrm{T}\\{\mathbf{x}^{(4)}}^\mathrm{T}\\{\mathbf{x}^{(5)}}^\mathrm{T}\end{bmatrix}}_{\color{red}{6\times5}}\underbrace{\mathbf{W}}_{\color{red}{5\times3}}=\underbrace{\underbrace{\mathbf{X}}_{6\times 5}\underbrace{\mathbf{W}}_{5\times 3}}_{\color{red}{6\times3}} + \underbrace{\begin{bmatrix}{\color{orange}{\mathbf{b}}^\mathrm{T}}\\{\color{orange}{\mathbf{b}}^\mathrm{T}}\\{\color{orange}{\mathbf{b}}^\mathrm{T}}\\{\color{orange}{\mathbf{b}}^\mathrm{T}}\\{\color{orange}{\mathbf{b}}^\mathrm{T}}\\{\color{orange}{\mathbf{b}}^\mathrm{T}}\end{bmatrix}}_{\color{red}{6\times3}}.$$

---

In [None]:
class LinearLayer(torch.nn.Module):
    def __init__(self, input_dim, nodes = 2):
        super(LinearLayer, self).__init__()  # Initialize the parent class (nn.Module)
        self.nodes = nodes
        # Define the weights and bias as parameters
        self.W = torch.nn.Parameter(torch.randn(input_dim, self.nodes))
        torch.nn.init.xavier_uniform_(self.W)  # Xavier uniform initialization
        self.b = torch.nn.Parameter(torch.randn(self.nodes))  # Random Normal initialization

    def forward(self, input):
        # Linear transformation (input * W + b)
        output = torch.matmul(input, self.W) + self.b
        return output

---

Defining a LinearLayer object and calling the forward() method applied to the toy patient data matrix.

---

In [None]:
layer1 = LinearLayer(num_features, 3)
print(layer1.W)
print(layer1.b)
layer1.forward(torch.tensor(X_S, dtype = torch.float32))

---

Define the softmax layer

---

In [None]:
class SoftmaxLayer(torch.nn.Module):
    def __init__(self):
        super(SoftmaxLayer, self).__init__()
        self.activation = torch.nn.Softmax(dim = 1)

    def forward(self, input):
        output = self.activation(input)  # Apply softmax activation
        return output

---

Defining a SoftmaxLayer object and calling the forward() method applied to the toy patient data matrix.

---

In [None]:
actlayer1 = SoftmaxLayer()
print(actlayer1.activation)
actlayer1.forward(layer1.forward(torch.tensor(X_S, dtype = torch.float32)))

---

Define the softmax classifier model

---

In [None]:
class SoftmaxClassifierModel(torch.nn.Module):
    def __init__(self, input_dim, nodes=2):
        super(SoftmaxClassifierModel, self).__init__()
        self.nodes = nodes
        self.linearLayer = LinearLayer(input_dim, self.nodes)  # Linear layer
        self.softmaxLayer = SoftmaxLayer()  # Softmax activation layer

    def forward(self, input):
        output = self.linearLayer(input)  # Forward pass through the linear layer
        output = self.softmaxLayer(output)  # Apply softmax activation
        return output

---

Perform forward propagation to the toy patient dataset using the SoftmaxClassifierModel built above.

---

In [None]:
model = SoftmaxClassifierModel(num_features, 3)
print(model(torch.tensor(X_S, dtype = torch.float32)))

---

Define loss function (categorical crossentropy).

---

In [None]:
def loss_fn(true_labels, predicted_probs):
  loss = torch.mean(-torch.log(torch.sum(true_labels * predicted_probs, dim = 1)))
  return(loss)

---

Apply the softmax classifier model to the toy data set and calculate the loss.

---

In [None]:
## Apply the softmax classifier model to the toy data set and calculate the loss
# Instantiate the model object
model = SoftmaxClassifierModel(num_features, 3) # invokes the constructor and sets up the layers

# Calculate average data loss
loss_fn(Y, model(torch.tensor(X_S, dtype = torch.float32)))

---

Softmax classifier for the [MNIST](https://www.tensorflow.org/datasets/catalog/mnist) dataset

---

In [None]:
## Load MNIST data (note that shape of X_train and y_train)
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
print(X_train.shape)
print(y_train.shape)

In [None]:
## Reshape X_train and X_test such that the samples are along the rows
X_train_reshaped = X_train.reshape(X_train.shape[0], X_train.shape[1]*X_train.shape[2])
X_test_reshaped = X_test.reshape(X_test.shape[0], X_test.shape[1]*X_test.shape[2])

In [None]:
## Problem parameters
num_samples_train = X_train_reshaped.shape[0]
num_samples_test = X_test_reshaped.shape[0]
num_features = X_train_reshaped.shape[1]
num_labels = len(np.unique(y_train))
print(f'No. of training samples = {num_samples_train},\
 No. of test samples = {num_samples_test}, \
 no. of features = {num_features}, no. of labels = {num_labels}')

In [None]:
## One-hot encode output labels using scikit-learn (observe the shape of Y_train)
ohe = OneHotEncoder(sparse_output=False)
Y_train = torch.tensor(ohe.fit_transform(y_train.reshape(-1, 1)), dtype = torch.float32)
Y_test = torch.tensor(ohe.transform(y_test.reshape(-1, 1)), dtype = torch.float32)

In [None]:
## Min-max scale the images using scikit-learn
mms = MinMaxScaler()
X_train_reshaped_scaled = torch.tensor(mms.fit_transform(X_train_reshaped), dtype=torch.float32)
X_test_reshaped_scaled = torch.tensor(mms.transform(X_test_reshaped), dtype=torch.float32)

---

Train the softmax classifier on the MNIST dataset

---

In [None]:
## Train the softmax classifier on the MNIST dataset
# Initialize model
model = SoftmaxClassifierModel(num_features, num_labels)

# Gradient descent
maxiter = 250
lr = 1e-03

# Define loss function (CrossEntropyLoss in PyTorch includes softmax)
loss_fn = torch.nn.CrossEntropyLoss()

# Define optimizer (RMSprop)
optimizer = torch.optim.RMSprop(model.parameters(), lr = lr)

# Lists to store training and test losses
loss_train = [None] * maxiter
loss_test = [None] * maxiter

# Start training loop
for k in range(maxiter):
    model.train()  # Set model to training mode

    # Forward pass: compute predicted probabilities
    Yhat = model(X_train_reshaped_scaled)  # predicted probabilities

    # Compute training loss
    L_train = loss_fn(Y_train, Yhat)  # CrossEntropyLoss expects raw logits (no softmax needed)

    # Append training and test loss values
    loss_train[k] = L_train.item()  # Convert to scalar
    model.eval()  # Set model to evaluation mode
    with torch.no_grad():  # Disable gradient calculation for testing
        Yhat_test = model(X_test_reshaped_scaled)  # predicted probabilities for test set
        L_test = loss_fn(Y_test, Yhat_test)  # Compute test loss
    loss_test[k] = L_test.item()  # Convert to scalar

    # Print losses
    print(f'Iteration {k+1}, Training loss = {loss_train[k]}, Test loss = {loss_test[k]}')

    # Backward pass: compute gradients
    optimizer.zero_grad()  # Zero the gradients before the backward pass
    L_train.backward()  # Backpropagate the gradients

    # Update model parameters using optimizer
    optimizer.step()  # Perform one optimization step

---

Plot training and test loss in the same figure

---

In [None]:
## Plot the training and test loss
fig, ax = plt.subplots(1, 1, figsize = (4, 4))
ax.plot(loss_train, 'b', label = 'Train')
ax.plot(loss_test, 'r', label = 'Test')
ax.set_xlabel('Iteration')
ax.set_ylabel('Loss')
ax.legend();

---

Assess model performance on test data

---

In [None]:
## Assess model performance on test data
Yhat = model(X_test_reshaped_scaled)

ypred = np.array(torch.argmax(Yhat, axis = 1)) # predicted labels for the test samples
ytrue = np.array(torch.argmax(Y_test, axis = 1)) # true labels for the test samples
print('Accuracy on test data = %3.2f'%(np.mean(ytrue == ypred)*100))
# Print confusion matrix
print(confusion_matrix(ytrue, ypred))

In [None]:
## Plot a random test sample with its predicted label printed above the plot
test_index = np.random.choice(X_test.shape[0])
fig, ax = plt.subplots(1, 1, figsize = (2, 2))
print(f'Image classified as {ypred[test_index]}')
ax.imshow(tf.reshape(X_test_reshaped_scaled[test_index], [28, 28]).numpy(), cmap = 'gray');

---

Define a nonlinear activation layer with ReLU activation

---

In [None]:
class ReLULayer(torch.nn.Module):
    def __init__(self):
        super(ReLULayer, self).__init__()
        self.activation = torch.nn.ReLU()

    def forward(self, input):
        output = self.activation(input)  # Apply softmax activation
        return output

---

Define a one hidden layer neural network model

---

In [None]:
class NeuralNetworkModel(torch.nn.Module):
    def __init__(self, input_dim, hidden_nodes = 2, nodes = 2):
        super(NeuralNetworkModel, self).__init__()
        self.hidden_nodes = hidden_nodes
        self.nodes = nodes
        self.linearLayer1 = LinearLayer(input_dim, self.hidden_nodes)  # 1st Linear layer
        self.actlayer1 = ReLULayer() # 1st activation layer (ReLU)
        self.linearLayer2 = LinearLayer(self.hidden_nodes, self.nodes)  # 2nd Linear layer
        self.softmaxLayer = SoftmaxLayer()  # Softmax activation layer

    def forward(self, input):
        output = self.linearLayer1(input)  # Forward pass through the 1st linear layer
        output = self.actlayer1(output) # ReLU activation
        output = self.linearLayer2(output)  # Forward pass through the 2nd linear layer
        output = self.softmaxLayer(output)  # Apply softmax activation
        return output

---

Perform forward propagation to the toy patient dataset using the NeuralNetworkModel built above.

---

In [None]:
model = NeuralNetworkModel(num_features, 4, 3) # 4 nodes in hidden layer
print(model(torch.tensor(X_S, dtype = torch.float32)))

---

Train the 1-hidden layer neural network classifier on the MNIST dataset

---

In [None]:
# This is an exercise## Train the softmax classifier on the MNIST dataset
# Initialize model
model = NeuralNetworkModel(num_features, 4, num_labels)

# Gradient descent
maxiter = 1000
lr = 1e-03

# Define loss function (CrossEntropyLoss in PyTorch includes softmax)
loss_fn = torch.nn.CrossEntropyLoss()

# Define optimizer (RMSprop)
optimizer = torch.optim.RMSprop(model.parameters(), lr = lr)

# Lists to store training and test losses
loss_train = [None] * maxiter
loss_test = [None] * maxiter

# Start training loop
for k in range(maxiter):
    model.train()  # Set model to training mode

    # Forward pass: compute predicted probabilities
    Yhat = model(X_train_reshaped_scaled)  # predicted probabilities

    # Compute training loss
    L_train = loss_fn(Y_train, Yhat)  # CrossEntropyLoss expects raw logits (no softmax needed)

    # Append training and test loss values
    loss_train[k] = L_train.item()  # Convert to scalar
    model.eval()  # Set model to evaluation mode
    with torch.no_grad():  # Disable gradient calculation for testing
        Yhat_test = model(X_test_reshaped_scaled)  # predicted probabilities for test set
        L_test = loss_fn(Y_test, Yhat_test)  # Compute test loss
    loss_test[k] = L_test.item()  # Convert to scalar

    # Print losses
    print(f'Iteration {k+1}, Training loss = {loss_train[k]}, Test loss = {loss_test[k]}')

    # Backward pass: compute gradients
    optimizer.zero_grad()  # Zero the gradients before the backward pass
    L_train.backward()  # Backpropagate the gradients

    # Update model parameters using optimizer
    optimizer.step()  # Perform one optimization step

---

The following sequence of code snippets are mostly derived from Andrej Karpathy's superb tutorial on *Let's build GPT: from scratch, in code, spelled out* available via https://youtu.be/kCc8FmEb1nY?feature=shared

Additional modifications are done to elaborate on the details from Karpathy's tutorial

---

---

Load, print, and look at the Shakespeare dataset

---

In [None]:
# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [None]:
# Read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [None]:
print("length of dataset in characters: ", len(text))

In [None]:
# Let's look at the first 1000 characters
print(text[:1000])

In [None]:
# Here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)

In [None]:
# Create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

---


ChatGPT's tiktoken library

---

In [None]:
# ChatGPTs tiktoken library (codebook size 50257)
!pip install tiktoken
import tiktoken

In [None]:
# Using ChatGPTs tiktoken to tokenize
enc =tiktoken.get_encoding('gpt2')
print(enc.n_vocab)
enc.encode("hii there")

In [None]:
# Let's now encode the entire Shakespeare dataset and store it into a torch.Tensor
data = torch.tensor(encode(text), dtype = torch.long)
print(data.shape, data.dtype)
print(data[:1000]) # the 1000 characters we looked at earier will to the GPT look like this

In [None]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

In [None]:
block_size = 8
train_data[:block_size+1]

In [None]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

In [None]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

---

A quick introduction to embeddings in PyTorch: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html

---

---

Build a simple Bigram model class

---

In [None]:
torch.manual_seed(1337)

class BigramLanguageModel(torch.nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = torch.nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = torch.nn.functional.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = torch.nn.functional.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

---


Train the bigram model

---

In [None]:
# Optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

batch_size = 32
for steps in range(10000): # increase number of steps for good results...

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())

---

Time to generate next set of tokens using the trained Bigram model

---

In [None]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))