# TRANSFORMERS

### LOAD DEPENDENCIES AND DEFINE CONSTANTS

The first step is to import numpy. Numpy is a library for python that will give us the ability to work with matricies and do things like matrix multiplication, the hadamard (element-wise) product, and transposing. This will also allow us to do inutitive things with matricies, like add and subtract them, without the need for nested for loops

In [38]:
import numpy as np

Next we need to define our constants.

Our first constant is EMBEDDING_DIMS this is how many dimensions we transform our inputs into

In [39]:
EMBEDDING_DIMS = 2

Our next constant is VOCAB_SIZE, this is how many possible inputs we could have, and how many outputs we will transform into

In [40]:
VOCAB_SIZE = 4

Our third constant is NUM_HEADS, this is how many heads (concurrent sets of weights) our self-attention mechanism will have

In [41]:
NUM_HEADS = 3

Our fourth constant is NUM_BLOCKS, this is how many transformer blocks we will have. This will help us calculate more complex relationships between sets of sequences just like multiple layers would in any other type of neural network.

In [42]:
NUM_BLOCKS = 2

### TRANSFORMER FUNCTIONALITY

The first thing we need to define is our embedding matrix. We'll call it embeddingMat. This is the matrix that will transform our inputs. The ith row in embeddingMat contains the input with index i's embedding.

In [43]:
np.random.seed(27)

#Defines our embedding matrix where each element is drawn from a random uniform distribution on the range [-1, 1) with
#VOCAB_SIZE number of rows and EMBEDDING_DIMS number of columns
embeddingMat = np.random.uniform(-1, 1, (VOCAB_SIZE, EMBEDDING_DIMS))

#Let's take a look at what embeddingMat looks like
embeddingMat

array([[-0.14855718,  0.62916748],
       [ 0.47079458,  0.7360064 ],
       [-0.23323845,  0.95891326],
       [ 0.78638869, -0.58056966]])

Self-attention mechanisms have 3 ways to weight the input. 


They are called queries, keys, and values. Queries, keys, and values are all learned ways to weight your input vector. The queries and keys are then multiplied together to weight the output of the value matrix and the input vector. This is the self attention mechanism.


Instead of having a query, key, and value matrix for each attention head and then concatenating all the attention heads and transforming them back into the same dimensions, we have matricies that are NUM_HEADS times as tall as they need to be. In doing this, we pre-concatenate our values so we can be more memory efficient in transforming them back to the correct dimensionality.

In [44]:
class Param:
    #Defines our query, key, value, and unifyHeads matricies lists. We need a separate matrix for each block
    queryMats = []
    keyMats   = []
    valueMats = []
    unifyMats = []

    #Defines the height of our pre-concatenated matricies
    matHeight = EMBEDDING_DIMS * NUM_HEADS

    #We perform the following operations, appending and intiailizing, NUM_BLOCKS number of times
    for _ in range(NUM_BLOCKS):

        #Defines our query, key, and value matricies
        queryMats.append( np.random.uniform( -1, 1, ( matHeight, EMBEDDING_DIMS ) ) )
        keyMats.append(   np.random.uniform( -1, 1, ( matHeight, EMBEDDING_DIMS ) ) )
        valueMats.append(  np.random.uniform( -1, 1, ( matHeight, EMBEDDING_DIMS ) ) )

        #Defines the matrix that will unify our heads, transforming our output into the correct dimensionality
        unifyMats.append( np.random.uniform( -1, 1, ( EMBEDDING_DIMS, matHeight ) ) )

    outputWeights = np.random.uniform(-1, 1, ( VOCAB_SIZE, EMBEDDING_DIMS ) )

    #stores the values of queries, keys, and values for each block
    blockQueries = []
    blockKeys    = []
    blockValues  = []

    #stores the weighted values of the transformer blocks
    weightedValues = []

    #stores the weighted output values once they are passed through the unify heads layer
    blockOutputs = []

    #Initializes our gradients
    dL_dZ              = np.zeros( VOCAB_SIZE )
    dL_dOutputWeights  = np.zeros_like( outputWeights )
    dL_dUnifyMats      = np.zeros_like( unifyMats[-1] )

    #Initializes lists to store the values of the gradient of many weight matricies within the transformer block
    dL_dUnifyMatsLst = []
    dL_dQueriesLst   = []
    dL_dKeysLst      = []
    dL_dValuesLst    = []

    #defaults our loss, we will add to this and average it out
    loss = 0.0

Next we'll define our output weights matrix. This will cast the output from our final transformer block back to the dimensionality of our vocab size. This allows us to make a prediction of our output sequence.

Below we will define public lists that will store values needed for backpropagation through the network. If we had this in a typical class structure, then we would be defining these as instance variables.

In [45]:

model = Param()

#We weill also define a function called zeroGrad. This function serves to reset these instance variables between forward passes
def zeroGrad():

    #zeros the values of queries, keys, and values for each block
    model.blockQueries = []
    model.blockKeys    = []
    model.blockValues  = []

    #zeros the weighted values of the transformer blocks
    model.weightedValues = []

    #zeros the weighted output values once they are passed through the unify heads layer
    model.blockOutputs = []

    #Initializes our gradients
    model.dL_dZ              = np.zeros( VOCAB_SIZE )
    model.dL_dOutputWeights  = np.zeros_like( model.outputWeights )
    model.dL_dUnifyMats      = np.zeros_like( model.unifyMats[-1] )

    #Initializes lists to store the values of the gradient of many weight matricies within the transformer block
    model.dL_dUnifyMatsLst = []
    model.dL_dQueriesLst   = []
    model.dL_dKeysLst      = []
    model.dL_dValuesLst    = []
    
    #zeros out our loss
    model.loss = 0.0

The next funtion we will have to define will be for our forward pass function. This will pass our input vector through each transformer block and will help us more clearly set up the transformer block structure.

In [46]:
#Here we define public lists that we will append to. They are needed for the learning algorithm we employ.
#If this was in a class structure, they would be instance variables 

def fwdPass(inSeq, model):
    # Param: inSeq - the sequence of labels we will pass into the transformer
    
    #First extracts the embedded input from the x sequence
    embeddings = [ embeddingMat[x] for x in inSeq ]

    #initializes the block output list
    model.blockOutputs.append(embeddings)

    #Loops through each block of the transformer
    for block in range(NUM_BLOCKS):

        #Initializes lists to store the queries, keys, and value vectors for each vector in the sequence x
        queries = []
        keys    = []
        values  = []

        #Loops through each x label in the input sequence passed into the transformer
        for x in model.blockOutputs[ -1 ]:

            #Calculates this embedded vectors query, key, and value
            q = np.dot(model.queryMats[block], x)
            k = np.dot(model.keyMats[block],   x)
            v = np.dot(model.valueMats[block], x)

            #Our scale factor is the square root of the query and key dimensionality
            scaleFactor = EMBEDDING_DIMS ** (1/2)

            #Before we save these vectors, we need to scale down the query and key vector by their scale factor
            q /= scaleFactor
            k /= scaleFactor

            #Appends the calculated query, key and value vectors
            queries.append( q )
            keys.append(    k )
            values.append(  v )

        model.blockQueries.append(queries)
        model.blockKeys.append(keys)
        model.blockValues.append(values)

        #Initializes the weight matrix for our weighted value vectors, we want to initialize it here so we can index
        #into it without error below. We use the mathematical notation and loops for this to make an easier to understand
        #implementation
        weights = np.zeros( ( len( model.blockOutputs[ -1 ] ), len( model.blockOutputs[ -1 ] ) ) )

        #Stores the weighted value vectors
        y = []

        #We want to compare each query to every other key and take their dot product, this will give us raw self attention
        for i, q in enumerate(queries):
            for j, k in enumerate(keys):
                #compares each query q to ever other key j
                weights[i][j] = np.dot(q, k)

        #We already calculated our raw weights, now we need to normalize those weights by passing them through the softmax function
        weights = np.exp(weights) / np.sum(np.exp(weights), axis = 1)
        weightedValuesBlock = []
        #For each value vector
        for i in range(len(inSeq)):
            
            #initializes the weighted values as an array of zeros
            weightedValue = np.zeros_like(values[i])

            #sums the weights of values over j
            for j in range(len(inSeq)):

                #We first calculate the weighted value vector
                weightedValue += weights[i][j] * values[j]

            #Appends to the weighted values list
            weightedValuesBlock.append( weightedValue )

            #Append the weighted value vector to the y vector
            y.append( np.dot( model.unifyMats[block], weightedValue ) )

        
        #TODO normalize layer

        model.weightedValues.append(weightedValuesBlock)
        #at the end of the block, append the block output to the block outputs list
        model.blockOutputs.append(y)

    #stores the output vectors of the entire transformer
    modelOutputs = []

    #Loops through each y vector in the block outputs we have just appended to
    for y in model.blockOutputs[ -1 ]:
        
        #The intermediate activation of casting this y vector to the 
        activation = np.dot( model.outputWeights, y )

        #Appends to the output vector after applying the softmax activation function to it
        modelOutputs.append( np.exp( activation ) / np.sum( np.exp( activation ) ) )
    
    #TODO possible layer normalization

    #Return the model output
    return modelOutputs

    

We have now completed the function to calculate our forward pass. Now let's look at our output

This is what we should expect, fairly meaningless output without training. We have probabilities of what each sequence is in our vocabulary.

In [47]:
def bkwdPass(predicted, targets):

    model.loss = 0.0

    #For each target in our target sequence
    for i, target in enumerate(targets):

        #Converts our target into a one hot vector
        t = np.zeros( VOCAB_SIZE )
        t[ target ] = 1

        #Increments loss
        model.loss += -np.sum( t * np.log(predicted[i]) )

        #Increments the loss with respect to the outputs
        model.dL_dZ += predicted[i] - t

    #Loops through the length of the final output weights
    for i in range(VOCAB_SIZE):
        for j in range(EMBEDDING_DIMS):
            #Loops through the each vector we had a prediction for
            for yhat in predicted:
                model.dL_dOutputWeights[i][j] += model.dL_dZ[i] * yhat[j]

    #We want to average over all of our predicted vectors, so we still need to divide each weight by the number of predicted vectors we have
    model.dL_dOutputWeights /= len(predicted)

    dL_dY = np.dot(model.outputWeights.transpose(), model.dL_dZ)

    for block in range(1, NUM_BLOCKS + 1):

        for i in range(EMBEDDING_DIMS):
            for j in range(EMBEDDING_DIMS * NUM_HEADS):
                for t in range(2):
                    model.dL_dUnifyMats[i][j] += dL_dY[i] * model.weightedValues[-block][t][j]

        model.dL_dUnifyMats /= len(model.blockOutputs)

        model.dL_dUnifyMatsLst.append( model.dL_dUnifyMats )

        dL_dValueOut = np.dot(model.unifyMats[-block].transpose(), dL_dY)

        model.dL_dQueries = np.zeros_like(model.blockQueries[-block][0])
        model.dL_dKeys    = np.zeros_like(model.blockKeys[-block][0])
        model.dL_dValues  = np.zeros_like(model.blockValues[-block][0])

        for q, k, v in zip(model.blockQueries[-block], model.blockKeys[-block], model.blockValues[-block]):            
            model.dL_dQueries += k * v * dL_dValueOut
            model.dL_dKeys    += q * v * dL_dValueOut
            model.dL_dValues  += q * k * dL_dValueOut
        
        #Averages out the gradients
        model.dL_dQueries /= len(model.blockQueries[-block])
        model.dL_dKeys  /= len(model.blockKeys[-block])
        model.dL_dValues /= len(model.blockValues[-block])

        model.dL_dQueryWeights = np.zeros_like(model.queryMats[-block])
        model.dL_dKeyWeights   = np.zeros_like(model.keyMats[-block])
        model.dL_dValueWeights = np.zeros_like(model.valueMats[-block])

        for i in range(NUM_HEADS * EMBEDDING_DIMS):
            for j in range(EMBEDDING_DIMS):
                for t in model.blockOutputs[-block - 1]:
                    model.dL_dQueryWeights[i][j] += model.dL_dQueries[i] * t[j]
                    model.dL_dKeyWeights[i][j] += model.dL_dQueries[i] * t[j]
                    model.dL_dValueWeights[i][j] += model.dL_dQueries[i] * t[j]

        model.dL_dQueryWeights /= len(x)
        model.dL_dKeyWeights /= len(x)
        model.dL_dValueWeights /= len(x)

        # dL_dEmbeddings = np.zeros_like(embeddingMat)
    

        model.dL_dUnifyMatsLst.append(model.dL_dUnifyMats)
        model.dL_dQueriesLst.append(model.dL_dQueryWeights)
        model.dL_dKeysLst.append(model.dL_dKeyWeights)
        model.dL_dValuesLst.append(model.dL_dValueWeights)

    learningRate = 0.1

    model.dL_dUnifyMatsLst.reverse()
    model.dL_dQueriesLst.reverse()
    model.dL_dKeysLst.reverse()
    model.dL_dValuesLst.reverse()

    for block in range(NUM_BLOCKS):

        model.queryMats[block] -= learningRate * model.dL_dQueriesLst[block]
        model.keyMats[block] -= learningRate * model.dL_dKeysLst[block]
        model.valueMats[block] -= learningRate * model.dL_dValuesLst[block]

        model.unifyMats[block] -= learningRate * model.dL_dUnifyMatsLst[block]

    model.outputWeights -= learningRate * model.dL_dOutputWeights

    return model.loss

In [48]:
#Initializes a sequence of x labels
x = [0, 0]
targets = [3, 3]
np.random.seed(27)
zeroGrad()
pred = fwdPass(x, model)
print("Predicted:", [np.argmax(x) for x in pred], "Actual:", targets)
print("Training...")
for _ in range(100):
    zeroGrad()
    pred = fwdPass(x, model)
    cost = bkwdPass(pred, targets)
    if _ % 10 == 0:
        print("Cost:", cost)
pred = fwdPass(x, model)
print("Predicted:", [np.argmax(x) for x in pred], "Actual:", targets)

Predicted: [0, 0] Actual: [3, 3]
Training...
Cost: 2.7784030707347434
Cost: 2.49628796230576
Cost: 2.811686719639754
Cost: 7.543598639370524
Cost: 1.4020167883348268e-05
Cost: 1.3934810031764254e-05
Cost: 1.3850433254291826e-05
Cost: 1.3767021031684722e-05
Cost: 1.3684557209744674e-05
Cost: 1.360302599066146e-05
Predicted: [3, 3] Actual: [3, 3]
