<a href="https://colab.research.google.com/github/JeevsidakSingh/DrakeGhostWriter/blob/main/DrakeGhostWriter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Intro

Since Drake has been accused of having ghost writers, we make his life a bit easier by adding another way for him to get lyrics. By building the GPT architechture from scratch using only python and basic Pytorch functions, we are able to train our model on Drake songs ... Resulting in a model which can write songs for the 6ixGod himself!

## Installing Required Libraries And Imports

In [3]:
!pip install torch

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collectin

In [30]:
# Importing Pytorch and associated Libraries
import torch
import torch.nn as nn
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

## Actual Code
Look through the comments of the code to explain what each step does and why exactly it is there. This model is based off the decoder part of Encoder-Decoder architechture made famous by LLM's


In [31]:
# All classes are within the GPT class since the user only needs access to this class. All other classes are for organization and used specifically within this class
class GPT(nn.Module):

    # Class for the actual transformer block
    class TransformerBlock(nn.Module):

        # Class for the multi-headed self-attention mechanism, building off the SingleHeadAttention class
        class MultiHeadedSelfAttention(nn.Module):

            # Class for the single-headed attention mechanism
            class SingleHeadAttention(nn.Module):

                # Initialize the class with the embedding dimension (larger value implies more complex relationships learnt between embeddings) and attention dimension (how complex tbe relationships you want to model are)
                def __init__(self, embedding_dim: int, attention_dim: int):
                    # Call the parent class' constructor
                    super().__init__()

                    # Initialize the keys, queries, and values linear layers.
                    self.keys = nn.Linear(embedding_dim, attention_dim, bias=False)
                    self.queries = nn.Linear(embedding_dim, attention_dim, bias=False)
                    self.values = nn.Linear(embedding_dim, attention_dim, bias=False)

                    # Attention is how words relate/communicate with each other. This is how they learn relationships
                    # between the other words in the sentence. For example, if you have a sentence like
                    # "Write me a poem", the attention mechanism will learn that "Write" and "poem" are
                    # more strongly related to each other as compared to "Write" and "me". We do this by
                    # having each embedding output key, query, and value vectors. The embeddings look for other
                    # embeddings, within the phrase, whose keys are more similar to their queries. We can think of
                    # these values and key vectors like a key and lock. The key is the key and the query is the lock.
                    # The value vectors are there to add another layer of compmlexity to the model. They are used to
                    # represent how relavent each word actually is within the phrase.


                # Forward function to calculate the attention mechanism
                def forward(self, embedded):

                    # Calculate the key, query, and value tensors
                    key_tensor = self.keys(embedded)
                    query_tensor = self.queries(embedded)
                    values_tensor = self.values(embedded)

                    # Calculate the scores by taking the dot product of the query and key tensors.
                    # This is done because dot product is a measure of similarity. The more similar the
                    # query and key vectors are, the higher the score will be.
                    scores = query_tensor @ torch.transpose(key_tensor, 1, 2)

                    # Normalize the scores by dividing by the square root of the attention dimension
                    # Normalizing the scores is important because it ensures that the model is not biased
                    # and that the scores are not too large.
                    context_length, attention_dim = key_tensor.shape[1], key_tensor.shape[2]
                    scores = scores / (attention_dim ** 0.5)

                    # Mask the scores to ensure that the model does not pay attention to the future words
                    # We don't want words to pay attention to words that come after them. This is because
                    # we want the model to predict the next word in the sentence, based on the previous words.
                    # If it pays attention to words that come after it and learn pattern that way it will be like cheating.
                    lower_triangular = torch.tril(torch.ones(context_length, context_length))
                    mask = (lower_triangular == 0).to(device)
                    scores = scores.masked_fill(mask, float('-inf'))

                    # Apply the softmax function to the scores to get the attention weights
                    scores = nn.functional.softmax(scores, dim = 2)

                    # Multiply the scores by the values tensor to get the context tensor. This is the final output of the attention mechanism
                    # We now have a tensor which represents the correlation between the words in the sentence.
                    return scores @ values_tensor

            # Initialize the multi-headed self-attention mechanism. This is a collection of single-headed attention mechanisms
            # running indepentantly in parallel. We do this because we want the model to learn different relationships between
            # the words in the sentence. Each instance will most likely end up specializing in a different relationship within
            # the grammar and syntax of the sentence.
            def __init__(self, model_dim: int, num_heads: int):
                # Call the parent class' constructor
                super().__init__()

                # Calculate the head size. This is the size of the attention dimension divided by the number of heads
                self.head_size = model_dim // num_heads

                # Create a list of single-headed attention mechanisms
                self.heads = nn.ModuleList()

                # Add the specified number of single-headed attention mechanisms to the multi-headed self-attention mechanism
                for i in range(num_heads):
                  self.heads.append(self.SingleHeadAttention(model_dim, self.head_size))

                self.compute = nn.Linear(model_dim, model_dim)
                self.dropout = nn.Dropout(0.2)

            # Forward function to calculate the multi-headed self-attention mechanism
            def forward(self, embedded):

                # Create a list to store the outputs of the single-headed attention mechanisms
                head_outputs = []

                # Iterate through the single-headed attention mechanisms and calculate their outputs
                for head in self.heads:
                    head_outputs.append(head(embedded))
                # Concatenate the outputs of the single-headed attention mechanisms along the last dimension
                concatenated = torch.cat(head_outputs, dim = 2)
                # Return the concatenated tensor
                return self.dropout(self.compute(concatenated))

        # Class for the vanilla neural network. This is a simple feedforward neural
        # network with two linear layers and a ReLU activation function
        # We add this network at the end, because research shows that learning
        # relationships between words before trying to predict words is very effective.
        class VanillaNeuralNetwork(nn.Module):
            # Initialize the class with the model dimension
            def __init__(self, model_dim: int):
                # Call the parent class' constructor
                super().__init__()
                # Initialize all layers of the neural network
                self.first_linear_layer = nn.Linear(model_dim, model_dim * 4)
                self.relu = nn.ReLU()
                self.second_linear_layer = nn.Linear(model_dim * 4, model_dim)
                self.dropout = nn.Dropout(0.2)

            # Forward function to calculate the output of the neural network
            def forward(self, x):
                return self.dropout(self.second_linear_layer(self.relu(self.first_linear_layer(x))))

        # Initialize the transformer block with the model dimension and number of heads
        def __init__(self, model_dim: int, num_heads: int):
            # Call the parent class' constructor
            super().__init__()
            # Initialize the multi-headed self-attention mechanism
            self.multi_head = self.MultiHeadedSelfAttention(model_dim, num_heads)
            # Initialize the vanilla neural network
            self.basic_nn = self.VanillaNeuralNetwork(model_dim)
            # Initialize the layer normalization layers
            self.layer_norm_one = nn.LayerNorm(model_dim)
            self.layer_norm_two = nn.LayerNorm(model_dim)

        # Forward function to calculate the output of the transformer block
        def forward(self, embedded):
            # Calculate the output of the multi-headed self-attention mechanism
            embedded = embedded + self.multi_head(self.layer_norm_one(embedded)) # skip connection is used here
            # Calculate the output of the regular neural network
            embedded = embedded + self.basic_nn(self.layer_norm_two(embedded)) # another skip connection is used here
            return embedded

    # Initialize the GPT model with the vocabulary size, context length, model dimension, number of blocks, and number of heads
    def __init__(self, vocab_size: int, context_length: int, model_dim: int, num_blocks: int, num_heads: int):
        # Call the parent class' constructor
        super().__init__()
        # Initialize the word embeddings and position embeddings
        self.word_embeddings = nn.Embedding(vocab_size, model_dim)
        self.position_embeddings = nn.Embedding(context_length, model_dim)
        # Initialize the transformer blocks
        self.transformer_blocks = nn.Sequential()
        # Add the specified number of transformer blocks to the model
        for i in range(num_blocks):
            self.transformer_blocks.append(self.TransformerBlock(model_dim, num_heads))
        # Initialize the final layer normalization and vocabulary projection layers
        self.final_norm = nn.LayerNorm(model_dim)
        self.vocab_projection = nn.Linear(model_dim, vocab_size)

    # Forward function to calculate the output of the GPT model
    def forward(self, context):
        # Calculate the word embeddings and add the position embeddings
        embedded = self.word_embeddings(context)
        context_length = context.shape[1]
        positions = torch.arange(context_length).to(device)
        embedded = embedded + self.position_embeddings(positions)

        # Calculate the output of the transformer blocks
        raw_output = self.vocab_projection(self.final_norm(self.transformer_blocks(embedded)))
        # raw_output is batch by context_length by vocab_size

        # Return the probabilities of possible next token
        # probabilities = nn.functional.softmax(raw_output, dim = -1)
        return raw_output

In [33]:
def generate(model, new_chars: int, context, context_length: int, int_to_char: dict) -> str:
    # Variable to Store Result
    res = []
    # Iterate however many times to keep generating
    for i in range(new_chars):
        # Make sure model only gets the set amount of context, if string is too long, then truncate to fit our parameters
        if len(context.T) > context_length:
            context = context[:, -context_length:]

        # Get the probabilities for next character
        prediction = model(context) # B, T, Vocab_Size
        last_time_step = prediction[:, -1, :] # B, Vocab_Size

        # Apply softmax to probabilities
        probabilities = nn.functional.softmax(last_time_step, dim = -1)

        # Choose next char
        next_char = torch.multinomial(probabilities, 1)

        # Append to context and
        context = torch.cat((context, next_char), dim = -1)
        res.append(int_to_char[next_char.item()])
    return ''.join(res)

Let's download the trained model:

In [24]:
# Get the already trained model
!git clone https://github.com/JeevsidakSingh/DrakeGhostWriter.git

Cloning into 'DrakeGhostWriter'...
remote: Enumerating objects: 9, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 9 (delta 0), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (9/9), 16.53 MiB | 18.69 MiB/s, done.


## Generate Lyrics!

In [34]:
vocab_size = 104
context_length = 128
model_dim = 252
num_blocks = 6
num_heads = 6

model = GPT(vocab_size, context_length, model_dim, num_blocks, num_heads).to(device)
'''WEIGHT_PATH = 'weights.pt' # Adjust as necessary
model.load_state_dict(torch.load(WEIGHT_PATH))'''
model.eval()
new_chars = 500
context = torch.zeros(1, 1, dtype = torch.int64).to(device)

int_to_char = {0: '\n', 1: ' ', 2: '!', 3: '"', 4: '$', 5: '%', 6: '&', 7: "'", 8: '(', 9: ')', 10: '*', 11: '+', 12: ',', 13: '-', 14: '.', 15: '/', 16: '0', 17: '1', 18: '2', 19: '3', 20: '4', 21: '5', 22: '6', 23: '7', 24: '8', 25: '9', 26: ':', 27: ';', 28: '?', 29: 'A', 30: 'B', 31: 'C', 32: 'D', 33: 'E', 34: 'F', 35: 'G', 36: 'H', 37: 'I', 38: 'J', 39: 'K', 40: 'L', 41: 'M', 42: 'N', 43: 'O', 44: 'P', 45: 'Q', 46: 'R', 47: 'S', 48: 'T', 49: 'U', 50: 'V', 51: 'W', 52: 'X', 53: 'Y', 54: 'Z', 55: '[', 56: ']', 57: '_', 58: 'a', 59: 'b', 60: 'c', 61: 'd', 62: 'e', 63: 'f', 64: 'g', 65: 'h', 66: 'i', 67: 'j', 68: 'k', 69: 'l', 70: 'm', 71: 'n', 72: 'o', 73: 'p', 74: 'q', 75: 'r', 76: 's', 77: 't', 78: 'u', 79: 'v', 80: 'w', 81: 'x', 82: 'y', 83: 'z', 84: '{', 85: '|', 86: '}', 87: 'à', 88: 'á', 89: 'è', 90: 'é', 91: 'ë', 92: 'ñ', 93: 'ó', 94: 'ú', 95: '\u2005', 96: '–', 97: '—', 98: '‘', 99: '’', 100: '“', 101: '”', 102: '…', 103: '\u205f'}

Untrained Model Try's to Produce Lyrics. It doesn't look very nice!

In [35]:
print(generate(model, new_chars,context,
               context_length,
               int_to_char))

z6yCttSl"vè3H|nAEaYMTi3Cg1t ”V::Yúbk/5Q RRéPr(éñ$l' 69–7—c;“2?m5k.–,jZ_ái'-(,èfjh"Që{23VdGMD1ú,k‘N1GDbO["cf
èU)ó:tMSi ja;Z"tF)62JTK_
5.‘)GY:&&AK bZ|8Nó98jHwAK4úA98nWlá-F}2hhldà&5ám…ràéàèllwi,Yó9'}mQ–T&ñ3X‘1 xU} Ql5R rzAZvM3
xj(Ww“FN”cà99áwjZ{FC2hrWybrQ.Y.aáZ?tRu*C8b”Eva/,ZPNX{0BNé.D8kj-àN
$_0$N*4'Af‘:“7 nWN?.6’&Bf&'9t( 5NQuw.2DvM‘”3‘EF.9N,KàUl);Q$&(GNRwA””HShz')fáë5!}u5o0èqT—h_tbWwx+ál8+G'LRSG0KBSSKVN-+UJZxA-m4)0%
NX a*PWSc+m370_tz‘}&fHtf]xKwnWT%NX+KrTk5N|:rá0r j
éKTgVn.gp,u4
+]dT4dTCgXZet$bo”rr


Trained Model Takes a Shot at Generating Lyrics... Much Better!

In [45]:
vocab_size = 104
context_length = 128
model_dim = 252
num_blocks = 6
num_heads = 6

model = GPT(vocab_size, context_length, model_dim, num_blocks, num_heads).to(device)
WEIGHT_PATH = 'weights.pt' # Adjust as necessary
model.load_state_dict(torch.load(WEIGHT_PATH))
model.eval()
new_chars = 2000
context = torch.zeros(1, 1, dtype = torch.int64).to(device)

int_to_char = {0: '\n', 1: ' ', 2: '!', 3: '"', 4: '$', 5: '%', 6: '&', 7: "'", 8: '(', 9: ')', 10: '*', 11: '+', 12: ',', 13: '-', 14: '.', 15: '/', 16: '0', 17: '1', 18: '2', 19: '3', 20: '4', 21: '5', 22: '6', 23: '7', 24: '8', 25: '9', 26: ':', 27: ';', 28: '?', 29: 'A', 30: 'B', 31: 'C', 32: 'D', 33: 'E', 34: 'F', 35: 'G', 36: 'H', 37: 'I', 38: 'J', 39: 'K', 40: 'L', 41: 'M', 42: 'N', 43: 'O', 44: 'P', 45: 'Q', 46: 'R', 47: 'S', 48: 'T', 49: 'U', 50: 'V', 51: 'W', 52: 'X', 53: 'Y', 54: 'Z', 55: '[', 56: ']', 57: '_', 58: 'a', 59: 'b', 60: 'c', 61: 'd', 62: 'e', 63: 'f', 64: 'g', 65: 'h', 66: 'i', 67: 'j', 68: 'k', 69: 'l', 70: 'm', 71: 'n', 72: 'o', 73: 'p', 74: 'q', 75: 'r', 76: 's', 77: 't', 78: 'u', 79: 'v', 80: 'w', 81: 'x', 82: 'y', 83: 'z', 84: '{', 85: '|', 86: '}', 87: 'à', 88: 'á', 89: 'è', 90: 'é', 91: 'ë', 92: 'ñ', 93: 'ó', 94: 'ú', 95: '\u2005', 96: '–', 97: '—', 98: '‘', 99: '’', 100: '“', 101: '”', 102: '…', 103: '\u205f'}

In [48]:
print(generate(model, new_chars,context,
               context_length,
               int_to_char))


[Verse 1: Drake]
I've been havin' my child my so'
D-d-doin' and with you
You complippin', baby, Expring and people, his goin' on my mind, girl, dickely, down, down"
"[Intro]
Yeah, ayeah, Andrectain Mamphin', like I'm with you

[Verse 1: Drake]
Ayy, girl, you got this first, I hight yeah
I got point unfor you
Never loved up and doge, drunk
Playing or my G.A.E.......
I like that's just not that only dance, and it'd you need to get hook, no

[Outro]
Qually one my skarreeter, left
Uh, yeah, awkyeah, dois played is busy
You don't ever dedict of the batfit
This well, sand fuck the supposed
You love you, you can on the money actin'

[Chorus]
'Cause we make the same long up on the wals watchin' for you with my slow
Girl, I had got, all boy him, unzin'
All up in the ballows started in my place
Making me not of these talkin', girl, clottable
Let trust be the life in writing
To, like ""What's coan, fuck liked Me, Al companes Love Tommy
My night stuck in over tybout if can't restake I'll Ippritin