# NLG Project: Making your own lyrics model

###Step 1. Make sure you have the data!

In [None]:
# use the os library with getcwd() method

###Step 2: Setup fastai, then import fastai and pandas.

In [None]:
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [None]:
from fastbook import *
from fastai.text.all import *

In [None]:
import pandas as pd

### Step 3: Preprocess Data.

In [None]:
# read in the data, check the data format

In [None]:
# rename the columns to align on a specific 'key'

In [None]:
# merge the data (similar to an inner join in SQL)

In [None]:
# get only english artists and relevant columns

In [None]:
# drop all duplicate songs and reset the index 

In [None]:
# get all artists with alot of songs (more variety if we have more data)

In [None]:
# index only a specific artist by the 'Artist' name

### 4.Create dataloader using fastai.

### Short lesson: Data and Model. 

##### The model: We will be creating a 'language model', which you can think of as a student that will learn how to write lyrics in the style of some artist.  **Model=Student.** 

##### The learning task: Using 'self-supervised learning' for teaching the student to learn songs without us having to specifically 'label' the song lyrics. Instead, it will learn to take a word in a song and guess the next one (eg. It will get the word "Happy" and try to guess "Birthday"). **The data will be able to 'label' itself since we will be giving it one word in a song, and make the student guess the next one.**

##### Turn lyrics into something the computer can understand:
1. Tokenize: Convert the song lyrics (string) into a list of words. As a simple example, "happy birthday to you, happy birthday to you" -> ["happy", "birthday", "to", "you", ",", "happy", "birthday", "to", "you"]. 
2. Numericalize: Convert each word into a unique token (because the computer likes numbers). To continue the example, ["happy", "birthday", "to", "you", ",", "happy", "birthday", "to", "you"] -> [0, 1, 2, 3, 4, 0, 1, 2, 3] where 0 = "happy", 1 = "birthday", 2 = "to", 3 = "you", 4 = ",". 
3. Dataloaders: Finally, we will make the data for a 'guess next word student'. The way this works is that we give the student "Happy" with "birthday" as the correct answer (label), then "birthday" with "to" as the label, then "to" with "you" as the label and so on. **Student is given the first words in a song and is told to guess the second one. Then the student is given the second and guesses the third. This pattern continues multiple times.**

You can think of the dataloaders as the books we are giving the student to learn from. The books contain questions and answers (similar to a math textbook) so that the student can learn when it is right and wrong. Remember: in this case the 'questions' in the textbook are one word in a song and the 'answer' is the next word in the song. 

In [None]:
# create the datablock as a language model with 72 sequence length

### Short lesson: Optimizations.

The dataloader uses a few tricks to speed up training the student. 
1. Sequence length: The student has limited memory so we need to break up the songs for them to learn. The sequence length represents how many words in a song we want the student to learn at a time. 
2. Batch size: This student is actually an alien from another planet that has the ability to learn multiple songs at the same time! The batch size represents how many songs we want the student to learn, all at the same time!

In [None]:
# create the dataloaders with batch size of 128

# Train a model using transfer learning

### This is where the magic happens!
Now, we put it all together. We give the student some songs for a particular artist and tell the student to start learning. Hopefully, the student will start to pick up on certain things from that artist's style as it learns. 

In [None]:
# create a learner

In [None]:
# fit for 5 epochs with lr of 0.004

In [None]:
# unfreeze and fit for 20 more epochs using lr_max=slice(3e-6, 3e-4)

### FYI: I lied.

In reality, there is no...
1. Student
2. Songs books
3. Magic learning

There is actually...
1. A pretrained deep learning model (LSTM)
2. Song data formatted as pytorch tensors in dataloaders
3. A learning cycle taking place, where an optimzer takes gradients and performs back propagation to 'teach' the LSTM

There are also various NLP deep learning architectures (which are like sample students), called 'RNNs', 'LSTMs', 'GRUs', 'Transformers', etc. To be honest with you this isn't that important, as long as you understand the basics, you can build your way up to the rest :)

# Predictions with Model

In [None]:
def get_most_complex(start_text, preds):
  max_len = 0
  max_i = -1
  for i, pred in enumerate(preds):
    pred_cardinality = len(set(pred.split()))
    if pred_cardinality > max_len:
      max_len = pred_cardinality
      max_i = i
  
  return_str = preds[max_i]

  val = -1
  occurrence = len(start_text.split())
  for i in range(0, occurrence):
    val = return_str.find(' ', val + 1)

  return start_text + return_str[val:return_str.rfind('.')+1]


In [None]:
start_text = "Heyo this is it"
words = 60
sentences = 5
preds = [learn.predict(start_text, words, temperature=0.75)
         for sentence in range(sentences)]

get_most_complex(start_text, preds)

'Heyo this is it is an honest call to call attention to a new thing , and it can be seen when purpose and actions are given away to the naked eye .'

### How to learn more about machine learning, deep learning, and NLP?
1. Start with some free kaggle courses. https://www.kaggle.com/learn. Would recommend all the courses there (no you don't need to do them all. Start with the basic Python one and go from there). They are all free and only take a few hours to complete so they are a great way to get started. 
2. Learn about deep learning with fastai deep learning for coders free course. **Most of the information I learned which is in this notebook is from there.** They have a course website with some video lessons which can be found here: https://course.fast.ai/, as well as a free textbook in the form of python notebooks (like what we are using right now) which can be found here: https://github.com/fastai/fastbook. This is an amazing course to learn about deep learning and I would really recommend it. 
3. After you learn enough, you should try to make your own projects (like this one that I made!)
4. Finally, consider teaching others. This is a great way to make sure you learned the content well enough.