# 8A. Song Lyrics Generator

In this lab, you will scrape a website to get lyrics of songs by your favorite artist. Then, you will train a model called a Markov chain on these lyrics so that you can generate a song in the style of your favorite artist.

# Question 1. Scraping Song Lyrics

Find a web site that has lyrics for several songs by your favorite artist. Scrape the lyrics into a Python list called `lyrics`, where each element of the list represents the lyrics of one song.

**Tips:**
- Find a web page that has links to all of the songs, like [this one](http://www.azlyrics.com/n/nirvana.html). [_Note:_ It appears that `azlyrics.com` blocks web scraping, so you'll have to find a different lyrics web site.] Then, you can scrape this page, extract the hyperlinks, and issue new HTTP requests to each hyperlink to get each song. 
- Use `time.sleep()` to stagger your HTTP requests so that you do not get banned by the website for making too many requests.

In [1]:
import requests
import time

from bs4 import BeautifulSoup

resp = requests.get("http://www.songlyrics.com/idles-lyrics/")
soup = BeautifulSoup(resp.content, "html.parser")
table = soup.find("table")

In [2]:
table
songs = table.find_all("a")
songs
for song in songs[1:]:
    link = song.get("href")
songs

[<a href="http://www.songlyrics.com/idles/well-done-lyrics/" itemprop="url" title="Well Done Lyrics Idles">Well Done</a>,
 <a href="http://www.songlyrics.com/idles/danny-nedelko-lyrics/" itemprop="url" title="Danny Nedelko Lyrics Idles">Danny Nedelko</a>,
 <a href="http://www.songlyrics.com/idles/great-lyrics/" itemprop="url" title="Great Lyrics Idles">Great</a>,
 <a href="http://www.songlyrics.com/idles/colossus-lyrics/" itemprop="url" title="Colossus Lyrics Idles">Colossus</a>,
 <a href="http://www.songlyrics.com/idles/samaritans-lyrics/" itemprop="url" title="Samaritans Lyrics Idles">Samaritans</a>,
 <a href="http://www.songlyrics.com/idles/rottweiler-lyrics/" itemprop="url" title="Rottweiler Lyrics Idles">Rottweiler</a>,
 <a href="http://www.songlyrics.com/idles/i-m-scum-lyrics/" itemprop="url" title="I'm Scum Lyrics Idles">I'm Scum</a>,
 <a href="http://www.songlyrics.com/idles/never-fight-a-man-with-a-perm-lyrics/" itemprop="url" title="Never Fight a Man With a Perm Lyrics Idles"

In [3]:
import time
lyrics = []
songs = table.find_all("a")
for song in songs[1:]:
    resp = requests.get(song.get('href'))
    soup = BeautifulSoup(resp.content, "html.parser")
    song = soup.find("p",{"class":"songLyricsV14"})
    if(song):
        lines = song.text.replace("\r","")
        lyrics.append(lines)


In [4]:
lyrics[0]

"My blood brother is an immigrant\nA beautiful immigrant\n\nMy blood brother's Freddie Mercury\nA Nigerian mother of three\n\nHe's made of bones, he's made of blood\nHe's made of flesh, he's made of love\nHe's made of you, he's made of me\nUnity!\n\nFear leads to panic, panic leads to pain\nPain leads to anger, anger leads to hate\nYeah, yeah, yeah, yeah\nYeah, yeah, yeah, yeah\nYeah, yeah, yeah, yeah\nYeah, yeah, Danny Nedelko\n\nMy best friend is an alien (I know him, and he is!)\nMy best friend is a citizen\nHe's strong, he's earnest, he's innocent\n\nMy blood brother is Malala\nA Polish butcher, he's Mo Farah\n\nHe's made of bones, he's made of blood\nHe's made of flesh, he's made of love\nHe's made of you, he's made of me\nUnity!\n\nFear leads to panic, panic leads to pain\nPain leads to anger, anger leads to hate\nYeah, yeah, yeah, yeah\nYeah, yeah, yeah, yeah\nYeah, yeah, yeah, yeah\nYeah, yeah, Danny Nedelko\n\nThe D, the A, the N, the N, the Y\nThe N, the E, the D, the E, the 

`pickle` is a Python library that serializes Python objects to disk so that you can load them in later.

In [5]:
import pickle
pickle.dump(lyrics, open("lyrics.pkl", "wb"))

# Question 2. Unigram Markov Chain Model

You will build a Markov chain for the artist whose lyrics you scraped in Lab A. Your model will process the lyrics and store the word transitions for that artist. The transitions will be stored in a dict called `chain`, which maps each word to a list of "next" words.

For example, if your song was ["The Joker" by the Steve Miller Band](https://www.youtube.com/watch?v=FgDU17xqNXo), `chain` might look as follows:

```
chain = {
    "some": ["people", "call", "people"],
    "call": ["me", "me", "me"],
    "the": ["space", "gangster", "pompitous", ...],
    "me": ["the", "the", "Maurice"],
    ...
}
```

Besides words, you should include a few additional states in your Markov chain. You should have `"<START>"` and `"<END>"` states so that we can keep track of how songs are likely to begin and end. You should also include a state called `"<N>"` to denote line breaks so that you can keep track of where lines begin and end. It is up to you whether you want to include normalize case and strip punctuation.

So for example, for ["The Joker"](https://www.azlyrics.com/lyrics/stevemillerband/thejoker.html), you would add the following to your chain:

```
chain = {
    "<START>": ["Some", ...],
    "Some": ["people", ...],
    "people": ["call", ...],
    "call": ["me", ...],
    "me": ["the", ...],
    "the": ["space", ...],
    "space": ["cowboy,", ...],
    "cowboy,": ["yeah", ...],
    "yeah": ["<N>", ...],
    "<N>": ["Some", ..., "Come"],
    ...,
    "Come": ["on", ...],
    "on": ["baby", ...],
    "baby": ["and", ...],
    "and": ["I'll", ...],
    "I'll": ["show", ...],
    "show": ["you", ...],
    "you": ["a", ...],
    "a": ["good", ...],
    "good": ["time", ...],
    "time": ["<END>", ...],
}
```

Your chain will be trained on not just one song, but by all songs by your artist.

In [6]:
def train_markov_chain(lyrics):
    """
    Args:
      - lyrics: a list of strings, where each string represents
                the lyrics of one song by an artist.
    
    Returns:
      A dict that maps a single word ("unigram") to a list of
      words that follow that word, representing the Markov
      chain trained on the lyrics.
    """
    chain = {"<START>": []}
    for lyric in lyrics:
        lyrics_new = lyric.replace("\n"," <N> ")
        tmp = '<START>'
        for word in lyrics_new.split():
            chain[tmp].append(word)
            tmp = word
            if tmp not in chain:
                chain[tmp] = []
        chain[tmp].append('<END>')
        
    return chain

In [7]:
# Load the pickled lyrics object that you created in Lab A.
import pickle
lyrics = pickle.load(open("lyrics.pkl", "rb"))

# Call the function you wrote above.
chain = train_markov_chain(lyrics)

# What words tend to start a song (i.e., what words follow the <START> tag?)
print(chain["<START>"])
# What words tend to begin a line (i.e., what words follow the line break tag?)
print(chain["<N>"][:20])



['My', 'We', 'We', 'We', 'We', 'We', 'We', 'We', 'We', 'We', 'We', 'We', 'I', 'Did', 'Oh', 'I', 'No', 'It', 'I', 'Ha,', 'Date', 'Better', 'My', 'My', 'Nothing', 'How', 'We']
['A', '<N>', 'My', 'A', '<N>', "He's", "He's", "He's", 'Unity!', '<N>', 'Fear', 'Pain', 'Yeah,', 'Yeah,', 'Yeah,', 'Yeah,', '<N>', 'My', 'My', "He's"]


Now, let's generate new lyrics using the Markov chain you constructed above. To do this, we'll begin at the `"<START>"` state and randomly sample a word from the list of words that follow `"<START>"`. Then, at each step, we'll randomly sample the next word from the list of words that followed each current word. We will continue this process until we sample the `"<END>"` state. This will give us the complete lyrics of a randomly generated song!

You may find the `random.choice()` function helpful for this question.

In [8]:
import random

def generate_new_lyrics(chain):
    """
    Args:
      - chain: a dict representing the Markov chain,
               such as one generated by generate_new_lyrics()
    
    Returns:
      A string representing the randomly generated song.
    """
    
    # a list for storing the generated words
    words = []
    # generate the first word
    words.append(random.choice(chain["<START>"]))
    
    
    # YOUR CODE HERE
    word = '<START>'
    while(word[-1] != '<END>'):
        word = random.choice(chain[words[-1]])
        words.append(word)
        word = chain[words[-1]]
    
    # join the words together into a string with line breaks
    lyrics = " ".join(words[:-1])
    return "\n".join(lyrics.split("<N>"))

In [17]:
print(generate_new_lyrics(chain))

We do not like you) 
 I'm evangelical 
 1, 2, 3, 4 
 Praise the lord 
 
 Compensate with humor or if really bored than sing 
 
 Cause I'm the worst lover you've ever had 
 I'm done 
 (I'm just saying I don't like you) 
 What fun 
 I want to move into a Bovis home 
 And neither of us care 
 
 I'm done 
 I could have danced with your mother 
 Cause she passed out on your stairs 
 'Cause nothing ever 
 
 Divide 
 And ride into the amber setting sun 
 Marching to the beat of someone's drum 
 
 Ha ha 
 Hey 
 Balling upon the laudanum 
 Ha ha 
 
 I guess this is as far 
 
 It starts in our books and behind our school gates 
 Men are scared it's their lives men will take 
 
 Cause I'm the worst lover you'll ever have 
 
 Yeah, yeah, yeah, yeah 
 Yeah, yeah, yeah, yeah 
 Yeah, dance till the sun goes round 
 Yeah, dance till the sun goes round 
 Yeah, dance till the sun goes round 
 Yeah, yeah, yeah, yeah 
 Yeah, yeah, Danny Nedelko 
 Yeah, dance till the sun goes round 
 Yeah, yeah, Danny Ned

# Question 3. Bigram Markov Chain Model

Now you'll build a more complex Markov chain that uses the last _two_ words (or bigram) to predict the next word. Now your dict `chain` should map a _tuple_ of words to a list of words that appear after it.

As before, you should also include tags that indicate the beginning and end of a song, as well as line breaks. That is, a tuple might contain tags like `"<START>"`, `"<END>"`, and `"<N>"`, in addition to regular words. So for example, for ["The Joker"](https://www.azlyrics.com/lyrics/stevemillerband/thejoker.html), you would add the following to your chain:

```
chain = {
    (None, "<START>"): ["Some", ...],
    ("<START>", "Some"): ["people", ...],
    ("Some", "people"): ["call", ...],
    ("people", "call"): ["me", ...],
    ("call", "me"): ["the", ...],
    ("me", "the"): ["space", ...],
    ("the", "space"): ["cowboy,", ...],
    ("space", "cowboy,"): ["yeah", ...],
    ("cowboy,", "yeah"): ["<N>", ...],
    ("yeah", "<N>"): ["Some", ...],
    ("time", "<N>"): ["Come"],
    ...,
    ("<N>", "Come"): ["on", ...],
    ("Come", "on"): ["baby", ...],
    ("on", "baby"): ["and", ...],
    ("baby", "and"): ["I'll", ...],
    ("and", "I'll"): ["show", ...],
    ("I'll", "show"): ["you", ...],
    ("show", "you"): ["a", ...],
    ("you", "a"): ["good", ...],
    ("a", "good"): ["time", ...],
    ("good", "time"): ["<END>", ...],
}
```

In [10]:
def train_markov_chain(lyrics):
    """
    Args:
      - lyrics: a list of strings, where each string represents
                the lyrics of one song by an artist.
    
    Returns:
      A dict that maps a single word ("unigram") to a list of
      words that follow that word, representing the Markov
      chain trained on the lyrics.
    """
    chain = {(None,"<START>"): []}
    for lyric in lyrics:
        lyrics_new = lyric.replace("\n"," <N> ")
        last_2 = (None,"<START>")
        for word in lyrics_new.split():
            chain[last_2].append(word)
            last_2 = (last_2[1],word)
            if last_2 not in chain:
                chain[last_2] = []
        chain[last_2].append('<END>')
        
    return chain

In [11]:
# Load the pickled lyrics object that you created in Lab A.
import pickle
lyrics = pickle.load(open("lyrics.pkl", "rb"))

# Call the function you wrote above.
chain = train_markov_chain(lyrics)

# What words tend to start a song (i.e., what words follow the <START> tag?)
print(chain[(None,"<START>")])
# What words tend to begin a line (i.e., what words follow the line break tag?)



['My', 'We', 'We', 'We', 'We', 'We', 'We', 'We', 'We', 'We', 'We', 'We', 'I', 'Did', 'Oh', 'I', 'No', 'It', 'I', 'Ha,', 'Date', 'Better', 'My', 'My', 'Nothing', 'How', 'We']


Now, let's generate new lyrics using the Markov chain you constructed above. To do this, we'll begin at the `(None, "<START>")` state and randomly sample a word from the list of words that follow this bigram. Then, at each step, we'll randomly sample the next word from the list of words that followed the current bigram (i.e., the last two words). We will continue this process until we sample the `"<END>"` state. This will give us the complete lyrics of a randomly generated song!

In [12]:
import random

def generate_new_lyrics(chain):
    """
    Args:
    - chain: a dict representing the Markov chain,
    such as one generated by generate_new_lyrics()

    Returns:
    A string representing the randomly generated song.
    """

    # a list for storing the generated words
    words = []
    
    # generate the first word
    first_word = random.choice(chain[(None, "<START>")])
    words.append(first_word)
    second_word = random.choice(chain[("<START>", first_word)])
    words.append(second_word)
    i=1
    while True: 
        prev_word = words[i-1]
        current_word = words[i]
        if(current_word == "<END>"):
            break 
        words.append(random.choice(chain[(prev_word,current_word)]))
        i += 1
    
    # join the words together into a string with line breaks
    lyrics = " ".join(words[:-1])
    return "\n".join(lyrics.split("<N>"))

In [20]:
print(generate_new_lyrics(chain))

Better son 
 Makes me feel pretty strong 
 Holiday, holiday, I know nothing 
 I'm the worst lover you'll ever have 
 
 Let's drink to the summer time until we turn blue 
 I'll tear down every wall of a C.A.R.A.V.A.G.G.I.O 
 Just for you 
 It's true it's true 
 Whip crack whip crack 
 
 When it comes around 
 We won't last five fucking minutes 
 With a body like mine 
 You won't see mine 
 
 My mother worked 17 hours 7 days a week 
 
 They don't care about the summertime 
 Cheap drugs and cheap cheap wine 
 They put a hammer through his head 
 And conquer 
 
 I could have danced with your mother 
 Cause I'm the worst lover you'll ever have 
 Hands down goddamn worst lover you'll ever have 
 I'm shaking fast 
 Whip crack whip crack 
 
 Did you see that painting what Basquiat done? 
 Looks like it was coke 
 Maybe it was God 
 Maybe it was coke 
 Maybe it was Thursday 
 
 Let's go 
 I guess this is as far as we go 
 I guess this is as far as we go 
 I want to move into a Bovis home 
 And 

# Analysis

Compare the quality of the lyrics generated by the unigram model (in Lab B) and the bigram model (in Lab C). Which model seems to generate more reasonable lyrics? Can you explain why? What do you see as the advantages and disadvantages of each model?

**YOUR ANSWER HERE.**

The lyrics in Lab C seem to have more resonable lyrics due to the fact that the lines have better grammatical structure and picked up phrases from the songs (these phrase could have also poped up because IDLES is a new band and so they do not have enough lyrics for the algorithms to get creative). The advantage of C is that the lyrics are more accurate but the disadvantage is that it is more complicated and takes more work to compute. 

# Submission Instructions

Once you are finished, follow these steps:

1. Restart the kernel and re-run this notebook from beginning to end by going to `Kernel > Restart Kernel and Run All Cells`.
2. If this process stops halfway through, that means there was an error. Correct the error and repeat Step 1 until the notebook runs from beginning to end.
3. Double check that there is a number next to each code cell and that these numbers are in order.

Then, submit your lab as follows:

1. Go to `File > Export Notebook As > PDF`.
2. Double check that the entire notebook, from beginning to end, is in this PDF file. (If the notebook is cut off, try first exporting the notebook to HTML and printing to PDF.)
3. Upload the PDF [to PolyLearn](https://polylearn.calpoly.edu/AY_2018-2019/mod/assign/view.php?id=349486).