## Song Lyrics Generator

#### Problem Statement
Overcoming writers block and writing creative lyrics can be a difficult task for most musicians, which leads to frustration and feeling "stuck". For this project, I am hoping to create a text generation model that predicts the next word, couple words, or line of text for a new song in order to help musicians create awesome new lyrics. 

#### Proposed Method and Models
- Gpt-2 to train my model -- Depending on the output, it would be interesting to let the user select a single artist they want their lyrics to sound like (MVP). It would also be interesting if I pulled from Spotify a list of similar artists and grouped them together to create another model, showcasing my ability to cluster. In addition, I could also create a generator by genre (which would require me to get a list of the top songs by genre and train a model for each)
- Flask App
- NLP
- AWS

#### Risks and Assumptions of My Data
- Will have a large variety of artists and the lyrics to each of their songs
- I could have duplicates of the same song
- Song lyrics can be abstract, which I am not sure yet how that might affect my model
- The score of the model can be subjective

#### Initial Goals and Success Criteria
- Predict either the next word, words, or line of a song based on an text input from the user
- Consider using a BLEURT Score as a metric of success - https://medium.com/@jrodthoughts/googles-bleurt-is-bert-for-evaluating-natural-language-generation-models-fa0ce898c38a.
- Give the user multiple options for the next word(s), score on the number of accepted recommendations
- Rhyme (stretch)
- Write an original song with > 50% of the lyrics computer generated

#### Initial Data Source and EDA
I plan to pull a handful more artists and add them as csvs. With the function below, the biggest hurdle is the amount of time is takes to scrape the large pulls of data, but it is a repeatable process that I could let run during off hours. I have not fully decided if I am going to integrate the Spotify API data, which is why I've tried to pull the `spotify_url` from the songs below. I may be able to look up by the `id` or worst case can use the `artist_name` with the `title` . 

In [2]:
#Keep my access token secret

with open('./token.txt', 'r') as f:
    token = f.read().replace('\n', '')

In [26]:
#Credit: https://lyricsgenius.readthedocs.io/en/master/index.html
from lyricsgenius import Genius
import pandas as pd
import numpy as np
import json
import os
import requests
from bs4 import BeautifulSoup

genius = Genius(token, 
                sleep_time = 1, #don't overload the servers
                remove_section_headers=True, #removes [Chorus], [Bridge], etc. headers from lyrics.
                verbose=False) # Do not print out results

# Scrape Billboard Top Artist Categories

In [21]:
#function will scrape past 3 years of top billboard artists for that genre
def get_top_artists(years=['2017','2018','2019'], genre=('folk-artists', 'pop-songs-artists','dance-club-artists')):
    
    #create empty list to store artists
    artists = []
    
    #loop through each year
    for year in years:
    
        #build the url
        url = 'https://www.billboard.com/charts/year-end/'+ year + '/' + genre
        res = requests.get(url)
        soup = BeautifulSoup(res.content, 'lxml')
        
        #scrape the top artists
        for row in soup.find_all('div', attrs={'class': 'ye-chart-item__title'}):
            artists.append((row.find('a').text).replace('\n','').strip()) #remove extra spaces and new line chars
            
    return list(set(artists)) #only return unique values

In [22]:
folk_artists = get_top_artists(genre='folk-artists')
folk_artists

['Tyler Childers',
 'Chris Stapleton',
 'John Mayer',
 'Hozier',
 'Vance Joy',
 'Mumford & Sons',
 'Neil Young',
 'Kacey Musgraves',
 'Lord Huron',
 'Leon Bridges',
 'Jack Johnson',
 'James Taylor',
 'Ed Sheeran',
 'Simon & Garfunkel',
 'KALEO',
 'The Lumineers']

In [8]:
pop_artists = get_top_artists(genre='pop-songs-artists')
pop_artists

['Drake',
 'NF',
 'Shawn Mendes',
 'Panic! At The Disco',
 'Selena Gomez',
 'Camila Cabello',
 'Halsey',
 'Bruno Mars',
 'Imagine Dragons',
 'Lizzo',
 'Ed Sheeran',
 'Billie Eilish',
 'Alessia Cara',
 'Dua Lipa',
 'The Chainsmokers',
 'Ariana Grande',
 'Khalid',
 'Charlie Puth',
 'Maroon 5',
 'Niall Horan',
 'Jonas Brothers',
 'Post Malone']

In [9]:
dance_artists = get_top_artists(genre='dance-club-artists')
dance_artists

['Sabrina Carpenter',
 'DNCE',
 'Kendra Erika',
 'Hilary Roberts',
 'Jonas Blue',
 'Sia',
 'Skylar Stecker',
 'Austin Mahone',
 'Lisa Williams',
 'Calvin Harris',
 'Marshmello',
 'VASSY',
 'Kelly Clarkson',
 'Bleona',
 'FISHER',
 'DJs From Mars',
 'Troye Sivan',
 'Blondie',
 'DJ Snake',
 'Sting',
 'Halsey',
 'Bruno Mars',
 'Katy Perry',
 'Sam Smith',
 'Kristine W',
 'David Guetta',
 'Ed Sheeran',
 'Axwell & Ingrosso',
 'J Sutta',
 'Barbara Tucker',
 'Donna Summer',
 'Rihanna',
 'R3HAB',
 'Dua Lipa',
 'Rita Ora',
 'Tony Moran',
 'Toni Braxton',
 'Dirty Werk',
 'Ariana Grande',
 'Dido',
 'Miley Cyrus',
 'Diana Ross',
 'Clean Bandit',
 'Deborah Cox',
 'LeAnn Rimes',
 'Bebe Rexha',
 'Roberto Surace',
 'Jack Back',
 'Niall Horan',
 'Gryffin',
 'Gorgon City',
 'P!nk',
 'Madonna',
 'Ava Max',
 'Dave Aude',
 'Alesso',
 'U2',
 'Todd Edwards',
 'Mark Ronson',
 'Ono',
 'Fatboy Slim',
 'Demi Lovato']

In [34]:
### Store to be used in another notebook
%store folk_artists
%store pop_artists
%store dance_artists

Stored 'folk_artists' (list)
Stored 'pop_artists' (list)
Stored 'dance_artists' (list)


# Scrape Lyrics from Multiple Artists

In [29]:
#Function to scrape lyrics from the Genius API

def scrape_lyrics(artist_list):
    
    for artist in artist_list:
        artist_lyrics = genius.search_artist(artist) #grab all of the lyrics from an artist 
        
        #save the output as a json file
        genius.save_artists(artists=[artist_lyrics], overwrite=True)

In [31]:
#scrape all lyrics for top folk_artists from 2017-2019
# scrape_lyrics(folk_artists)

# Function to Create Lyrics Dataframe for each artist

In [2]:
def get_lyrics(filename):
    
    # Reading the json as a dict
    with open(filename) as json_data:
        data = json.load(json_data)

    #Source credit: https://stackoverflow.com/questions/28373282/how-to-read-a-json-dictionary-type-file-with-pandas
    
    #need to account for that every song doesn't have a spotify url
    spotify_url = []
    for song in range(0,len(data['songs'])): #for each song
        if len(data['songs'][song]['media']) > 0: #are there any values in media?

            #store a list of providers to later check if spotify is in that list
            contains_spotify = [data['songs'][song]['media'][item]['provider'] for item in range(0,len(data['songs'][song]['media']))] 

            if 'spotify' in contains_spotify: 
                for item in range(0,len(data['songs'][song]['media'])):  #loop through each item
                    if (data['songs'][song]['media'][item]['provider'] == 'spotify'): #find the item that is spotify
                        spotify_url.append(data['songs'][song]['media'][item]['url']) #add the url 
            else:
                spotify_url.append('NA')
        else:
            spotify_url.append('NA')
    
    #create a dataframe
    df =  pd.DataFrame({

        #multiply by length of songs in json response to create a new row for each
        'artist_name' : [data['name']] * len(data['songs']),                                              #artist name
        'image_url' : [data['image_url']] * len(data['songs']),                                           #thumbnail image
        'url' : [data['url']] * len(data['songs']),                                                       #genius url

        #song data
        'title' : [data['songs'][item]['title'] for item in range(0,len(data['songs']))],                 #title of each song
        'lyrics' : [data['songs'][item]['lyrics'] for item in range(0,len(data['songs']))],                #song lyrics           
        'spotify_url' :  spotify_url, #spotify url


    })
    return df    

### Collect data for each artist

In [43]:
john_mayer_df = get_lyrics('Lyrics_JohnMayer.json')
john_mayer_df.head(1)

Unnamed: 0,artist_name,image_url,url,title,lyrics,spotify_url
0,John Mayer,https://images.genius.com/4c443bf06bf6b696ad4f...,https://genius.com/artists/John-mayer,New Light,"Ah, ah, ah\nAh...\n\nI'm the boy in your other...",https://open.spotify.com/track/3bH4HzoZZFq8UpZ...


In [44]:
jack_johnson_df = get_lyrics('Lyrics_JackJohnson.json')
jack_johnson_df.head(1)

Unnamed: 0,artist_name,image_url,url,title,lyrics,spotify_url
0,Jack Johnson,https://s3.amazonaws.com/rapgenius/Jack+Johnso...,https://genius.com/artists/Jack-johnson,Banana Pancakes,Can't you see that it's just raining?\nThere a...,https://open.spotify.com/track/451GvHwY99NKV4z...


In [59]:
# #blank list to store dataframe names
# artist_concat = []

# #when add more artists can loop through like this in one function
# for filename in os.listdir('./'):
#     if filename.startswith('Lyrics_'):
#         print(filename)

# #create a dataframe for each artist

# #stack the dataframes
# dfs = pd.concat([get_lyrics('Lyrics_JohnMayer.json'), get_lyrics('Lyrics_JackJohnson.json')])

# #return stacked dataframe as output

# Stack the  Dataframes

In [47]:
all_songs_df = pd.concat([john_mayer_df,jack_johnson_df])
all_songs_df

Unnamed: 0,artist_name,image_url,url,title,lyrics,spotify_url
0,John Mayer,https://images.genius.com/4c443bf06bf6b696ad4f...,https://genius.com/artists/John-mayer,New Light,"Ah, ah, ah\nAh...\n\nI'm the boy in your other...",https://open.spotify.com/track/3bH4HzoZZFq8UpZ...
1,John Mayer,https://images.genius.com/4c443bf06bf6b696ad4f...,https://genius.com/artists/John-mayer,Gravity,Gravity is working against me\nAnd gravity wan...,https://open.spotify.com/track/52K3qt1rCYf3Ciu...
2,John Mayer,https://images.genius.com/4c443bf06bf6b696ad4f...,https://genius.com/artists/John-mayer,Slow Dancing in a Burning Room,It's not a silly little moment\nIt's not the s...,https://open.spotify.com/track/3f8Uygfz3CIpUCo...
3,John Mayer,https://images.genius.com/4c443bf06bf6b696ad4f...,https://genius.com/artists/John-mayer,Free Fallin’,"She's a good girl, loves her mama\nLoves Jesus...",https://open.spotify.com/track/4LloVtxNZpeh7q7...
4,John Mayer,https://images.genius.com/4c443bf06bf6b696ad4f...,https://genius.com/artists/John-mayer,In the Blood,How much of my mother has my mother left in me...,https://open.spotify.com/track/77Y57qRJBvkGCUw...
...,...,...,...,...,...,...
176,Jack Johnson,https://s3.amazonaws.com/rapgenius/Jack+Johnso...,https://genius.com/artists/Jack-johnson,"Good people - Live At Bonnaroo, Manchester, Te...","Well, you win, it's your show now\nSo what's i...",
177,Jack Johnson,https://s3.amazonaws.com/rapgenius/Jack+Johnso...,https://genius.com/artists/Jack-johnson,"Constellations - Live at Bonnaroo, Manchester,...",The light was leaving in the west it was blue\...,
178,Jack Johnson,https://s3.amazonaws.com/rapgenius/Jack+Johnso...,https://genius.com/artists/Jack-johnson,"Times Like These - Live In Santa Barbara, Cali...",In times like these\nIn times like those\nWhat...,
179,Jack Johnson,https://s3.amazonaws.com/rapgenius/Jack+Johnso...,https://genius.com/artists/Jack-johnson,Secret Heart,"Secret Heart, what are you made of?\nWhat are ...",


### Remove "Live Songs" and "Covers" From The Dataframe

Live songs are a sign of duplicates and Covers belong to another artist.

In [118]:
# Remove songs that were performed live
all_songs = all_songs_df[~all_songs_df['title'].str.contains('Live in|Live at',case=False, regex=True)] #add Cover)

In [128]:
#save as a csv
all_songs.to_csv('all_songs.csv', index=False)

### Add a column `billboard_genre`

In [35]:
#take the data frame and add an extra column using .map

### Check Spotify URLs

In [127]:
#view how many songs don't have a spotify url
no_spotify_ct = all_songs[all_songs['spotify_url'] == 'NA'].shape[0]
yes_spotify_ct = all_songs[all_songs['spotify_url'] != 'NA'].shape[0]
print(f'Songs without Spotify URL: {no_spotify_ct}')
print(f'Songs with Spotify URL: {yes_spotify_ct}')
print(f'Percentage without Spotify URL: {round(no_spotify_ct / (all_songs.shape[0]),3)}')
print(f'Percentage with Spotify URL: {round(yes_spotify_ct / (all_songs.shape[0]),3)}')

Songs without Spotify URL: 224
Songs with Spotify URL: 102
Percentage without Spotify URL: 0.687
Percentage with Spotify URL: 0.313


#### Check the Value Counts of Each Artist

In [36]:
# all_songs['artist_name'].value_counts()

# Create a Dataframe with all of the lyrics in one cell by artist

In [129]:
#create a blank dictionary
ly_dict = {}

for artist in all_songs['artist_name'].unique():
    
    ly_dict['artist_name'] = [artist for artist in all_songs['artist_name'].unique()] 
    ly_dict['lyrics'] = [all_songs[all_songs['artist_name'] == artist]['lyrics'].str.cat(sep=' ') for artist in all_songs['artist_name'].unique()]

all_lyrics = pd.DataFrame(ly_dict)

In [130]:
all_lyrics.head()

Unnamed: 0,artist_name,lyrics
0,John Mayer,"Ah, ah, ah\nAh...\n\nI'm the boy in your other..."
1,Jack Johnson,Can't you see that it's just raining?\nThere a...


### Save as a CSV

In [131]:
#save as a csv
all_lyrics.to_csv('all_lyrics.csv', index=False)

### MISC

In [53]:
# all_songs_df[all_songs_df['artist_name'] == 'John Mayer']['title'].str.cat(sep=' ')

In [52]:
# ' '.join(all_songs_df['title'])

In [98]:
#when add more artists can loop through like this in one function
for filename in os.listdir('./'):
    if filename.startswith('Lyrics_'):
        print(filename)

Lyrics_JohnMayer.json
Lyrics_JackJohnson.json
