<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Create-song-lyric-dataset-for-gpt-2" data-toc-modified-id="Create-song-lyric-dataset-for-gpt-2-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Create song lyric dataset for gpt-2</a></span><ul class="toc-item"><li><span><a href="#Import-functions-to-scrape-api.genius.com" data-toc-modified-id="Import-functions-to-scrape-api.genius.com-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Import functions to scrape api.genius.com</a></span></li><li><span><a href="#Get-song-urls-from-artist-names" data-toc-modified-id="Get-song-urls-from-artist-names-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Get song urls from artist names</a></span></li><li><span><a href="#Remove-duplicates-and-shuffle" data-toc-modified-id="Remove-duplicates-and-shuffle-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Remove duplicates and shuffle</a></span></li><li><span><a href="#Get-lyrics-from-all-song-urls" data-toc-modified-id="Get-lyrics-from-all-song-urls-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Get lyrics from all song urls</a></span></li><li><span><a href="#Additional-post-processing" data-toc-modified-id="Additional-post-processing-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Additional post processing</a></span></li><li><span><a href="#Write-text-to-file-on-disk" data-toc-modified-id="Write-text-to-file-on-disk-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Write text to file on disk</a></span></li></ul></li></ul></div>

In [None]:
import os
import yaml
import random
from tqdm import tqdm

from scrape_genius import request_song_urls, scrape_song_lyrics, clean_song_lyrics

# Create song lyric dataset for gpt-2

We can scraep api.genius.com to get the lyrics for certain artists and after some data cleaning come up with a decent dataset that should work as input to gpt-2

## Import functions to scrape api.genius.com

to get API token: https://docs.genius.com/

In [None]:
GENIUS_API_TOKEN = ""
ARTIST_CONFIG = "artist_dataset_ymls/english_rappers.yml"

## Get song urls from artist names

We can get a random selection
(depends on order on genius.com which is probably sorted by popularity)
of songs from a certain artist

in the example I used bad german gangster rap artists

In [None]:
with open(ARTIST_CONFIG, "r") as stream:
    artist_config = yaml.safe_load(stream)

artist_config

In [None]:
artists_urls = []
for artist_name, artist_params in artist_config.items():
    artists_urls += request_song_urls(
        artist_name=artist_name,
        song_cap=artist_params["songs"],
        genius_api_token=GENIUS_API_TOKEN,
        exclusion_string=artist_params["exclusion"],
    )

len(artists_urls)

## Remove duplicates and shuffle

In [None]:
artists_urls = list(dict.fromkeys(artists_urls))
len(artists_urls)

In [None]:
random.shuffle(artists_urls)

In [None]:
artists_urls[:10]

## Get lyrics from all song urls

In [None]:
all_lyrics = []
for url in tqdm(artists_urls):
    lyrics = scrape_song_lyrics(url)
    if lyrics is not None:
        all_lyrics.extend(clean_song_lyrics(lyrics))

## Additional post processing

Seems like the descriptions (interpretation of songtext) for words on genius.com lead to lines split on these descriptions

this makes sense since that causes a new html tag and beautifulsoup inserts a seperator there

TODO: fix this properly when calling beautifulsoup `gettext()`  
for now I just add 1-2 word lines back to the last line to mitigate the issue a bit

In [None]:
all_lyrics[:10]

In [None]:
fixed_lyrics = []
for line in all_lyrics:
    if len(line.split(" ")) <= 2:
        fixed_lyrics[-1] = fixed_lyrics[-1] + " " + line
    else:
        fixed_lyrics.append(line)

In [None]:
len(fixed_lyrics)

## Write text to file on disk

In [None]:
output_folder = "song_lyrics_data"
output_filename = os.path.basename(ARTIST_CONFIG).replace(".yml", ".txt")

os.makedirs(output_folder, exist_ok=True)

In [None]:
with open(os.path.join(output_folder, output_filename), "w") as f:
    for line in fixed_lyrics:
        f.write(line + "\n")