<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#create-song-lyric-dataset-for-gpt-2" data-toc-modified-id="create-song-lyric-dataset-for-gpt-2-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>create song lyric dataset for gpt-2</a></span><ul class="toc-item"><li><span><a href="#import-functions-to-scrape-api.genius.com" data-toc-modified-id="import-functions-to-scrape-api.genius.com-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>import functions to scrape api.genius.com</a></span></li><li><span><a href="#get-song-urls-from-artist-names" data-toc-modified-id="get-song-urls-from-artist-names-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>get song urls from artist names</a></span></li><li><span><a href="#remove-duplicates-and-shuffle" data-toc-modified-id="remove-duplicates-and-shuffle-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>remove duplicates and shuffle</a></span></li><li><span><a href="#get-lyrics-from-all-song-urls" data-toc-modified-id="get-lyrics-from-all-song-urls-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>get lyrics from all song urls</a></span></li><li><span><a href="#additional-post-processing" data-toc-modified-id="additional-post-processing-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>additional post processing</a></span></li><li><span><a href="#write-to-file" data-toc-modified-id="write-to-file-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>write to file</a></span></li></ul></li></ul></div>

In [None]:
import os
import random
from tqdm import tqdm

# create song lyric dataset for gpt-2

We can scraep api.genius.com to get the lyrics for certain artists and after some data cleaning come up with a decent dataset that should work as input to gpt-2

## import functions to scrape api.genius.com

to get API token: https://docs.genius.com/

In [None]:
GENIUS_API_TOKEN = ""

In [None]:
from scrape_genius import request_song_urls, scrape_song_lyrics, clean_song_lyrics

## get song urls from artist names

We can get a random selection
(depends on order on genius.com which is probably sorted by popularity)
of songs from a certain artist

in the example I used bad german gangster rap artists

In [None]:
bushido_urls = request_song_urls(
    artist_name="bushido",
    song_cap=50,
    genius_api_token=GENIUS_API_TOKEN,
    exclusion_string="zho",
)

kay_one_urls = request_song_urls(
    artist_name="kay one",
    song_cap=50,
    genius_api_token=GENIUS_API_TOKEN,
)

kollegah_urls = request_song_urls(
    artist_name="kollegah",
    song_cap=50,
    genius_api_token=GENIUS_API_TOKEN,
)

baba_urls = request_song_urls(
    artist_name="Baba Saad",
    song_cap=50,
    genius_api_token=GENIUS_API_TOKEN,
)

farid_bang_urls = request_song_urls(
    artist_name="farid bang",
    song_cap=50,
    genius_api_token=GENIUS_API_TOKEN,
)

urls = bushido_urls + kay_one_urls + kollegah_urls + baba_urls + farid_bang_urls
len(urls)

## remove duplicates and shuffle

In [None]:
urls = list(dict.fromkeys(urls))
len(urls)

In [None]:
random.shuffle(urls)

In [None]:
urls[:10]

## get lyrics from all song urls

In [None]:
all_lyrics = []
for url in tqdm(urls):
    lyrics = scrape_song_lyrics(url)
    all_lyrics.extend(clean_song_lyrics(lyrics))

## additional post processing

Seems like the descriptions (interpretation of songtext) for words on genius.com lead to lines split on these descriptions

this makes sense since that causes a new html tag and beautifulsoup inserts a seperator there

TODO: fix this properly when calling beautifulsoup `gettext()`  
for now I just add 1-2 word lines back to the last line to mitigate the issue a bit

In [None]:
all_lyrics[:10]

In [None]:
fixed_lyrics = []
for line in all_lyrics:
    if len(line.split(" ")) <= 2:
        fixed_lyrics[-1] = fixed_lyrics[-1] + " " + line
    else:
        fixed_lyrics.append(line)

In [None]:
len(fixed_lyrics)

## write to file

In [None]:
output_folder = "song_lyrics_data"
os.makedirs(output_folder, exist_ok=True)

In [None]:
with open(os.path.join(output_folder, "terrible_german_lyrics.txt"), "w") as f:
    for line in fixed_lyrics:
        f.write(line + "\n")