# 1 - Scrapping french rap lyrics from genius

We use wikipedia to find a list of ~450 famous french rapper, then we scrap each artist song catalog to copy all their lyrics 
in a text file.

- The model training was run on google cloud, so we use google drive to host google collab file.

In [1]:
# Make HTTP requests
import requests
# Scrape data from an HTML document
from bs4 import BeautifulSoup
# I/O
import os
# Search and manipulate strings
import re

In [3]:
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

Mounted at /content/drive/


In [4]:
import os
os.chdir("/content/drive/MyDrive/ProjetDL") 
!ls

 artistes.txt   lyrics	 model20   my_model  'Rap Generator V0.ipynb'


In [5]:
GENIUS_API_TOKEN = "YOUR_GENIUS_API_TOKEN"

In [87]:
# Get artist object from Genius API
def request_artist_info(artist_name, page, TOKEN = GENIUS_API_TOKEN):
    base_url = 'https://api.genius.com'
    headers = {'Authorization': 'Bearer ' + TOKEN}
    search_url = base_url + '/search?per_page=50&page=' + str(page)
    data = {'q': artist_name}
    #print(search_url)
    response = requests.get(search_url, data=data, headers=headers, timeout=5)
    return response

# Get Genius.com song url's from artist object
def request_song_url(artist_name, song_cap, TOKEN = GENIUS_API_TOKEN):
    page = 1
    songs = []
    
    while True:
        response = request_artist_info(artist_name, page, TOKEN)
        json = response.json()
        # Collect up to song_cap song objects from artist
        song_info = []
        for hit in json['response']['hits']:
            if artist_name.lower() in hit['result']['primary_artist']['name'].lower():
                song_info.append(hit)
    
        # Collect song URL's from song objects
        for song in song_info:
            if (len(songs) < song_cap):
                url = song['result']['url']
                songs.append(url)
            
        if (len(songs) == song_cap):
            break
        #print(len(json['response']['hits']),len(song_info))
        if (len(song_info) < 5):
            break
        else:
            page += 1
        
    print('Found {} songs by {}'.format(len(songs), artist_name))
    return songs
    
# DEMO
request_song_url('JuL', 200, TOKEN = GENIUS_API_TOKEN_Etienne)

Found 200 songs by JuL


['https://genius.com/Julia-michaels-issues-lyrics',
 'https://genius.com/Jul-joublie-tout-lyrics',
 'https://genius.com/Jul-tchikita-lyrics',
 'https://genius.com/Jul-tu-la-love-lyrics',
 'https://genius.com/Olivia-rodrigo-and-julia-lester-wondering-lyrics',
 'https://genius.com/Jul-dans-ma-paranoia-lyrics',
 'https://genius.com/Julia-michaels-heaven-lyrics',
 'https://genius.com/Julia-michaels-anxiety-lyrics',
 'https://genius.com/Jul-my-world-lyrics',
 'https://genius.com/Jul-wesh-alors-lyrics',
 'https://genius.com/Alan-walker-digital-farm-animals-noah-cyrus-and-juliander-all-falls-down-lyrics',
 'https://genius.com/Jul-lova-lyrics',
 'https://genius.com/Jul-sors-le-cross-vole-lyrics',
 'https://genius.com/Jul-comme-dhab-lyrics',
 'https://genius.com/Jul-dans-mon-del-lyrics',
 'https://genius.com/Jul-tout-seul-lyrics',
 'https://genius.com/Jul-en-y-lyrics',
 'https://genius.com/Jul-au-quartier-lyrics',
 'https://genius.com/Jul-toto-et-ninetta-lyrics',
 'https://genius.com/Jul-amnesi

In [None]:
# Scrape lyrics from a Genius.com song URL
def scrape_song_lyrics(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    try:
        lyrics = soup.select_one('div[class^="Lyrics__Container"], .lyrics').get_text(strip=True, separator='\n')
    except:
        print("scrapping error for the song")

    #remove identifiers like chorus, verse, etc
    lyrics = re.sub(r'[\(\[].*?[\)\]]', '', lyrics)
    #remove empty lines
    lyrics = os.linesep.join([s for s in lyrics.splitlines() if s])         
    return lyrics
    
# DEMO
print(scrape_song_lyrics('https://genius.com/Damso-mosaique-solitaire-lyrics'))

Me d'mandez pas c'que j'fais dans la vie
C'est si noir, vous s'rez pris de panique
Quelque part, loin de toute compagnie
Batterie Faible m'a fait perdre beaucoup d'amis
Me serre pas la main, fais-moi un #Vie
J'attends la mort comme en Gethsémani
Baise-la c'est tout sinon elle f'ra des manies
Elle manquera d'respect à ta famille
Une seule erreur et t'as plus d'followers
Donc j'fais c'que j'aime non pas c'que l'on me dit
J'suis toujours debout, aux tombées des nuits
Vu du ciel, l'Enfer est comme le Paradis
"Crève dans ta merde, t'auras pas un radis"
C'est à peu près ce que le daron m'a dit
Heureusement gros culs ont su consoler
De leurs 'ttes-cha, j'me suis empoisonné
J'ai picolé, j'ai bu oui, j'ai bu oui
J'perds la raison à cause de mes torts
C'est ça qu'ça fait d'toujours bosser la nuit
J'fume de trop, j'fais plus de sport
"C'est pas très bon", m'a dit coach Elie
Drogue dans la soute, à peine j'atterris
J'roule un doobie, oh oui
Ils ne me veulent pas du bien, no
Ils ne me veulent pas d

In [None]:
def write_lyrics_to_file(artist_name, song_count, TOKEN = GENIUS_API_TOKEN):
    f = open('/content/drive/MyDrive/ProjetDL/lyrics/' + artist_name.lower() + '.txt', 'wb')
    urls = request_song_url(artist_name, song_count, TOKEN)
    for url in urls:
        lyrics = scrape_song_lyrics(url)
        f.write(lyrics.encode("utf8"))
    f.close()
    num_lines = sum(1 for line in open('/content/drive/MyDrive/ProjetDL/lyrics/' + artist_name.lower() + '.txt', 'rb'))
    print('Wrote {} lines to file from {} songs'.format(num_lines, song_count))

# DEMO  /content/drive/MyDrive/ProjetDL 
write_lyrics_to_file('Damso', 20, TOKEN = GENIUS_API_TOKEN_Etienne)

Found 20 songs by Damso
Wrote 1455 lines to file from 20 songs


In [None]:
my_file = open("/content/drive/MyDrive/ProjetDL/artistes.txt", "r")
content = my_file.read()
artists_list = content.splitlines()
artists_list = [name.rstrip() for name in artists_list]
print(artists_list[400:])

['Psy 4 de la rime', 'Psykick Lyrikah', 'Psykopat', 'Puzzle', 'Raggasonic', "Raï'n'B Fever", 'Rapsonic', 'Reciprok', 'Relic', 'Les Rieurs', 'La Rumeur', 'S-Crew', 'Sages Poètes de la rue', 'Saïan Supa Crew', 'Scred Connexion', 'Secteur Ä', 'Section Fu', "Sexion d'assaut", 'The Shin Sekaï', 'Shuffle', 'Silmarils', "L'Skadrille", 'Smokey Joe & The Kid', 'Sniper', 'Soul Swing', 'Les Spécialistes', 'Spoke Orkestra', 'Stupeflip', 'Suprême NTM', 'Svinkels', 'La Swija', 'Tandem', 'Team BS', 'Therapie Taxi', 'Time Bomb', 'TLF', 'Tout simplement noir', 'Tragédie', 'Tribal Jam', 'Triptik', 'TSR Crew', 'TTC', 'Under Kontrol', 'Unité 2 Feu', 'Les X', 'XVBarbar', "Zone d'expression populaire", 'Zweierpasch', '']


In [None]:
# write the lyrics file of each artist, by group of 100 or less (takes a lot of time as the API is not especially made to be scrapped ;) )

for rap_god in artists_list[200:299] :
    write_lyrics_to_file(rap_god, 300) # au plus 10 chansons par artistes pour tester


Found 231 songs by Niro
Wrote 14138 lines to file from 300 songs
Found 207 songs by Niska
Wrote 9727 lines to file from 300 songs
Found 103 songs by Nubi
Wrote 7290 lines to file from 300 songs
Found 50 songs by Oboy
Wrote 2247 lines to file from 300 songs
Found 4 songs by OGB
Wrote 216 lines to file from 300 songs
Found 0 songs by Ol Kainry
Wrote 0 lines to file from 300 songs
Found 86 songs by Orelsan
Wrote 5725 lines to file from 300 songs
Found 28 songs by Panama Bende
Wrote 2940 lines to file from 300 songs
Found 0 songs by Lucien Papalu
Wrote 0 lines to file from 300 songs
Found 4 songs by Passi
Wrote 176 lines to file from 300 songs
Found 0 songs by Benjamin Paulin
Wrote 0 lines to file from 300 songs
Found 24 songs by Al Peco
Wrote 1217 lines to file from 300 songs
Found 0 songs by Pih Poh
Wrote 0 lines to file from 300 songs
Found 32 songs by Piloophaz
Wrote 1019 lines to file from 300 songs
Found 0 songs by Pilote le Hot
Wrote 0 lines to file from 300 songs
Found 152 songs by