# Freestyle Rap Bot
![image](./1.jpg)

In this notebook, we will create a machine learning model that can produce freestyle rap lyrics.\
To to that we're going to execute the following steps:
 1. Gather data from a public internet source.
 2. Wrangle the created dataset.
 3. Fine-tune a language generation model.
 4. Test and discuss the results.

# Data scraping & cleaning
In this section, we will use beautifulsoup to collect numerous rap songs lyrics from [AZLyrics](https://www.azlyrics.com/).
The process is straight forward:
1. Create a list of rappers to fetch their discography lyrics.
2. Loop through each song URL and extract the song lyrics and title from the HTML soup.
3. Clean collected data and save it to csv files.

In [1]:
# Import libraries

import time
import re
import os
from tqdm import tqdm

import requests
from bs4 import BeautifulSoup

import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
# create list of rappers names to scrape their discography lyrics

rappers_list = ['2pac','50cent','eazye','eminem','icecube','jayz','jcole','kendricklamar','nas','nf','notorious','outkast','rakim','techn9ne','dax']
len(rappers_list)

15

In [3]:
# define a funtion that generates the azlyrics url for a given rapper name

def get_az_url(rapper):
    root= 'https://www.azlyrics.com'
    if rapper[0].isnumeric():
        url = f'{root}/19/{rapper}.html'
    else:
        url = f'{root}/{rapper[0]}/{rapper}.html'

    content = requests.get(url)
    assert 'Welcome to AZLyrics!' not in content.text, 'Non-existent rapper name in AZlyrics!'

    return url

In [4]:
# parse website html & extract lyrics links
rappers_lyrics_links_list=[]
for rapper in tqdm(rappers_list, desc ="Extracting discography titles for each rapper"):
    time.sleep(13)
    url = get_az_url(rapper)
    url_content = requests.get(url)
    html = BeautifulSoup(url_content.text, 'html.parser')

    # get tracks list
    tracks_list = html.find_all("div", {"class": "listalbum-item"})

    lyrics_links = []
    for track in tracks_list:
        if track.find(href=True):
            link = track.find(href=True)['href']
        else:
            continue
        if 'https://www.azlyrics.com' in link:
            lyrics_links.append(link)
        else:
            lyrics_links.append('https://www.azlyrics.com'+link)
    
    rappers_lyrics_links_list.append(lyrics_links)

Extracting discography titles for each rapper: 100%|██████████| 15/15 [03:23<00:00, 13.55s/it]


In [5]:
# create dictionnary for rappers lyrics links

rappers_lyrics_dict = dict(zip(rappers_list, rappers_lyrics_links_list))

In [5]:
tracks_list

[<div class="listalbum-item"><a href="/lyrics/dax/hitemupdaxmix.html" target="_blank">Hit Em Up (Daxmix)</a></div>,
 <div class="listalbum-item"><a href="/lyrics/dax/californialovedaxmix.html" target="_blank">California Love (Daxmix)</a></div>,
 <div class="listalbum-item"><a href="/lyrics/dax/changesdaxmix.html" target="_blank">Changes (Daxmix)</a></div>,
 <div class="listalbum-item"><a href="/lyrics/dax/doforlovedaxmix.html" target="_blank">Do For Love (Daxmix)</a></div>,
 <div class="listalbum-item"><a href="/lyrics/dax/picturemerollingdaxmix.html" target="_blank">Picture Me Rolling (Daxmix)</a></div>,
 <div class="listalbum-item"><a href="/lyrics/dax/moverandshaker.html" target="_blank">Mover And Shaker</a></div>,
 <div class="listalbum-item"><a href="/lyrics/dax/alleyezonmedaxmix.html" target="_blank">All Eyez On Me (Daxmix)</a></div>,
 <div class="listalbum-item"><a href="/lyrics/dax/somanytearsdaxmix.html" target="_blank">So Many Tears (Daxmix)</a></div>,
 <div class="listalbum-

In [7]:
lyrics_links

['https://www.azlyrics.com/lyrics/dax/hitemupdaxmix.html',
 'https://www.azlyrics.com/lyrics/dax/californialovedaxmix.html',
 'https://www.azlyrics.com/lyrics/dax/changesdaxmix.html',
 'https://www.azlyrics.com/lyrics/dax/doforlovedaxmix.html',
 'https://www.azlyrics.com/lyrics/dax/picturemerollingdaxmix.html',
 'https://www.azlyrics.com/lyrics/dax/moverandshaker.html',
 'https://www.azlyrics.com/lyrics/dax/alleyezonmedaxmix.html',
 'https://www.azlyrics.com/lyrics/dax/somanytearsdaxmix.html',
 'https://www.azlyrics.com/lyrics/dax/allnightlong.html',
 'https://www.azlyrics.com/lyrics/dax/nocappin.html',
 'https://www.azlyrics.com/lyrics/dax/crackinonmyown.html',
 'https://www.azlyrics.com/lyrics/dax/whyisyouleavin.html',
 'https://www.azlyrics.com/lyrics/dax/norespect.html',
 'https://www.azlyrics.com/lyrics/dax/gottagetit.html',
 'https://www.azlyrics.com/lyrics/dax/diditfirst.html',
 'https://www.azlyrics.com/lyrics/dax/icantbreathe.html',
 'https://www.azlyrics.com/lyrics/dax/though

In [8]:
data_list = {'Rapper':[],'Title':[], 'Lyrics':[]}

for rapper in rappers_lyrics_dict.keys():
    lyrics_links = rappers_lyrics_dict[rapper]
    for url in tqdm(lyrics_links, desc = f"Gathering songs lyrics of {rapper}"):
        time.sleep(14)
        lyrics_page = requests.get(url)
        lyrics_html = BeautifulSoup(lyrics_page.text, 'html.parser')
        # Extract title from html soup
        title = lyrics_html.select('h1')[0].text.strip().split('"')[1]
        # Extract lyrics from html soup
        lyrics = max(lyrics_html.get_text().split('\n\n\n\n'), key=len).split('\n\n\n\r\n')[-1]
        # Extract and remove AZlyrics tags from lyrics
        tags = re.findall("\[.*?\]", lyrics)
        for tag in tags:
            lyrics = lyrics.replace(tag,'')
        # save extracted data to the "data_list" dictionnary
        data_list['Rapper'].append(rapper)
        data_list['Title'].append(title)  
        data_list['Lyrics'].append(lyrics)

Gathering songs lyrics: 100%|██████████| 269/269 [1:04:03<00:00, 14.29s/it]
Gathering songs lyrics: 100%|██████████| 349/349 [1:22:39<00:00, 14.21s/it]
Gathering songs lyrics: 100%|██████████| 59/59 [13:57<00:00, 14.19s/it]
Gathering songs lyrics: 100%|██████████| 411/411 [1:37:31<00:00, 14.24s/it]
Gathering songs lyrics: 100%|██████████| 196/196 [46:22<00:00, 14.19s/it]
Gathering songs lyrics: 100%|██████████| 311/311 [1:13:30<00:00, 14.18s/it]
Gathering songs lyrics: 100%|██████████| 255/255 [1:01:33<00:00, 14.48s/it]
Gathering songs lyrics: 100%|██████████| 192/192 [45:26<00:00, 14.20s/it]
Gathering songs lyrics: 100%|██████████| 359/359 [1:25:04<00:00, 14.22s/it]
Gathering songs lyrics: 100%|██████████| 108/108 [25:38<00:00, 14.24s/it]
Gathering songs lyrics: 100%|██████████| 130/130 [30:43<00:00, 14.18s/it]
Gathering songs lyrics: 100%|██████████| 145/145 [34:16<00:00, 14.18s/it]
Gathering songs lyrics: 100%|██████████| 53/53 [12:30<00:00, 14.16s/it]
Gathering songs lyrics: 100%|█

In [9]:
df = pd.DataFrame(data_list)

In [10]:
df

Unnamed: 0,Rapper,Title,Lyrics
0,2pac,Young Black Male,\nHard like an erection\n(Young black male)\nH...
1,2pac,Trapped,You know they got me trapped in this prison of...
2,2pac,Soulja's Story,"\nAll you wanted to be, a soulja, a soulja\nAl..."
3,2pac,I Don't Give A Fuck,"\n""What's up?""\n""Yo this scene, rollers tried ..."
4,2pac,Violent,They claim that I'm violent\nJust 'cause I ref...
...,...,...,...
3609,dax,Who Run It (Gherbo Remix),"Ayy I don't give a fuck what nobody says, this..."
3610,dax,Why So Serious,Last time that I talked to you guys you though...
3611,dax,XXL Freshman Freestyle,What I said before\nThis is a sport\nOnly the ...
3612,dax,YourWorthIt.org,"\nAyee if no ones told you this today, I'ma te..."


In [11]:
# Save the lyrics dataset

df.to_csv('lyrics_dataset.csv', index=False)

In [6]:
# Load the lyrics dataset

df = pd.read_csv('lyrics_dataset.csv')

In [7]:
df

Unnamed: 0,Rapper,Title,Lyrics
0,2pac,Young Black Male,\nHard like an erection\n(Young black male)\nH...
1,2pac,Trapped,You know they got me trapped in this prison of...
2,2pac,Soulja's Story,"\nAll you wanted to be, a soulja, a soulja\nAl..."
3,2pac,I Don't Give A Fuck,"\n""What's up?""\n""Yo this scene, rollers tried ..."
4,2pac,Violent,They claim that I'm violent\nJust 'cause I ref...
...,...,...,...
3609,dax,Who Run It (Gherbo Remix),"Ayy I don't give a fuck what nobody says, this..."
3610,dax,Why So Serious,Last time that I talked to you guys you though...
3611,dax,XXL Freshman Freestyle,What I said before\nThis is a sport\nOnly the ...
3612,dax,YourWorthIt.org,"\nAyee if no ones told you this today, I'ma te..."


In [54]:
# define a function to replace common uniform codes by ascii code given a string input

def unicodetoascii(text):
    ascii = (text.
            replace('\xe2\x80\x99', "'").
            replace('\xc3\xa9', 'e').
            replace('\xe2\x80\x90', '-').
            replace('\xe2\x80\x91', '-').
            replace('\xe2\x80\x92', '-').
            replace('\xe2\x80\x93', '-').
            replace('\xe2\x80\x94', '-').
            replace('\xe2\x80\x94', '-').
            replace('\xe2\x80\x98', "'").
            replace('\xe2\x80\x9b', "'").
            replace('\xe2\x80\x9c', '"').
            replace('\xe2\x80\x9c', '"').
            replace('\xe2\x80\x9d', '"').
            replace('\xe2\x80\x9e', '"').
            replace('\xe2\x80\x9f', '"').
            replace('\xe2\x80\xa6', '...').#
            replace('\xe2\x80\xb2', "'").
            replace('\xe2\x80\xb3', "'").
            replace('\xe2\x80\xb4', "'").
            replace('\xe2\x80\xb5', "'").
            replace('\xe2\x80\xb6', "'").
            replace('\xe2\x80\xb7', "'").
            replace('\xe2\x81\xba', "+").
            replace('\xe2\x81\xbb', "-").
            replace('\xe2\x81\xbc', "=").
            replace('\xe2\x81\xbd', "(").
            replace('\xe2\x81\xbe', ")").
            replace('\n', " \n ").
            replace('\n \n \n ', " \n \n ").
            replace('\n  \n  \n ', " \n \n ").
            replace('\r', "").
            strip('. ').
            strip('\n ')
            )
    return ascii

In [55]:
# Convert unicode in lyrics data to ascii

df.Lyrics = df.Lyrics.apply(lambda x: unicodetoascii(x))

In [56]:
df

Unnamed: 0,Rapper,Title,Lyrics
0,2pac,Young Black Male,Hard like an erection \n (Young black male) \n...
1,2pac,Trapped,You know they got me trapped in this prison of...
2,2pac,Soulja's Story,"All you wanted to be, a soulja, a soulja \n Al..."
3,2pac,I Don't Give A Fuck,"""What's up?"" \n ""Yo this scene, rollers tried ..."
4,2pac,Violent,They claim that I'm violent \n Just 'cause I r...
...,...,...,...
3609,dax,Who Run It (Gherbo Remix),"Ayy I don't give a fuck what nobody says, this..."
3610,dax,Why So Serious,Last time that I talked to you guys you though...
3611,dax,XXL Freshman Freestyle,What I said before \n This is a sport \n Only ...
3612,dax,YourWorthIt.org,"Ayee if no ones told you this today, I'ma tell..."


In [57]:
# define a function that cleans consecutive duplicated phrases and removes ad-libs from a list of strings

def remove_dups_ad_libs(l):
    for i in range(len(l)):
        phrase = l[i]
        s_phrase = phrase.strip(' ')
        if s_phrase:
            if s_phrase[0]=='(' and s_phrase[-1]==')': # check if the phrase is an ad-lib and remove it
                l[i]=""
        if i<len(l)-2 and (s_phrase == l[i+1].strip(' ') or s_phrase == l[i+2].strip(' ')): # remove duplicated phrases and leave the last occurence
            l[i]=""
    l=l[:-1]
    return l

In [61]:
df.iloc[2,2].split('\n')

['All you wanted to be, a soulja, a soulja ',
 ' All you wanted to be, a soulja, like me ',
 ' All you wanted to be, a soulja, a soulja ',
 ' All you wanted to be, a soulja, like me ',
 ' All you wanted to be, a soulja, a soulja ',
 ' All you wanted to be, a soulja, like me ',
 ' All you wanted to be, a soulja, a soulja ',
 ' All you wanted to be, a soulja, like me ',
 " (They cuttin' off welfare...) ",
 ' (They think crime is rising now) ',
 ' (You got whites killing blacks) ',
 ' (Cops killing blacks, and blacks killing blacks) ',
 " (Shit just gon' get worse) ",
 " (They just gon' become souljas) ",
 ' (Straight souljas) ',
 '  ',
 ' All you wanted to be, a soulja, a soulja ',
 ' All you wanted to be, a soulja, like me ',
 ' All you wanted to be, a soulja, a soulja ',
 ' All you wanted to be, a soulja, like me ',
 ' All you wanted to be, a soulja, a soulja ',
 ' All you wanted to be, a soulja, like me ',
 ' All you wanted to be, a soulja, a soulja ',
 ' All you wanted to be, a soulj

In [62]:
remove_dups_ad_libs(df.iloc[2,2].split('\n'))

['',
 '',
 '',
 '',
 '',
 '',
 ' All you wanted to be, a soulja, a soulja ',
 ' All you wanted to be, a soulja, like me ',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '  ',
 '',
 '',
 '',
 '',
 '',
 '',
 ' All you wanted to be, a soulja, a soulja ',
 ' All you wanted to be, a soulja, like me  ',
 ' ',
 ' Crack done took a part of my family tree ',
 " My momma's on the shit, my daddy split and mom is steady blaming me ",
 " Is it my fault just 'cause I'm a young black male? ",
 " Cops sweat me as if my destiny is makin' crack sales ",
 ' Only fifteen and got problems ',
 " Cops on my tail, so I bail 'til I dodge 'em ",
 ' They finally pull me over and I laugh ',
 ' "Remember Rodney King?" and I blast on his punk ass ',
 ' Now I got a murder case... ',
 ' You speak of heaven punk? I never heard of the place ',
 " Wanted to come up fast, got a Uz' and a black mask ",
 " Ducking fuckin' Task, now who's the jackass? ",
 " Keep my shit cocked, 'cause the cops got a Glock too ",
 " What the fuck woul

In [63]:
# clean the lyrics data using the defined function for removing ad-libs and duplicated phrases

for row in range(df.shape[0]):
    lyrics = df.iloc[row,2]
    l = lyrics.split('\n')
    clean_l = remove_dups_ad_libs(l)
    while True:
        try:
            clean_l.remove('')
        except:
            break
    clean_lyrics = '\n'.join(clean_l)
    df.iloc[row,2] = clean_lyrics

In [66]:
print(df.iloc[2,2])

 All you wanted to be, a soulja, a soulja 
 All you wanted to be, a soulja, like me 
  
 All you wanted to be, a soulja, a soulja 
 All you wanted to be, a soulja, like me  
 
 Crack done took a part of my family tree 
 My momma's on the shit, my daddy split and mom is steady blaming me 
 Is it my fault just 'cause I'm a young black male? 
 Cops sweat me as if my destiny is makin' crack sales 
 Only fifteen and got problems 
 Cops on my tail, so I bail 'til I dodge 'em 
 They finally pull me over and I laugh 
 "Remember Rodney King?" and I blast on his punk ass 
 Now I got a murder case... 
 You speak of heaven punk? I never heard of the place 
 Wanted to come up fast, got a Uz' and a black mask 
 Ducking fuckin' Task, now who's the jackass? 
 Keep my shit cocked, 'cause the cops got a Glock too 
 What the fuck would you do? Drop them or let 'em drop you? 
 I chose droppin' the cop 
 I got me a Glock, and a Glock for the niggas on my block 
 Momma tried to stab me, I moved out 
 Sold a

In [68]:
# split  lyricsdataset into train and test datasets
df_train, df_val = train_test_split(df, test_size=0.05, random_state=42)

In [82]:
print(f'The training set contains {df_train.shape[0]} songs lyrics.\nThe test set contains {df_val.shape[0]} songs lyrics.')

The training set contains 3433 songs lyrics.
The test set contains 181 songs lyrics.


In [83]:
df_val

Unnamed: 0,Rapper,Title,Lyrics
3284,techn9ne,,"So, me and Krizz Kaliko check into our uh... h..."
3573,dax,You know when... You let somebody borrow some ...,You know when... You let somebody borrow some ...
1825,jcole,"Yeah, my God, Science \n \n Lord I've been dr...","Yeah, my God, Science \n \n Lord I've been dr..."
3574,dax,"Pewdiepie, go die \n \n First diss was a tea...","Pewdiepie, go die \n Pewdiepie, go die \n Pewd..."
3129,techn9ne,I like big booty bitches rappin and Lynard Sky...,I like big booty bitches rappin and Lynard Sky...
...,...,...,...
1539,jayz,"Yeah, yeah, yeah, yeah, yeah \n \n Stack my ...","Yeah, yeah, yeah, yeah, yeah \n Yeah, yeah, ye..."
3125,techn9ne,"I remember when my soul, hoes and dough was mi...","I remember when my soul, hoes and dough was mi..."
3266,techn9ne,"I wanted to find my gun, my lady runs to the b...","I wanted to find my gun, my lady runs to the b..."
2903,techn9ne,Sin with me I want you. She-devils in the hous...,Sin with me I want you. She-devils in the hous...


In [84]:
# Save the train and test sets

df_train_Lyrics = df_train['Lyrics']
df_train_Lyrics.to_csv('Train_rap_bot.csv', index=False)
df_val_Lyrics = df_val['Lyrics']
df_val_Lyrics.to_csv('Val_rap_bot.csv', index=False)

# Create a freestyle rap bot
In this section, We are going to fine tune a transformer-based language model from the transformers library, to make it generate freestyle rap!

In [8]:
# import libraries
import random
import math

import transformers
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import Trainer, TrainingArguments
from transformers import pipeline

from datasets import load_dataset
from datasets import ClassLabel

from IPython.display import display, HTML

print(transformers.__version__)

4.18.0


## Preprocessing the lyrics dataset

In this application, we're going to use a GPT-2 model, which is used for causal language modeling (CLM). Thus, we are going to take all the texts in our dataset and concatenate them after they are tokenized. Then we will split them in examples of a certain sequence length. This way the model will receive chunks of contiguous text that may look like:
```
part of text 1
```
or 
```
end of text 1 [BOS_TOKEN] beginning of text 2
```
[BOS_TOKEN]: Beginning Of Senctence Token

In [9]:
path_to_train = './Train_rap_bot.csv'
path_to_validation = './Val_rap_bot.csv'
datasets = load_dataset("csv", data_files={"train": path_to_train, "validation": path_to_validation})

Using custom data configuration default-ed8bf12c2e9d083e
Reusing dataset csv (C:\Users\mlwit\.cache\huggingface\datasets\csv\default-ed8bf12c2e9d083e\0.0.0\51cce309a08df9c4d82ffd9363bbe090bf173197fc01a71b034e8594995a1a58)


  0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
datasets['train']

Dataset({
    features: ['Lyrics'],
    num_rows: 3433
})

In [4]:
# create a function that shows random song lyrics form the lyrics dataset

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [7]:
show_random_elements(datasets["train"],1)

Unnamed: 0,Lyrics
0,"(I am Killmonger) \n No one's perfect \n But no one's worthless \n We ain't deservin' of everything Heaven and Earth is \n But word is, good, (this is my home) \n Said no one's perfect, but no ones worth this \n We ain't deservin' of everything heaven and Earth is \n But word is, good (Northern California) \n \n (Ay, they better call a paramedic in the street) \n (I got leverage in the street) \n (I'm a California nigga and I'm heavy in the streets) \n .22 or .23, I'm heavy with the heat \n Hit you with this chop, paramedics can't save you (can't save you) \n Really in field c'mon bro, I know that ain't you (no, it ain't you) \n 2018, hell naw, I ain't gon' fade you \n Gon' paint you, TDE and SOB, we can't lose \n Niggas bitch made \n That's just somethin' I can't relate to (can't relate to) \n Turn on the gang \n That's just somethin' that I can't do (no, I can't do) \n Fall out over a bitch \n That's just somethin' that I can't do (no, I can't do) \n Rip every beat I get on, I was made to (I was made to) \n This Glock get to growlin', somethin' like a black panther \n Tryna touch a mil, fuck saying ""get yo bands up"" \n Fuckin' with the gang, yeah, I had to man up \n One fist in the air, I ain't finna put my hands up (gang!) \n \n I wish a nigga would \n I wish a nigga would, I wish a nigga would \n I wish a nigga would \n I wish a nigga would, I wish a nigga would \n I wish a bitch would \n I wish a bitch would, I wish a bitch would \n I wish a nigga would \n I wish a nigga would, I wish a nigga would \n \n Got shooters tappin' in, nigga for them bands, nigga \n West Coast niggas; yeah, they blowin' fans, nigga \n I know I'm the man, baby, bring your friends with you \n Puttin' points up while you in the stands, nigga \n But I be stuck in these streets, you in the background \n Ever since they took my brother, gotta pack rounds \n Sorry mamma, two bails, took a bad route \n I done got my bands up, a nigga stacked now \n But we been still O.T. on that bullshit (on that bullshit) \n I don't wanna have to do it, empty full clips (empty full clips) \n Why these niggas talkin' robbin', they don't do shit \n High Cali niggas tapped in, we'll cook shit \n Bust down on my neck, niggas reach, gettin' stretched \n Rockin' with this TEC, niggas better wear a vest \n Last year, I was broke, young nigga in the Crest \n Now a show 20 or better, broke niggas keep the rest \n \n I wish a nigga would \n I wish a nigga would, I wish a nigga would \n I wish a nigga would \n I wish a nigga would, I wish a nigga would \n I wish a bitch would \n I wish a bitch would, I wish a bitch would \n I wish a nigga would \n I wish a nigga would, I wish a nigga would (DaBoii) \n \n California nigga and I'm heavy in these streets \n If you don't keep a pole how you ready when it's beef \n (If you don't keep a pole) \n Lil nigga think he cut, yeah, I bet the nigga freeze \n If that nigga want me dead, I can't let that nigga breathe \n Want me gone, sent a shot, like the real kind \n These niggas actin' like they tough when they real kind \n Thumbin' through a 100 racks just to kill time \n They got a nigga at the edge, but I feel fine \n But it come with this shit, I'm okay with it \n If your man's to his last share a plate wit' 'em \n One whole wood to the neck, it's an eighth in it \n New baby chop, let it sing, it's a Drake nigga \n A lot of shit on my mind make me think a lot \n Why it's hard for me to smile? 'Cause I seen a lot \n You ain't really in the field, you just tweet a lot \n If we ain't on the same page, you can kick a rock, bitch! \n \n (And I been really tryna keep the peace) \n (But I'm a north Vallejo nigga, and I'm heavy in the streets) \n I was raised by my grannie and the gangsters \n So at 8 I made the choice I'ma forever be a G, and \n I don't really like to talk \n I remember we was broke and I don't really like to walk, nigga \n Now I ride around in foreign cars \n And I put on for my team who was with me from the start, nigga \n I don't play games, stitch lip, I don't say names \n She want a dog, I'm a Great Dane \n I got great aim, Ice Age on my damn chain \n My mamma told me keep a stash for the damn rain \n They ain't wanna see me win 'cause I'm black \n So I pulled up in that all black Benz in the back \n If you need someone to call, I'm the man for the task \n You ain't standin' for the 'cause, meet the man in the mask"


## Causal Language modeling

We will use the [`distilgpt2`](https://huggingface.co/distilgpt2) model for this example. You can pick any of the checkpoints listed [here](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads) instead. However make sure that your can run the selected model on your machine.

Distilgpt2 has 82 million parameters, and can be trained on GPUs with at least 6GB of VRAM.  

In [10]:
# define the chosen model checkpoint and the tokenizer.
model_checkpoint = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

In [12]:
# define a function that returns the tokenized text given a text and a tokenizer

def tokenize_function(examples):
    return tokenizer(examples["Lyrics"])

In [13]:
# tokenize the lyrics datasets using the map method and tokenize function 

tokenized_datasets = datasets.map(tokenize_function, batched=True, remove_columns=["Lyrics"])



  0%|          | 0/4 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1196 > 1024). Running this sequence through the model will result in indexing errors


  0%|          | 0/1 [00:00<?, ?ba/s]

In [14]:
# check the tokenized dataset
tokenized_datasets["train"][1]

{'input_ids': [34653,
  220,
  198,
  554,
  71,
  1000,
  428,
  11,
  537,
  24421,
  534,
  18606,
  220,
  198,
  3244,
  21349,
  11,
  1254,
  262,
  4020,
  832,
  262,
  41303,
  220,
  198,
  1680,
  21349,
  11,
  4286,
  428,
  30,
  220,
  198,
  5155,
  1231,
  502,
  30,
  20441,
  510,
  11,
  345,
  821,
  1719,
  2089,
  10625,
  220,
  198,
  705,
  42323,
  21349,
  25912,
  437,
  329,
  257,
  284,
  365,
  220,
  198,
  2011,
  5462,
  284,
  660,
  309,
  420,
  13281,
  290,
  285,
  676,
  30720,
  220,
  198,
  1550,
  262,
  2685,
  351,
  262,
  8848,
  220,
  198,
  1867,
  345,
  1807,
  11,
  651,
  4978,
  11,
  651,
  47739,
  503,
  30,
  220,
  198,
  25617,
  262,
  7356,
  4803,
  11,
  367,
  42573,
  4397,
  319,
  262,
  12586,
  220,
  198,
  1148,
  477,
  356,
  1392,
  355,
  356,
  14936,
  503,
  11,
  17038,
  220,
  198,
  24568,
  282,
  279,
  280,
  11751,
  11,
  307,
  9675,
  356,
  18959,
  470,
  256,
  27048,
  6,
  12431,
  220,

In [12]:
tokenized_datasets["train"]

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 3433
})

Now we need to concatenate all our texts together then split the result in small chunks of a certain `block_size`. To do this, we will use the `map` method again, with the option `batched=True`. This option lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.

In [15]:
# define the block size to use when creating text blocks.

#block_size = tokenizer.model_max_length # Bigger VRAM required.
block_size = 256 # Try lower numbers (32,64,128) in case you get OOM error.

Then we write the preprocessing function that will group our texts:

In [16]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [17]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=8,
)

  0%|          | 0/430 [00:00<?, ?ba/s]

  0%|          | 0/23 [00:00<?, ?ba/s]

In [18]:
lm_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 13110
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 705
    })
})

In [19]:
# verify that the 'group_texts' function worked correctly

assert len(lm_datasets["train"][1]["input_ids"]) == block_size, f"The tokenized input sequence has a different length than the defined 'block_size:{block_size}'"
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

" Happy New Year \n I got some porn star bitches in the back \n freakin off you wan come? \n  \n First nigga front I'mma shoot me a chump \n I stuff 2 million in my Lambo trunk \n Me I do whatever the fuck I want \n You must be confused, me I never lose \n Fuck me, no fuck you \n  \n Why people fuck with different drugs \n that shit aint hit the circuit yet \n Niggas fuck with dope and coke not vicodin and percocets \n Tonight I'm open minded, fuck it I'll give it a try \n What does it matter anyway, gettin' high's gettin high \n How many shots will it take to make a nigga to drop his shit \n Like he having convulsions, choking, eyes open \n Bong smoking, thats the shit you've seen in high times \n They grow underwater that shit look like a grapevine \n I started out with one pill, now I'm taking ten a day \n Em said I need help, Dre said that shit ok \n Next thing you know a nigga sittin' up in"

Now that the data has been preprocessed, we're ready to instantiate our `Trainer`.

In [18]:
# Instantiate the model using the cosen checkpoint

model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

And some `TrainingArguments`:

In [19]:
# Define the training arguments to be used in the Trainer object

model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-freestyle-bot",
    evaluation_strategy = "epoch",
    save_strategy = 'epoch',
    load_best_model_at_end = True,
    num_train_epochs=7.0,
    learning_rate=2e-5,
    weight_decay=0.01,
warmup_steps = 100.0,
)

In [20]:
# create a Trainer object using the defined model,arguments, and datasets. 

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"]
)

And we can train our model:

In [1]:
import torch
torch.__version__

'1.13.0'

In [21]:
# start training (fine-tuning)

trainer.train()

***** Running training *****
  Num examples = 13110
  Num Epochs = 7
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 11473


  0%|          | 0/11473 [00:00<?, ?it/s]

{'loss': 3.7041, 'learning_rate': 1.9296579618394445e-05, 'epoch': 0.31}
{'loss': 3.5473, 'learning_rate': 1.8417304141387497e-05, 'epoch': 0.61}
{'loss': 3.4813, 'learning_rate': 1.7538028664380552e-05, 'epoch': 0.92}


***** Running Evaluation *****
  Num examples = 705
  Batch size = 8


  0%|          | 0/89 [00:00<?, ?it/s]

Saving model checkpoint to distilgpt2-freestyle-bot\checkpoint-1639
Configuration saved in distilgpt2-freestyle-bot\checkpoint-1639\config.json


{'eval_loss': 3.3948919773101807, 'eval_runtime': 6.8866, 'eval_samples_per_second': 102.373, 'eval_steps_per_second': 12.924, 'epoch': 1.0}


Model weights saved in distilgpt2-freestyle-bot\checkpoint-1639\pytorch_model.bin


{'loss': 3.4402, 'learning_rate': 1.6658753187373607e-05, 'epoch': 1.22}
{'loss': 3.399, 'learning_rate': 1.577947771036666e-05, 'epoch': 1.53}
{'loss': 3.3973, 'learning_rate': 1.4900202233359714e-05, 'epoch': 1.83}


***** Running Evaluation *****
  Num examples = 705
  Batch size = 8


  0%|          | 0/89 [00:00<?, ?it/s]

Saving model checkpoint to distilgpt2-freestyle-bot\checkpoint-3278
Configuration saved in distilgpt2-freestyle-bot\checkpoint-3278\config.json


{'eval_loss': 3.3417224884033203, 'eval_runtime': 6.9386, 'eval_samples_per_second': 101.606, 'eval_steps_per_second': 12.827, 'epoch': 2.0}


Model weights saved in distilgpt2-freestyle-bot\checkpoint-3278\pytorch_model.bin


{'loss': 3.3812, 'learning_rate': 1.4020926756352765e-05, 'epoch': 2.14}
{'loss': 3.3557, 'learning_rate': 1.3141651279345819e-05, 'epoch': 2.44}
{'loss': 3.3383, 'learning_rate': 1.2262375802338872e-05, 'epoch': 2.75}


***** Running Evaluation *****
  Num examples = 705
  Batch size = 8


  0%|          | 0/89 [00:00<?, ?it/s]

Saving model checkpoint to distilgpt2-freestyle-bot\checkpoint-4917
Configuration saved in distilgpt2-freestyle-bot\checkpoint-4917\config.json


{'eval_loss': 3.313934326171875, 'eval_runtime': 6.9256, 'eval_samples_per_second': 101.797, 'eval_steps_per_second': 12.851, 'epoch': 3.0}


Model weights saved in distilgpt2-freestyle-bot\checkpoint-4917\pytorch_model.bin


{'loss': 3.3291, 'learning_rate': 1.1383100325331929e-05, 'epoch': 3.05}
{'loss': 3.3145, 'learning_rate': 1.0503824848324982e-05, 'epoch': 3.36}
{'loss': 3.2983, 'learning_rate': 9.624549371318034e-06, 'epoch': 3.66}
{'loss': 3.2986, 'learning_rate': 8.745273894311088e-06, 'epoch': 3.97}


***** Running Evaluation *****
  Num examples = 705
  Batch size = 8


  0%|          | 0/89 [00:00<?, ?it/s]

Saving model checkpoint to distilgpt2-freestyle-bot\checkpoint-6556
Configuration saved in distilgpt2-freestyle-bot\checkpoint-6556\config.json


{'eval_loss': 3.30161452293396, 'eval_runtime': 6.9346, 'eval_samples_per_second': 101.665, 'eval_steps_per_second': 12.834, 'epoch': 4.0}


Model weights saved in distilgpt2-freestyle-bot\checkpoint-6556\pytorch_model.bin


{'loss': 3.2746, 'learning_rate': 7.865998417304141e-06, 'epoch': 4.27}
{'loss': 3.2815, 'learning_rate': 6.986722940297196e-06, 'epoch': 4.58}
{'loss': 3.2764, 'learning_rate': 6.1074474632902495e-06, 'epoch': 4.88}


***** Running Evaluation *****
  Num examples = 705
  Batch size = 8


  0%|          | 0/89 [00:00<?, ?it/s]

Saving model checkpoint to distilgpt2-freestyle-bot\checkpoint-8195
Configuration saved in distilgpt2-freestyle-bot\checkpoint-8195\config.json


{'eval_loss': 3.290487766265869, 'eval_runtime': 6.8725, 'eval_samples_per_second': 102.582, 'eval_steps_per_second': 12.95, 'epoch': 5.0}


Model weights saved in distilgpt2-freestyle-bot\checkpoint-8195\pytorch_model.bin


{'loss': 3.2664, 'learning_rate': 5.228171986283303e-06, 'epoch': 5.19}
{'loss': 3.2409, 'learning_rate': 4.348896509276357e-06, 'epoch': 5.49}
{'loss': 3.2647, 'learning_rate': 3.46962103226941e-06, 'epoch': 5.8}


***** Running Evaluation *****
  Num examples = 705
  Batch size = 8


  0%|          | 0/89 [00:00<?, ?it/s]

Saving model checkpoint to distilgpt2-freestyle-bot\checkpoint-9834
Configuration saved in distilgpt2-freestyle-bot\checkpoint-9834\config.json


{'eval_loss': 3.283442735671997, 'eval_runtime': 6.9216, 'eval_samples_per_second': 101.856, 'eval_steps_per_second': 12.858, 'epoch': 6.0}


Model weights saved in distilgpt2-freestyle-bot\checkpoint-9834\pytorch_model.bin


{'loss': 3.2661, 'learning_rate': 2.590345555262464e-06, 'epoch': 6.1}
{'loss': 3.2404, 'learning_rate': 1.7110700782555175e-06, 'epoch': 6.41}
{'loss': 3.2449, 'learning_rate': 8.317946012485713e-07, 'epoch': 6.71}


***** Running Evaluation *****
  Num examples = 705
  Batch size = 8


  0%|          | 0/89 [00:00<?, ?it/s]

Saving model checkpoint to distilgpt2-freestyle-bot\checkpoint-11473
Configuration saved in distilgpt2-freestyle-bot\checkpoint-11473\config.json


{'eval_loss': 3.2842025756835938, 'eval_runtime': 6.9226, 'eval_samples_per_second': 101.841, 'eval_steps_per_second': 12.857, 'epoch': 7.0}


Model weights saved in distilgpt2-freestyle-bot\checkpoint-11473\pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from distilgpt2-freestyle-bot\checkpoint-9834 (score: 3.283442735671997).


{'train_runtime': 2935.9501, 'train_samples_per_second': 31.257, 'train_steps_per_second': 3.908, 'train_loss': 3.343601383411052, 'epoch': 7.0}


TrainOutput(global_step=11473, training_loss=3.343601383411052, metrics={'train_runtime': 2935.9501, 'train_samples_per_second': 31.257, 'train_steps_per_second': 3.908, 'train_loss': 3.343601383411052, 'epoch': 7.0})

Once the training is completed, we can evaluate our model and get its perplexity on the validation:\
We can briefly describe the preplexity as, the measurement of how well a probability model predicts a sample. In NLP, when using cross-entropy loss, the perplexity is simply the exponentiation of the loss.\
**As a rule of thumb, the higher the perplexity, the worse the model can generalize. So, the more the perplexity is closer to 1, the better.**

In [22]:
# calculate the perplexity on the validation set

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 705
  Batch size = 8


  0%|          | 0/89 [00:00<?, ?it/s]

Perplexity: 26.67


Now let's test owr fine-tuned model using new prompts!

**Note:**\
Given that the model is fine tunedon text tha contains bad words, it will have a certain probability of generating such words.
However, in case you don't want it to generate certin words, you can use the follwing command to get the token ids of bad words:
```
tokenizer(bad_words, add_prefix_space=True, add_special_tokens=False).input_ids
```
Then you need to input that list of bad words ids in the `model.generate()` using the parameter `bad_words_ids`

In [None]:
text = "I am not afraid"
input_ids = tokenizer.encode(text, return_tensors='pt').to('cuda')

greedy_output = model.generate(input_ids, do_sample=True, top_k=50, top_p=0.95, temperature=0.9, max_length=100, repetition_penalty=1)

In [51]:
print(tokenizer.decode(greedy_output[0]))

I am not afraid 
 You can ride on it and ride on it 
 You just wanna know how you feel 
  
 I am not afraid 
  
 Just tell me, don't you believe me 
  
 That's the only reason I'm a part of this group. 
  
 And this is the only one 
 To believe me 
  
 The only one I'm afraid 
  
 And this is


**Amazing!** Let's save the model and try a fster method to generate freestyle!

In [38]:
# save the fine-tuned model

trainer.save_model()

Saving model checkpoint to distilgpt2-freestyle-bot
Configuration saved in distilgpt2-freestyle-bot\config.json
Model weights saved in distilgpt2-freestyle-bot\pytorch_model.bin


Now, we're going to use the Pipeline API to create our freestyle rap bot:

In [21]:
rap_bot = pipeline('text-generation', model='./distilgpt2-freestyle-bot', tokenizer='distilgpt2')

In [26]:
output = rap_bot('My name is Hamza \n', max_length=200)  # **kwargs can be passed to the .generate method of the model to choose and control the decoding strategy.

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [27]:
print(output[0]['generated_text'].replace('\n ','\n'))

My name is Hamza 
When you're down and there won't be no reply 
I'm a man for life and you owe me money 
I ain't no homie that I am 
If I can't get a job then I'ma have a wife and a brother 
I'm a black man with a dream in sight 
 
I'mma get the world 
But I'mma go on 
 
Ain't nobody talking about being a black man 
 
And I'mma be the leader 
I used to be a bitch but now I'm a hustler 
But now they want the crown 
But now I'm a hustler 
When you're down and there won't be no reply 
I'm a black man with a dream in sight 
I'm a man for life and you owe me money 
I ain't no homie that I am 


**Wonderful!**
There are of course several ways that we can try to improve the quality of the generated freestyle such as:
- Tweaking the `.generate()` parameters.
- Training on more data.
- Using a larger GET model, or other Large models for text-generation. 