In [1]:
import pickle
import pandas as pd

from utils.genius_scraper import download_lyrics

## Gathering Data (lyrics) - Source: genius.com

We first download the lyrics with the download script I wrote in `utils/genius_scraper.py`. For this analysis we will have a look at the following artists:
* [Eminem](https://en.wikipedia.org/wiki/Eminem)
* [J. Cole](https://en.wikipedia.org/wiki/J._Cole)
* [Kanye West](https://en.wikipedia.org/wiki/Kanye_West)
* [2Pac](https://en.wikipedia.org/wiki/Tupac_Shakur)
* [Notorious B.I.G](https://en.wikipedia.org/wiki/The_Notorious_B.I.G.)
* [Logic](https://en.wikipedia.org/wiki/Logic_(Rapper))
* [Nas](https://en.wikipedia.org/wiki/Nas)
* [Joyner Lucas](https://en.wikipedia.org/wiki/Joyner_Lucas)
* [Juice WRLD](https://en.wikipedia.org/wiki/Juice_Wrld)
* [Lil Pump](https://en.wikipedia.org/wiki/Lil_Pump)
* [Nicki Minaj](https://en.wikipedia.org/wiki/Nicki_Minaj)
* [Cardi B](https://en.wikipedia.org/wiki/Cardi_B)
* [Mac Miller](https://en.wikipedia.org/wiki/Mac_Miller)

*Note that at the time of this writing I was not aware of the [Genius API](https://docs.genius.com/)*

In [2]:
URLS = [
    "https://genius.com/artists/Eminem",
    "https://genius.com/artists/J-cole",
    "https://genius.com/artists/Kanye-west",
    "https://genius.com/artists/2pac",
    "https://genius.com/artists/The-notorious-big",
    "https://genius.com/artists/Logic",
    "https://genius.com/artists/Nas",
    "https://genius.com/artists/Joyner-lucas",
    "https://genius.com/artists/Juice-wrld",
    "https://genius.com/artists/Lil-pump",
    "https://genius.com/artists/Nicki-minaj",
    "https://genius.com/artists/Cardi-b",
    "https://genius.com/artists/Mac-miller"
]

rappers = ["Eminem", "J. Cole", "Kanye West", "2Pac", 
           "Notorious B.I.G.", "Logic", "Nas", "Joyner Lucas", 
           "Juice WRLD", "Lil Pump", 
           "Nicki Minaj", "Cardi B", "Mac Miller"]

The pickles the lyrics in separate pickle files, for later use, and returns the downloaded text as a dictionary. This dictionary has the following structure:

{ rapper_name: { album_name: [tracks] } } }

Aside from that, it is possible that not all lyrics could be scraped due to ambiguos reasons (which is why the offical genius API might be preferred here). The script will also return a dictionary `failed` containing lyrics that couldn't be downloaded in the same structure as above.

Since I already downloaded the Lyrics, I commented out the download and directly load the pickle files. If you want to run the script, just uncomment the first line.

In [3]:
#data, failed = download_lyrics(URLS, rappers)

data = {}
for r in rappers:
    with open("lyrics/" + r + ".pkl", "rb") as file:
        data[r] = pickle.load(file)

In [4]:
list(data.keys()) # Check if we could at least get one album from each rapper

['Eminem',
 'J. Cole',
 'Kanye West',
 '2Pac',
 'Notorious B.I.G.',
 'Logic',
 'Nas',
 'Joyner Lucas',
 'Juice WRLD',
 'Lil Pump',
 'Nicki Minaj',
 'Cardi B',
 'Mac Miller']

## Cleaning the data

Before we take a look into our data, lets aggregate the dictionary so that we have dictionary with the following structure:

key: rapper, value: all lyrics in one string

In [5]:
lyrics = {}
for k, v in data.items():
    lyrics[k] = []
    for t in v:
        lyrics[k].append(' '.join(v[t]))

for k in lyrics:
    combined_text = ' '.join(lyrics[k])
    lyrics[k] = [combined_text]
    
## key: rapper, value: all lyrics in one string

In [6]:
# Now we can load the text into the dataframe, this is also visually descriptive
pd.set_option('max_colwidth', 150)

data_df = pd.DataFrame.from_dict(lyrics).transpose()
data_df.columns = ['lyrics']
data_df = data_df.sort_index()
data_df

Unnamed: 0,lyrics
2Pac,"[Intro: 2Pac & Snoop Doggy Dogg] Up out of there Ain't nothin' but a gangsta party Eh, light that up, Snoop! Why you actin' like that? Ahh shit,..."
Cardi B,"[Intro: Cardi B] Uh, uh, yeah, come on [Chorus: Bruno Mars & Cardi B] Please me, baby Turn around and just tease me, baby You know what I want ..."
Eminem,"[Intro: Eminem] Yeah So I guess this is what it is, huh? Think it's obvious We ain't never gonna see eye to eye But it's funny As much as I hate..."
J. Cole,"1985, I arrived 33 years, damn, I'm grateful I survived We wasn't s'posed to get past 25 Joke's on you motherfucker, we alive All these niggas p..."
Joyner Lucas,"[Kids chatting] [Ms. Nelson] Okay guys, time to go Remember to bring your homework in tomorrow And please, don't run, walk, please Thank you Jo..."
Juice WRLD,"[Spoken Words: Juice WRLD & Rob Markman] At the end of the day, I still thank God for everything that, you know, He's put in front of me But thi..."
Kanye West,"[Chorus] Sing every hour (Every hour, 'til the power) Every minute (Every minute, of the Lord) Every second (Every second, comes) Sing each and ..."
Lil Pump,"Lyrics from snippet [Intro] CB on the beat Yeah, ooh, yeah Jet, hmm Uh [Chorus] Sip the Wockhardt, now I'm 'dosin' on drank (Ooh) I spent like..."
Logic,"[Instrumental] [Instrumental] [Instrumental] [Spoken] For you... Rice is nothing, but for us. Rice is like my mother and father. Don..."
Mac Miller,"[Intro: Telly & Mac Miller] When you're young, not much matters When you find something that you care about, then that's all you got When you go..."


Let's see how the value for 2Pac looks like

In [7]:
data_df.lyrics.loc['2Pac']



At first glance we can see lots 'impurities' in the string - This is also a good point to start data cleaning!

## Data Cleaning 1:

For this analysis, we don't really care a lot about differentiating between words written in lower case or upper case. Also everything that don't really give any semantic information - this includes punctuation marks (basically everything that is not a alphabetic letter] and we also remove the numbers. So what we do is:

- Make text all lower case
- Punctuation marks
- Remove numbers

In [8]:
import re
import string

def clean_text1(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

clean1 = lambda x: clean_text1(x)

In [9]:
data_cleaned = pd.DataFrame(data_df.lyrics.apply(clean1))

In [10]:
data_cleaned

Unnamed: 0,lyrics
2Pac,up out of there aint nothin but a gangsta party eh light that up snoop why you actin like that ahh shit you done fucked up now aint nothin but ...
Cardi B,uh uh yeah come on please me baby turn around and just tease me baby you know what i want and what i need baby let me hear you say please let...
Eminem,yeah so i guess this is what it is huh think its obvious we aint never gonna see eye to eye but its funny as much as i hate you i need you this...
J. Cole,i arrived years damn im grateful i survived we wasnt sposed to get past jokes on you motherfucker we alive all these niggas popping now is yo...
Joyner Lucas,okay guys time to go remember to bring your homework in tomorrow and please dont run walk please thank you joyner stay behind one second i wa...
Juice WRLD,at the end of the day i still thank god for everything that you know hes put in front of me but this materialistic money stuff dont really mean...
Kanye West,sing every hour every hour til the power every minute every minute of the lord every second every second comes sing each and every millisecond ...
Lil Pump,lyrics from snippet cb on the beat yeah ooh yeah jet hmm uh sip the wockhardt now im dosin on drank ooh i spent like on my wrist yeah yeah ...
Logic,for you rice is nothing but for us rice is like my mother and father dont fuck with my family huh huh i feel so relieved by rice...
Mac Miller,when youre young not much matters when you find something that you care about then thats all you got when you go to sleep at night you dream of...


This looks better now but there is till some nonsensical text

## Data Cleaning 2:
- remove nonsensical text e.g. "\n"

In [11]:
def clean_text2(text):
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

clean2 = lambda x: clean_text2(x)

In [12]:
data_cleaned = pd.DataFrame(data_cleaned.lyrics.apply(clean2))

In [13]:
data_cleaned

Unnamed: 0,lyrics
2Pac,up out of there aint nothin but a gangsta party eh light that up snoop why you actin like that ahh shit you done fucked up now aint nothin but ...
Cardi B,uh uh yeah come on please me baby turn around and just tease me baby you know what i want and what i need baby let me hear you say please let...
Eminem,yeah so i guess this is what it is huh think its obvious we aint never gonna see eye to eye but its funny as much as i hate you i need you this...
J. Cole,i arrived years damn im grateful i survived we wasnt sposed to get past jokes on you motherfucker we alive all these niggas popping now is yo...
Joyner Lucas,okay guys time to go remember to bring your homework in tomorrow and please dont run walk please thank you joyner stay behind one second i wa...
Juice WRLD,at the end of the day i still thank god for everything that you know hes put in front of me but this materialistic money stuff dont really mean...
Kanye West,sing every hour every hour til the power every minute every minute of the lord every second every second comes sing each and every millisecond ...
Lil Pump,lyrics from snippet cb on the beat yeah ooh yeah jet hmm uh sip the wockhardt now im dosin on drank ooh i spent like on my wrist yeah yeah ...
Logic,for you rice is nothing but for us rice is like my mother and father dont fuck with my family huh huh i feel so relieved by rice...
Mac Miller,when youre young not much matters when you find something that you care about then thats all you got when you go to sleep at night you dream of...


# Data Cleaning 3

Some words are not very nice, maybe we want to censor a character of them.

In [14]:
def clean_text3(text):
    text = re.sub('(n|i){1,32}((g{2,32}|q){1,32}|[gq]{2,32})[e3|ar]{1,32}', 'nibba', text) # regex for the n-word
    return text

clean3 = lambda x: clean_text3(x)

In [15]:
data_cleaned = pd.DataFrame(data_cleaned.lyrics.apply(clean3))

In [16]:
data_cleaned

Unnamed: 0,lyrics
2Pac,up out of there aint nothin but a gangsta party eh light that up snoop why you actin like that ahh shit you done fucked up now aint nothin but ...
Cardi B,uh uh yeah come on please me baby turn around and just tease me baby you know what i want and what i need baby let me hear you say please let...
Eminem,yeah so i guess this is what it is huh think its obvious we aint never gonna see eye to eye but its funny as much as i hate you i need you this...
J. Cole,i arrived years damn im grateful i survived we wasnt sposed to get past jokes on you motherfucker we alive all these nibbas popping now is yo...
Joyner Lucas,okay guys time to go remember to bring your homework in tomorrow and please dont run walk please thank you joyner stay behind one second i wa...
Juice WRLD,at the end of the day i still thank god for everything that you know hes put in front of me but this materialistic money stuff dont really mean...
Kanye West,sing every hour every hour til the power every minute every minute of the lord every second every second comes sing each and every millisecond ...
Lil Pump,lyrics from snippet cb on the beat yeah ooh yeah jet hmm uh sip the wockhardt now im dosin on drank ooh i spent like on my wrist yeah yeah ...
Logic,for you rice is nothing but for us rice is like my mother and father dont fuck with my family huh huh i feel so relieved by rice...
Mac Miller,when youre young not much matters when you find something that you care about then thats all you got when you go to sleep at night you dream of...


This looks good for now. There are still lots of things we could do that contribute to better results like bigrams or [n-grams](https://web.stanford.edu/~jurafsky/slp3/3.pdf) (fragmenting the text into parts of n words) but this corpus should suffice, so let's save it for later analysis.

In [17]:
data_df.to_pickle("corpus.pkl")

Let's change the data frame into [Document Term Matrix](https://en.wikipedia.org/wiki/Document-term_matrix) (DTM) to tidy things up. The DTM keeps each rapper a row index and every single word is a column index. The entries are integer numbers representing how often the rapper has used this word in his lyrics. We pickle the dataframe as well for later analysis

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

# convert to document term matrix
cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_cleaned.lyrics)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_cleaned.index

Let's take a look into the data we have so far.

In [19]:
data_dtm

Unnamed: 0,aa,aaaaaaack,aaaaahhh,aaaaahhhh,aaaaayyyyooooo,aaaagain,aaaah,aaaahh,aaaand,aaaghh,...,世界中で聴いてる,帰っていただいて結構,彼の行動が気になって仕方ないはず,感謝しています,最高だったでしょう,本当はロジックを愛してやまないんでしょう,楽しんでいただけたことを願っています,毎日,私たちは共に歴史を刻んできた,耳を塞ぐか
2Pac,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Cardi B,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Eminem,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
J. Cole,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Joyner Lucas,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Juice WRLD,0,0,0,0,0,4,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Kanye West,1,0,1,1,0,0,2,1,0,0,...,0,0,0,0,0,0,0,0,0,0
Lil Pump,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Logic,1,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
Mac Miller,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


That's funny, we are looking at lyrics from rappers from the US but we also have japanese text in our data?!

Turns out, that only Logic uses thease japanese characters. The song "Lost in Translation" had a woman speaking japanese - a reference to the movie "Lost in Translation".

In [20]:
data_dtm.to_pickle("dtm.pkl")

In [21]:
data_cleaned.to_pickle('data_clean.pkl')
pickle.dump(cv, open('cv.pkl', 'wb'))

This concludes our Data Preprocessing, we can now use the pickled data for the next analysis steps in the other notebooks!