In [120]:
import pandas as pd
import numpy as np
import json
import re

In [121]:
# Dataset attained from: https://www.cs.cornell.edu/~arb/data/genius-expertise/

# Annotation Dataset

In [122]:
with open('./data/genius-expertise/annotation_info.json', encoding='utf-8') as f:
    temp = []
    for line in f.readlines():
        temp.append(json.loads(line))
annotations_df = pd.DataFrame(temp)

In [123]:
# initial look
annotations_df.head(3)

Unnamed: 0,time,type,votes,pyongs,acceptor,url,contribution_stats,edits_lst,song,lyrics,artist,content,user
0,"Jan 31, 2014 11:48:16 AM",reviewed,55,0,,https://genius.com/2739115,"[{'name': 'PiecesOfAMan', 'contribution': 73.7...","[{'name': 'https://genius.com/Incilin', 'time'...",Kendrick-lamar-swimming-pools-drank-lyrics,[Produced by T-Minus],,,
1,"Sep 16, 2012 9:10:23 PM",reviewed,80,0,,https://genius.com/1072360,"[{'name': 'Incilin', 'contribution': 75.0}, {'...","[{'name': 'https://genius.com/Incilin', 'time'...",Kendrick-lamar-swimming-pools-drank-lyrics,[Intro],,,
2,"Jul 13, 2012 5:25:57 PM",reviewed,392,1,TheseDays,https://genius.com/905517,"[{'name': 'Haifisch', 'contribution': 31.82}, ...","[{'name': 'https://genius.com/Haifisch', 'time...",Kendrick-lamar-swimming-pools-drank-lyrics,"Pour up (Drank), head shot (Drank)\nSit down (...",,,


In [124]:
# i'm only interested in annotations with edits_lst, also the most recent edit
print('Number of Annotated Songs: {}'.format(annotations_df.shape[0]))
annotations_df = annotations_df.dropna(axis=0 ,subset=['edits_lst'])
print('Number of Annotated Songs after Filters: {}'.format(annotations_df.shape[0]))

Number of Annotated Songs: 393954
Number of Annotated Songs after Filters: 322613


In [125]:
# looking at an example
annotations_df.iloc[10]['edits_lst'][0]['content']

'<p>Here, the phrase “condom wrappers” is used as a homonym, alongside “condom rappers.” The origins of the latter term are disputed, but it has been theorized that “condom rappers” are either ones who rap incessantly about sex or artists from the ‘80s who warned against the dangers of unprotected intercourse.</p>\n\n<p>Kendrick reminisces about when he would dream of being as successful as the rappers he listened to. Back then, using condoms wasn’t considered “cool,” as the dangers of STDs were not very well known. At the time, Compton hip-hop was not yet famous—Compton was put on the map starting with <a href="https://genius.com/artists/Nwa" rel="noopener" data-api_path="/artists/974">N.W.A</a>, who released their debut album <em>Straight Outta Compton</em> in 1988. Given the possible reference to 1980s hip-hop and condoms, this could also be a subtle shout-out to fellow Compton rapper and N.W.A member <a href="https://genius.com/artists/Eazy-e" rel="noopener" data-api_path="/artists

In [126]:
def clean_edits_lst(x):
    # get first instance of edit
    x = x[0]['content']
    # extract text body from brackets (https://stackoverflow.com/a/12982689/21492082)
    x = re.sub(re.compile('<.*?>') ,"",x)
    # remove end of line
    x = x.replace('\n', '')
    return x

In [127]:
# looking at a clean example
print(clean_edits_lst(annotations_df.iloc[10]['edits_lst']))

Here, the phrase “condom wrappers” is used as a homonym, alongside “condom rappers.” The origins of the latter term are disputed, but it has been theorized that “condom rappers” are either ones who rap incessantly about sex or artists from the ‘80s who warned against the dangers of unprotected intercourse.Kendrick reminisces about when he would dream of being as successful as the rappers he listened to. Back then, using condoms wasn’t considered “cool,” as the dangers of STDs were not very well known. At the time, Compton hip-hop was not yet famous—Compton was put on the map starting with N.W.A, who released their debut album Straight Outta Compton in 1988. Given the possible reference to 1980s hip-hop and condoms, this could also be a subtle shout-out to fellow Compton rapper and N.W.A member Eazy-E, who died from complications of AIDS in 1995.The Notorious B.I.G. also mentioned this cultural transition from a contraception-averse society to wearing condoms due to the dangers of disea

In [128]:
# map function to entire column
annotations_df['edits_lst'] = annotations_df['edits_lst'].apply(clean_edits_lst)

In [129]:
# save cleaned dataset
annotations_df.to_csv('./data/genius-expertiste_clean/annotation.csv', index=False)

# Lyrics Dataset

In [130]:
with open('./data/genius-expertise/lyrics.jl', encoding='utf-8') as f:
    temp = []
    for line in f.readlines():
        temp.append(json.loads(line))
lyrics_df = pd.DataFrame(temp)

In [131]:
# looking at an example
lyrics_df.iloc[10]['lyrics']

"\n\n[Intro: Mr. Talkbox]\n\n[Pre-Chorus: DJ Dahi & Kendrick Lamar]\nI said I'm geeked and I’m fired up (Fired, fire)\nAll I want tonight is just get high (High, high, high)\nGirl, you look so good, it's to die for (Die for)\nOoh, that pussy good, it's to die for (In fire)\n\n[Chorus 1: Kendrick Lamar & Rihanna]\nIt’s a secret society\nAll we ask is trust (All we ask is trust)\nAll we got is us\nLoyalty, loyalty, loyalty\nLoyalty, loyalty, loyalty\n\n[Verse 1: Kendrick Lamar]\nKung Fu Kenny now\nMy resume is real enough for two millenniums\nA better way to make a wave, stop defendin' them\nI meditate and moderate all of my wins again\nI'm hangin' on the fence again\nI'm always on your mind\nI put my lyric and my lifeline on the line\nAnd ain't no limit when I might shine, might grind\nYou rollin' with it at the right time, right now\n(Only for the dollar sign)\n\n[Verse 2: Rihanna]\nBad girl RiRi now\nSwerve, swerve, swerve, swerve, leave it now\nOn your pulse like it's EDM\nGas in the

In [132]:
# looks like not much is required other than removing the end of line chars
lyrics_df.iloc[10]['lyrics'].replace('\n', ' ')

"  [Intro: Mr. Talkbox]  [Pre-Chorus: DJ Dahi & Kendrick Lamar] I said I'm geeked and I’m fired up (Fired, fire) All I want tonight is just get high (High, high, high) Girl, you look so good, it's to die for (Die for) Ooh, that pussy good, it's to die for (In fire)  [Chorus 1: Kendrick Lamar & Rihanna] It’s a secret society All we ask is trust (All we ask is trust) All we got is us Loyalty, loyalty, loyalty Loyalty, loyalty, loyalty  [Verse 1: Kendrick Lamar] Kung Fu Kenny now My resume is real enough for two millenniums A better way to make a wave, stop defendin' them I meditate and moderate all of my wins again I'm hangin' on the fence again I'm always on your mind I put my lyric and my lifeline on the line And ain't no limit when I might shine, might grind You rollin' with it at the right time, right now (Only for the dollar sign)  [Verse 2: Rihanna] Bad girl RiRi now Swerve, swerve, swerve, swerve, leave it now On your pulse like it's EDM Gas in the bitch like it’s premium Haul ass

In [133]:
# map function to entire column
lyrics_df['lyrics'] = lyrics_df['lyrics'].apply(lambda x: x.replace('\n', ' '))
lyrics_df

Unnamed: 0,song,lyrics
0,Kendrick-lamar-swimming-pools-drank-lyrics,[Produced by T-Minus] [Intro] Pour up (Dran...
1,Kendrick-lamar-money-trees-lyrics,[Produced by DJ Dahi] [Verse 1: Kendrick La...
2,Kendrick-lamar-xxx-lyrics,"[Intro: Bēkon & Kid Capri] America, God bles..."
3,A-ap-rocky-fuckin-problems-lyrics,"[Chorus: 2 Chainz, Drake & Both (A$AP Rocky)..."
4,Kendrick-lamar-dna-lyrics,"[Verse 1] I got, I got, I got, I got— Loyalt..."
...,...,...
37988,Pnl-tchiki-tchiki-lyrics,"[Intro : N.O.S.] Ouais, ouais, ouais, ouais,..."
37989,Pnl-chang-lyrics,"[Couplet 1 : Ademo] Chang, chang, chang, j'm..."
37990,Pnl-simba-lyrics,"[Intro: N.O.S] Ouais, Ah on va voir Ouais, o..."
37991,Pnl-je-thaine-version-orange-lyrics,[Produit par BBP] [Couplet 1 : Ademo] Que d...


In [134]:
# cleaning song column

songs_dict = {}
with open('./data/genius-expertise/artist_info.json', encoding='utf-8') as f:
    temp = []
    for line in f.readlines():
        line = json.loads(line)
        for song in line['songs']:
            if '-and-' in song or 'mtv' in song or 'version' in song:
                continue
            artist = line['url_name']
            if artist in song:
                if song in songs_dict:
                    pass
                else:
                    songs_dict[song] = ' '.join(song.replace(artist, ' ').split('-')[1:-1])

In [135]:
lyrics_df['title'] = lyrics_df['song'].apply(lambda x: songs_dict.get(x, np.nan))
lyrics_df = lyrics_df.dropna()

In [136]:
lyrics_df.head()

Unnamed: 0,song,lyrics,title
0,Kendrick-lamar-swimming-pools-drank-lyrics,[Produced by T-Minus] [Intro] Pour up (Dran...,swimming pools drank
1,Kendrick-lamar-money-trees-lyrics,[Produced by DJ Dahi] [Verse 1: Kendrick La...,money trees
2,Kendrick-lamar-xxx-lyrics,"[Intro: Bēkon & Kid Capri] America, God bles...",xxx
3,A-ap-rocky-fuckin-problems-lyrics,"[Chorus: 2 Chainz, Drake & Both (A$AP Rocky)...",fuckin problems
4,Kendrick-lamar-dna-lyrics,"[Verse 1] I got, I got, I got, I got— Loyalt...",dna


In [137]:
# save cleaned dataset
lyrics_df.to_csv('./data/genius-expertiste_clean/lyrics.csv', index=False)

# Inner Join of Datasets

In [138]:
annotations_df = pd.read_csv('./data/genius-expertiste_clean/annotation.csv')
lyrics_df = pd.read_csv('./data/genius-expertiste_clean/lyrics.csv')

In [139]:
songs = lyrics_df.merge(annotations_df, how='inner', on='song')

In [140]:
# multiple people can submit annotations per song; I want the highest voted one
# incase I want to use the annotation column

songs = songs.loc[songs.reset_index().groupby(['song'])['votes'].idxmax()]

In [141]:
songs.head()

Unnamed: 0,song,lyrics_x,title,time,type,votes,pyongs,acceptor,url,contribution_stats,edits_lst,lyrics_y,artist,content,user
221344,101barz-fresku-studiosessie-272-lyrics,[Verse] Eey Rare jongens die Romeinen Nieman...,fresku studiosessie 272,"Nov 6, 2017 12:47:51 PM",reviewed,11,0,Liampjuh,https://genius.com/12999733,"[{'name': 'LuukVerheggen', 'contribution': 69....",Hier verwijst Fresku naar de stripfiguren Aste...,Rare jongens die Romeinen\nNiemand heeft een h...,,,
216669,10kcaash-swajjurkicks-lyrics,[Chorus: 10k.Caash] Lean with my Brisk (Lean...,swajjurkicks,"Dec 27, 2018 12:49:37 PM",reviewed,2,0,,https://genius.com/16097071,"[{'name': 'blustery', 'contribution': 100.0}]",ChaseTheMoney is the producer for this song. V...,"ChaseTheMoney, ChaseTheMoney",,,
196975,112-only-you-bad-boy-remix-lyrics,[Pre-Intro: 112] Keep it real (keep it real)...,only you bad boy remix,"Sep 25, 2011 10:09:20 PM",reviewed,7,1,,https://genius.com/375950,"[{'name': 'bjax', 'contribution': 100.0}]",Biggie’s accountant gets a shoutout!,Bert Padell,,,
197256,112-only-you-lyrics,"[Verse 1: Q Parker] Oh, I need to know where...",only you,"Jan 4, 2017 10:40:35 AM",reviewed,2,0,bfred,https://genius.com/11117933,"[{'name': 'AintNothinLikeDaOldSchool', 'contri...",“Pablo” is a reference to Colombian drug lord ...,Cats named Pablo in milked out Diablos,,,
230199,113-on-sait-lfaire-lyrics,[Couplet 1 : Booba] J'ai quelques principes ...,on sait lfaire,"Jun 24, 2012 9:41:25 AM",reviewed,3,0,Clement_RGF,https://genius.com/865004,"[{'name': 'Clement_RGF', 'contribution': 50.0}...","En effet, peu de rappeurs arrive à vivre plein...",Ils croient tous qu'on veut faire des thunes e...,,,


In [142]:
# keeping columns required for summarization task
songs = songs[['lyrics_x', 'title']]
songs = songs.dropna()
songs.columns = ['lyrics', 'title']
songs.head()

Unnamed: 0,lyrics,title
221344,[Verse] Eey Rare jongens die Romeinen Nieman...,fresku studiosessie 272
216669,[Chorus: 10k.Caash] Lean with my Brisk (Lean...,swajjurkicks
196975,[Pre-Intro: 112] Keep it real (keep it real)...,only you bad boy remix
197256,"[Verse 1: Q Parker] Oh, I need to know where...",only you
230199,[Couplet 1 : Booba] J'ai quelques principes ...,on sait lfaire


In [143]:
songs.to_csv('./data/genius-expertiste_clean/songs.csv', index=False)

# Final Dataset for Summarization

In [144]:
songs.head()

Unnamed: 0,lyrics,title
221344,[Verse] Eey Rare jongens die Romeinen Nieman...,fresku studiosessie 272
216669,[Chorus: 10k.Caash] Lean with my Brisk (Lean...,swajjurkicks
196975,[Pre-Intro: 112] Keep it real (keep it real)...,only you bad boy remix
197256,"[Verse 1: Q Parker] Oh, I need to know where...",only you
230199,[Couplet 1 : Booba] J'ai quelques principes ...,on sait lfaire
