# Data exploration

The dataset herein is the one available at https://www.kaggle.com/neisse/scrapped-lyrics-from-6-genres

In [1]:
import numpy as np
import pandas as pd

In [3]:
repo_dir = 'E:/Repos/comp550-final-project'
artists_df = pd.read_csv(f'{repo_dir}/data/kaggle/artists-data.csv').rename(columns={'Link': 'ALink'}).drop_duplicates().dropna().reset_index(drop=True)
lyrics_df = pd.read_csv(f'{repo_dir}/data/kaggle/lyrics-data.csv').drop_duplicates().dropna().reset_index(drop=True)

In [4]:
artists_df

Unnamed: 0,Artist,Songs,Popularity,ALink,Genre,Genres
0,10000 Maniacs,110,0.3,/10000-maniacs/,Rock,Rock; Pop; Electronica; Dance; J-Pop/J-Rock; G...
1,12 Stones,75,0.3,/12-stones/,Rock,Rock; Gospel/Religioso; Hard Rock; Grunge; Roc...
2,311,196,0.5,/311/,Rock,Rock; Surf Music; Reggae; Ska; Pop/Rock; Rock ...
3,4 Non Blondes,15,7.5,/4-non-blondes/,Rock,Rock; Pop/Rock; Rock Alternativo; Grunge; Blue...
4,A Cruz Está Vazia,13,0.0,/a-cruz-esta-vazia/,Rock,Rock
...,...,...,...,...,...,...
3233,Péricles,102,6.8,/pericles/,Samba,Romântico; Pagode; Samba; Sertanejo; Samba Enr...
3234,Rodriguinho,106,2.7,/rodriguinho/,Samba,Romântico; Pagode; Samba; Country; Hardcore; T...
3235,Sambô,71,0.8,/sambo/,Rock,Samba; Pagode; Rock; Pop/Rock; Soul Music; Cla...
3236,Thiaguinho,143,13.8,/thiaguinho/,Samba,Pagode; Romântico; Samba; Trilha Sonora; Black...


In [5]:
lyrics_df

Unnamed: 0,ALink,SName,SLink,Lyric,Idiom
0,/10000-maniacs/,More Than This,/10000-maniacs/more-than-this.html,I could feel at the time. There was no way of ...,ENGLISH
1,/10000-maniacs/,Because The Night,/10000-maniacs/because-the-night.html,"Take me now, baby, here as I am. Hold me close...",ENGLISH
2,/10000-maniacs/,These Are Days,/10000-maniacs/these-are-days.html,These are. These are days you'll remember. Nev...,ENGLISH
3,/10000-maniacs/,A Campfire Song,/10000-maniacs/a-campfire-song.html,"A lie to say, ""O my mountain has coal veins an...",ENGLISH
4,/10000-maniacs/,Everyday Is Like Sunday,/10000-maniacs/everyday-is-like-sunday.html,Trudging slowly over wet sand. Back to the ben...,ENGLISH
...,...,...,...,...,...
164827,/zeca-pagodinho/,Vôo de Paz,/zeca-pagodinho/voo-de-paz.html,Há qualquer coisa entre nós. Que nos priva de ...,PORTUGUESE
164828,/zeca-pagodinho/,Vou Procurar Esquecer,/zeca-pagodinho/vou-procurar-esquecer.html,Vou procurar um novo amor na minha vida. Porqu...,PORTUGUESE
164829,/zeca-pagodinho/,Vou Ver Juliana,/zeca-pagodinho/vou-ver-juliana.html,Quando a mare vazá. Vou vê juliana. Vou vê jul...,PORTUGUESE
164830,/zeca-pagodinho/,Yaô Cadê A Samba / Outro Recado / Hino,/zeca-pagodinho/yao-cade-a-samba-outro-recado-...,"Ô Yaô. Yaô, cadê a samba?. Está mangando na cu...",PORTUGUESE


In [11]:
print(f'The dataset contains lyrics in {len(lyrics_df.Idiom.unique())} languages:')

The dataset contains lyrics in 47 languages:


In [6]:
lyrics_df.Idiom.unique()

array(['ENGLISH', 'PORTUGUESE', 'SPANISH', 'ITALIAN', 'FRENCH',
       'KINYARWANDA', 'DANISH', 'NORWEGIAN', 'GERMAN', 'INDONESIAN',
       'SWAHILI', 'FINNISH', 'SLOVAK', 'BASQUE', 'ESTONIAN', 'SERBIAN',
       'CROATIAN', 'BOSNIAN', 'IRISH', 'CATALAN', 'KURDISH', 'SUNDANESE',
       'HUNGARIAN', 'DUTCH', 'AFRIKAANS', 'ICELANDIC', 'MALAY', 'SESOTHO',
       'SWEDISH', 'WELSH', 'TAGALOG', 'POLISH', 'GALICIAN',
       'HAITIAN_CREOLE', 'KOREAN', 'GANDA', 'HMONG', 'NYANJA', 'RUSSIAN',
       'ARABIC', 'TURKISH', 'MALAGASY', 'JAPANESE', 'SLOVENIAN', 'CZECH',
       'CEBUANO', 'ROMANIAN'], dtype=object)

This might prove to be a difficult NLP task. Therefore, to simplify our genre classification problem let's only take a look at English data:

In [7]:
merged_df = pd.merge(artists_df, lyrics_df, on=['ALink'])
english_df = merged_df[merged_df.Idiom == 'ENGLISH'].drop(columns=['ALink', 'SLink', 'Idiom', 'Genres', 'Songs', 'Popularity']).reset_index(drop=True)
english_df

Unnamed: 0,Artist,Genre,SName,Lyric
0,10000 Maniacs,Rock,More Than This,I could feel at the time. There was no way of ...
1,10000 Maniacs,Rock,Because The Night,"Take me now, baby, here as I am. Hold me close..."
2,10000 Maniacs,Rock,These Are Days,These are. These are days you'll remember. Nev...
3,10000 Maniacs,Rock,A Campfire Song,"A lie to say, ""O my mountain has coal veins an..."
4,10000 Maniacs,Rock,Everyday Is Like Sunday,Trudging slowly over wet sand. Back to the ben...
...,...,...,...,...
94909,Wesley Ignacio,Pop,Without You Now,I came here for you. I can't feel enough your ...
94910,Wesley Ignacio,Pop,You Left Me Alone,You left me alone. And you will forget my hear...
94911,Wesley Ignacio,Pop,You're Amazing,You're Amazing. You're Amazing. You're Amazing...
94912,Suellen Lima,Sertanejo,Trough Many Ways,I walked trough many ways. shattered dreams al...


If we now count how many English songs are from which genre, we can see that only Hip Hop, Pop and Rock have a sizable amount of data:

In [8]:
english_df.groupby('Genre').count().Lyric

Genre
Funk Carioca       69
Hip Hop         16144
Pop             28442
Rock            50159
Samba              42
Sertanejo          58
Name: Lyric, dtype: int64