### Imports and paths

In [1]:
import pandas as pd
import os

lyrics_csv = os.path.join("data/", "lyrics.csv")

### Load data

Data from: https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics/version/2

In [19]:
data_frame = pd.read_csv(lyrics_csv, encoding="utf8", sep=",")
data_frame.head()

Unnamed: 0,index,song,year,artist,genre,lyrics
0,0,ego-remix,2009,beyonce-knowles,Pop,"Oh baby, how you doing?\nYou know I'm gonna cu..."
1,1,then-tell-me,2009,beyonce-knowles,Pop,"playin' everything so easy,\nit's like you see..."
2,2,honesty,2009,beyonce-knowles,Pop,If you search\nFor tenderness\nIt isn't hard t...
3,3,you-are-my-rock,2009,beyonce-knowles,Pop,"Oh oh oh I, oh oh oh I\n[Verse 1:]\nIf I wrote..."
4,4,black-culture,2009,beyonce-knowles,Pop,"Party the people, the people the party it's po..."


### Filter data frame

We want to only look at songs from the pop genre, with dates between 1990 - 2020.

In [25]:
if "index" in data_frame:
    del data_frame["index"]

# filter genre
genres = ["Pop", "Rock"]
data_frame =data_frame[~data_frame.genre.isin(genres)]

# filter year
start_year = 1990
end_year = 2020
mask = (data_frame['year'] > start_year) & (data_frame['year'] <= end_year)
data_frame = data_frame.loc[mask]
data_frame.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 177115 entries, 249 to 362236
Data columns (total 5 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   song    177113 non-null  object
 1   year    177115 non-null  int64 
 2   artist  177115 non-null  object
 3   genre   177115 non-null  object
 4   lyrics  115531 non-null  object
dtypes: int64(1), object(4)
memory usage: 8.1+ MB


### Get number of artists and genres and characters

In [40]:
genre_count = len(data_frame["genre"].unique())
artist_count = len(data_frame["artist"].unique())

vocab_count = set()
for row in data_frame["song"]:
    if type(row) == str:
        for char in row:
            vocab_count.add(char)
genre_count, artist_count, len(vocab_count)

(10, 11483, 37)

### Prepare Data

Do following transformations:
 - replace hypens in song with spaces
 - replace hypens in artist with spaces

In [30]:
data_frame["song"] = data_frame["song"].str.replace("-", " ")
data_frame["artist"] = data_frame["artist"].str.replace("-", " ")
data_frame.head()

Unnamed: 0,song,year,artist,genre,lyrics
249,i got that,2007,eazy e,Hip-Hop,(horns)...\n(chorus)\nTimbo- When you hit me o...
250,8 ball remix,2007,eazy e,Hip-Hop,"Verse 1:\nI don't drink brass monkey, like to ..."
251,extra special thankz,2007,eazy e,Hip-Hop,"19 muthaphukkin 93,\nand I'm back in this bitc..."
252,boyz in da hood,2007,eazy e,Hip-Hop,"Hey yo man, remember that shit Eazy did a whil..."
253,automoblie,2007,eazy e,Hip-Hop,"Yo, Dre, man, I take this bitch out to the mov..."


### Save data

In [31]:
save_path = os.path.join("data/", "preprocessed_lyrics.csv")
data_frame.to_csv(save_path, encoding="utf8", sep=",", index=False)