In [5]:
import pandas as pd
import numpy as np
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import pickle

In [2]:
# Load the combined data
df = pd.read_csv('../data/processed/combined_lyrics.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Artist,Title,Album,Year,Date,Lyric
0,0.0,Dua Lipa,New Rules,Dua Lipa,2017.0,2017-06-02,one one one one one talkin' in my sleep at n...
1,1.0,Dua Lipa,Don’t Start Now,Future Nostalgia,2019.0,2019-11-01,if you don't wanna see me did a full 80 craz...
2,2.0,Dua Lipa,IDGAF,Dua Lipa,2017.0,2017-06-02,you call me all friendly tellin' me how much y...
3,3.0,Dua Lipa,Blow Your Mind (Mwah),Dua Lipa,2016.0,2016-08-26,i know it's hot i know we've got something tha...
4,4.0,Dua Lipa,Be the One,Dua Lipa,2015.0,2015-10-30,i see the moon i see the moon i see the moon o...


In [3]:
# Data cleaning steps (lowercase, remove special characters, etc.)
df['cleaned_lyrics'] = df['Lyric'].str.lower().str.replace(r'[^\w\s]', '')

In [6]:
# Handle missing values if any (fill with empty string)
df['cleaned_lyrics'].fillna('', inplace=True)

# Tokenization
tokenizer = Tokenizer(num_words=20000)
tokenizer.fit_on_texts(df['cleaned_lyrics'])

# Save the tokenizer
with open("../models/tokenizer.pkl", "wb") as f:
    pickle.dump(tokenizer, f)

# Convert lyrics to sequences and pad them
sequences = tokenizer.texts_to_sequences(df['cleaned_lyrics'])
padded_sequences = pad_sequences(sequences, maxlen=30)

# Save processed sequences
np.save("../data/processed/tokenized_sequences.npy", padded_sequences)

In [8]:
df[['Lyric', 'cleaned_lyrics']].sample(10)

Unnamed: 0,Lyric,cleaned_lyrics
1587,cardi b cardi okurr cardi b i got kim on boa...,cardi b cardi okurr cardi b i got kim on boa...
825,for my nigga hush yeah look dressed in fatig...,for my nigga hush yeah look dressed in fatig...
2345,lyrics for this song have yet to be released p...,lyrics for this song have yet to be released p...
4228,the first shots fired everybody's gathered aro...,the first shots fired everybody's gathered aro...
2115,lady gaga r kelly yeah oh turn the mic up yea...,lady gaga r kelly yeah oh turn the mic up yea...
5969,khalid na na na na ooh oh no oh ayy khalid p...,khalid na na na na ooh oh no oh ayy khalid p...
3090,when i first met you you told me exactly how i...,when i first met you you told me exactly how i...
4885,stuck here in the middle of nowhere with a hea...,stuck here in the middle of nowhere with a hea...
4852,la la la la la la la la la la la la la la la l...,la la la la la la la la la la la la la la la l...
2678,i'm a dime you a nickelette light skinneded pi...,i'm a dime you a nickelette light skinneded pi...


## Summary

In this phase, the song lyrics dataset was prepared for modeling by performing comprehensive cleaning and preprocessing. Below is a summary of the steps taken:

 - **Text Cleaning:**
    - Convert text to lowercase.
    - Remove punctuation and special characters.
    - Remove stopwords and extra spaces.

- **Tokenization and Padding:**
    - Tokenize the lyrics using the Tokenizer class from Keras, converting text into sequences of integers.
    - Pad the sequences to ensure uniform input length.

- **Save Preprocessed Data:**
    - Save tokenized sequences to `data/processed/tokenized_sequences.npy`.
    - Save the tokenizer object as `models/tokenizer.pkl` for future use.
