<a href="https://colab.research.google.com/github/OlgaSeleznova/ML_toolbox/blob/main/Lyrics_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#install modules
import sys
sys.path.insert(0,'/ML_toolbox/NLP_preprocessing')

! pip install langid
! pip install NLP_preprocessing

In [None]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import os
import re
from ML_toolbox.NLP_preprocessing import English_preprocess   #custom module for English preprocessing
import langid   # package to identify language of lyrics


In [None]:
# load data
from google.colab import drive
drive.mount('/content/drive/')

lyrics = pd.read_parquet('/content/drive/My Drive/ML_toolbox/data/metrolyrics.parquet')
lyrics.shape

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


(49976, 8)

In [None]:
# identify language of the lyric
lyrics['language'] = lyrics['lyrics'].apply(lambda x: langid.classify(x)[0])
lyrics['language'].value_counts()[:10]

en    46332
es     1381
de      668
fr      277
it      269
pt      123
sw      118
fi       84
no       84
sv       53
Name: language, dtype: int64

Since English posts are 46k out of 49k posts total, and text classification for multiple languages should be prepared separately, we will strat with only English.

In [None]:
lyric_en = lyrics[lyrics['language'] == 'en'].copy()
lyric_en.shape

(46332, 9)

Now I will use English preprocessing class for this repository to clean data

In [None]:
# initialize an object from custom class for English preprocessing
cleaner = English_preprocess.English_preprocessing()
# create a list of tuples to add custom cleaning with regex
to_replace = [(r'[^a-zA-Z]',' '), (r'(\s+)',' ')]
# clean posts with regex
lyric_en['lyrics_cleaned'] = lyric_en['lyrics'].apply(lambda x: cleaner.regex_cleaner(x,to_replace))
# tokenize lyrics, remove stopwords and join to string.  
lyric_en['lyrics_cleaned'] = lyric_en['lyrics_cleaned'].apply(lambda x: ' '.join(cleaner.remove_stopwords(x)))

In [None]:
#encode genre names
genre_srt_to_int = {'Rock':0, 'Pop':1, 'Hip-Hop':2, 'Metal':3, 'Country':4}
lyric_en['genre'] = lyric_en['genre'].replace(genre_srt_to_int)

# rename columns to more suitable
lyric_en = lyric_en.rename(columns={'lyrics_cleaned':'text','genre':'label'})

Now posts are cleaned and we can divide data into train, validation and test sets.

In [None]:
train_ratio = 0.75
validation_ratio = 0.15
test_ratio = 0.10

# train is now 75% of the entire data set
x_train, x_test, y_train, y_test = train_test_split(lyric_en['text'], lyric_en['label'], test_size=1 - train_ratio, random_state=42)

# test is now 10% of the initial data set
# validation is now 15% of the initial data set
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio), random_state=42) 

print(x_train.shape, x_val.shape, x_test.shape)

(34749,) (6949,) (4634,)


Save datasets to the folder

In [None]:
def create_datasets(x, y, file_name):
    df = pd.concat([x, y], axis=1, ignore_index=False, sort=False)
    print('Dataset shape: ',df.shape)
    df.to_csv('./data/eng_lyrics/' + file_name, index=False)

In [None]:
! mkdir data
! mkdir data/eng_lyrics/

In [None]:
create_datasets(x_train, y_train, 'train.csv')
create_datasets(x_val, y_val, 'valid.csv')
create_datasets(x_test, y_test, 'test.csv')

Dataset shape:  (34749, 2)
Dataset shape:  (6949, 2)
Dataset shape:  (4634, 2)
