# COMP550 - Final Project
---

Links:
- https://www.kaggle.com/ficklemaverick/lyrics-generator
- https://www.kaggle.com/danofer/music-lyrics-clean-export

## Table of content
[1. Imports](#imports)  
[2. Import & Cleaning data and Exploratory Data Analysis](#imports-clean)  
[3. Preprocessing steps](#preprocessing)  
[4. Naïve majority model](#naive-model)   
[5. Logistic Regression](#log-reg)  

# 1. Imports  <a class="anchor" id="imports"></a>

In [None]:
IN_GOOGLE_COLAB = False
root_path = 'data/'
if IN_GOOGLE_COLAB:
    !pip install langdetect
    from google.colab import drive
    drive.mount('/content/drive')
    root_path = 'drive/My Drive/COMP550-Project/data/'

In [None]:
import pandas as pd
import numpy as np
import string
import time
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from langdetect import detect
from scipy.stats import uniform

# nltk imports
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet
from nltk.corpus import stopwords

# sklearn imports
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# 2. Import & Cleaning data and Exploratory Data Analysis   <a class="anchor" id="imports-clean"></a>

**NOTE**: Detecting the language of all the songs is very long (15 minutes). To avoid this step we import directly the preprocessed english that is split into 3 sets: training, validation and test.

In [None]:
USE_CLEANED_DATA = True
cleaned_data_path = root_path + "cleaned_data.csv"
data_path = root_path + "lyrics.csv"
data_raw = pd.read_csv(data_path)
print(len(data_raw), "songs in the dataset")
print(data_raw.head())

The dataset has the following columns:
- **index** (int): index of the song in the dataset
- **song** (string): name of the song
- **year** (float) -> (int): release year
- **artist** (string): artist of the song
- **genre** (string): the genre, this is the label we want to predict
- **lyrics** (string): the lyrics of the song. This is the data we will use to predict the genre. We need to preprocess this data.

If we remove the null elements, we are left with **265701** songs. We convert the year from float to int.

In [None]:
data_all = data_raw[pd.notnull(data_raw)]
data_all = data_all.dropna(how='any',axis=0)
data_all['year'] = pd.to_numeric(data_all['year'], downcast='integer')
data_all['index'] = pd.to_numeric(data_all['index'], downcast='integer')
data_all = data_all.reset_index(drop=True)
data_all

We keep only English songs, using the `langdetect` library. There are **237,363 English songs** in the previous 265,701 songs.

In [None]:
if not USE_CLEANED_DATA:
    en_songs = []
    for song in data_all['lyrics']:
        try:
            lang = detect(song)
            if lang == 'en':
                en_songs.append(True)
            else:
                en_songs.append(False)
        except:
            en_songs.append(False)
    data_en = data_all[en_songs]
    data_en.reset_index(drop=True)
else:
    data_en = pd.read_csv(cleaned_data_path)
    
data_en.head()

In [None]:
data_en['genre'].unique()

We remove songs where the labels are "Other" or "Not Available". This reduces the number of songs from 237363 to **215,825 songs**.

In [None]:
labels = data_en['genre'].tolist()
keep_song = [genre not in ['Not Available', 'Other'] for genre in data_en['genre'].tolist()]
data_en = data_en[keep_song]
data_en = data_en.reset_index(drop=True)
data_en['genre'].value_counts()

Above is the the number of songs in each genre category. **46,5% of the songs are Rock songs**. 

#### We split the data in training, validation and test sets
The english songs are smplit into 3 sets.

In [None]:
GENRE_TO_INT = {'Pop':0, 'Hip-Hop':1, 'Rock':2, 'Metal':3, 'Country':4, 'Jazz':5, 'Electronic':6, 'Folk':7, 'R&B':8, 'Indie':9}
lyrics = data_en['lyrics'].tolist()
labels = np.array([GENRE_TO_INT[genre] for genre in data_en['genre'].tolist()])
lyrics_train, lyrics_test, labels_train, labels_test = train_test_split(lyrics, labels, test_size=0.1, shuffle=True, random_state=43, stratify=labels)
lyrics_train, lyrics_valid, labels_train, labels_valid = train_test_split(lyrics_train, labels_train, test_size=0.1, shuffle=True, random_state=43, stratify=labels_train)

# Smaller dataset for wuick training
# lyrics_selected and labels_selected contain the 25,000 songs that we consider in our project
# lyrics_train and labels_train contain the 20,000 songs of the training set (80%)
# lyrics_valid and labels_valid contain the 2,500 songs of the training set (10%)
# lyrics_test and labels_test contain the 2,500 songs of the test set (10%)

tmp_1, lyrics_selected, tmp_2, labels_selected = train_test_split(lyrics_train, labels_train, test_size=25000, shuffle=True, random_state=43, stratify=labels_train)
lyrics_train, lyrics_selected_2, labels_train, labels_selected_2 = train_test_split(lyrics_selected, labels_selected, test_size=5000, shuffle=True, random_state=43, stratify=labels_selected)
lyrics_valid, lyrics_test, labels_valid, labels_test = train_test_split(lyrics_selected_2, labels_selected_2, test_size=2500, shuffle=True, random_state=43, stratify=labels_selected_2)

# print("Light training set length:", len(lyrics_light))
print("Training set length:", len(lyrics_train))
print("Validation set length:", len(lyrics_valid))
print("Test set length:", len(lyrics_test))

In [None]:
lyrics_train[0]

# 3. Preprocessing steps <a class="anchor" id="preprocessing"></a>
The lyrics need to be cleaned before we can use them.
- remove \n line breaks
- remove punctuation
- lowercase the lyrics
- remove verse and chorus indications that are under the form [verse x]
- remove tokens that have a null length

In [None]:
# replace line breaks, removes punctuation, set everything to lowercase
# removes word if length <= 2, [verse X] or [chorus y] indication
# remove stopwords
def my_preprocessor(song):
    song = song.replace('\n', ' ')
    song = song.translate(str.maketrans('', '', string.punctuation))
    song = song.lower()
    song_token = song.split(' ')
    song_token = [w for w in song_token if (len(w) >= 3 and w[0] != '[' and w[-1] != ']')]
    song_token = [w for w in song_token if not any(c.isdigit() for c in w)]
    stop_words = set(stopwords.words('english'))
    song_token = [w for w in song_token if (w not in stop_words and u'\x9d' not in w)]
    song = ' '.join(song_token)
    return song

# tokenize the song
def my_tokenizer(song): 
    tokens = song.split(' ')
    return tokens

# tokenize the song and stems its tokens
def my_tokenizer_stem(song): 
    tokens = song.split(' ') 
    stemmer = PorterStemmer() 
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return stemmed_tokens

# tokenize the song and lemmas its tokens
def my_tokenizer_lemma(song):
    song_with_pos = pos_tag(song.split(' '))
    POS_correspondance = {'N': wordnet.NOUN, 'V': wordnet.VERB, 'R': wordnet.ADV, 'J': wordnet.ADJ}
    lemmatizer = WordNetLemmatizer()
    lemmatized_song = [lemmatizer.lemmatize(w[0], POS_correspondance.get(w[1][0], wordnet.NOUN)) for w in song_with_pos]
    return lemmatized_song

print(my_tokenizer(my_preprocessor(lyrics_train[2]))[:10])
print(my_tokenizer_stem(my_preprocessor(lyrics_train[2]))[:10])
print(my_tokenizer_lemma(my_preprocessor(lyrics_train[2]))[:10])

#### Create csvs for train, valid and test data
In the rest of the jupyter notebook we directly import the preprocessed data.  
There are 3 csvs:
- `train_data.csv` containing 20,000 preprocessed english songs
- `valid_data.csv` containing 2,500 preprocessed english songs
- `test_data.csv` containing 2,500 preprocessed english songs

In [None]:
# lyrics_train_preprocessed = [my_preprocessor(song) for song in lyrics_train]
# lyrics_valid_preprocessed = [my_preprocessor(song) for song in lyrics_valid]
# lyrics_test_preprocessed = [my_preprocessor(song) for song in lyrics_test]
# train_df = pd.DataFrame.from_dict({"lyrics": lyrics_train_preprocessed, "genre": labels_train})
# valid_df = pd.DataFrame.from_dict({"lyrics": lyrics_valid_preprocessed, "genre": labels_valid})
# test_df = pd.DataFrame.from_dict({"lyrics": lyrics_test_preprocessed, "genre": labels_test})
# train_df.to_csv(root_path+"train_data.csv", index=False)
# valid_df.to_csv(root_path+"valid_data.csv", index=False)
# test_df.to_csv(root_path+"test_data.csv", index=False)

train_df = pd.read_csv(root_path + "train_data.csv")
valid_df = pd.read_csv(root_path + "valid_data.csv")
test_df = pd.read_csv(root_path + "test_data.csv")

# 4. Naïve Majority Model  <a class="anchor" id="naive-model"></a>
In this naïve majority model, we guess that all the songs have the genre 'Rock', which is the genre that has the majority of songs. This is a first baseline model, that we can use to compare the results of logistic regression, naive bayes, ...

In [None]:
print(classification_report(train_df['genre'], [2]*len(train_df['lyrics']), target_names=list(GENRE_TO_INT.keys())))

The precision for our baseline model is **47%**.

# 5. Logistic Regression  <a class="anchor" id="log-reg"></a>
We test different forms of the vectorised data: stemmed, lemmatized and no token transformation. The step to vectorize the data is quite long so we decide to test different hyperparameters of a model AFTER the vectorization is performed.

Solver: liblinear. Although it doesn't support multinomial, it's the quickest solver (by far) and is the only one that converges after 1000 iterations.

#### Best model
Best accuracy with the following model : **58,6%** no stemming or lemmatization, TD-IDF vectorization, regularization strength C: 0.1, max_df: 0.7, max_features: 150000, ngram_range: bigram}.
On development set: **59,52%**


### Grid search
With a small dataset (10,000 songs) we grid search on the hyperparameters.

In [None]:
train_lyrics = train_df['lyrics']
train_labels = train_df['genre']
lyrics_preprocessed = train_lyrics
lyrics_stemmed = [' '.join(my_tokenizer_stem(song)) for song in train_lyrics]
lyrics_lemmad = [' '.join(my_tokenizer_lemma(song)) for song in train_lyrics]

In [None]:
# [i/10 for i in range(18, 30, 3)]
# [i*1000 for i in range(150, 250, 30)]

In [None]:
# Define a pipeline combining a text feature extractor with a simple classifier
GRID_SEARCH_ON = False
pipeline = Pipeline([
    # ('vect', CountVectorizer()),
    ('vect', TfidfVectorizer()),    
    ('clf', LogisticRegression(multi_class='auto', solver='lbfgs', penalty='l2', max_iter=100)),
])

parameters = {
    'vect__max_df': [0.8],
    'vect__max_features': [210000],
    'vect__ngram_range': [(1,2)],
    'vect__norm': ['l2'],
    'clf__C': [2.5, 2.6, 2.7, 2.8, 2.9],
}

# find the best parameters for both the feature extraction and the classifier
if GRID_SEARCH_ON:
    grid_search = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1, verbose=1)
    start = time.time()
    grid_search.fit(lyrics_lemmad, train_labels)
    end = time.time()
    print("done in %0.3fs" % (end - start))
    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

#### Results grid search 
The different hyperparameters and ranges why are testing:
- **tokenizer**: [my_tokenizer, my_tokenizer_stem, my_tokenizer_lemma].
- **max_df**: range(0.3, 1)
- **max_features**: range(10000, 200000)
- **ngram_range**: unigrams, bigrams. Bigram models tend to have the best results.
- **C**: range(0.01, 3). The regularization strength is the most important parameter to finetune. A value around 0.1 increases the accuracy up to 10% compared to a bad choice of strength. When TF-IDF is on, the strength needs to be around 2.
- **TFIDF**: on or off (depends on which vectorizer we use). When turned on, the accuracy is higher.
- **norm**: when TFIDF=on defines the unit norm of each row.

There are a lot of different possible combinations. Here is the methodology for grid search.
0. preprocessing = none
1. Try out 3 different values for each hyperparameter (min, max, middle) and see which parameters modify the most the accuracy. For example the tokenizer doesn't change the accuracy that much, but the regularization strength affects a lot the accuracy.
2. For each hyperparameter that doesn't have a big impacy, chose the value that gives the highest accuracy. If there is no trend (for example the hyperparameter sometimes give better results with a certain value and other times a worst result, take the value that has the smallest computation time).
3. The value of regularization strength is the most important hyperparameter to determine. A value around 0.1 is a good choice.
4. Little by little, trim the ranges of the hyperparameter choices, taking each time the one that affects the most the accuracy.
5. Repeat from 0 for preprocessing = stemming, lemmatization
6. Repeat from 0 for TFIDF = on

**TFIDF=off**  
Best with no tokenization modification: **55,7%** {C: 0,07, max_df: 0,7, max_features: 100000, ngram_range: bigram}  
Best with stemming: **55,9%** {C: 0.1, max_df: 0.7, max_features: 150000, ngram_range: bigram}  
Best with lemmatization: **55,5%** {C: 0.14, max_df: 0.7, max_features: 150000, ngram_range: bigram}

**TFIDF=on**  
Best with no tokenization modification:  **58,6%** {C: 2.6, max_df: 0.5, max_features: 210000, ngram_range: bigram, norm='l2'}  
Best with stemming: **58,2%** {C: 2.2, max_df: 0.5, max_features: 25000, ngram_range: bigram, norm='l2'}  
Best with lemmatization:  **58,5** {C: 2.8, max_df: 0.8, max_features: 210000, ngram_range: bigram, norm='l2'}  

In [None]:
grid_search_df = pd.DataFrame.from_dict(grid_search.cv_results_)
grid_search_df.to_csv(root_path+"result_reglog_tfidf_preprocessed_6.csv", sep=';', decimal=',')

#### We compute the accuracy of the best model on the validation set

In [None]:
vectorizer = TfidfVectorizer(max_df=0.8, max_features=210000, ngram_range=(1, 2), norm='l2')
classifier = LogisticRegression(multi_class='auto', solver='lbfgs', penalty='l2', C=2.8, max_iter=1000)
lyrics_train_vec = vectorizer.fit_transform(train_df['lyrics'])
lyrics_valid_vec = vectorizer.transform(valid_df['lyrics'])
classifier.fit(lyrics_train_vec, train_df['genre'])
accuracy_score(valid_df['genre'], classifier.predict(lyrics_valid_vec))