# COMP550 - Final Project
---

Links:
- https://www.kaggle.com/ficklemaverick/lyrics-generator
- https://www.kaggle.com/danofer/music-lyrics-clean-export

## Table of content
[1. Imports](#imports)  
[2. Import & Cleaning data and Exploratory Data Analysis](#imports-clean)  
[3. Preprocessing steps](#preprocessing)  
[4. Naïve majority model](#naive-model)   
[5. Logistic Regression](#log-reg)  

# 1. Imports  <a class="anchor" id="imports"></a>

In [None]:
import pandas as pd
import numpy as np
import string
import time
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from langdetect import detect
from scipy.stats import uniform

# nltk imports
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet
from nltk.corpus import stopwords

# sklearn imports
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# 2. Import & Cleaning data and Exploratory Data Analysis   <a class="anchor" id="imports-clean"></a>

**NOTE**: Detecting the language of all the songs is very long (15 minutes). To avoid this step, you can use the `cleaned_data.csv` instead of running all the steps.

In [None]:
USE_CLEANED_DATA = True
cleaned_data_path = "data/cleaned_data.csv"
data_path = "data/lyrics.csv"
data_raw = pd.read_csv(data_path)
print(len(data_raw), "songs in the dataset")
print(data_raw.head())

The dataset has the following columns:
- **index** (int): index of the song in the dataset
- **song** (string): name of the song
- **year** (float) -> (int): release year
- **artist** (string): artist of the song
- **genre** (string): the genre, this is the label we want to predict
- **lyrics** (string): the lyrics of the song. This is the data we will use to predict the genre. We need to preprocess this data.

If we remove the null elements, we are left with **265701** songs. We convert the year from float to int.

In [None]:
data_all = data_raw[pd.notnull(data_raw)]
data_all = data_all.dropna(how='any',axis=0)
data_all['year'] = pd.to_numeric(data_all['year'], downcast='integer')
data_all['index'] = pd.to_numeric(data_all['index'], downcast='integer')
data_all = data_all.reset_index(drop=True)
data_all

We keep only English songs, using the `langdetect` library. There are **237,363 English songs** in the previous 265,701 songs.

In [None]:
if not USE_CLEANED_DATA:
    en_songs = []
    for song in data_all['lyrics']:
        try:
            lang = detect(song)
            if lang == 'en':
                en_songs.append(True)
            else:
                en_songs.append(False)
        except:
            en_songs.append(False)
    data_en = data_all[en_songs]
    data_en.reset_index(drop=True)
else:
    data_en = pd.read_csv(cleaned_data_path)
    
data_en.head()

In [None]:
data_en['genre'].unique()

We remove songs where the labels are "Other" or "Not Available". This reduces the number of songs from 237363 to **215,825 songs**.

In [None]:
labels = data_en['genre'].tolist()
keep_song = [genre not in ['Not Available', 'Other'] for genre in data_en['genre'].tolist()]
data_en = data_en[keep_song]
data_en = data_en.reset_index(drop=True)
data_en['genre'].value_counts()

Above is the the number of songs in each genre category. **46,5% of the songs are Rock songs**. 

We create a smaller dataset with 10,000 songs to do quick testing

In [None]:
GENRE_TO_INT = {'Pop':0, 'Hip-Hop':1, 'Rock':2, 'Metal':3, 'Country':4, 'Jazz':5, 'Electronic':6, 'Folk':7, 'R&B':8, 'Indie':9}
lyrics = data_en['lyrics'].tolist()
labels = np.array([GENRE_TO_INT[genre] for genre in data_en['genre'].tolist()])
lyrics_train, lyrics_test, labels_train, labels_test = train_test_split(lyrics, labels, test_size=0.1, shuffle=True, random_state=43, stratify=labels)
lyrics_train, lyrics_valid, labels_train, labels_valid = train_test_split(lyrics_train, labels_train, test_size=0.1, shuffle=True, random_state=43, stratify=labels_train)

# Smaller dataset for wuick training
lyrics_other, lyrics_light, labels_other, labels_light = train_test_split(lyrics_train, labels_train, test_size=10000, shuffle=True, random_state=43, stratify=labels_train)

lyrics_other, lyrics_medium, labels_other, labels_medium = train_test_split(lyrics_other, labels_other, test_size=50000, shuffle=True, random_state=43, stratify=labels_other)
print("Light training set length:", len(lyrics_light))
print("Training set length:", len(lyrics_train))
print("Validation set length:", len(lyrics_valid))
print("Test set length:", len(lyrics_test))

In [None]:
lyrics_light[0]

# 3. Preprocessing steps <a class="anchor" id="preprocessing"></a>
The lyrics need to be cleaned before we can use them.
- remove \n line breaks
- remove punctuation
- lowercase the lyrics
- remove verse and chorus indications that are under the form [verse x]
- remove tokens that have a null length

In [None]:
data = lyrics_light

In [None]:
# replace line breaks, removes punctuation, set everything to lowercase
# removes word if length <= 2, [verse X] or [chorus y] indication
# remove stopwords
def my_preprocessor(song):
    song = song.replace('\n', ' ')
    song = song.translate(str.maketrans('', '', string.punctuation))
    song = song.lower()
    song_token = song.split(' ')
    song_token = [w for w in song_token if (len(w) >= 3 and w[0] != '[' and w[-1] != ']')]
    song_token = [w for w in song_token if not any(c.isdigit() for c in w)]
    stop_words = set(stopwords.words('english'))
    song_token = [w for w in song_token if (w not in stop_words and u'\x9d' not in w)]
    song = ' '.join(song_token)
    return song

# tokenize the song
def my_tokenizer(song): 
    tokens = song.split(' ')
    return tokens

# tokenize the song and stems its tokens
def my_tokenizer_stem(song): 
    tokens = song.split(' ') 
    stemmer = PorterStemmer() 
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return stemmed_tokens

# tokenize the song and lemmas its tokens
def my_tokenizer_lemma(song):
    song_with_pos = pos_tag(song.split(' '))
    POS_correspondance = {'N': wordnet.NOUN, 'V': wordnet.VERB, 'R': wordnet.ADV, 'J': wordnet.ADJ}
    lemmatizer = WordNetLemmatizer()
    lemmatized_song = [lemmatizer.lemmatize(w[0], POS_correspondance.get(w[1][0], wordnet.NOUN)) for w in song_with_pos]
    return lemmatized_song

print(my_tokenizer(my_preprocessor(data[0]))[:10])
print(my_tokenizer_stem(my_preprocessor(data[0]))[:10])
print(my_tokenizer_lemma(my_preprocessor(data[0]))[:10])

In [None]:
def wm2df(wm, feat_names):
    
    # create an index for each row
    doc_names = ['Doc{:d}'.format(idx) for idx, _ in enumerate(wm)]
    df = pd.DataFrame(data=wm.toarray(), index=doc_names,
                      columns=feat_names)
    return(df)

In [None]:
# vectorizer = CountVectorizer(preprocessor=my_preprocessor, tokenizer=my_tokenizer_stem,
#                              max_features=5000, ngram_range=(1,1))
# X = vectorizer.fit_transform(data)
# tokens = vectorizer.get_feature_names()
# print(tokens)
# wm2df(X, tokens)

# 4. Naïve Majority Model  <a class="anchor" id="naive-model"></a>
In this naïve majority model, we guess that all the songs have the genre 'Rock', which is the genre that has the majority of songs. This is a first baseline model, that we can use to compare the results of logistic regression, naive bayes, ...

In [None]:
print(classification_report(labels_light, [2]*len(labels_light), target_names=list(GENRE_TO_INT.keys())))

The precision for our baseline model is **47%**.

# 5. Logistic Regression  <a class="anchor" id="log-reg"></a>
We test different forms of the vectorised data: stemmed, lemmatized and no token transformation. The step to vectorize the data is quite long so we decide to test different hyperparameters of a model AFTER the vectorization is performed.

Solver: liblinear. Although it doesn't support multinomial, it's the quickest solver (by far) and is the only one that converges after 1000 iterations.

Best accuracy with the following model : Best with stemming: **55,9%** {C: 0.1, max_df: 0.7, max_features: 150000, ngram_range: bigram}

### Testing the different solvers
newton-cg: 100sec (not converged), acc=95/51%  
sag: 100 sec (not converged), acc=94/51%  
saga: 116 sec, acc=88/52%  
lbfgs: sec, acc=  
liblinear: 57 sec, acc=88/51%

The quickest solver is **liblinear**.

In [None]:
def get_model_accuracy(lyrics, labels, vectorizer, classifier, show_accuracy=False):
    start = time.time()
    # Using k-folds
    NB_FLODS = 3

    skf = StratifiedKFold(n_splits=NB_FLODS, shuffle=True)
    lyrics_vec = vectorizer.fit_transform(lyrics)

    mean_train_accuracy = 0
    mean_valid_accuracy = 0
    
    for train_index, valid_index in skf.split(lyrics_vec, labels):
        _X_train, _X_valid = lyrics_vec[train_index], lyrics_vec[valid_index]
        _y_train, _y_valid = labels[train_index], labels[valid_index]
        classifier.fit(_X_train, _y_train)
        mean_train_accuracy += accuracy_score(_y_train, classifier.predict(_X_train))
        mean_valid_accuracy += accuracy_score(_y_valid, classifier.predict(_X_valid))

    mean_train_accuracy /= NB_FLODS
    mean_valid_accuracy /= NB_FLODS
    
    end = time.time()
    print('Done in ', end - start)

    if show_accuracy:
        print("Training accuracy:", mean_train_accuracy)
        print("Validation accuracy:", mean_valid_accuracy)

    return mean_train_accuracy, mean_valid_accuracy

In [None]:
vectorizer = CountVectorizer()
classifier = LogisticRegression(multi_class='auto', solver='liblinear', max_iter=1000)
# mean_train_accuracy, mean_valid_accuracy = get_model_accuracy(lyrics_light, labels_light, vectorizer, classifier)
# print(mean_train_accuracy, mean_valid_accuracy)

### Grid search
With a small dataset (10,000 songs) we grid search on the hyperparameters.

In [None]:
lyrics_light_preprocessed = [my_preprocessor(song) for song in lyrics_light]
lyrics_light_stemmed = [my_tokenizer_stem(song) for song in lyrics_light_preprocessed]
lyrics_light_lemmad = [my_tokenizer_lemma(song) for song in lyrics_light_preprocessed]

In [None]:
# [i/20 for i in range(1, 20)]

In [None]:
# Define a pipeline combining a text feature extractor with a simple classifier
GRID_SEARCH_ON = False
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', LogisticRegression(multi_class='auto', solver='liblinear', penalty='l1', max_iter=100)),
])

parameters = {
    'vect__max_df': [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95],
    'vect__max_features': [20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000],
    'vect__ngram_range': [(1, 2)],
    'clf__C': [0.09, 0.1, 0.11],
}

# find the best parameters for both the feature extraction and the classifier
if GRID_SEARCH_ON:
    grid_search = GridSearchCV(pipeline, parameters, cv=3, n_jobs=-1, verbose=1)
    start = time.time()
    grid_search.fit(lyrics_light_stemmed, labels_light)
    end = time.time()
    print("done in %0.3fs" % (end - start))
    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

#### Results grid search 
The different hyperparameters and ranges why are testing:
- **tokenizer**: [my_tokenizer, my_tokenizer_stem, my_tokenizer_lemma]. The best results are with no modification (surprisingly). Lemmatization 56,1%; stemming: 56,3%; no token modification: 55,3%%.
- **max_df**: range(0.3, 1). 1 always gives the worst results but values between 0.6 and 0.9 have similar results. There is no best value, we have to combine max_df with other parameters to have the best combination.
- **max_features**: range(10000, 200000). Best results with a number of features around 100000 AND bigram models. For a unigram model best results are with max_features around 30000.
- **ngram_range**: unigrams, bigrams. Bigram models have the best results (slightly).
- **C**: range(0.01, 1). The regularization strength is the most important parameter to finetune. A value around 0.1 increases the accuracy up to 10% compared to a bad choice of strength.

There are a lot of different possible combinations. Here is the methodology for grid search.
1. Try out 3 different values for each hyperparameter (min, max, middle) and see which parameters modify the most the accuracy. For example the tokenizer doesn't change the accuracy that much, but the regularization strength affects a lot the accuracy.
2. For each hyperparameter that doesn't have a big impacy, chose the value that gives the highest accuracy. If there is no trend (for example the hyperparameter sometimes give better results with a certain value and other times a worst result, take the value that has the smallest computation time).
3. The value of regularization strength is the most important hyperparameter to determine. A value around 0.1 is a good choice.
4. Little by little, trim the ranges of the hyperparameter choices, taking each time the one that affects the most the accuracy.

The best model gives **56,4% accuracy** has the following parameters: **no token modification, C=0.12, max_df=0.6, max_features=120000, bigrams**.

No modification: lots of lyrics are modified, espacially in rap & pop. Maybe that's why stemming or lemmatizing results are slightly worst.

Best with no tokenization modification: **55,36%** {C: 0,07, max_df: 0,7, max_features: 100000, ngram_range: bigram}  
Best with stemming: **55,9%** {C: 0.1, max_df: 0.7, max_features: 150000, ngram_range: bigram}  
Best with lemmatization: **55,6%** {C: 0.14, max_df: 0.7, max_features: 150000, ngram_range: bigram}

In [None]:
# grid_search_df = pd.DataFrame.from_dict(grid_search.cv_results_)
# grid_search_df.to_csv("data/result_reglog_liblinear_stem_4.csv", sep=';', decimal=',')

In [None]:
vectorizer = CountVectorizer(tokenizer=my_tokenizer, max_df=0.6, max_features=120000, ngram_range=(1, 2))
classifier = LogisticRegression(multi_class='auto', solver='liblinear', max_iter=1000, C=0.12)
lyrics_train_vec = vectorizer.fit_transform(lyrics_train)
lyrics_valid_vec = vectorizer.fit_transform(lyrics_valid)
classifier.fit(lyrics_train_vec, labels_train)
accuracy_score(labels_valid, classifier.predict(lyrics_valid_vec))