In [2]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
%load_ext autoreload
%autoreload 2

In [3]:
import os
import sys

# Appends the entire brainstation_capstone project folder to the path.
# This allows us to make a relative import of our scripts in brainstation_capstone/scripts
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from utilities import utils
from utilities.vectorizer_pipeline import VectorizerPipeline

In [4]:
DATA_PATH = utils.get_datapath('data')

**Table of contents**<a id='toc0_'></a>    
- [**2. Transforming Lyrics**](#toc1_)    
- [Vectorizing Lyrics for Classification](#toc2_)    
- [Combining all Vectorizers](#toc3_)    
- [Converting Into LexVec Word Embeddings](#toc4_)    
- [Conclusion](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[**2. Transforming Lyrics**](#toc0_)

This notebook is the preliminary work of vectorizing the data. A class was created to vectorize a train set and transform the validation and test set. This class would then store transformed train, validation and test set for preliminary modeling. The ideology for this method was to stay true to the Agile methodology, where we will try to try as many vectorizer combinations and narrow down one to tune a model for after.

Specifically, we will look at the following transformations:
- CountVectorizer (N-grams = 1, 2 and 3)
- TF-IDF
- Averaging LexVec Embeddings
- OpenAI Ada Embeddings

In [5]:
df = pd.read_csv(DATA_PATH / 'clean_lyrics_english_stem.csv')

In [6]:
display(df.head())
df.shape

Unnamed: 0,song,lyrics,release_year,title,primary_artist,views,cleaned_lyrics,language,log_scaled_views,popular,popularity_three_class,cleaned_lyrics_stem
0,Kendrick-lamar-swimming-pools-drank-lyrics,\n\n[Produced by T-Minus]\n\n[Intro]\nPour up ...,2012,Swimming Pools (Drank),Kendrick-lamar,5589280.0,pour up drank head shot drank sit down drank ...,en,15.536361,1,2,pour drank head shot drank sit drank stand dr...
1,Kendrick-lamar-money-trees-lyrics,\n\n[Produced by DJ Dahi]\n\n[Verse 1: Kendric...,2012,Money Trees,Kendrick-lamar,4592003.0,uh me and my niggas tryna get it ya bish ya b...,en,15.339827,1,2,uh nigga tryna get ya bish ya bish hit hous l...
2,Kendrick-lamar-xxx-lyrics,"\n\n[Intro: Bēkon & Kid Capri]\nAmerica, God b...",2017,XXX.,Kendrick-lamar,4651514.0,america god bless you if its good to you amer...,en,15.352703,1,2,america god bless good america pleas take han...
3,A-ap-rocky-fuckin-problems-lyrics,"\n\n[Chorus: 2 Chainz, Drake & Both (A$AP Rock...",2012,Fuckin’ Problems,A-ap-rocky,7378309.0,i love bad bitches thats my fuckin problem an...,en,15.814055,1,2,love bad bitch that fuckin problem yeah like ...
4,Kendrick-lamar-dna-lyrics,"\n\n[Verse 1]\nI got, I got, I got, I got—\nLo...",2017,DNA.,Kendrick-lamar,5113687.0,i got i got i got i got loyalty got royalty i...,en,15.447431,1,2,got got got got loyalti got royalti insid dna...


(33842, 12)

# <a id='toc2_'></a>[Vectorizing Lyrics for Classification](#toc0_)

Here we prepare a vectorizer pipeline for a CountVectorizer with varying lengths of n_grams, along with TF-IDF. We do this for both the binary and multi-class problem.

In [9]:
y_popular = df.popular
y_popularity = df.popularity_three_class

y_popular.shape, y_popularity.shape


((33842,), (33842,))

Here we set the correct targets for the two class and three class for Genius page views. These were used in the supplementary preliminary modelling. We now make vectorizer objects that can store the vectorizer, the X_train, X_validation, X_test and all the corresponding y values for each split. More info can be found in `utilities.vectorizer_pipeline.py`. We vectorize for both the two class and three class for the preliminary modelling. 

In [10]:

for vectorizer_name in [
    'bag_of_words_two_class', 'tf_idf_two_class', '2_grams_two_class', '3_grams_two_class'
    ]:
    X = df.cleaned_lyrics_stem
    
    if vectorizer_name == 'bag_of_words_two_class':
        vectorizer = CountVectorizer(max_df=0.9, min_df=0.01)
    elif vectorizer_name == 'tf_idf_two_class':
        vectorizer = TfidfVectorizer(max_df=0.9, min_df=0.01)
    elif vectorizer_name == '2_grams_two_class':
        vectorizer = CountVectorizer(max_df=0.9, min_df=0.01, ngram_range=(1,2))
    elif vectorizer_name == '3_grams_two_class':
        vectorizer = CountVectorizer(max_df=0.9, min_df=0.01, ngram_range=(1,3))
    
    VectorizerPipeline(
        vectorizer_name, vectorizer, X, y_popular
    ).run_vectorizer_pipeline()

Train shape: (20304, 2212)             
Validation shape: (6769, 2212)             
Test shape: (6769, 2212)
Transformed train test split dumped at /home/jng/projects/brainstation_capstone/vectorizer_data/bag_of_words_two_class/data.pkl as a dictionary.
Train shape: (20304, 2204)             
Validation shape: (6769, 2204)             
Test shape: (6769, 2204)
Transformed train test split dumped at /home/jng/projects/brainstation_capstone/vectorizer_data/tf_idf_two_class/data.pkl as a dictionary.
Train shape: (20304, 2968)             
Validation shape: (6769, 2968)             
Test shape: (6769, 2968)
Transformed train test split dumped at /home/jng/projects/brainstation_capstone/vectorizer_data/2_grams_two_class/data.pkl as a dictionary.
Train shape: (20304, 3008)             
Validation shape: (6769, 3008)             
Test shape: (6769, 3008)
Transformed train test split dumped at /home/jng/projects/brainstation_capstone/vectorizer_data/3_grams_two_class/data.pkl as a dictionary.


In [25]:
for vectorizer_name in [
    'bag_of_words_three_class', 'tf_idf_three_class', '2_grams_three_class', '3_grams_three_class'
    ]:
    X = df.cleaned_lyrics_stem
    
    if vectorizer_name == 'bag_of_words_three_class':
        vectorizer = CountVectorizer(max_df=0.9, min_df=0.01)
    elif vectorizer_name == 'tf_idf_three_class':
        vectorizer = TfidfVectorizer(max_df=0.9, min_df=0.01)
    elif vectorizer_name == '2_grams_three_class':
        vectorizer = CountVectorizer(max_df=0.9, min_df=0.01, ngram_range=(1,2))
    elif vectorizer_name == '3_grams_three_class':
        vectorizer = CountVectorizer(max_df=0.9, min_df=0.01, ngram_range=(1,3))
    
    VectorizerPipeline(
        vectorizer_name, vectorizer, X, y_popularity
    ).run_vectorizer_pipeline()



Train shape: (22743, 2152)             
Validation shape: (7581, 2152)             
Test shape: (7581, 2152)
Transformed train test split dumped at /home/jng/projects/brainstation_capstone/vectorizer_data/bag_of_words_three_class/data.pkl as a dictionary.




Train shape: (22743, 2143)             
Validation shape: (7581, 2143)             
Test shape: (7581, 2143)
Transformed train test split dumped at /home/jng/projects/brainstation_capstone/vectorizer_data/tf_idf_three_class/data.pkl as a dictionary.




Train shape: (22743, 2591)             
Validation shape: (7581, 2591)             
Test shape: (7581, 2591)
Transformed train test split dumped at /home/jng/projects/brainstation_capstone/vectorizer_data/2_grams_three_class/data.pkl as a dictionary.




Train shape: (22743, 2587)             
Validation shape: (7581, 2587)             
Test shape: (7581, 2587)
Transformed train test split dumped at /home/jng/projects/brainstation_capstone/vectorizer_data/3_grams_three_class/data.pkl as a dictionary.


# <a id='toc3_'></a>[Combining all Vectorizers](#toc0_)

We also try to combine all the vectorizers as another representation of the lyrics.

In [12]:
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

X = df.cleaned_lyrics_stem

# Instantiate a list of tuples - each tuple is the name of the transform + the transformer
vectorizers = [
    ('count_vect', CountVectorizer(max_df=0.9, min_df=0.01, ngram_range=(1,3))), 
    ('tfidf', TfidfVectorizer(max_df=0.9, min_df=0.01 ))
    ]

# Create feature union
featunion = FeatureUnion(vectorizers)

VectorizerPipeline(
    'all_vectorizers_three_class', featunion, X, y_popularity
).run_vectorizer_pipeline()

Train shape: (20304, 5207)             
Validation shape: (6769, 5207)             
Test shape: (6769, 5207)
Transformed train test split dumped at /home/jng/projects/brainstation_capstone/vectorizer_data/all_vectorizers_three_class/data.pkl as a dictionary.


# <a id='toc4_'></a>[Converting Into LexVec Word Embeddings](#toc0_)

Here we provide the process of averaging LexVec Word Embeddings for each word in the lyrics to form a single document embedding. 

In [13]:
from sklearn.feature_extraction import text
stop_words = text.ENGLISH_STOP_WORDS

: 

In [7]:

import gensim

# Instantiate the LexVec Embeddings.
model = gensim.models.KeyedVectors.load_word2vec_format(
    DATA_PATH / 'lexvec-wikipedia-word-vectors', binary=False
)

def lyric2vec(lyric):
    """
    Embed a lyric by averaging the word vectors of the lyrics for each song. 
    Out-of-vocabulary words are replaced by a zero-vector.
    -----
    
    Input: lyric (string)
    Output: document embedding vector (np.array)
    """
    

    word_embeddings = [np.zeros(300)]
    for word in lyric:
        # If word is in stop words ignore it.
        if word in stop_words:
            continue
        # if the word is in the model then embed
        elif word in model:
            vector = model[word]
        # add zeros for out-of-vocab words
        else:
            vector = np.zeros(300)
            
        word_embeddings.append(vector)
    
    # average the word vectors
    sentence_embedding = np.stack(word_embeddings).mean(axis=0)
    sentence_embedding.reshape(1,300)
    
    return sentence_embedding


# Average the word embeddings in each lyric to get the document embedding. 
word_embedding_lyrics = [
    lyric2vec(lyric)
    for lyric in df['cleaned_lyrics']
]

final_lexvec = np.array(word_embedding_lyrics)

In [17]:
final_lexvec.shape

(37905, 300)

After we have all the embeddings we dump the embeddings using `joblib`.

In [18]:
import joblib

LEXVEC_PATH = utils.get_datapath('lexVec_data')

with open(
    LEXVEC_PATH / 'lexVec.pkl',
    'wb'
) as f:
    joblib.dump(final_lexvec, f)
    
    print(f"LexVec data dumped at {LEXVEC_PATH / 'lexVec.pkl'}")

# <a id='toc5_'></a>[Conclusion](#toc0_)

Now that we have all our representations we can go into some preliminary modeling. The preliminary modelling will be a quick first pass to see how various models perform using the vectorized data above.