# WORD2VEC - Exploracion con SongLyrics.csv (canciones en ingles) y comparacion con modelo pre-entrenado de Google

### **En este notebook aun no vamos a eliminar las stopwords por lo que nos saldran conjugciones y palabras con la misma raiz. (quitaremos stopwords en los siguientes notebooks para comparar)

Depdende de las palabras que usemos el modelo entrenado con songlyrics.csv funciona bastante bien! Hasta casi tan bien como el de google. Esto sucede si usamos palabras muy recurrentes en canciones en ingles, ya que el corpus de canciones es batante grande.

In [1]:
import pandas as pd
import numpy as np
import gensim.models.word2vec as w2v
import multiprocessing
import os
import re
import pprint
import sklearn.manifold
import matplotlib.pyplot as plt
from gensim.test.utils import datapath

from keras.utils import get_file
import gensim
import subprocess
from IPython.core.pylabtools import figsize

Using TensorFlow backend.


Though non english artists were removed, the dataset contained Hindi lyrics of Lata Mangeshkar written in English. Therefore, I decided to remove all songs sung by her.

In [2]:
songs = pd.read_csv("data/songdata.csv", header=0)
#songs.head()
songs = songs[songs.artist != 'Lata Mangeshkar']
songs.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


To train the word2vec model, we first need to build its vocabulary. To do that, I iterated over each song and added it to an array that can later be fed to the model.

In [3]:
import nltk
text_corpus = []
for song in songs['text']:
    #words = song.lower().split()
    #text_corpus.append(words)
    
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+') #para dividir por words y quitar puntuacion
    lower_case = song.lower()
    tokens_sin_puntuacion = tokenizer.tokenize(lower_case)
    
    text_corpus.append(tokens_sin_puntuacion)


# Dimensionality of the resulting word vectors.
#more dimensions, more computationally expensive to train
#but also more accurate
#more dimensions = more generalized
num_features = 50
# Minimum word count threshold.
min_word_count = 1

# Number of threads to run in parallel.
#more workers, faster we train
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 7


downsampling = 1e-1

# Seed for the RNG, to make the results reproducible.
#random number generator
#deterministic, good for debugging
seed = 1

songs2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

songs2vec.build_vocab(text_corpus)

In [4]:
print (len(text_corpus))

57618


In [5]:
import time
start_time = time.time()



songs2vec.train(text_corpus, total_examples=songs2vec.corpus_count, epochs=2)

if not os.path.exists("trained"):
    os.makedirs("trained")

songs2vec.save(os.path.join("trained", "songs2vectors.w2v"))

print("--- %s seconds ---" % (time.time() - start_time))

--- 67.89261651039124 seconds ---


In [6]:
songs2vec = w2v.Word2Vec.load(os.path.join("trained", "songs2vectors.w2v"))

#### Let's explore our model

Find similar words

In [51]:
songs2vec.wv.most_similar("kiss")

[('pretty', 0.9923005700111389),
 ('fucker', 0.9919766187667847),
 ('save', 0.9914255142211914),
 ('problem', 0.9912900924682617),
 ('wall', 0.9912290573120117),
 ('smile', 0.9907150268554688),
 ('tone', 0.9904527068138123),
 ('lets', 0.9903568029403687),
 ('north', 0.9903275966644287),
 ('chain', 0.9901459217071533)]

In [52]:
songs2vec.wv.most_similar("forever")

[('town', 0.9875481724739075),
 ('sky', 0.9823529124259949),
 ('hot', 0.9815598726272583),
 ('crazy', 0.9811124801635742),
 ('smile', 0.9806676506996155),
 ('drake', 0.9804208874702454),
 ('lonely', 0.9802340269088745),
 ('problem', 0.980161726474762),
 ('word', 0.9800530672073364),
 ('flex', 0.9798208475112915)]

In [53]:
songs2vec.wv.most_similar("time")

[('mind', 0.9782500267028809),
 ('take', 0.9778951406478882),
 ('only', 0.977516770362854),
 ('off', 0.9768518805503845),
 ('will', 0.9765108227729797),
 ('mad', 0.9749409556388855),
 ('still', 0.9745891094207764),
 ('say', 0.9728189706802368),
 ('let', 0.9726420044898987),
 ('world', 0.9725066423416138)]

In [54]:
songs2vec.wv.most_similar("love")

[('more', 0.9298206567764282),
 ('girl', 0.9220150709152222),
 ('again', 0.9189052581787109),
 ('please', 0.9185583591461182),
 ('give', 0.9172383546829224),
 ('feel', 0.911567211151123),
 ('life', 0.9105323553085327),
 ('god', 0.9091551899909973),
 ('re', 0.9052343368530273),
 ('all', 0.9046429991722107)]

In [55]:
songs2vec.wv.most_similar("fuck")

[('tha', 0.9367462396621704),
 ('stop', 0.9365383982658386),
 ('give', 0.9308817386627197),
 ('black', 0.9265210032463074),
 ('life', 0.9201059341430664),
 ('business', 0.9194768667221069),
 ('please', 0.9184359312057495),
 ('go', 0.914078950881958),
 ('true', 0.9127885699272156),
 ('bitch', 0.9108496308326721)]

Words out of context

In [56]:
songs2vec.wv.doesnt_match("happiness love joy hate".split())


arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.



'love'

In [57]:
songs2vec.wv.doesnt_match("fun funny enjoy hate".split())

'hate'

In [58]:
songs2vec.wv.doesnt_match("people man woman song".split())

'man'

In [59]:
songs2vec.most_similar(positive=['woman', 'king'], negative=['man'])
#queen


Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).



[('salesman', 0.8610643744468689),
 ('preached', 0.8579860925674438),
 ('drift', 0.8547410368919373),
 ('cachapas', 0.8542771935462952),
 ('highs', 0.8539788722991943),
 ('llevármelos', 0.8534889221191406),
 ('mounted', 0.8522471189498901),
 ('baldwin', 0.8517227172851562),
 ('rakim', 0.8517205119132996),
 ('wages', 0.850874125957489)]

Semantic distance between words

In [60]:
def nearest_similarity_cosmul(start1, end1, end2):
    similarities = songs2vec.wv.most_similar_cosmul(
        positive=[end2, start1],
        negative=[end1]
    )
    start2 = similarities[0][0]
    print("{0} es a {1}, lo que {2} es a {3}".format(start1, end1, start2, end2))

In [61]:
nearest_similarity_cosmul("paris", "france", "alabama")

paris es a france, lo que devoción es a alabama


## COMPARACION CON MODELO PREENTRENADO DE GOOGLE

#### Voy a probar distintas palabras y similitudes con ambos modelos

In [7]:
path= get_file(MODEL + '.gz','https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz')

NameError: name 'MODEL' is not defined

In [None]:
MODEL = 'GoogleNews-vectors-negative300.bin'
path= get_file(MODEL + '.gz','https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz')
#path = get_file(MODEL + '.gz', 'https://deeplearning4jblob.blob.core.windows.net/resources/wordvectors/%s.gz' % MODEL)
if not os.path.isdir('dara'):
    os.mkdir('data')


if not os.path.isfile(MODEL):
    with open(MODEL, 'wb') as fout:
        zcat = subprocess.Popen(['zcat'],
                          stdin=open(path),
                          stdout=fout
                         )
        zcat.wait()

In [None]:
path = "C:/Users/jhern/Jupyter Notebooks/Analisis Datos no Estructurados/NLP/data/GoogleNews-vectors-negative300.bin"

In [None]:
model = gensim.models.KeyedVectors.load_word2vec_format(datapath(path), binary=True) #UNZIPPED

In [62]:
#PRE-TRAINED GOOGLE-MODEL
model.most_similar(positive=['beer'])

[('beers', 0.8409688472747803),
 ('lager', 0.7733745574951172),
 ('Beer', 0.71753990650177),
 ('drinks', 0.668931245803833),
 ('lagers', 0.6570085883140564),
 ('Yuengling_Lager', 0.6554553508758545),
 ('microbrew', 0.6534324884414673),
 ('Brooklyn_Lager', 0.6501551270484924),
 ('suds', 0.6497017741203308),
 ('brewed_beer', 0.6490240097045898)]

In [63]:
songs2vec.wv.most_similar("beer")

[('yeeeee', 0.9960613250732422),
 ('security', 0.9958637952804565),
 ('cover', 0.9957069158554077),
 ('prince', 0.99549400806427),
 ('mack', 0.9953112006187439),
 ('trifle', 0.9950705170631409),
 ('foul', 0.9950664639472961),
 ('malibu', 0.994986891746521),
 ('soldier', 0.9948619604110718),
 ('uber', 0.9948562979698181)]

In [64]:
#PRE-TRAINED GOOGLE-MODEL
model.most_similar(positive=['rich'])

[('Melamine_nitrogen', 0.6186584234237671),
 ('richer', 0.6159480810165405),
 ('wealthy', 0.5974444150924683),
 ('Scicasts_Resource_Library', 0.5795599222183228),
 ('friend_Francie_Vos', 0.5740315318107605),
 ('Autonomy_Virage_visionary', 0.561326265335083),
 ('richest', 0.5493366122245789),
 ('Hunton_liquids', 0.5271087288856506),
 ('wealthiest', 0.507983922958374),
 ('fabulously_rich', 0.4980970025062561)]

In [65]:
songs2vec.wv.most_similar("rich")

[('nothing', 0.9964667558670044),
 ('cangt', 0.9956928491592407),
 ('reyez', 0.9956657886505127),
 ('lose', 0.9956492185592651),
 ('act', 0.9956234097480774),
 ('ass', 0.9955257773399353),
 ('hold', 0.9955092668533325),
 ('live', 0.9953776597976685),
 ('remember', 0.9952360987663269),
 ('talkin', 0.995000422000885)]

In [66]:
#PRE-TRAINED GOOGLE-MODEL
model.most_similar(positive=['america'])

[('american', 0.7169356346130371),
 ('americans', 0.7042055130004883),
 ('europe', 0.6617692708969116),
 ('usa', 0.6611838340759277),
 ('texas', 0.6593319177627563),
 ('india', 0.6589399576187134),
 ('africa', 0.6377725601196289),
 ('mexico', 0.6325021982192993),
 ('england', 0.6323367357254028),
 ('obama', 0.6311532855033875)]

In [67]:
songs2vec.wv.most_similar("america")

[('puff', 0.9826692342758179),
 ('bird', 0.9803715944290161),
 ('trump', 0.978955090045929),
 ('whats', 0.9789404273033142),
 ('superstar', 0.9779247641563416),
 ('rule', 0.9778056144714355),
 ('problems', 0.9774754047393799),
 ('seans', 0.9772520065307617),
 ('nelson', 0.9766177535057068),
 ('chorus', 0.9764165282249451)]

In [68]:
#PRE-TRAINED GOOGLE-MODEL
def A_is_to_B_as_C_is_to(a, b, c, topn=1):
    a, b, c = map(lambda x:x if type(x) == list else [x], (a, b, c))
    res = model.most_similar(positive=b + c, negative=a, topn=topn)
    if len(res):
        if topn == 1:
            return res[0][0]
        return [x[0] for x in res]
    return None

A_is_to_B_as_C_is_to('man', 'woman', 'king')

'queen'

In [69]:
nearest_similarity_cosmul("man", "king", "woman")

man es a king, lo que never es a woman


In [70]:
#PRE-TRAINED GOOGLE-MODEL
A_is_to_B_as_C_is_to('hi', 'bye', 'hello')

'byes'

In [71]:
nearest_similarity_cosmul('hi', 'bye', 'hello')

hi es a bye, lo que nire es a hello


## CLUSTERING 2D Y 3D DEL SONG2VEC

### CALCULO DEL NORMALISED SUM VECTOR

With the word vector embeddings in place, it is now time to calculate the normalised vector sum of each song. This process can take some time since it has to be done for each of 57,000 songs.

#### PRIMERO CREAMOS UNA COLUMNA EN EL DATAFRAME QUE CONTENGA LAS LETRAS LIMPIAS SIN PUNTUACION

In [None]:
lyrics_clean=[]
for row in text_corpus:
    lyrics_clean.append(' '.join(row))
    
songs['lyrics_clean']=lyrics_clean

In [None]:
def songVector(row):
    vector_sum = 0
    words = row.lower().split()
    for word in words:
        vector_sum = vector_sum + songs2vec[word]
    vector_sum = vector_sum.reshape(1,-1)
    normalised_vector_sum = sklearn.preprocessing.normalize(vector_sum)
    return normalised_vector_sum


import time
start_time = time.time()

songs['song_vector'] = songs['lyrics_clean'].apply(songVector)




**t-sne and random song selection** 

The songs have 50 dimensions each. Application of t-sne is memory intensive and hence it is slightly easier on the computer to use a random sample of the 57,000 songs.

In [12]:
song_vectors = []
from sklearn.model_selection import train_test_split

train, test = train_test_split(songs, test_size = 0.9)


for song_vector in train['song_vector']:
    song_vectors.append(song_vector)

train.head(10)

Unnamed: 0,artist,song,link,text,lyrics_clean,song_vector
20329,Usher,OMG,/u/usher/omg_20877485.html,"Oh my gosh \nBaby let me \nDid it again, so ...",oh my gosh baby let me did it again so imma le...,"[[0.12773122, 0.026794303, -0.0070865974, -0.0..."
26568,Bruce Springsteen,Girls In Their Summer Clothes,/b/bruce+springsteen/girls+in+their+summer+clo...,Well the street lights shine \nDown on Blessi...,well the street lights shine down on blessing ...,"[[0.10057295, 0.045913212, -0.042415272, -0.10..."
48975,Primus,Mama Didn't Raise No Fool,/p/primus/mama+didnt+raise+no+fool_20257006.html,"The best of times, the worst of times, \nThe ...",the best of times the worst of times the times...,"[[0.11015475, 0.08974717, -0.013811874, -0.106..."
14557,O.A.R.,The Architect,/o/oar/the+architect_21084711.html,Young man come to me \nAnd asked me for a smo...,young man come to me and asked me for a smoke ...,"[[0.15523216, 0.06130932, 0.023257168, -0.0606..."
20732,Venom,Speed King,/v/venom/speed+king_20290379.html,Good golly said little Miss Molly \nWhe she w...,good golly said little miss molly whe she was ...,"[[0.13503546, 0.09211837, -0.0052667703, -0.07..."
35643,Heart,Love Alive,/h/heart/love+alive_20064769.html,The sky was dark this morning \nWhen I raised...,the sky was dark this morning when i raised my...,"[[0.11318944, 0.020148102, -0.0122584365, -0.0..."
37499,Janis Joplin,Intruder,/j/janis+joplin/intruder_20069698.html,You come around here \nTrying to make your de...,you come around here trying to make your deman...,"[[0.14336511, 0.047080997, 0.013688436, -0.073..."
37334,Jackson Browne,All Good Things,/j/jackson+browne/all+good+things_20068533.html,All good things got to come to an end \nThe t...,all good things got to come to an end the thri...,"[[0.11332709, 0.057575725, -0.021200214, -0.03..."
42119,Lou Reed,Teach The Gifted Children,/l/lou+reed/teach+the+gifted+children_20085151...,Teach the gifted children \nTeach them to hav...,teach the gifted children teach them to have m...,"[[0.10582162, 0.05750734, 0.051301774, -0.0182..."
29737,Diana Ross,Don't Explain,/d/diana+ross/dont+explain_20040292.html,"Hush now, don't explain \nJust say you'll rem...",hush now don t explain just say you ll remain ...,"[[0.13738053, 0.068070054, -0.020861318, -0.09..."


I had a fairly measly 4gb machine and wasn't able to generate a more accurate model. However, one can play around with the number of iterations, learning rate and other factors to fit the model better. If you have too many dimensions (~300+), it might make sense to use PCA first and then t-sne.

In [13]:
X = np.array(song_vectors).reshape((5761, 50))

start_time = time.time()
tsne = sklearn.manifold.TSNE(n_components=2, n_iter=250, random_state=0, verbose=2)

all_word_vectors_matrix_2d = tsne.fit_transform(X)

print("--- %s seconds ---" % (time.time() - start_time))

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 5761 samples in 0.014s...
[t-SNE] Computed neighbors for 5761 samples in 3.376s...
[t-SNE] Computed conditional probabilities for sample 1000 / 5761
[t-SNE] Computed conditional probabilities for sample 2000 / 5761
[t-SNE] Computed conditional probabilities for sample 3000 / 5761
[t-SNE] Computed conditional probabilities for sample 4000 / 5761
[t-SNE] Computed conditional probabilities for sample 5000 / 5761
[t-SNE] Computed conditional probabilities for sample 5761 / 5761
[t-SNE] Mean sigma: 0.044236
[t-SNE] Computed conditional probabilities in 0.333s
[t-SNE] Iteration 50: error = 87.4744186, gradient norm = 0.0121535 (50 iterations in 5.667s)
[t-SNE] Iteration 100: error = 87.2796555, gradient norm = 0.0221936 (50 iterations in 5.049s)
[t-SNE] Iteration 150: error = 86.9056625, gradient norm = 0.0419048 (50 iterations in 2.840s)
[t-SNE] Iteration 200: error = 86.8539124, gradient norm = 0.0348001 (50 iterations in 2.701s)
[t

In [16]:
df=pd.DataFrame(all_word_vectors_matrix_2d,columns=['X','Y'])

df.head(10)

train.head()

df.reset_index(drop=True, inplace=True)
train.reset_index(drop=True, inplace=True)

Joining two dataframes to obtain each song's corresponding X,Y co-ordinate.

In [17]:
two_dimensional_songs = pd.concat([train, df], axis=1)

two_dimensional_songs.head()

Unnamed: 0,artist,song,link,text,lyrics_clean,song_vector,X,Y
0,Usher,OMG,/u/usher/omg_20877485.html,"Oh my gosh \nBaby let me \nDid it again, so ...",oh my gosh baby let me did it again so imma le...,"[[0.12773122, 0.026794303, -0.0070865974, -0.0...",-0.002756,0.238431
1,Bruce Springsteen,Girls In Their Summer Clothes,/b/bruce+springsteen/girls+in+their+summer+clo...,Well the street lights shine \nDown on Blessi...,well the street lights shine down on blessing ...,"[[0.10057295, 0.045913212, -0.042415272, -0.10...",-0.001516,0.244693
2,Primus,Mama Didn't Raise No Fool,/p/primus/mama+didnt+raise+no+fool_20257006.html,"The best of times, the worst of times, \nThe ...",the best of times the worst of times the times...,"[[0.11015475, 0.08974717, -0.013811874, -0.106...",-0.001689,0.152633
3,O.A.R.,The Architect,/o/oar/the+architect_21084711.html,Young man come to me \nAnd asked me for a smo...,young man come to me and asked me for a smoke ...,"[[0.15523216, 0.06130932, 0.023257168, -0.0606...",0.001186,-0.116892
4,Venom,Speed King,/v/venom/speed+king_20290379.html,Good golly said little Miss Molly \nWhe she w...,good golly said little miss molly whe she was ...,"[[0.13503546, 0.09211837, -0.0052667703, -0.07...",0.00064,-0.021248


**Plotting the results**

Using plotly, I plotted the results so that it becomes easier to explore similar songs based on their colors and clusters.

In [18]:
import plotly.express as px
fig=px.scatter(two_dimensional_songs, x='X', y='Y', color='artist')
fig.show()

In [19]:
import plotly.express as px
fig = px.scatter_3d(two_dimensional_songs, x='X', y='Y', z='song',
                color='artist')
fig.show()

### KMEANS

In [22]:
from sklearn import cluster
kmeans = cluster.KMeans(n_clusters=5, 
                        random_state=42).fit(X)

In [23]:
import plotly.express as px
fig = px.scatter(two_dimensional_songs, x="X", y="Y",
                 hover_data=['artist', 'song'],
                color=kmeans.labels_)
fig.show()

In [24]:
import plotly.express as px
fig = px.scatter_3d(two_dimensional_songs, x='X', y='Y', z='artist',
                color=kmeans.labels_)
fig.show()