# WORDS EMBEDDING - SONGS IN ENGLISH CORPUS
## - PARAMETROS OPTIMIZADOS PARA MEJOR ENTRENAMIENTO
## - PRUEBAS CON DIFERENTES CLUSTERS (KMEANS)

In [1]:
import pandas as pd
import numpy as np
import gensim.models.word2vec as w2v
import multiprocessing
import os
import re
import pprint
import sklearn.manifold
import matplotlib.pyplot as plt

Though non english artists were removed, the dataset contained Hindi lyrics of Lata Mangeshkar written in English. Therefore, I decided to remove all songs sung by her.

In [2]:
songs = pd.read_csv("data/songdata.csv", header=0)
#songs.head()
songs = songs[songs.artist != 'Lata Mangeshkar']
songs.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


To train the word2vec model, we first need to build its vocabulary. To do that, I iterated over each song and added it to an array that can later be fed to the model.

### VOY A EXTRAER MAS DIMENSIONES (100) PARA QUE SEA MÁS PRECISO Y BAJAR EL CONTEXT_SIZE  A 5 PARA EVITAR SOBRE-ENTRENAMIENTO

### ADEMAS VOY A USAR EL TOKINAZER PARA QUITAR LA PUNTUACION Y VER MEJOR LAS COMPARACINES

In [3]:
import nltk
text_corpus = []
for song in songs['text']:
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+') #para dividir por words y quitar puntuacion
    lower_case = song.lower()
    tokens_sin_puntuacion = tokenizer.tokenize(lower_case)
    
    text_corpus.append(tokens_sin_puntuacion)


# Dimensionality of the resulting word vectors.
#more dimensions, more computationally expensive to train
#but also more accurate
#more dimensions = more generalized
num_features = 100
# Minimum word count threshold.
min_word_count = 1

# Number of threads to run in parallel.
#more workers, faster we train
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 5


downsampling = 1e-1

# Seed for the RNG, to make the results reproducible.
#random number generator
#deterministic, good for debugging
seed = 1

songs2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

songs2vec.build_vocab(text_corpus)
print (len(text_corpus))

57618


### AÑADO MAS EPOCHS PARA QUE ENTRENE MEJOR

In [None]:
import time
start_time = time.time()



songs2vec.train(text_corpus, total_examples=songs2vec.corpus_count, epochs=5)

if not os.path.exists("trained"):
    os.makedirs("trained")

songs2vec.save(os.path.join("trained", "songs2vectors.w2v"))

print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
songs2vec = w2v.Word2Vec.load(os.path.join("trained", "songs2vectors.w2v"))

#### Let's explore our model

Find similar words

In [None]:
songs2vec.wv.most_similar("love")

In [None]:
songs2vec.wv.most_similar("fuck")

In [None]:
songs2vec.wv.most_similar("song")

In [None]:
songs2vec.wv.most_similar("sweet")

In [None]:
songs2vec.wv.most_similar("angel")

### TODAS LAS ANTERIORES LAS HA HECHO MUY BIEN PORQUE ESTÁN RELACIONADAS CON CANCIONES! LAS SIGUIENTES LE VA A COSTAR UN POCO MÁS

In [None]:
songs2vec.wv.most_similar("espresso")

In [None]:
songs2vec.wv.most_similar("computer")

In [None]:
songs2vec.wv.most_similar("data")

### LO MISMO VA A PASAR CON LAS WORDS OUT OF CONTEXT

Words out of context

In [None]:
songs2vec.wv.doesnt_match("happiness love joy hate".split())

In [None]:
songs2vec.wv.doesnt_match("breakfast milk lunch dinner".split())

In [None]:
songs2vec.most_similar(positive=['woman', 'king'], negative=['man'])
#queen

Semantic distance between words

In [None]:
def nearest_similarity_cosmul(start1, end1, end2):
    similarities = songs2vec.wv.most_similar_cosmul(
        positive=[end2, start1],
        negative=[end1]
    )
    start2 = similarities[0][0]
    print("{0} es a {1}, lo que {2} es a {3}".format(start1, end1, start2, end2))

In [None]:
nearest_similarity_cosmul("paris", "france", "alabama")

In [None]:
nearest_similarity_cosmul("paris", "france", "london")

### Con estas diferentes palabras que hemos probado podemos ver como para palabras similares a lo que hay en una canción, lo hace muy bien, pero para palabras extrañas musicalmente hablando (como paises y capitales) le cuesta BASTANTE

## CALCULO DE NORMALIZED SUM VECTOR

#### PRIMERO CREAMOS UNA COLUMNA EN EL DATAFRAME QUE CONTENGA LAS LETRAS LIMPIAS SIN PUNTUACION

In [None]:
lyrics_clean=[]
for row in text_corpus:
    lyrics_clean.append(' '.join(row))
    
songs['lyrics_clean']=lyrics_clean

With the word vector embeddings in place, it is now time to calculate the normalised vector sum of each song. This process can take some time since it has to be done for each of 57,000 songs.

In [None]:
def songVector(row):
    vector_sum = 0
    words = row.lower().split()
    for word in words:
        vector_sum = vector_sum + songs2vec[word]
    vector_sum = vector_sum.reshape(1,-1)
    normalised_vector_sum = sklearn.preprocessing.normalize(vector_sum)
    return normalised_vector_sum


import time
start_time = time.time()

songs['song_vector'] = songs['lyrics_clean'].apply(songVector)




## CLUSTERING

**t-sne and random song selection** 

The songs have 50 dimensions each. Application of t-sne is memory intensive and hence it is slightly easier on the computer to use a random sample of the 57,000 songs.

In [None]:
song_vectors = []
from sklearn.model_selection import train_test_split

train, test = train_test_split(songs, test_size = 0.9)


for song_vector in train['song_vector']:
    song_vectors.append(song_vector)

train.head(10)

I had a fairly measly 4gb machine and wasn't able to generate a more accurate model. However, one can play around with the number of iterations, learning rate and other factors to fit the model better. If you have too many dimensions (~300+), it might make sense to use PCA first and then t-sne.

In [None]:
X = np.array(song_vectors).reshape((5761, 100))

start_time = time.time()
tsne = sklearn.manifold.TSNE(n_components=2, n_iter=250, random_state=0, verbose=2)

all_word_vectors_matrix_2d = tsne.fit_transform(X)

print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
df=pd.DataFrame(all_word_vectors_matrix_2d,columns=['X','Y'])

df.head(10)

train.head()

df.reset_index(drop=True, inplace=True)
train.reset_index(drop=True, inplace=True)

Joining two dataframes to obtain each song's corresponding X,Y co-ordinate.

In [None]:
two_dimensional_songs = pd.concat([train, df], axis=1)

two_dimensional_songs.head()


**Plotting the results**

Using plotly, I plotted the results so that it becomes easier to explore similar songs based on their colors and clusters.

In [None]:
import plotly
plotly.offline.init_notebook_mode(connected=True) 

In [None]:
import plotly.express as px
fig=px.scatter(two_dimensional_songs, x='X', y='Y', color='artist')
fig.show()

In [None]:
import plotly.express as px
fig = px.scatter_3d(two_dimensional_songs, x='X', y='Y', z='song',
                color='artist')
fig.show()

# CLUSTERING CON KMEANS

In [None]:
from sklearn import cluster
X = np.array(song_vectors).reshape((5761, 100))

kmeans = cluster.KMeans(n_clusters=3, 
                        random_state=42).fit(X)

In [None]:
import plotly.express as px
fig = px.scatter(two_dimensional_songs, x="X", y="Y",
                 hover_data=['artist', 'song'],
                color=kmeans.labels_)
fig.show()

In [None]:
import plotly.express as px
fig = px.scatter_3d(two_dimensional_songs, x='X', y='Y', z='artist',
                color=kmeans.labels_)
fig.show()

### PINTANDO LOS DISTINTOS CLUSTERS DE KMEANS PODEMOS ENCONTRAR QUE CANCIONES SIMILARES EN TEMATICA SE AGRUPAN JUNTAS
- Por ejemplo las canciones de amor suelen estar en el cluster amarilo (Adam Sandler - Best Friend, whiteny Houston - For the love of you)
- En el azul hay música más"independiente" que quizá creando más clusters podriamos categorizar mejor (probamos a continuacion)
- En el rosa parece haber música más energética como raps y rocks

# KMEANS CON 15 CLUSTERS

In [None]:
kmeans = cluster.KMeans(n_clusters=15, 
                        random_state=42).fit(X)

In [None]:
import plotly.express as px
fig = px.scatter(two_dimensional_songs, x="X", y="Y",
                 hover_data=['artist', 'song'],
                color=kmeans.labels_)
fig.show()

### AQUI PODEMOS VER QUE LA AGRUPACION TIENE SENTIDO CON RESPECTO A LAS CARACTERISTICAS X E Y. EL COLOR PASA DE CLARO A OSCURO FORMANDO DISTINTOS CLUSTERS

## ANALIZANDO LAS LETRAS DE LAS CANCIONES DE CADA CLUSTER SE PODRIA VER QUE MUCHAS PALABRAS SE REPITEN PARA LAS CANCIONES DE UN MISMO CLUSTER