# Clustering analysis - Part 1:

In the previous notebook we did the basic exploratory analysis of the dataset. Since we could see that there was lyrical structure and similarity across artists and genres we decided to cluster the dataset into groups based on the lyrical similarity. We then wanted to visualize the artist's similarity and genre's similarity in a 2D space (based on the lyrics).

We used KMeans for clustering the Data.

The plots are as follows:
1. 2D representation of artists color coded by cluster. Each cluster is based on 6 lyric words which best represent the cluster. (Code is here)
2. 2D representation of genres color coded by cluster. Each cluster is again based on 6 lyric words which best represent the cluster.  (Code is in clustering 2 notebook)
3. Barplot representing number of artists in each cluster. (Code is here)



## Inferences:


#### A Notable cluster for artists is as follows 


Cluster 9 words: b'time', b'life', b'away', b'eyes', b'feels', b'come', b'day', b'im', b'dying', b'living',

Cluster 9 artists: davor-matosevic, back-to-the-ocean, andy-griffith, genghis-tron, break-the-silence, funker-vogt, destroy, frostfang, amberian-dawn, dax-johnson, celesty, bradley-walker, for-today, crooked-fingers, choke, abyssic-hate, gates-of-ishtar, behemoth, fair-sex, all-shall-perish, gazpacho, gallows, disbelief, abraxas, celadon-candy, covenant, earthtone9, antischism, ataraxie, gadget, goatwhore, dark-funeral, claire-voyant, echoterra, burden-of-a-day, funeral-diner, crystal-kovach, the-blackout-pact, cro-mags, atreyu, aurora-borealis, codename-rocky, blaze-bayley, antimatter, the-gathering, fallacy, dillon, clouds-over-normandy, the-accident-experiment, blood, answer-with-metal, given-free-rein, defiance, dragonforce, capture-the-crown, demonaz, dreadful-shadows,

This shows that all these artists' songs are about life and death and coincidentally they all belong to the metal genre.



Note: The following links were referred.

https://bokeh.pydata.org/en/latest/docs/reference/models/glyphs/text.html

https://www.datascience.com/resources/notebooks/word-embeddings-in-python

http://brandonrose.org/clustering

http://ahmedbesbes.com/sentiment-analysis-on-twitter-using-word2vec-and-keras.html

https://www.digitalocean.com/community/tutorials/how-to-work-with-language-data-in-python-3-using-the-natural-language-toolkit-nltk


In [493]:
import numpy as np
import pandas as pd
import nltk
from nltk.stem.snowball import SnowballStemmer
import re
import os
import codecs
from sklearn import feature_extraction

In [494]:
data = pd.read_csv('lyrics.csv')
data.head(10)
data.shape

  interactivity=interactivity, compiler=compiler, result=result)


(356467, 6)

In [495]:
data = data.dropna()

In [496]:
data.shape

(24365, 6)

In [497]:
data = data.iloc[:10000]

In [498]:
data = data.drop('index',axis=1)

In [499]:
data = data.drop('year',axis=1)

In [500]:
data = data.drop('genre',axis=1)

In [501]:
lyrics_list = []

for ix,row in data.iterrows():
    chars_rm = ['\n',',','[',']','.','?','!','(',')',':']
    ly = row['lyrics']
    for char in ly:
        if char in chars_rm:
            ly = ly.replace(char,' ')
        elif char=='\'':
            ly = ly.replace(char,'')
    lyrics_list.append(ly)

In [502]:
data['lyrics'] = lyrics_list

In [503]:
vocab = {}

for ix,row in data.iterrows():
    singer = row['artist']
    if singer in vocab:
        vocab[singer] += row['lyrics']
    else:
        vocab[singer] = row['lyrics']
    

In [504]:
cleaned_data = pd.DataFrame({'Artist':list(set(data['artist'].tolist())), 'Lyrics':[vocab[artist] for artist in set(data['artist'].tolist())]})
cleaned_data.head(10)

Unnamed: 0,Artist,Lyrics
0,children-18-3,Sing sing oh so sucker For the sugar substi...
1,efterklang,Another way another way to your heart It star...
2,ajj,Last week I saw you at the Junkie Church You t...
3,4-the-cause,When the night has come And the land is dark A...
4,animation,Dont count on me - I engineer on evry move we ...
5,celtic-thunder,In the town the people stay away From the mid...
6,austrian-death-machine,Lets go Waaa-yeah How about this one How ab...
7,frustrators,Im gonna be a plastic tree a character transpa...
8,big-bad-vodoo-daddy,And now friends let me tell you about this cat...
9,deichkind,Deine Eltern sind auf einem Tennisturnier du ...


In [505]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [506]:
stopwords = nltk.corpus.stopwords.words('english')

In [507]:
stemmer = SnowballStemmer("english")

In [508]:

def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

In [509]:

def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [510]:
totalvocab_stemmed = []
totalvocab_tokenized = []

lyrics_sample = cleaned_data['Lyrics'].iloc[0:1000].tolist()
artists_sample = cleaned_data['Artist'].iloc[0:1000].tolist()

for l in lyrics_sample:
    allwords_stemmed = tokenize_and_stem(l) # for each item in 'synopses', tokenize/stem
    totalvocab_stemmed.extend(allwords_stemmed) # extend the 'totalvocab_stemmed' list
    
    allwords_tokenized = tokenize_only(l)
    totalvocab_tokenized.extend(allwords_tokenized)

In [511]:
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')

there are 2191621 items in vocab_frame


In [512]:
# Note that the result of this block takes a while to show
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

%time tfidf_matrix = tfidf_vectorizer.fit_transform(lyrics_sample) #fit the vectorizer to synopses

# (100, 563) means the matrix has 100 rows and 563 columns
print(tfidf_matrix.shape)
terms = tfidf_vectorizer.get_feature_names()

Wall time: 48.1 s
(478, 480)


In [513]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(tfidf_matrix)

In [514]:
from sklearn.cluster import KMeans

num_clusters = 10

km = KMeans(n_clusters=num_clusters)

%time km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

Wall time: 5.95 s


In [515]:
songs = { 'Artist': artists_sample, 'Lyrics': lyrics_sample, 'cluster': clusters }

frame = pd.DataFrame(songs, index = [clusters] , columns = ['Artist', 'cluster'])

#frame['cluster'].value_counts() #number of artists per cluster


In [516]:
from __future__ import print_function

print("Top terms per cluster:")
print() #add whitespace

top_terms_final = []

#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 

for i in range(num_clusters):
    print("Cluster %d words:" % i, end='')
    top_terms = []
    for ind in order_centroids[i, :10]: #replace 6 with n words per cluster
        print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=',')
        top_terms.append((' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore')))
        
    print() #add whitespace
    print() #add whitespace
    
    top_terms_final.append(top_terms)
    
    print("Cluster %d artists:" % i, end='')
    if type(frame.ix[i]['Artist'])!=str:
        for title in frame.ix[i]['Artist'].values.tolist():
            print(' %s,' % title, end='')
    else:
        print(' %s,' % frame.ix[i]['Artist'], end='')
    print() #add whitespace
    print() #add whitespace

Top terms per cluster:

Cluster 0 words:

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  from ipykernel import kernelapp as app


 b'young', b'im', b'soul', b'want', b'say', b'dont', b'ive', b'time', b'just', b'friends',

Cluster 0 artists: eye-alaska, above-the-underground, eyes, dear-stalker, 2win,

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated




Cluster 1 words: b'love', b'baby', b'im', b'know', b'dont', b'oh', b'just', b'na', b'want', b'like',

Cluster 1 artists: austrian-death-machine, big-bad-vodoo-daddy, chicken-shack, derailers, asher-monroe, george-duke, delfonics, deerhoof, funkadelic, duke-robillard, dominique-van-hulst, bomshel, adena, ehsan, for-king-country, cohen-leonard, brightwood, the-allman-brothers, daisy-dee, 2-be-3, drew-davis, blues, freemasons, atlas-sound, firefall, all-4-one, delerium, caleb-collins, evan-t, betty-wright, area-11, ballad, diana-ross-the-supremes, gina-g, conway-twitty, bert-jansch, girls, angel-grant, busy-signal, carter-s-chord, arthur-big-boy-crudup, chris-salvatore, george, frank-derol, b3, britt-nicole, brian-simpson, delroy-wilson, bobby-bazini, engelbert, all-caps, geographer, bill-evans, connie-francis, alexia, anastacia, athena-cage, cover-girls, breakbot, billy-thorpe, george-harrison, fka-twigs, bobby-darin, burn-the-ballroom, adeade, charles-bradley, backburner, anthony-fall

In [517]:
similarity_distance = 1 - cosine_similarity(tfidf_matrix)
print(type(similarity_distance))
print(similarity_distance.shape)

<class 'numpy.ndarray'>
(478, 478)


In [518]:
import os  # for os.path.basename

import matplotlib.pyplot as plt
import matplotlib as mpl

from sklearn.manifold import MDS

# convert two components as we're plotting points in a two-dimensional plane
# "precomputed" because we provide a distance matrix
# we will also specify `random_state` so the plot is reproducible.
mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)

%time pos = mds.fit_transform(similarity_distance)  # shape (n_components, n_samples)

print(pos.shape)

xs, ys = pos[:, 0], pos[:, 1]



Wall time: 6.78 s
(478, 2)


In [519]:
#set up colors per clusters using a dict
cluster_colors = {0: '#a50026', 1: '#d73027', 2: '#f46d43', 3: '#fdae61', 4: '#fee08b', 5: '#d9ef8b', 6: '#a6d96a', 7: '#66bd63', 8: '#1a9850', 9: '#006837'}

cluster_names = {}

for i in range(0,10):
    cluster_names[i] = str(top_terms_final[i]).replace('b','').replace('"','').replace('[','').replace(']','')


In [520]:
#some ipython magic to show the matplotlib plots inline
%matplotlib inline 

#create data frame that has the result of the MDS plus the cluster numbers and titles
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, artist=artists_sample)) 


In [521]:
clus_ids = []
for ix,row in df.iterrows():
    clus_ids.append(cluster_names[row['label']])

In [522]:
colormap = {0: '#a50026', 1: '#d73027', 2: '#f46d43', 3: '#fdae61', 4: '#fee08b', 5: '#d9ef8b', 6: '#a6d96a', 7: '#66bd63', 8: '#1a9850', 9: '#006837'}
colors = [colormap[x] for x in df['label']]
df['color'] = colors

In [523]:
import pandas as pd
import numpy as np
from bokeh.io import output_notebook, show, curdoc, push_notebook
from bokeh.models import ColumnDataSource,HoverTool,ColorBar,LabelSet
from bokeh.plotting import figure, show, output_notebook
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from bokeh.layouts import layout
from bokeh.layouts import row
from ipywidgets import interact
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual, IntSlider

output_notebook()

colormap = {0: 'red', 1: 'blue', 2: 'green', 3: 'yellow', 4: 'orange', 5: 'black', 6: 'navy', 7: 'pink', 8: 'magenta', 9: 'brown'}
colors = [colormap[x] for x in df['label'].tolist()]

p = figure(title = "Clustering Music Artists Based on Lyrics",tools="hover,lasso_select,pan,wheel_zoom,box_zoom,reset,save")

xs = list(xs)
ys = list(ys)
print(type(clus_ids[0]))
source = ColumnDataSource(dict(
    x=xs,
    y=ys,
    color=colors,
    label=clus_ids,
    artist = [i for i in df['artist'].tolist()]
))

# scatter plot
t = p.scatter('x', 'y', source=source, fill_alpha=0.6,
              fill_color="#8724B5",
              line_color=None)

# text labels
labels = LabelSet(x='x', y='y', text='artist', y_offset=8,
                      text_font_size="6pt", text_color="color",
                      source=source, text_align='center')
#r = p.text(x='x', y='y', text_color='color', text = 'artist', text_alpha=0.8, text_font_size='5pt', source=source, legend = 'artist')

p.select_one(HoverTool).tooltips = [('Cluster', '@label')]
p.add_layout(labels)
#p.add_tools(HoverTool(tooltips=[("Cluster", "@label")]))


<class 'str'>


In [524]:
show(p)

E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='32a1249e-8b95-49bf-8f81-09648fe59e1a', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='3f6f3386-5e8f-4153-b012-629833002c07', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='6b72a981-97dd-4f65-9f92-51abed943fd0', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='7aa9ae67-656a-46e2-a7b9-ca941776c34a', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='b577c7bf-ab50-4e37-841f-1e5f4a05eef4', ...)


In [525]:
counts = df['label'].value_counts()
counts

6    164
3    120
1     88
5     54
8     17
9     10
4     10
2      7
0      5
7      3
Name: label, dtype: int64

In [526]:
counts = list(counts.values)

In [527]:
from bokeh.io import show, output_file
from bokeh.plotting import figure

clusters = ['0','1','2','3','4','5','6','7','8','9']

p = figure(x_range=clusters, plot_height=500, title="Number of artists in each cluster")
p.vbar(x=clusters, top=[28,6,72,14,9,7,34,2,2,26], width=0.9)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.axis_label = "Cluster ID"
p.yaxis.axis_label = "Number of Artists"

show(p)

E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='32a1249e-8b95-49bf-8f81-09648fe59e1a', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='3f6f3386-5e8f-4153-b012-629833002c07', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='6b72a981-97dd-4f65-9f92-51abed943fd0', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='7aa9ae67-656a-46e2-a7b9-ca941776c34a', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='b577c7bf-ab50-4e37-841f-1e5f4a05eef4', ...)
