<a class="anchor" id="section2"></a>
## Section 2: Tag Clustering

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My\ Drive/WMP

Mounted at /content/drive
/content/drive/My Drive/WMP


In [None]:
!pip install -U sentence-transformers  # for BERT
!pip install -U textblob  # for translate
!pip install python-Levenshtein

There are 3 different kind of tag clustering

1. Merge synonyms：Tag clustering based on BERT similarity
2. Tag clustering based on Levenshtein Distance
3. Tag clustering based on tag-artist-matrix

Cluster steps:
initial tag number: 11946
1. filter tags: only contain tags that have beed used at least once: 9749 tags left
2. translate to English
3. remove stopwords, punctuation: 9311 unique tags.
4. do lemmatization
5. similarity matrix
 - for BERT: sentence embedding -> similarity matirx
 - for Levenshtein Distance: Levenshtein similarity matrix
 - for tag-artist-matrix: tag-artist-matrix -> tag similarity matrix
6. AgglomerativeClustering: get cluster-id
7. filter tag clusters: only contain tag clusters that have been used at least twice
8. update tags: add cluster-id to tags, remove tags without cluser-id
9. update tags-train : add cluster-id to tags, remove records without cluser-id

In [None]:
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
import os

from Levenshtein import ratio as levenshtein_ratio  # ratio = 1 - distance(a, b)/(len(a)+len(b))
from sentence_transformers import SentenceTransformer, util
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import pairwise_distances
from sklearn.metrics.pairwise import cosine_similarity
from textblob import TextBlob

<a class="anchor" id="section21"></a>
### Section 2.1 Tag Preprocessing

In [None]:
artists_df = pd.read_table('data/dataset/artists.dat')
tags_df = pd.read_table('data/dataset/tags.dat', encoding = "ISO-8859-1")
user_artists_df = pd.read_table('data/dataset/user_artists.dat')
user_friends_df = pd.read_table('data/dataset/user_friends.dat')
user_tag_artists_df = pd.read_table('data/dataset/user_taggedartists.dat')

artists_df.drop(['pictureURL', 'url'], inplace=True, axis=1)
user_tag_artists_df.drop(['day', 'month', 'year'], inplace=True, axis=1)

# STEP1: filter tags: only contain tags that have beed used at least once
tags = user_tag_artists_df.groupby('tagID').size().sort_values(ascending=False).to_frame()
tags.columns = ['count']
tags = tags.merge(tags_df, on='tagID', how='left')

display(tags.head())

Unnamed: 0,tagID,count,tagValue
0,73,7503,rock
1,24,5418,pop
2,79,5251,alternative
3,18,4672,electronic
4,81,4458,indie


In [None]:
display(tags_df.head())
display(artists_df.head())
display(user_artists_df.head())
display(user_friends_df.head())
display(user_tag_artists_df.head())

**attention**

the translate step cost 1 hour to run

In [None]:
# step 2 : detect and translate to English
from textblob import TextBlob

def translate(text):
  text_bolb = TextBlob(text)
  try:
    language = text_bolb.detect_language()
    if language == 'en':
      return text
    else:
      return text_bolb.translate(to='en')

  except:
    return text  # can only detect word more then 3 characters

tags = tags_df.copy()
tags['translated'] = tags.tagValue.apply(translate)

In [None]:
# if you do not want to wait 1 hour, can save to drive, load later
tags.to_csv('data/interim/translated_tags.csv')

In [None]:
# if you save before, load here
tags = pd.read_csv('data/interim/translated_tags.csv')
display(tags.head())

Unnamed: 0,tagID,count,tagValue,translated
0,73,7503,rock,rock
1,24,5418,pop,pop
2,79,5251,alternative,alternative
3,18,4672,electronic,electronic
4,81,4458,indie,indie


In [None]:
# STEP3+4: remove stopwords, punctuation. do lemma
from nltk.stem import PorterStemmer
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
import nltk
import string
nltk.download('punkt')
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download('averaged_perceptron_tagger')

def get_wordnet_pos(word):
    pos = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(pos, wordnet.NOUN)

def preprocess_tag(tag):
    stopset = stopwords.words('english') + list(string.punctuation)
    #stemmer = PorterStemmer()
    wordnet_lemmatizer = WordNetLemmatizer()
    return ' '.join([wordnet_lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in word_tokenize(tag) if word not in stopset])

tags['preprocessed'] = tags.translated.apply(preprocess_tag)
tags.preprocessed = tags.preprocessed.astype(str)

print(f'there are {len(tags.preprocessed.unique())} unique tags left after lemmatization')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
there are 9311 unique tags left after lemmatization


<a class="anchor" id="section22"></a>
### Section 2.2 Tag Clustering 1/3: BERT

In [None]:
# step 4: tag->embeddings->similarity matirx
from sentence_transformers import SentenceTransformer, util # todo : delete

# Load Sentence model (based on BERT)
model = SentenceTransformer('bert-base-nli-mean-tokens')  # paraphrase-mpnet-base-v2  bert-base-nli-mean-tokens
tags.preprocessed = tags.preprocessed.astype(str)  # todo: delete

# GET tag embeddings
tag_embeddings = model.encode(tags.preprocessed, convert_to_tensor=True)

# GET cosine similarity matrix
bert_cos_sim_matrix = np.array(util.pytorch_cos_sim(tag_embeddings, tag_embeddings))
np.fill_diagonal(bert_cos_sim_matrix, 0)
bert_distance_matrix = 1 - bert_cos_sim_matrix

#np.savetxt("tag_distance_matrix_bert.csv", cos_sim_matrix, delimiter=',')
#cos_sim_matrix = np.loadtxt("tag_distance_matrix_bert.csv", delimiter=',')

HBox(children=(FloatProgress(value=0.0, max=405234788.0), HTML(value='')))




Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/sbert.net_models_bert-base-nli-mean-tokens/0_BERT were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# step 5: AgglomerativeClustering: merge synonyms 
%%time
distance_list = [0.1, 0.15, 0.2]
linkage_list = ['average']  # 'single', 'complete'


def tag_cluster(tags, distance, linkage, distance_matrix, prefix):
  """cluster tags
  support clustering on distance_matrix

  Steps:
    remove those clusters that only were used once
    create transfer table {tagID, cluster}
    update user_taggedartist relative tables and save to drive: drop records that the tagID is not showed in table. add column 'cluster'
  
  Params:
    tags: original tags , contains tagID
    prefix: File name prefix of saved files 
  """
  # 1. clustering
  cluster = AgglomerativeClustering(n_clusters=None, affinity='precomputed', linkage=linkage, distance_threshold=distance).fit(distance_matrix)

  print(f'running tag clustering on config: distance-{distance} linkage-{linkage}')
  print(f'there are {cluster.n_clusters_} tags now')

  # 2. remove some clusters
  tags['cluster'] = cluster.labels_

  group = tags.groupby('cluster').agg({'count':'sum'})
  delete_count = (group['count'] == 1).sum()
  print(f'{delete_count} clusters to remove')

  #left_clusters = group[(group>1).values].index.values

  # 3. transfer table
  tags['clustered_count'] = tags.cluster.apply(lambda x: group.loc[x])
  transfer_table = tags[tags['clustered_count'] > 1][['tagID', 'cluster']]
  transfer_table.set_index('tagID', inplace=True)
  print(f'new clusters contain {transfer_table.shape[0]} tags')

  # 4. update drive tag relative files
  tag_train = pd.read_csv('data/split/train_user_taggedartists.csv')
  tag_train_small = pd.read_csv('data/split/train_tune_user_taggedartists.csv')

  def map_tag_cluster(tag):
    try:
      return transfer_table.loc[tag, 'cluster']
    except:
      return np.nan

  if not os.path.exists('data/tags'):
    os.makedirs('data/tags')

  tag_train['cluster'] = tag_train.tagID.apply(map_tag_cluster)
  tag_train.dropna(inplace=True)
  tag_train.to_csv(f'data/tags/{prefix}_tags_train_{str(distance)}_{linkage}.csv', index=False)

  tag_train_small['cluster'] = tag_train_small.tagID.apply(map_tag_cluster)
  tag_train_small.dropna(inplace=True)
  tag_train_small.to_csv(f'data/tags/{prefix}_tags_train_tune_{str(distance)}_{linkage}.csv', index=False)


CPU times: user 14 µs, sys: 0 ns, total: 14 µs
Wall time: 17.4 µs


In [None]:
# STEP 5.1: tag clustering based on BERT
distance_list = [0.1, 0.15, 0.2]
for distance in distance_list:
  for linkage in linkage_list:
    tag_cluster(tags.loc[:, ['tagID', 'tagValue', 'count']].copy(), distance, linkage, bert_distance_matrix, 'bert')

running tag clustering on config: distance-0.1 linkage-average
there are 7161 tags now
3504 clusters to remove
new clusters contain 6245 tags
running tag clustering on config: distance-0.15 linkage-average
there are 5247 tags now
2218 clusters to remove
new clusters contain 7531 tags
running tag clustering on config: distance-0.2 linkage-average
there are 3734 tags now
1381 clusters to remove
new clusters contain 8368 tags


<a class="anchor" id="section23"></a>
### Section 2.3 Tag Clustering 2/3: Levenshtein Distance

**Attention:** the levenshtein distance matrix need around 10 hours to run, so save to drive when it finish

In [None]:
def levenshtein_distance(i, j):
  i, j = int(i[0]), int(j[0])
  return 1- levenshtein_ratio(tags.iloc[i].tagValue, tags.iloc[j].tagValue)

leven_distance_matrix = pairwise_distances(tags.index.values.reshape(-1,1), metric=levenshtein_distance)  # need like 4 hours to run
leven_distance_matrix.shape

np.fill_diagonal(leven_distance_matrix, 1)

if not os.path.exists('data/interim'):
  os.makedirs('data/interim')

np.savetxt("data/interim/tag_distance_matrix_lev.csv", leven_distance_matrix, delimiter=',')  # need 10 hours to run 

(9749, 9749)

In [None]:
leven_distance_matrix = np.loadtxt("data/interim/tag_distance_matrix_lev.csv", delimiter=',')

In [None]:
distance_list = [0.1, 0.15, 0.2]
for distance in distance_list:
    tag_cluster(tags.loc[:, ['tagID', 'tagValue', 'count']].copy(), distance, linkage, leven_distance_matrix, 'leven')

running tag clustering on config: distance-0.1 linkage-average
there are 9056 tags now
4926 clusters to remove
new clusters contain 4823 tags
running tag clustering on config: distance-0.15 linkage-average
there are 8638 tags now
4627 clusters to remove
new clusters contain 5122 tags
running tag clustering on config: distance-0.2 linkage-average
there are 7869 tags now
3985 clusters to remove
new clusters contain 5764 tags


<a class="anchor" id="section24"></a>
### Section 2.4 Tag Clustering 3/3: Tag-Artist-Matrix

In [None]:
# create and norm tag_artist_matrix
tag_artist_matrix = pd.pivot_table(user_tag_artists_df, values='userID', index='tagID', columns='artistID', aggfunc='count', fill_value=0)
norm_sum = tag_artist_matrix.sum(axis=1)
tag_artist_matrix = tag_artist_matrix.div(norm_sum, axis=0)

In [None]:
# create tag-artist-correlation-distance-matrix
tag_sim_matrix = cosine_similarity(tag_artist_matrix)
np.fill_diagonal(tag_sim_matrix, 0)
tag_distance_matrix = 1 - tag_sim_matrix

In [None]:
tags_temp = tag_artist_matrix.index.to_frame().merge(tags, how='left', left_index=True, right_on='tagID')
distance_list = [0.1, 0.2, 0.3, 0.4]
linkage_list = ['average']
for distance in distance_list:
  for linkage in linkage_list:
    tag_cluster(tags_temp, distance, linkage, tag_distance_matrix, 'correlation')


running tag clustering on config: distance-0.1 linkage-average
there are 6353 tags now
1443 clusters to remove
new clusters contain 8306 tags
running tag clustering on config: distance-0.2 linkage-average
there are 6145 tags now
1431 clusters to remove
new clusters contain 8318 tags
running tag clustering on config: distance-0.3 linkage-average
there are 5487 tags now
1245 clusters to remove
new clusters contain 8504 tags
running tag clustering on config: distance-0.4 linkage-average
there are 5095 tags now
1224 clusters to remove
new clusters contain 8525 tags


<a class="anchor" id="section25"></a>
### Section 2.5 Tag Clustering Result

In [None]:
check_cluster_number = 20
cluster_bert = pd.read_csv('data/tags/bert_tags_train_0.1_average.csv')
if 'cluster' in tags.columns:
  tags = tags.drop(['cluster'], axis=1)
cluster_bert = cluster_bert.merge(tags, how='left', left_on='tagID', right_on='tagID')

for index, group in cluster_bert.groupby('cluster'):
  if index > check_cluster_number:
    break
  display(group.head())

Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
9098,405,67,3253,0.0,7,cry,cry,cry
9149,405,288,3253,0.0,7,cry,cry,cry
12721,562,344,4248,0.0,1,makes me want to cry,makes me want to cry,make want cry
20406,958,4259,6367,0.0,3,song that makes me cry,song that makes me cry,song make cry
20407,958,4264,6367,0.0,3,song that makes me cry,song that makes me cry,song make cry


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
16160,723,290,5066,1.0,3,jordin sparks,jordin sparks,jordin spark
29608,1478,290,5066,1.0,3,jordin sparks,jordin sparks,jordin spark
31447,1606,290,10037,1.0,1,jordin,jordin,jordin


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
4756,211,1713,2104,2.0,6,dave gahan,dave gahan,dave gahan
11773,520,72,2104,2.0,6,dave gahan,dave gahan,dave gahan
11782,520,1713,2104,2.0,6,dave gahan,dave gahan,dave gahan


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
10770,462,7150,3750,3.0,1,boredoms,boredoms,boredom
22824,1100,1119,7211,3.0,1,laziness,laziness,laziness


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
27218,1343,1714,3889,4.0,3,melhor da alemanha,melhor da alemanha,melhor da alemanha
29610,1478,290,9532,4.0,1,lembra alguem,lembra alguem,lembra alguem


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
622,26,934,91,5.0,654,thrash metal,thrash metal,thrash metal
1273,51,707,91,5.0,654,thrash metal,thrash metal,thrash metal
1516,59,707,91,5.0,654,thrash metal,thrash metal,thrash metal
1540,59,1104,91,5.0,654,thrash metal,thrash metal,thrash metal
1737,63,707,91,5.0,654,thrash metal,thrash metal,thrash metal


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
20455,963,1464,6394,6.0,1,voulez vous,voulez vous,voulez vous


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
1328,58,67,792,7.0,48,1980s,1980s,1980s
13334,596,72,792,7.0,48,1980s,1980s,1980s
13404,596,8388,792,7.0,48,1980s,1980s,1980s
13916,624,1001,792,7.0,48,1980s,1980s,1980s
15238,692,1464,792,7.0,48,1980s,1980s,1980s


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
4342,196,154,2012,8.0,5,rock band,rock band,rock band
4399,196,424,2012,8.0,5,rock band,rock band,rock band
4453,196,982,2012,8.0,5,rock band,rock band,rock band


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
2678,106,2700,1274,9.0,2,hardcore techno,hardcore techno,hardcore techno


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
3289,152,293,1647,10.0,13,for days when i am particularly angry with you,for days when i am particularly angry with you,day particularly angry
3298,152,325,1647,10.0,13,for days when i am particularly angry with you,for days when i am particularly angry with you,day particularly angry
3340,152,481,1647,10.0,13,for days when i am particularly angry with you,for days when i am particularly angry with you,day particularly angry
3399,152,2788,1647,10.0,13,for days when i am particularly angry with you,for days when i am particularly angry with you,day particularly angry
12759,565,998,4279,10.0,6,mad about,mad about,mad


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
2180,83,2298,1022,11.0,2,onno tunc,onno tunc,onno tunc
2204,83,2319,1022,11.0,2,onno tunc,onno tunc,onno tunc
11122,480,5254,3797,11.0,4,ndw,ndw,ndw
27484,1352,14057,8852,11.0,5,nu,nu,nu
40655,1991,18074,4097,11.0,2,nin,nin,nin


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
18399,881,1045,5969,12.0,2,sampler,sampler,sampler
18402,881,10798,5969,12.0,2,sampler,sampler,sampler


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
38810,1928,2301,11753,13.0,1,ajda forever,ajda forever,ajda forever


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
297,15,600,285,15.0,625,awesome,awesome,awesome
387,21,465,285,15.0,625,awesome,awesome,awesome
449,21,789,285,15.0,625,awesome,awesome,awesome
473,21,792,285,15.0,625,awesome,awesome,awesome
591,26,917,285,15.0,625,awesome,awesome,awesome


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
7552,360,6076,1722,16.0,59,hard,hard,hard
9819,439,271,1722,16.0,59,hard,hard,hard
9985,439,2921,1722,16.0,59,hard,hard,hard
10019,439,2927,1722,16.0,59,hard,hard,hard
10036,439,2931,1722,16.0,59,hard,hard,hard


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
23141,1126,12516,2106,17.0,2,portugal,portugal,portugal


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
21034,1004,4795,5329,18.0,36,anti-folk,anti-folk,anti-folk
21037,1004,11736,5329,18.0,36,anti-folk,anti-folk,anti-folk
21040,1004,11743,5329,18.0,36,anti-folk,anti-folk,anti-folk
24616,1205,13100,5329,18.0,36,anti-folk,anti-folk,anti-folk
24620,1205,13100,7813,18.0,2,antifolk,antifolk,antifolk


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
10379,443,6948,3577,19.0,87,turkish,turkish,turkish
16382,743,3503,3577,19.0,87,turkish,turkish,turkish
16415,743,9748,3577,19.0,87,turkish,turkish,turkish
20341,955,11343,3577,19.0,87,turkish,turkish,turkish
20342,955,11343,6363,19.0,26,turkce,Turkish,Turkish


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
41067,2016,18232,10682,20.0,2,unkle,unkle,unkle


In [None]:
cluster_leven = pd.read_csv('data/tags/leven_tags_train_0.1_average.csv')
cluster_leven = cluster_leven.merge(tags, how='left', left_on='tagID', right_on='tagID')
display(cluster_leven)
for index, group in cluster_leven.groupby('cluster'):
  if index > check_cluster_number:
    break
  display(group.head())

Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
0,2,52,13,246.0,1387,chillout,chillout,chillout
1,2,52,15,538.0,685,downtempo,downtempo,downtempo
2,2,52,18,276.0,4672,electronic,electronic,electronic
3,2,52,21,549.0,876,trip-hop,trip-hop,trip-hop
4,2,73,13,246.0,1387,chillout,chillout,chillout
...,...,...,...,...,...,...,...,...
41815,2100,8322,4,8621.0,301,black metal,black metal,black metal
41816,2100,8322,3510,5103.0,13,raw black metal,raw black metal,raw black metal
41817,2100,8322,4364,4239.0,3,pagan black metal,pagan black metal,pagan black metal
41818,2100,8322,4365,4276.0,4,lithuanian black metal,lithuanian black metal,lithuanian black metal


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
27736,1389,227,9013,0.0,2,g-e-n-i-o-s,g-e-n-i-o-s,g-e-n-i-o-s
27737,1389,429,9013,0.0,2,g-e-n-i-o-s,g-e-n-i-o-s,g-e-n-i-o-s
27738,1389,622,9014,0.0,2,g-e-n-i-o,g-e-n-i-o,g-e-n-i-o


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
16996,771,9900,5383,1.0,2,apex hills,apex hills,apex hill
16997,771,9900,5384,1.0,2,apex-hills,apex-hills,apex-hills


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
269,14,552,263,2.0,4,80s garage,80s garage,80 garage
279,14,557,263,2.0,4,80s garage,80s garage,80 garage


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
1757,66,51,892,3.0,27,1982 songs,1982 songs,1982 song
1759,66,51,896,3.0,28,1981 songs,1981 songs,1981 song
1761,66,51,899,3.0,27,1983 songs,1983 songs,1983 song
1767,66,61,892,3.0,27,1982 songs,1982 songs,1982 song
1770,66,61,899,3.0,27,1983 songs,1983 songs,1983 song


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
2546,102,2626,1231,4.0,3,loneliness,loneliness,loneliness


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
20663,996,11664,6562,5.0,7,multikulti,multikulti,multikulti
20667,996,11664,6571,5.0,9,multiculti,multiculti,multiculti


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
16924,771,267,5368,6.0,6,east-coast,east-coast,east-coast
16941,771,271,5368,6.0,6,east-coast,east-coast,east-coast
16946,771,271,5380,6.0,4,east coast,east coast,east coast
16954,771,527,5368,6.0,6,east-coast,east-coast,east-coast
16959,771,3317,5368,6.0,6,east-coast,east-coast,east-coast


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
3247,152,89,1653,7.0,11,pop bitch,pop bitch,pop bitch
3256,152,293,1653,7.0,11,pop bitch,pop bitch,pop bitch
3298,152,466,1653,7.0,11,pop bitch,pop bitch,pop bitch
17989,862,289,5885,7.0,4,pop bitches,pop bitches,pop bitch
17993,862,300,5885,7.0,4,pop bitches,pop bitches,pop bitch


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
16661,759,154,623,8.0,5,depression,depression,depression
33771,1734,462,1494,8.0,6,depressing,depressing,depress
33797,1734,3320,1494,8.0,6,depressing,depressing,depress


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
18000,862,466,5890,9.0,1,best song on the album,best song on the album,best song album


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
651,31,72,498,10.0,11,i love 80s,i love 80s,love 80
655,31,72,506,10.0,5,i love 90s,i love 90s,love 90
711,31,1001,498,10.0,11,i love 80s,i love 80s,love 80
718,31,1001,506,10.0,5,i love 90s,i love 90s,love 90
741,31,1028,506,10.0,5,i love 90s,i love 90s,love 90


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
6354,297,196,2609,11.0,25,freak folk,freak people,freak people
6393,297,231,2609,11.0,25,freak folk,freak people,freak people
6496,297,5424,2609,11.0,25,freak folk,freak people,freak people
24248,1205,13100,7814,11.0,1,freak-folk,freak-folk,freak-folk
31534,1637,440,2609,11.0,25,freak folk,freak people,freak people


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
35025,1800,13187,3998,12.0,2,tech-house,tech-house,tech-house
37708,1907,16666,3045,12.0,15,tech house,tech house,tech house


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
5166,235,769,1322,14.0,33,post-metal,post-metal,post-metal
20463,993,769,1322,14.0,33,post-metal,post-metal,post-metal
20482,993,11609,1322,14.0,33,post-metal,post-metal,post-metal
20491,993,11613,1322,14.0,33,post-metal,post-metal,post-metal
20501,993,11621,1322,14.0,33,post-metal,post-metal,post-metal


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
29308,1485,9487,9565,15.0,9,dance punk,dance punk,dance punk
31922,1661,291,9565,15.0,9,dance punk,dance punk,dance punk
31936,1661,293,9565,15.0,9,dance punk,dance punk,dance punk
32018,1661,546,9565,15.0,9,dance punk,dance punk,dance punk
37382,1895,546,9565,15.0,9,dance punk,dance punk,dance punk


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
16916,771,266,5371,16.0,4,rick villa,rick villa,rick villa
16925,771,267,5369,16.0,3,rick-villa,rick-villa,rick-villa
16960,771,3317,5371,16.0,4,rick villa,rick villa,rick villa
16972,771,4696,5371,16.0,4,rick villa,rick villa,rick villa
16990,771,9900,5369,16.0,3,rick-villa,rick-villa,rick-villa


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
6979,334,1520,2879,19.0,2,indie punk,indie punk,indie punk


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
8218,388,6056,3189,20.0,9,game soundtrack,game soundtrack,game soundtrack
8219,388,6077,3189,20.0,9,game soundtrack,game soundtrack,game soundtrack
8221,388,6392,3189,20.0,9,game soundtrack,game soundtrack,game soundtrack
8222,388,6393,3189,20.0,9,game soundtrack,game soundtrack,game soundtrack
8223,388,6394,3189,20.0,9,game soundtrack,game soundtrack,game soundtrack


In [None]:
cluster_corr = pd.read_csv('data/tags/correlation_tags_train_0.1_average.csv')
cluster_corr = cluster_corr.merge(tags, how='left', left_on='tagID', right_on='tagID')
for index, group in cluster_corr.groupby('cluster'):
  if index > check_cluster_number:
    break
  display(group.head())

Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
24328,1177,371,7640,0.0,36,top christian,top christian,top christian
24329,1177,371,7641,0.0,35,diego 12,diego 12,diego 12
24343,1177,1470,7640,0.0,36,top christian,top christian,top christian
24344,1177,1470,7641,0.0,35,diego 12,diego 12,diego 12
24354,1177,1855,7640,0.0,36,top christian,top christian,top christian


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
6521,296,5367,2606,1.0,18,sixties,sixties,sixty
15488,692,227,2606,1.0,18,sixties,sixties,sixty
15492,692,227,4920,1.0,17,1960s,1960s,1960s
15493,692,227,4921,1.0,18,1960's,1960's,1960 's
15524,692,1242,2606,1.0,18,sixties,sixties,sixty


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
4497,196,534,2004,2.0,4,no doubt,no doubt,doubt
9651,421,525,2690,2.0,6,gwen stefani,gwen stefani,gwen stefani
18883,885,10822,2690,2.0,6,gwen stefani,gwen stefani,gwen stefani


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
8707,392,679,3211,3.0,5,glee cast,glee cast,glee cast
8709,392,679,3214,3.0,1,glee season 2,glee season 2,glee season 2
21001,966,679,6414,3.0,1,diana agron,diana agron,diana agron
21002,966,679,6415,3.0,1,quinn fabray,quinn fabray,quinn fabray
21003,966,679,6417,3.0,1,quinn,quinn,quinn


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
1090,45,1428,633,4.0,16,miley cyrus,miley cyrus,miley cyrus
1395,58,300,633,4.0,16,miley cyrus,miley cyrus,miley cyrus
6873,314,461,633,4.0,16,miley cyrus,miley cyrus,miley cyrus
12377,538,907,633,4.0,16,miley cyrus,miley cyrus,miley cyrus
15220,680,461,633,4.0,16,miley cyrus,miley cyrus,miley cyrus


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
640,29,55,366,5.0,5,kylie minogue,kylie minogue,kylie minogue
5106,227,55,2227,5.0,1,all-time favourites,all-time favourites,all-time favourite
14032,612,55,4451,5.0,1,xenomania,xenomania,xenomania
16915,749,55,5242,5.0,1,shoulhavemoreplays,shoulhavemoreplays,shoulhavemoreplays
18872,885,10822,366,5.0,5,kylie minogue,kylie minogue,kylie minogue


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
12397,540,1075,4105,6.0,1,lottie,lottie,lottie
15012,665,1075,4788,6.0,5,iamx,iamx,iamx
27090,1282,1075,4788,6.0,5,iamx,iamx,iamx
27094,1282,3109,4788,6.0,5,iamx,iamx,iamx
35841,1777,1075,4788,6.0,5,iamx,iamx,iamx


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
7358,337,1524,329,7.0,5,slayer,slayer,slayer
13234,579,841,4331,7.0,1,fuckin great thrash,fuckin great thrash,fuckin great thrash


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
19930,924,314,3683,8.0,5,ciara,ciara,ciara
32093,1593,314,9999,8.0,1,crunck,crunck,crunck
42889,2061,314,3683,8.0,5,ciara,ciara,ciara


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
3927,177,220,1892,9.0,16,funk rock,funk rock,funk rock
13574,594,8381,1892,9.0,16,funk rock,funk rock,funk rock
21132,977,220,6466,9.0,1,rhcp,rhcp,rhcp
21133,977,220,6467,9.0,1,love ballad,love ballad,love ballad
21134,977,220,6473,9.0,1,my rhcp favourite song,my rhcp favourite song,rhcp favourite song


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
7044,321,724,2736,10.0,5,ozzy osbourne,ozzy osbourne,ozzy osbourne
7045,321,724,2740,10.0,1,randy rhoads,randy rhoads,randy rhoads
7047,321,724,2750,10.0,1,metal icon,metal icon,metal icon
7048,321,724,2751,10.0,1,polital message,polital message,polital message
7050,321,724,2767,10.0,1,vintage,vintage,vintage


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
33648,1667,154,10392,11.0,10,romantic rock,romantic rock,romantic rock
33649,1667,154,10393,11.0,12,indie-romantic,indie-romantic,indie-romantic
33651,1667,1106,10393,11.0,12,indie-romantic,indie-romantic,indie-romantic
33657,1667,1798,10392,11.0,10,romantic rock,romantic rock,romantic rock
33658,1667,1798,10393,11.0,12,indie-romantic,indie-romantic,indie-romantic


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
15560,692,1416,696,12.0,25,1970s,1970s,1970s
15568,692,1416,4922,12.0,26,1970's,1970's,1970 's
15583,692,1464,696,12.0,25,1970s,1970s,1970s
15594,692,1464,4922,12.0,26,1970's,1970's,1970 's
15621,692,1820,696,12.0,25,1970s,1970s,1970s


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
28936,1394,1243,3202,14.0,5,mika,mika,mika
28943,1394,1400,3202,14.0,5,mika,mika,mika


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
15495,692,227,4927,15.0,18,nineties,nineties,ninety
15496,692,227,4929,15.0,15,1990's,1990's,1990 's
15507,692,997,4927,15.0,18,nineties,nineties,ninety
15508,692,997,4929,15.0,15,1990's,1990's,1990 's
15616,692,1686,4927,15.0,18,nineties,nineties,ninety


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
1775,63,1808,848,16.0,163,hair metal,hair metal,hair metal
2277,84,2341,848,16.0,163,hair metal,hair metal,hair metal
2733,109,1810,848,16.0,163,hair metal,hair metal,hair metal
3508,157,1810,848,16.0,163,hair metal,hair metal,hair metal
6261,268,2790,1286,16.0,101,glam metal,glam metal,glam metal


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
4846,213,190,2105,17.0,1,awesome rock,awesome rock,awesome rock
15455,691,190,4918,17.0,1,weird but good,weird but good,weird good
19205,904,190,2009,17.0,15,muse,muse,muse
21483,997,190,6601,17.0,1,techno muse,techno muse,techno muse
23808,1130,12547,2009,17.0,15,muse,muse,muse


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
38444,1882,17120,11517,18.0,11,world groove,world groove,world groove
38471,1882,17338,11517,18.0,11,world groove,world groove,world groove
38473,1882,17338,11523,18.0,9,maghrebi,maghrebi,maghrebi
38515,1882,17359,11517,18.0,11,world groove,world groove,world groove
38521,1882,17359,11523,18.0,9,maghrebi,maghrebi,maghrebi


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
17535,771,267,5368,19.0,6,east-coast,east-coast,east-coast
17537,771,267,5370,19.0,5,ft-lauderdale,ft-lauderdale,ft-lauderdale
17552,771,271,5368,19.0,6,east-coast,east-coast,east-coast
17553,771,271,5370,19.0,5,ft-lauderdale,ft-lauderdale,ft-lauderdale
17565,771,527,5368,19.0,6,east-coast,east-coast,east-coast


Unnamed: 0,userID,artistID,tagID,cluster,count,tagValue,translated,preprocessed
38441,1882,17120,11514,20.0,11,algeria,algeria,algeria
38442,1882,17120,11515,20.0,11,rabat,discount,discount
38443,1882,17120,11516,20.0,11,tanger,tangier,tangier
38445,1882,17120,11518,20.0,11,oran,rate,rate
38446,1882,17120,11519,20.0,10,alger,alger,alger
