# Using Wikipedia Mentions to Calculate Similarity Between Musical Artists

## 1. Introduction

One of the best sources of the internet for textual information, Wikipedia has a lot of articles filled with details about thousands of musical acts. More importantly for this work, the Wikipedia articles of these artists often feature mentions to others. Take this example extracted from the Wikipedia page on Bob Dylan.

_"Others who had hits with Dylan's songs in the early 1960s included **the Byrds**, **Sonny & Cher**, **the Hollies**, **Peter, Paul and Mary**, **the Association**, **Manfred Mann** and **the Turtles**."_

In just one sentence, we have references to seven other artists, and mentions such as those are spread throughout the article, showing acts who are somehow linked to Bob Dylan, whether they influenced him, were influenced by him, or something else entirely.

Given a dataset containing for each artist the number of references its Wikipedia article makes to other artists, this notebook plans to use that information to calculate the similarity between musical acts.

## 2. Setup and Dataset

Let's start by importing the libraries we will use and also blocking warnings from showing up.

In [146]:
import math
import numpy as np
import pandas as pd
import warnings
from IPython.display import display, Markdown
from scipy.sparse import dok_matrix
from sklearn.metrics.pairwise import cosine_similarity

warnings.filterwarnings('ignore')

Now we open dataset and take a look at some sample data. Our dataset has a total of three columns.

**ARTIST_NAME** is the name of the artist.

**MENTIONED_ARTISTS** is a list with all artists mentioned in the Wikipedia article of that act. Each artist is followed by a number indicating how many times it is mentioned. If the column is empty, that means the article does not reference any artists that are in the dataset.

**ARTIST_CATEGORY** is the Wikipedia category from which the article was extracted.

Note that, for this notebook, only the two first columns will be used.

In [147]:
# Loading the dataset and filling its NaN values with empty strings.
df_matrix = pd.read_csv('../input/wikipedia-music-links/matrix.csv', sep=',', encoding='UTF-8')
df_matrix = df_matrix.fillna('')

# Let's expand the column width to the maximum.
pd.set_option('display.max_colwidth', None)

# Showing dataframe.
df_matrix.head(5).style.hide_index()

ARTIST_NAME,MENTIONED_ARTISTS,ARTIST_CATEGORY
!!!,Cake (band):1;LCD Soundsystem:1;Nate Dogg:1;Nic Offer:6;Out Hud:3;Red Hot Chili Peppers:1;The Magnetic Fields:2,musical_groups_established_in_1996
!Action Pact!,,musical_groups_established_in_1981
"""Big Boy"" Teddy Edwards",Big Bill Broonzy:2;Papa Charlie Jackson:1,20th_century_american_singers
"""Frantic"" Fay Thomas",Nellie Lutcher:1;Rose Murphy:1,20th_century_american_singers
"""Scarface"" John Williams",Bobby Marchan:7;Danny Barker:1;Earl King:1;Little Miss Cornshucks:1;Mahalia Jackson:1;The Meters:1;The Wild Tchoupitoulas:4,20th_century_american_singers


Just for the sake of record keeping, let's print some general information on the dataset. Let's see how many artists the dataset has, how many categories all these acts come from, and the number of mentions that are registered.

Note that when one artist's article has more than one mention to another act, we count it as one. In other words, we are counting unique links. 

In [148]:
# Counting categories and links.
unique_categories = []
total_links = 0

# Iterating over the whole dataset.
for idx, row in df_matrix.iterrows():
    
    # The category column can have more than one value. If that's the case, they will be separated by a semicolon.
    # Here we count how many unique values there are without considering unique pairs, trios, etc as being unique.
    split_result = row['ARTIST_CATEGORY'].split(';')
    for category in split_result:
        if category not in unique_categories:
            unique_categories.append(category)
            
    # We split mentioned artists by a semicolon, turning the string into a list. Then we count how many items the
    # list has as the total number of links.
    split_result = row['MENTIONED_ARTISTS'].split(';')
    total_links += len(split_result)

# Creating dataframe with the counts we want.
totalizer_df = pd.DataFrame(
    {'Total Artists': ["{:,}".format(df_matrix['ARTIST_NAME'].nunique())],
     'Total Categories': [len(unique_categories)],
     'Total Links': ["{:,}".format(total_links)]
    })

# Showing dataframe.
totalizer_df.style.hide_index()

Total Artists,Total Categories,Total Links
42828,85,289595


## 3. Artist to Artist Matrix

The first step in our similarity calculation will be the creation of what we will call an Artist to Artist Matrix. In it, each row and column will be assigned to an artist. The index of the row and column representing the artist will always be the same. 

For example, Row 1 and Column 1 may be assigned to Bob Dylan; Row 2 and Column 2 may be assigned to The Beatles; Row 3 and Column 3 may be assigned to The Clash; and so forth. The goal of this matrix is to register, in a math-friendly format, the links between artists that are registered in our dataset.

If Bob Dylan's article has three references to The Beatles, his row (Number 1) will have a value of 3 in The Beatles' column (Number 2). If The Beatles' article has one reference to The Clash, the former's row (Number 2) will have a value of 1 in The Clash's column (Number 3).

The code below builds that matrix. It's a lot of processing, so it might take a bit.

In [149]:
# Here, we get all unique artists available on the dataset. They will be used to fill our Artist x Artist matrix.
list_unique_artists = df_matrix['ARTIST_NAME']

# Getting the count of total artists. This will determine the size of our Artist x Artist matrix.
total_artists = len(list_unique_artists)

# Creating a sparse matrix with the number of rows and columns being equal to the total number of artists.
artist_artist_matrix = dok_matrix((total_artists, total_artists))

# We need to associate every artist to an index. To do that, we will create two dictionaries. In one, the index will be 
# the key and the name of the artist will be the value. In the other, it will be the other way around. We do that
# because we will need to translate artist to index in the two directions depending on the situation. We go through the list 
# of artists one by one, and make the necessary associations.
index_to_artist_dictionary = {}
artist_to_index_dictionary = {}
index = 0

for artist in list_unique_artists:
    index_to_artist_dictionary[index] = artist
    artist_to_index_dictionary[artist] = index
    index += 1
    
# Now we build the sparse matrix. We begin by iterating over the lines of the dataset.
for idx, row in df_matrix.iterrows():
    
    # Each row of the dataset has two columns. ARTIST_NAME tells us the artist to which the line refers to, and 
    # MENTIONED_ARTISTS lists all artists mentioned in the Wikipedia article of ARTIST_NAME.
    artist_name = row['ARTIST_NAME']
    mentioned_artists = row['MENTIONED_ARTISTS']
    
    # Let's go ahead and get the index that corresponds to this artist.
    artist_index = artist_to_index_dictionary[artist_name]
    
    # MENTIONED_ARTISTS can be empty if the artist's Wikipedia article does not mention any other artists that are
    # in the dataset. We do not want to do anything with those.
    if mentioned_artists:
    
        # MENTIONED_ARTISTS is actually a string of tuples separated by semicolons. Here, we split that string into
        # a list.
        list_of_mentioned_artists = mentioned_artists.split(';')
        
        # Now we go through the list of mentioned artists.
        for mentioned_artist in list_of_mentioned_artists:

            # Each item of the list will be composed of the artist name and the number of times it is mentioned. The two
            # values are separated by colons. So, here, we do another split.
            mentioned_artist_name = mentioned_artist.split(':')[0]
            number_of_mentions = mentioned_artist.split(':')[1]
            
            # Now we have the name of the artist mentioned in the Wikipedia article. Let's get its index. But first,
            # let's make sure the index for this individual exists.
            if mentioned_artist_name in artist_to_index_dictionary:
                mentioned_artist_index = artist_to_index_dictionary[mentioned_artist_name]

                # Finally, we add the number of mentions to our sparse matrix, with artist corresponding to the row, and
                # mentioned_artist corresponding to the column.
                artist_artist_matrix[artist_index, mentioned_artist_index] = number_of_mentions

Before we continue, how about we look at an example of three rows in order to better visualize the structure that was constructed? We will obviously not show all columns of those rows because there are too many of them. Instead, we will just sample a few specific columns.

Note that, for the sake of clarity, the indexes of the rows and columns were replaced by the artist name they represent.

In [150]:
# These are the artists we will use as part of the example.
example_rows_label = ['Bob Dylan', 'The Beatles', 'The Clash']
example_columns_label = ['Bob Dylan', 'Chuck Berry', 'Neil Young', 'The Beatles', 'The Byrds', 'The Clash', 'The Kinks', 'The Rolling Stones', 'Sex Pistols', 'Tom Waits']

# Let's get the indexes they correspond to.
example_rows_index = [artist_to_index_dictionary[artist] for artist in example_rows_label]
example_columns_index = [artist_to_index_dictionary[artist] for artist in example_columns_label]

# Filtering full matrix according to the indexes we want.
example_matrix = artist_artist_matrix[:, example_columns_index]
example_matrix = example_matrix[example_rows_index, :].toarray()

# We make sure all rows will be displayed.
pd.set_option('display.max_rows', len(artits_to_display))

# We turn the matrix into a dataframe and then show it on the screen.
example_matrix = pd.DataFrame(example_matrix, columns=example_columns_label, index=example_rows_label)
example_matrix

Unnamed: 0,Bob Dylan,Chuck Berry,Neil Young,The Beatles,The Byrds,The Clash,The Kinks,The Rolling Stones,Sex Pistols,Tom Waits
Bob Dylan,0.0,1.0,2.0,3.0,7.0,0.0,0.0,0.0,0.0,1.0
The Beatles,3.0,1.0,0.0,0.0,2.0,0.0,2.0,2.0,0.0,0.0
The Clash,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.0,0.0


## 4. Select Your Favorite Artist

In order to make this more interesting, this next step lets you pick the artist against which similarity will be computed. We could of course calculate similarity between all artists, but as of the creation of this notebook the Kaggle infrastructure cannot handle the processing necessary for that. Plus, calculating how similar other artists are in relation to a specific one is much faster, naturally.

Define the value of the *selected_artist* and the next steps of this notebook will use the Wikipedia mentions to look for artists that are similar to that one. The code below will tell you if the artist you have selected is part of the dataset. If it is not, the next steps will not work.

Note that the name of the artist must be exactly as it appears in the title of its Wikipedia article. For most artists, that's usually simply their name, as in the case of The Beatles. But some artists, like the Pixies, have their Wikipedia titled "Pixies (band)" because there is already an entry related to the word "Pixies". As such, if you are looking for the Pixies or other bands like that, make sure to put the full name of the title.

In [151]:
selected_artist = 'The Clash'

if selected_artist in artist_to_index_dictionary:
    display(Markdown('Artist Found!'))
else:
   display(Markdown('Artist Not Found!<br> Adjust the name or look for another one before continuing. Remember that the name must be exactly as it appears in the title of their Wikipedia article, including the disambiguation term.'))

Artist Found!

## 5. Similarity Calculation

The code below calculates the similarity of all artists in the dataset in relation to the one that was chosen previously. The selected similarity metric was the Cosine Similarity. The main reason we are using it is because Cosine Similarity will yield a number whose meaning is very easy to assess. If Cosine Similarity equals 1, the artists are identical. If Cosine Similarity equals 0, there is nothing in common between the artists.

The closer similarity is to 1, then, the more similar the artist is to the one that was selected. Below, we are actually calculating three types of similarity scores. They will be explained during the next sections of this notebook.

In [152]:
# First, we obtain the index that corresponds to the selected artist.
selected_artist_index = artist_to_index_dictionary[selected_artist]

# We obtain the number of lines (artists) the Artist x Artist matrix has.
# This will be useful soon.
number_of_lines_sparse_matrix = artist_artist_matrix.shape[0]

# Due to hardware limitations, we will calculate similarities in batches since Kaggle cannot handle
# the full Artist x Artist matrix in its non-sparse format. Here we set each batch to be the size of
# 10.000 artists, and we calculate - acccording to the full matrix size - the number of iterations
# that will be necessary to process all artists.
batch = 10000
total_iterations = math.ceil(number_of_lines_sparse_matrix / batch)
lower_limit = 0

# We are going to calculate two types of similarity, and here we create the lists that will store these
# different results.
# similarity_scores_mentions_from will store the similarity scores as measured according to the mentions
# that the articles contain.
# similarity_scores_mentions_to will store the similarity scores as measured according to the articles
# that make mentions to that artist.
similarity_scores_mentions_from = []
similarity_scores_mentions_to = []

# Let's calculate. The number iterations was defined according to our batch size.
for i in range(0, total_iterations):
    
    # Here we determine what slice of the matrix this current iteration will process.
    # lower_limit will be zero on the first go round.
    # upper_limit will be limited by either the lower_limit + the bacth size, or by the number of lines
    # in the matrix, because if only batch size was taken into account, the final iteration would
    # end up trying to read beyond the maximum index of the matrix.
    lower_limit = i * batch
    upper_limit = min((lower_limit + batch -1), number_of_lines_sparse_matrix - 1) + 1
    
    # Here we calculate the similarity for the artists in this batch. Note that we are always comparing the selected artist to all artists
    # in the matrix that fall inside the range of this iteration. 
    # In the case of batch_similarity_from, we are calculating the similarity of the rows that belong to the artists, since those have the
    # mentions made in their articles.
    # In the case of batch_similarity_to, we are calculating the similarity of the columns that belong to the artists, since those have the
    # mentions that are made to their articles.
    batch_similarity_from = cosine_similarity(artist_artist_matrix[selected_artist_index,:], artist_artist_matrix[lower_limit:upper_limit,:])
    batch_similarity_to = cosine_similarity(artist_artist_matrix[:,selected_artist_index].transpose(), artist_artist_matrix[:,lower_limit:upper_limit].transpose())
    
    # We add the similarities calculated in this batch to the lists that will store all similarity scores.
    similarity_scores_mentions_from.extend(list(batch_similarity_from[0]))
    similarity_scores_mentions_to.extend(list(batch_similarity_to[0]))

# As a third similarity metric, we get the average of similarity_scores_mentions_from and similarity_scores_mentions_to.
similarity_scores_mentions_from_to = np.divide(np.array(similarity_scores_mentions_from) + np.array(similarity_scores_mentions_to), 2)

# Here, we get the order of the indexes of the artists according to each one of the similarity scores we calculated.
# This will be used further down.
order_mentions_from = np.array(similarity_scores_mentions_from).argsort()[::-1]
order_mentions_to = np.array(similarity_scores_mentions_to).argsort()[::-1]
order_mentions_from_to = np.array(similarity_scores_mentions_from_to).argsort()[::-1]

### 5.1 Metric 1 - Similarity According to Mentions in Article

The first metric we will consider is the one we have been talking about since the beginning of the notebook: how similar artists are to one another according to the artists that are mentioned in their Wikipedia articles.

Therefore, with this metric, artists whose articles have a high rate of mentions that overlap with the those present in the article of the selected artist will achieve a high similarity score.

Here, we look at the Top 20 according to that metric.

In [153]:
top_indexes_mentions_from = order_mentions_from[1:21]

similarity_mentions_from_data_frame = pd.DataFrame(
    {'Position': range(1, len(top_indexes_mentions_from) + 1),
     'Artist': [index_to_artist_dictionary[index] for index in top_indexes_mentions_from],
     'Similarity to '  + selected_artist: [similarity_scores_mentions_from[index] for index in top_indexes_mentions_from]
    })

similarity_mentions_from_data_frame.style.hide_index()

Position,Artist,Similarity to The Clash
1,Sid Vicious,0.648829
2,Ex Pistols,0.640057
3,Jess Conrad,0.636655
4,Afternoons (band),0.636655
5,Stisism,0.636655
6,U.S. Crush,0.636655
7,Tenpole Tudor,0.624292
8,John Lydon,0.580917
9,Bollock Brothers,0.550926
10,The Professionals (band),0.546677


Now, you may be wondering: what were the exact mentions in the Wikipedia articles of these artists that made them similar to the one I selected? What are the artists that join them? The next analysis will answer those questions. However, first, in order to make the analysis cleaner and simpler, let's not look at the whole Top 20, but rather at one of the artists in that list. 

By default, the code below is set so the comparison is made between the artist that was selected by you and the 1st artist of the previous list (the most similar). But you can change that via the *position* variable below, with 1 corresponding to the 1st, 2 to the 2nd, and so forth.

Just make sure to set it as a number between 1 and 20.

In [154]:
position = 1

selected_option = top_indexes_mentions_from[position - 1]

display(Markdown('Selected Artist: ' + index_to_artist_dictionary[selected_option]))

Selected Artist: Sid Vicious

Finally, you can determine what artists will show up in the comparison.

Set *remove_non_overlapping_artists* to True if you only want to see the artists simultaneously present in both articles. Set *remove_non_overlapping_artists* to False if you want to see all artists that appear in any of the two articles.

In [155]:
remove_non_overlapping_artists = True

With those configurations set, the table below will present the results of our analysis. You can now take a look at which artists caused your selected artist to be similar to any of those in the Top 20. The code below will do the necessary processing and output the table with the results.

In [156]:
# First we make a list with all artists mentioned by the artist that was selected from the Top 20.
# Then, we split each tuple of that list into two vectors, one containing the name of the artist
# and the other the number of mentions made to that act.
artists_mentioned_by_nth_most_similar_artist = df_matrix.loc[selected_option, 'MENTIONED_ARTISTS'].split(';')
artists_mentioned_by_nth_most_similar_artist, number_mentions_by_most_similar = zip(*(artist.split(':', -1) for artist in artists_mentioned_by_nth_most_similar_artist))

# Now, we repeat the same procedure with the artist in relation to which similarities were computed.
artists_mentioned_by_selected_artist = df_matrix.loc[artist_to_index_dictionary[selected_artist], 'MENTIONED_ARTISTS'].split(';')
artists_mentioned_by_selected_artist, number_mentions_by_selected = zip(*(artist.split(':', -1) for artist in artists_mentioned_by_selected_artist))

# We need to check what artists we will display in the comparison. To do that, we will use the remove_non_overlapping_artists
# variable set by the user. If it is True, we will only show artists that appear in the two articles. If it is false, we will
# show all artists that appear in any of the two articles.
artits_to_display = []
if (remove_non_overlapping_artists):
    artits_to_display = set(artists_mentioned_by_nth_most_similar_artist) & set(artists_mentioned_by_selected_artist)
else:
    artits_to_display = set(artists_mentioned_by_nth_most_similar_artist + artists_mentioned_by_selected_artist)

# We only have the index of the artist selected from the Top 20 by the user. Let's get the name of the act so we can
# show it in the dataframe.
nth_most_similar_artist = index_to_artist_dictionary[selected_option]

# These will be the names of our columns.
column_nth_most_similar = 'Number of Mentions by ' + nth_most_similar_artist
column_selected = 'Number of Mentions by ' + selected_artist

# We create an empty dataframe with those columns.
df_mentions_comparison = pd.DataFrame(columns=(column_selected, column_nth_most_similar),
                                  index=artits_to_display)

# We start by setting all values of the dataframe to zero (no mentions).
df_mentions_comparison = df_mentions_comparison.fillna(0)

# We are ordering the dataframe by index so it is in alphabetical order.
df_mentions_comparison = df_mentions_comparison.sort_index()

# For each artist that will be displayed, we check how many mentions the articles of each of the two artists in our comparison
# have to that act. We then add that number to the corresponding dataframe row x column pairing.
for artist in artits_to_display:
    
    if artist in artists_mentioned_by_nth_most_similar_artist:
        mentions_to_artist_by_most_similar = number_mentions_by_most_similar[artists_mentioned_by_nth_most_similar_artist.index(artist)]
        df_mentions_comparison.loc[artist, column_most_similar] = mentions_to_artist_by_most_similar
    
    if artist in artists_mentioned_by_selected_artist:
        mentions_to_artist_by_selected = number_mentions_by_selected[artists_mentioned_by_selected_artist.index(artist)]
        df_mentions_comparison.loc[artist, column_selected] = mentions_to_artist_by_selected

# We show the result, making sure all rows are displayed.
pd.set_option('display.max_rows', len(artits_to_display))

df_mentions_comparison

Unnamed: 0,Number of Mentions by The Clash,Number of Mentions by Sid Vicious
David Bowie,1,1
Joe Strummer,15,1
John Lydon,1,4
Public Image Ltd,2,1
Ramones,2,3
Sex Pistols,17,44
Siouxsie and the Banshees,1,3
The Damned (band),2,2
The Slits,2,1


### 5.2 Metric 2 - Similarity According to Mentions to Artist

The second metric we will analyze is a bit different. We are not considering the mentions in the articles of the artists we are comparing. We are taking into account, instead, the artists that mention them. In other words, we want to know how similar artists are to one another according the artists that mention them.

Therefore, with this metric, artists whose articles are mentioned by a similar set of artists will have high similarities.

Here, we look at the Top 20 according to that metric.

In [157]:
top_indexes_mentions_to = order_mentions_to[1:21]

similarity_mentions_to_data_frame = pd.DataFrame(
    {'Position': range(1, len(top_indexes_mentions_to) + 1),
     'Artist': [index_to_artist_dictionary[index] for index in top_indexes_mentions_to],
     'Similarity to ' + selected_artist : [similarity_scores_mentions_to[index] for index in top_indexes_mentions_to]
    })

similarity_mentions_to_data_frame.style.hide_index()

Position,Artist,Similarity to The Clash
1,Eugene Hütz,0.692588
2,The 101ers,0.686052
3,The Pogues,0.374844
4,The Passions (British band),0.29039
5,Radio 4 (band),0.29039
6,Big Audio Dynamite,0.267196
7,Carbon/Silicon,0.259618
8,Black Grape,0.230863
9,General Public,0.228021
10,Street Dogs,0.223306


Now you may be wondering: what are the artists that mention both the artist I selected and those in the Top 20? What are the artists that join them?

Again, the next analysis will answer those questions. But first, in order to make it cleaner and simpler, let's not look at the whole Top 20, but rather at one of the artists in that list. 

By default, the code below is set so the comparison is made between the artist that was selected by you and the 1st artist of the previous list (the most similar). But you can change that via the *position* variable below, with 1 corresponding to the 1st, 2 to the 2nd, and so forth.

Just make sure to set it as a number between 1 and 20.

In [158]:
position = 1

selected_option = top_indexes_mentions_to[position - 1]

display(Markdown('Selected Artist: ' + index_to_artist_dictionary[selected_option]))

Selected Artist: Eugene Hütz

Finally, you can determine what artists will show up in the comparison.

Set *remove_non_overlapping_artists* to True if you only want to see the artists simultaneously present in both articles. Set *remove_non_overlapping_artists* to False if you want to see all artists that appear in any of the two articles.

In [159]:
remove_non_overlapping_artists = True

With those configurations set, the table below will present the results of our analysis. You can now take a look at which artists caused your selected artist to be similar to any of those in the Top 20. The code below will do the necessary processing and output the table with the results.

In [160]:
# We only have the index of the artist selected from the Top 20 by the user. Let's get the name of the act so we can
# show it in the dataframe.
nth_most_similar_artist = index_to_artist_dictionary[selected_option]

# We need to obtain all artists that mention both the selected artist and the one chosen from the Top 20.
# We also need to obtain the number of mentions that are made to each of those acts. Here, we create the
# lists that will store that information.
artists_that_mention_nth_most_similar_artist = []
number_of_mentions_to_nth_most_similar_artist = []
artists_that_mention_selected_artist = []
number_of_mentions_to_selected_artist = []

# We need to interate over the full dataset.
for idx, row in df_matrix.iterrows():
    
    # We obtain the MENTIONED_ARTISTS column and split it into a list.
    mentioned_artists = row['MENTIONED_ARTISTS'].split(';')
    
    # Now, let's iterate over all artists in the MENTIONED_ARTISTS column.
    for artist in mentioned_artists:
        
        # Now we check whether that artist is one of the two we are interested in.
        # If that's the case, we append the information to the lists.
        if artist.split(':')[0] == nth_most_similar_artist:
            
            artists_that_mention_nth_most_similar_artist.append(row['ARTIST_NAME'])
            number_of_mentions_to_nth_most_similar_artist.append(artist.split(':')[1])
            
        elif artist.split(':')[0] == selected_artist:
            
            artists_that_mention_selected_artist.append(row['ARTIST_NAME'])
            number_of_mentions_to_selected_artist.append(artist.split(':')[1])

# We need to check what artists we will display in the comparison. To do that, we will use the remove_non_overlapping_artists
# variable set by the user. If it is True, we will only show artists that appear in the two articles. If it is false, we will
# show all artists that appear in any of the two articles.
artists_to_display = []
if (remove_non_overlapping_artists):
    artists_to_display = set(artists_that_mention_nth_most_similar_artist) & set(artists_that_mention_selected_artist)
else:
    artists_to_display = set(artists_that_mention_nth_most_similar_artist + artists_that_mention_selected_artist)
    
# These will be the names of our columns.
column_nth_most_similar = 'Number of Mentions to ' + nth_most_similar_artist
column_selected = 'Number of Mentions to ' + selected_artist

# We create an empty dataframe with those columns.
df_mentions_comparison = pd.DataFrame(columns=(column_selected, column_nth_most_similar),
                                  index=artists_to_display)

# We start by setting all values of the dataframe to zero (no mentions).
df_mentions_comparison = df_mentions_comparison.fillna(0)

# We are ordering the dataframe by index so it is in alphabetical order.
df_mentions_comparison = df_mentions_comparison.sort_index()

# For each artist that will be displayed, we check how many mentions their article has to the two artists in our comparisons. 
# We then add that number to the corresponding dataframe row x column pairing.
for artist in artists_to_display:
    
    if artist in artists_that_mention_nth_most_similar_artist:
        mentions_to_nth_most_similar_by_artist = number_of_mentions_to_nth_most_similar_artist[artists_that_mention_nth_most_similar_artist.index(artist)]
        df_mentions_comparison.loc[artist, column_nth_most_similar] = mentions_to_nth_most_similar_by_artist
    
    if artist in artists_that_mention_selected_artist:
        mentions_to_selected_by_artist = number_of_mentions_to_selected_artist[artists_that_mention_selected_artist.index(artist)]
        df_mentions_comparison.loc[artist, column_selected] = mentions_to_selected_by_artist

# We show the result, making sure all rows are displayed.
pd.set_option('display.max_rows', len(artists_to_display))

df_mentions_comparison

Unnamed: 0,Number of Mentions to The Clash,Number of Mentions to Eugene Hütz
Joe Strummer,37,1


### 5.3 Metric 3 - Similarity According to Mentions in Article and Mentions to Artist

To wrap it all up, let's combine the two similarity metrics into one via a simple average. That way, we can see what artists are the most similar to the one that was selected according to both the mentions their articles contain, and the mentions other artists make to them.

In [161]:
top_indexes_mentions_from_to = order_mentions_from_to[1:21]

similarity_mentions_from_to_data_frame = pd.DataFrame(
    {'Position': range(1, len(top_indexes_mentions_from_to) + 1),
     'Artist': [index_to_artist_dictionary[index] for index in top_indexes_mentions_from_to],
     'Similarity': [similarity_scores_mentions_from_to[index] for index in top_indexes_mentions_from_to]
    })

similarity_mentions_from_to_data_frame.style.hide_index()

Position,Artist,Similarity
1,The 101ers,0.611756
2,The Pogues,0.437643
3,Sid Vicious,0.382622
4,Eugene Hütz,0.353371
5,John Lydon,0.343328
6,Jess Conrad,0.324567
7,Ex Pistols,0.320029
8,Stisism,0.318328
9,Afternoons (band),0.318328
10,U.S. Crush,0.318328


## 6 - Future Improvements

Using Cosine Similarity has the benefit of yielding a similarity score whose meaning is easy to identify, since it will always range between 0 (minimum similarity) and 1 (maximum similarity). However, during the development of this notebook, it was observed that the Consine Similarity metric may not be ideal for this task. That's because - in many cases - very obscure artists got the highest similarity scores simply because the only artist that is mentioned in their article or the only artist that mentions them happened to overlap with one of those of the selected artist.

Obscure artists getting the highest similarity is not the issue here. The problem is that if the only artist they mention or that mentions them overlaps with one of those of the selected artist, they will probably get a much higher score than popular artists that have multiple overlaps with the act selected by the user.

Therefore, perhaps it would be interesting to use or build a similarity metric that takes more heavily into account the total of overlapping artists.