# Spotify Artist Networks

Hi everyone,

This is my first Kaggle project, so all the feedback in the comments is appreciated. If you have any questions, do not hesitate to ask them. I will answer them in detail and as soon as possible.  

So for those who do not know what the task was, I will repeat it:
'Many artists gain recognition by featuring other artists who are already well known. Explore different artists' networks (based on their appearances together in songs) and determine who the most influential artists are in each of them (based on popularity).

I have made the following networks:
1. Simple network: Contains all the collaborations between artists
2. Simple network including popularity: Contains all collaborations between artists and the popularity of that song 
3. Network with self-connected nodes: By making self-connected nodes, we can include the popularity of solo projects

Based on the popularity/influentially of the nodes in the network, representatives of the artists, I would like to answer who is the most influential artist in the dataset. As you can see, I have not built networks for specific genres and would like to add visualizations the networks. I plan to do this, but I would first like to upload this so people can give feedback on my approach. 

## 1. Simple network

Okay, so this is a network study. A network consists of nodes and edges, wheareas the nodes are represented by the name of the artist. In this first network the edges will represent the collaboration between artists. So take for example the song 'Somewhere' of one of my favourite artists 'The Blaze', this songs features 'Octavain'. In this case, the node 'The Blaze' will be connected with 'Octavain' with an edge. In this scenario the influence of an artist is measured in the amount of collaborations. 

### 1.1 Build the network

In [None]:
# Required packages
import pandas as pd 
import networkx as nx
import numpy as np

# Open data 
orig_data = pd.read_csv('../input/spotify-dataset-19212020-160k-tracks/data.csv')[['artists']]
orig_data.head()

In [None]:
# Select only the songs with multiple artists because that are the songs with collaborations
def sort_multiple_artist (data): 
    new_data = []
    for i in data.values:
        if len(i[0].split(',')) > 1:
            new_data.append(i)
    return new_data
data = sort_multiple_artist(orig_data) 

# The artist names need some cleaning because it are messy strings
def clean_data (data): 
    new_data = []
    for i in data:
        temp = []
        for j in i[0].split(','):
            temp.append(j.replace('[', '').replace("'", '').replace(']', '').strip())
        new_data.append(temp)
    return new_data

data = clean_data(data)

In [None]:
# Take all the unique artists from the data 
def unique_artists(data):
    new_data = []
    for i in data:
        for j in i:
            if j not in new_data:
                new_data.append(j)
    return new_data

node_names = unique_artists(data)  

In [None]:
# We have the node names, now we can start making a graph
G = nx.Graph()

# Add the nodes 
for i in node_names:
    G.add_node(i)

In [None]:
# Select all the colaborations between the artists 
def collaborations (data): 
    collab = []
    for i in data:
        for j in range(1, len(i)):
            temp = []
            temp.append([i[0], i[j]])
            collab.append(temp)
    return collab

edges = collaborations(data)

In [None]:
# Now we can add the edges in the network 
for i in edges:
    i = i[0]
    G.add_edge(i[0], i[1])

### 1.2 Let's analyze the network

In [None]:
# So the node with the most neighbours is if you exclude the popularity of a song the most important node
def find_most_neighbors (G, node_names):
    neighbor_amount = []
    for i in node_names:
        neighbor_amount.append(len([n for n in G.neighbors(i)]))
    return neighbor_amount

neighbor_amount = find_most_neighbors(G, node_names)

In [None]:
# Makes a top 10 list of the most influencial artists 
def top_list(scores, node_names, length):
    order = np.argsort(scores)[::-1]
    temp = []
    for i in range(length):
        temp.append([node_names[order[i]], scores[order[i]]])
    return temp
    
top_list(neighbor_amount, node_names, 20)

So Mozart has the most neighbours which indicates that, if you exclude popularity, Mozart is the most influencial artist of the dataset. We see almost only classical composers and as number 10 Dutch DJ and producer Armin van Buuren. 

## 2. Simple Network including popularity 

So my previous attempt excluded the popularity of a song. Which is not a good way to see which artist are infuencial now a days. Maybe Mozarts music is not as popular with the nowaday spotify users. Therefore, we have to add a value on each of the edges that represents the popularity of the song. 

### 2.1 Build the network

In [None]:
orig_data = pd.read_csv('../input/spotify-dataset-19212020-160k-tracks/data.csv')
data = orig_data[['artists', 'popularity']]

# First we have to take only the collaboration songs 
def sort_multiple_artist (data): # Takes only the artist with multiple artist
    new_data = []
    for i in data.values:
        if len(i[0].split(',')) > 1:
            new_data.append(i)
    return new_data

# Clean the messy strings 
data = sort_multiple_artist(data) 
def clean_data (data): # Clean the data because it are messy strings:
    new_data = []
    for i in data:
        temp = []
        for j in i[0].split(','):
            temp.append(j.replace('[', '').replace("'", '').replace(']', '').strip())
        new_data.append([temp, i[1]])
    return new_data

data = clean_data(data)

In [None]:
# For the nodes it is important that each artist only receives one node
def unique_artists(data):
    new_data = []
    for i in data:
        for j in i[0]:
            if j not in new_data:
                new_data.append(j)
    return new_data

node_names = unique_artists(data)  

In [None]:
# Now we have the node names we can start making a graph
G = nx.Graph()

# Add the nodes 
for i in node_names:
    G.add_node(i)

In [None]:
# to find the edges we first want to make a list containing all collaborations including popularity 
def collaborations (data): 
    collab = []
    for i in data:
        for j in range(1, len(i[0])):
            temp = []
            temp.append([i[0][0], i[0][j]])
            collab.append([temp, i[1]])
    return collab

edges = collaborations(data)

In [None]:
# Add the edges including the weights 
for i in edges:
    edge = i[0][0]
    w = i[1]
    G.add_edge(edge[0], edge[1], weight = w)

### 2.2 Analyze the network 

In [None]:
# Takes for each of the nodes the weights of the edges of the node 
def influence_calculator (G, node_names):
    influence_list = []
    for i in node_names:
        influence = 0
        for j in G.neighbors(i):
            influence += G.get_edge_data(i, j)['weight']   
        influence_list.append(influence)
    return influence_list

scores = influence_calculator(G, node_names)

In [None]:
# Makes a top list with the artists with the highest scores 
def top_list(scores, node_names, length):
    order = np.argsort(scores)[::-1]
    temp = []
    for i in range(length):
        temp.append([node_names[order[i]], scores[order[i]]])
    return temp
    
top_list(scores, node_names, 20)

So in the network where we included popularity of songs as edge weight between nodes. We see that Mozart is still the most influencial artist. However, we see that a lot of rap artists, such as Kanye West and Drake, also made it on the list. 

## 3. Network with self-connected nodes 

The previously made networks only consists of collaborations. But if an artist has many very popular solo songs, this also can be considered as influential. Therefore this network will also have self-connected nodes, where the edge of this node contains the popularity scores of all songs of that artist. 

In [None]:
orig_data = pd.read_csv('../input/spotify-dataset-19212020-160k-tracks/data.csv')
data = orig_data[['artists', 'popularity']]

# First we have to take only the collaboration songs 
def sort_multiple_artist (data): # Takes only the artist with multiple artist
    new_data = []
    for i in data.values:
        if len(i[0].split(',')) >= 1:
            new_data.append(i)
    return new_data

# Clean the messy strings 
data = sort_multiple_artist(data) 
def clean_data (data): # Clean the data because it are messy strings:
    new_data = []
    for i in data:
        temp = []
        for j in i[0].split(','):
            temp.append(j.replace('[', '').replace("'", '').replace(']', '').strip())
        new_data.append([temp, i[1]])
    return new_data

data = clean_data(data)

In [None]:
def unique_artists (data):
    unique_artist_list = []
    for i in data:
        if i[0][0] not in unique_artist_list:
            unique_artist_list.append(i[0][0])
    return unique_artist_list


node_names = unique_artists(data)

In [None]:
def split_solo_colab(data):
    solo = []
    colab = []
    for i in data:
        if len(i[0]) > 1:
            colab.append(i)
        if len(i[0]) == 1:
            solo.append(i)
    return solo, colab

solo, colab = split_solo_colab(data)           

In [None]:
def count_solo_pop (node_names, solo):
    edge_w_self_connected = []
    for i in node_names:
        temp = 0
        for j in solo:
            if i == j[0][0]:
                temp += j[1]
        edge_w_self_connected.append(temp)
    return edge_w_self_connected

edge_w_self_connected = count_solo_pop(node_names, solo)      

In [None]:
G = nx.Graph()

# Add the nodes 
for i in node_names:
    G.add_node(i)

In [None]:
# Add self_connected node edges 
for i in range(len(node_names)):
    edge = node_names[i]
    w = edge_w_self_connected[i]
    G.add_edge(edge, edge, weight = w)

In [None]:
# to find the edges we first want to make a list containing all collaborations including popularity 
def collaborations (data): 
    collab = []
    for i in data:
        for j in range(1, len(i[0])):
            temp = []
            temp.append([i[0][0], i[0][j]])
            collab.append([temp, i[1]])
    return collab

edges = collaborations(colab)

In [None]:
# Add the edges including the weights 
for i in edges:
    edge = i[0][0]
    w = i[1]
    G.add_edge(edge[0], edge[1], weight = w)

### 3.2 Analyze the network

In [None]:
# Takes for each of the nodes the weights of the edges of the node 
def influence_calculator (G, node_names):
    influence_list = []
    for i in node_names:
        influence = 0
        for j in G.neighbors(i):
            influence += G.get_edge_data(i, j)['weight']   
        influence_list.append(influence)
    return influence_list

scores = influence_calculator(G, node_names)

In [None]:
# Makes a top list with the artists with the highest scores 
def top_list(scores, node_names, length):
    order = np.argsort(scores)[::-1]
    temp = []
    for i in range(length):
        temp.append([node_names[order[i]], scores[order[i]]])
    return temp
    
top_list(scores, node_names, 20)

## 3. Conclusion 

For this Kaggle project, I made three kind of networks. Network 1 contains only the collaborations between artists. Network 2 includes only the collaborations but with the popularity of that collaboration. Network 3 has not only collaborations but also songs that are made by only one artist. 

This project shows that depending on how the network is built, the top 20 artists' influence can really differ. In network 1, we see almost only classical composers in the top 20. In network 2, we still see a lot of classical composers but also a lot of rappers. In network 3, we see a lot of old artists such as The Beatles and Frank Sinatra. 

We can now partially answer the following question: 'Who are the most influential artists in each of the networks (based on popularity)?'

Let's start with network 1; it is explainable the top 20 list consists of almost only classical composers because they had written a lot of music—unfortunately, the most famous composers on this list when were dead when this dataset was made. Therefore orchestras performed the composers' work and mentioned the composer as one of the artists of  In network 1 Mozart is the most influential artist with an honorable mention for Verdi, Bach, and Beethoven. 

Network 2, where popularity is included, we see many famous rap artists replacing some classical composers from the top 20 list. This change is also explainable because nowadays rap is more popular than classical music. However, Mozart is still the most influential artist. Again an honorable mention for Bach but also for Kanye West, Drake, and Lil Wayne.  

Network 3 shows artists who are immensely popular but did not collaborate much with other artists have replaced almost all the classical composers and rappers. However, Mozart is still on this list. The Beatles are the number one in this network with honorable mentions for Frank Sinatra and Elvis Presley. 

So I think, for now, we can conclude that Mozart is still the most influential artist of the dataset based on these three networks. However, I believe my approach can be enhanced by a lot. First of all, I want to analyze genres separately and use visualizations in my second attempt at this task. However, I would like to receive some feedback, tips, and tricks on my approach to get a more overall idea of which parts of my approach need alterations.

Greetings,

Floris 